Paperid:1 Oral
Authors:Lei Shi,Jiapeng Yang,Pengtao Lv,Lu Yuan,Feifei Kou,Jia Luo,MingYing Xu
Title: Self-derived Knowledge Graph Contrastive Learning for Recommendation
Abstract:
Knowledge Graphs (KGs) serve as valuable auxiliary information to improve the accuracy of recommendation systems. Previous methods have leveraged the knowledge graph to enhance item representation and thus achieve excellent performance. However, these approaches heavily rely on high-quality knowledge graphs and learn enhanced representations with the assistance of carefully designed triplets. Furthermore, the emergence of knowledge graphs has led to models that ignore the inherent relationships between items and entities. To address these challenges, we propose a Self-Derived Knowledge Graph Contrastive Learning framework (CL-SDKG) to enhance recommendation systems. Specifically, we employ the variational graph reconstruction technique to estimate the Gaussian distribution of user-item nodes corresponding to the graph neural network aggregation layer. This process generates multiple KGs, referred to as self-derived KGs. The self-derived KG acquires more robust perceptual representations through the consistency of the estimated structure. Besides, the self-derived KG allows models to focus on user-item interactions and reduce the negative impact of miscellaneous dependencies introduced by conventional KGs. Finally, we apply contrastive learning to the self-derived KG to further improve the robustness of CL-SDKG through the traditional KG contrast-enhanced process. We conducted comprehensive experiments on three public datasets, and the results demonstrate that our CL-SDKG outperforms state-of-the-art baselines.



Paperid:2 Oral
Authors:Navonil Majumder,Chia-Yu Hung,Deepanway Ghosal,Wei-Ning Hsu,Rada Mihalcea,Soujanya Poria
Abstract:
Generative multimodal content is increasingly prevalent in much of the content creation arena, as it has the potential to allow artists and media personnel to create pre-production mockups by quickly bringing their ideas to life. The generation of audio from text prompts is an important aspect of such processes in the music and film industry. Many of the recent diffusion-based text-to-audio models focus on training increasingly sophisticated diffusion models on a large set of datasets of prompt-audio pairs. These models do not explicitly focus on the presence of concepts or events and their temporal ordering in the output audio with respect to the input prompt. Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data. As such, in this work, using an existing text-to-audio model Tango, we synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from. The loser outputs, in theory, have some concepts from the prompt missing or in an incorrect order. We fine-tune the publicly available Tango text-to-audio model using diffusion-DPO (direct preference optimization) loss on our preference dataset and show that it leads to improved audio output over Tango and AudioLDM2, in terms of both automatic- and manual evaluation metrics.



Paperid:3 Oral
Authors:Wenxin Xu,Hexin Jiang,xuefeng liang
Abstract:
Multimodal Emotion Recognition (MER) may encounter incomplete multimodal scenarios caused by sensor damage or privacy protection in practical applications. Existing incomplete multimodal learning methods focus on learning better joint representations across modalities. However, our investigation shows that they are lacking in learning the unimodal representations which are rather discriminative as well. Instead, we propose a novel framework named Mixture of Modality Knowledge Experts (MoMKE) with two-stage training. In unimodal expert training, each expert learns the unimodal knowledge from the corresponding modality. In experts mixing training, both unimodal and joint representations are learned by leveraging the knowledge of all modality experts. In addition, we design a special Soft Router that can enrich the modality representations by dynamically mixing the unimodal representations and the joint representations. Various incomplete multimodal experiments on three benchmark datasets showcase the robust performance of MoMKE, especially on severely incomplete conditions. Visualization analysis further reveals the considerable value of unimodal and joint representations.



Paperid:4 Oral
Authors:Yixuan Zhou,Xiaoyu Qin,Zeyu Jin,Shuoyi Zhou,Shun Lei,Songtao Zhou,Zhiyong Wu,Jia Jia
Abstract:
Recent AIGC systems possess the capability to generate digital multimedia content based on human language instructions, such as text, image, and video. However, when it comes to speech, existing methods related to human instruction-to-speech generation exhibit two limitations. Firstly, they require the division of inputs into the content prompt (transcript) and description prompt (style and speaker), instead of directly supporting human instruction. This division is less natural in form and does not align with other AIGC models. Secondly, the practice of utilizing an independent description prompt to model speech style, without considering the transcript content, restricts the ability to control speech at a fine-grained level. To address these limitations, we propose VoxInstruct, a novel unified multilingual codec language modeling framework that extends traditional text-to-speech tasks into a general human instruction-to-speech task. Our approach enhances the expressiveness of human instruction-guided speech generation and aligns the speech generation paradigm with other modalities. To enable the model to automatically extract the content of synthesized speech from raw text instructions, we introduce speech semantic tokens as an intermediate representation for instruction-to-content guidance. We also incorporate multiple Classifier-Free Guidance (CFG) strategies into our codec language model, which strengthens the generated speech following human instructions. Furthermore, our model architecture and training strategies allow for the simultaneous support of combining speech prompt and descriptive human instruction for expressive speech synthesis, which is a first-of-its-kind attempt.



Paperid:5 Oral
Authors:Esmee Henrieke Anne de Haas,LIK-HANG LEE,Yiming Huang,Carlos BERMEJO FERNANDEZ,Pan Hui,Zijun Lin
Abstract:
E-commerce has emerged as a significant endeavour in which technological advancements influence the shopping experience. Simultaneously, the metaverse is the next breakthrough to transform multimedia engagement. However, under such situations, deceptive designs aimed at deceiving users into making desired choices might be more successful. This paper proposes the design space of manipulative techniques in e-commerce applications for the metaverse. We construct our arguments by evaluating user interaction with manipulative design in metaverse shopping experiences, followed by a survey among users to understand the effect of counteracting manipulative e-commerce scenarios. Our findings can reinforce understanding of design guidelines according to metaverse e-commerce experiences and the possibility of opportunities to improve user awareness of manipulative experiences.



Paperid:6 Oral
Authors:Xuan Han,Yihao Zhao,Mingyu You
Abstract:
Scene image is one of the important windows for showcasing product design. To obtain it, the standard 3D-based pipeline requires designer to not only create the 3D model of product, but also manually construct the entire scene in software, which hindering its adaptability in situations requiring rapid evaluation. This study aims to realize a novel conditional synthesis method to create the scene image based on a single-model rendering of the desired object and the scene description. In this task, the major challenges are ensuring the strict appearance fidelity of drawn object and the overall visual harmony of synthesized image. The former's achievement relies on maintaining an appropriate condition-output constraint, while the latter necessitates a well-balanced generation process for all regions of image. In this work, we propose Scene Diffusion framework to meet these challenges. Its first progress is introducing the Shading Adaptive Condition Alignment (SACA), which functions as an intensive training objective to promote the appearance consistency between condition and output image without hindering the network's learning to the global shading coherence. Afterwards, a novel low-to-high Frequency Progression Training Schedule (FPTS) is utilized to maintain the visual harmony of entire image by moderating the growth of high-frequency signals in the object area. Extensive qualitative and quantitative results are presented to support the advantages of the proposed method. In addition, we also demonstrate the broader uses of Scene Diffusion, such as its incorporation with ControlNet.



Paperid:7 Oral
Authors:Pengqiang Bi,Yifei Zou,Mengbai Xiao,Dongxiao Yu,yijunli,zhixiong.liu,qunxie
Abstract:
QUIC is the underlying protocol of the next generation HTTP/3, serving as the major vehicle delivering video data nowadays. As a userspace protocol based on UDP, QUIC features low transmission latency and has been widely deployed by content providers. However, the high computational overhead of QUIC shifts system knobs to CPUs in high-bandwidth scenarios. When CPU resources become the constraint, HTTP/3 exhibits even lower throughput than HTTP/1.1. In this paper, we carefully analyze the performance bottleneck of QUIC and find it results from ACK processing, packet sending, and data encryption. By reducing the ACK frequency, activating UDP generic segmentation offload (GSO), and incorporating PicoTLS, a high-performance encryption library, the CPU overhead of QUIC could be effectively reduced in stable network environments. However, simply reducing the ACK frequency also impairs the transmission throughput of QUIC under poor network conditions. To solve this, we develop LiteQUIC, which involves two mechanisms towards alleviating the overhead of ACK processing in addition to GSO and PicoTLS. We evaluate LiteQUIC in the DASH-based video streaming, and the results show that LiteQUIC achieves 1.20$\times$ higher average bitrate and 93.3% lower rebuffering time than an optimized version of QUIC with GSO and PicoTLS. In a network environment with 0.1% packet loss ratio, LiteQUIC outperforms HTTP/1.1 by up to 1.63$\times$ in terms of average bitrate and reduces the rebuffering time by 95.1%.



Paperid:8 Oral
Authors:Dongyu Xie,Chaofan Qiao,Lanyue Liang,Zhiwen Wang,Tianyu Li,Qiao Liu,Chongyi Li,Guoqing Wang,Yang Yang
Abstract:
ISP (Image Signal Processor) serves as a pipeline converting unprocessed raw images to sRGB images, positioned before nearly all visual tasks. Due to the varying spectral sensitivities of cameras, raw images captured by different cameras exist in different color spaces, making it challenging to deploy ISP across cameras with consistent performance. To address this challenge, it is intuitively to incorporate a raw-to-raw mapping (mapping raw images across camera color spaces) module into the ISP. However, the lack of paired data (i.e., images of the same scene captured by different cameras) makes it difficult to train a raw-to-raw model using supervised learning methods. In this paper, we aim to achieve ISP generalization by proposing the first unsupervised raw-to-raw model. To be specific, we propose a CSTPP (Color Space Transformation Parameters Predictor) module to predict the space transformation parameters in a patch-wise manner, which can accurately perform color space transformation and flexibly manage complex lighting conditions. Additionally, we design a CycleGAN-style training framework to realize unsupervised learning, overcoming the deficiency of paired data. Our proposed unsupervised model achieved performance comparable to that of the state-of-the-art semi-supervised method in raw-to-raw task. Furthermore, to assess its ability to generalize the ISP model across different cameras, we for the first formulated cross-camera ISP task and demonstrated the performance of our method through extensive experiments. Codes will be publicly available.



Paperid:9 Oral
Authors:Yingxuan Li,Ryota Hinami,Kiyoharu Aizawa,Yusuke Matsui
Abstract:
Recognizing characters and predicting speakers of dialogue are critical for comic processing tasks, such as voice generation or translation. However, because characters vary by comic title, supervised learning approaches like training character classifiers which require specific annotations for each comic title are infeasible. This motivates us to propose a novel zero-shot approach, allowing machines to identify characters and predict speaker names based solely on unannotated comic images. In spite of their importance in real-world applications, these task have largely remained unexplored due to challenges in story comprehension and multimodal integration. Recent large language models (LLMs) have shown great capability for text understanding and reasoning, while their application to multimodal content analysis is still an open problem. To address this problem, we propose an iterative multimodal framework, the first to employ multimodal information for both character identification and speaker prediction tasks. Our experiments demonstrate the effectiveness of the proposed framework, establishing a robust baseline for these tasks. Furthermore, since our method requires no training data or annotations, it can be used as-is on any comic series.



Paperid:10 Oral
Authors:Ziyi Ye,Jingtao Zhan,Qingyao Ai,Yiqun LIU,Maarten de Rijke,Christina Lioma,Tuukka Ruotsalo
Abstract:
In the information retrieval scenario, query augmentation is an essential technique to refine semantically imprecise queries to align closely with users' actual information needs. Traditional methods typically rely on extracting signals from user interactions such as browsing or clicking behaviors to augment the queries, which may not accurately reflect the actual user intent due to inherent noise and the dependency on initial user interactions. To overcome these limitations, we introduce Brain-Aug, a novel approach that decodes semantic information directly from brain signals of users to augment query representation. Brain-Aug explores three-fold techniques: (1) Structurally, an adapter network is utilized to project brain signals into the embedding space of a language model, allowing query augmentation conditioned on both the users' initial query and their brain signals. (2) During training, we use a next token prediction task for query augmentation and adopt prompt tuning to efficiently train the brain adapter. (3) At the inference stage, a ranking-oriented decoding strategy is implemented, enabling Brain-Aug to generate augmentations that improve ranking performance. We evaluate our approach on multiple functional magnetic resonance imaging (fMRI) datasets, demonstrating that Brain-Aug not only produces semantically richer queries but also significantly improves document ranking accuracy, particularly for ambiguous queries. These results validate the effectiveness of our proposed Brain-Aug approach, and reveal the great potential of leveraging internal cognitive states to understand and augment text-based queries.



Paperid:11 Oral
Authors:Yiheng Huang,Yang Hui,Chuanchen Luo,Yuxi Wang,Shibiao Xu,Zhaoxiang Zhang,Man Zhang,Junran Peng
Abstract:
Thanks to the powerful generative capacity of diffusion models, recent years have witnessed rapid progress in human motion generation. Existing diffusion-based methods employ disparate network architectures and training strategies. The effect of the design of each component is still unclear. In addition, the iterative denoising process consumes considerable computational overhead, which is prohibitive for real-time scenarios such as virtual characters and humanoid robots. For this reason, we first conduct a comprehensive investigation into network architectures, training strategies, and inference processs. Based on the profound analysis, we tailor each component for efficient high-quality human motion generation. Despite the promising performance, the tailored model still suffers from foot skating which is an ubiquitous issue in diffusion-based solutions. To eliminate footskate, we identify foot-ground contact and correct foot motions along the denoising process. By organically combining these well-designed components together, we present StableMoFusion, a robust and efficient framework for human motion generation. Extensive experimental results show that our StableMoFusion performs favorably against current state-of-the-art methods.



Paperid:12 Oral
Authors:Yi Bin,WENHAO SHI,Yujuan Ding,Zhiqiang Hu,Zheng WANG,Yang Yang,See-Kiong Ng,Heng Tao Shen
Abstract:
Artwork analysis is important and fundamental skill for art appreciation, which could enrich personal aesthetic sensibility and facilitate the critical thinking ability. Understanding artworks is challenging due to its subjective nature, diverse interpretations, and complex visual elements, requiring expertise in art history, cultural background, and aesthetic theory. However, limited by the data collection and model ability, previous works for automatically analyzing artworks mainly focus on classification, retrieval, and other simple tasks, which is far from the goal of AI. To facilitate the research progress, in this paper, we step further to compose comprehensive analysis inspired by the remarkable perception and generation ability of large multimodal models. Specifically, we first propose a task of composing paragraph analysis for artworks, i.e., painting in this paper, only focusing on visual characteristics to formulate more comprehensive understanding of artworks. To support the research on formal analysis, we collect a large dataset PaintingForm, with about 19k painting images and 50k analysis paragraphs. We further introduce a superior large multimodal model for painting analysis composing, dubbed GalleryGPT, which is slightly modified and fine-tuned based on LLaVA architecture leveraging our collected data. We conduct formal analysis generation and zero-shot experiments across several datasets to assess the capacity of our model. The results show remarkable performance improvements comparing with powerful baseline LMMs, demonstrating the superb ability of art analysis and generalization. The codes and model will be released after the double-blind review.



Paperid:13 Oral
Authors:Zhenxi Song,Ruihan Qin,Huixia Ren,Zhen Liang,Yi Guo,Min zhang,Zhiguo Zhang
Abstract:
Cross-center data heterogeneity and annotation unreliability significantly challenge the intelligent diagnosis of diseases using brain signals. A notable example is the EEG-based diagnosis of neurodegenerative diseases, which features subtler abnormal neural dynamics typically observed in small-group settings. To advance this area, in this work, we introduce a transferable framework employingManifoldAttention andConfidenceStratification (MACS) to diagnose neurodegenerative disorders based on EEG signals sourced from four centers with unreliable annotations. The MACS framework’s effectiveness stems from these features: 1) TheAugmentorgenerates various EEG-represented brain variants to enrich the data space; 2) TheSwitcherenhances the feature space for trusted samples and reduces overfitting on incorrectly labeled samples; 3) TheEncoderuses the Riemannian manifold and Euclidean metrics to capture spatiotemporal variations and dynamic synchronization in EEG; 4) TheProjector, equipped with dual heads, monitors consistency across multiple brain variants and ensures diagnostic accuracy; 5) TheStratifieradaptively stratifies learned samples by confidence levels throughout the training process; 6) Forward and backpropagation inMACSare constrained by confidence stratification to stabilize the learning system amid unreliable annotations. Our subject-independent experiments, conducted on both neurocognitive and movement disorders using cross-center corpora, have demonstrated superior performance compared to existing related algorithms. This work not only improves EEG-based diagnostics for cross-center and small-setting brain diseases but also offers insights into extending MACS techniques to other data analyses, tackling data heterogeneity and annotation unreliability in multimedia and multimodal content understanding. We have released our code here:https://anonymous.4open.science/r/EEG-Disease-MACS-0B4A.



Paperid:14 Oral
Authors:Haodong Hong,Sen Wang,Zi Huang,Qi Wu,Jiajun Liu
Abstract:
Real-world navigation often involves dealing with unexpected obstructions such as closed doors, moved objects, and unpredictable entities. However, mainstream Vision-and-Language Navigation (VLN) tasks typically assume instructions perfectly align with the fixed and predefined navigation graphs without any obstructions. This assumption overlooks potential discrepancies in actual navigation graphs and given instructions, which can cause major failures for both indoor and outdoor agents. To address this issue, we integrate diverse obstructions into the R2R dataset by modifying both the navigation graphs and visual observations, introducing an innovative dataset and task, R2R with UNexpected Obstructions (R2R-UNO). R2R-UNO contains various types and numbers of path obstructions to generate instruction-reality mismatches for VLN research. Experiments on R2R-UNO reveal that state-of-the-art VLN methods inevitably encounter significant challenges when facing such mismatches, indicating that they rigidly follow instructions rather than navigate adaptively. Therefore, we propose a novel method called ObVLN (Obstructed VLN), which includes a curriculum training strategy and virtual graph construction to help agents effectively adapt to obstructed environments. Empirical results show that ObVLN not only maintains robust performance in unobstructed scenarios but also achieves a substantial performance advantage with unexpected obstructions. The source code is available at \url{https://anonymous.4open.science/r/ObstructedVLN-D579}.



Paperid:15 Oral
Authors:Wenjie Zheng,Jianfei Yu,Rui Xia
Abstract:
Multimodal Multi-Label Emotion Recognition (MMER) aims to identify one or more emotion categories expressed by an utterance of a speaker. Despite obtaining promising results, previous studies on MMER represent each emotion category using a one-hot vector and ignore the intrinsic relations between emotions. Moreover, existing works mainly learn the unimodal representation based on the multimodal supervision signal of a single sample, failing to explicitly capture the unique emotional state of each modality as well as its emotional correlation between samples. To overcome these issues, we propose a $\textbf{Uni}$modal $\textbf{V}$alence-$\textbf{A}$rousal driven contrastive learning framework (UniVA) for the MMER task. Specifically, we adopt the valence-arousal (VA) space to represent each emotion category and regard the emotion correlation in the VA space as priors to learn the emotion category representation. Moreover, we employ pre-trained unimodal VA models to obtain the VA scores for each modality of the training samples, and then leverage the VA scores to construct positive and negative samples, followed by applying supervised contrastive learning to learn the VA-aware unimodal representations for multi-label emotion prediction. Experimental results on two benchmark datasets MOSEI and M$^3$ED show that the proposed UniVA framework consistently outperforms a number of existing methods for the MMER task.



Paperid:16 Oral
Authors:Xin Li,Shangfei Wang,Xuandong Huang
Abstract:
With the popularity and advancement of the Internet and video-sharing platforms, video affective content analysis has been greatly developed. Nevertheless, existing methods often utilize simple models to extract semantic information. This might not capture comprehensive emotional cues in videos. In addition, these methods tend to overlook the presence of substantial irrelevant information in videos, as well as the uneven importance of modalities for emotional tasks. This could result in noise from both temporal fragments and modalities, thus diminishing the capability of the model to identify crucial temporal fragments and recognize emotions. To tackle the above issues, in this paper, we propose a Temporal Enhancement (TE) method. Specifically, we employ three encoders for extracting features at various levels and sample features to enhance temporal data, thereby enriching video representation and improving the model's robustness to noise. Subsequently, we design a cross-modal temporal enhancement module to enhance temporal information for every modal feature. This module interacts with multiple modalities at once to emphasize critical temporal fragments while suppressing irrelevant ones. The experimental results on four benchmark datasets show that the proposed temporal enhancement method achieves state-of-the-art performance in video affective content analysis. Moreover, the effectiveness of each module is confirmed through ablation experiments.



Paperid:17 Oral
Authors:Hu Lin,Chengjiang Long,Yifeng Fei,qianchen xia,Erwei Yin,Baocai Yin,Xin Yang
Abstract:
Camera relocalization is the task of estimating camera pose within a known scene. It has important applications in the fields of Virtual Reality (VR), Augmented Reality (AR), robotics, and more within the domain of computer vision. Learning-based camera relocalizers have demonstrated leading pose accuracy, yet all current methods invariably utilize all the information within an image for pose estimation. This may offer robustness under challenging viewpoints but impacts the localization accuracy for viewpoints that are easier to localize. In this paper, we propose a method to gauge the credibility of image pose, enabling our approach to achieve more accurate localization on keyframes. Additionally, we have devised a keypoint selection method predicated on matching rate. Furthermore, we have developed a keypoint evaluation technique based on reprojection error, which estimates the scene coordinates for points within the scene that truly warrant attention, thereby enhancing the localization performance for keyframes. We also introduce a gated camera pose estimation strategy, employing an updated keypoint-based network for keyframes with higher credibility and a more robust network for difficult viewpoints. By adopting an effective curriculum learning scheme, we have achieved higher accuracy within a training span of just 20 minutes. Our method's superior performance is validated through rigorous experimentation. The code will be released.



Paperid:18 Oral
Authors:Zhihong Zhu,Xuxin Cheng,Zhaorun Chen,Yuyan Chen,Yunyan Zhang,Xian Wu,Yefeng Zheng,Bowen Xing
Abstract:
Multi-modal intent detection (MID) aims to comprehend users' intentions through diverse modalities, which has received widespread attention in dialogue systems. Despite the promising advancements in complex fusion mechanisms or architecture designs, challenges remain due to: (1) various noise and redundancy in both visual and audio modalities and (2) long-tailed distributions of intent categories. In this paper, to tackle the above two issues, we propose InMu-Net, a simple yet effective framework for MID from the Information bottleneck and Multi-sensory processing perspective. Our contributions lie in three aspects. First, we devise a denoising bottleneck module to filter out the intent-irrelevant information in the fused feature; Second, we introduce a saliency preservation loss to prevent the dropping of intent-relevant information; Ultimately, kurtosis regulation is introduced to maintain representation smoothness during the filtering process, mitigating the adverse impact of the long tail distribution. Comprehensive experiments on two MID benchmark datasets demonstrate the effectiveness of InMu-Net and its vital components. Impressively, a series of analyses reveal our denoising potential and robustness in low-resource, modality corruption, cross-architecture and cross-task scenarios.



Paperid:19 Oral
Authors:Jiao PAN,Liang Li,Hiroshi Yamaguchi,Kyoko Hasegawa,Fadjar Ibnu Thufail,Brahmantara,Xiaojuan Ban,Satoshi Tanaka
Abstract:
Relief-type cultural heritage objects are commonly found at historical sites but often manifest with varying degrees of damage and deterioration. The traditional process of reconstructing these reliefs is laborious and requires extensive manual intervention and specialized archaeological knowledge. By utilizing a single old photo containing predamage information of a given relief, monocular depth estimation can be used to reconstruct 3D digital models. However, extracting depth variations along the edges is challenging in relief scenario due to the highly compression of the depth values, resulting in low-curvature edges. This paper proposes an innovative solution that leverages a multi-task neural network to enhance the depth estimation task by integrating the edge detection and semantic segmentation tasks. We redefine edge detection of relief data as a multi-class classification task rather than a typical binary classification task. In this paper, an edge matching module that performs this novel task is proposed to refine depth estimations specifically for edge regions. The proposed approach achieves better depth estimation results with finer details along the edge region. Additionally, the semantic and edge outputs provide a comprehensive reference for multi-modal understanding and analysis. This paper not only advances in computer vision task computer vision tasks but also provides effective technical support for the protection of relief-type cultural heritage objects.



Paperid:20 Oral
Authors:Cheng Ye,Weidong Chen,Jingyu Li,Lei Zhang,Zhendong Mao
Abstract:
Emotional Video Captioning (EVC) is an emerging task that aims to describe factual content with the intrinsic emotions expressed in videos. The essential of the EVC task is to effectively perceive subtle and ambiguous visual emotional cues during the caption generation, which is neglected by the traditional video captioning. Existing emotional video captioning methods perceive global visual emotional cues at first, and then combine them with the video features to guide the emotional caption generation, which neglects two characteristics of the EVC task. Firstly, their methods neglect the dynamic subtle changes in the intrinsic emotions of the video, which makes it difficult to meet the needs of common scenes with diverse and changeable emotions. Secondly, as their methods incorporate emotional cues into each step, the guidance role of emotion is overemphasized, which makes factual content more or less ignored during generation. To this end, we propose a dual-path collaborative generation network, which dynamically perceives visual emotional cues evolutions while generating emotional captions by collaborative learning. The two paths promote each other and significantly improve the generation performance. Specifically, in the dynamic emotion perception path, we propose a dynamic emotion evolution module, which first aggregates visual features and historical caption features to summarize the global visual emotional cues, and then dynamically selects emotional cues required to be re-composed at each stage as well as re-composed them to achieve emotion evolution by dynamically enhancing or suppressing different granularity subspace’s semantics. Besides, in the adaptive caption generation path, to balance the description of factual content and emotional cues, we propose an emotion adaptive decoder, which firstly estimates emotion intensity via the alignment of emotional features and historical caption features at each generation step, and then, emotional guidance adaptively incorporate into the caption generation based on the emotional intensity. Thus, our methods can generate emotion-related words at the necessary time step, and our caption generation balances the guidance of factual content and emotional cues well. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module.



Paperid:21 Oral
Authors:Ziyan Li,Jianfei Yu,Jia Yang,Wenya Wang,Li Yang,Rui Xia
Abstract:
As an important task in multimodal information extraction, Multimodal Named Entity Recognition (MNER) has recently attracted considerable attention. One key challenge of MNER lies in the lack of sufficient fine-grained annotated data, especially in low-resource scenarios. Although data augmentation is a widely used technique to tackle the above issue, it is challenging to simultaneously generate synthetic text-image pairs and their corresponding high-quality entity annotations. In this work, we propose a novel Generative Multimodal Data Augmentation (GMDA) framework for MNER, which contains two stages: Multimodal Text Generation and Multimodal Image Generation. Specifically, we first transform each annotated sentence into a linearized labeled sequence, and then train a Label-aware Multimodal Large Language Model (LMLLM) to generate the labeled sequence based on a label-aware prompt and its associated image. After using the trained LMLLM to generate synthetic labeled sentences, we further employ a Stable Diffusion model to generate the synthetic images that are semantically related to these sentences. Experimental results on three benchmark datasets demonstrate the effectiveness of the proposed GMDA framework, which consistently boosts the performance of several competitive methods for two subtasks of MNER in both full-supervision and low-resource settings.



Paperid:22 Oral
Authors:Rishikesh Devanathan,APOORVA SINGH,A.S. Poornash,Sriparna Saha
Abstract:
Complaints are pivotal expressions within e-commerce communication, yet the intricate nuances of human interaction present formidable challenges for AI agents to grasp comprehensively. While recent attention has been drawn to analyzing complaints within a multimodal context, relying solely on text and images is insufficient for organizations. The true value lies in the ability to pinpoint complaints within the intricate structures of discourse, scrutinizing them at a granular aspect level. Our research delves into the discourse structure of e-commerce video-based product reviews, pioneering a novel task we term Aspect-Level Complaint Detection from Discourse (ACDD). Embedded in a multimodal framework, this task entails identifying aspect categories and assigning complaint/non-complaint labels at a nuanced aspect level. To facilitate this endeavour, we have curated a unique multimodal product review dataset, meticulously annotated at the utterance level with aspect categories and associated complaint labels. To support this undertaking, we introduce a Multimodal Aspect-Aware Complaint Analysis (MAACA) model that incorporates a novel pre-training strategy and a global feature fusion technique across the three modalities. Additionally, the proposed framework leverages a moment retrieval step to identify the relevant portion of the clip, crucial for accurately detecting the fine-grained aspect categories and conducting aspect-level complaint detection. Extensive experiments conducted on the proposed dataset showcase that our framework outperforms unimodal and bimodal baselines, offering valuable insights into the application of video-audio-text representation learning frameworks for downstream tasks.



Paperid:23 Oral
Authors:Yuxin Hong,Xiao Zhang,Xin Zhang,Joey Tianyi Zhou
Abstract:
In the medical field, managing high-dimensional massive medical imaging data and performing reliable medical analysis from it is a critical challenge, especially in resource-limited environments such as remote medical facilities and mobile devices. This necessitates effective dataset compression techniques to reduce storage, transmission, and computational cost. However, existing coreset selection methods are primarily designed for natural image datasets, and exhibit doubtful effectiveness when applied to medical image datasets due to challenges such as intra-class variation and inter-class similarity. In this paper, we propose a novel coreset selection strategy termed as Evolution-aware VAriance (EVA), which captures the evolutionary process of model training through a dual-window approach and reflects the fluctuation of sample importance more precisely through variance measurement. Extensive experiments on medical image datasets demonstrate the effectiveness of our strategy over previous SOTA methods, especially at high compression rates. EVA achieves 98.27% accuracy with only 10% training data, compared to 97.20% for the full training set. None of the baseline methods compared can exceed Random at 5% selection rate, while EVA outperforms Random by 5.61%, showcasing its potential for efficient medical image analysis.



Paperid:24 Oral
Authors:Mengze Li,Kairong Han,Jiahe Xu,Yueying Li,Tao Wu,Zhou Zhao,Jiaxu Miao,Shengyu Zhang,Jingyuan Chen
Abstract:
Hypothesis inference, a sophisticated cognitive process that allows humans to construct plausible explanations for incomplete observations, is paramount to our ability to make sense of the world around us. Despite the universality of this skill, it remains under-explored within the context of multi-modal AI, which necessitates analyzing observation, recalling information in the mind, and generating explanations. In this work, we propose the Cross-modal Observation hypothesIs iNference task (COIN). Given a textual description of a partially observed event, COIN strives to recall the most probable event from the visual mind (video pool), and infer the subsequent action flow connecting the visual mind event and the observed textural event. To advance the development of this field, we propose a large-scale text-video dataset, Tex-COIN, that contains 39,796 meticulously annotated hypothesis inference examples and auxiliary commonsense knowledge (appearance, clothing, action, etc.) for key video characters. Based on the proposed Tex-COIN dataset, we design a strong baseline, COINNet, which features two perspectives: 1) aligning temporally displaced textual observations with target videos via transformer-based multi-task learning, and 2) inferring the action flow with non-parametric graph-based inference grounded in graph theory. Extensive experiments on the Tex-COIN dataset validate the effectiveness of our COINNet by significantly outperforming the state-of-the-arts.



Paperid:25 Oral
Authors:Ruohao Guo,Liao Qu,Dantong Niu,Yanyu Qi,Wenzhen Yue,Ji Shi,Bowei Xing,Xianghua Ying
Abstract:
Audio-visual semantic segmentation (AVSS) aims to segment and classify sounding objects in videos with acoustic cues. However, most approaches operate on the close-set assumption and only identify pre-defined categories from training data, lacking the generalization ability to detect novel categories in practical applications. In this paper, we introduce a new task: \textbf{open-vocabulary audio-visual semantic segmentation}, extending AVSS task to open-world scenarios beyond the annotated label space. This is a more challenging task that requires recognizing all categories, even those that have never been seen nor heard during training. Moreover, we propose the first open-vocabulary AVSS framework, OV-AVSS, which mainly consists of two parts: 1) a universal sound source localization module to perform audio-visual fusion and locate all potential sounding objects and 2) an open-vocabulary classification module to predict categories with the help of the prior knowledge from large-scale pre-trained vision-language models. To properly evaluate the open-vocabulary AVSS, we split zero-shot training and testing subsets based on the AVSBench-semantic benchmark, namely AVSBench-OV. Extensive experiments demonstrate the strong segmentation and zero-shot generalization ability of our model on all categories. On the AVSBench-OV dataset, OV-AVSS achieves 55.43% mIoU on base categories and 29.14% mIoU on novel categories, exceeding the state-of-the-art zero-shot method by 41.88%/20.61% and open-vocabulary method by 10.2%/11.6%.



Paperid:26 Oral
Authors:Alejandro Galán-Cuenca,Jose J. Valero-Mas,Juan C. Martinez-Sevilla,Antonio Hidalgo-Centeno,Antonio Pertusa,Jorge Calvo-Zaragoza
Abstract:
Multimodal audio-image music transcription has been recently posed as a means of retrieving a digital score representation by leveraging the individual estimations from Automatic Music Transcription (AMT)---acoustic recordings---and Optical Music Recognition (OMR)---image scores---systems. Nevertheless, while proven to outperform single-modality recognition rates, this approach has been exclusively validated under controlled scenarios---monotimbral and monophonic synthetic data---mainly due to a lack of collections with symbolic score-level annotations for both recordings and graphical sheets. To promote research on this topic, this work presents the $\textit{Multimodal mUSic Collection for Automatic Transcription}$ (MUSCAT) assortment of acoustic recordings, image sheets, and their score-level annotations in several notation formats. This dataset comprises almost 80 hours of real recordings with varied instrumentation and polyphony degrees---from piano to orchestral music---1251 scanned sheets, and 880 symbolic scores from 37 composers, which may also be used in other tasks involving metadata such as instrument identification or composer recognition. A fragmented subset of this collection exclusively focused on acoustic data for score-level AMT---the $\textit{MUSic Collection for aUtomatic Transcription - fragmented Subset}$ (MUSCUTS) assortment---is also presented together with a baseline experimentation, concluding the need to foster research on this field with real recordings. Finally, a web-based service is also provided to increase the size of the collections collaboratively.



Paperid:27 Oral
Authors:Xueyuan Xu,Li Zhuo,Jinxin Lu,Xia Wu
Abstract:
Due to the small size of valid samples, multi-source EEG features with high dimensionality can easily cause problems such as overfitting and poor real-time performance of the emotion recognition classifier. Feature selection has been demonstrated as an effective means to solve these problems. Current EEG feature selection research assumes that all dimensions of emotional labels are complete. However, owing to the open acquisition environment, subjective variability, and border ambiguity of individual perceptions of emotion, the training data in the practical application often includes missing information, i.e., multi-dimensional emotional labels of several instances are incomplete. The aforementioned incomplete information directly restricts the accurate construction of the EEG feature selection model for multi-dimensional emotion recognition. To wrestle with the aforementioned problem, we propose a novel EEG feature selection model with weighted self-expression learning (WSEL). The model utilizes self-representation learning and least squares regression to reconstruct the label space through the second-order correlation and higher-order correlation within the multi-dimensional emotional labels and simultaneously realize the EEG feature subset selection under the incomplete information. We have utilized two multimedia-induced emotion datasets with EEG recordings, DREAMER and DEAP, to confirm the effectiveness of WSEL in the partial multi-dimensional emotional feature selection challenge. Compared to nine state-of-the-art feature selection approaches, the experimental results demonstrate that the EEG feature subsets chosen by WSEL can achieve optimal performance in terms of six performance metrics.



Paperid:28 Oral
Authors:Wan Zhang,Sheng Tang,Jiawei Wei,Ruize Zhang,Juan Cao
Abstract:
In recent years, diffusion models have achieved tremendous success in the field of video generation, with controllable video generation receiving significant attention. However, existing control methods still face two limitations: Firstly, control conditions (such as depth maps, 3D Mesh) are difficult for ordinary users to obtain directly. Secondly, it’s challenging to drive multiple objects through complex motions with multiple trajectories simultaneously. In this paper, we introduce DragEntity, a video generation model that utilizes entity representation for controlling the motion of multiple objects. In comparison to previous methods, MotionCtrl offers two main advantages: 1) Trajectory-based methods are more user-friendly for interaction. Users only need to draw trajectories during the interaction to generate videos. 2) We use entity representation to represent any object in the image, and multiple objects can maintain relative spatial relationships. Therefore, we allow multiple trajectories to control multiple objects in the image with different levels of complexity simultaneously. Our experiments validate the effectiveness of DragEntity, demonstrating its superior performance in fine-grained control in video generation.



Paperid:29 Oral
Authors:Hongjie Wu,Linchao He,Mingqin Zhang,Dongdong Chen,Kunming Luo,Mengting Luo,Ji-Zhe Zhou,Hu Chen,Jiancheng Lv
Abstract:
Diffusion models have demonstrated remarkable efficacy in generating high-quality samples. Existing diffusion-based image restoration algorithms exploit pre-trained diffusion models to leverage data priors, yet they still preserve elements inherited from the unconditional generation paradigm. These strategies initiate the denoising process with pure white noise and incorporate random noise at each generative step, leading to over-smoothed results. In this paper, we introduce a refined paradigm for diffusion-based image restoration. Specifically, we opt for a sample consistent with the measurement identity at each generative step, exploiting the sampling selection as an avenue for output stability and enhancement. Besides, we start the restoration process with an initialization combined with the measurement signal, providing supplementary information to better align the generative process. Extensive experimental results and analyses validate that our proposed method significantly enhances image restoration performance while consuming negligible additional computational resources.



Paperid:30 Oral
Authors:Yuqing Wang,Lei Meng,Haokai Ma,Yuqing Wang,Haibei HUANG,Xiangxu Meng
Abstract:
Classifying videos differs from that of images in the need to capture the information on what has happened, instead of what is in the frames. Conventional methods typically follow the data-driven approach, which uses transformer-based attention models to extract and aggregate the features of video frames as the representation of the entire video. However, this approach tends to extract the object information of frames and may face difficulties in classifying the classes talking about events, such as "fixing bicycle". To address this issue, This paper presents an Event-level Causal Representation Learning (ECRL) model for the spatio-temporal modeling of both the in-frame object interactions and their cross-frame temporal correlations. Specifically, ECRL first employs a Frame-to-Video Causal Modeling (F2VCM) module, which simultaneously builds the in-frame causal graph with the background and foreground information and models their cross-frame correlations to construct a video-level causal graph. Subsequently, a Causality-aware Event-level Representation Inference (CERI) module is introduced to eliminate the spurious correlations in contexts and objects via the back- and front-door interventions, respectively. The former involves visual context de-biasing to filter out background confounders, while the latter employs global-local causal attention to capture event-level visual information. Experimental results on two benchmarking datasets verified that ECRL may better capture the cross-frame correlations to describe videos in event-level features. The source code is provided in the supplementary material.



Paperid:31 Oral
Authors:Ruiqi Wang,Jinyang Huang,Jie Zhang,Xin Liu,Xiang Zhang,Zhi Liu,Peng Zhao,Sigui Chen,Xiao Sun
Abstract:
Depression is a prevalent mental health disorder that significantly impacts individuals' lives and well-being. Early detection and intervention are crucial for effective treatment and management of depression. Recently, there are many end-to-end deep learning methods leveraging the facial expression features for automatic depression detection. However, most current methods overlook the temporal dynamics of facial expressions. Although very recent 3DCNN methods remedy this gap, they introduce more computational cost due to the selection of CNN-based backbones and redundant facial features. To address the above limitations, by considering the timing correlation of facial expressions, we propose a novel framework called FacialPulse, which recognizes depression with high accuracy and speed. By harnessing the bidirectional nature and proficiently addressing long-term dependencies, the Facial Motion Modeling Module (FMMM) is designed in FacialPulse to fully capture temporal features. Since the proposed FMMM has parallel processing capabilities and has the gate mechanism to mitigate gradient vanishing, this module can also significantly boost the training speed. Besides, to effectively use facial landmarks to replace original images to decrease information redundancy, a Facial Landmark Calibration Module (FLCM) is designed to eliminate facial landmark errors to further improve recognition accuracy. Extensive experiments on the AVEC2014 dataset and MMDA dataset (a depression dataset) demonstrate the superiority of FacialPulse on recognition accuracy and speed, with the average MAE (Mean Absolute Error) decreased by 22%, and the recognition speed increased by 100% compared to state-of-the-art baselines.



Paperid:32 Oral
Authors:Junjie Shi,Caozhi Shang,Zhaobin Sun,Li Yu,Xin Yang,Zengqiang Yan
Abstract:
Incomplete multi-modal image segmentation is a fundamental task in medical imaging to refine deployment efficiency when only partial modalities are available. However, the common practice that complete-modality data is visible during model training is far from realistic, as modalities can have imbalanced missing rates in clinical scenarios. In this paper, we, for the first time, formulate such a challenging setting and propose Preference-Aware Self-diStillatION (PASSION) for incomplete multi-modal medical image segmentation under imbalanced missing rates. Specifically, we first construct pixel-wise and semantic-wise self-distillation to balance the optimization objective of each modality. Then, we define relative preference to evaluate the dominance of each modality during training, based on which to design task-wise and gradient-wise regularization to balance the convergence rates of different modalities. Experimental results on two publicly available multi-modal datasets demonstrate the superiority of PASSION against existing approaches for modality balancing. More importantly, PASSION is validated to work as a plug-and-play module for consistent performance improvement across different backbones. Code will be available upon acceptance.



Paperid:33 Oral
Authors:Han Wang,Rui Yang Tan,Usman Naseem,Roy Ka-Wei Lee
Abstract:
Hate speech is a pressing issue in modern society, with significant repercussions both online and offline. Recent research in hate speech detection has primarily centered on text-based media, largely overlooking multimodal content such as videos. Existing studies on hateful video datasets have predominantly focused on English content within a Western context and have been limited to binary labels (hateful or non-hateful), lacking detailed contextual information. This study presents $\textsf{MultiHateClip}$, an novel multilingual dataset curated through hate lexicons and human annotation. It aims to enhance the detection of hateful videos on platforms such as YouTube and Bilibili, encompassing content in both English and Chinese languages. Comprising 2,000 videos annotated for hatefulness, offensiveness, and normalcy, this dataset provides a cross-cultural perspective on gender-based hate speech. Through a detailed examination of human annotation results, we discuss the differences between Chinese and English hateful videos and underscore the importance of different modalities in hateful and offensive video analysis. Evaluations of state-of-the-art video classification models, such as $\textit{VLM}$ and $\textit{GPT-4V}$, on $\textsf{MultiHateClip}$ highlight the existing challenges in accurately distinguishing between hateful and offensive content and the urgent need for models that are both multimodally and culturally nuanced. $\textsf{MultiHateClip}$ serves as a foundational step towards developing more effective hateful video detection solutions, emphasizing the importance of a multimodal and culturally sensitive approach in the ongoing fight against online hate speech.



Paperid:34 Oral
Authors:Huilin Tian,Jingke Meng,Wei-Shi Zheng,Yuan-Ming Li,Junkai Yan,Yunong Zhang
Abstract:
Vision and Language Navigation (VLN) is a challenging task that requires agents to understand instructions and navigate to the destination in a visual environment. One of the key challenges in outdoor VLN is keeping track of which part of the instruction was completed. To alleviate this problem, previous works mainly focus on grounding the natural language to the visual input, but neglecting the crucial role of the agent’s spatial position information in the grounding process. In this work, we first explore the substantial effect of spatial position locating on the grounding of outdoor VLN, drawing inspiration from human navigation. In real-world navigation scenarios, before planning a path to the destination, humans typically need to figure out their current location. This observation underscores the pivotal role of spatial localization in the navigation process. In this work, we introduce a novel framework, Locating before Planning (Loc4Plan), designed to incorporate spatial perception for action planning in outdoor VLN tasks. The main idea behind Loc4Plan is to perform the spatial localization before planning a decision action based on corresponding guidance, which comprises a block-aware spatial locating (BAL) module and a spatial-aware action planning (SAP) module. Specifically, to help the agent perceive its spatial location in the environment, we propose to learn a position predictor that measures how far the agent is from the next intersection for reflecting its position, which is achieved by the BAL module. After the locating process, we propose the SAP module to incorporate spatial information to ground the corresponding guidance and enhance the precision of action planning. Extensive experiments on the Touchdown and map2seq datasets show that the proposed Loc4Plan outperforms the SOTA methods.



Paperid:35 Oral
Authors:Qiwen Zhu,Yanjie Wang,Shilv Cai,Liqun Chen,Jiahuan Zhou,Luxin Yan,Sheng Zhong,Xu Zou
Abstract:
In this paper, we introduce a novel approach to single-image super-resolution (SISR) that balances perceptual quality and distortion through multi-objective optimization (MOO). Traditional pixel-based distortion metrics like PSNR and SSIM often fail to align with human perceptual quality, resulting in blurry outputs despite high scores. To address this, we propose the Multi-Objective Bayesian Optimization Super-Resolution (MOBOSR) framework, which dynamically adjusts loss weights during training. This reduces the need for manual hyperparameter tuning and lessens computational demands compared to AutoML. Our method conceptualizes the relationship between loss weights and image quality assessment (IQA) metrics as black-box objective functions, optimized to achieve an optimal perception-distortion Pareto frontier. Extensive experiments demonstrate that MOBOSR surpasses current state-of-the-art methods in both perception and distortion, significantly advancing the perception-distortion Pareto frontier. Our work lays a foundation for future exploration of the balance between perceptual quality and fidelity in image restoration tasks. Source codes and pretrained models are available at:https://github.com/ZhuKeven/MOBOSR.



Paperid:36 Oral
Authors:Jingjie Zeng,Zhihao Yang,Qi Yang,Liang Yang,Hongfei Lin
Abstract:
By integrating various modules with the Visual Transformer (ViT), we facilitate a interpretation of image processing across each layer and attention head. This method allows us to explore the connections both within and across the layers, enabling a analysis of how images are processed at different layers. Conducting a analysis of the contributions from each layer and attention head, shedding light on the intricate interactions and functionalities within the model's layers. This in-depth exploration not only highlights the visual cues between layers but also examines their capacity to navigate the transition from abstract concepts to tangible objects. It unveils the model's mechanism to building an understanding of images, providing a strategy for adjusting attention heads between layers, thus enabling targeted pruning and enhancement of performance for specific tasks. Our research indicates that achieving a scalable understanding of transformer models is within reach, offering ways for the refinement and enhancement of such models.



Paperid:37 Oral
Authors:Kento Shigyo,Yi-Fan Cao,Kentaro Takahira,Mingming Fan,Huamin Qu
Abstract:
The growing prevalence of psychological disorders underscores the critical importance of mental health research in today's society. In psychotherapy, particularly Acceptance and Commitment Therapy (ACT), cognitive exercises employing mental imagery are used to manage negative thoughts. However, the challenge of maintaining vivid imagery diminishes their therapeutic effectiveness. Virtual reality (VR) offers untapped potential for increasing engagement and therapeutic efficacy. However, there is still a gap in exploration regarding how to effectively leverage the potential of VR to enhance traditional cognitive exercises with mental imagery. This study investigates the effective HCI design and the comparative efficacy of a VR-mediated exercise for promoting cognitive defusion to address negative thoughts grounded in ACT. Using a co-design approach with clinicians and potential users of postgraduate students, we developed a VR system that materializes negative thoughts into tangible objects. This allows users to visually modify and transpose these objects onto a surface, facilitating mental detachment from negative thoughts. In an evaluation study with 20 non-clinical participants, divided into VR and mental imagery groups, we assessed the impact of the cognitive defusion exercise on their perception of negative thoughts and psychological measures using standardized questionnaires. Results show improvement in both groups, with significant enhancements in negative thought perception and mental detachment from negative thoughts exclusively in the VR group, whereas the mental imagery group did not demonstrate significant changes. Interviews emphasize the VR's capability to present vivid visualizations of negative thoughts effortlessly, highlighting its effectiveness and engagement in psychotherapy to facilitate cognitive exercises.



Paperid:38 Oral
Authors:Wenqiang Xu,Wenrui Dai,Ziyang Zheng,Chenglin Li,Junni Zou,Hongkai Xiong
Abstract:
Point cloud upsampling is crucial for 3D reconstruction, with recent research significantly benefitting from the advances in deep learning technologies. The majority of existing methods, which focus on a sequence of processes including feature extraction, augmentation, and the reconstruction of coordinates, encounter significant challenges in interpreting the geometric attributes they uncover, particularly with respect to the intricacies of transitioning feature dimensionality. In this paper, we delve deeper into modeling Partial Differential Equations (PDEs) specifically tailored for the inverse heat dissipation process in dense point clouds. Our goal is to detect gradients within the dense point cloud data distribution and refine the accuracy of interpolated points’ positions along with their complex geometric nuances through a systematic iterative approximation method. Simultaneously, we adopt multivectors from geometric algebra as the primary tool for representing the geometric characteristics of point clouds, moving beyond the conventional vector space representations. The use of geometric products of multivectors enables us to capture the complex relationships between scalars, vectors, and their components more effectively. This methodology not only offers a robust framework for depicting the geometric features of point clouds but also enhances our modeling capabilities for inverse heat dissipation PDEs. Through both qualitative and quantitative assessments, we demonstrate that our results significantly outperform existing state-of-the-art techniques in terms of widely recognized point cloud evaluation metrics and 3D visual reconstruction fidelity.



Paperid:39 Oral
Authors:Yangqin Jiang,Lianghao Xia,Wei Wei,Da Luo,Kangyi Lin,Chao Huang
Abstract:
The rise of online multi-modal sharing platforms like TikTok and YouTube has enabled personalized recommender systems to incorporate multiple modalities (such as visual, textual, and acoustic) into user representations. However, addressing the challenge of data sparsity in these systems remains a key issue. To address this limitation, recent research has introduced self-supervised learning techniques to enhance recommender systems. However, these methods often rely on simplistic random augmentation or intuitive cross-view information, which can introduce irrelevant noise and fail to accurately align the multi-modal context with user-item interaction modeling. To fill this research gap, we propose a novel multi-modal graph diffusion model for recommendation called DiffMM. Our framework integrates a modality-aware graph diffusion model with a cross-modal contrastive learning paradigm to improve modality-aware user representation learning. This integration facilitates better alignment between multi-modal feature information and collaborative relation modeling. Our approach leverages diffusion models’ generative capabilities to automatically generate a user-item graph that is aware of different modalities, facilitating the incorporation of useful multi-modal knowledge in modeling user-item interactions. We conduct extensive experiments on three public datasets, consistently demonstrating the superiority of our DiffMM over various competitive baselines.



Paperid:40 Oral
Authors:Wei Zhang
Abstract:
Automated diagnosis of depression is crucial for early detection and timely intervention. Previous research has largely concentrated on visual indicators, often neglecting the value of leveraging a variety of data types. Although some studies have attempted to employ multiple modalities, they typically fall short in investigating the complex dynamics between features from various modalities over time. To address this challenge, we present an innovative Multi-modal Dual-Attention Aggregation Architecture for Depression Recognition (MDDR). This framework capitalizes on multi-modal pre-trained features and introduces two attention aggregation mechanisms: the Feature Alignment and Aggregation (FAA) module and the Sequence Encoding and Aggregation (SEA) module. The FAA module is designed to dynamically evaluate the relevance of multi-modal features for each instance, facilitating a dynamic integration of these features over time. Following this, the SEA module determines the importance of the amalgamated features for each frame, ensuring that aggregation is conducted based on their significance, to extract the most relevant features for accurately diagnosing depression. Moreover, we propose a unique loss calculation method specifically designed for depression assessment, named DRLoss. Our approach, evaluated on the AVEC2013 and AVEC2014 depression audiovisual datasets, achieves unparalleled performance.



Paperid:41 Oral
Authors:Zhiqi Ge,Hongzhe Huang,Mingze Zhou,Juncheng Li,Guoming Wang,Siliang Tang,Yueting Zhuang
Abstract:
World models are progressively being employed across diverse fields, extending from basic environment simulation to complex scenario construction. However, existing models are mainly trained on domain-specific states and actions, and confined to single-modality state representations. In this paper, We introduceWorldGPT, a generalist world model built upon Multimodal Large Language Model (MLLM). WorldGPT acquires an understanding of world dynamics through analyzing millions of videos across various domains. To further enhance WorldGPT's capability in specialized scenarios and long-term tasks, we have integrated it with a novel cognitive architecture that combines memory offloading, knowledge retrieval, and context reflection. As for evaluation, we buildWorldNet, a multimodal state transition prediction benchmark encompassing varied real-life scenarios. Conducting evaluations on WorldNet directly demonstrates WorldGPT's capability to accurately model state transition patterns, affirming its effectiveness in understanding and predicting the dynamics of complex scenarios. We further explore WorldGPT's emerging potential in serving as a world simulator, helping multimodal agents generalize to unfamiliar domains through efficiently synthesising multimodal instruction instances which are proved to be as reliable as authentic data for fine-tuning purposes. The project is available on theanonymous website.



Paperid:42 Oral
Authors:Tan Yu,Jingjing Wang,Jiawen Wang,Jiamin Luo,Guodong Zhou
Abstract:
In the literature, existing studies on text-to-motion generation (TMG) routinely focus on exploring the objective alignment of text and motion, which largely ignore the subjective emotion information, especially the limb-level emotion information. With this in mind, this paper proposes a new Emotion-enriched Text-to-Motion Generation (ETMG) task, aiming to generate motions with the subjective emotion information. Further, this paper believes that injecting emotions into limbs (named intra-limb emotion injection) and ensuring the coordination and coherence of emotional motions after injecting emotion information (named inter-limb emotion disturbance) is rather important and challenging in this ETMG task. To this end, this paper proposes an LLM-guided Limb-level Emotion Manipulating (${\rm L^{3}EM}$) approach to ETMG. Specifically, this approach designs an LLM-guided intra-limb emotion modeling block to inject emotion into limbs, followed by a graph-structured inter-limb relation modeling block to ensure the coordination and coherence of emotional motions. Particularly, this paper constructs a coarse-grained Emotional Text-to-Motion (EmotionalT2M) dataset and a fine-grained Limb-level Emotional Text-to-Motion (Limb-ET2M) dataset to justify the effectiveness of the proposed ${\rm L^{3}EM}$ approach. Detailed evaluation demonstrates the significant advantage of our ${\rm L^{3}EM}$ approach to ETMG over the state-of-the-art baselines. This justifies the importance of the limb-level emotion information for ETMG and the effectiveness of our ${\rm L^{3}EM}$ approach in coherently manipulating such information.



Paperid:43 Oral
Authors:Tao Tang,Longfei Gao,Guangrun Wang,Yixing Lao,Peng Chen,Hengshuang Zhao,Dayang Hao,Xiaodan Liang,Mathieu Salzmann,Kaicheng Yu
Abstract:
We introduce a new task, novel view synthesis for LiDAR sensors. While traditional model-based LiDAR simulators with style-transfer neural networks can be applied to render novel views, they fall short of producing accurate and realistic LiDAR patterns because the renderers rely on explicit 3D reconstruction and exploit game engines, that ignore important attributes of LiDAR points. We address this challenge by formulating, to the best of our knowledge, the first differentiable end-to-end LiDAR rendering framework, LiDAR-NeRF, leveraging a neural radiance field (NeRF) to facilitate the joint learning of geometry and the attributes of 3D points. However, simply employing NeRF cannot achieve satisfactory results, as it only focuses on learning individual pixels while ignoring local information, especially at low texture areas, resulting in poor geometry. To this end, we have taken steps to address this issue by introducing a structural regularization method to preserve local structural details. To evaluate the effectiveness of our approach, we establish an object-centric multi-view LiDAR dataset, dubbed NeRF-MVL. It contains observations of objects from 9 categories seen from 360-degree viewpoints captured with multiple LiDAR sensors. Our extensive experiments on the scene-level KITTI-360 dataset, and on our object-level NeRF-MVL show that our LiDAR-NeRF surpasses the model-based algorithms significantly.



Paperid:44 Oral
Authors:Puyi Wang,Wei Sun,Zicheng Zhang,Jun Jia,Yanwei Jiang,Zhichao Zhang,Xiongkuo Min,Guangtao Zhai
Abstract:
Traditional deep neural network (DNN)-based image quality assessment (IQA) models leverage convolutional neural networks (CNN) or Transformer to learn the quality-aware feature representation, achieving commendable performance on natural scene images. However, when applied to AI-Generated images (AGIs), these DNN-based IQA models exhibit subpar performance. This situation is largely due to the semantic inaccuracies inherent in certain AGIs caused by uncontrollable nature of the generation process. Thus, the capability to discern semantic content becomes crucial for assessing the quality of AGIs. Traditional DNN-based IQA models, constrained by limited parameter complexity and training data, struggle to capture complex fine-grained semantic features, making it challenging to grasp the existence and coherence of semantic content of the entire image. To address the shortfall in semantic content perception of current IQA models, we introduce a largeMulti-modality modelAssistedAI-GeneratedImageQualityAssessment (MA-AGIQA) model, which utilizes semantically informed guidance to sense semantic information and extract semantic vectors through carefully designed text prompts. Moreover, it employs a mixture of experts (MoE) structure to dynamically integrate the semantic information with the quality-aware features extracted by traditional DNN-based IQA models. Comprehensive experiments conducted on two AI-generated content datasets, AIGCQA-20k and AGIQA-3k show that MA-AGIQA achieves state-of-the-art performance, and demonstrate its superior generalization capabilities on assessing the quality of AGIs. The code will be available.



Paperid:45 Oral
Authors:Liu Mengzhen,Mengyu Wang,Henghui Ding,Yilong Xu,Yao Zhao,Yunchao Wei
Abstract:
Although the Segment Anything Model (SAM) has achieved impressive results in many segmentation tasks and benchmarks, its performance noticeably deteriorates when applied to high-resolution images for high-precision segmentation, limiting it's usage in many real-world applications. In this work, we explored transferring SAM into the domain of high-resolution images and proposed Pi-SAM. Compared to the original SAM and its variants, Pi-SAM demonstrates the following superiorities:Firstly, Pi-SAM possesses a strong perception capability for the extremely fine details in high-resolution images, enabling it to generate high-precision segmentation masks. As a result,Pi-SAM significantly surpasses previous methods in four high-resolution datasets.Secondly, Pi-SAM supports more precise user interactions. In addition to the native promptable ability of SAM, Pi-SAM allows users to interactively refine the segmentation predictions simply by clicking. While the original SAM fails to achieve this on high-resolution images.Thirdly, building upon SAM, Pi-SAM freezes all its original parameters and introduces very few additional parameters and computational costs to achieve the above performance. This ensures highly efficient model fine-tuning while also retaining the powerful semantic information contained in the original SAM.



Paperid:46 Oral
Authors:Rintaro Yanagi,Ren Togo,Takahiro Ogawa,Miki Haseyama
Abstract:
Screening similar but non-target images in text-based image retrieval is crucial for pinpointing the user's desired images accurately. However, conventional methods mainly focus on enhancing text-image matching performance, often failing to identify images that exactly match the retrieval intention because of the query quality. User-provided queries frequently lack adequate information for screening similar but not target images, especially when the target database (DB) contains numerous similar images. Therefore, a novel approach is needed to extract valuable information from users for effective screening. In this paper, we propose a DB question generation (DQG) model to enhance exact cross-modal image retrieval performance. Our DQG model learns to generate effective questions that precisely screen similar but non-target images using DB contents information. By answering the questions generated from our model, users can reach their desired images by only answering the presented questions even within DBs with similar content. Experimental results on publicly available datasets show that our proposed approach can significantly improve exact cross-modal image retrieval performance. Code is available in the supplemental materials and will be publicly available.



Paperid:47 Oral
Authors:Xinfa Zhu,Wenjie Tian,Xinsheng Wang,Lei He,Yujia Xiao,Xi Wang,Xu Tan,sheng zhao,Lei Xie
Abstract:
Understanding the speaking style, such as the emotion of the interlocutor's speech, and responding with speech in an appropriate style is a natural occurrence in human conversations. However, technically, existing research on speech synthesis and speaking style captioning typically proceeds independently. In this work, an innovative framework, referred to as UniStyle, is proposed to incorporate both the capabilities of speaking style captioning and style-controllable speech synthesizing. Specifically, UniStyle consists of a UniConnector and a style prompt-based speech generator. The role of the UniConnector is to bridge the gap between different modalities, namely speech audio and text descriptions. It enables the generation of text descriptions with speech as input and the creation of style representations from text descriptions for speech synthesis with the speech generator. Besides, to overcome the issue of data scarcity, we propose a two-stage and semi-supervised training strategy, which reduces data requirements while boosting performance. Extensive experiments conducted on open-source corpora demonstrate that UniStyle achieves state-of-the-art performance in speaking style captioning and synthesizes expressive speech with various speaker timbres and speaking styles in a zero-shot manner.



Paperid:48 Oral
Authors:Zhanyu Wang,Longyue Wang,Zhen Zhao,Minghao Wu,Chenyang Lyu,Huayang Li,Deng Cai,Luping Zhou,Shuming Shi,Zhaopeng Tu
Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have constituted a significant leap forward in the field, particularly in the processing of videos, which encompasses inherent challenges such as spatiotemporal relationships. However, existing MLLMs are predominantly focused on the comprehension of video inputs, with limited capabilities in generating video content. In this paper, we present GPT4Video, a unified framework that seamlessly and lightly integrates with LLMs, visual feature extractors, and stable diffusion generative models for cohesive video understanding and generation. Moreover, we propose a text-only finetuning approach to equip models for instruction-following and safeguarding in multimodal conversations without requiring costly annotated video-based instructions. Additionally, we construct multi-turn and caption-interleaved datasets for finetuning and benchmarking MLLMs, which serve as solid resources for advancing this field. Through quantitative and qualitative assessments, GPT4Video demonstrates the following advantages: 1) The framework incorporates video generation ability without adding extra training parameters, ensuring seamless compatibility with various video generators. 2) The model achieves superior performances across a variety of benchmarks. For instance, it outperforms Valley by 11.8% on video question answering, and surpasses NExt-GPT by 2.3% on text-to-video generation. 3) As safety pioneers in open-source MLLMs, we developed finetuning and evaluation datasets, securing an F1 score exceeding 80% in blocking harmful content during understanding and generating videos. In general, GPT4Video shows potential to function as a real-life assistant, marked by its effectiveness, adaptability, and safety. We will open-source our code, data, and models.



Paperid:49 Oral
Authors:Jinbo Yan,Rui Peng,Luyang Tang,Ronggang Wang
Abstract:
Reconstructing dynamic scenes from video sequences is a highly promising task in the multimedia domain. While previous methods have made progress, they often struggle with slow rendering and managing temporal complexities such as significant motion and object appearance/disappearance. In this paper, we propose SaRO-GS as a novel dynamic scene representation capable of achieving real-time rendering while effectively handling temporal complexities in dynamic scenes. To address the issue of slow rendering speed, we adopt a Gaussian primitive-based representation and optimize the Gaussians in 4D space, which facilitates real-time rendering with the assistance of 3D Gaussian Splatting. Additionally, to handle temporally complex dynamic scenes, we introduce a Scale-aware Residual Field. This field considers the size information of each Gaussian primitive while encoding its residual feature and aligns with the self-splitting behavior of Gaussian primitives. Furthermore, we propose an Adaptive Optimization Schedule, which assigns different optimization strategies to Gaussian primitives based on their distinct temporal properties, thereby expediting the reconstruction of dynamic regions. Through evaluations on monocular and multi-view datasets, our method has demonstrated state-of-the-art performance.



Paperid:50 Oral
Authors:Yang Lu,junxianli,Zhitong Cui,Jiapeng Hu,Yanna Lin,Shijian Luo
Abstract:
Virtual reality (VR) is a revolutionary method of presenting data visualizations, which brings potential possibilities for enhancing analytical activities. However, applying this method to visualize complex data flows remains largely underexplored, especially the Sankey diagrams, which have an advantageous capacity to represent trends in data flows. In this work, we explored a novel design for the immersive Sankey diagram system within VR environments, utilizing a three-dimensional visual design and several interaction techniques that leveraged VR's spatial and immersive capabilities. Through two comparative user studies, we found the effectiveness of the VR Sankey diagram system in improving task performance and engagement and reducing cognitive workload in complex data analysis. We contribute to an interactive, immersive Sankey diagram system in VR environments, empirical evidence of its advantages, and design lessons for future immersive visualization tools.



Paperid:51 Oral
Authors:Qian Guo,Xinyan Liang,Yuhua Qian,Zhihua Cui,Jie Wen
Abstract:
In multi-modal classification tasks, a good fusion algorithm can effectively integrate and process multi-modal data, thereby significantly improving its performance. Researchers often focus on the design of complex fusion operators and have proposed numerous fusion operators, while paying less attention to the design of feature fusion usage, specifically how features should be fused to better facilitate multi-modal classification tasks. In this article, we propose a progressive skip reasoning fusion network (PSRFN) to make some attempts to address this issue. Firstly, unlike most existing multi-modal fusion methods that only use one fusion operator in a single stage to fuse all view features, PSRFN utilizes the progressive skip reasoning (PSR) block to fuse all views with a fusion operator at each layer. Specifically, each PSR block utilizes all view features and the fused features from the previous layer to jointly obtain the fused features for the current layer. Secondly, each PSR block utilizes a dual-weighted fusion strategy with learnable parameters to adaptively allocate weights during the fusion process. The first level of weighting assigns weights to each view feature, while the second level assigns weights to the fused features from the previous layer and the fused features obtained from the first level of weighting in the current layer. This strategy ensures that the PSR block can dynamically adjust the weights based on the actual contribution of features. Finally, to enable the model to fully utilize feature information from different levels for feature fusion, the skip connections are adopted between PSR blocks employing them. Extensive experiment results on six real multi-modal datasets show that a better usage for fusion operator is indeed able to improve performance.



Paperid:52 Oral
Authors:Wenxuan Wang,Haonan Bai,Jen-tse Huang,Yuxuan WAN,Youliang Yuan,Haoyi Qiu,Nanyun Peng,Michael Lyu
Abstract:
Image generation models can generate or edit images from a given text. Recent advancements in image generation technology, exemplified by DALL-E and Midjourney, have been groundbreaking. These advanced models, despite their impressive capabilities, are often trained on massive Internet datasets, making them susceptible to generating content that perpetuates social stereotypes and biases, which can lead to severe consequences. Prior research on assessing bias within image generation models suffers from several shortcomings, including limited accuracy, reliance on extensive human labor, and lack of comprehensive analysis. In this paper, we propose BiasPainter, a novel evaluation framework that can accurately, automatically and comprehensively trigger social bias in image generation models. BiasPainter uses a diverse range of seed images of individuals and prompts the image generation models to edit these images using gender, race, and age-neutral queries. These queries span 62 professions, 39 activities, 57 types of objects, and 70 personality traits. The framework then compares the edited images to the original seed images, focusing on the significant changes related to gender, race, and age. BiasPainter adopts a key insight that these characteristics should not be modified when subjected to neutral prompts. Built upon this design, BiasPainter can trigger the social bias and evaluate the fairness of image generation models. We use BiasPainter to evaluate six widely-used image generation models, such as stable diffusion and Midjourney. Experimental results show that BiasPainter can successfully trigger social bias in image generation models. According to our human evaluation, BiasPainter can achieve 90.8% accuracy on automatic bias detection, which is significantly higher than the results reported in previous work. All the code, data, and experimental results will be released to facilitate future research.



Paperid:53 Oral
Authors:Yili Jin,Duan Xize,Fangxin Wang,Xue Liu
Abstract:
Virtual Reality (VR) headsets have become increasingly popular for remote collaboration, but video conferencing poses challenges when the user's face is covered by the headset. Existing solutions have limitations in terms of accessibility. In this paper, we propose HeadSetOff, a novel system that achieves photorealistic video conferencing on economical VR headsets by leveraging voice-driven face reconstruction. HeadSetOff consists of three main components: a multimodal attention-based predictor, a generator, and an adaptive controller. The predictor effectively predicts user future behavior based on different modalities. The generator employs voice input, head motion, and eye blink to animate the human face. The adaptive controller dynamically selects the appropriate generator model based on the trade-off between video quality and delay, aiming to maximize Quality of Experience while minimizing latency. Experimental results demonstrate the effectiveness of HeadSetOff in achieving high-quality, low-latency video conferencing on economical VR headsets.



Paperid:54 Oral
Authors:Bo Wu,Tong Li,cheng luo,Xu Yan,FuYu Wang,Xinle Du,Ke Xu
Abstract:
Due to the limited permissions for upgrading dual-side (i.e., server-side and client-side) loss tolerance schemes from the perspective of CDN vendors in a multi-supplier market, modern large-scale live streaming services are still using the automatic-repeat-request (ARQ) based paradigm for loss recovery, which only requires server-side modifications. In this paper, we first conduct a large-scale measurement study with a collection of up to 50 million live streams. We find that loss shows dynamics and live streaming contains frequent on-off mode switching in the wild. We further find that the recovery latency, enlarged by the ubiquitous retransmission loss, is a critical factor affecting client-side QoE (e.g., video freezing) of live streaming. We then propose an enhanced recovery mechanism called AutoRec, which can transform the disadvantages of on-off mode switching into an advantage for reducing loss recovery latency without any modifications on the client side. AutoRec also adopts an online learning-based scheduler to fit the dynamics of loss, balancing the tradeoff between the recovery latency and the incurred overhead. We implement AutoRec upon QUIC and evaluate it via both testbed and real-world deployments of commercial services. The experimental results demonstrate the practicability and profitability of AutoRec, in which the 95th-percentile times and duration of client-side video freezing can be lowered by 34.1% and 16.0%, respectively.



Paperid:55 Oral
Authors:Usman Naseem,Adam Dunn,Matloob Khushi,Jinman Kim
Abstract:
Identifying social media posts that spread vaccine misinformation can inform emerging public health risks and aid in designing effective communication interventions. Existing studies, while promising, often rely on single user posts, potentially leading to flawed conclusions. This highlights the necessity to model users' historical posts for a comprehensive understanding of their stance towards vaccines. However, users' historical posts may contain a diverse range of content that adds noise and leads to low performance. To address this gap, in this study, we present VaxMine, a cooperative multi-agent reinforcement learning method that automatically selects relevant textual and visual content from a user's posts, reducing noise. To evaluate the performance of the proposed method, we create and release a new dataset of 2,072 users with historical posts due to the unavailability of publicly available datasets. The experimental results show that our approach outperforms state-of-the-art methods with an F1-Score of 0.94 (an absolute increase of 13%), demonstrating that extracting relevant content from users' historical posts and understanding both modalities are essential to identify anti-vaccine users on social media. We further analyze the robustness and generalizability of VaxMine, showing that extracting relevant textual and visual content from a user's posts improves performance. We conclude with a discussion on the practical implications of our study by explaining how computational methods used in surveillance can benefit from our work, with flow-on effects on the design of health communication interventions to counter vaccine misinformation on social media. We suggest that releasing a robustly annotated dataset will support further advances and benchmarking of methods



Paperid:56 Oral
Authors:Mo Yujian,Yan Wu,Junqiao Zhao,Hou zhenjie,weiquan Huang,Hu Yinghao,Jijun Wang,Jun Yan
Abstract:
Current LiDAR-only 3D detection methods are limited by the sparsity of point clouds. The previous method used pseudo points generated by depth completion to supplement the LiDAR point cloud, but the pseudo points sample process was complex, and the distribution of pseudo points was uneven. Meanwhile, due to the imprecision of depth completion, the pseudo points suffer from noise and local structural ambiguity, which limit the further improvement of detection accuracy. This paper presents SQDNet, a novel framework designed to address these challenges. SQDNet incorporates two key components: the SQD, which achieves sparse-to-dense matching via grid position indices, allowing for rapid sampling of large-scale pseudo points on the dense depth map directly, thus streamlining the data preprocessing pipeline. And use the density of LiDAR points within these grids to alleviate the uneven distribution and noise problems of pseudo points. Meanwhile, the sparse 3D Backbone is designed to capture long-distance dependencies, thereby improving voxel feature extraction and mitigating local structural blur in pseudo points. The experimental results validate the effectiveness of SQD and achieve considerable detection performance for difficult-to-detect instances on the KITTI test.



Paperid:57 Oral
Authors:Fuqiang Niu,Zebang Cheng,Xianghua Fu,Xiaojiang Peng,Genan Dai,Yin Chen,Hu Huang,Bowen Zhang
Abstract:
Stance detection, which aims to identify public opinion towards specific targets using social media data, is an important yet challenging task. With the proliferation of diverse multimodal social media content including text, and images multimodal stance detection (MSD) has become a crucial research area. However, existing MSD studies have focused on modeling stance within individual text-image pairs, overlooking the multi-party conversational contexts that naturally occur on social media. This limitation stems from a lack of datasets that authentically capture such conversational scenarios, hindering progress in conversational MSD. To address this, we introduce a new multimodal multi-turn conversational stance detection dataset (called MmMtCSD). To derive stances from this challenging dataset, we propose a novel multimodal large language model stance detection framework (MLLM-SD), that learns joint stance representations from textual and visual modalities. Experiments on MmMtCSD show state-of-the-art performance of our proposed MLLM-SD approach for multimodal stance detection. We believe that MmMtCSD will contribute to advancing real-world applications of stance detection research. For reproducibility, we will release the data and code upon acceptance.



Paperid:58 Oral
Authors:Jianing Zhao,Jingjing Wang,Yujie Jin,Jiamin Luo,Guodong Zhou
Abstract:
In real-world recon-videos such as surveillance and drone reconnaissance videos, commonly used explicit language, acoustic and facial expressions information is often missing. However, these videos are always rich in anomalous sentiments (e.g., criminal tendencies), which urgently requires the implicit scene information (e.g., actions and object relations) to fast and precisely identify these anomalous sentiments. Motivated by this, this paper proposes a new chat-paradigm Implicit anomalous sentiment Discovering and grounding (IasDig) task, aiming to interactively, fast discovering and grounding anomalous sentiments in recon-videos via leveraging the implicit scene information (i.e., actions and object relations). Furthermore, this paper believes that this IasDig task faces two key challenges, i.e., scene modeling and scene balancing. To this end, this paper proposes a new Scene-enhanced Video Large Language Model named Hawkeye, i.e., acting like a raptor (e.g., a Hawk) to discover and locate prey, for the IasDig task. Specifically, this approach designs a graph-structured scene modeling module and a balanced heterogeneous MoE module to address the above two challenges, respectively. Extensive experimental results on our constructed scene-sparsity and scene-density IasDig datasets demonstrate the great advantage of Hawkeye to IasDig over the advanced Video-LLM baselines, especially on the metric of false negative rates. This justifies the importance of the scene information for identifying implicit anomalous sentiments and the impressive practicality of Hawkeye for real-world applications.



Paperid:59 Oral
Authors:Ying Liu,Lihong Liu,Cai Xu,Xiangyu Song,Ziyu Guan,Wei Zhao
Abstract:
Multi-view learning methods often focus on improving decision accuracy, while neglecting the decision uncertainty, limiting their suitability for safety-critical applications. To mitigate this, researchers propose trusted multi-view learning methods that estimate classification probabilities and uncertainty by learning the class distributions for each instance. However, these methods assume that the data from each view can effectively differentiate all categories, ignoring the semantic vagueness phenomenon in real-world multi-view data. Our findings demonstrate that this phenomenon significantly suppresses the learning of view-specific evidence in existing methods. We propose a Consistent and Complementary-aware trusted Multi-view Learning (CCML) method to solve this problem. We first construct view-opinions using evidential deep neural networks, which consist of belief mass vectors and uncertainty estimates. Next, we dynamically decouple the consistent and complementary evidence. The consistent evidence is derived from the shared portions across all views, while the complementary evidence is obtained by averaging the differing portions across all views. We ensure that the opinion constructed from the consistent evidence strictly aligns with the ground-truth category. For the opinion constructed from the complementary evidence, we only require it to reflect the probability of the true category, allowing for potential vagueness in the evidence. We compare CCML with state-of-the-art baselines on one synthetic and six real-world datasets. The results validate the effectiveness of the dynamic evidence decoupling strategy and show that CCML significant outperforms baselines on accuracy and reliability. We promise to release the code and all datasets on GitHub and show the link here.



Paperid:60 Oral
Authors:Huishan Ji,Qingyi Si,Zheng Lin,Weiping Wang
Abstract:
Throughout rapid development of multimodal large language models, a crucial ingredient is a fair and accurate evaluation of their multimodal comprehension abilities. Although Visual Question Answering (VQA) could serve as a developed test field, limitations of VQA evaluation, like the inflexible pattern of Exact Match, have hindered MLLMs from demonstrating their real capability and discourage rich responses. Therefore, this paper proposes to use semantics-based evaluators for assessing unconstrained open-ended responses on VQA datasets. As characteristics of VQA have made such evaluation significantly different than the traditional Semantic Textual Similarity (STS) task, to systematically analyze the behaviour and compare the performance of various evaluators including LLM-based ones, we proposes three key properties, i.e., Alignment, Consistency and Generalization, and a corresponding dataset Assessing VQA Evaluators (AVE) to facilitate analysis. In addition, this paper proposes a Semantically Flexible VQA Evaluator (SFVE) with meticulous design based on the unique features of the VQA response evaluation task. Experimental results verify the feasibility of model-based VQA evaluation and effectiveness of the proposed evaluator that surpasses existing semantic evaluators by a large margin. The proposed training scheme generalizes to both the BERT-like encoders and decoder-only LLM.



Paperid:61 Oral
Authors:Xiyu Wang,Yufei Wang,Satoshi Tsutsui,Weisi Lin,Bihan Wen,Alex Kot
Abstract:
Diffusion-based models for story visualization have shown promise in generating content-coherent images for storytelling tasks. However, how to effectively integrate new characters into existing narratives while maintaining character consistency remains an open problem, particularly with limited data. Two major limitations hinder the progress: (1) the absence of a suitable benchmark due to potential character leakage and inconsistent text labeling, and (2) the challenge of distinguishing between new and old characters, leading to ambiguous results. To address these challenges, we introduce the NewEpisode benchmark, comprising refined datasets designed to evaluate generative models' adaptability in generating new stories with fresh characters using just a single example story. The refined dataset involves refined text prompts and eliminates character leakage. Additionally, to mitigate the character confusion of generated results, we propose EpicEvo, a method that customizes a diffusion-based visual story generation model with a single story featuring the new characters seamlessly integrating them into established character dynamics. EpicEvo introduces a novel adversarial character alignment module to align the generated images progressively in the diffusive process, with exemplar images of new characters, while applying knowledge distillation to prevent forgetting of characters and background details. Our evaluation quantitatively demonstrates that EpicEvo outperforms existing baselines on the NewEpisode benchmark, and qualitative studies confirm its superior customization of visual story generation in diffusion models. In summary, EpicEvo provides an effective way to incorporate new characters using only one example story, unlocking new possibilities for applications such as serialized cartoons.



Paperid:62 Oral
Authors:Zixuan Yang,Yushu Zhang,Tao Wang,Zhongyun Hua,Zhihua Xia,Jian Weng
Abstract:
As billions of face images stored on cloud platforms contain sensitive information to human vision, the public confronts substantial threats to visual face privacy. In response, the community has proposed some perturbation-based schemes to mitigate visual privacy leakage. However, these schemes need to generate a new protective perturbation for each image, failing to satisfy the real-time requirement of cloud platforms. To address this issue, we present an efficient visual face privacy protection scheme by utilizing person-specific veils, which can be conveniently applied to all images of the same user without regeneration. The protected images exhibit significant visual differences from the originals but remain identifiable to face recognition models. Furthermore, the protected images can be recovered to originals under certain circumstances. In the process of generating the veils, we propose a feature alignment loss to promote consistency between the recognition outputs of protected and original images with approximate construction of feature subspace. Meanwhile, the block variance loss is designed to enhance the concealment of visual identity information. Extensive experimental results demonstrate that our scheme can significantly eliminate the visual appearance of original images and almost has no impact on face recognition models.



Paperid:63 Oral
Authors:Yunqiang Pei,Kaiyue Zhang,Hongrong yang,Yong Tao,Qihang Tang,Jialei Tang,Guoqing Wang,Zhitao Liu,Ning Xie,Peng Wang,Yang Yang,Heng Tao Shen
Abstract:
Previous research has demonstrated the potential of Augmented Reality in enhancing psychological comfort in Human-Robot Interaction (AR-HRI) through shared robot intent, enhanced visual feedback, and increased expressiveness and creativity in interaction methods. However, the challenge of selecting interaction methods that enhance physical comfort in varying scenarios remains. This study purposes a dynamic dual-layer interaction adjustment mechanism to improve user comfort and interaction efficiency. The mechanism comprises two models: an general layer model, grounded in ergonomics principles, identifies appropriate areas for various interaction methods; a individual layer model predicts user discomfort levels using physiological signals. Interaction methods are dynamically adjusted based on continuous discomfort level changes, enabling the system to adapt to individual differences and dynamic changes, thereby reducing misjudgments and enhancing comfort management. The mechanism's success in authoring tasks validates its effectiveness, significantly advancing AR-HRI and fostering more comfortable and enhancing efficient human-centered interactions.



Paperid:64 Oral
Authors:Fangtao Zhou,xiaofeng huang,Peng Zhang,Meng Wang,Zhao Wang,Yang Zhou,Haibing YIN
Abstract:
With the rapid development of video conferencing and online education applications, screen content image (SCI) compression has become increasingly crucial. Recently, deep learning techniques have made significant strides in compressing natural images, surpassing the performance of traditional standards like versatile video coding. However, directly applying these methods to SCIs is challenging due to the unique characteristics of SCIs. In this paper, we propose a synergistic approach to preserve structural fidelity and text integrity for SCIs. Firstly, external prior guidance is proposed to enhance structural fidelity and text integrity by providing global spatial attention. Then, a structural enhancement module is proposed to improve the preservation of structural information by enhanced spatial feature transform. Finally, the loss function is optimized for better compression efficiency in text regions by weighted mean square error. Experimental results show that the proposed method achieves 13.3% BD-Rate saving compared to the baseline window attention convolutional neural networks (WACNN) on the JPEGAI, SIQAD, SCID, and MLSCID datasets on average.



Paperid:65 Oral
Authors:Dongjie Fu,Xize Cheng,Xiaoda Yang,Wang Hanting,Zhou Zhao,Tao Jin
Abstract:
In the burgeoning field of Audio-Visual Speech Recognition (AVSR), extant research has predominantly concentrated on the training paradigms tailored for high-quality resources. However, owing to the challenges inherent in real-world data collection, audio-visual data are frequently affected by modality-distortion, which encompasses audio-visual asynchrony, video noise and audio noise. The recognition accuracy of existing AVSR method is significantly compromised when multiple modality-distortion coexist in low-resource data. In light of the above challenges, we propose PCD: Cluster-Prompt with Contrastive Decomposition, a robust framework for modality-distortion speech recognition, specifically devised to transpose the pre-trained knowledge from high-resource domain to the targeted domain by leveraging contrast-augmented prompts. In contrast to previous studies, we take into consideration the possibility of various types of distortion in both the audio and visual modalities. Concretely, we design bespoke prompts to delineate each modality-distortion, guiding the model to achieve speech recognition applicable to various distortion scenarios with quite few learnable parameters. To materialize the prompt mechanism, we employ multiple cluster-based strategies that better suits the pre-trained audio-visual model. Additionally, we design a contrastive decomposition mechanism to restrict the explicit relationships among various modality conditions, given their shared task knowledge and disparate modality priors. Extensive results on LRS2 dataset demonstrate that PCD achieves state-of-the-art performance for audio-visual speech recognition under the constraints of distorted resources.



Paperid:66 Oral
Authors:Hsiang-Hui Hung,Huu-Phu Do,Yung-Hui Li,Ching-Chun Huang
Abstract:
We present TimeNeRF, a generalizable neural rendering approach for rendering novel views at arbitrary viewpoints and at arbitrary times, even with few input views. For real-world applications, it is expensive to collect multiple views and inefficient to re-optimize for unseen scenes. Moreover, as the digital realm, particularly the metaverse, strives for increasingly immersive experiences, the ability to model 3D environments that naturally transition between day and night becomes paramount. While current techniques based on Neural Radiance Fields (NeRF) have shown remarkable proficiency in synthesizing novel views, the exploration of NeRF's potential for temporal 3D scene modeling remains limited, with no dedicated datasets available for this purpose. To this end, our approach harnesses the strengths of multi-view stereo, neural radiance fields, and disentanglement strategies across diverse datasets. This equips our model with the capability for generalizability in a few-shot setting, allows us to construct an implicit content radiance field for scene representation, and further enables the building of neural radiance fields at any arbitrary time. Finally, we synthesize novel views of that time via volume rendering. Experiments show that TimeNeRF can render novel views in a few-shot setting without per-scene optimization. Most notably, it excels in creating realistic novel views that transition smoothly across different times, adeptly capturing intricate natural scene changes from dawn to dusk.



Paperid:67 Oral
Authors:Te Yang,Jian Jia,Bo Wang,Yanhua cheng,Yan Li,Dongze Hao,Xipeng Cao,Quan Chen,Han Li,Peng Jiang,Xiangyu Zhu,Zhen Lei
Abstract:
In the mobile internet era, short videos are inundating people's lives. However, research on visual language models specifically designed for short videos has yet to be fully explored. Short videos are not just videos of limited duration. The prominent visual details and high information density of short videos differentiate them to long videos. In this paper, we propose the SpatioTemporal Fine-grained Description (STFVD) emphasizing on the uniqueness of short videos, which entails capturing the intricate details of the main subject and fine-grained movements. To this end, we create a comprehensive Short Video Advertisement Description (SVAD) dataset, comprising 34,930 clips from 5,046 videos. The dataset covers a range of topics, including 191 sub-industries, 649 popular products, and 470 trending games. Various efforts have been made in the data annotation process to ensure the inclusion of fine-grained spatiotemporal information, resulting in 34,930 high-quality annotations. Compared to existing datasets, samples in SVAD exhibit a superior text information density, suggesting that SVAD is more appropriate for the analysis of short videos. Based on the SVAD dataset, we develop SVAD-VLM to generate spatiotemporal fine-grained description for short videos. We use a prompt-guided keyword generation task approach to efficiently learn key visual information. Moreover, we also utilize dual visual alignment to exploit the advantage of mixed-datasets training. Experiments on SVAD dataset demonstrate the challenge of STFVD and the competitive performance of proposed method compared to previous ones.



Paperid:68 Oral
Authors:Yichi Zhang,Zhuo Chen,Lingbing Guo,yajing Xu,Wen Zhang,Huajun Chen
Abstract:
Large language model (LLM) based knowledge graph completion (KGC) aims to predict the missing triples in the KGs with LLMs. However, research about LLM-based KGC fails to sufficiently harness LLMs' inference proficiencies, overlooking critical structural information integral to KGs. In this paper, we explore methods to incorporate structural information into the LLMs, with the overarching goal of facilitating structure-aware reasoning. We first discuss on the existing LLM paradigms like in-context learning and instruction tuning, proposing basic structural information injection approaches. Then we propose a Kno}wledge Prefix Adapter (KoPA) to fulfill this stated goal. The KoPA uses a structural pre-training phase to comprehend the intricate entities and relations within KGs, representing them as structural embeddings. Then KoPA communicates suchcross-modal structural information understanding to the LLMsthrough a knowledge prefix adapter which projects the structural embeddings into the textual space and obtains virtual knowledge tokens positioned as a prefix of the input prompt. We conduct comprehensive experiments and provide incisive analysis concerning how the introduction of cross-modal structural information would be better for LLM's factual knowledge reasoning ability. Our code and data are available athttps://anonymous.4open.science/r/KoPA-3415.



Paperid:69 Oral
Authors:Haicheng Liao,Yongkang Li,Zhenning Li,Chengyue Wang,Yanchen Guan,KaHou Tam,Chunlin Tian,Li Li,Cheng-zhong Xu
Abstract:
As autonomous driving systems increasingly become part of daily transportation, the ability to accurately anticipate and mitigate potential traffic accidents is paramount. Traditional accident anticipation models primarily utilizing dashcam videos are adept at predicting when an accident may occur but fall short in localizing the incident and identifying involved entities. Addressing this gap, this study introduces a novel framework that integrates Large Language Models (LLMs) to enhance predictive capabilities across multiple dimensions—what, when, and where accidents might occur. We develop an innovative chain-based attention mechanism that dynamically adjusts to prioritize high-risk elements within complex driving scenes. This mechanism is complemented by a three-stage model that processes outputs from smaller models into detailed multimodal inputs for LLMs, thus enabling a more nuanced understanding of traffic dynamics. Empirical validation on the DAD, CCD, and A3D datasets demonstrates superior performance in Average Precision (AP) and Mean Time-To-Accident (mTTA), establishing new benchmarks for accident prediction technology. Our approach not only advances the technological framework for autonomous driving safety but also enhances human-AI interaction, making the predictive insights generated by autonomous systems more intuitive and actionable.



Paperid:70 Oral
Authors:Yujia Wang,Fang-Lue Zhang,Neil A. Dodgson
Abstract:
Scanpath generation in 360° images aims to model the realistic trajectories of gaze points that viewers follow when exploring panoramic environments. Existing methods for scanpath generation suffer from various limitations, including a lack of global attention to panoramic environments, insufficient diversity in generated scanpaths, and inadequate consideration of the temporal sequence of gaze points. To address these challenges, we propose a novel approach named ScanTD which employs a conditional Diffusion Model-based method to generate multiple scanpaths. Notably, a transformer-based time-series (TTS) module with a novel attention mechanism is integrated into ScanTD to capture the temporal dependency of gaze points effectively. Additionally, ScanTD utilizes a Vision Transformer-based method for image feature extraction, enabling better learning of scene semantic information. Experimental results demonstrate that our approach outperforms state-of-the-art methods across three datasets. We further demonstrate its generalizability by applying it to the 360° saliency detection task.



Paperid:71 Oral
Authors:Wei Liu,Yufei Chen,Xiaodong Yue
Abstract:
Uncertainty-aware multi-view deep classification methods have markedly improved the reliability of results amidst the challenges posed by noisy multi-view data, primarily by quantifying the uncertainty of predictions. Despite their efficacy, these methods encounter limitations in real-world applications: 1) They are limited to providing a single class prediction per instance, which can lead to inaccuracies when dealing with samples that are difficult to classify due to inconsistencies across multiple views. 2) While these methods offer a quantification of prediction uncertainty, the magnitude of such uncertainty often varies with different datasets, leading to confusion among decision-makers due to the lack of a standardized measure for uncertainty intensity. To address these issues, we introduce Conformalized Multi-view Deep Classification (CMDC), a novel method that generates set-valued rather than single-valued predictions and integrates uncertain predictions as an explicit class category. Through end-to-end training, CMDC minimizes the size of prediction sets while guaranteeing that the set-valued predictions contain the true label with a user-defined probability, building trust in decision-making. The superiority of CMDC is validated through comprehensive theoretical analysis and empirical experiments on various multi-view datasets.



Paperid:72 Oral
Authors:Dian Xie,Peiang Zhao,Jiarui Zhang,Kangqi Wei,Xiaobao Ni,Jiong Xia
Abstract:
Reconstructing visual stimuli from brain activities is crucial for deciphering the underlying mechanism of the human visual system. While recent studies have achieved notable results by leveraging deep generative models, challenges persist due to the lack of large-scale datasets and the inherent noise from non-invasive measurement methods. In this study, we draw inspiration from the mechanism of human memory and propose BrainRAM, a novel two-stage dual-guided framework for visual stimuli reconstruction. BrainRAM incorporates a Retrieval-Augmented Module (RAM) and diffusion prior to enhance the quality of reconstructed images from the brain. Specifically, in stage I, we transform fMRI voxels into the latent space of image and text embeddings via diffusion priors, obtaining preliminary estimates of the visual stimuli's semantics and structure. In stage II, based on previous estimates, we retrieve data from the LAION-2B-en dataset and employ the proposed RAM to refine them, yielding high-quality reconstruction results. Extensive experiments demonstrate that our BrainRAM outperforms current state-of-the-art methods both qualitatively and quantitatively, providing a new perspective for visual stimuli reconstruction.



Paperid:73 Oral
Authors:Jingxiong Li,Sunyi Zheng,Chenglu Zhu,Yuxuan Sun,Pingyi Chen,Zhongyi Shui,Yunlong Zhang,Honglin Li,Lin Yang
Abstract:
In digital pathology, cancer lesions are identified by analyzing the spatial context within pathology images. Synthesizing such complex spatial context is challenging as pathology whole slide images typically exhibit high resolution, low inter-class variety, and are sparsely labeled. To address these challenges, we propose PathUp, a novel diffusion model tailored for the synthesis of multi-class high-resolution pathology images. Our approach includes a latent space patch-wise timestep tracking, which helps to generate high-quality images without tiling artifacts. Expert pathology knowledge is integrated into the model through our patho-align mechanism. To ensure robust generation of lesion subtypes and scale information, we introduce a feature entropy loss function. We substantiate the effectiveness of our method through both qualitative and quantitative evaluations, supplemented by assessments from human experts, demonstrating the authenticity of the synthetic data produced. Furthermore, we highlight the potential utility of our generated images as an augmentation method, thereby enhancing the performance of downstream tasks such as cancer subtype classification.



Paperid:74 Oral
Authors:Shuai Yu,Xiaoliang He,Ke Chen,Yi Yu
Abstract:
Singing melody extraction is a key task in the field of music information retrieval (MIR). However, decades of research works have uncovered two difficult issues. \emph{First}, binary classification on frequency-domain audio features (e.g., spectrogram) is regarded as the primary method, which ignores the potential associations of musical information at different frequency bins, as well as their varying significance for output decisions. \emph{Second}, the existing semi-supervised singing melody extraction models ignore the accuracy of the generated pseudo labels by semi-supervised models, which largely limits the further improvements of the model. To solve the two issues, in this paper, we propose a \underline{h}eterogeneous \underline{k}nowledge \underline{d}istillation framework for \underline{s}emi-supervised singing \underline{m}elody \underline{e}xtraction using harmonic supervision, termed as \emph{HKDSME}. We begin by proposing a four-class classification paradigm for determining the results of singing melody extraction using harmonic supervision. This enables the model to capture more information regarding melodic relations in spectrograms. To improve the accuracy issue of pseudo labels, we then build a semi-supervised method by leveraging the extracted harmonics as a consistent regularization. Different from previous methods, it judges the availability of unlabeled data in terms of the inner positional relations of extracted harmonics. To further build a light-weight semi-supervised model, we propose a heterogeneous knowledge distillation (HKD) module, which enables the prior knowledge transfers between heterogeneous models. We also propose a novel confidence guided loss, which incorporates with the proposed HKD module to reduce the wrong pseudo labels. We evaluate our proposed method using several well-known public available datasets, and the findings demonstrate the efficacy of our proposed method.



Paperid:75 Oral
Authors:Yudong Li,Xianxu Hou,Dezhi Zheng,Linlin Shen,Zhe Zhao
Abstract:
While significant progress has been made in multi-modal learning driven by large-scale image-text datasets, there is still a noticeable gap in the availability of such datasets within the facial domain. To facilitate and advance the field of facial representation learning, we present FLIP-80M, a large-scale visual-linguistic dataset comprising over 80 million face images paired with text descriptions. The construction of FLIP-80M utilizes large-scale publicly available image-text-pair dataset, filtering 5 billion samples from general domain, and incorporates with AI-Generated Content (AIGC) methods for quality management and data augmentation. The data creation process involves a mixed-method pipeline to filter face-related pairs from both visual and linguistic perspectives, including face detection, face caption classification, text de-noising, and AIGC augmentation. As a result, FLIP-80M stands as the largest face-text dataset to date. It shows exceptional data quality and demonstrates the potential to enhance the performance of face representation models. To assess the efficacy of our dataset, we use contrastive learning objective to train FLIP (Facial Language-Image Pretraining) and evaluate its representation capabilities across various downstream tasks. Experimental results reveal that our FLIP model achieves state-of-the-art results cross 10 different face analysis tasks like face parsing, face alignment, and face attribute classification. The dataset and models will be publicly available.



Paperid:76 Oral
Authors:Dunyun Chen,Xin Liao,Xiaoshuai Wu,Shiwei Chen
Abstract:
Existing image inpainting methods have achieved remarkable accomplishments in generating visually appealing results, often accompanied by a trend toward creating more intricate structural textures. However, while these models excel at creating more realistic image content, they often leave noticeable traces of tampering, posing a significant threat to security. In this work, we take the anti-forensic capabilities into consideration, firstly proposing an end-to-end training framework for anti-forensic image inpainting named SafePaint. Specifically, we innovatively formulated image inpainting as two major tasks: semantically plausible content completion and region-wise optimization. The former is similar to current inpainting methods that aim to restore the missing regions of corrupted images. The latter, through domain adaptation, endeavors to reconcile the discrepancies between the inpainted region and the unaltered area to achieve anti-forensic goals. Through comprehensive theoretical analysis, we validate the effectiveness of domain adaptation for anti-forensic performance. Furthermore, we meticulously crafted a region-wise separated attention (RWSA) module, which not only aligns with our objective of anti-forensics but also enhances the performance of the model. Extensive qualitative and quantitative evaluations show our approach achieves comparable results to existing image inpainting methods while offering anti-forensic capabilities not available in other methods.



Paperid:77 Oral
Authors:Fujun Han,Peng Ye,Shukai Duan,Lidan Wang
Abstract:
Vision-based intrusion detection has many applications in life environments, e.g., security, intelligent monitoring, and autonomous driving. Previous works improve the performance of intrusion detection under unknown environments by introducing unsupervised domain adaption (UDA) methods. However, these works do not fully fulfill the practical requirements due to the performance gap between UDA and fully supervised methods. To address the problem, we develop a new and vital active domain adaption intrusion detection task, namely ADA-ID. Our aim is to query and annotate the most informative samples of the target domain at the lowest possible cost, striving for a balance between achieving high performance and keeping low annotation expenses. Specifically, we propose a multi-task joint active domain adaption intrusion detection framework, namely ADAID-YOLO. It consists of a lower branch for detection and an upper branch for segmentation. Further, three effective strategies are designed to better achieve the ADA-ID task: 1) An efficient Dynamic Diffusion Pseudo-Labeling method (DDPL) is introduced to get Pseudo ground truth to help identify areas of uncertainty in segmentation. 2) A Enhanced Region Impurity and Prediction Uncertainty sampling strategy (Enhanced-RIPU) is proposed to better capture the uncertainty of the segmentation region. 3) A Multi-Element Joint sampling strategy (MEJ) is designed to calculate the uncertainty of the detection comprehensively. Finally, comprehensive experiments and comparisons are conducted on multiple dominant intrusion detection datasets. The results show that our method can outperform other classic and promising active domain adaption methods and reach current SOTA performance, even surpassing the performance of UDA and full supervision on Normal→Foggy with only 0.1% and 10% data annotation, respectively. All the source codes, and trained models will be public.



Paperid:78 Oral
Authors:Jielong Lu,Zhihao Wu,Zhaoliang Chen,Zhiling Cai,Shiping Wang
Abstract:
Facing the increasing heterogeneity of data in the real world, multi-view learning has become a crucial area of research. Many researchers favor using graph convolutional networks for their adeptness at modeling both the topology and the attributes. However, these approaches typically only consider the construction of static topologies within individual views, overlooking the potential relationships between views in multi-view data. Furthermore, there is a glaring absence of theoretical guidance for constructing topologies of multi-view data, leaving uncertainties about whether graph embeddings are progressing toward the desired state. To tackle these challenges, we introduce a framework named energy-constrained multi-view graph diffusion. This approach establishes a mathematical correspondence between multi-view data and graph convolution via graph diffusion. It derives a feature propagation process with inter-view perception by considering both inter- and intra-view feature flows across the entire system, treating multi-view data as a holistic entity. Additionally, an energy function is introduced to guide the inter- and intra-view diffusion functions, ensuring that the representations converge towards global consistency. The empirical research on several benchmark datasets substantiates the benefits of the proposed method and demonstrates its significant performance improvement.



Paperid:79 Oral
Authors:Ruilin Yao,Shengwu Xiong,Yichen Zhao,Yi Rong
Abstract:
Visual grounding is the task of locating objects specified by natural language expressions. Existing methods extend generic object detection frameworks to tackle this task. They typically extract visual and textual features separately using independent visual and textual encoders, then fuse these features in a multi-modal decoder for final prediction. However, visual grounding presents unique challenges. It often involves locating objects with different text descriptions within the same image. Existing methods struggle with this task because the independent visual encoder produces identical visual features for the same image, limiting detection performance. Some recently approaches propose various language-guided visual encoders to address this issue, but they mostly rely solely on textual information and require sophisticated designs. In this paper, we introduce Multi-modal Conditional Adaptation (MMCA), which enables the visual encoder to adaptively update weights, directing its focus towards text-relevant regions. Specifically, we first integrate information from different modalities to obtain multi-modal embeddings. Then we utilize a set of weighting coefficients, which generated from the multimodal embeddings, to reorganize the weight update matrices and apply them to the visual encoder of the visual grounding model. Extensive experiments on four widely used datasets demonstrate that MMCA achieves significant improvements and state-of-the-art results. Ablation experiments further demonstrate the lightweight and efficiency of our method.



Paperid:80 Oral
Authors:Feiyu Chen,Cong Xu,Qi Jia,Yihua Wang,Yuhan Liu,Zhang Haotian,Endong Wang
Abstract:
Typical dense video captioning mostly concentrates on third-person videos, which are generally characterized by relatively delineated steps among events as seen in edited instructional videos. However, such videos do not genuinely reflect the way we perceive our real lives. Instead, we observe the world from an egocentric viewpoint and witness only continuous unedited footage. To facilitate further research, we introduce a new task, Egocentric Vehicle Dense Video Captioning, in classic first-person driving scenario. This is a multi-modal, multi-task project for a comprehensive understanding of untrimmed, egocentric driving videos. It consists of three sub-tasks that focus on event location, event captioning, and vehicle state estimation separately. For the purpose of accomplishing these tasks, it is necessary to deal with at least three challenges, those are extracting relevant ego-motion information, describing driving behavior and understanding the underlying rationale, as well as resolving the boundary ambiguity problem. In response, we devise corresponding solutions, encompassing a vehicle ego-motion learning strategy and a novel adjacent contrastive learning strategy, which effectively address the aforementioned issues to a certain extent. We validate our method by conducting extensive experiments on the BDD-X dataset, all of which show promising results and achieve new state-of-the-art performance on most metrics, which proves the effectiveness of our approach.



Paperid:81 Oral
Authors:Xiaoxuan Shen,Fenghua Yu,yaqi Liu,Ruxia Liang,Qian Wan,Kai Yang,Jianwen Sun
Abstract:
Advances in multimedia technology and its widespread application in education have made multimedia learning increasingly important. Knowledge Tracing (KT) is the key technology for achieving adaptive multimedia learning, aiming to monitor the degree of knowledge acquisition and predict students' performance during the learning process. Current KT research is dedicated to enhancing the performance of KT problems by integrating the most advanced deep learning techniques. However, this has led to increasingly complex models, which reduce model usability and divert researchers' attention away from exploring the core issues of KT. This paper aims to tackle the fundamental challenges of KT tasks, including the knowledge state representation and the core architecture design, and investigate a novel KT model that is both simple and powerful. We have revisited the KT task and propose the ReKT model. First, taking inspiration from the decision-making process of human teachers, we model the knowledge state of students from three distinct perspectives: questions, concepts, and domains. Second, building upon human cognitive development models, such as constructivism, we have designed a Forget-Response-Update (FRU) framework to serve as the core architecture for the KT task. The FRU is composed of just two linear regression units, making it an extremely lightweight framework. Extensive comparisons were conducted with 22 state-of-the-art KT models on 7 publicly available datasets. The experimental results demonstrate that ReKT outperforms all the comparative methods in question-based KT tasks, and consistently achieves the best (in most cases) or near-best performance in concept-based KT tasks. Furthermore, in comparison to other KT core architectures like Transformers or LSTMs, the FRU achieves superior prediction performance with approximately only 38% computing resources. Through an exploration of the ReKT model that is both simple and powerful, is able to offer new insights to future KT research. The code is in the supplementary materials.



Paperid:82 Oral
Authors:Chihaya Matsuhira,Marc A. Kastner,Takahiro Komamizu,Takatsugu Hirayama,Ichiro Ide
Abstract:
Text-to-image diffusion models sometimes depict blended concepts in generated images. One promising use case of this effect would be the nonword-to-image generation task which attempts to generate images intuitively imaginable from a non-existing word (nonword). To realize nonword-to-image generation, an existing study focused on associating nonwords with similar-sounding words. Since each nonword can have multiple similar-sounding words, generating images containing their blended concepts would increase intuitiveness, facilitating creative activities and promoting computational psycholinguistics. Nevertheless, no existing study has quantitatively evaluated this effect in either diffusion models or the nonword-to-image generation paradigm. Therefore, this paper first analyzes the conceptual blending in one of the pretrained diffusion models called Stable Diffusion. The analysis reveals that a high percentage of generated images depict blended concepts when inputting an embedding interpolating between the text embeddings of two text prompts referring to different concepts. Next, this paper explores the best text embedding space conversion method of an existing nonword-to-image generation framework to ensure both the occurrence of conceptual blending and image generation quality. We compare the conventional direct prediction approach with the proposed method that combines $k$-nearest neighbor search and linear regression. Evaluation reveals that the enhanced accuracy of the embedding space conversion by the proposed method improves the image generation quality, while the emergence of conceptual blending could be attributed mainly to the specific dimensions of the high-dimensional text embedding space.



Paperid:83 Oral
Authors:Jiaqi Zhu,Shaofeng Cai,Fang Deng,WuJunran
Abstract:
Large vision-language models (LVLMs) are markedly proficient in deriving visual representations guided by natural language. Recent explorations have utilized LVLMs to tackle zero-shot visual anomaly detection (VAD) challenges by pairing images with textual descriptions indicative of normal and abnormal conditions, referred to as anomaly prompts. However, existing approaches depend on static anomaly prompts that are prone to cross-semantic ambiguity, and prioritize global image-level representations over crucial local pixel-level image-to-text alignment that is necessary for accurate anomaly localization. In this paper, we present ALFA, a training-free approach designed to address these challenges via a unified model. We propose a run-time prompt adaptation strategy, which first generates informative anomaly prompts to leverage the capabilities of a large language model (LLM). This strategy is enhanced by a contextual scoring mechanism for per-image anomaly prompt adaptation and cross-semantic ambiguity mitigation. We further introduce a novel fine-grained aligner to fuse local pixel-level semantics for precise anomaly localization, by projecting the image-text alignment from global to local semantic spaces. Extensive evaluations on the challenging MVTec and VisA datasets confirm ALFA's effectiveness in harnessing the language potential for zero-shot VAD, achieving significant PRO improvements of 12.1% on MVTec AD and 8.9% on VisA compared to state-of-the-art zero-shot VAD approaches.



Paperid:84 Oral
Authors:Haonan Zheng,Xinyang Deng,Wen Jiang,Wenrui Li
Abstract:
With Vision-Language Pre-training (VLP) models demonstrating powerful multimodal interaction capabilities, the application scenarios of neural networks is no longer confined to unimodal domains such as CV and NLP, but has expanded to more complex multimodal V+L downstream tasks. The security vulnerabilities of unimodal models have been extensively examined, whereas those of VLP models remain challenges. We note that in CV models, the understanding of images comes from annotated information, while VLP models is designed to learn image representations directly from raw text. Motivated by this discrepancy, we developed the Feature Guidance Attack (FGA), a novel method that uses text representations to direct the perturbation of clean images, resulting in the generation of adversarial images. FGA is orthogonal to many advanced attack strategies in the unimodal domain, facilitating the direct application of rich research findings from the unimodal to multimodal scenario. By appropriately introducing text attack into FGA, we construct Feature Guidance with Text Attack (FGA-T). Through the interaction of attacking two modalities, FGA-T achieves superior attack effects against VLP models. Moreover, incorporating data augmentation and momentum mechanisms significantly improves black-box transferability of FGA-T. Our method demonstrates stable and effective attack capabilities across various datasets, downstream tasks, and both black-box and white-box settings, offering a unified baseline for exploring the robustness of VLP models.



Paperid:85 Oral
Authors:Chaoqun Niu,Dongdong Chen,Ji-Zhe Zhou,Jian Wang,Xiang Luo,Quan-Hui Liu,YUAN LI,Jiancheng Lv
Abstract:
Forensic person identification is of paramount importance in accidents and criminal investigations. Existing methods based on soft tissue or DNA can be unavailable if the body is badly decomposed, white-ossified, or charred. However, bones last a long time. This raises a natural question:can we learn to identify a person using bone data?We present a novel feature of bones calledNeural Boneprintfor personal identification. In particular, we exploit the thoracic skeletal data including chest radiographs (CXRs) and computed tomography (CT) images enhanced by the volume rendering technique (VRT) as an example to explore the availability of the neural boneprint. We then represent the neural boneprint as a joint latent embedding of VRT images and CXRs through a bidirectional cross-modality translation and contrastive learning. Preliminary experimental results on real skeletal data demonstrate the effectiveness of the Neural Boneprint for identification. We hope that this approach will provide a promising alternative for challenging forensic cases where conventional methods are limited. The code will be available at ***.



Paperid:86 Oral
Authors:Jinyue Chen,Lingyu Kong,Haoran Wei,Chenglong Liu,Zheng Ge,Liang Zhao,Jianjian Sun,Chunrui Han,Xiangyu Zhang
Abstract:
Chart parsing poses a significant challenge due to the diversity of styles, values, texts, and so forth. Even advanced large vision-language models (LVLMs) with billions of parameters struggle to handle such tasks satisfactorily. To address this, we propose OneChart: a reliable agent specifically devised for the structural extraction of chart information. Similar to popular LVLMs, OneChart incorporates an autoregressive main body. Uniquely, to enhance the reliability of the numerical parts of the output, we introduce an auxiliary token placed at the beginning of the total tokens along with an additional decoder. The numerically optimized (auxiliary) token allows subsequent tokens for chart parsing to capture enhanced numerical features through causal attention. Furthermore, with the aid of the auxiliary token, we have devised a self-evaluation mechanism that enables the model to gauge the reliability of its chart parsing results by providing confidence scores for the generated content. Compared to current state-of-the-art (SOTA) chart parsing models, e.g., DePlot, ChartVLM, ChartAst, OneChart significantly outperforms in Average Precision (AP) for chart structural extraction across multiple public benchmarks, despite enjoying only 0.2 billion parameters. Moreover, as a chart parsing agent, it also brings 10%+ accuracy gains for the popular LVLM (LLaVA-1.6) in the downstream ChartQA benchmark.



Paperid:87 Oral
Authors:Xiao Han,Yiming Ren,Peishan Cong,Yujing Sun,Jingya Wang,Lan Xu,Yuexin Ma
Abstract:
Human gait recognition is crucial in multimedia, enabling identification through walking patterns without direct interaction, enhancing the integration across various media forms in real-world applications like smart homes, healthcare and non-intrusive security. LiDAR's ability to capture depth makes it pivotal for robotic perception and holds promise for real-world gait recognition. In this paper, based on a single LiDAR, we present the Hierarchical Multi-representation Feature Interaction Network (HMRNet) for robust gait recognition. Prevailing LiDAR-based gait datasets primarily derive from controlled settings with predefined trajectory, remaining a gap with real-world scenarios. To facilitate LiDAR-based gait recognition research, we introduce FreeGait, a comprehensive gait dataset from large-scale, unconstrained settings, enriched with multi-modal and varied 2D/3D data. Notably, our approach achieves state-of-the-art performance on prior dataset (SUSTech1K) and on FreeGait. Code and dataset will be released upon publication of this paper.



Paperid:88 Oral
Authors:Chunyi Li,Haoning Wu,Hongkun Hao,Zicheng Zhang,Tengchuan Kou,Chaofeng Chen,Xiaohong Liu,LEI BAI,Weisi Lin,Guangtao Zhai
Abstract:
With the evolution of Text-to-Image (T2I) models, the quality defects of AI-Generated Images (AIGIs) pose a significant barrier to their widespread adoption. In terms of both perception and alignment, existing models cannot always guarantee high-quality results. To mitigate this limitation, we introduce G-Refine, a general image quality refiner designed to enhance low-quality images without compromising the integrity of high-quality ones. The model is composed of three interconnected modules: a perception quality indicator, an alignment quality indicator, and a general quality enhancement module. Based on the mechanisms of the Human Visual System (HVS) and syntax trees, the first two indicators can respectively identify the perception and alignment deficiencies, and the last module can apply targeted quality enhancement accordingly. Extensive experimentation reveals that when compared to alternative optimization methods, AIGIs after G-Refine outperform in 10+ quality metrics across 4 datasets. This improvement significantly contributes to the practical application of contemporary T2I models, paving the way for their broader adoption.



Paperid:89 Oral
Authors:Tongtong Feng,Xin Wang,Feilin Han,Leping Zhang,Wenwu Zhu
Abstract:
Modern perception systems for autonomous flight are sensitive to occlusion and have limited long-range capability, which is a key bottleneck in improving low-altitude economic task performance. Recent research has shown that the UAV-to-UAV (U2U) cooperative perception system has great potential to revolutionize the autonomous flight industry. However, the lack of a large-scale dataset is hindering progress in this area. This paper presents U2UData, the first large-scale cooperative perception dataset for swarm UAVs autonomous flight. The dataset was collected by three UAVs flying autonomously in the U2USim, covering a 9 km$^2$ flight area. It comprises 315K LiDAR frames, 945K RGB and depth frames, and 2.41M annotated 3D bounding boxes for 3 classes. It also includes brightness, temperature, humidity, smoke, and airflow values covering all flight routes. U2USim is the first real-world mapping swarm UAVs simulation environment. It takes Yunnan Province as the prototype and includes 4 terrains, 7 weather conditions, and 8 sensor types. U2UData introduces two perception tasks: cooperative 3D object detection and cooperative 3D object tracking. This paper provides comprehensive benchmarks of recent cooperative perception algorithms on these tasks.



Paperid:90 Oral
Authors:Shiyu Liu,Zibo Zhao,Yihao Zhi,Yiqun Zhao,Binbin Huang,Shuo Wang,Ruoyu Wang,Michael Xuan,Zhengxin Li,Shenghua Gao
Abstract:
Video generation and editing, particularly human-centric video editing, has seen a surge of interest in its potential to create immersive and dynamic content. A fundamental challenge is ensuring temporal coherence and visual harmony across frames, especially in handling large-scale human motion and maintaining consistency over long sequences. The previous methods, such as diffusion-based video editing, struggle with flickering and length limitations. In contrast, methods employing Video-2D representations grapple with accurately capturing complex structural relationships in large-scale human motion. Simultaneously, some patterns on the human body appear intermittently throughout the video, posing a knotty problem in identifying visual correspondence. To address the above problems, we present HeroMaker. This human-centric video editing framework manipulates the person's appearance within the input video and achieves inter-frame consistent results. Specifically, we propose to learn the motion priors, transformations from dual canonical fields to each video frame, by leveraging the body mesh-based human motion warping and neural deformation-based margin refinement in the video reconstruction framework to ensure the semantic correctness of canonical fields. HeroMaker performs human-centric video editing by manipulating the dual canonical fields and combining them with motion priors to synthesize temporally coherent and visually plausible results. Comprehensive experiments demonstrate that our approach surpasses existing methods regarding temporal consistency, visual quality, and semantic coherence.



Paperid:91 Oral
Authors:Yiming Li,Zhifang Guo,Xiangdong Wang,Hong Liu
Abstract:
Recent advances have been witnessed in audio-language joint learning, such as CLAP, that shows much success in multi-modal understanding tasks. These models usually aggregate uni-modal local representations, namely frame or word features, into global ones, on which the contrastive loss is employed to reach coarse-grained cross-modal alignment. However, frame-level correspondence with texts may be ignored by the above paradigm, making it ill-posed on explainability and fine-grained text-audio challenges (e.g., text-to-audio grounding) which may also undermine performances on coarse-grained tasks. In this work, we aim to improve both coarse- and fine-grained audio-language alignment in large-scale contrastive pre-training. To unify the granularity and latent distribution of two modalities, a shared codebook is adopted to represent multi-modal global features with common bases, and each internal codeword is regularized to encode modality-shared semantics, bridging the gap between frame and word features. Based on the above framework, a locality-aware block is involved to purify local patterns, and a hard-negative guided loss is devised to boost alignment effects. Extensive experiments on eleven zero-shot coarse- and fine-grained evaluation protocols suggest that our model not only surpasses the baseline CLAP significantly but also yields superior or competitive results compared to current SOTA works. The code and model will be released upon paper acceptance.



Paperid:92 Oral
Authors:Yizhang Liu,Weiwei Zhou,Yanping Li,Shengjie Zhao
Abstract:
Correspondence pruning has recently drawn considerable attention as a crucial step in image matching. Existing methods typically achieve this by constructing neighborhoods for each feature point and imposing neighborhood consistency. However, the nearest-neighbor matching strategy often results in numerous many-to-one correspondences, thereby reducing the reliability of neighborhood information. Furthermore, the smoothness constraint fails in cases of large-scale rotations, leading to misjudgments. To address the above issues, this paper proposes a novel robust correspondence pruning method termed RoSe, which is based on rotation-invariant sequence-aware consensus. We formulate the correspondence pruning problem as a mathematical optimization problem and derive a closed-form solution. Specifically, we devise a rectified local neighborhood construction strategy that effectively enlarges the distribution between inliers and outliers. Meanwhile, to accommodate large-scale rotation, we propose a relative sequence-aware consistency as an alternative to existing smoothness constraints, which can better characterize the topological structure of inliers. Experimental results on image matching and registration tasks demonstrate the effectiveness of our method. Robustness analysis involving diverse feature descriptors and varying rotation degrees further showcases the efficacy of our method.



Paperid:93 Oral
Authors:Yiyang Jiang,Wengyu Zhang,Xulu Zhang,Xiaoyong Wei,Chang Wen Chen,Qing Li
Abstract:
In this paper, we investigate the feasibility of leveraging large language models (LLMs) for integrating general knowledge and incorporating pseudo-events as priors for temporal content distribution in video moment retrieval (VMR) models. The motivation behind this study arises from the limitations of using LLMs as decoders for generating discrete textual descriptions, which hinders their direct application to continuous outputs like salience scores and inter-frame embeddings that capture inter-frame relations. To overcome these limitations, we propose utilizing LLM encoders instead of decoders. Through a feasibility study, we demonstrate that LLM encoders effectively refine inter-concept relations in multimodal embeddings, even without being trained on textual embeddings. We also show that the refinement capability of LLM encoders can be transferred to other embeddings, such as BLIP and T5, as long as these embeddings exhibit similar inter-concept similarity patterns to CLIP embeddings. We present a general framework for integrating LLM encoders into existing VMR architectures, specifically within the fusion module. The LLM encoder's ability to refine concept relation can help the model to achieve a balanced understanding of the foreground concepts (e.g., persons, faces) and background concepts (e.g., street, mountains) rather focusing only on the visually dominant foreground concepts. Additionally, we introduce the concept of pseudo-events, obtained through event detection techniques, to guide the prediction of moments within event boundaries instead of crossing them, which can effectively avoid the distractions from adjacent moments. The integration of semantic refinement using LLM encoders and pseudo-event regulation is designed as plug-in components that can be incorporated into existing VMR methods within the general framework. Through experimental validation, we demonstrate the effectiveness of our proposed methods by achieving state-of-the-art performance in VMR. The source code can be accessed athttps://github.com/open_upon_acceptance.



Paperid:94 Oral
Authors:Yang Liu,Qianqian Xu,Peisong Wen,Siran Dai,Qingming Huang
Abstract:
The rapid growth of online video resources has significantly promoted the development of video retrieval methods. As a standard evaluation metric for video retrieval, Average Precision (AP) assesses the overall rankings of relevant videos at the top list, making the predicted scores a reliable reference for users. However, recent video retrieval methods utilize pair-wise losses that treat all sample pairs equally, leading to an evident gap between the training objective and evaluation metric. To effectively bridge this gap, in this work, we aim to address two primary challenges: a) The current similarity measure and AP-based loss are suboptimal for video retrieval; b) The noticeable noise from frame-to-frame matching introduces ambiguity in estimating the AP loss. In response to these challenges, we propose the Hierarchical learning framework for Average-Precision-oriented Video Retrieval (HAP-VR). For the former challenge, we develop the TopK-Chamfer Similarity and QuadLinear-AP loss to measure and optimize video-level similarities in terms of AP. For the latter challenge, we suggest constraining the frame-level similarities to achieve an accurate AP loss estimation. Experimental results present that HAP-VR outperforms existing methods on several benchmark datasets, providing a feasible solution for video retrieval tasks and thus offering potential benefits for the multi-media application.



Paperid:95 Oral
Authors:Zihan Zheng,Houqiang Zhong,Qiang Hu,Xiaoyun Zhang,Li Song,Ya Zhang,Yanfeng Wang
Abstract:
Volumetric video based on Neural Radiance Field (NeRF) holds vast potential for various 3D applications, but its substantial data volume poses significant challenges for compression and transmission. Current NeRF compression lacks the flexibility to adjust video quality and bitrate within a single model for various network and device capacities. To address these issues, we propose HPC, a novel hierarchical progressive volumetric video coding framework achieving variable bitrate using a single model. Specifically, HPC introduces a hierarchical representation with a multi-resolution residual radiance field to reduce temporal redundancy in long-duration sequences while simultaneously generating various levels of detail. Then, we propose an end-to-end progressive learning approach with a multi-rate-distortion loss function to jointly optimize both hierarchical representation and compression. Our HPC trained only once can realize multiple compression levels, while the current methods need to train multiple fixed-bitrate models for different rate-distortion (RD) tradeoffs. Extensive experiments demonstrate that HPC achieves flexible quality levels with variable bitrate by a single model and exhibits competitive RD performance, even outperforming fixed-bitrate models across various datasets.



Paperid:96 Oral
Authors:Changmeng Zheng,DaYong Liang,Wengyu Zhang,Xiaoyong Wei,Tat-Seng Chua,Qing Li
Abstract:
This paper presents a pilot study aimed at introducing multi-agent debate into multimodal reasoning. The study addresses two key challenges: the trivialization of opinions resulting from excessive summarization and the diversion of focus caused by distractor concepts introduced from images. These challenges stem from the inductive (bottom-up) nature of existing debating schemes. To address the issue, we propose a deductive (top-down) debating approach called Blueprint Debate on Graphs (BDoG). In BDoG, debates are confined to a blueprint graph to prevent opinion trivialization through world-level summarization. Moreover, by storing evidence in branches within the graph, BDoG mitigates distractions caused by frequent but irrelevant concepts. Extensive experiments validate that BDoG is able to achieve state-of-the-art results in ScienceQA and MMBench with significant improvements over previous methods. The source code can be accessed athttps://github.com/open_upon_acceptance.



Paperid:97 Oral
Authors:Zhongxu Wang,Yujia Wang,Mingzhu Li,Hua Huang
Abstract:
We devise an articulatory representation-based text-to-speech (TTS) model, ArtSpeech, an explainable and effective network for human-like speech synthesis, by revisiting the sound production system. Current deep TTS models learn acoustic-text mapping in a fully parametric manner, ignoring the explicit physical significance of articulation movement. ArtSpeech, on the contrary, leverages articulatory representations to perform adaptive TTS, clearly describing the voice tone and speaking prosody of different speakers. Specifically, energy, F0, and vocal tract variables are utilized to represent airflow forced by articulatory organs, the degree of tension in the vocal folds of the larynx, and the coordinated movements between different organs, respectively. We further designed a multi-dimensional style mapping network to extract speaking styles from diverse articulatory representations. These speaking styles will be utilized to guide the output of the articulatory variation predictors respectively, and ultimately predict the final mel spectrogram out-put. Experiment results show that, compared to other open-source zero-shot TTS systems, ArtSpeech enhances synthesis quality and greatly boosts the similarity between the generated results and the target speaker’s voice and prosody.



Paperid:98 Oral
Authors:Yanglin Deng,Tianyang Xu,Chunyang Cheng,Xiaojun Wu,Josef Kittler
Abstract:
In recent years, Multi-Modality Image Fusion (MMIF) has been applied to many fields, which has attracted many scholars to endeavour to improve the fusion performance. However, the prevailing focus has predominantly been on the architecture design, rather than the training strategies. As a low-level vision task, image fusion is supposed to quickly deliver output images for observing and supporting downstream tasks. Thus, superfluous computational and storage overheads should be avoided. In this work, a lightweight Distilled Mini-Model with a Dynamic Refresh strategy (MMDRFuse) is proposed to achieve this objective. To pursue model parsimony, an extremely small convolutional network with a total of 113 trainable parameters (0.44 KB) is obtained by three carefully designed supervisions. First, digestible distillation is constructed by emphasising external spatial feature consistency, delivering soft supervision with balanced details and saliency for the target network. Second, we develop a comprehensive loss to balance the pixel, gradient, and perception clues from the source images. Third, an innovative dynamic refresh training strategy is used to collaborate history parameters and current supervision during training, together with an adaptive adjust function to optimise the fusion network. Extensive experiments on several public datasets demonstrate that our method exhibits promising advantages in terms of model efficiency and complexity, with superior performance in multiple image fusion tasks and downstream pedestrian detection application.



Paperid:99 Oral
Authors:TongshunZhang,Pingping Liu,Ming Zhao,Haotian Lv
Abstract:
In the Fourier frequency domain, luminance information is primarily encoded in the amplitude component, while spatial structure information is significantly contained within the phase component. Existing low-light image enhancement techniques using Fourier transform have mainly focused on amplifying the amplitude component and simply replicating the phase component, an approach that often leads to color distortions and noise issues. In this paper, we propose a Dual-Stage Multi-Branch Fourier Low-Light Image Enhancement (DMFourLLIE) framework to address these limitations by emphasizing the phase component's role in preserving image structure and detail. The first stage integrates structural information from infrared images to enhance the phase component and employs a luminance-attention mechanism in the luminance-chrominance color space to precisely control amplitude enhancement. The second stage combines multi-scale and Fourier convolutional branches for robust image reconstruction, effectively recovering spatial structures and textures. This dual-branch joint optimization process ensures that complex image information is retained, overcoming the limitations of previous methods that neglected the interplay between amplitude and phase. Extensive experiments across multiple datasets demonstrate that DMFourLLIE outperforms current state-of-the-art methods in low-light image enhancement.



Paperid:100 Oral
Authors:Minghe Gao,Juncheng Li,Hao Fei,Liang Pang,Wei Ji,Guoming Wang,Zheqi Lv,Wenqiao Zhang,Siliang Tang,Yueting Zhuang
Abstract:
Visual programming, a modular and generalizable paradigm, integrates different modules and Python operators to solve various vision-language tasks. Unlike end-to-end models that need task-specific data, it advances in performing visual processing and reasoning in an unsupervised manner. Current visual programming methods generate programs in a single pass for each task where the ability to evaluate and optimize based on feedback, unfortunately, is lacking, which consequentially limits their effectiveness for complex, multi-step problems. Drawing inspiration from benders decomposition, we introduce De-fine, a training-free framework that automatically decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. This model-agnostic approach can improve logical reasoning performance by integrating the strengths of multiple models. Our experiments across various visual tasks show that De-fine creates more accurate and robust programs, setting new benchmarks in the field. The anonymous project is available athttps://anonymous.4open.science/r/De-fine_Program-FE15



Paperid:101 Oral
Authors:Daoming Zong,Chaoyue Ding,Kaitao Chen
Abstract:
For AI systems to be safely and reliably grounded in the real world, they should possess the ability of physical commonsense reasoning, i.e. they are desired to understand the physical properties, affordances, and maneuverability of objects in everyday life. Physical commonsense reasoning is essentially a multisensory task as physical properties of objects are manifested through multiple perception modalities, including both visual and auditory. In this study, we constructed two new benchmarks, called PACS-Reason and PACS-Reason+, for explainable physical audiovisual commonsense reasoning (EPACS), in which each datapoint is accompanied by a golden detailed rationale (intermediate reasoning path) to explain the answer selection. Moreover, we present PAVC-Reasoner, a multimodal large language model (LLM) designed to reason about physical commonsense attributes. The model aligns different modalities with the language modality by integrating three different perceivers for cross-modal pretraining and instruction finetuning at multiple granularities. It utilizes an LLM as a cognitive engine to process multimodal inputs and output convincing intermediate reasoning paths as justification for inferring answers. Numerous experiments have demonstrated the effectiveness and superiority of PAVC-Reasoner as a baseline model for studying EPACS. Most attractively, PAVC-Reasoner is capable of reasoning and obtaining strong interpretable explicit reasoning paths, signifying a significant stride towards real-world physical commonsense reasoning.



Paperid:102 Oral
Authors:Zhiqi Pang,Lingling Zhao,Chunyu Wang
Abstract:
Cross-resolution person re-identification (CR-ReID) aims to match images of the same person with different resolutions in different scenarios. Existing CR-ReID methods achieve promising performance by relying on large-scale manually annotated identity labels. However, acquiring manual labels requires considerable human effort, greatly limiting the flexibility of existing CR-ReID methods. To address this issue, we propose a dual-resolution fusion modeling (DRFM) framework to tackle the CR-ReID problem in an unsupervised manner. Firstly, we design a cross-resolution pseudo-label generation (CPG) method, which initially clusters high-resolution images and then obtains reliable identity pseudo-labels by fusing class vectors in both resolution spaces. Subsequently, we develop a cross-resolution feature fusion (CRFF) module to fuse features from both high-resolution and low-resolution spaces. The fusion features have the potential to serve as a new form of resolution-invariant features. Finally, we introduce cross-resolution contrastive loss and probability sharpening loss in DRFM to facilitate resolution-invariant learning and effectively utilize ambiguous samples for optimization. Experimental results on multiple CR-ReID datasets demonstrate that the proposed DRFM not only outperforms existing unsupervised methods but can even achieves competitive performance with supervised methods.



Paperid:103 Oral
Authors:Bo Xu,Junzhe Zheng,Jiayuan He,Yuxuan Sun,Hongfei Lin,Liang Zhao,Feng Xia
Abstract:
Understanding a meme is a challenging task, due to the metaphorical information contained in the meme that requires intricate interpretation to grasp its intended meaning fully. In previous works, attempts have been made to facilitate computational understanding of memes through introducing human-annotated metaphors as extra input features into machine learning models. However, these approaches mainly focus on formulating linguistic representation of a metaphor (extracted from the texts appearing in memes), while ignoring the connection between the metaphor and corresponding visual features (e.g., objects in meme images). In this paper, we argue that a more comprehensive understanding of memes can only be achieved through a joint modelling of both visual and linguistic features of memes. To this end, we propose an approach to generate Multimodal Metaphorical feature for Meme Classification, named MMMC. MMMC derives visual characteristics from linguistic attributes of metaphorical concepts, which more effectively convey the underlying metaphorical concept, leveraging a text-conditioned generative adversarial network. The linguistic and visual features are then integrated into a set of multimodal metaphorical features for classification purpose. We perform extensive experiments on a benchmark metaphorical meme dataset, MET-Meme. Experimental results show that MMMC significantly outperforms existing baselines on the task of emotion classification and intention detection. Our code and dataset are available athttps://anonymous.4open.science/r/MMMC-C37B.



Paperid:104 Oral
Authors:Jiawei Lin,Zhaoyun Jiang,Jiaqi Guo,Shizhao Sun,Ting Liu,Zijiang James Yang,Jian-Guang Lou,Dongmei Zhang
Abstract:
Icons are ubiquitous visual elements in graphic design. However, their creation is non-trivial and time-consuming. To this end, we draw inspiration from the booming text-to-image field and propose Text-Guided Icon Set Expansion, a task that allows users to create novel and style-preserving icons using textual descriptions and a few handmade icons as style reference. Despite its usefulness, this task poses two unique challenges. (i) Abstract Concept Visualization. Abstract concepts like technology and health are frequently encountered in icon creation, but their visualization requires a mental grounding process that connects them to physical and easy-to-draw concepts. (ii) Fine-grained Style Transfer. Unlike ordinary images, icons exhibit far richer fine-grained stylistic elements, including tones, line widths, shapes, shadow effects, etc, setting a higher demand on capturing and preserving them during generation.To address the challenges, we propose IconDM, a method based on pre-trained text-to-image (T2I) diffusion models. It involves a one-shot domain adaptation process and an online style transfer process. The domain adaptation aims to improve the pre-trained T2I model in understanding abstract concepts by finetuning on high-quality icon-text pairs. To achieve so, we construct IconBank, a large-scale dataset of 2.3 million icon-text pairs, where the texts are generated by the state-of-the-art vision-language model from icons. In style transfer, we introduce a Style Enhancement Mod- ule into the T2I model. It explicitly extracts the fine-grained style features from the given reference icons, and is jointly optimized with the T2I model during DreamBooth tuning. To assess IconDM, we present IconBench, a structured suite with 30 icon sets and 100 concepts (including 50 abstract concepts) for generation. Quantitative results, qualitative analysis, and extensive ablation studies demonstrate the effectiveness of IconDM.



Paperid:105 Oral
Authors:Bowen Zhao,Tianhao Cheng,Yuejie Zhang,Ying Cheng,Rui Feng,Xiaobo Zhang
Abstract:
Multimodal Question Answering (MMQA) is crucial as it enables comprehensive understanding and accurate responses by integrating insights from diverse data representations such as tables, charts, and text. Most existing researches in MMQA only focus on two modalities such as image-text QA, table-text QA and chart-text QA, and there remains a notable scarcity in studies that investigate the joint analysis of text, tables, and charts. In this paper, we present C$\text{T}^2$C-QA, a pioneering Chinese reasoning-based QA dataset that includes an extensive collection of text, tables, and charts, meticulously compiled from 200 selectively sourced webpages. Our dataset simulates real webpages and serves as a great test for the capability of the model to analyze and reason with multimodal data, because the answer to a question could appear in various modalities, or even potentially not exist at all. Additionally, we present AED (Allocating, Expert and Desicion), a multi-agent system implemented through collaborative deployment, information interaction, and collective decision-making among different agents. Specifically, the Assignment Agent is in charge of selecting and activating expert agents, including those proficient in text, tables, and charts. The Decision Agent bears the responsibility of delivering the final verdict, drawing upon the analytical insights provided by these expert agents. We execute a comprehensive analysis, comparing AED with various state-of-the-art models in MMQA, including GPT-4. The experimental outcomes demonstrate that current methodologies, including GPT-4, are yet to meet the benchmarks set by our dataset.



Paperid:106 Oral
Authors:Jun Ma,Tuukka Ruotsalo
Abstract:
Understanding human assessment of semantically salient parts of multimedia content is crucial for developing human-centric applications, such as annotation tools, search and recommender systems, and systems able to generate new media matching human interests. However, the challenge of acquiring suitable supervision signals to detect semantic saliency without extensive manual annotation remains significant. Here, we explore a novel method that utilizes signals measured directly from human cognition via electroencephalogram (EEG) in response to natural visual perception. These signals are used for supervising representation learning to capture semantic saliency. Through a contrastive learning framework, our method aligns EEG data with visual stimuli, capturing human cognitive responses without the need for any manual annotation. Our approach demonstrates that the learned representations closely align with human-centric notions of visual saliency and achieve competitive performance in several downstream tasks, such as image classification and generation. As a contribution, we introduce an open EEG/image dataset from 30 participants, to facilitate further research in utilizing cognitive signals for multimodal data analysis, studying perception, and developing models for cross-modal representation learning.



Paperid:107 Oral
Authors:Xihua Wang,Yuyue Wang,Yihan Wu,Ruihua Song,Xu Tan,Zehua Chen,Hongteng Xu,Guodong Sui
Abstract:
Video-to-audio generation is crucial for autonomous video editing and post-processing, which aims to generate high-quality audio for silent videos with semantic similarity and temporal synchronization. However, most existing methods mainly focus on matching the semantics of the visual and acoustic modalities while merely considering their temporal alignment in a coarse granularity, thus failing to achieve precise synchronization on time. In this study, we propose a novel time-aligned video-to-audio framework, called TiVA, to achieve semantic matching and temporal synchronization jointly when generating audio. Given a silent video, our method encodes its visual semantics and predicts an audio layout separately. Then, leveraging the semantic latent embeddings and the predicted audio layout as condition information, it learns a latent diffusion-based audio generator. Comprehensive objective and subjective experiments demonstrate that our method consistently outperforms state-of-the-art methods on semantic matching and temporal synchronization precision.



Paperid:108 Oral
Authors:Zhihua Xu,Tianshui Chen,Zhijing Yang,Chunmei Qing,Yukai Shi,Liang Lin
Abstract:
Speech-preserving Facial Expression Manipulation (SPFEM) aims to alter facial emotions in video content while preserving the facial movements associated with speech. Current works often fall short due to the inadequate representation of emotion as well as the absence of time-aligned paired data—two corresponding frames from the same speaker that showcase the same speech content but differ in emotional expression. In this work, we introduce a novel framework, Self-Supervised Emotion Representation Disentanglement (SSERD), to disentangle emotion representation for accurate emotion transfer while implementing a paired data construction module to facilitate automated, photorealistic facial animations. Specifically, We developed a module for learning emotion latent codes using StyleGAN's latent space, employing a cross-attention mechanism to extract and predict emotion editing codes, with contrastive learning to differentiate emotions. To overcome the lack of strictly paired data in the SPFEM task, we exploit pretrained StyleGAN to generate paired data, focusing on expression vectors unrelated to mouth shape. Additionally, we employed a hybrid training strategy using both synthetic paired and real unpaired data to enhance the realism of SPFEM model's generated images. Extensive experiments conducted on benchmark datasets, including MEAD and RAVDESS, have validated the effectiveness of our framework, demonstrating its superior capability in generating photorealistic and expressive facial animations.



Paperid:109 Oral
Authors:Kun Dong,Jian Xue,Zehai Niu,Xing Lan,Ke Lv,Qingyuan Liu,Xiaoyu Qin
Abstract:
In the domain of generative multimedia and interactive experiences, generating realistic and accurate full-body poses from sparse tracking is crucial for many real-world applications, while achieving sequence modeling and efficient motion generation remains challenging. Recently, state space models (SSMs) with efficient hardware-aware designs (i.e., Mamba) have shown great potential for sequence modeling, particularly in temporal contexts. However, processing motion data is still challenging for SSMs. Specifically, the sparsity of input conditions makes motion generation an ill-posed problem. Moreover, the complex structure of the human body further complicates this task. To address these issues, we present Motion Mamba Diffusion (MMD), a novel conditional diffusion model, which effectively utilizes the sequence modeling capability of SSMs and the robust generation ability of diffusion models to track full-body poses accurately. In particular, we design a bidirectional Temporal Mamba Module (TMM) to model motion sequence. Additionally, a Spatial Mamba Module (SMM) is further proposed for feature enhancement within a single frame. Extensive experiments on the large motion capture dataset (AMASS) demonstrate that our proposed approach outperforms the latest methods in terms of accuracy and smoothness and achieves new state-of-the-art performance. Moreover, our approach runs in real-time, making it ideal for employment in practical applications. The source code will be made public upon acceptance of this paper.



Paperid:110 Oral
Authors:Fei Gao,Yuhao Lin,Jiaqi Shi,Maoying Qiao,Nannan Wang
Abstract:
Image Aesthetic Assessment (IAA) aims to objectively predict the generic or personalized evaluations, of the aesthetic or fine-grained multi-attributes, based on visual or multimodal inputs. Previously, researchers have designed diverse and specialized methods, for specific IAA tasks, based on different input-output situations. Is it possible to design a universal IAA framework applicable for the whole IAA task taxonomy? In this paper, we explore this issue, and propose a modular IAA framework, dubbed AesMamba. Specially, we use the Visual State Space Model (VMamba), instead of CNNs or ViTs, to learn comprehensive representations of aesthetic-related attributes; because VMamba can efficiently achieve both global and local effective receptive fields. Afterward, a modal-adaptive module is used to automatically produce the integrated representations, conditioned on the type of input. In the prediction module, we propose a Multitask Balanced Adaptation (MBA) module, to boost task-specific features, with emphasis on the tail instances. Finally, we formulate the personalized IAA task as a multimodal learning problem, by converting a user’s anonymous subject characters to a text prompt. This prompting strategy effectively employs the semantics of flexibly selected characters, for inferring individual preferences. AesMamba can be applied to diverse IAA tasks, through flexible combination of these modules. Extensive experiments, on numerous benchmark datasets, demonstrate that our AesMamba models consistently achieve superior or highly competitive performance, on all IAA tasks, in comparison with state-of-the-art methods. The code and models will be released after peer review.



Paperid:111 Oral
Authors:Zejun Zhang,Xiao Zhu,Anlan Zhang,Feng Qian
Abstract:
Video Conferencing Applications (VCAs) are indispensable for real-time communication in remote work and education by enabling simultaneous transmission of audio, video, and screen-sharing content. Despite their ubiquity, there is a noticeable lack of research on how these platforms allocate resources, especially under limited bandwidth constraints, and how these resource allocation strategies affect the Quality of Experience (QoE). This paper addresses this research gap by conducting an in-depth analysis of bandwidth allocation strategies among prominent VCAs, including Zoom, Webex, and Google Meet, with an emphasis on their implications for QoE. To assess QoE effectively, we propose a general QoE model based on data collected from a user study involving over 800 participants. This study marks a pioneering effort in the extensive evaluation of multimedia transmissions across diverse scenarios for VCAs, representing a significant advancement over prior research that predominantly concentrated on the quality assessment of singular media types. The promising outcomes highlight the model's effectiveness and generalization in accurately predicting Quality of Experience (QoE) across various scenarios among VCAs.



Paperid:112 Oral
Authors:YiChang Qu,Bing Li,Jie Huang,Feng Zhao
Abstract:
Pan-sharpening is an important technique for remote sensing imaging systems to obtain high resolution multispectral images. Existing deep learning-based methods mostly rely on using pseudo-groundtruth multi-spectral images for supervised learning. The whole training process only remains at the scale of reduced resolution, which means that the impact of the degradation process is ignored and high-quality images cannot be guaranteed at full resolution. To address the challenge, we propose a new unsupervised framework that does not rely on pseudo-groundtruth but uses the invariance of the degradation process to build a consistent loss function on the original scale for network training. Specifically, first, we introduce the operator learning method to build an exact mapping function from multi-spectral to panchromatic images and decouple spectral features and texture features. Then, through joint training, operators and convolutional networks can learn the spatial degradation process and spectral degradation process at full resolution, respectively. By introducing them to build consistency constraints, we can train the pansharpening network at the original full resolution. Our approach could be applied to existing pansharpening methods, improving their usability on original data, which is matched to practical application requirements. The experimental results on different kinds of satellite datasets demonstrate that the new network outperforms state-of-the-art methods both visually and quantitatively.



Paperid:113 Oral
Authors:Qian Huang,Cheng Xu,Guiqing Li,Ziheng Wu,Shengxin Liu,Shengfeng He
Abstract:
We introduce the Self-Exemplar Illumination Equalization Network, designed specifically for effective portrait shadow removal. The core idea of our method is that partially shadowed portraits can find ideal exemplars within their non-shadowed facial regions. Rather than directly fusing two distinct classes of facial features, our approach utilizes non-shadowed regions as an illumination indicator to equalize the shadowed regions, generating deshadowed results without boundary-merging artifacts. Our network comprises cascaded Self-Exemplar Illumination Equalization Blocks (SExmBlock), each containing two modules: a self-exemplar feature matching module and a feature-level illumination rectification module. The former identifies and applies internal illumination exemplars to shadowed areas, producing illumination-corrected features, while the latter adjusts shadow illumination by reapplying the illumination factors from these features to the input face. Applying this series of SExmBlocks to shadowed portraits incrementally eliminates shadows and preserves clear, accurate facial details. The effectiveness of our method is demonstrated through evaluations on two public shadow portrait datasets, where it surpasses existing state-of-the-art methods in both qualitative and quantitative assessments.



Paperid:114 Oral
Authors:Ke Zhu,Liang Zhao,Zheng Ge,Xiangyu Zhang
Abstract:
This paper makes the first attempt towards unsupervised preference alignment in Vision-Language Models (VLMs). We generate chosen and rejected responses with regard to the original and augmented image pairs, and conduct preference alignment with direct preference optimization. It is based on a core idea: properly designed augmentation to the image input will induce VLM to generate false but hard negative responses, which helps the model to learn from and produce more robust and powerful answers. The whole pipeline no longer hinges on supervision from GPT-4 or human involvement during alignment, and is highly efficient with few lines of code. With only 8k randomly sampled unsupervised data, it achieves 90% relative score to GPT-4 on complex reasoning in LLaVA-Bench, and improves LLaVA-7B/13B by 6.7%/5.6% score on complex multi-modal benchmark MM-Vet. Visualizations shows its improved ability to align with user-intentions. A series of ablations are firmly conducted to reveal the latent mechanism of the approach, which also indicates its potential towards further scaling.



Paperid:115 Oral
Authors:Yinxuan Gui,Bin Zhu,Jingjing Chen,Chong-Wah Ngo,Yu-Gang Jiang
Abstract:
Current research in food analysis primarily concentrates on tasks such as food recognition, recipe retrieval and nutrition estimation from a single image. Nevertheless, there is a significant gap in exploring the impact of food intake on physiological indicators (e.g., weight) over time. This paper addresses this gap by introducing the DietDiary dataset, which encompasses daily dietary diaries and corresponding weight measurements of real users. Furthermore, we propose a novel task of weight prediction with a dietary diary that aims to leverage historical food intake and weight to predict future weights. To tackle this task, we propose a model-agnostic time series forecasting framework. Specifically, we introduce a Unified Meal Representation Learning (UMRL) module to extract representations for each meal. Additionally, we design a diet-aware loss function to associate food intake with weight variations. By conducting experiments on the DietDiary dataset with two state-of-the-art time series forecasting models, NLinear and iTransformer, we demonstrate that our proposed framework achieves superior performance compared to the original models. We will make our dataset, code, and models publicly available.



Paperid:116 Oral
Authors:Zhiyu Zhang,Guo Lu,Huanxiong Liang,Zhengxue Cheng,Anni Tang,Li Song
Abstract:
The neural radiance fields (NeRF) have advanced the development of 3D volumetric video technology, but the large data volumes they involve pose significant challenges for storage and transmission. To address these problems, the existing solutions typically compress these NeRF representations after the training stage, leading to a separation between representation training and compression. In this paper, we try to directly learn a compact NeRF representation for volumetric video in the training stage based on the proposed rate-aware compression framework. Specifically, for volumetric video, we use a simple yet effective modeling strategy to reduce temporal redundancy for the NeRF representation. Then, during the training phase, an implicit entropy model is utilized to estimate the bitrate of the NeRF representation. This entropy model is then encoded into the bitstream to assist in the decoding of the NeRF representation. This approach enables precise bitrate estimation, thereby leading to a compact NeRF representation. Furthermore, we propose an adaptive quantization strategy and learn the optimal quantization step for the NeRF representations. Finally, the NeRF representation can be optimized by using the rate-distortion trade-off. Our proposed compression framework can be used for different representations and experimental results demonstrate that our approach significantly reduces the storage size with marginal distortion and achieves state-of-the-art rate-distortion performance for volumetric video on the HumanRF and ReRF datasets. Compared to the previous state-of-the-art method TeTriRF, we achieved an approximately -80% BD-rate on the HumanRF dataset and -60% BD-rate on the ReRF dataset.



Paperid:117 Oral
Authors:Fangyi Liu,Mang Ye,Bo Du
Abstract:
Person re-identification (ReID) is crucial in video surveillance, aiming to match individuals across different camera views while cloth-changing person re-identification (CC-ReID) focuses on pedestrians changing attire. Many existing CC-ReID methods overlook generalization, crucial for universality across cloth-consistent and cloth-changing scenarios. This paper pioneers exploring the cloth-generalized person re-identification (CG-ReID) task and introduces the Cloth-aware Augmentation (CaAug) strategy. Comprising domain augmentation and feature augmentation, CaAug aims to learn identity-relevant features adaptable to both scenarios. Domain augmentation involves creating diverse fictitious domains, simulating various clothing scenarios. Supervising features from different cloth domains enhances robustness and generalization against clothing changes. Additionally, for feature augmentation, element exchange introduces diversity concerning clothing changes. Regularizing the model with these augmented features strengthens resilience against clothing change uncertainty. Extensive experiments on cloth-changing datasets demonstrate the efficacy of our approach, consistently outperforming state-of-the-art methods. Our codes will be publicly released soon.



Paperid:118 Oral
Authors:Linmei Hu,Duokang Wang,Yiming Pan,Jifan Yu,Yingxia Shao,Chong Feng,Liqiang Nie
Abstract:
Multimodal Large Language Models (MLLMs) have shown significant potential for chart understanding and generation. However, they are still far from achieving the desired effectiveness in practical applications. This could be due to the limitations of the used training chart data. Existing chart datasets suffer from scarcity of chart types, limited coverage of tasks, and insufficient scalability, making them incapable of effectively enhancing the chart-related capabilities of MLLMs. To tackle these obstacles, we construct NovaChart, a large-scale dataset for chart understanding and generation of MLLMs. NovaChart contains 47K high-resolution chart images and 856K chart-related instructions, covering 18 different chart types and 15 unique tasks of chart understanding and generation. To build NovaChart, we propose a data generation engine for metadata curation, chart visualization and instruction formulation. Chart metadata in NovaChart contains detailed annotations, i.e., data points, visual elements, source data and the visualization code of every chart. This additional information endows NovaChart with considerable scalability, as it can facilitate the extension of chart instruction data to a larger scale and greater diversity. We utilize NovaChart to train several open-source MLLMs. Experimental results demonstrate NovaChart empowers MLLMs with stronger capabilities in 15 chart understanding and generation tasks by a large-margin (35.47%-619.47%), bringing them a step closer to smart chart assistants. Our dataset is now available athttps://github.com/Elucidator-V/NovaChart.



Paperid:119 Oral
Authors:Miao Zhang,Jiaxing Li,Haoyuan Zhao,Linfeng Shen,Jiangchuan Liu
Abstract:
Streaming videos from resource-constrained front-end devices over networks to resource-rich cloud servers has long been a common practice for surveillance and analytics. Most existing live video analytics (LVA) systems, however, are built over terrestrial networks, limiting their applications during natural disasters and in remote areas that desperately call for real-time visual data delivery and scene analysis. With the recent advent of space networking, in particular, low Earth orbit (LEO) satellite constellations such as Starlink, high-speed truly global Internet access is becoming available and affordable. This paper examines the challenges and potentials of LVA over modern LEO satellite networking (LSN). Using Starlink as the testbed, we have carried out extensive in-the-wild measurements to gain insights into its achievable performance for LVA. The results reveal that, the uplink bottleneck in today's LSN, together with the volatile network conditions, can significantly affect the service quality of LVA and necessitate prompt adaptation. We accordingly develop StarStream, a novel LSN-adaptive streaming framework for LVA. At its core, StarStream is empowered by a transformer-based network performance predictor tailored for LSN and a content-aware configuration optimizer. We discuss a series of key design and implementation issues of StarStream and demonstrate its effectiveness and superiority through trace-driven experiments with real-world network and video processing data.



Paperid:120 Oral
Authors:Ruibin Li,Jingcai Guo,Qihua Zhou,Song Guo
Abstract:
This paper provides an efficient training-free painterly image harmonization (PIH) method, dubbed FreePIH, that leverages only a pre-trained diffusion model to achieve state-of-the-art harmonization results. Unlike existing methods that require either training auxiliary networks or fine-tuning a large pre-trained backbone, or both, to harmonize a foreground object with a painterly-style background image, our FreePIH tames the denoising process as a plug-in module for foreground image style transfer. Specifically, we find that the very last few steps of the denoising (i.e., generation) process strongly correspond to the stylistic information of images, and based on this, we propose to augment the latent features of both the foreground and background images with Gaussians for a direct denoising-based harmonization. To guarantee the fidelity of the harmonized image, we make use of multi-scale features to enforce the consistency of the content and stability of the foreground objects in the latent space, and meanwhile, aligning both fore-/back-grounds with the same style. Moreover, to accommodate the generation with more structural and textural details, we fur- ther integrate text prompts to attend to the latent features, hence improving the generation quality. Quantitative and qualitative eval- uations on COCO and LAION 5B datasets demonstrate that our method can surpass representative baselines by large margins.



Paperid:121 Oral
Authors:Zhedong Zhang,Liang Li,Gaoxiang Cong,Haibing YIN,Yuhan Gao,Chenggang Yan,Anton van den Hengel,Yuankai Qi
Abstract:
Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects while preserving the vocal timbre of one brief reference audio. The wide variations in emotion, pace, and environment that dubbed speech must exhibit to achieve real alignment make dubbing a complex task. Considering the limited scale of the movie dubbing datasets (due to copyright) and the interference from background noise, directly learning from movie dubbing datasets limits the pronunciation quality of learned models. To address this problem, we propose a two-stage dubbing method that allows the model to first learn pronunciation knowledge before practicing it in movie dubbing. In the first stage, we introduce a multi-task approach to pre-train a phoneme encoder on a large-scale text-speech corpus for learning clear and natural phoneme pronunciations. For the second stage, we devise a prosody consistency learning module to bridge the emotional expression with the phoneme-level dubbing prosody attributes (pitch and energy). Finally, we design a duration consistency reasoning module to align the dubbing duration with the lip movement. Extensive experiments demonstrate that our method outperforms several state-of-the-art methods on two primary benchmarks. The source code and model checkpoints will be released to the public. The demos are available athttps://speaker2dubber.github.io/.



Paperid:122 Oral
Authors:Bochao Liu,Pengju Wang,Weijia Guo,Yong Li,Liansheng Zhuang,Weiping Wang,Shiming Ge
Abstract:
While generative models have proved successful in massive domains, they may pose a privacy leakage risk in practical deployment. To address this challenge, differentially private generative model learning has emerged as a solution to train private generative models for different down-stream tasks. However, existing private generative modeling approaches are limited when it comes to generating high-dimensional data due to the limitation of the complexity of modeling high-dimensional data. In this work, we introduce a new private generative modeling approach where samples are generated via Hamiltonian dynamics using gradients of the private dataset. To protect data privacy, we achieve differential privacy by perturbing the projection vectors in the estimation of gradients with sliced score matching. In addition, we enhance the reconstruction ability of the model by incorporating a residual enhancement module during the score matching. For sampling, we perform Hamiltonian dynamics with gradients estimated by the well-trained model, allowing the sampled images close to the manifold of private dataset step by step. In this way, our model is capable of generating images with a resolution of 256x256. Extensive experiments and analysis clearly demonstrate the effectiveness and rationality of the proposed approach.



Paperid:123 Oral
Authors:Weilun Feng,Chuanguang Yang,Zhulin An,Libo Huang,Boyu Diao,Fei Wang,Yongjun Xu
Abstract:
Although the diffusion model has achieved remarkable performance in the field of image generation, its high inference delay hinders its wide application in edge devices with scarce computing resources. Therefore, many training-free sampling methods have been proposed to reduce the number of sampling steps required for diffusion models. However, they perform poorly under a very small number of sampling steps. Thanks to the emergence of knowledge distillation technology, the existing training scheme methods have achieved excellent results at very low step numbers. However, the current methods mainly focus on designing novel diffusion model sampling methods with knowledge distillation. How to transfer better diffusion knowledge from teacher models is a more valuable problem but rarely studied. Therefore, we propose Relational Diffusion Distillation (RDD), a novel distillation method tailored specifically for distilling diffusion models. Unlike existing methods that simply align teacher and student models at pixel level or feature distributions, our method introduces cross-sample relationship interaction during the distillation process and alleviates the memory constraints induced by multiple sample interactions. Our RDD significantly enhances the effectiveness of the progressive distillation framework within the diffusion model. Extensive experiments on several datasets (e.g., CIFAR-10 and ImageNet) demonstrate that our proposed RDD leads to 1.47 FID decrease and 256x speed-up, compared to state-of-the-art diffusion distillation methods. Our code will be attached to the supplementary material.



Paperid:124 Oral
Authors:Liyuan Ma,Xueji Fang,Guo-Jun Qi
Abstract:
Image customization involves learning the subject from provided concept images and generating it within textual contexts, typically yielding alterations of attributes such as style or background. Prevailing methods primarily rely on fine-tuning technique, wherein a unified latent embedding is employed to characterize various concept attributes. However, the attribute entanglement renders customized result challenging to mitigate the influence of subject-irrelevant attributes (e.g., style and background). To overcome these issues, we propose Equilibrated Diffusion, an innovative method that achieves equilibrated image customization by decoupling entangled concept attributes from a frequency-aware perspective, thus harmonizing textual and visual consistency. Unlike conventional approaches that employ a shared latent embedding and tuning process to learn concept, our Equilibrated Diffusion draws inspiration from the correlation between high- and low-frequency components with image style and content, decomposing concept accordingly in the frequency domain. Through independently optimizing concept embeddings in the frequency domain, the denoising model not only enriches its comprehension of style attribute irrelevant to subject identity but also inherently augments its aptitude for accommodating novel stylized descriptions. Furthermore, by combining different frequency embeddings, our model retains the spatially original customization capability. We further design a diffusion process guided by subject masks to alleviate the influence of background attribute, thereby strengthening text alignment. To ensure subject-related information consistency, Residual Reference Attention (RRA) is incorporated into the denoising model of spatial attention computation, effectively preserving structural details. Experimental results demonstrate that Equilibrated Diffusion surpasses other competitors with better subject consistency while closely adhering to text descriptions, thus validating the superiority of our approach.



Paperid:125 Oral
Authors:Xiao Han,Yiming Ren,Yichen Yao,Yujing Sun,Yuexin Ma
Abstract:
Human motion prediction is crucial for human-centric multimedia understanding and interacting. Current methods typically rely on ground truth human poses as observed input, which is not practical for real-world scenarios where only raw visual sensor data is available. To implement these methods in practice, a pre-phrase of pose estimation is essential. However, such two-stage approach often leads to performance degradation due to the accumulation of errors. Moreover, reducing raw visual data to sparse keypoint representations significantly diminishes the density of information, resulting in the loss of fine-grained features. In this paper, we propose LiDAR-HMP, the first single-LiDAR-based 3D human motion prediction approach, which receives the raw LiDAR point cloud as input and forecasts future 3D human poses directly. Building upon our novel structure-aware body feature descriptor, LiDAR-HMP adaptively maps the observed motion manifold to future poses and effectively models the spatial-temporal correlations of human motions for further refinement of prediction results. Extensive experiments show that our method achieves state-of-the-art performance on two public benchmarks and demonstrates remarkable robustness and efficacy in real-world deployments.



Paperid:126 Oral
Authors:Meng Luo,Hao Fei,Bobo Li,Shengqiong Wu,Qian Liu,Soujanya Poria,Erik Cambria,Mong-Li Lee,Wynne Hsu
Abstract:
While existing Aspect-based Sentiment Analysis (ABSA) has received extensive effort and advancement, there are still gaps in defining a more holistic research target seamlessly integrating multimodality, conversation context, fine-granularity, and also covering the changing sentiment dynamics as well as cognitive causal rationales. This paper bridges the gaps by introducing a multimodal conversational ABSA, where two novel subtasks are proposed: 1) Panoptic Sentiment Sextuple Extraction, panoramically recognizing holder, target, aspect, opinion, sentiment, rationale from multi-turn multi-party multimodal dialogue. 2) Sentiment Flipping Analysis, detecting the dynamic sentiment transformation throughout the conversation with the causal reasons. To benchmark the tasks, we construct PanoSent, a dataset annotated both manually and automatically, featuring high quality, large scale (20,000 dialogues), multimodality (text, image, audio and video), multilingualism (English, Chinese and Spanish), multi-scenarios (over 100 domains), and covering both implicit&explicit sentiment elements. Further, to effectively address the tasks, we devise a novel Chain-of-Sentiment reasoning framework, together with a novel multimodal large language model (namely Sentica) and a paraphrase-based verification mechanism. Extensive evaluations demonstrate the superiority of our methods over strong baselines, validating the efficacy of all our proposed methods. The work is expected to open up a new era for the ABSA community, and thus all our codes and data will be open.



Paperid:127 Oral
Authors:Yili Li,Jing Yu,Keke Gai,Bang Liu,Gang Xiong,Qi Wu
Abstract:
Current text-video retrieval methods mainly rely on cross-modal matching between queries and videos to calculate their similarity scores, which are then sorted to obtain retrieval results. This method considers the matching between each candidate video and the query, but it incurs a significant time cost and will increase notably with the increase of candidates. Generative models are common in natural language processing and computer vision, and have been successfully applied in document retrieval, but their application in multimodal retrieval remains unexplored. To enhance retrieval efficiency, in this paper, we introduce a model-based video indexer named T2VIndexer, which is a sequence-to-sequence generative model directly generating video identifiers and retrieving candidate videos with constant time complexity. T2VIndexer aims to reduce retrieval time while maintaining high accuracy. To achieve this goal, we propose video identifier encoding and query-identifier augmentation approaches to represent videos as short sequences while preserving their semantic information. Our method consistently enhances the retrieval efficiency of current state-of-the-art models on four standard datasets. It enables baselines with only 30%-50% of the original retrieval time to achieve better retrieval performance on MSR-VTT (+1.0%), MSVD (+1.8%), ActivityNet (+1.5%), and DiDeMo (+0.2%). The code is available athttps://anonymous.4open.science/r/T2VIndexer-40BE.



Paperid:128 Oral
Authors:Wei Qian,Kun Li,Dan Guo,Bin Hu,Meng Wang
Abstract:
Remote photoplethysmography (rPPG) measurement aims to estimate physiological signals by analyzing subtle skin color changes induced by heartbeats in facial videos. Existing methods primarily rely on the fundamental video frame features or vanilla facial ROI (region of interest) features. Recognizing the varying light absorption and reactions of different facial regions over time, we adopt a new perspective to conduct a more fine-grained exploration of the key clues present in different facial regions within each frame and across temporal frames. Concretely, we propose a novel clustering-driven remote physiological measurement framework called Cluster-Phys, which employs a facial ROI prototypical clustering module to adaptively cluster the representative facial ROI features as facial prototypes and then update facial prototypes with highly semantic correlated base ROI features. In this way, our approach can mine facial clues from a more compact and informative prototype level rather than the conventional video/ROI level. Furthermore, we also propose a spatial-temporal prototype interaction module to learn facial prototype correlation from both spatial (across prototypes) and temporal (within prototype) perspectives. Extensive experiments are conducted on both intra-dataset and cross-dataset tests. The results show that our Cluster-Phys achieves significant performance improvement with less computation consumption. The source code will be released after the double-blind review.



Paperid:129 Oral
Authors:Zhenyu Zhang,Guangyao Chen,Yixiong Zou,Zhimeng Huang,Yuhua Li,Ruixuan Li
Abstract:
Humans exhibit a remarkable ability to learn quickly from a limited number of labeled samples, a capability that starkly contrasts with that of current machine learning systems. Unsupervised Few-Shot Learning (U-FSL) seeks to bridge this divide by reducing reliance on annotated datasets during initial training phases. In this work, we first quantitatively assess the impacts of Masked Image Modeling (MIM) and Contrastive Learning (CL) on few-shot learning tasks. Our findings highlight the respective limitations of MIM and CL in terms of discriminative and generalization abilities, which contribute to their underperformance in U-FSL contexts. To address these trade-offs between generalization and discriminability in unsupervised pretraining, we introduce a novel paradigm named Masked Image Contrastive Modeling (MICM). MICM creatively combines the targeted object learning strength of CL with the generalized visual feature learning capability of MIM, significantly enhancing its efficacy in downstream few-shot learning inference. Extensive experimental analyses confirm the advantages of MICM, demonstrating significant improvements in both generalization and discrimination capabilities for few-shot learning. Our comprehensive quantitative evaluations further substantiate the superiority of MICM, showing that our two-stage U-FSL framework based on MICM markedly outperforms existing leading baselines. We provide the source code in the supplementary materials for reproducibility.



Paperid:130 Oral
Authors:Yunze Liu,Changxi Chen,Chenjing Ding,Li Yi
Abstract:
Humanoid Reaction Synthesis is pivotal for creating highly interactive and empathetic robots that can seamlessly integrate into human environments, enhancing the way we live, work, and communicate. However, it is difficult to learn the diverse interaction patterns of multiple humans and generate physically plausible reactions. Currently, the predominant approaches involve kinematics-based and physics-based methods. The kinematic-based methods lack physical prior limiting their capacity to generate convincingly realistic motions. The physics-based method often relies on kinematics-based methods to generate reference states, which struggle with the challenges posed by kinematic noise during action execution. Moreover, these methods are unable to achieve real-time inference constrained by their reliance on diffusion models. In this work, we propose a Forward Dynamics Guided 4D Imitation method to generate physically plausible human-like reactions. The learned policy is capable of generating physically plausible and human-like reactions in real-time, significantly improving the speed(x33) for inference and quality of reactions compared with the existing methods. Our experiments on the InterHuman and Chi3D datasets, along with ablation studies, demonstrate the effectiveness of our approach. More visualizations are available in supplementary materials.



Paperid:131 Oral
Authors:Jingjing Liu,Youyi Zheng,Kun Zhou
Abstract:
When people use agent characters to travel through different spaces (such as virtual scenes and real scenes, or different game spaces), it is important to reasonably position the characters in the new scene according to their personal characteristics. In this paper, we propose a novel pipeline for relocating virtual agents in new scenarios based on their personal characteristics. We extract the characteristics of the characters (including figure, posture, social distance). Then a cost function is designed to evaluate the agent's position in the scene, which consists of a spatial term and an personalized term. Finally, a a Markov Chain Monte Carlo optimization method is applied to search for the optimized solution. The results generated by our approach are evaluated through extensive user study experiments, verifying the effectiveness of our approach compared with other alternative approaches.



Paperid:132 Oral
Authors:Yawen Luo,Min Shi,Liao Shen,Yachuan Huang,Zixuan Ye,Juewen Peng,Zhiguo Cao
Abstract:
Bokeh is a wide-aperture optical effect that creates aesthetic blurring in photography. However, achieving this effect typically demands expensive professional equipment and expertise. To make such cinematic techniques more accessible, bokeh rendering aims to generate the desired bokeh effects from all-in-focus inputs captured by smartphones. Previous efforts in bokeh rendering primarily focus on static images. However, when extended to video inputs, these methods exhibit flicker and artifacts due to a lack of temporal consistency modeling. Meanwhile, they cannot utilize information like occluded objects from adjacent frames, which are necessary for bokeh rendering. Moreover, the difficulties of capturing all-in-focus and bokeh video pairs result in a shortage of data for training video bokeh models. To tackle these challenges, we propose the Video Bokeh Renderer (VBR), the model designed specifically for video bokeh rendering. VBR leverages implicit feature space alignment and aggregation to model temporal consistency and exploit complementary information from adjacent frames. On the data front, we introduce the first Synthetic Video Bokeh (SVB) dataset, synthesizing authentic bokeh effects using ray-tracing techniques. Furthermore, to improve the robustness of the model to inaccurate disparity maps, we employ a set of augmentation strategies to simulate corrupted disparity inputs during training. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our method. Code and dataset will be released.



Paperid:133 Oral
Authors:Jinghao Zhang,Guofan Liu,Qiang Liu,Shu Wu,Liang Wang
Abstract:
Multimedia content is of predominance in the modern Web era. Many recommender models have been proposed to investigate how to incorporate multimodal content information into traditional collaborative filtering framework effectively. The use of multimodal information is expected to provide more comprehensive information and lead to superior performance. However, the integration of multiple modalities often encounters the modal imbalance problem: since the information in different modalities is unbalanced, optimizing the same objective across all modalities leads to the under-optimization problem of the weak modalities with a slower convergence rate or lower performance. Even worse, we find that in multimodal recommendation models, all modalities suffer from the problem of insufficient optimization.To address these issues, we propose a Counterfactual Knowledge Distillation (CKD) method which could solve the imbalance problem and make the best use of all modalities. Through modality-specific knowledge distillation, CKD could guide the multimodal model to learn modality-specific knowledge from uni-modal teachers. We also design a novel generic-and-specific distillation loss to guide the multimodal student to learn wider-and-deeper knowledge from teachers. Additionally, to adaptively recalibrate the focus of the multimodal model towards weaker modalities during training, we estimate the causal effect of each modality on the training objective using counterfactual inference techniques, through which we could determine the weak modalities, quantify the imbalance degree and re-weight the distillation loss accordingly.Our method could serve as a plug-and-play module for both late-fusion and early-fusion backbones. Extensive experiments on six backbones show that our proposed method can improve the performance by a large margin.



Paperid:134 Oral
Authors:Junhao Xu,Jingjing Chen,Xue Song,Feng Han,Haijun Shan,Yu-Gang Jiang
Abstract:
Recent technological advancements, such as the "deepfake" techniques, have paved the way for generating various media forgeries. In response to the potential hazards of these media forgeries, many researchers engage in exploring detection methods, increasing the demand for high-quality media forgery datasets. Despite this, existing datasets have certain limitations. Firstly, most datasets focus on manipulating visual modality and usually lack diversity, as only a few forgery approaches are considered. Secondly, the quality of media is often inadequate in clarity and naturalness. Meanwhile, the size of the dataset is also limited. Thirdly, it is commonly observed that real-world forgeries are motivated by identity, yet the identity information of the individuals portrayed in these forgeries within existing datasets remains under-explored. For detection, identity information could be an essential clue to boost performance. Moreover, official media concerning relevant identities on the Internet can serve as prior knowledge, aiding both the audience and forgery detectors in determining the true identity. Therefore, we propose an identity-driven multimedia forgery dataset, IDForge, which contains 249,138 video shots. All video shots are sourced from 324 wild videos of 54 celebrities collected from the Internet. The fake video shots involve 9 types of manipulation across visual, audio, and textual modalities. Additionally, IDForge provides extra 214,438 real video shots as a reference set for the 54 celebrities. Correspondingly, we design an effective multimedia detection network termed the Reference-assisted Multimodal Forgery Detection Network (R-MFDN). Through extensive experiments on the proposed dataset, we demonstrate the effectiveness of R-MFDN on the multimedia detection task.



Paperid:135 Oral
Authors:Haijie Yang,Zhenyu Zhang,Hao Tang,Jianjun Qian,Jian Yang
Abstract:
Diffusion models have shown impressive potential on talking head generation. While plausible appearance and talking effect are achieved, these methods still suffers from temporal, 3D or expression inconsistency due to the error accumulation and inherent limitation of single-image generation ability. In this paper, we propose ConsistentAvatar, a novel framework for fully consistent and high-fidelity talking avatar generation. Instead of directly employing multi-modal conditions to the diffusion process, our method learns to first model the temporal representation for stability between adjacent frames. Specifically, we propose a Temporally-Sensitive Detail (TSD) map containing high-frequency feature and contour that vary significantly along time axis. Using a temporal consistent diffusion module, we learn to align TSD of the initial result to that of the video frame ground truth. The final avatar is generated by a fully consistent diffusion module, conditioned on the aligned TSD, rough head normal, and emotion prompt embedding. We find that the aligned TSD, which represents the temporal patterns, constrains the diffusion process to generate temporally stable talking head. Further, its reliable guidance complements the inaccuracy of other conditions, suppressing the accumulated error while improving the consistency on various aspects. Extensive experiments demonstrate that ConsistentAvatar outperforms the state-of-the-art methods on the generated appearance, 3D, expression and temporal consistency.



Paperid:136 Oral
Authors:Jiaye Lin,Qing Li,Guorui Xie,Zhongxu Guan,Yong Jiang,Ting Xu,Zhong Zhang,Peilin Zhao
Abstract:
Industrial multimedia recommendation systems extensively utilize cascade architectures to deliver personalized content for users, generally consisting of multiple stages like retrieval and ranking. However, retrieval models have long suffered from Sample Selection Bias (SSB) due to the distribution discrepancy between the exposed items used for model training and the candidates (almost unexposed) during inference, affecting recommendation performance. Traditional methods utilize retrieval candidates as augmented training data, indiscriminately treating unexposed data as negative samples, which leads to inaccuracies and noise. Some efforts rely on unbiased datasets, while they are costly to collect and insufficient for industrial models. In this paper, we propose a debiasing framework named DAMCAR, which introduces Domain Adaptation to mitigate SSB in Multimedia CAscade Recommendation systems. Firstly, we sample hard-to-distinguish samples from unexposed data to serve as the target domain, optimizing data quality and resource utilization. Secondly, adversarial domain adaptation is employed to generate pseudo-labels for each sample. To enhance robustness, we utilize Exponential Moving Average (EMA) to create a teacher model that supervises the generation of pseudo-labels via self-distillation. Finally, we obtain a retrieval model that maintains stable performance during inference through a hybrid training mechanism. We conduct offline experiments on two real-world datasets and deploy our approach in the retrieval model of a multimedia video recommendation system for online A/B testing. Comprehensive experimental results demonstrate the effectiveness of DAMCAR in practical applications.



Paperid:137 Oral
Authors:Yang Liu,Daizong Liu,Zongming Guo,Wei Hu
Abstract:
3D visual grounding is a fundamental but important task in multimedia understanding, which aims to locate a specific object in a complicated 3D scene semantically according to a text description. However, this task requires a large number of annotations of labeled text-object pairs for training, and the scarcity of annotated data has been a key obstacle in this task. To this end, this paper makes the first attempt to introduce and address a new semi-supervised setting, where only a few text-object labels are provided during training. Considering most scene data has no annotation, we explore a new solution for unlabeled 3D grounding by additionally training and transferring samples from a correlated task, i.e., 3D captioning. Our main insight is that 3D grounding and captioning are complementary and can be iteratively trained with unlabeled data to provide object and text contexts for each other with pseudo-label learning. Specifically, we propose a novel 3D Cross-Task Teacher-Student Framework (3D-CTTSF) for joint 3D grounding and captioning in the semi-supervised setting, where each branch contains parallel grounding and captioning modules. We first pre-train the two modules of the teacher branch with the limited labeled data for warm-up. Then, we train the student branch to mimic the ability of the teacher model and iteratively update both branches with the unlabeled data. In particular, we transfer the learned knowledge between the grounding and captioning modules across two branches to generate and refine the pseudo labels of unlabeled data for providing reliable supervision. To further improve the pseudo-label quality, we design a cross-task pseudo-label generation scheme, filtering low-quality pseudo-labels at the detection, captioning, and grounding levels, respectively. Experimental results on various datasets show competitive performances in both tasks compared to previous fully- and weakly-supervised methods, demonstrating the proposed 3D-CTTSF can serve as an effective solution to overcome the data scarcity issue.



Paperid:138 Oral
Authors:Peiming Li,Ziyi Wang,Mengyuan Liu,Hong Liu,Chen Chen
Abstract:
Grasp generation aims to create complex hand-object interactions with a specified object. While traditional approaches for hand generation have primarily focused on visibility and diversity under scene constraints, they tend to overlook the fine-grained hand-object interactions such as contacts, resulting in inaccurate and undesired grasps. To address these challenges, we propose a controllable grasp generation task and introduce ClickDiff, a controllable conditional generation model that leverages a fine-grained Semantic Contact Map (SCM). Particularly when synthesizing interactive grasps, the method enables the precise control of grasp synthesis through either user-specified or algorithmically predicted Semantic Contact Map. Specifically, to optimally utilize contact supervision constraints and to accurately model the complex physical structure of hands, we propose a Dual Generation Framework. Within this framework, the Semantic Conditional Module generates reasonable contact maps based on fine-grained contact information, while the Contact Conditional Module utilizes contact maps alongside object point clouds to generate realistic grasps. We evaluate the evaluation criteria applicable to controllable grasp generation. Both unimanual and bimanual generation experiments on GRAB and ARCTIC datasets verify the validity of our proposed method, demonstrating the efficacy and robustness of ClickDiff, even with previously unseen objects. Our code is available athttps://anonymous.4open.science/r/ClickDiff.



Paperid:139 Oral
Authors:Shipeng Zhu,Hui Xue,Na Nie,Chenjie Zhu,Haiyue Liu,Pengfei Fang
Abstract:
Inscriptions on ancient steles, as carriers of culture, encapsulate the humanistic thoughts and aesthetic values of our ancestors. However, these relics often deteriorate due to environmental and human factors, resulting in significant information loss. Since the advent of inscription rubbing technology over a millennium ago, archaeologists and epigraphers have devoted immense effort to manually restoring these cultural imprints, endeavoring to unlock the storied past within each rubbing. This paper approaches this challenge as a multi-modal task, aiming to establish a novel benchmark for the inscription restoration from rubbings. In doing so, we construct the Chinese Inscription Rubbing Image (CIRI) dataset, which includes a wide variety of real inscription rubbing images characterized by diverse calligraphy styles, intricate character structures, and complex degradation forms. Furthermore, we develop a synthesis approach to generate ``intact-degraded'' paired data, mirroring real-world degradation faithfully. On top of the datasets, we propose a baseline framework that achieves visual consistency and textual integrity through global and local diffusion-based restoration processes and explicit incorporation of domain knowledge. Comprehensive evaluations confirm the effectiveness of our pipeline, demonstrating significant improvements in visual presentation and textual integrity. The code will be released.



Paperid:140 Oral
Authors:Junyan Wu,Wei Lu,Xiangyang Luo,Rui Yang,Qian Wang,Xiaochun Cao
Abstract:
Recently, a novel form of audio partial forgery has posed challenges to its forensics, requiring advanced countermeasures to detect subtle forgery manipulations within long-duration audio. However, existing countermeasures still serve a classification purpose and fail to perform meaningful analysis of the start and end timestamps of partial forgery segments. To address this challenge, we introduce a novel coarse-to-fine proposal refinement framework (CFPRF) that incorporates a frame-level detection network (FDN) and a proposal refinement network (PRN) for audio temporal forgery detection and localization. Specifically, the FDN aims to mine informative inconsistency cues between real and fake frames to obtain discriminative features that are beneficial for roughly indicating forgery regions. The PRN is responsible for predicting confidence scores and regression offsets to refine the coarse-grained proposals derived from the FDN. To learn robust discriminative features, we devise a difference-aware feature learning (DAFL) module guided by contrastive representation learning to enlarge the sensitive differences between different frames induced by minor manipulations. We further design a boundary-aware feature enhancement (BAFE) module to capture the contextual information of multiple transition boundaries and guide the interaction between boundary information and temporal features via a cross-attention mechanism. Extensive experiments show that our CFPRF achieves state-of-the-art performance on various datasets, including LAV-DF, ASVS2019PS, and HAD.



Paperid:141 Oral
Authors:Chaoya Jiang,Wei Ye,Mengfan Dong,Jia Hongrui,Haiyang Xu,Ming Yan,Ji Zhang,Shikun Zhang
Abstract:
Large Vision-Language Models (LVLMs) exhibit remarkable capabilities but struggle with "hallucinations"—inconsistencies between images and their descriptions. Previous hallucination evaluation studies on LVLMs have identified hallucinations in terms of objects, attributes, and relations but overlooked complex hallucinations that create an entire narrative around a fictional entity. In this paper, we introduce a refined taxonomy of hallucinations, featuring a new category: Event Hallucination. We then utilize advanced LLMs to generate and filter fine-grained hallucinatory data consisting of various types of hallucinations, with a particular focus on event hallucinations, laying the groundwork for integrating discriminative and generative evaluation methods within our universal evaluation framework. The proposed benchmark distinctively assesses LVLMs' ability to tackle a broad spectrum of hallucinations, making it a reliable and comprehensive tool for gauging LVLMs' efficacy in handling hallucinations. We will release our code and data.



Paperid:142 Oral
Authors:Chaofeng Chen,Yang Sensen,Haoning Wu,Liang Liao,Zicheng Zhang,Annan Wang,Wenxiu Sun,Qiong Yan,Weisi Lin
Abstract:
Recent advances of large multi-modality models (LMM) have greatly improved the ability of image quality assessment (IQA) method to evaluate and explain the quality of visual content. However, these advancements are mostly focused on overall quality assessment, and the detailed examination of local quality, which is crucial for comprehensive visual understanding, is still largely unexplored. In this work, we introduceQ-Ground, the first framework aimed at tackling fine-scale visual quality grounding by combining large multi-modality models with detailed visual quality analysis. Central to our contribution is the introduction of theQGround-100Kdataset, a novel resource containing 100k triplets of(image, quality text, distortion segmentation)to facilitate deep investigations into visual quality. The dataset comprises two parts: one with human-labeled annotations for accurate quality assessment, and another labeled automatically by LMMs such as GPT4V, which helps improve the robustness of model training while also reducing the costs of data collection. With theQGround-100Kdataset, we propose a LMM-based method equipped with multi-scale feature learning to learn models capable of performing both image quality answering and distortion segmentation based on text prompts. This dual-capability approach not only refines the model's understanding of region-aware image quality but also enables it to interactively respond to complex, text-based queries about image quality and specific distortions.Q-Groundtakes a step towards sophisticated visual quality analysis in a finer scale, establishing a new benchmark for future research in the area. Codes and dataset will be made available.



Paperid:143 Oral
Authors:LiLing,Wenrui Yang,Xinchun Yu,Junliang Xing,Xiao-Ping Zhang
Abstract:
Symbols play a pivotal role in the documentation and dissemination of art. For instance, we use musical scores and dance notation to document musical compositions and choreographic movements. Existing hand representations are not ideally suited for the documentation of hand movements. Firstly, data-driven representations, like the coordinates of hand keypoints, are non-intuitive and vulnerable to noise. Second, sign language, although a prevalent system for hand movements, is solely focused on semantic interaction, not on encoding actions. In this paper, we introduce Hand Labanotation (HL), an innovative notation system for hand movement documentation. We first construct a new HL dataset comprising $4$M annotated images. Thereon, we propose a novel multi-view transformer architecture for automatically translating hand movements to HL. Extensive experiments demonstrate the promising capacity of our method for representing hand movements. This makes our method a general tool for hand movement documentation, driving various downstream applications like using HL to control robotic hands. To promote this new stream of research, we will open-source the data and model to the community.



Paperid:144 Oral
Authors:Xuemei Zhou,Irene Viola,Yunlu Chen,Jiahuan Pei,Pablo Cesar
Abstract:
Point cloud contents represent one of the prevalent formats for 3D representations. Distortions introduced at various stages in the point cloud processing pipeline affect the visual quality, altering their geometric composition, texture information, or both. Understanding and quantifying the impact of the distortion domain on visual quality is vital to driving rate optimization and guiding postprocessing steps to improve the overall quality of experience. In this paper, we propose a multi-task guided multi-modality no reference metric for measuring the quality of colored point clouds (M3-Unity), which utilizes 4 types of modalities across different attributes and dimensionalities to represent point clouds. An attention mechanism establishes inter/intra associations among 3D/2D patches, which can complement each other, yielding both local and global features, to fit the highly nonlinear property of the human vision system. A multi-task decoder involving distortion type classification selects the best combination among 4 modalities based on the specific distortion type, aiding the regression task and enabling the in-depth analysis of the interplay between geometrical and textural distortions. Furthermore, our framework design and attention strategy enable us to measure the impact of individual attributes and their combinations, providing insights into how these associations contribute particularly in relation to distortion type. Experimental results demonstrate that our method effectively predicts the visual quality of point clouds, achieving state-of-the-art performance on four benchmark datasets. The code will be released.



Paperid:145 Oral
Authors:Yuanbo Wen,Tao Gao,Ting Chen
Abstract:
Current unpaired image deraining approaches face challenges in accurately capture the distinguishing characteristics between the rainy and clean domains, resulting in residual degradation and color distortion within the reconstructed images. To this end, we propose an energy-informed diffusion model for unpaired photo-realistic image deraining (UPID-EDM). Initially, we delve into the intricate visual-language priors embedded within the contrastive language-image pre-training model (CLIP), and demonstrate that the CLIP priors aid in the discrimination of rainy and clean images. Furthermore, we introduce a dual-consistent energy function (DEF) that retains the domain-consistent characteristics while eliminating the domain-related features. This energy function is trained by the non-corresponding rainy and clean images. In addition, we employ the domain-relevance discarding energy function (DDEF) and the domain-consistency preserving energy function (DPEF) to direct the reverse sampling procedure of a pre-trained diffusion model, effectively removing the rain streaks while preserving the image contents. Extensive experiments demonstrate that our energy-informed model surpasses the existing unpaired learning approaches in terms of both supervised and no-reference metrics.



Paperid:146 Oral
Authors:Zhe Huang,Shuo Wang,Yongcai Wang,Wanting Li,Deying Li,Lei Wang
Abstract:
Collaborative autonomous driving with multiple vehicles usually requires the data fusion from multiple modalities. To ensure effective fusion, the data from each individual modality shall maintain a reasonably high quality. However, in collaborative perception, the quality of object detection based on a modality is highly sensitive to the relative pose errors among the agents. It leads to feature misalignment and significantly reduces collaborative performance. To address this issue, we propose RoCo, a novel unsupervised framework to conduct iterative object matching and agent pose adjustment. To the best of our knowledge, our work is the first to model the pose correction problem in collaborative perception as an object matching task, which reliably associates common objects detected by different agents; On top of this, we propose a graph optimization process to adjust the agent poses by minimizing the alignment errors of the associated objects, and the object matching is re-done based on the adjusted agent poses. This process is iteratively repeated until convergence. Experimental study on both simulated and real-world datasets demonstrates that the proposed framework RoCo consistently outperforms existing relevant methods in terms of the collaborative object detection performance, and exhibits highly desired robustness when the pose information of agents is with high-level noise. Ablation studies are also provide to show the impact of its key parameters and components. The code will be released.



Paperid:147 Oral
Authors:Xiang Fang,Wanlong Fang,Daizong Liu,Xiaoye Qu,Jianfeng Dong,Pan Zhou,Renfu Li,Zichuan Xu,Lixing Chen,Panpan Zheng,Yu Cheng
Abstract:
As a significant yet challenging multimedia task, Video Moment Retrieval (VMR) targets to retrieve the specific moment corresponding to a sentence query from an untrimmed video. Although recent respectable works have made remarkable progress in this task, they implicitly are rooted in the closed-set assumption that all the given queries as video-relevant1. Given a video-irrelevant OOD query in open-set scenarios, they still utilize it for wrong retrieval, which might lead to irrecoverable losses in high-risk scenarios, e.g., criminal activity detection. To this end, we creatively explore a brand-new VMR setting termed Open-Set Video Moment Retrieval (OS-VMR), where we should not only retrieve the precise moments based on ID query, but also reject OOD queries. In this paper, we make the first attempt to step toward OS-VMR and propose a novel model OpenVMR, which first distinguishes ID and OOD queries based on the normalizing flow technology, and then conducts moment retrieval based on ID queries. Specifically, we first learn the ID distribution by constructing a normalizing flow, and assume the ID query distribution obeys the multi-variate Gaussian distribution. Then, we introduce an uncertainty score to search the ID-OOD separating boundary. After that, we refine the ID-OOD boundary by pulling together ID query features. Besides, video-query matching and frame-query matching are designed for coarse-grained and fine-grained cross-modal interaction, respectively. Finally, a positive-unlabeled learning module is introduced for moment retrieval. Experimental results on three challenging datasets demonstrate the effectiveness of our OpenVMR.



Paperid:148 Oral
Authors:Zicheng Zhang,Haoning Wu,Yingjie Zhou,Chunyi Li,Wei Sun,Chaofeng Chen,Xiongkuo Min,Xiaohong Liu,Weisi Lin,Guangtao Zhai
Abstract:
Although large multi-modality models (LMMs) have seen extensive exploration and application in various quality assessment studies, their integration into Point Cloud Quality Assessment (PCQA) remains unexplored. Given LMMs' exceptional performance and robustness in low-level vision and quality assessment tasks, this study aims to investigate the feasibility of imparting PCQA knowledge to LMMs through text supervision. To achieve this, we transform quality labels into textual descriptions during the fine-tuning phase, enabling LMMs to derive quality rating logits from 2D projections of point clouds. To compensate for the loss of perception in the 3D domain, structural features are extracted as well. These quality logits and structural features are then combined and regressed into quality scores. Our experimental results affirm the effectiveness of our approach, showcasing a novel integration of LMMs into PCQA that enhances model understanding and assessment accuracy. We hope our contributions can inspire subsequent investigations into the fusion of LMMs with PCQA, fostering advancements in 3D visual quality analysis and beyond.



Paperid:149 Oral
Authors:Jiaxu Li,Songsong Yu,Yifan Wang,Lijun Wang,Huchuan Lu
Abstract:
Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in videos according to associated audio cues, where both modalities are affected by noise to different extents, such as the blending of background noises in audio or the presence of distracted objects in video.Most existing methods focus on learning interactions between modalities at high semantic levels but is incapable of filtering low-level noise or achieving fine-grained representational interactions during the early feature extraction phase. Consequently, they struggle with illusion issues, where nonexistent audio cues are erroneously linked to visual objects.In this paper, we present SelM, a novel architecture that leverages selective mechanisms to counteract these illusions. SelM employs State Space model for noise reduction and robust feature selection. By imposing additional bidirectional constraints on audio and visual embeddings, it is able to precisely identify crutial features corresponding to sound-emitting targets.To fill the existing gap in early fusion within AVS, SelM introduces a dual alignment mechanism specifically engineered to facilitate intricate spatio-temporal interactions between audio and visual streams, achieving more fine-grained representations. Moreover, we develop a cross-level decoder for layered reasoning, significantly enhancing segmentation precision by exploring the complex relationships between audio and visual information.SelM achieves state-of-the-art performance in AVS tasks, especially in the challenging Audio-Visual Semantic Segmentation setting.Source code will be made publicly available.



Paperid:150 Oral
Authors:Hongtao Wu,Yijun Yang,Huihui Xu,Weiming Wang,JINNI ZHOU,Lei Zhu
Abstract:
The outdoor vision systems are frequently contaminated by rain streaks and raindrops, which significantly degenerate the performance of visual tasks and multimedia applications. The nature of videos exhibits redundant temporal cues for rain removal with higher stability. Traditional video deraining methods heavily rely on optical flow estimation and kernel-based manners, which have a limited receptive field. Yet, transformer architectures, while enabling long-term dependencies, bring about a significant increase in computational complexity. Very recently, the linear-complexity operator of the state space models (SSMs) has contrarily facilitated efficient long-term temporal modeling, which is crucial for rain streaks and raindrops removal in videos. Unexpectedly, its uni-dimensional sequential process on videos destroys the local correlations across the spatio-temporal dimension by distancing adjacent pixels. To address this, we present an improved SSMs-based video deraining network (RainMamba) with a novel Hilbert scanning mechanism to better capture sequence-level local information. We also introduce a difference-guided dynamic contrastive locality learning strategy to enhance the patch-level self-similarity learning ability of the proposed network. Extensive experiments on four synthesized video deraining datasets and real-world rainy videos demonstrate the superiority of our network in the removal of rain streaks and raindrops. The code will be publicly available once accepted.



Paperid:151 Oral
Authors:Hongcheng Li,Yucan Zhou,Xiaoyan Gu,Bo Li,Weiping Wang
Abstract:
Dataset distillation, also known as dataset condensation, offers a possibility for compressing a large-scale dataset into a small-scale one (i.e., distilled dataset) while achieving similar performance during model training. This method effectively tackles the challenges of training efficiency and storage cost posed by the large-scale dataset. Existing dataset distillation methods can be categorized into Optimization-Oriented (OO)-based and Distribution-Matching (DM)-based methods. Since OO-based methods require bi-level optimization to alternately optimize the model and the distilled data, they face challenges due to high computational overhead in practical applications. Thus, DM-based methods have emerged as an alternative by aligning the prototypes of the distilled data to those of the original data. Although efficient, these methods overlook the diversity of the distilled data, which will limit the performance of evaluation tasks. In this paper, we propose a novel Diversified Semantic Distribution Matching (DSDM) approach for dataset distillation. To accurately capture semantic features, we first pre-train models for dataset distillation. Subsequently, we estimate the distribution of each category by calculating its prototype and covariance matrix, where the covariance matrix indicates the direction of semantic feature transformations for each category. Then, in addition to the prototypes, the covariance matrices are also matched to obtain more diversity for the distilled data. However, since the distilled data are optimized by multiple pre-trained models, the training process will fluctuate severely. Therefore, we match the distilled data of the current pre-trained model with the historical integrated prototypes. Experimental results demonstrate that our DSDM achieves state-of-the-art results on both image and speech datasets. Codes will be released soon.



Paperid:152 Oral
Authors:Xingyu Zhu,Beier Zhu,Yi Tan,Shuo Wang,Yanbin Hao,Hanwang Zhang
Abstract:
Vision-language models such as CLIP are capable of mapping the different modality data into a unified feature space, enabling zero/few-shot inference by measuring the similarity of given images and texts. However, most existing methods overlook modality gaps in CLIP's encoded features, which is shown as the text and image features lie far apart from each other, resulting in limited classification performance. To tackle this issue, we introduce a method calledSelective Vision-LanguageSubspaceProjection (SSP), which incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs. Specifically, our SSP framework comprises two parallel modules: a vision projector and a language projector. Both projectors utilize local image features to span the respective subspaces for image and texts, thereby projecting the image and text features into their respective subspaces to achieve alignment. Moreover, our approach entails only training-free matrix calculations and can be seamlessly integrated into advanced CLIP-based few-shot learning frameworks. Extensive experiments on 11 datasets have demonstrated SSP's superior text-image alignment capabilities, outperforming the state-of-the-art alignment methods. The code is available at:https://anonymous.4open.science/r/SSP-D3EC/main_our.py



Paperid:153 Oral
Authors:Daiqing Wu,Dongbao Yang,Yu Zhou,Can Ma
Abstract:
Visual emotion recognition (VER) is a longstanding field that has garnered increasing attention with the advancement of deep neural networks. Although recent studies have achieved notable improvements by leveraging the knowledge embedded within pre-trained visual models, the lack of direct association between factual-level features and emotional categories, called the ''affective gap'', limits the applicability of pre-training knowledge for VER tasks. On the contrary, the explicit emotional expression and high information density in textual modality eliminate the ''affective gap''. Therefore, we propose borrowing the knowledge from the pre-trained textual model to enhance the emotional perception of pre-trained visual models. We focus on the factual and emotional connections between images and texts in noisy social media data, and propose Partitioned Adaptive Contrastive Learning (PACL) to leverage these connections. Specifically, we manage to separate different types of samples and devise distinct contrastive learning strategies for each type. By dynamically constructing negative and positive pairs, we fully exploit the potential of noisy samples. Through comprehensive experiments, we demonstrate that bridging the ''affective gap'' significantly improves the performance of various pre-trained visual models in downstream emotion-related tasks.



Paperid:154 Oral
Authors:Tengchuan Kou,Xiaohong Liu,Zicheng Zhang,Chunyi Li,Haoning Wu,Xiongkuo Min,Guangtao Zhai,Ning Liu
Abstract:
With the rapid development of generative models, AI-Generated Content (AIGC) has exponentially increased in daily lives. Among them, Text-to-Video (T2V) generation has received widespread attention. Though many T2V models have been released for generating high perceptual quality videos, there is still lack of a method to evaluate the quality of these videos quantitatively. To solve this issue, we establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date. The dataset is composed of 10,000 videos generated by 9 different T2V models, along with each video's corresponding mean opinion score. Based on T2VQA-DB, we propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA). The model extracts features from text-video alignment and video fidelity perspectives, then it leverages the ability of a large language model to give the prediction score. Experimental results show that T2VQA outperforms existing T2V metrics and SOTA video quality assessment models. Quantitative analysis indicates that T2VQA is capable of giving subjective-align predictions, validating its effectiveness. The dataset and code will be released upon publication.



Paperid:155 Oral
Authors:Xueli Hu,Huan Liu,Haocheng Yuan,Zhiyang Fu,Yizhi Luo,Ning Zhang,Hang Zou,Gan Jianwen,Yuan Zhang
Abstract:
There has been an increasing focus from researchers on Domain-Generalized (DG) Face Anti-Spoofing (FAS). However, existing methods aim to project a shared visual space through adversarial training, making it difficult to explore the space without losing semantic information. We investigate the inadequacies of DG that result from classifier overfitting to a significantly different domain distribution. To address this issue, we propose a novel Fine-Grained Prompt Learning (FGPL) based on Vision-Language Models (VLMs), such as CLIP, which can adaptively adjust weights for classifiers with text features to mitigate overfitting. Specifically, FGPL first motivates the prompts to learn content and domain semantic information by capturing Domain-Agnostic and Domain-Specific features. Furthermore, our prompts are designed to be category-generalized by diversifying the Domain-Specific prompts. Additionally, we design an Adaptive Convolutional Adapter (AC-adapter), which is implemented through an adaptive combination of Vanilla Convolution and Central Difference Convolution, to be inserted into the image encoder for quickly bridging the gap between general image recognition and FAS task. Extensive experiments demonstrate that the proposed FGPL is effective and outperforms state-of-the-art methods on several cross-domain datasets.



Paperid:156 Oral
Authors:Shuo Ma,Yingwei Zhang,Zhang Qiqi,Yiqiang Chen,Wang Haoran,Ziyu Jia
Abstract:
Sleep staging is crucial for sleep tracking and health assessment. Polysomnography (PSG), containing multiple modalities such as electroencephalography, electrooculography, electromyography, and electrocardiography, is the fundamental means of sleep staging. However, due to performance differences in both classification and domain discrimination across modalities in PSG, existing domain generalization methods face a dilemma of modal imbalance. To balance inter-modal differences and achieve highly accurate cross-domain sleep staging, we propose SleepMG, a Multimodal Generalizable Sleep staging method. SleepMG assesses the classification and domain discrimination performances of each modality and further defines the modal performance metrics by calculating the variance between the performance score and the average performance of each modality. Guided by these metrics, the gradients of the classifier and domain discriminator are adaptively adjusted, placing greater emphasis on poorly-balanced modalities while reducing emphasis on well-balanced modalities. Experimental results on public sleep staging datasets demonstrate that SleepMG outperforms state-of-the-art sleep staging methods, effectively balancing multiple modalities as evidenced by the visual experiment of modal imbalance degree. Our code will be released after formal publication.



Paperid:157 Oral
Authors:Changli Wu,Yihang Liu,Yiwei Ma,Haowei Wang,Gen Luo,Jiayi Ji,Henghui Ding,Xiaoshuai Sun,Rongrong Ji
Abstract:
3D Referring Expression Segmentation (3D-RES) is dedicated to segmenting a specific instance within a 3D space based on a natural language description. However, current approaches are limited to segmenting a single target, restricting the versatility of the task. To overcome this limitation, we introduce Generalized 3D Referring Expression Segmentation (3D-GRES), which extends the capability to segment any number of instances based on natural language instructions. In addressing this broader task, we propose the Multi-Query Decoupled Interaction Network (MDIN), designed to break down multi-object segmentation tasks into simpler, individual segmentations. MDIN comprises two fundamental components: Text-driven Sparse Queries (TSQ) and Multi-object Decoupling Optimization (MDO). TSQ generates sparse point cloud features distributed over key targets as the initialization for queries. Meanwhile, MDO is tasked with assigning each target in multi-object scenarios to different queries while maintaining their semantic consistency. To adapt to this new task, we build a new dataset, namely Multi3DRes. Our comprehensive evaluations on this dataset demonstrate substantial enhancements over existing models, thus charting a new path for intricate multi-object 3D scene comprehension.



Paperid:158 Oral
Authors:Shizong Yan,Shan Chang,Hongzi Zhu,Huixiang Wen,Luo Zhou
Abstract:
3D face recognition is subject to frequent spoofing attacks, in which 3D face presentation attack is one of the most notorious attacks. The attacker takes advantages of 3D scanning and printing techniques to generate masks of targets, which has found success in numerous real-life examples. The salient feature in such attacks is to obtain 3D face models through 3D scanning, though relatively more expensive and inconvenient when comparing with 2D photos. In this work, we propose a new method, DREAM, to recover 3D face models from single 2D image. Specifically, we adopt a black-box approach, which recovers ‘sufficient’ depths to defeat target recognition models (e.g., face identification and face authentication models) by accessing its output and the corresponding RGB photo. The key observation is that it is not necessary to restore the true value of depths, but only need to recover the essential features relevant to the target model. We used four public 3D face datasets to verify the effectiveness of DREAM. The experimental results show that DREAM can achieve a success rate of 94% on face authentication model, even in cross-dataset testing, and a success rate of 36% on face identification model.



Paperid:159 Oral
Authors:Minghui Wu,Chenxu Zhao,Anyang Su,Donglin Di,Tianyu Fu,Da An,Min He,Ya Gao,Meng Ma,Kun Yan,Ping Wang
Abstract:
Understanding of video creativity and content often varies among individuals, with differences in focal points and cognitive levels across different ages, experiences, and genders. There is currently a lack of research in this area, and most existing benchmarks suffer from several drawbacks: 1) a limited number of modalities and answers with restrictive length; 2) the content and scenarios within the videos are excessively monotonous, transmitting allegories and emotions that are overly simplistic. To bridge the gap to real-world applications, we introduce a large-scale Subjective Response Indicators for Advertisement Videos dataset, namely SRI-ADV. Specifically, we collected real changes in Electroencephalographic (EEG) and eye-tracking regions from different demographics while they viewed identical video content. Utilizing this multi-modal dataset, we developed tasks and protocols to analyze and evaluate the extent of cognitive understanding of video content among different users. Along with the dataset, we designed a Hypergraph Multi-modal Large Language Model (HMLLM) to explore the associations among different individuals, video elements, EEG and eye-tracking indicators. HMLLM could bridge semantic gaps across rich modalities and integrate information beyond different modalities to perform logical reasoning. Extensive experimental evaluations on SRI-ADV and other additional video-based generative performance benchmark demonstrate the effectiveness of our method. The codes and dataset will be released soon.



Paperid:160 Oral
Authors:Desen Yuan,Lei Wang
Abstract:
This paper introduces a novel approach to Image Quality Assessment (IQA) by presenting a new loss function, Dual-Criterion Quality (DCQ) Loss, which integrates the Mean Squared Error (MSE) framework with a Relative Perception Constraint (RPC). The RPC is comprised of two main components: the Quantitative Discrepancy Constraint (QDC) and the Qualitative Alignment Constraint (QAC). The QDC focuses on capturing the numerical relationships of relative differences by minimizing the mean squared error between the differences in predicted scores among samples within a batch size and the differences in Mean Opinion Scores (MOS). Meanwhile, the QAC aims to capture the ordinal relationships between these differences. This method is designed to closely align with human subjective assessments of image quality, which are frequently quantified using the MOS, and to enhance the interpretability and reliability of IQA. Unlike existing ranking methods that suffer from complex pipelines and the introduction of errors through the generation of pair-wise or ordering data, DCQ Loss provides a more straightforward and efficient approach. Moreover, the loss function outperforms current rank-based IQA methods in terms of convergence, stability, and the ability to emulate human perception of visual quality. The effectiveness of this approach is validated through extensive experiments on various mainstream datasets and IQA network architectures, demonstrating significant performance gains over traditional rank loss approaches and contributing to the ongoing development of IQA.



Paperid:161 Oral
Authors:Haipeng Zhou,Hongqiu Wang,Tian Ye,Zhaohu Xing,Jun Ma,Ping Li,Qiong Wang,Lei Zhu
Abstract:
Video Shadow Detection (VSD) aims to detect the shadow masks with frame sequence. Existing works suffer from inefficient temporal learning and give no attention to future video frames since the information is agnostic. Moreover, few works address the VSD problem by considering the characteristic (i.e., boundary) of shadow. Motivated by this, we propose a Timeline and Boundary Guided Diffusion (TBGDiff) network for VSD where we take account of the past-future temporal guidance and boundary information jointly. In detail, we design a Dual Scale Aggregation (DSA) module for better temporal understanding by rethinking the affinity of the long-term and short-term frames for the clipped video. Next, we introduce Shadow Boundary Aware Attention (SBAA) to utilize the edge contexts for capturing the characteristics of shadows. Moreover, we are the first to introduce the Diffusion model for VSD in which we explore a Space-Time Encoded Embedding (STEE) to inject the temporal guidance for Diffusion to conduct shadow detection. Benefiting from these designs, our model can not only capture the temporal information but also the shadow property. Extensive experiments show that the performance of our approach overtakes the state-of-the-art methods, verifying the effectiveness of our components. The codes and weights will be released.



Paperid:162 Oral
Authors:Jiale Yu,Baopeng Zhang,Zhu Teng,Jianping Fan
Abstract:
Audio-Visual Event (AVE) Localization aims to identify and classify video segments that are both audible and visible, a field that has seen substantial progress in recent years. Existing methods operate under a closed-set assumption and struggle to recognize unknown events in open-world scenarios. To better adapt to real-life applications, we introduce the Open Set Audio-Visual Event Localization task and propose a novel and effective network called OpenAVE based on evidential deep learning. To the best of our knowledge, this is the first effort to address this challenge. Our approach encompasses deep evidential AVE classification and event-relevant prediction, targeting the nuanced demands of open-set environments. Our approach includes deep evidential AVE classification and event-relevant prediction. The deep evidential AVE classification manages event classification uncertainty by extracting class evidence from segment-specific representations enriched with multi-scale context. To effectively distinguish between unknown events and background segments, event-relevant prediction utilizes positive-unlabeled learning. Futhermore, a learnable Gaussian-prior prediction branch is adopted to enhance the performance of event-relevant prediction. Experimental results demonstrate that OpenAVE significantly outperforms state-of-the-art models on the Audio-Visual Event dataset, confirming the effectiveness of our proposed method.



Paperid:163 Oral
Authors:Jin Liu,Bo Wang,Chuanming Wang,Huiyuan Fu,Huadong Ma
Abstract:
Exposure correction aims to enhance visual data suffering from improper exposures, which can greatly improve satisfactory visual effects. However, previous methods mainly focus on the image modality, and the video counterpart is less explored in the literature. Directly applying prior image-based methods to videos results in temporal incoherence with low visual quality. Through thorough investigation, we find that the development of relevant communities is limited by the absence of a benchmark dataset. Therefore, in this paper, we construct the first real-world paired video dataset, including both underexposure and overexposure dynamic scenes. To achieve spatial alignment, we utilize two DSLR cameras and a beam splitter to simultaneously capture improper and normal exposure videos. Additionally, we propose an end-to-end Video Exposure Correction Network (VECNet), in which a dual-stream module is designed to deal with both underexposure and overexposure factors, enhancing the illumination based on Retinex theory. Experimental results based on various metrics and user studies demonstrate the significance of our dataset and the effectiveness of our method. The code and dataset will be available soon.



Paperid:164 Oral
Authors:Yi Dong,Yuxi Wang,ZHENG FANG,Wenqi Ouyang,Xianhui Lin,Zhiqi Shen,Peiran Ren,Xuansong Xie,Qingming Huang
Abstract:
Fine-grained video color enhancement delivers superior visual results by making precise adjustments to specific areas of the frame, maintaining more natural color relationships compared to global enhancement techniques. However, dynamically applying these specific enhancements can lead to flickering artifacts and unsatisfactory color blending at object boundaries, issues caused by the coarse and unstable masks produced by current video segmentation algorithms. To overcome these challenges, we introduce MovingColor, featuring a novel self-supervised training approach that leverages large-scale video datasets. This approach redefines color fusion as a generation process using original full-frame textures and color editing information from non-edge areas. We address spatio-temporal inconsistencies with a spectral-spatial hybrid encoder that captures multi-scale spatial and frequency features, thus enhancing color adjustments in complex scenes. Additionally, our global-local feature propagation module, incorporating Transformer blocks, consolidates spatio-temporal contexts to ensure consistency among frames. Both quantitative and subjective evaluations validate the effectiveness of MovingColor in delivering state-of-the-art spatio-temporal consistency for video color enhancements, adhering closely to the intended color editing operations. These results demonstrate that MovingColor can effectively enhance fine-grained video color grading, making it more efficient and accessible to a wider range of users. We will release the code to support further research and practical applications.



Paperid:165 Oral
Authors:Mu Chen,Zhedong Zheng,Yi Yang
Abstract:
Scene segmentation via unsupervised domain adaptation (UDA) enables the transfer of knowledge acquired from source synthetic data to real-world target data, which largely reduces the need for manual pixel-level annotations in the target domain. To facilitate domain-invariant feature learning, existing methods typically mix data from both the source domain and target domain by simply copying and pasting the pixels. Such vanilla methods are usually sub-optimal since they do not take into account how well the mixed layouts correspond to real-world scenarios. Real-world scenarios are with an inherent layout. We observe that semantic categories, such as sidewalks, buildings, and sky, display relatively consistent depth distributions, and could be clearly distinguished in a depth map. Based on such observation, we propose a depth-aware framework to explicitly leverage depth estimation to mix the categories and facilitate the two complementary tasks, i.e., segmentation and depth learning in an end-to-end manner. In particular, the framework contains a Depth-guided Contextual Filter (DCF) forndata augmentation and a cross-task encoder for contextual learning. DCF simulates the real-world layouts, while the cross-task encoder further adaptively fuses the complementing features between two tasks. Besides, it is worth noting that several public datasets do not provide depth annotation. Therefore, we leverage the off-the-shelf depth estimation network to generate the pseudo depth. Extensive experiments show that our proposed methods, even with pseudo depth, achieve competitive performance on two widely-used bench-marks, i.e. 77.7 mIoU on GTA→Cityscapes and 69.3 mIoU on Synthia→Cityscapes.



Paperid:166 Oral
Authors:Jiyang Li,Lechao Cheng,Zhangye Wang,Tingting Mu,Jingxuan He
Abstract:
Cinemagraph is a unique form of visual media that combines elements of still photography and subtle motion to create a captivating experience. However, the majority of videos generated by recent works lack depth information and are confined to the constraints of 2D image space. In this paper, inspired by significant progress in the field of novel view synthesis (NVS) achieved by 3D Gaussian Splatting (3D-GS), we propose \textbf{\textit{LoopGaussian}} to elevate cinemagraph from 2D image space to 3D space using 3D Gaussian modeling. To achieve this, we first employ the 3D-GS method to reconstruct 3D Gaussian point clouds from multi-view images of static scenes, incorporating shape regularization terms to prevent blurring or artifacts caused by object deformation. We then adopt an autoencoder tailored for 3D Gaussian to project it into feature space. To maintain the local continuity of the scene, we devise SuperGaussian for clustering based on the acquired features. By calculating the similarity between clusters and employing a two-stage estimation method, we derive an Eulerian motion field to describe velocities across the entire scene. The 3D Gaussian points then move within the estimated Eulerian motion field. Through bidirectional animation techniques, we ultimately generate a 3D Cinemagraph that exhibits natural and seamlessly loopable dynamics. Experiment results validate the effectiveness of our approach, demonstrating high-quality and visually appealing scene generation.



Paperid:167 Oral
Authors:Changcheng Xiao,Qiong Cao,Zhigang Luo,Long Lan
Abstract:
Tracking by detection has been the prevailing paradigm in the field of Multi-object Tracking (MOT). These methods typically rely on the Kalman Filter to estimate the future locations of objects, assuming linear object motion. However, they fall short when tracking objects exhibiting nonlinear and diverse motion in scenarios like dancing and sports. In addition, there has been limited focus on utilizing learning-based motion predictors in MOT. To address these challenges, we resort to exploring data-driven motion prediction methods. Inspired by the great expectation of state space models (SSMs), such as Mamba, in long-term sequence modeling with near-linear complexity, we introduce a Mamba-based motion model named Mamba moTion Predictor (MTP). MTP is designed to model the complex motion patterns of objects like dancers and athletes. Specifically, MTP takes the spatial-temporal location dynamics of objects as input, captures the motion pattern using a bi-Mamba encoding layer, and predicts the next motion. In real-world scenarios, objects may be missed due to occlusion or motion blur, leading to premature termination of their trajectories. To tackle this challenge, we further expand the application of MTP. We employ it in an autoregressive way to compensate for missing observations by utilizing its own predictions as inputs, thereby contributing to more consistent trajectories. Our proposed tracker, MambaTrack, demonstrates advanced performance on benchmarks such as Dancetrack and SportsMOT, which are characterized by complex motion and severe occlusion.



Paperid:168 Oral
Authors:Zeyu Li,Ruitong Gan,Chuanchen Luo,Yuxi Wang,Jiaheng Liu,Ziwei Zhu,Man Zhang,Qing Li,Zhaoxiang Zhang,Junran Peng,Xu-Cheng Yin
Abstract:
Driven by powerful image diffusion models, recent research has achieved the automatic creation of 3D objects from textual or visual guidance. By performing score distillation sampling (SDS) iteratively across different views, these methods succeed in lifting 2D generative prior to the 3D space. However, such a 2D generative image prior bakes the effect of illumination and shadow into the texture. As a result, material maps optimized by SDS inevitably involve spurious correlated components. The absence of precise material definition makes it infeasible to relight the generated assets reasonably in novel scenes, which limits their application in downstream scenarios. In contrast, humans can effortlessly circumvent this ambiguity by deducing the material of the object from its appearance and semantics. Motivated by this insight, we propose MaterialSeg3D, a 3D asset material generation framework to infer underlying material from the 2D semantic prior. Based on such a prior model, we devise a mechanism to parse material in 3D space. We maintain a UV stack, each map of which is unprojected from a specific viewpoint. After traversing all viewpoints, we fuse the stack through a weighted voting scheme and then employ region unification to ensure the coherence of the object parts. To fuel the learning of semantics prior, we collect a material dataset, named Materialized Individual Objects (MIO), which features abundant images, diverse categories, and accurate annotations. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method.



Paperid:169 Oral
Authors:Peiwen Sun,Honggang Zhang,Di Hu
Abstract:
Community researchers have developed a range of advanced audio-visual segmentation models aimed at improving the quality of sounding objects' masks. While masks created by these models may initially appear plausible, they occasionally exhibit anomalies with incorrect grounding logic. We attribute this to real-world inherent preferences and distributions as a simpler signal for learning than the complex audio-visual grounding, which leads to the disregard of important modality information. Generally, the anomalous phenomena are often complex and cannot be directly observed systematically. In this study, we made a pioneering effort with the proper synthetic data to categorize and analyze phenomena as two types “audio priming bias'' and “visual prior'' according to the source of anomalies. For audio priming bias, to enhance audio sensitivity to different intensities and semantics, a perception module specifically for audio perceives the latent semantic information and incorporates information into a limited set of queries, namely active queries. Moreover, the interaction mechanism related to such active queries in the transformer decoder is customized to adapt to the need for interaction regulating among audio semantics. For visual prior, multiple contrastive training strategies are explored to optimize the model by incorporating a biased branch, without even changing the structure of the model. During experiments, observation demonstrates the presence and the impact that has been produced by the biases of the existing model. Finally, through experimental evaluation of AVS benchmarks, we demonstrate the effectiveness of our methods in handling both types of biases, achieving competitive performance across all three subsets.



Paperid:170 Oral
Authors:Weiqi Li,Shijie Zhao,Bin Chen,Xinhua Cheng,Junlin Li,Li zhang,Jian Zhang
Abstract:
With the advent of virtual reality technology, omnidirectional image (ODI) rescaling techniques are increasingly embraced for reducing transmitted and stored file sizes while preserving high image quality. Despite this progress, current ODI rescaling methods predominantly focus on enhancing the quality of images in equirectangular projection (ERP) format, which overlooks the fact that the content viewed on head mounted displays (HMDs) is actually a rendered viewport instead of an ERP image. In this work, we emphasize that focusing solely on ERP quality results in inferior viewport visual experiences for users. Thus, we propose ResVR, which is the first comprehensive framework for the joint Rescaling and Viewport Rendering of ODIs. ResVR allows obtaining LR ERP images for transmission while rendering high-quality viewports for users to watch on HMDs. In our ResVR, a novel discrete pixel sampling strategy is developed to tackle the complex mapping between the viewport and ERP, enabling end-to-end training of ResVR pipeline. Furthermore, a spherical pixel shape representation technique is innovatively derived from spherical differentiation to significantly improve the visual quality of rendered viewports. Extensive experiments demonstrate that our ResVR outperforms existing methods in viewport rendering tasks across different fields of view, resolutions, and view directions while keeping a low transmission overhead.



Paperid:171 Oral
Authors:Zixuan Gong,Qi Zhang,Guangyin Bao,Lei Zhu,Yu Zhang,KE LIU,Liang Hu,Duoqian Miao
Abstract:
The limited data availability and the low signal-to-noise ratio of fMRI signals lead to the challenging task of fMRI-to-image retrieval. State-of-the-art MindEye remarkably improves fMRI-to-image retrieval performance by leveraging a large model, i.e., a 996M MLP Backbone per subject, to align fMRI embeddings to the final hidden layer of CLIP’s Vision Transformer (ViT). However, significant individual variations exist among subjects, even under identical experimental setups, mandating the training of large subject-specific models. The substantial parameters pose significant challenges in deploying fMRI decoding on practical devices. To this end, we propose Lite-Mind, a lightweight, efficient, and robust brain representation learning paradigm based on Discrete Fourier Transform (DFT), which efficiently aligns fMRI voxels to fine-grained information of CLIP. We elaborately design a DFT backbone with Spectrum Compression and Frequency Projector modules to learn informative and robust voxel embeddings. Our experiments demonstrate that Lite-Mind achieves an impressive 94.6% fMRI-to-image retrieval accuracy on the NSD dataset for Subject 1, with 98.7% fewer parameters than MindEye. Lite-Mind is also proven to be able to be migrated to smaller fMRI datasets and establishes a new state-of-the-art for zero-shot classification on the GOD dataset.



Paperid:172 Oral
Authors:Xinji Mai,Junxiong Lin,Haoran Wang,Zeng Tao,Yan Wang,Shaoqi Yan,Xuan Tong,Jiawen Yu,Boyang Wang,Ziheng Zhou,Qing Zhao,Shuyong Gao,Wenqiang Zhang
Abstract:
In the field of affective computing, fully leveraging information from a variety of sensory modalities is essential for the comprehensive understanding and processing of human emotions. Inspired by the process through which the human brain handles emotions and the theory of cross-modal plasticity, we propose UMBEnet, a brain-like unified modal affective processing network. The primary design of UMBEnet includes a Dual-Stream (DS) structure that fuses intrinsic prompts with a Prompt Pool and a Sparse Feature Fusion (SFF) module. The design of the Prompt Pool is aimed at integrating information from different modalities, while intrinsic prompts are intended to enhance the system's predictive guidance capabilities and effectively manage knowledge related to emotion classification. Moreover, considering the sparsity of effective information across different modalities, the Sparse Feature Fusion module aims to make full use of all available sensory data through the sparse integration of modality fusion prompts and intrinsic prompts, maintaining high adaptability and sensitivity to complex emotional states. Extensive experiments on the largest benchmark datasets in the Dynamic Facial Expression Recognition(DFER) field, including DFEW, FERV39k, and MAFW, have proven that UMBEnet consistently outperforms the current state-of-the-art methods. Notably, in scenarios of modality absence and multimodal contexts, UMBEnet significantly surpasses the leading current methods, demonstrating outstanding performance and adaptability in tasks that involve complex emotional understanding with rich multimodal information.



Paperid:173 Oral
Authors:Shao-Kui Zhang,Hanxi Zhu,Xuebin Chen,Jinghuan Chen,Zhike Peng,Ziyang Chen,Yong-Liang Yang,Song-Hai Zhang
Abstract:
Humans understand digital 3D scenes by observing them from reasonably placed virtual cameras. Selecting camera views is fundamental for 3D scene applications but is typically manual. Existing literature on selecting views is based on regular or polygonal room shapes without focusing on the objects in the scene, resulting in poorly composed views concerning objects. This paper introduces ScenePhotographer, an object-oriented framework for automatic view selection in residential scenes. Potential object-oriented views are yielded by a learning-based method, which clusters objects into groups according to objects' functional and spatial relationships. We propose four criteria to evaluate the views and recommend the best batch, including room information, visibility, composition balance, and line dynamics. Each criterion measures the view according to its corresponding photography rule. Experiments on various room types and layouts demonstrate that our method can generate views focusing on coherent objects while preserving aesthetics, leading to more visually pleasing results.



Paperid:174 Oral
Authors:Zhixi Cai,Shreya Ghosh,Aman Pankaj Adatia,Munawar Hayat,Abhinav Dhall,Tom Gedeon,Kalin Stefanov
Abstract:
The detection and localization of highly realistic deepfake audio-visual content are challenging even for the most advanced state-of-the-art methods. While most of the research efforts in this domain are focused on detecting high-quality deepfake images and videos, only a few works address the problem of the localization of small segments of audio-visual manipulations embedded in real videos. In this research, we emulate the process of such content generation and propose the AV-Deepfake1M dataset. The dataset contains content-driven (i) video manipulations, (ii) audio manipulations, and (iii) audio-visual manipulations for more than 2K subjects resulting in a total of more than 1M videos. The paper provides a thorough description of the proposed data generation pipeline accompanied by a rigorous analysis of the quality of the generated data. The comprehensive benchmark of the proposed dataset utilizing state-of-the-art deepfake detection and localization methods indicates a significant drop in performance compared to previous datasets. The proposed dataset will play a vital role in building the next-generation deepfake localization methods. The dataset and associated code will be made public.



Paperid:175 Poster
Authors:Yuzhuo Wang,Junwei He,Hongzhi Wang
Abstract:
All along, KG completion relied on link prediction has always been the focus of researchers. However, overwhelming majority of them can only serve 2-ary KGs. While in practice, knowledge hypergraphs (KH) covering facts beyond binary relations are far more ubiquitous but receive little attention. When confronted with them, massive studies for KGs show inadaptability. The several work towards N-ary KHs generally simply extend KG methods. And they usually transform N-ary knowledge into role-value pairs or triples, largely simplifying inherent association within each piece of knowledge. Furthermore, previous models study each N-ary knowledge independently, resulting in structural correlations among them being completely neglected. Motivated by these, avoiding breaking knowledge structure in KHs like previous studies do, based on original knowledge formats, we propose the first KH reasoning model based on an innovative relational hypergraph neural network (RHNN), RHKH. Challenged by complicated compositions indicated by the original format of N-ary tuples, association within and among each knowledge is discovered through RHNN. It considers complex interactions between relation and entities involved in the same knowledge as well. To refine such interactions, semantic components at each arity-position of relations are distinguished, along with introducing position-specific shift. Extensive experiments demonstrate the effectiveness of our RHKH.



Paperid:176 Poster
Authors:Siqi Wang,Chao Liang,Yunfan Gao,Liu Yang,Jing Li,Haofen Wang
Abstract:
Industrial parks are critical to urban economic growth, blending technology and urban life to foster innovation. Yet, their development often faces challenges due to imbalances between industrial needs and urban services, necessitating strategic planning and operation. This paper presents IndustryScopeKG, a pioneering multi-modal, multi-level large-scale industrial park knowledge graph, and the IndustryScopeGPT framework. By leveraging vast datasets, including corporate, socio-economic, and geospatial information, IndustryScopeKG captures the intricate relationships and semantics of industrial parks, facilitating comprehensive analysis and planning. The IndustryScopeGPT framework, integrating LLMs with Monte Carlo Tree Search, enhances decision-making capabilities, enabling dynamic and adaptable responses to the diverse needs of industrial park planning and operation (IPPO) tasks. Our contributions include the release of the first open-source industrial park knowledge graph, IndustryScopeKG, and the demonstration of the IndustryScopeGPT framework's efficacy in site selection and planning tasks through the IndustryScopeQA benchmark. Our findings highlight the potential of combining LLMs with extensive datasets and innovative frameworks, setting a new standard for research and practice in the field.



Paperid:177 Poster
Authors:Lize Zhou,Xiaoqi Wang,Jian Xiong,Xianzhong Long,Hao Gao
Abstract:
Existing blind image quality assessment (BIQA) models are susceptible to biases related to distortion intensity and domain. Intensity bias manifests as an over-sensitivity to severe distortions and under-estimation of minor ones, while domain bias stems from the discrepancies between synthetic and authentic distortion properties. This work introduces a unified learning framework to address these distortion biases. We integrate distortion perception and restoration modules to address intensity bias. The restoration module uses a combined image-level and feature-level denoising method to restore distorted images, where easily restorable minor distortions serve as references for mildly distorted images, and severe distortions benefit directly from distortion perception. Finally, calculating a distortion intensity matrix via intensity-aware cross-attention for adaptive handling of intensity bias. To tackle domain bias, we introduce a distortion domain recognition task, leveraging inherent differences between synthetic and authentic distortions for adaptive quality score weighting. Experimental results show that our proposed method achieves state-of-the-art performance on a multitude of synthetic and authentic IQA benchmark datasets. The code and models will be available.



Paperid:178 Poster
Authors:Mamta Mamta,gopendra Vikram singh,Deepak Raju Kori,Asif Ekbal
Abstract:
Sentiment analysis and complaint identification are key tools in mining user preferences by measuring the polarity and breach of expectations. Recent works on complaint identification identify aspect categories and classify them into complaint or not-complaint classes. However, aspect category-based complaint identification provides high-level information about the features of products. In addition, it is also observed that the user sometimes does not complain about a specific aspect but expresses concern about specific aspects in a respectful way. Currently, uni-modal and multimodal studies do not differentiate between this thin line between complaint and concern. In this work, we propose the task of multimodal aspect term-based analysis beyond sentiments and complaints. It comprises of two sub-tasks, \textit{viz} (i) classification of the given aspect term into one of the four classes, \textit{viz.} praise, concern, complaint, and others, (ii) identification of the cause of praise, concern, and complaint classes. We propose a first benchmark explainable multimodal corpus annotated for aspect term-based complaints, praises, concerns, their corresponding causes, and sentiments. Further, we propose an effective technique for the joint learning of aspect term-based complaint/concern/praise identification and cause extraction tasks (primary tasks) where sentiment analysis is used as a secondary task to assist primary tasks and establish them as baselines for further research in this direction. Sample dataset has been made available at: \url{https://anonymous.4open.science/r/MAspectX-327E/README.md}The whole dataset will be made publicly available for research after acceptance of the paper.



Paperid:179 Poster
Authors:Jing Yang,XiaowenJiang,Yuan Gao,Laurence Tianruo Yang,JieMing Yang
Abstract:
Inductive link prediction aims to infer missing triples on unseen graphs, which contain unseen entities and relations during training. The performances of existing inductive inference methods were hindered by the limited generalization capability in fully unseen graphs, which is rooted in the neglect of the intrinsic graph structure. In this paper, we aim to enhance the model's generalization ability to unseen graphs and thus propose a novel Hyper-Relation aware multi-views model HyRel for learning the global transferable structure of graphs. Distinct from existing studies, we introduce a novel perspective focused on learning the inherent hyper-relation structure consisting of the relation positions and affinity. The hyper-relation structure is independent of specific entities, relations, or features, thus allowing for transferring the learned knowledge to any unseen graphs. We adopt a multi-view approach to model the hyper-relation structure. HyRel incorporates neighborhood learning on each view, capturing nuanced semantics of relative relation position. Meanwhile, dual views contrastive constraints are designed to enforce the robustness of transferable structural knowledge. To the best of our knowledge, our work makes one of the first attempts to generalize the learning of hyper-relation structures, offering high flexibility and ease of use without reliance on any external resources. HyRel demonstrates SOTA performance compared to existing methods under extensive inductive settings, particularly on fully unseen graphs, and validates the efficacy of learning hyper-relation structures for improving generalization. The code is available online athttps://github.com/hncps6/HyRel.



Paperid:180 Poster
Authors:Yiying Bao,Hao Zhou,Chao Peng,Chenyang Xu,Shuo Shi,Kecheng Cai
Abstract:
In various domains such as transportation, resource management, and weather forecasting, there is an urgent need for methods that can provide predictions over a sufficiently long time horizon to encompass the period required for decision-making and implementation. Compared to traditional time series forecasting, ultra-long time series forecasting requires enhancing the model’s ability to infer long-term series, while maintaining inference costs within an acceptable range. To address this challenge, we propose the Boundary-Aware Periodicity-based sparsification strategy for Ultra-Long time series forecasting (BAP-UL). The periodicity-based sparsification strategy is a general lightweight data sparsification framework that captures periodic features in time series and reorganizes inputs and outputs into shorter sub-sequences for model prediction. The boundary-aware method, combined with the bounded nature of time series, improves the model’s capability to predict extreme peaks and irregular time series by adjusting the prediction results. We conducted extensive experiments on benchmark datasets, and the BAP-UL model achieved nearly 90% state-of-the-art results under various experimental conditions. Moreover, the data sparsification method based on the periodicity, proposed in this paper, exhibits broad applicability. It enhances the upper limit of sequence length for mainstream time series forecasting models and achieves the state-of-the-art prediction results.



Paperid:181 Poster
Authors:Shuiping Gou,Xin Wang,Xinlin Wang,Yunzhi Chen
Abstract:
Driven by the complementary information fusion of optical and synthetic aperture radar (SAR) images, the optical-SAR image matching has drawn much attention. However, the significant radiometric differences between them imposes great challenges on accurate matching. Most existing approaches convert SAR and optical images into a shared feature space to perform the matching, but these methods often fail to achieve the robust matching since the feature spaces are unknown and uninterpretable. Motivated by the interpretable latent space of diffusion models, this paper formulates an optical-SAR image translation and matching framework via a dynamically conditioned diffusion model (DCDM) to achieve interpretable and robust optical-SAR cross-modal image matching. Specifically, in the denoising process, to filter out outlier matching regions, a gated dynamic sparse cross-attention module is proposed to facilitate efficient and effective long-range interactions of multi-grained features between the cross-modal data. In addition, a spatial position consistency constraint is designed to promote the cross-attention features to perceive the spatial corresponding relation in different modalities, improving the matching precision. Experimental results demonstrate that the proposed method outperforms state-of-the-art methods in terms of both the matching accuracy and the interpretability.



Paperid:182 Poster
Authors:Yiyang Luo,Ke Lin,Chao Gu
Abstract:
Indoor scene modification has emerged as a prominent area within computer vision, particularly for its applications in Augmented Reality (AR) and Virtual Reality (VR). Traditional methods often rely on pre-existing object databases and predetermined object positions, limiting their flexibility and adaptability to new scenarios. In response to this challenge, we present a novel end-to-end multi-modal deep neural network capable of generating point cloud objects seamlessly integrated with their surroundings, driven by textual instructions. Our model revolutionizes scene modification by enabling the creation of new environments with previously unseen object layouts, eliminating the need for pre-stored CAD models. Leveraging Point-E as our generative model, we introduce innovative techniques such as quantized position prediction and Top-K estimation to address the issue of false negatives resulting from ambiguous language descriptions. Furthermore, we conduct comprehensive evaluations to showcase the diversity of generated objects, the efficacy of textual instructions, and the quantitative metrics, affirming the realism and versatility of our model in generating indoor objects. To provide a holistic assessment, we incorporate visual grounding as an additional metric, ensuring the quality and coherence of the scenes produced by our model. Through these advancements, our approach not only advances the state-of-the-art in indoor scene modification but also lays the foundation for future innovations in immersive computing and digital environment creation.



Paperid:183 Poster
Authors:Zhida Zhao,Jia Li,Lijun Wang,Yifan Wang,Huchuan Lu
Abstract:
Existing RGB-D semantic segmentation methods struggle to handle modality missing input, where only RGB images or depth maps are available, leading to degenerated segmentation performance. We tackle this issue using MaskMentor, a new pre-training framework for modality missing segmentation, which advances its counterparts via two novel designs: Masked Modality and Image Modeling (M2IM), and Self-Teaching via Token-Pixel Joint reconstruction (STTP). M2IM simulates modality missing scenarios by combining both modality- and patch-level random masking. Meanwhile, STTP offers an effective self-teaching strategy, where the trained network assumes a dual role, simultaneously acting as both the teacher and the student. The student with modality missing input is supervised by the teacher with complete modality input through both token- and pixel-wise masked modeling, closing the gap between missing and complete input modalities. By integrating M2IM and STTP, MaskMentor significantly improves the generalization ability of the trained model across diverse input conditions, and outperforms state-of-the-art methods on two popular benchmarks by a considerable margin. Extensive ablation studies further verify the effectiveness of the above contributions.



Paperid:184 Poster
Authors:Rui Liu,Yifan Hu,Yi Ren,Xiang Yin,Haizhou Li
Abstract:
Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper speaking style in a user-agent conversation setting. Existing CSS methods employ effective multi-modal context modeling techniques to achieve empathy understanding and expression. However, they often need to design complex network architectures and meticulously optimize the modules within them. In addition, due to the limitations of small-scale datasets containing scripted recording styles, they often fail to simulate real natural conversational styles. To address the above issues, we propose a novel generative expressive CSS system, termedGPT-Talker. We transform the multimodal information of the multi-turn dialogue history into discrete token sequences and seamlessly integrate them to form a comprehensive user-agent dialogue context. Leveraging the power of GPT, we predict the token sequence, that includes both semantic and style knowledge, of response for the agent. After that, the expressive conversational speech is synthesized by the conversation-enriched VITS to deliver feedback to the user. Furthermore, we propose a large-scale Natural CSS Dataset calledNCSSD, that includes both naturally recorded conversational speech in improvised styles and dialogues extracted from TV shows. It encompasses both Chinese and English languages, with a total duration of 236 hours. We conducted comprehensive experiments on the reliability of the NCSSD and the effectiveness of our GPT-Talker. Both subjective and objective evaluations demonstrate that our model outperforms other state-of-the-art CSS systems significantly in terms of naturalness and expressiveness.The Code, Dataset, and Pre-trained Model are available at:https://github.com/GPT-Talker/GPT-Talker.



Paperid:185 Poster
Authors:Jing Yang,ShunDong Yang,Yuan Gao,JieMing Yang,Laurence Tianruo Yang
Abstract:
Link prediction aims to infer missing valid triplets to complete knowledge graphs, with recent inclusion of multimodal information to enrich entity representations. Existing methods project multimodal information into a unified embedding space or learn modality-specific features separately for later integration. However, performance was limited in such studies due to neglecting the modalities compatibility and conflict semantic carried by entities in valid and invalid triplets. In this paper, we aim at modeling inter-entity modality interactions and thus propose a novel modality circular fusion approach (MoCi), which interweaves multimodal contextual of entities. Firstly, unlike most methods in this task that directly fuse modalities, we design a triplets-prompt modality contrastive pre-training to align modality semantics beforehand. Moreover, we propose a modality circular fusion model using a simple yet efficient multilinear transformation strategy. This allows explicit inter-entity modality interactions, distinguishing it from methods confined to fuse within individual entities. To the best of our knowledge, MoCi presents one of the pioneering frameworks that tailored to grasp inter-entity modality semantics for better link prediction. Extensive experiments on seven datasets demonstrate our model yields SOTA performance, confirming the efficacy of MoCi in modeling inter-entity modality interactions. Our code is released athttps://github.com/MoCiGitHub/MoCi.



Paperid:186 Poster
Authors:Rongjie Huang,Yongqi Wang,Ruofan Hu,Xiaoshan Xu,Zhiqing Hong,Dongchao Yang,Xize Cheng,Zehan Wang,Ziyue Jiang,Zhenhui Ye,Luping Liu,Siqi Zheng,Zhou Zhao
Abstract:
Voice large language models (LLMs) cast voice synthesis as a language modeling task in a discrete space, and have demonstrated significant progress to date. Despite the recent success, the current development of voice LLMs in low-resource applications is hampered by data scarcity and high computational cost. In this work, we propose VoiceTuner, with a self-supervised pre-training and efficient fine-tuning approach for low-resource voice generation. Specifically, 1) to mitigate data scarcity, we leverage large-scale unlabeled dataset and pre-train VoiceTuner-SSL without pre-defined applications, which can be fine-tuned in downstream tasks; 2) to further reduce the high training cost in complete fine-tuning, we introduce a multiscale transformer adapter to effectively update only around 1% parameters as a plug-and-play module. Experimental results demonstrate that VoiceTuner-SSL presents strong acoustic continuations, and VoiceTuner achieves state-of-the-art results in rich-resource TTS evaluation compared with competitive baseline models. Low-resource (1h, 10h, 30h) downstream applications including zero-shot TTS, instruction TTS, and singing voice synthesis present VoiceTuner's superior audio quality and style similarity with reduced data requirement and computational cost. Audio samples are available athttps://VoiceTuner.github.io



Paperid:187 Poster
Authors:Zhi Zhou,Junke Zhu,ZhangJin Huang
Abstract:
The 3D Gaussian Splatting(3D-GS) method has recently sparked a new revolution in novel view synthesis with its remarkable visual effects and fast rendering speed. However, its reliance on simple spherical harmonics for color representation leads to subpar performance in complex scenes, struggling with effects like specular highlights, light refraction, etc. Also, 3D-GS adopts a periodic split strategy, significantly increasing the model's disk space and hindering rendering efficiency. To tackle these challenges, we introduce Gaussian Splatting with Neural Basis Extension (GSNB), a novel approach that substantially improves the performance of 3D-GS in demanding scenes while reducing storage consumption. Drawing inspiration from basis function, GSNB employs a light-weight MLP to share feature coefficients with spherical harmonics and extends the color calculation of 3D Gaussians for more precise visual effect modeling. This combination enables GSNB to achieve impressive results in scenes with challenging lighting and reflection conditions. Moreover, GSNB utilizes pre-computation to bake the network's output, thereby alleviating inference workload and subsequent speed loss. Furthermore, to leverage the capabilities of Neural Basis Extension and eliminate redundant Gaussians, we propose a new importance criterion to prune the converged Gaussian model and obtain a more compact representation through re-optimization. Experimental results demonstrate that our method delivers high-quality rendering in most scenarios and effectively reduces redundant Gaussians without compromising rendering speed. Our code and real-time demos will be released soon.



Paperid:188 Poster
Authors:Yuting Zhang,Zhao Zhang,Yiqing Wu,Ying Sun,Fuzhen Zhuang,Wenhui Yu,Lantao Hu,Han Li,Kun Gai,Zhulin An,Yongjun Xu
Abstract:
Multi-Domain Rcommendation (MDR) aims to leverage data from multiple domains to enhance recommendations through overlapping users or items. However, extreme overlap sparsity in some applications makes it challenging for existing multi-domain models to capture domain-shared information. Moreover, the sparse overlapping users or items result in a cold start problem in every single domain and hinder feature space alignment of different domains, posing a challenge for joint optimization across domains. However, in multi-domain short video recommendation, we identify two key characteristics that can greatly alleviate the overlapping sparsity issue and enable domain alignment. (1) The following relations between users and publishers exhibit strong preferences and a concentration effect, as popular video publishers, who constitute a small portion of all users, are followed by a majority of users across various domains. (2) The tag tree structure shared by all videos can help facilitate multi-grained alignment across multiple domains. Based on these characteristics, we propose tag tree-guided multi-grained alignment with publisher enhancement for multi-domain video recommendation. Our model integrates publisher and tag nodes into the user-video bipartite graph as central nodes, enabling user and video alignment across all domains via graph propagation. Then, we propose a tag tree-guided decomposition method to obtain hierarchical graphs for multi-grained alignment. Further, we design tree-guided contrastive learning methods to capture the intra-level and inter-level node relations respectively. Finally, extensive experiments on two real-world short video recommendation datasets demonstrate the effectiveness of our model.



Paperid:189 Poster
Authors:Chaoxiang He,Xiaofan Bai,Xiaojing Ma,Bin Benjamin Zhu,Pingyi Hu,Jiayun Fu,Hai Jin,Dongmei Zhang
Abstract:
Cloud-based machine learning services are attractive but expose a cloud-deployed DNN model to the risk of tampering. Black-box integrity verification (BIV) enables the owner or end-users to ascertain whether a cloud-deployed DNN model has been tampered with via returned responses of only top-1 labels. Fingerprinting generates fingerprint samples to query the model to achieve BIV of the model with no impact on the model's accuracy. In this paper, we introduce BIVBench, the first benchmark for BIV of DNN models, encompassing 16 types of practical modifications covering typical tampering scenarios. We reveal that existing fingerprinting methods, which focus on a limited range of tampering types, lack sensitivity in detecting subtle, yet common and potentially severe, tampering effectively. To fill this gap, we propose MiSentry (Model integrity Sentry), a novel fingerprinting method that strategically incorporates only a few crucial subtly tampered models into a model zoo, leverages meta-learning, and maximizes the divergence of the output predictions between the untampered targeted model and those models in the model zoo to generate highly sensitive, generalizable, and effective fingerprint samples. Extensive evaluations using BIVBench demonstrate that MiSentry substantially outperforms existing state-of-the-art fingerprinting methods, particularly in detecting subtle tampering.



Paperid:190 Poster
Authors:Andreea-Maria Oncescu,Joao F. Henriques,A. Sophia Koepke
Abstract:
Recent advancements in machine learning have fueled research on multimodal interactions, such as for instance text-to-video and text-to-audio retrieval tasks. These tasks require models to understand the semantic content of input videos, including objects, sounds and characters. The models also need to learn their spatial arrangement and the temporal relationships of sounds. In this work, we tackle the temporal ordering of sounds, which is an understudied problem in the context of text-to-audio retrieval. In particular, we dissect the temporal understanding capabilities of a state-of-the-art model for text-to-audio retrieval on the AudioCaps dataset. Additionally, we introduce a synthetic text-audio dataset that provides a controlled setting for evaluating the temporal understanding of recent models. Lastly, we investigate a new loss function that encourages text-audio models to focus on the temporal ordering of events.



Paperid:191 Poster
Authors:Federico Espositi,Andrea Bonarini
Abstract:
This paper delves into the exploration of spaces as non-anthropomorphic avatars. We are investigating interaction with entities showing features different from humans’, to understand how they can be embodied as avatars and perceived as living, social beings. To push this investigation to its limit, we have designed as an avatar an interactive space (the Room), that challenges both the anthropomorphic structure, and most of the social interaction mechanisms we are used to. We introduce a pilot framework for the Room design, addressing challenges related to its body, perception, and interaction process. We present an implementation of the framework as an interactive installation, namely a real-time, two-player, VR experience, featuring the Room avatar, with a focus on haptic feedback as the main means of perception for the subject embodying the Room. By radically challenging anthropomorphism, we seek to investigate the most basic aspects of embodiment and social cognition.



Paperid:192 Poster
Authors:Yong Yang,Aoqi Zhao,Shuying Huang,Xiaozheng Wang,Yajing Fan
Abstract:
Single hyperspectral image super-resolution (HSSR) aims to reconstruct a high-resolution hyperspectral image (HRHSI) from an observed low resolution hyperspectral image (LRHSI). Most current methods combine CNN and Transformer structures to directly extract features of all channels in LRHSI for image reconstruction, but they do not consider the interference of redundant information in adjacent bands, resulting in spectral and spatial distortions in the reconstruction results, as well as an increase in model computational complexity. To address this issue, this paper proposes a spectral clustering-based pyramid super-resolution network (SCPSN) to progressively reconstruct HRHSI at different scales. In each layer of the pyramid network, a clustering super-resolution block consisting of spectral clustering block (SCB), patch non local attention block (PNAB), and dynamic fusion block (DFB) is designed to achieve the reconstruction of detail features for LRHSI. Specifically, for the high correlation between adjacent spectral bands in LRHSI, an SCB is first constructed to achieve clustering of spectral channels and filtering of hyperchannels. This can reduce the interference of redundant spectral information and the computational complexity of the model. Then, by utilizing the non-local similarity of features within the channel, a PNAB is constructed to enhance the features in the hyperchannels. Next, a DFB is designed to reconstruct the features of all channels in LRHSI by establishing correlations between enhanced hyperchannels and other channels. Finally, the reconstructed channels are upsampled and added with the upsampled LRHSI to obtain the reconstructed HRHSI. Extensive experiments validate that the performance of SCPSN is superior to that of some state-of-the-art methods in terms of visual effects and quantitative metrics. In addition, our model does not require training on large-scale datasets compared to other methods. The dataset and code will be released on GitHub.



Paperid:193 Poster
Authors:Rui Xu,Gaolei Li,Changze Li,Zhaohui Yang,Yuchen Liu,Mingzhe Chen
Abstract:
By leveraging multi-view inputs to synthesize novel-view images, Neural Radiance Fields (NeRF) have emerged as a prominent technique in the realm of 3D object reconstruction. However, existing methods primarily focus on global scene reconstruction using large datasets, which necessitate substantial computational resources and impose high-quality requirements on input images. Nevertheless, in practical applications, users prioritize the 3D reconstruction results of on-demand specific object (OSO) based on their individual demands . Furthermore, the collected images transmitted through high-interference wireless environment (HIWE) leads to negatively impact the accuracy of NeRF reconstruction, thereby limiting its scalability. In this paper, we propose a novel on-demand Semantic Neural Radiance Fields (OSNeRF) scheme, which offers fast and robust 3D object reconstruction for diverse tasks. Within OSNeRF, semantic encoder is employed to extract core semantic features of OSOs from the collected scene images, semantic decoder is utilized to facilitate robust image recovery under HIWE conditions, lightweight renderer is employed for fast and efficient object reconstruction. Moreover, a semantic control unit (SCU) is introduced to guide above components, thereby enhancing the efficiency of reconstruction. Demonstrative experiments demonstrate that the proposed OSNeRF enables fast and robust object reconstruction in HIWE, surpassing the performance of state-of-the-art (SOTA) methods in terms of reconstruction quality.



Paperid:194 Poster
Authors:Tianshuo Peng,Zuchao Li,Lefei Zhang,hai zhao,Ping Wang,Bo Du
Abstract:
Large Language Models (LLMs), benefiting from the auto-regressive modelling approach performed on massive unannotated texts corpora, demonstrates powerful perceptual and reasoning capabilities. However, as for extending auto-regressive modelling to multi-modal scenarios to build Large Multi-modal Models (LMMs), there lies a great difficulty that the image information is processed in the LMM as continuous visual embeddings, which cannot obtain discrete supervised labels for classification. In this paper, we successfully perform multi-modal auto-regressive modeling with a unified objective for the first time. Specifically, we propose the concept of visual tokens, which maps the visual features to probability distributions over LLM's vocabulary, providing supervision information for visual modelling. We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information. Experimental results and ablation studies on 5 VQA tasks and 4 benchmark toolkits validate the powerful performance of our proposed approach.



Paperid:195 Poster
Authors:Shengzhang,Xi Yang
Abstract:
Fine-grained remote sensing object detection aims to locate and identify specific targets with variable scale and orientation from complex background in the high-resolution and wide-swath images, which needs requirement of high precision and real-time processing simultaneously. Although traditional knowledge distillation technology show its effectiveness in model compression and accuracy preservation for natural images, the challenges of heavy background noise and intra-class similarity faced by remote sensing images limits the knowledge quality of teacher model and the learning ability of student model. To address these issues, we propose an Information Fusion with Knowledge Distillation (IFKD) method that enhances the student model's performance by integrating information from external images, frequency domain, and hyperbolic space. Firstly, we propose an external interference enhancement (EDE) module, which utilizes MobileSAM introducing information from external to enrich teachers' knowledge set, compete with teachers for the right to cultivate students, and weaken students' dependence on teachers. Secondly, to strengthen the representation of key features and improve the quality of knowledge, a frequency domain reconstruction (FDR) module is proposed, which is mainly performed by resampling the low-frequency background frequency to suppress the interference of background noise. Finally, aiming at the problem of intra-class similarity, hyperbolic similarity mask (HSM) module is designed to magnify intra-class differences and guide students to analyze teachers' knowledge based on the exponential growth of hyperbolic spatial ability. Experiments on the optical ShipRSImageNet and SAR Aircraft-1.0 datasets verify that the IFKD method significantly enhances performance in fine-grained recognition tasks compared to existing distillation techniques. Among them, 65.8% $AP_{50}$ can be improved by 2.6% on ShipRSImageNet dataset, and 81.4% $AP_{50}$ can be improved by 1.4% on SAR Aircraft-1.0.



Paperid:196 Poster
Authors:Jiaxin Gao,Yaohua Liu
Abstract:
Due to device constraints and lighting conditions, captured images frequently exhibit coupled low-resolution and ultra-dark degradations. Enhancing the visibility and resolution of ultra-dark images simultaneously is crucial for practical applications. Current approaches often address both tasks in isolation or through simplistic cascading strategies, while also relying heavily on empirical and manually designed composite loss constraints, which inevitably results in compromised training efficacy, increased artifacts, and diminished detail fidelity. To address these issues, we propose TriCo, the first to adopt a Tri-level learning framework that explicitly formulates the bidirectional Cooperative relationship and devises algorithms to tackle coupled degradation factors. In the optimization across Upper (U)-Middle (M)-Lower (L) levels, we model the synergistic dependencies between illumination learning and super-resolution tasks within the M-L levels. Moving to the U-M levels, we introduce hyper-variables to automate the learning of beneficial constraints for both learning tasks, moving beyond the traditional trial-and-error pitfalls of the learning process. Algorithmically, we establish a Phased Gradient-Response (PGR) algorithm as our training mechanism, which facilitates a dynamic, inter-variable gradient feedback and ensures efficient and rapid convergence. Moreover, we present the Integrated Hybrid Expert Modulator (IHEM), which merges inherent illumination priors with universal semantic model features to adaptively guide pixel-level high-frequency detail recovery. Extensive experimentation validates the framework's broad generalizability across challenging ultra-dark scenarios, outperforming current state-of-the-art methods across 4 real and synthetic benchmark datasets over 8 metrics (e.g., 5.8%$\uparrow$ in PSNR, 26.6%$\uparrow$ in LPIPS, and 13.9%$\uparrow$ in RMSE).



Paperid:197 Poster
Authors:Yuanfeng Pan,Wenkang Su,Jiangqun Ni,Qingliang Liu,Yulin Zhang,Donghua Jiang
Abstract:
Recent achievements have shown that model-based steganographic schemes hold promise for better security than heuristic-based ones, as they can provide theoretical guarantees on secure steganography under a given statistical model. However, it remains a challenge to exploit the correlations between DCT coefficients for secure steganography in practical scenarios where only a single compressed JPEG image is available. To cope with this, we propose a novel model-based steganographic scheme using the Conditional Random Field (CRF) model with four-element cross-neighborhood to capture the dependencies among DCT coefficients for JPEG steganography with symmetric embedding. Specifically, the proposed CRF model is characterized by the delicately designed energy function, which is defined as the weighted sum of a series of unary and pairwise potentials, where the potentials associated with the statistical detectability of steganography are formulated as the KL divergence between the statistical distributions of cover and stego. By optimizing the constructed energy function with the given payload constraint, the non-independent distortion cost corresponding to the least detectability can be accordingly obtained. Extensive experimental results validate the effectiveness of our proposed scheme, especially outperforming the previous independent art J-MiPOD.



Paperid:198 Poster
Authors:Yi Bin,Junrong Liao,Yujuan Ding,Haoxuan Li,Yang Yang,See-Kiong Ng,Heng Tao Shen
Abstract:
Cross-modal coherence modeling is essential for intelligent systems to help them organize and structure information, thereby understand and create content of the physical world coherently like human-beings. Previous work on cross-modal coherence modeling attempted to leverage the order information from another modality to assist the coherence recovering of the target modality. Despite of the effectiveness, labeled associated coherency information is not always available and might be costly to acquire, making the cross-modal guidance hard to leverage. To tackle this challenge, this paper explores a new way to take advantage of cross-modal guidance without gold labels on coherency, and proposes the Weak Cross-Modal Guided Ordering (WeGO) model. More specifically, it leverages high-confidence predicted pairwise order in one modality as reference information to guide the coherence modeling in another. An iterative learning paradigm is further designed to jointly optimize the coherence modeling in two modalities with selected guidance from each other. The iterative cross-modal boosting also functions in inference to further enhance coherence prediction in each modality. Experimental results on two public datasets have demonstrated that the proposed method outperforms existing methods for cross-modal coherence modeling tasks. Major technical modules have been evaluated effective through ablation studies.



Paperid:199 Poster
Authors:Bingzhi Chen,Ruihan Liu,Yishu Liu,Xiaozhao Fang,Jiahui Pan,Guangming Lu,Zheng Zhang
Abstract:
Due to the inherent vulnerability of neural networks, adversarial attacks present formidable challenges to the robustness and reliability of deep learning models. In contrast to traditional adversarial training (AT) methods that prioritize semantic distillation and purification, our work pioneers a novel discovery attributing the insufficient adversarial robustness of models to the challenges of spatial attention shift and channel activation disarray. To mitigate these issues, we propose a robust spatial-aligned and channel-adapted learning paradigm, which we term the StayFocused, that integrates spatial alignment and channel adaptation to enhance the focus region against adversarial attacks by adaptively recalibrating the spatial attention and channel responses. Specifically, the proposed StayFocused mainly benefits from two flexible mechanisms, i.e., Spatial-aligned Hypersphere Constraint (SHC) and Channel-adapted Prompting Calibration (CPC). Specifically, SHC aims to enhance intra-class compactness and inter-class separation between adversarial and natural samples by measuring the angular margins and distribution distance within the hypersphere space. Inspired by the top-$K$ candidate prompts from the clean sample, CPC is designed to dynamically recalibrate channel-wise feature responses by explicitly modeling interdependencies between channels. To comprehensively learn feature representations, the StayFocused framework can be easily extended with additional branches in a multi-head training manner, further enhancing the model's robustness and adaptability. Extensive experiments on multiple benchmark datasets consistently demonstrate the effectiveness and superiority of our StayFocused over state-of-the-art baselines.



Paperid:200 Poster
Authors:Hongyun Yu,Zhan Qu,Qihang Yu,Jianchuan Chen,Zhonghua Jiang,Zhiwen Chen,Shengyu Zhang,Jimin Xu,Fei Wu,chengfei lv,Gang Yu
Abstract:
Recent works on audio-driven talking head synthesis using Neural Radiance Fields (NeRF) have achieved impressive results. However, due to inadequate pose and expression control caused by NeRF implicit representation, these methods still have some limitations, such as unsynchronized or unnatural lip movements, and visual jitter and artifacts. In this paper, we propose GaussianTalker, a novel method for audio-driven talking head synthesis based on 3D Gaussian Splatting. With the explicit representation property of 3D Gaussians, intuitive control of the facial motion is achieved by binding Gaussians to 3D facial models. GaussianTalker consists of two modules, Speaker-specific Motion Translator and Dynamic Gaussian Renderer. Speaker-specific Motion Translator achieves accurate lip movements specific to the target speaker through universalized audio feature extraction and customized lip motion generation. Dynamic Gaussian Renderer introduces Speaker-specific BlendShapes to enhance facial detail representation via a latent pose, delivering stable and realistic rendered videos. Extensive experimental results suggest that GaussianTalker outperforms existing state-of-the-art methods in talking head synthesis, delivering precise lip synchronization and exceptional visual quality. Our method achieves rendering speeds of 130 FPS on NVIDIA RTX4090 GPU, significantly exceeding the threshold for real-time rendering performance, and can potentially be deployed on other hardware platforms.



Paperid:201 Poster
Authors:Zeyu Jin,Jia Jia,Qixin Wang,Kehan Li,Shuoyi Zhou,Songtao Zhou,Xiaoyu Qin,Zhiyong Wu
Abstract:
Speech-language multi-modal learning presents a significant challenge due to the fine nuanced information inherent in speech styles. Therefore, a large-scale dataset providing elaborate comprehension of speech style is urgently needed to facilitate insightful interplay between speech audio and natural language. However, constructing such datasets presents a major trade-off between large-scale data collection and high-quality annotation. To tackle this challenge, we propose an automatic speech annotation system for expressiveness interpretation that annotates in-the-wild speech clips with expressive and vivid human language descriptions. Initially, speech audios are processed by a series of expert classifiers and captioning models to capture diverse speech characteristics, followed by a fine-tuned LLaMA for customized annotation generation. Unlike previous tag/templet-based annotation frameworks with limited information and diversity, our system provides in-depth understandings of speech style through tailored natural language descriptions, thereby enabling accurate and voluminous data generation for large model training. With this system, we create SpeechCraft, a fine-grained bilingual expressive speech dataset. It is distinguished by highly descriptive natural language style prompts, containing approximately 2,000 hours of audio data and encompassing over two million speech clips. Extensive experiments demonstrate that the proposed dataset significantly boosts speech-language task performance in both stylist speech synthesis and speech style understanding.



Paperid:202 Poster
Authors:Zhi Zeng,Minnan Luo,Xiangzheng Kong,Huan Liu,Hao Guo,Hao Yang,Zihan Ma,Xiang Zhao
Abstract:
Short videos turn into an important channel for public sharing, as well as they've become a fertile ground for fake news. Fake news video detection is to judge the veracity of news based on its different modal information, such as video, audio, text, image and social context information. Current detection models tend to learn the multimodal dataset biases within spurious correlations between news modalities and veracity labels as shortcuts, rather than learning how to integrate the multimodal information behind them to reason, resulting in seriously degrading their detection and generalization capabilities. To address this issues, we propose a Multimodal Multi-View Debiasing (MMVD) framework, which makes the first attempt to mitigate various multimodal biases for fake news video detection. Inspired by people's misleading situations by multimodal short videos, we summarize three cognitive biases: static, dynamic, and social biases. MMVD put forward a multi-view causal reasoning strategy to learn unbiased dependencies within the cognitive biases, thus enhancing the unbiased prediction of multimodal videos. The extensive experimental results show that the MMVD could improve the detection performance of multimodal fake news video. Studies also confirm that our MMVD can mitigate multiple biases on complex real-world scenarios and improve generalization ability of multimodal models.



Paperid:203 Poster
Authors:Fan Nie,Jiangqun Ni,Jian Zhang,Bin Zhang,Weizhe Zhang
Abstract:
Nowadays, the abuse of AI-generated content (AIGC), especially the facial images known as deepfake, on social networks has raised severe security concerns, which might involve the manipulations of both visual and audio signals. For multimodal deepfake detection, previous methods usually exploit forgery-relevant knowledge to fully finetune Vision transformers (ViTs) and perform cross-modal interaction to expose the audio-visual inconsistencies. However, these approaches may undermine the prior knowledge of pretrained ViTs and ignore the domain gap between different modalities, resulting in unsatisfactory performance. To tackle these challenges, in this paper, we propose a new framework, i.e., Forgery-aware Audio-distilled Multimodal Learning (FRADE), for deepfake detection. In FRADE, the parameters of pretrained ViT are frozen to preserve its prior knowledge, while two well-devised learnable components, i.e., the Adaptive Forgery-aware Injection (AFI) and Audio-distilled Cross-modal Interaction (ACI), are leveraged to adapt forgery relevant knowledge. Specifically, AFI captures high-frequency discriminative features on both audio and visual signals and injects them into ViT via the self-attention layer. Meanwhile, ACI employs a set of latent tokens to distill audio information, which could bridge the domain gap between audio and visual modalities. The ACI is then used to well learn the inherent audio-visual relationships by cross-modal interaction. Extensive experiments demonstrate that the proposed framework could outperform other state-of-the-art multimodal deepfake detection methods under various circumstances.



Paperid:204 Poster
Authors:Xiaojun Chen,Jimeng Lou,Wenxi Huang,Ting Wan,Qin Zhang,Min Yang
Abstract:
Image-text retrieval stands as a pivotal task within information retrieval, gaining increasing importance with the rapid advancements in Visual-Language Pretraining models. However, current benchmarks for evaluating these models face limitations, exemplified by instances such as BLIP2 achieving near-perfect performance on existing benchmarks. In response, this paper advocates for a more robust evaluation benchmark for image-text retrieval, one that embraces several essential characteristics. Firstly, a comprehensive benchmark should cover a diverse range of tasks in both perception and cognition-based retrieval. Recognizing this need, we introduce ReCoS, a novel benchmark specifically designed for cross-modal image-text retrieval in complex real-life scenarios. Unlike existing benchmarks, ReCoS encompasses 12 retrieval tasks, with a particular focus on three cognition-based tasks, providing a more holistic assessment of model capabilities. To ensure the novelty of the benchmark, we emphasize the use of original data sources, steering clear of reliance on existing publicly available datasets to minimize the risk of data leakage. Additionally, to strike a balance between the complexity of the real world and benchmark usability, ReCoS includes text descriptions that are neither overly detailed, making retrieval overly simplistic, nor under-detailed to the point where retrieval becomes impossible. Our evaluation results shed light on the challenges faced by existing methods, especially in cognition-based retrieval tasks within ReCoS. This underscores the necessity for innovative approaches in addressing the complexities of image-text retrieval in real-world scenarios.



Paperid:205 Poster
Authors:Xinyue Liu,Jianyuan Wang,Biao Leng,Shuo Zhang
Abstract:
Knowledge distillation based on student-teacher network is one of the mainstream solution paradigms for the challenging unsupervised Anomaly Detection task, utilizing the difference in representation capabilities of the teacher and student networks to implement anomaly localization. However, over-generalization of the student network to the teacher network may lead to negligible differences in representation capabilities of anomaly, thus affecting the detection effectiveness. Existing methods address the possible over-generalization by using differentiated students and teachers from the structural perspective or explicitly expanding distilled information from the content perspective, which inevitably result in an increased likelihood of underfitting of the student network and poor anomaly detection capabilities in anomaly center or edge. In this paper, we propose Dual-Modeling Decouple Distillation (DMDD) for the unsupervised anomaly detection. In DMDD, a Decouple Student-Teacher Network is proposed to decouple the initial student features into normality and abnormality features. We further introduce Dual-Modeling Distillation based on normal-anomaly image pairs, fitting normality features of anomalous image and the teacher features of the corresponding normal image, widening the distance between abnormality features and the teacher features in anomalous regions. Synthesizing these two distillation ideas, we achieve anomaly detection which focuses on both edge and center of anomaly. Finally, a Multi-perception Segmentation Network is proposed to achieve focused anomaly map fusion based on multiple attention. Experimental results on MVTec AD show that DMDD surpasses SOTA localization performance of previous knowledge distillation-based methods, reaching 98.85% on pixel-level AUC and 96.13% on PRO.



Paperid:206 Poster
Authors:Baorui Ma,Yu-Shen Liu,Matthias Zwicker,Zhizhong Han
Abstract:
Implicit 3D representations have shown great promise in deep learning-based 3D reconstruction. With differentiable renderers, current methods are able to learn implicit occupancy fields without 3D supervision by minimizing the error between the images rendered from the learned occupancy fields and 2D ground truth images. In this paper, however, we hypothesize that a full rendering pipeline including visibility determination and evaluation of a shading model is not required for the learning of 3D shapes without 3D supervision. Instead, we propose to use implicit reasoning, that is, we reason directly on the implicit occupancy field without explicit rendering. This leads our method to reveal highly accurate 3D structures from low quality silhouette images. Our implicit reasoning infers a 3D occupancy field by evaluating how well it matches with multiple 2D occupancy maps, using occupancy clues rather than rendering the 3D occupancy field into images. We exploit the occupancy clues that indicate whether a viewing ray inside a 2D object silhouette hits at least one occupied 3D location, or whether a ray outside the silhouette hits no occupied location. In contrast to differentiable renderers whose losses do not distinguish between the inside and outside of objects, our novel loss function weights unoccupied clues more than occupied ones. Our results outperform recent state-of-the-art techniques, justifying that we can learn accurate occupancy fields only using sparse clues without an explicit rendering process.



Paperid:207 Poster
Authors:Qiuyu Kong,Jiangming Chen,Jiang Jie,Zanxi Ruan,KANG Lai
Abstract:
Cross-Domain Few-Shot Semantic Segmentation (CD-FSS) aims to achieve pixel-level segmentation of novel categories across various domains by transferring knowledge from the source domain leveraging limited samples. The main challenge in CD-FSS is bridging the inter-domain gap and addressing the scarcity of labeled samples in the target domain to enhance both generalization and discriminative abilities. Current methods usually resort to additional networks and complex strategy to embrace domain variability, which inevitably increases the training costs. This paper proposes a Dual-Branch Fusion with Style Modulation (DFSM) method to tackle this issues. We specifically deploy a parameter-free Grouped Style Modulation (GSM) layer that captures and adjusts a wide spectrum of potential feature distribution changes, thus improving the model's solution efficiency. Additionally, to overcome data limitations and enhance adaptability in the target domain, we develope a Dual-Branch Fusion (DBF) strategy which achieves accurate pixel-level prediction results by combining predicted probability maps through weighted fusion, thereby enhancing the discriminative ability of the model. We evaluate the proposed method on multiple widely-used benchmark datasets, including FSS-1000, ISIC, Chest X-Ray, and Deepglobe, and demonstrate superior performance compared to state-of-the-art methods in CD-FSS tasks.



Paperid:208 Poster
Authors:Ziyi Wang,Yiming Rong,Deyang Jiang,Haoran Wu,Shiyu Zhou,Bo XU
Abstract:
Automatic Speech Recognition (ASR) models pre-trained on large-scale speech datasets have achieved significant breakthroughs compared with traditional methods. However, mainstream pre-trained ASR models encounter challenges in distinguishing homophones, which have close or identical pronunciations. Previous studies have introduced visual auxiliary cues to address this challenge, yet the sophisticated use of lip movements falls short in correcting homophone errors. On the other hand, the fusion and utilization of scene images remain in an exploratory stage, with performance still inferior to the pre-trained speech model. In this paper, we introduce Contextual Image-Enhanced Automatic Speech Recognition (CIEASR), a novel multimodal speech recognition model that incorporates a new cue fusion method, using scene images as soft prompts to correct homophone errors. To mitigate data scarcity, we refine and expand the VSDial dataset for extensive experiments, illustrating that scene images contribute to the accurate recognition of entity nouns and personal pronouns. Our proposed CIEASR achieves state-of-the-art results on VSDial and Flickr8K, significantly reducing the Character Error Rate (CER) on VSDial from 3.61% to 0.92%.



Paperid:209 Poster
Authors:Kang Xia,Wenzhong Li,Yimiao Shao,Sanglu Lu
Abstract:
Human Activity Recognition (HAR) as an emerging research field has attracted widespread academic attention due to its wide range of practical applications in areas such as healthcare, environmental monitoring, and sports training. Given the high cost of annotating sensor data, many unsupervised and semi-supervised methods have been applied to HAR to alleviate the problem of limited data. In this paper, we propose a novel video-enhanced cross-modal collaborative learning method, to address the issue of few-shot HAR. We introduce a new data augmentation approach that utilizes a text-to-video generation model to generate class-related videos. Subsequently, a large quantity of video semantic representations are obtained through fine-tuning the video encoder for cross-modal co-learning. Furthermore, to effectively align video semantic representations and time series representations, we enhance HAR at the representation-level using conditional Generative Adversarial Nets (cGAN). We also design a novel Representation Conditional Discriminator that is trained to assess samples as originating from video representations rather than those generated by the time series encoder as accurately as possible. We conduct extensive experiments on four commonly used HAR datasets. The experimental results demonstrate that our method outperforms other baseline models in all few-shot scenarios.



Paperid:210 Poster
Authors:Jiawei Zhu,Yishu Liu,Huanjia Zhu,Hui Lin,Yuncheng Jiang,Zheng Zhang,Bingzhi Chen
Abstract:
The challenge of bias in visual question answering (VQA) has gained considerable attention in contemporary research. Various intricate bias dependencies, such as modalities and data imbalances, can cause semantic ambiguities to generate shifts in the feature space of VQA instances. This phenomenon is referred to as VQA Hallucinations. Such distortions can cause hallucination distributions that deviate significantly from the true data, resulting in the model producing factually incorrect predictions. To address this challenge, we propose a robust Multi-Space Co-debias Learning (MSCD) approach for combating VQA hallucinations, which effectively mitigates bias-induced instance and distribution shifts in multi-space under a unified paradigm. Specifically, we design bias-aware and prior-aware debias constraints by utilizing the angle and angle margin of the spherical space to construct bias-prior-instance constraints, thereby refining the manifold representation of instance de-bias and distribution de-dependence. Moreover, we leverage the inherent overfitting characteristics of Euclidean space to introduce bias components from biased examples and modal counterexample injection, further assisting in multi-space robust learning. By integrating homeomorphic instances in different spaces, MSCD could enhance the comprehension of structural relationships between semantics and answer classes, yielding robust representations that are not solely reliant on training priors. In this way, our co-debias paradigm generates more robust representations that effectively mitigate biases to combat hallucinations. Extensive experiments on multiple benchmark datasets consistently demonstrate that the proposed MSCD method outperforms state-of-the-art baselines.



Paperid:211 Poster
Authors:Jihoon Lee,Yunhong Min,Hwidong Kim,Sangtae Ahn
Abstract:
In recent years, there has been a significant focus on research related to text-guided image inpainting, which holds a pivotal role in the domain of multimedia processing. This has resulted in notable enhancements in the quality and performance of the generated images. However, the task remains challenging due to several constraints, such as ensuring alignment between the generated images and the accompanying text, and maintaining consistency in distribution between corrupted and uncorrupted regions, for achieving natural and fine-grained image generation. To address these challenges, previous studies developed novel architectures, inpainting techniques, or objective functions but they still lack semantic consistency between the text and generated images. In this paper, thus, we propose a dual affine transformation generative adversarial network (DAFT-GAN) to maintain the semantic consistency for text-guided inpainting. DAFT-GAN integrates two affine transformation networks to combine text and image features gradually for each decoding block. The first affine transformation network leverages global features of the text to generate coarse results, while the second affine network utilizes attention mechanisms and spatial of the text to refine the coarse results. By connecting the features generated from these dual paths through residual connections in the subsequent block, the model retains information at each scale while enhancing the quality of the generated image. Moreover, we minimize information leakage of uncorrupted features for fine-grained image generation by encoding corrupted and uncorrupted regions of the masked image separately. Through extensive experiments, we observe that our proposed model outperforms the existing models in both qualitative and quantitative assessments with three benchmark datasets (MS-COCO, CUB, and Oxford) for text-guided image inpainting.



Paperid:212 Poster
Authors:Pengfei Luo,Tong Xu,Che Liu,Suojuan Zhang,Linli Xu,Minglei Li,Enhong Chen
Abstract:
Multimodal Entity Linking (MEL) aims to address the ambiguity in multimodal mentions and associate them with Multimodal Knowledge Graphs (MMKGs). Existing works primarily focus on designing multimodal interaction and fusion mechanisms to enhance the performance of MEL. However, these methods still overlook two crucial gaps within the MEL task. One is the content discrepancy between mentions and entities, manifested as uneven information density. The other is the knowledge gap, indicating insufficient knowledge extraction and reasoning during the linking process. To bridge these gaps, we propose a novel framework FissFuse, as well as a plug-and-play knowledge-aware re-ranking method KAR. Specifically, FissFuse collaborates with the Fission and Fusion branches, establishing dynamic features for each mention-entity pair and adaptively learning multimodal interactions to alleviate content discrepancy. Meanwhile, KAR is endowed with carefully crafted instruction for intricate knowledge reasoning, serving as re-ranking agents empowered by Large Language Models (LLMs). Extensive experiments on two well-constructed MEL datasets demonstrate outstanding performance of FissFuse compared with various baselines. Comprehensive evaluations and ablation experiments validate the effectiveness and generality of KAR.



Paperid:213 Poster
Authors:Minghang Zheng,Jiahua Zhang,Qingchao Chen,Yuxin Peng,Yang Liu
Abstract:
Visual grounding aims to localize the object referred to in an image based on a natural language query. Although progress has been made recently, accurately localizing target objects within multiple-instance distractions (multiple objects of the same category as the target) remains a significant challenge. Existing methods demonstrate a significant performance drop when there are multiple distractions in an image, indicating an insufficient understanding of the fine-grained semantics and spatial relationships between objects. In this paper, we propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ReSVG) model, to address this issue. Firstly, we enhance the model's understanding of fine-grained semantics by injecting semantic prior information derived from text queries into the model. This is achieved by leveraging text-to-image generation models to produce images representing the semantic attributes of target objects described in queries. Secondly, we tackle the lack of training samples with multiple distractions by introducing a relation-sensitive data augmentation method. This method generates additional training data by synthesizing images containing multiple objects of the same category and pseudo queries based on their spatial relationships. The proposed ReSVG model significantly improves the model's ability to comprehend both object semantics and spatial relations, leading to enhanced performance in visual grounding tasks, particularly in scenarios with multiple-instance distractions. We conduct extensive experiments to validate the effectiveness of our methods on five datasets.



Paperid:214 Poster
Authors:Yaqiang Wu,Zhen Xu,Yong Duan,Yanlai Wu,Qinghua Zheng,Hui Li,Xiaochen Hu,Lianwen Jin
Abstract:
The increasing use of smartphones for capturing documents in various real-world conditions has underscored the need for robust document localization technologies. Current challenges in this domain include handling diverse document types, complex backgrounds, and varying photographic conditions such as low contrast and occlusion. However, there currently are no publicly available datasets containing these complex scenarios and few methods demonstrate their capabilities on these complex scenes. To address these issues, we create a new comprehensive real-world document localization benchmark dataset which contains the complex scenarios mentioned above and propose a novel Real-world Document Localization Network (RDLNet) for locating targeted documents in the wild. The RDLNet consists of an innovative light-SAM encoder and a masked attention decoder. Utilizing light-SAM encoder, the RDLNet transfers the mighty generalization capability of SAM to the document localization task. In the decoding stage, the RDLNet exploits the masked attention and object query method to efficiently output the triple-branch predictions consisting of corner point coordinates, instance-level segmentation area and categories of different documents without extra post-processing. We compare the performance of RDLNet with other state-of-the-art approaches for real-world document localization on multiple benchmarks, the results of which reveal that the RDLNet remarkably outperforms contemporary methods, demonstrating its superiority in terms of both accuracy and practicability.



Paperid:215 Poster
Authors:Lishuang Zhan,Enting Ying,Jiabao Gan,Shihui Guo,BoYu Gao,Yipeng Qin
Abstract:
Estimating 3D human poses from monocular images is an important research area with many practical applications. However, the depth ambiguity of 2D solutions limits their accuracy in actions where occlusion exits or where slight centroid shifts can result in significant 3D pose variations. In this paper, we introduce a novel multimodal approach to mitigate the depth ambiguity inherent in monocular solutions by integrating spatial-aware pressure information. To achieve this, we first establish a data collection system with a pressure mat and a monocular camera, and construct a large-scale multimodal human activity dataset comprising over 600,000 frames of motion data. Utilizing this dataset, we propose a pressure image reconstruction network to extract pressure priors from monocular images. Subsequently, we introduce a Transformer-based multimodal pose estimation network to combine pressure priors with monocular images, achieving a world mean per joint position error (W-MPJPE) of 51.6mm, outperforming state-of-the-art methods. Extensive experiments demonstrate the effectiveness of our multimodal 3D human pose estimation method across various actions and joints, highlighting the significance of spatial-aware pressure in improving the accuracy of monocular 3D pose estimation methods. Our dataset is available at:https://anonymous.4open.science/r/SATPose-51DD.



Paperid:216 Poster
Authors:Zhaoda Ye,Xinhan Zheng,Yang Liu,Yuxin Peng
Abstract:
Text-driven 3D indoor scene generation aims to automatically generate and arrange the objects, which form a 3D scene that accurately captures the semantics detailed in the given text description. Recent works have shown the potential to generate 3D scenes guided by specific object categories and room layouts but lack a robust mechanism to maintain consistent spatial relationships in alignment with the provided text description during the 3D scene generation. Besides, the annotations of the object and relationships of the 3D scenes are usually time- and cost-consuming, which are not easily obtained for the model training. Thus, in this paper, we conduct a dataset and benchmark for assessing spatial relations in text-driven 3D scene generation, which contains a comprehensive collection of 3D scenes, including textual descriptions, annotating object spatial relations, and providing both template and free-form natural language descriptions. We also provide a pseudo description feature generation method to address the 3D scenes without language annotations. We design an aligned latent space for spatial relation in 3D scenes and text description, in which we can sample the features according to the spatial relation for the few-shot learning. We also propose new metrics to investigate the ability of the approach to generate correct spatial relationships among objects.



Paperid:217 Poster
Authors:Xingqi Wang,Xiaoyuan Yi,Xing Xie,Jia Jia
Abstract:
Recent advancements in diffusion models trained on large-scale data have enabled the generation of indistinguishable human-level images, yet they often produce harmful content misaligned with human values, e.g., social bias, and offensive content. Despite extensive research on Large Language Models (LLMs), the challenge of Text-to-Image (T2I) model alignment remains largely unexplored. Addressing this problem, we propose LiVO (Lightweight Value Optimization), a novel lightweight method for aligning T2I models with human values. LiVO only optimizes a plug-and-play value encoder to integrate a specified value principle with the input prompt, allowing the control of generated images over both semantics and values. Specifically, we design a diffusion model-tailored preference optimization loss, which theoretically approximates the Bradley-Terry model used in LLM alignment but provides a more flexible trade-off between image quality and value conformity. To optimize the value encoder, we also develop a framework to automatically construct a text-image preference dataset of 86k (prompt, aligned image, violating image, value principle) samples. Without updating most model parameters and through adaptive value selection from the input prompt, LiVO significantly reduces harmful outputs and achieves faster convergence, surpassing several strong baselines and taking an initial step towards ethically aligned T2I models.Warning: This paper involves descriptions and images depicting discriminatory, pornographic, bloody, and horrific scenes, which some readers may find offensive or disturbing.



Paperid:218 Poster
Authors:Dongding Lin,Jian Wang,Chak Tou Leong,Wenjie Li
Abstract:
Engaging in conversational recommendations within a specific scenario represents a promising paradigm in the real world. Scenario-relevant situations often affect conversations and recommendations from two closely related aspects: varying the appealingness of items to users, namely situated item representation, and shifting user interests in the targeted items, namely situated user preference. We highlight that considering the situational factors is crucial, as this aligns with the realistic conversational recommendation process in the physical world. However, it is under-explored in existing studies. In this work, we are pioneering to address this gap and introduce a novel setting: Situated Conversational Recommendation Systems (SCRS). We observe an emergent need for high-quality datasets, and building one from scratch requires tremendous human effort. To this end, we construct a new benchmark, named SCREEN, via a role-playing method using large language models. This benchmark comprises over 20k dialogues across 1.5k diverse situations, providing a rich foundation for exploring situational influences on conversational recommendations. Based on the SCREEN, we propose three worth-exploring subtasks and evaluate several representative baseline models. Our evaluations confirm that the benchmark is high quality, establishing a robust experimental basis for future research in situated conversational recommendation.



Paperid:219 Poster
Authors:Ximing Wu,Kongyange Zhao,Teng Liang,Xu Chen
Abstract:
Multi-party Mobile Virtual Reality (MMVR) enables multiple mobile users to share virtual scenes for immersive multimedia experience in scenarios such as gaming, social interaction, and industrial mission collaboration. Dynamic 3D Point Cloud (DPCL) is an emerging representation form of MMVR that can be consumed as a free-viewpoint video with 6 degree of freedom. Given that it is challenging to render DPCL at a satisfying frame rate with limited on-device resources, offloading rendering tasks to edge servers is recognized as a practical solution. However, repeated loading of DPCL scenes with a substantial amount of metadata introduces a significant redundancy overhead that cannot be overlooked when enabling multiple edge servers to support the rendering requirements of user groups. In this paper, we design PoClVR, an edge-assisted DPCL rendering system for MMVR applications, which breaks down the rendering process of the complete dynamic scene into multiple rendering tasks of individual dynamic objects. PoClVR significantly reduces the repetitive loading overhead of DPCL scenes on edge servers and periodically adjusts the rendering task allocation for edge servers during the application running to accommodate rendering requirements. We deploy PoClVR based on a real-world implementation and the experimental evaluation results show that PoClVR can reduce GPU utilization by up to 15.1% and increase rendering frame rate by up to 34.6% compared to other baselines while ensuring that the image quality viewed by the user is virtually unchanged.



Paperid:220 Poster
Authors:Zhenyu Yang,Shengsheng Qian,Dizhan Xue,Jiahong Wu,Fan Yang,Weiming Dong,Changsheng Xu
Abstract:
Zero-Shot Composed Image Retrieval (ZS-CIR) has attracted more attention in recent years, focusing on retrieving a specific image based on a query composed of a reference image and a relative text without training samples. Specifically, the relative text describes the differences between the two images. Prevailing ZS-CIR methods employ image-to-text (I2T) models to convert the query image into a single caption, which is further merged with the relative text by text-fusion approaches to form a composed text for retrieval. However, these methods neglect the fact that ZS-CIR entails considering not only the final similarity between the composed text and retrieved images but also the semantic increment during the compositional editing process. To address this limitation, this paper proposes a training-free method called Semantic Editing Increment for ZS-CIR (SEIZE) to retrieve the target image based on the query image and text without training. Firstly, we employ a pre-trained captioning model to generate diverse captions for the reference image and prompt Large Language Models (LLMs) to perform breadth compositional reasoning based on these captions and relative text, thereby covering the potential semantics of the target image. Then, we design a semantic editing search to incorporate the semantic editing increment contributed by the relative text into the retrieval process. Concretely, we comprehensively consider relative semantic increment and absolute similarity as the final retrieval score, which is subsequently utilized to retrieve the target image in the CLIP feature space. Extensive experiments on three public datasets demonstrate that our proposed SEIZE achieves the new state-of-the-art performance. The code is publicly available athttps://anonymous.4open.science/r/SEIZE-11BC.



Paperid:221 Poster
Authors:Cam Van Thi Nguyen,Son Le The,Tuan Anh Mai,Duc-Trong Le
Abstract:
Multimodal Emotion Recognition in Conversations (ERC) is a typical multimodal learning task in exploiting various data modalities concurrently. Prior studies on effective multimodal ERC encounter challenges in addressing modality imbalances and optimizing learning across modalities. Dealing with these problems, we present a novel framework named Ada2I, which consists of two inseparable modules namely Adaptive Feature Weighting (AFW) and Adaptive Modality Weighting (AMW) for feature-level and modality-level balancing respectively via leveraging both Inter- and Intra-modal interactions. Additionally, we introduce a refined disparity ratio as part of our training optimization strategy, a simple yet effective measure to assess the overall discrepancy of the model's learning process when handling multiple modalities simultaneously. Experimental results validate the effectiveness of Ada2I with state-of-the-art performance compared against baselines on three benchmark datasets including IEMOCAP, MELD, and CMU-MOSEI, particularly in addressing modality imbalances.



Paperid:222 Poster
Authors:Yuhang Su,Wei Hu,Fan Zhang,Qiming Xu
Abstract:
Audio Identification aims to precisely retrieve exact matches from a vast music repository through a query audio snippet. The need for specificity and granularity has traditionally led to representing music audio using numerous short fixed-duration overlapped segment/shingle features in fingerprinting approaches. However, fingerprinting imposes constraints on scalability and efficiency, as hundreds or even thousands of embeddings are generated to represent a typical music audio. In this paper, we present an innovative self-supervised approach called Angular Margin Guided Embedding (AMG-Embedding). AMG-Embedding is built on a traditional fingerprinting encoder and aims to represent variable-duration non-overlapped segments as embeddings through a two-stage embedding and class-level learning process. AMG-Embedding significantly reduces the number of generated embeddings while achieving high-specific fragment-level audio identification simultaneously. Experimental results demonstrate that AMG-Embedding achieves retrieval accuracy comparable to the based fingerprinting approach while consuming less than $1/10th$ of its storage and retrieval time. The efficiency gains of our approach position it as a promising solution for scalable and efficient audio identification systems.



Paperid:223 Poster
Authors:Wenyu Yin,Shuyuan Lin,Yang Lu,Hanzi Wang
Abstract:
Multi-model fitting aims to robustly estimate the parameters of various model instances in data contaminated by noise and outliers. Most previous works use only one type of consensus or implicit fusion model to model the correlation between data points and model hypotheses. This strategy often results in unrealistic and incorrect model fitting in the presence of noise and uncertainty. In this paper, we propose a novel diverse Consensuses paired with Motion estimation-based multi-Model Fitting (CMMF), which leverages three diverse consensuses along with inter-model collaboration to enhance the effectiveness of multi-model fusion. We design a Tangent Consensus Residual Reconstruction (TCRR) module to capture motion structure information of two points at the pixel level. Additionally, we introduce a Cross Consensus Affinity (CCA) framework to strengthen the correlation between data points and model hypotheses. To address the challenge of multi-body motion estimation, we propose a Nested Consensus Clustering (NCC) strategy, which formulates multi-model fitting as a motion estimation problem. It explicitly establishes motion collaboration between models and ensures that multiple models are well-fitted. Extensive quantitative and qualitative experiments are conducted on four public datasets (i.e., AdelaideRMF-F, Hopkins155, KITTI, MTPV62), and the results demonstrate that the proposed method outperforms several state-of-the-art methods.



Paperid:224 Poster
Authors:Yiding Li,Lingyun Yu,Li Wang,Hongtao Xie
Abstract:
In recent years, the field of talking head generation has made significant strides. However, the need for substantial computational resources for model training, coupled with a scarcity of high-quality video data, poses challenges for the rapid customization of model to specific individual. Additionally, existing models usually only support single-modal control, lacking the ability to generate vivid facial expressions and controllable head poses based on multiple conditions such as audio, video, etc. These limitations restricts the models' widespread application. In this paper, we introduce a two-stage method called Control-Talker to achieve rapid customization of identity in talking head model and high-quality generation based on multimodal conditions. Specifically, we divide the training process into two stages: prior learning stage and identity rapid-customization stage. 1) In the prior learning stage, we leverage a diffusion-based model pre-trained on the high-quality image dataset to acquire a robust controllable facial prior. Meanwhile, we innovatively propose a high-frequency ControlNet structure to enhance the fidelity of the synthesized results. This structure adeptly extracts a high-frequency feature map from the source image, serving as a facial texture prior, thereby excellently preserving facial texture of the source image. 2) In the identity rapid-customization stage, the identity is fixed by fine-tuning the U-Net part of the diffusion model on merely several images of a specific individual. The entire fine-tuning process for identity customization can be completed within approximately ten minutes, thereby significantly reducing training costs. Further, we propose a unified driving method for both audio and video, utilizing FLAME-3DMM as an intermediary representation. This method equips the model with the ability to precisely control expressions, poses, and lighting under multi conditions, significantly broadening the application fields of the talking head model. Extensive experiments and visual results demonstrate that our method outperforms other state-of-the-art models. Additionally, our model demonstrates reduced training costs and lower data requirements.



Paperid:225 Poster
Authors:Junqi Shi,Mingyi Jiang,Ming Lu,Tong Chen,Xun Cao,Zhan Ma
Abstract:
As a prevalent scientific data format with extensive applications, the efficient compression of hyperspectral images (HSI) and ensuring high-quality downstream tasks have garnered significant attention. This paper introduces HINER, a novel approach for compressing HSI using Neural Representation. HINER fully exploits inter-spectral correlations by explicitly encoding of spectral wavelengths and achieves a compact representation of the input HSI sample through joint optimization with a learnable decoder. By additionally incorporating the Content Angle Mapper with the L1 loss, we can supervise the global and local information within each spectral band, thereby enhancing the overall reconstruction quality. For downstream classification on compressed HSI, we theoretically demonstrate the task accuracy is not only related to the classification loss but also to the reconstruction fidelity through a first-order expansion of the accuracy degradation, and accordingly adapt the reconstruction by introducing Adaptive Spectral Weighting. Owing to the inherent capability of HINER to implicitly reconstruct spectral bands using input wavelengths, it can generate arbitrary continuous spectra, even those absent in the original input. Consequently, we propose utilizing Implicit Spectral Interpolation for data augmentation during classification model training, thereby improving overall task accuracy on compressed data. Experimental results on various HSI datasets demonstrate the superior compression performance of our HINER compared to the existing learned methods and also the traditional codecs. Our model is lightweight and computationally efficient, which maintains high accuracy for downstream classification task even on decoded HSIs at high compression ratios.



Paperid:226 Poster
Authors:Shuqi Dai,Ming-Yu Liu,Rafael Valle,Siddharth Gururani
Abstract:
Singing Voice Synthesis (SVS) has significantly advanced with deep generative models, achieving high audio quality but still struggling with musicality, mainly due to the lack of performance control over timing, dynamics, and pitch, which are essential for music expression. Additionally, integrating data and supporting diverse languages and styles in SVS remain challenging. To tackle these issues, this paper presents \textit{ExpressiveSinger}, an SVS framework that leverages a cascade of diffusion models to generate realistic singing across multiple languages, styles, and techniques from scores and lyrics. Our approach begins with consolidating, cleaning, annotating, and processing public singing datasets, developing a multilingual phoneme set, and incorporating different musical styles and techniques. We then design methods for generating expressive performance control signals including phoneme timing, F0 curves, and amplitude envelopes, which enhance musicality and model consistency, introduce more controllability, and reduce data requirements. Finally, we generate mel-spectrograms and audio from performance control signals with style guidance and singer timbre embedding. Our models also enable trained singers to sing in new languages and styles. A listening test reveals both high musicality and audio quality of our generated singing compared with existing works and human singing. We release the data for future research. Demo: We release the data for future research. Demo:https://expressivesinger.github.io/ExpressiveSinger.



Paperid:227 Poster
Authors:Wenhao Shen,Wanqi Yin,Hao Wang,Chen Wei,Zhongang Cai,Lei Yang,Guosheng Lin
Abstract:
Expressive Human Mesh Recovery (HMR) involves reconstructing the 3D human body, including hands and face, from RGB images. It is difficult because humans are highly deformable, and hands are small and frequently occluded. Recent approaches have attempted to mitigate these issues using large datasets and models, but these solutions remain imperfect. Specifically, whole-body estimation models often inaccurately estimate hand poses, while hand expert models struggle with severe occlusions. To overcome these limitations, we introduce a dual-path cross augmentation framework with a novel adaptation approach called HMR-Adapter that enhances the decoding module of large HMR models. HMR-Adapter significantly improves expressive HMR performance by injecting additional guidance from other body parts. This approach refines hand pose predictions by incorporating body pose information and uses additional hand features to enhance body pose estimation in whole-body models. Remarkably, a HMR-Adapter with only about 27M parameters achieves better performance in fine-tuning the large model on a target dataset. Furthermore, HMR-Adapter significantly improves expressive HMR results by combining the adapted large whole-body and hand expert models. We show extensive experiments and analysis to demonstrate the efficacy of our method.



Paperid:228 Poster
Authors:Shiqin Liu,Chaozhuo Li,Xi Zhang,Minjun Zhao,yuanbo xu,Jiajun Bu
Abstract:
Learning item representation is crucial for a myriad of on-line e-commerce applications. The nucleus of retail item representation learning is how to properly fuse the semantics within a single item, and the interactions across different items generated by user behaviors (e.g., co-click or co-view). Product semantics depict the intrinsic characteristics of the item, while the interactions describe the relationships between items from the perspective of human perception. Existing approaches either solely rely on a single type of information or loosely couple them together, leading to hindered representations. In this work, we propose a novel model named TESPA to reinforce semantic modeling and interaction modeling mutually. Specifically, collaborative filtering signals in the interaction graph are encoded into the language models through fine-grained topological pre-training, and the interaction graph is further enriched based on semantic similarities. After that, a novel multi-channel co-training paradigm is proposed to deeply fuse the semantics and interactions under a unified framework. In a nutshell, TESPA is capable of enjoying the merits of both sides to facilitate item representation learning. Experimental results of on-line and off-line evaluations demonstrate the superiority of our proposal.



Paperid:229 Poster
Authors:zekun Ai,Xiaotong Luo,Yanyun Qu,Yuan Xie
Abstract:
Deep neural networks have revealed enormous potential in video super-resolution (VSR), yet the expensive computational expense limits their deployment on resource-limited devices and actual scenarios, especially for restoring multiple frames simultaneously. Existing VSR models contain considerable redundant filter, which drag down the inference efficiency. To accelerate the inference of VSR models, we propose a scalable method based on adaptive patch routing to achieve more practical speedup. Specifically, we design a confidence estimator to predict the aggregation performance of each block for adjacent patch information, which learns to dynamically perform block skipping, i.e., choose which basic blocks of a VSR network to execute during inference so as to reduce total computation to the maximum extent without degrading reconstruction accuracy dramatically. However, we observe that skipping error would be amplified as the hidden states propagate along with recurrent networks. To alleviate the issue, we design Temporal feature distillation to guarantee the performance. This proposal essentially proposes an adaptive routing scheme for each patch. Extensive experiments demonstrate that our method can not only accelerate inference but also provide strong quantitative and qualitative results with the learned strategies. Built upon an BasicVSR model, our method achieves a speedup of 20% on average, going as high as 50% for some images, while even maintaining competitive performance on REDS4.



Paperid:230 Poster
Authors:Lu Zhang,Ke Yan,Shouhong Ding
Abstract:
Since the release of the CLIP model by OpenAI, it has received widespread attention. However, categories in the real world often exhibit a long-tail distribution, and existing CLIP models struggle to effectively recognize rare, tail-end classes, such as an endangered African bird. An intuitive idea is to generate visual descriptions for these tail-end classes and use descriptions to create category prototypes for classification. However, experiments reveal that visual descriptions, image captions, and test prompt templates belong to three distinct domains, leading to distribution shifts. In this paper, we propose the use of caption object parsing to identify the objects set contained within captions. During training, the object sets is used to generate visual descriptions and test prompts, aligning these three domains and enabling the text encoder to generate category prototypes based on visual descriptions. Thanks to the acquired object sets, our approach can construct many-to-many relationships at a lower cost and derive soft labels, addressing the noise issues associated with traditional one-to-one matching. Extensive experimental results demonstrate that our method significantly surpasses the CLIP baseline and exceeds existing methods, achieving a new state-of-the-art (SOTA).



Paperid:231 Poster
Authors:Yinyin Peng,Yaofei Wang,Donghui Hu,Kejiang Chen,Xianjin Rong,Weiming Zhang
Abstract:
Generative image steganography has gained significant attention due to its ability to hide secret data during image generation. However, existing generative image steganography methods still face challenges in terms of controllability, usability, and robustness, making it difficult to apply real-world scenarios. To ensure secure and reliable communication, we propose a practical and robust generative image steganography based on Latent Diffusion Models, called LDStega. LDStega takes controllable condition text as input and designs an encoding strategy in the reverse process of the Latent Diffusion Models to couple latent space generation with data hiding. The encoding strategy selects a sampling interval from a candidate pool of truncated Gaussian distributions guided by secret data to generate the stego latent space. Subsequently, the stego latent space is fed into the Decoder to generate the stego image. The receiver extracts the secret data from the globally Gaussian distribution of the lossy-reconstructed latent space in the reverse process. Experimental results demonstrate that LDStega achieves high extraction accuracy while controllably generating image content and saving the stego image in the widely used PNG and JPEG formats. Additionally, LDStega outperforms state-of-the-art techniques in resisting common image attacks.



Paperid:232 Poster
Authors:Haibo Yang,Yang Chen,Yingwei Pan,Ting Yao,Zhineng Chen,Chong-Wah Ngo,Tao Mei
Abstract:
Despite having tremendous progress in image-to-3D generation, existing methods still struggle to produce multi-view consistent images with high-resolution textures in detail, especially in the paradigm of 2D diffusion that lacks 3D awareness. In this work, we present High-resolution Image-to-3D model (Hi3D), a new video diffusion based paradigm that redefines a single image to multi-view images as 3D-aware sequential image generation (i.e., orbital video generation). This methodology delves into the underlying temporal consistency knowledge in video diffusion model that generalizes well to geometry consistency across multiple views in 3D generation. Technically, Hi3D first empowers the pre-trained video diffusion model with 3D-aware prior (camera pose condition), yielding multi-view images with low-resolution texture details. A 3D-aware video-to-video refiner is learnt to further scale up the multi-view images with high-resolution texture details. Such high-resolution multi-view images are further augmented with novel views through 3D Gaussian Splatting, which are finally leveraged to obtain high-fidelity meshes via 3D reconstruction. Extensive experiments on both novel view synthesis and single view reconstruction demonstrate that our Hi3D manages to produce superior multi-view consistency images with highly-detailed textures.



Paperid:233 Poster
Authors:Zijian Yi,Ziming Zhao,Zhishu Shen,Tiehua Zhang
Abstract:
Multimodal emotion recognition in conversation (MERC) seeks to identify the speakers' emotions expressed in each utterance, offering significant potential across diverse fields. The challenge of MERC lies in balancing speaker modeling and context modeling, encompassing both long-distance and short-distance contexts, as well as addressing the complexity of multimodal information fusion. Recent research adopts graph-based methods to model intricate conversational relationships effectively. Nevertheless, the majority of these methods utilize a fixed fully connected structure to link all utterances, relying on convolution to interpret complex context. This approach can inherently heighten the redundancy in contextual messages and excessive graph network smoothing, particularly in the context of long-distance conversations. To address this issue, we propose a framework that dynamically adjusts hypergraph connections by variational hypergraph autoencoder (VHGAE), and employs contrastive learning to mitigate uncertainty factors during the reconstruction process. Experimental results demonstrate the effectiveness of our proposal against the state-of-the-art methods on IEMOCAP and MELD datasets. We release the code to support the reproducibility of this work (currently it is uploaded as the "complementary material" within the review system and will be made public following the completion of the review process).



Paperid:234 Poster
Authors:Jingqiao Xiu,Mengze Li,Wei Ji,Jingyuan Chen,Hanbin Zhao,Shin'ichi Satoh,Roger Zimmermann
Abstract:
Video Tube Retrieval (VTR) has attracted wide attention in the multi-modal domain, aiming to accurately localize the spatial-temporal tube in videos based on the natural language description. Despite the remarkable progress, existing VTR models trained on a specific domain (source domain) often perform unsatisfactory in another domain (target domain), due to the domain gap. Toward this issue, we introduce the learning strategy, Unsupervised Domain Adaptation, into the VTR task (UDA-VTR), which enables the knowledge transfer from the labeled source domain to the unlabeled target domain without additional manual annotations. An intuitive solution is generating the pseudo labels for the target-domain samples with the fully trained source model and fine-tuning the source model on the target domain with pseudo labels. However, the existing domain gap gives rise to two problems for this process: (1) The transfer of model parameters across domains may introduce source domain bias into target-domain features, significantly impacting the feature-based prediction for target domain samples. (2) The pseudo labels often tend to identify video tubes that are widely present in the source domain, rather than accurately localizing the correct video tubes specific to the target domain samples. To address the above issues, we propose the unsupervised domain adaptation model via Hierarchical dEbiAsing and noisy correction for cRoss-domain video Tube retrieval (HEART), which contains two characteristic modules: Layered Feature Debiasing (including the adversarial feature alignment and the graph-based alignment) and Pseudo Label Refinement. Extensive experiments prove the effectiveness of our HEART model by significantly surpassing the state-of-the-arts. The code is available (https://anonymous.4open.science/r/HEART).



Paperid:235 Poster
Authors:Tianjiao Xu,Aoxuan Chen,Yuxi Zhao,Jinfei Gao,Tian Gan
Abstract:
Social video platforms have emerged as significant channels for information dissemination, facilitating lively public discussions that often give rise to controversies. However, existing approaches to controversy detection primarily focus on textual features, which raises three key concerns: it underutilizes the potential of visual information available on social media platforms; it is ineffective when faced with incomplete or absent textual information; and the existing datasets fail to adequately address the need for comprehensive multimodal resources on social media platforms. To address these challenges, we construct a large-scale Multimodal Controversial Dataset (MMCD) in Chinese. Additionally, we propose a novel framework named Multi-view Controversy Detection (MVCD) to effectively model controversies from multiple perspectives. Through extensive experiments using state-of-the-art models on the MMCD, we demonstrate MVCD's effectiveness and potential impact.



Paperid:236 Poster
Authors:Haojian Huang,Xiaozhennn Qiao,Zhuo Chen,Haodong Chen,Binyu Li,Zhe Sun,Mulin Chen,Xuelong Li
Abstract:
Zero-shot learning (ZSL) enables the recognition of novel classes by leveraging semantic knowledge transfer from known to unknown categories. This knowledge, typically encapsulated in attribute descriptions, aids in identifying class-specific visual features, thus facilitating visual-semantic alignment and improving ZSL performance. However, real-world challenges such as distribution imbalances and attribute co-occurrence among instances often hinder the discernment of local variances in images, a problem exacerbated by the scarcity of fine-grained, region-specific attribute annotations. Moreover, the variability in visual presentation within categories can also skew attribute-category associations. In response, we propose a bidirectional cross-modal ZSL approach CREST. It begins by extracting representations for attribute and visual localization and employs Evidential Deep Learning (EDL) to measure underlying epistemic uncertainty, thereby enhancing the model's resilience against hard negatives. CREST incorporates dual learning pathways, focusing on both visual-category and attribute-category alignments, to ensure robust correlation between latent and observable spaces. Moreover, we introduce an uncertainty-informed cross-modal fusion technique to refine visual-attribute inference. Extensive experiments demonstrate our model's effectiveness and unique explainability across multiple datasets. Our code and data are available at:https://anonymous.4open.science/r/CREST-1CEC.



Paperid:237 Poster
Authors:Kyusun Cho,JoungBin Lee,Heeji Yoon,Yeobin Hong,Jaehoon Ko,Sangjun Ahn,Seungryong Kim
Abstract:
This paper proposes GaussianTalker, a novel framework for real-time generation of pose-controllable talking heads. It leverages the fast rendering capabilities of 3D Gaussian Splatting (3DGS) while addressing the challenges of directly controlling 3DGS with speech audio. GaussianTalker constructs a single 3DGS representation of the head and deforms it in sync with the audio. A key insight is to encode the 3D Gaussian attributes into a shared implicit feature representation, where it is merged with audio features to manipulate each Gaussian attribute. This design exploits the spatial information of the head and enforces interactions between neighboring points. The feature embeddings are then fed to a spatial-audio attention module, which predicts frame-wise offsets for the attributes of each Gaussian. This method is more stable than previous concatenation or multiplication approaches for manipulating the numerous Gaussians and their intricate parameters. Overall, GaussianTalker offers a promising approach for real-time generation of high-quality pose-controllable talking heads.



Paperid:238 Poster
Authors:Haoyu Shi,Huaiwen Zhang
Abstract:
Text to Motion Retrieval (TMR) is an emerging task to retrieval relevant motion sequence with the nature language description. The existing dominant approach is to learn a joint embedding space to measure global-level similarities. However, simple global embeddings are insufficient to represent complicated motion and textual details, such as the movement of specific body parts and the coordination among these body parts. In addition, most of the motion variations occur subtly and locally, resulting in semantic vagueness amony these motions, which further presents considerable challenges in precisely aligning motion sequences with texts. To address these challenges, we propose a novel Modal-Enhanced Semantic Modeling (MESM) method, focusing on fine-grained alignment through enhanced modal semantics. Specifically, we develop a prompt-enhanced textual module (PTM) to generate detailed descriptions of specific body part movements, which comprehensively captures the fine-grained textual semantics for precise matching. We employ a skeleton-enhanced motion module (SMM) to effectively enhance the model's capability to represent intricate motions. This module leverages a graph convolutional network to meticulously model the intricate spatial dependencies among relevant body parts. To improve the sensitivity to the subtle motions, we further propose a text-driven semantics interaction module (TSIM). The TSIM first assigns motion features into a set of aggregated descriptors, then employs a cross-attention to aggregate discriminative motion embeddings guided by textual, enabling precise semantic alignment between subtle motions and corresponding texts. Extensive experiments conducted on two widely used benchmark datasets, HumanML3D and KIT-ML, demonstrate the effectiveness of our proposed method. Our approach outperforms existing state-of-the-art retrieval methods, achieving significant Rsum improvements of 24.28% on HumanML3D and 25.80% on KIT-ML.



Paperid:239 Poster
Authors:Bowen Chen,Yun Sing Koh,Gillian Dobbie
Abstract:
Traditional deep learning models often struggle in few-shot learning scenarios, where limited labeled data is available. While the Contrastive Language-Image Pre-training (CLIP) model demonstrates impressive zero-shot capabilities, its performance in few-shot scenarios remains limited. Existing methods primarily aim to leverage the limited labeled dataset, but this offers limited potential for improvement. To overcome the limitations of small datasets in few-shot learning, we introduce a novel framework, SSAT-Adapter, that leverages CLIP's language understanding to generate informative auxiliary tasks and improve CLIP's performance and adaptability in few-shot settings. We utilize CLIP's language understanding to create decision-boundary-focused image latents. These latents form auxiliary tasks, including inter-class instances to bridge CLIP's pre-trained knowledge with the provided examples, and intra-class instances to subtly expand the representation of target classes. A self-paced training regime, progressing from easier to more complex tasks, further promotes robust learning. Experiments show our framework outperforms the state-of-the-art online few-shot learning method by an average of 2.2% on eleven image classification datasets. Further ablation studies on various tasks demonstrate the effectiveness of our approach to enhance CLIP's adaptability in few-shot image classification.



Paperid:240 Poster
Authors:Junzhang Liu,Zhecan Wang,Hammad Ayyubi,Haoxuan You,Chris Thomas,Rui Sun,Shih-Fu Chang,Kai-Wei Chang
Abstract:
Despite the widespread adoption of Vision-Language Understanding (VLU) benchmarks such as VQA v2, OKVQA, A-OKVQA, GQA, VCR, SWAG, and VisualCOMET, our analysis reveals a pervasive issue affecting their integrity: these benchmarks contain samples where answers rely on assumptions unsupported by the provided context. Training models on such data fosters biased learning and hallucinations as models tend to make similar unwarranted assumptions. To address this issue, we collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions. Strong improvements across multiple benchmarks demonstrate the effectiveness of our approach. Further, we develop a general-purpose Context-AwaRe Abstention (CARA) detector to identify samples lacking sufficient context and enhance model accuracy by abstaining from responding if the required context is absent. CARA exhibits generalization to new benchmarks it wasn't trained on, underscoring its utility for future VLU benchmarks in detecting or cleaning samples with inadequate context. Finally, we curate a Context Ambiguity and Sufficiency Evaluation (CASE) set to benchmark the performance of insufficient context detectors. Overall, our work represents a significant advancement in ensuring that vision-language models generate trustworthy and evidence-based outputs in complex real-world scenarios.



Paperid:241 Poster
Authors:Xiangyu Chen,Yihao Liu,Yuandong Pu,Wenlong Zhang,Jiantao Zhou,Yu Qiao,Chao Dong
Abstract:
Building a unified model for general low-level vision tasks has important research and practical value. However, existing methods still face challenges when dealing with diverse low-level vision problems. Multi-task restoration approaches can simultaneously address various degradation-to-clean restoration tasks, while their applicability to tasks with different target domains (e.g., image stylization) remains limited. Existing methods like PromptGIP that can handle tasks with multiple input-target domains mainly rely on the Masked Autoencoder (MAE) training paradigm. Unfortunately, these approaches are restricted by coupling to the ViT architecture, resulting in suboptimal image reconstruction quality. In addition, they tend to be sensitive to prompt content and often fail when handling more tasks that involve low-frequency information processing, such as color and style. In this paper, we present a Visual task Prompt-based Image Processing (VPIP) framework to address the above challenges. This framework employs the visual task prompt to process tasks with different input-target domains. Besides, it provides the flexibility to select a backbone network suitable for various low-level vision tasks. A prompt cross-attention mechanism is introduced to deal with the information interaction between the input and prompt information. Based on the VPIP framework, we train a low-level vision generalist model, namely GenLV, on 30 diverse tasks. Experimental results show that GenLV can successfully address a variety of low-level tasks, and it significantly outperforms existing methods both quantitatively and qualitatively.



Paperid:242 Poster
Authors:Subash Khanal,Eric Xing,Srikumar Sastry,Aayush Dhakal,Zhexiao Xiong,Adeel Ahmad,Nathan Jacobs
Abstract:
A soundscape is defined by the acoustic environment a person perceives at a location. In this work, we propose a framework for mapping soundscapes across the Earth. Since soundscapes involve sound distributions that span varying spatial scales, we represent locations with multi-scale satellite imagery and learn a joint representation among this imagery, audio, and text. To capture the inherent uncertainty in the soundscape of a location, we additionally design the representation space to be probabilistic. We also fuse ubiquitous metadata (including geolocation, time, and data source) to enable learning of spatially and temporally dynamic representations of soundscapes. We demonstrate the utility of our framework by creating large-scale soundscape maps integrating both audio and text with temporal control. To facilitate future research on this task, we also introduce a large-scale dataset, GeoSound, containing over 300k geotagged audio samples paired with both low- and high-resolution satellite imagery. We demonstrate that our method outperforms the existing state-of-the-art on both GeoSound and the existing SoundingEarth dataset. Our dataset and code will be made available at TBD.



Paperid:243 Poster
Authors:Bolin Jiang,Yuqiu Xie,Jiawei Li,Naiqi Li,Bin Chen,Shu-Tao Xia
Abstract:
Pose-agnostic anomaly detection refers to the situation where the pose of test samples is inconsistent with the training dataset, allowing anomalies to appear at any position in any pose. We propose a novel method IGSPAD to address this challenge. Specifically, we employ 3D Gaussian splatting to represent the normal information from the training dataset. To accurately determine the pose of the test sample, we introduce an approach termed Inverting 3D Gaussian Splatting (IGS) to address the challenge of 6D pose estimation for anomalous images. The pose derived from IGS is utilized to render a normal image well-aligned with the test sample. Subsequently, the image encoder of the Segment Anything Model is employed to identify discrepancies between the rendered image and the test sample, predicting the location of anomalies. Experimental results on the MAD dataset demonstrate that the proposed method significantly surpasses the existing state-of-the-art method in terms of precision (from 97.8% to 99.7% at pixel level and from 90.9% to 98.0% at image level) and efficiency.



Paperid:244 Poster
Authors:Jing Bi,Yunlong Tang,Luchuan Song,Ali Vosoughi,Nguyen Nguyen,Chenliang Xu
Abstract:
The rapid evolution of egocentric video analysis brings new insights into understanding human activities and intentions from a first-person perspective. Despite this progress, the fragmentation in tasks like action recognition, procedure learning, and moment retrieval, \etc, coupled with inconsistent annotations and isolated model development, hinders a holistic interpretation of video content. In response, we introduce the EAGLE (Egocentric AGgregated Language-video Engine) model and the EAGLE-400K dataset to provide a unified framework that integrates various egocentric video understanding tasks. EAGLE-400K, the \textit{first} large-scale instruction-tuning dataset tailored for egocentric video, features 400K diverse samples to enhance a broad spectrum task from activity recognition to procedure knowledge learning. Moreover, EAGLE, a strong video-based multimodal large language model (MLLM), is designed to effectively capture both spatial and temporal information. In addition, we propose a set of evaluation metrics designed to facilitate a thorough assessment of MLLM for egocentric video understanding. Our extensive experiments demonstrate EAGLE's superior performance over existing models, highlighting its ability to balance task-specific understanding with comprehensive video interpretation. With EAGLE, we aim to pave the way for novel research opportunities and practical applications in real-world scenarios.



Paperid:245 Poster
Authors:Xinyu Li,Wenqing Ye,Yueyi Zhang,Xiaoyan Sun
Abstract:
Multimodal sentiment analysis (MSA) aims to predict sentiment from text, audio, and visual data of videos. Existing works focus on designing fusion strategies or decoupling mechanisms, which suffer from low data utilization and a heavy reliance on large amounts of labeled data. However, acquiring large-scale annotations for multimodal sentiment analysis is extremely labor-intensive and costly. To address this challenge, we propose GRACE, a GRadient-based Active learning method with Curriculum Enhancement, designed for MSA under a multi-task learning framework. Our approach achieves annotation reduction by strategically selecting valuable samples from the unlabeled data pool while maintaining high-performance levels. Specifically, we introduce informativeness and representativeness criteria, calculated from gradient magnitudes and sample distances, to quantify the active value of unlabeled samples. Additionally, an easiness criterion is incorporated to avoid outliers, considering the relationship between modality consistency and sample difficulty. During the learning process, we dynamically balance sample difficulty and active value, guided by the curriculum learning principle. This strategy prioritizes easier, modality-aligned samples for stable initial training, then gradually increases the difficulty by incorporating more challenging samples with modality conflicts. Extensive experiments demonstrate the effectiveness of our approach on both multimodal sentiment regression and classification benchmarks.



Paperid:246 Poster
Authors:Haiyan Jiang,Song Leiyu,dongdong weng,Zhe Sun,Li Huiying,Xiaonuo Dongye,Zhenliang Zhang
Abstract:
Virtual reality (VR) provides an interface to access virtual environments anytime and anywhere, allowing us to experience and interact with an immersive virtual world. It has been widely used in various fields, such as entertainment, training, and education. However, the user's body cannot be separated from the physical world. When users are immersed in virtual scenes, they encounter safety and immersion issues caused by physical objects in the surrounding environment. Although virtual scene synthesis has attracted widespread attention, many popular methods are limited to generating purely virtual scenes independent of the physical environment or simply mapping all physical objects as obstacles. To this end, we propose a scene agent that synthesizes situated 3D virtual scenes as a kind of ubiquitous embodied interface in VR for users. The scene agent synthesizes scenes by perceiving the user's physical environment as well as inferring the user's demands. The synthesized scenes maintain the affordances of the physical environment, enabling immersive users to interact with the physical environment and improving the user's sense of security. Meanwhile, the synthesized scenes maintain the style described by the user, improving the user's immersion. The comparison results show that the proposed scene agent can synthesize virtual scenes with better affordance maintenance, scene diversity, style maintenance, and 3D intersection over union (3D IoU) compared to state-of-the-art baseline methods. To the best of our knowledge, this is the first work that achieves in situ scene synthesis with virtual-real affordance consistency and user demand.



Paperid:247 Poster
Authors:Changhao Peng,Wei Gao
Abstract:
Graph Fourier Transform (GFT) has demonstrated significant effectiveness in point cloud attribute compression task. However, existing graph modeling methods are based on the geometric relationships of the points, which leads to reduced efficiency of graph transforms in cases where the correlation between attributes and geometry is weak. In this paper, we propose a novel graph modeling method based on attribute prediction values. Specifically, we utilize Gaussian priors to model prediction values, then use maximum a posteriori estimation to learn the Laplacian matrix that best fits the prediction values in order to conduct separate graph transforms on prediction values and ground truth values to derive residuals, and subsequently perform quantization and entropy coding on these residuals. Additionally, since the partitioning of point clouds directly affects the coding performance, We design an adaptive block partitioning method based on ternary search, which selects reference points using distance threshold r and performs block partitioning and non-reference point attribute prediction based on these reference points. By conducting ternary search on distance threshold r, we rapidly identify the optimal block partitioning strategy. Moreover, we introduce an efficient residual encoding method based on Morton codes for the attributes of reference points while the prediction attributes of non-reference points are modeled using the proposed graph-based modeling approach. Experimental results demonstrate that our method significantly outperforms two attribute compression methods employed by Moving Picture Experts Group (MPEG) in lossless geometry based attribute compression tasks, with an average of 30.57% BD-rate gain compared to Predictive Lifting Transform (PLT), and an average of 33.54% BD-rate gain compared to Region-Adaptive Hierarchical Transform (RAHT), which exhibits significantly improved rate-distortion performance over the current state-of-the-art method based on GFT.



Paperid:248 Poster
Authors:Choubo Ding,Guansong Pang
Abstract:
Detecting out-of-distribution (OOD) inputs is a principal task for ensuring the safety of deploying deep-neural-network classifiers in open-world scenarios. OOD samples can be drawn from arbitrary distributions and exhibit deviations from in-distribution (ID) data in various dimensions, such as foreground features (e.g., objects in CIFAR100 images vs. that in CIFAR10 images) and background features (e.g., textural images vs. objects in CIFAR10). Existing methods can confound foreground and background features in training, failing to utilize the background features for OOD detection. This paper considers the importance of feature disentanglement in open-world classification and proposes the simultaneous exploitation of both foreground and background features to support the detection of OOD inputs in open-world classification. To this end, we propose a novel framework that first disentangles foreground and background features from ID training samples via a dense prediction approach, and then learns a new classifier that can evaluate the OOD scores of test images from both foreground and background features. It is a generic framework that allows for a seamless combination with various existing OOD detection methods. Extensive experiments show that our approach 1) can substantially enhance the performance of four different state-of-the-art (SotA) OOD detection methods on multiple widely-used OOD datasets with diverse background features, and 2) achieves new SotA performance on these benchmarks.



Paperid:249 Poster
Authors:Masoumeh Zareapoor,Pourya Shamsolmoali,Huiyu Zhou,Yue Lu,Salvador Garcia
Abstract:
The Detection Transformer (DETR), by incorporating the Hungarian algorithm, has significantly simplified the matching process in object detection tasks. This algorithm facilitates optimal one-to-one matching of predicted bounding boxes to ground-truth annotations during training. While effective, this strict matching process does not inherently account for the varying densities and distributions of objects, leading to suboptimal correspondences where the algorithm fails to consider multiple detections of the same object or misses small objects. These limitations significantly contribute to DETR’s poor convergence, especially in complex detection environments. To address this, we propose the Regularized Transport Plan (RTP). This flexible matching strategy captures the cost of aligning predictions with ground-truths in order to find the most fitting correspondence between these sets. The RTP is computed using the differentiable Sinkhorn algorithm to allow for soft, fractional matching rather than strict one-to-one assignments. This approach enhances DETR adaptability to complex object detection scenarios, providing a nuanced and precise assessment of disparities between prediction and ground-truth distributions. Our extensive evaluations on the MS-COCO and VOC benchmarks demonstrate the effectiveness of our approach. RTP-DETR, surpassing the performance of the Deform-DETR and the recently introduced DINO-DETR, achieving absolute gains in mAP of +3.8% and +1.7%, respectively.



Paperid:250 Poster
Authors:Xiaobin Lu,Xiaobin Hu,Jun Luo,zhuben,paulruan,Wenqi Ren
Abstract:
Blind face restoration aims to restore a sharp face image from a degraded counterpart. Recent methods using GANs as priors have achieved many successful stories in this domain. However, these methods still struggle to balance realism and fidelity when facing complex degradation scenarios. In this paper, we propose a novel framework by embedding 3D facial priors into a denoising diffusion model, enabling the extraction of facial structure and identity information from 3D facial images. Specifically, the downgraded image undergoes initial processing through a pre-trained restoration network to obtain an incompletely restored face image. This image is then fed into the 3D Morphable Model (3DMM) to reconstruct a 3D facial image. During the denoising process, the structural and identity information is extracted from the 3D prior image using a multi-level feature extraction module. Given that the denoising process of the diffusion model primarily involves initial structure refinement followed by texture detail enhancement, we propose a time-aware fusion block (TAFB). This module can provide more effective fusion information for denoising as the time step changes. Extensive experiments demonstrate that our network performs favorably against state-of-the-art algorithms on synthetic and real-world datasets for blind face restoration.



Paperid:251 Poster
Authors:Hongyu Zhu,Sichu liang,Wentao Hu,Li Fangqi,Ju Jia,Shi-Lin Wang
Abstract:
With the rise of Machine Learning as a Service (MLaaS) platforms, safeguarding the intellectual property of deep learning models is becoming paramount. Among various protective measures, trigger set watermarking has emerged as a flexible and effective strategy for preventing unauthorized model distribution. However, this paper identifies an inherent flaw in the current paradigm of trigger set watermarking: evasion adversaries can readily exploit the shortcuts created by models memorizing watermark samples that deviate from the main task distribution, significantly impairing their generalization in adversarial settings. To counteract this, we leverage diffusion models to synthesize unrestricted adversarial examples as trigger sets. By learning the model to accurately recognize them, unique watermark behaviors are promoted through knowledge injection rather than error memorization, thus avoiding exploitable shortcuts. Furthermore, we uncover that the resistance of current trigger set watermarking against removal attacks primarily relies on significantly damaging the decision boundaries during embedding, intertwining unremovability with adverse impacts. By optimizing the knowledge transfer in protected models during extraction, our approach conveys watermark behaviors without aggressively decision boundary perturbation. Experimental results on CIFAR-10/100 and Imagenette datasets demonstrate the effectiveness of our method, showing not only improved robustness against evasion adversaries but also superior resistance to watermark removal attacks compared to existing state-of-the-art solutions.



Paperid:252 Poster
Authors:Dehao Ying,Fengchang Yu,Haihua Chen,Wei Lu
Abstract:
Even though significant progress has been made in standardizing document layout analysis, complex layout documents like magazines, newspapers, and posters still present challenges. Models trained on standardized documents struggle with these complexities, and the high cost of annotating such documents limits dataset availability. To address this, we propose the Complex Layout Document Image Generation (DIG) model, which can generate diverse document images with complex layouts and authentic-looking text, aiding in layout analysis model training. Concretely, we first pretrain DIG on a large-scale document dataset with a text-sensitive loss function to address the issue of unreal generation of text regions. Then, we fine-tune it with a small number of documents with complex layouts to generate new images with the same layout. Additionally, we use a layout generation model to create new layouts, enhancing data diversity. Finally, we design a box-wise quality scoring function to filter out low-quality regions during layout analysis model training to enhance the effectiveness of using the generated images. Experimental results on the DSSE-200 and PRImA datasets show when incorporating generated images from DIG, the mAP of the layout analysis model is improved from 47.05 to 56.07 and from 53.80 to 62.26, respectively, which is a 19.17% and 15.72% enhancement compared to the baseline.



Paperid:253 Poster
Authors:Shaokun Zhang,Yiran Wu,Zhonghua Zheng,Qingyun Wu,Chi Wang
Abstract:
In this work, we propose a hyperparameter optimization method named HyperTime to find hyperparameters robust to potential temporal distribution shifts in the unseen test data. Our work is motivated by an important observation that it is, in many cases, possible to achieve temporally robust predictive performance via hyperparameter optimization. Based on this observation, we leverage the ‘worst-case-oriented’ philosophy from the robust optimization literature to help find such robust hyperparameter configurations. HyperTime imposes a lexicographic priority order on average validation loss and worst-case validation loss over chronological validation sets. We perform a theoretical analysis on the upper bound of the expected test loss, which reveals the unique advantages of our approach. We also demonstrate the strong empirical performance of the proposed method on multiple machine learning tasks with temporal distribution shifts.



Paperid:254 Poster
Authors:Xianghu Yue,Xueyi Zhang,Yiming Chen,Chengwei Zhang,Mingrui Lao,Huiping Zhuang,Xinyuan Qian,Haizhou Li
Abstract:
Class-incremental learning poses a significant challenge under an exemplar-free constraint, leading to catastrophic forgetting and sub-par incremental accuracy. Previous attempts have focused primarily on single-modality tasks, such as image classification or audio event classification. However, in the context of Audio-Visual Class-Incremental Learning (AVCIL), the effective integration and utilization of heterogeneous modalities, with their complementary and enhancing characteristics, remains largely unexplored. To bridge this gap, we propose the Multi-Modal Analytic Learning (MMAL) framework, an exemplar-free solution for AVCIL that employs a closed-form, linear approach. To be specific, MMAL introduces a modality fusion module that re-formulates the AVCIL problem through a Recursive Least-Square (RLS) perspective. Complementing this, a Modality-Specific Knowledge Compensation (MSKC) module is designed to further alleviate the under-fitting limitation intrinsic to analytic learning by harnessing individual knowledge from audio and visual modality in tandem. Comprehensive experimental comparisons with existing methods show that our proposed MMAL demonstrates superior performance with the accuracy of 76.71%, 78.98% and 76.19% on AVE, Kinetics-Sounds and VGGSounds100 datasets, respectively, setting new state-of-the-art AVCIL performance. Notably, compared to those memory-based methods, our MMAL, being an exemplar-free approach, provides good data privacy and can better leverage multi-modal information for improved incremental accuracy.



Paperid:255 Poster
Authors:qianxinhuang,Siyao Peng,Xiaobo Shen,Yun-Hao Yuan,Shirui Pan
Abstract:
As social networks grow exponentially, there is an increasing demand for video retrieval using natural language. Cross-modal hashing that encodes multi-modal data using compact hash code has be widely used in large-scale image-text retrieval, primarily due to its computation and storage efficiency. When applied to video-text retrieval, existing unsupervised cross-modal hashing extracts the frame- or word-level features individually, and thus ignores long-term dependencies. In addition, effective exploit of multi-modal structure poses a significant challenge due to intricate nature of video and text. To address the above issues, we propose Similarity Preserving Transformer Cross-Modal Hashing (SPTCH), a new unsupervised deep cross-modal hashing method for video-text retrieval. SPTCH encodes video and text by bidirectional Transformer encoder that exploits their long-term dependencies. SPTCH constructs a multi-modal collaborative graph to model correlations among multi-modal data, and applies semantic aggregation by employing Graph Convolutional Network (GCN) on such graph. SPTCH designs unsupervised multi-modal contrastive loss and neighborhood reconstruction loss to effectively exploit inter- and intra-modal similarity structure among videos and texts. The empirical results on three video benchmark datasets demonstrate that the proposed SPTCH generally outperforms state-of-the-arts in video-text retrieval.



Paperid:256 Poster
Authors:Kai Shao,Rui Wang,yixue Hao,Long Hu,Min Chen,Hans Arno Jacobsen
Abstract:
Depression recognition based on physiological signals such as functional near-infrared spectroscopy (fNIRS) and electroencephalogram (EEG) has made considerable progress. However, most existing studies ignore the complementarity and semantic consistency of multimodal physiological signals under the same stimulation task in complex spatio-temporal patterns. In this paper, we introduce a multimodal physiological signals representation learning framework using Siamese architecture via multiscale contrasting for depression recognition (MRLMC). First, fNIRS and EEG are transformed into different but correlated data based on a time-domain data augmentation strategy. Then, we design a spatio-temporal contrasting module to learn the representation of fNIRS and EEG through weight-sharing multiscale spatio-temporal convolution. Furthermore, to enhance the learning of semantic representation associated with stimulation tasks, a semantic consistency contrast module is proposed, aiming to maximize the semantic similarity of fNIRS and EEG. Extensive experiments on publicly available and self-collected multimodal physiological signals datasets indicate that MRLMC outperforms the state-of-the-art models. Moreover, our proposed framework is capable of transferring to multimodal time series downstream tasks. We will release the code and weights after review.



Paperid:257 Poster
Authors:Pengxu Chen,Huazhong Liu,Jihong Ding,Jiawen Luo,Peng Tan,Laurence T. Yang
Abstract:
As the visual interpretations for convolutional neural networks (CNNs), backpropagation attribution methods have been garnering growing attention. Nevertheless, majority of those methods merely concentrate on the ultimate convolutional layer, leading to tiny and concentrated interpretations that fail to adequately clarify the model-central attention. Therefore, we propose a precise attribution method (i. e., Holistic-CAM) for high-definition visual interpretation in the holistic stage of CNNs. Specifically, we first present weighted positive gradients to guarantee the sanity of interpretations in shallow layers and leverage multi-scale fusion to improve the resolution across the holistic stage. Then, we further propose fundamental scale denoising to eliminate the faithless attribution originated from fusing larger-scale components. The proposed method is capable of simultaneously rendering fine-grained and faithful attribution for CNNs from shallow to deep layers. Extensive experimental results demonstrate that Holistic-CAM outperforms state-of-the-art methods on common-used benchmarks, including deletion and insertion, energy-based point game as well as remove and debias on ImageNet-1k, it also passes the sanity check easily.



Paperid:258 Poster
Authors:Hao Gu,Jiangyan Yi,Chenglong Wang,Yong Ren,Jianhua Tao,Xinrui Yan,Yujie Chen,Xiaohui Zhang
Abstract:
Fake audio detection is an emerging active topic. A growing number of literatures have aimed to detect fake utterance, which are mostly generated by Text-to-speech (TTS) or voice conversion (VC). However, countermeasures against impersonation remains an underexplored area. Impersonation is a fake type that involves an imitator replicating specific traits and speech style of a target speaker. Unlike TTS and VC, which often leave digital traces or signal artifacts, impersonation involves live human beings producing entirely natural speech, rendering the detection of impersonation audio a challenging task. Thus, we propose a novel method that integrates speaker profiles into the process of impersonation audio detection. Speaker profiles are inherent characteristics that are challenging for impersonators to mimic accurately, such as speaker's age, job. We aim to leverage these features to extract discriminative information for detecting impersonation audio. Moreover, there is no large impersonated speech corpora available for quantitative study of impersonation impacts. To address this gap, we further design the first large-scale, diverse-speaker Chinese impersonation dataset, named ImPersonation Audio Detection (IPAD), to advance the community's research on impersonation audio detection. We evaluate several existing fake audio detection methods on our proposed dataset IPAD, demonstrating its necessity and the challenges. Additionally, our findings reveal that incorporating speaker profiles can significantly enhance the model's performance in detecting impersonation audio.



Paperid:259 Poster
Authors:Fanfan Wang,Heqing Ma,Xiangqing Shen,Jianfei Yu,Rui Xia
Abstract:
Emotion cause analysis has attracted increasing attention in recent years. Extensive research has been dedicated to multimodal emotion recognition in conversations. However, the integration of multimodal information with emotion cause remains underexplored. Existing studies merely extract utterances or spans from conversations as cause evidence, which may not be concise and clear enough, especially the lack of explicit descriptions of other modalities, making it difficult to intuitively understand the causes. To address these limitations, we introduce a new task named Multimodal Emotion Cause Generation in Conversations (MECGC), which aims to generate an abstractive summary describing the causes that trigger the given emotion based on the multimodal context of conversations. We accordingly construct a dataset named ECGF that contains 1,374 conversations and 7,690 emotion instances from TV series. We further develop a generative framework that first generates emotion-cause aware video captions (Observe) and then facilitates the generation of emotion causes (Generate). The captioning model is trained with examples synthesized by a Multimodal Large Language Model (MLLM). Experimental results demonstrate the effectiveness of our framework and the significance of multimodal information for emotion cause analysis.



Paperid:260 Poster
Authors:Fulin Luo,Yi Liu,Xiuwen Gong,Zhixiong Nan,Tan Guo
Abstract:
Cross-view consensus representation plays a critical role in hyperspectral images (HSIs) clustering, with recent multi-view contrastive cluster methods utilize contrastive loss to extract contextual consensus representations. However, these methods have a fatal flaw: Contrastive learning treats similar heterogeneous views as positive sample pairs and dissimilar homogeneous views as negative sample pairs. At the same time, the data representation obtained by using self-supervised contrastive loss is not specifically designed for clustering. Thus, to tackle this challenge, we propose a novel multi-view clustering method named Enhanced Multi-View Contrastive Clustering (EMVCC). First, the spatial multi-view is designed to learn the diverse features for contrastive clustering, and the globally relevant information of spectrum-view is extracted with Transformer, to enhance the spatial multi-view differences between neighboring samples. Then, a joint self-supervised loss is designed to constrain the consensus representation from different perspectives to efficiently avoid false negative pairs. Specifically, to preserve the diversity of multi-view information, the features are enhanced by probabilistic contrastive loss, and the data is projected into a representation spatial with semantic information, such that similar samples in this spatial are closer in distance. Finally, we design a novel clustering loss that aligns the view feature representation with high confidence pseudo-tags for promoting the network to learn cluster-friendly features. In the training process, the joint self-supervised loss is used to optimize the interactive cross-view features. Abundant experiment studies on numerous benchmarks verify the superiority of EMVCC in comparison to some state-of-the-art clustering methods. The code will be released later.



Paperid:261 Poster
Authors:Jing Ye,Xinpei Zhao
Abstract:
Multimodal Emotion Recognition in Conversations aims to understand the human emotion of each utterance in a conversation from different types of data, such as speech and text. Previous works mainly focus on either complex unimodal feature extraction or sophisticated fusion techniques as general multimodal classification tasks do. However, they ignore the process of human perception, neglecting various levels of emotional features within each modality and disregarding the unique contributions of different modalities for emotion recognition. To address these issues, we propose a more cognitive-aligned multimodal fusion framework, namely DQ-Former. Specifically, DQ-Former utilizes a small set of learnable query tokens to collate and condense various granularities of emotion cues embedded at different layers of pre-trained unimodal models. Subsequently, it integrates these emotional features from different modalities with dynamic modality priorities at each intermediate fusion layer. This process enables explicit and effective fusion of different levels of information from diverse modalities. Extensive experiments on MELD and IEMOCAP datasets validate the effectiveness of DQ-Former. Our results show that the proposed method achieves a robust and interpretable multimodal representation for emotion recognition.



Paperid:262 Poster
Authors:Dizhan Xue,Shengsheng Qian,Changsheng Xu
Abstract:
A key object in eXplainable Artificial Intelligence (XAI) is to create intelligent systems capable of reasoning and explaining real-world data to facilitate reliable decision-making. Recent studies have acknowledged the importance of providing user-friendly and verifiable explanations to facilitate trustworthy Visual Question Answering (VQA) systems. This paper aims to promote explainable VQA from both data and method perspectives. First, we propose a new Standard Multimodal Explanation (SME) dataset and a new Few-Shot Multimodal Explanation for VQA (FS-MEVQA) task, which aims to generate the multimodal explanation of the underlying reasoning process for solving visual questions with few training samples. Our SME dataset includes 1,028,230 samples composed of questions, images, answers, and multimodal explanations, which can facilitate the research in both traditional MEVQA and FS-MEVQA. To the best of our knowledge, this is the first large-scale dataset with joint language-vision explanations based on standard English and additional visual grounding tokens, which bridge MEVQA to a broad field in Natural Language Processing (NLP). Second, we propose a training-free Multimodal Explaining Agent (MEAgent) method based on an LLM agent with multimodal open-world tools to infer answers and generate multimodal explanations for visual questions. Our MEAgent can learn multimodal explaining from merely $N(=16)$ training samples and leverage open-world abilities to perform FS-MEVQA on test samples. Comprehensive experimental results evaluated by language quality metrics, visual detection metric, and visual attribution metrics on our SME dataset indicate the superiority of our method for FS-MEVQA, compared to state-of-the-art MEVQA methods and the multimodal LLM GPT-4V.



Paperid:263 Poster
Authors:Kaifang Yang,Xinrong Zhao,Yanchao Gong
Abstract:
With the rapid development of multimedia applications such as online education, remote conferences, and telemedicine, an emerging type of image known as text screen content images (TSCI) has gained widespread utilization. Distinguishing from natural images captured by cameras, TSCI is generally generated or rendered by computers and exhibits significant differences in content characteristics. Notably, TSCI primarily comprises text,, a symbol system uniquely defined by humans with specific semantics. As an important carrier for transmitting semantic information, the quality of text in TSCI significantly affects the subjective perception experience of multimedia system users. Just noticeable difference (JND) is a widely studied image quality measure that is theoretically closest to human perception. However, the traditional JND (T-JND) tests fail to distinguish text from other image contents, ignoring the significant impact of semantic readability of text on image quality. This paper focuses for the first time on the impact of text semantics on the quality of TSCI, and JND experiments for TSCI compressed by the state-of-the-art versatile video coding (VVC) standard are explored and discussed. Specifically, a matching TSCI database is first established. Using the database, image subjective observation comparison experiments are further designed and carried out to construct the traditional JND (T-JND) as well as the semantic aware JND (S-JND). By comparing the experimental results, crucial conclusions are reached, including the fact that the S-JND provides a more precise description of the quality of TSCI compared to the T-JND. These conclusions have important guiding significance for the subsequent development of efficient JND models suitable for TSCI compressed by VVC.



Paperid:264 Poster
Authors:Luca Rossetto,Cristina Sarasua,Abraham Bernstein
Abstract:
Image descriptions provide precious information for a myriad of visual media management tasks ranging from image classification to image search. The value of such curated collections comes from their diverse content and their accompanying extensive annotations. Such annotations are typically supplied by communities, where users (often volunteers) curate labels and/or descriptions of images. Supporting users in their quest to increase (overall) description completeness where possible is, therefore, of utmost importance. In this paper, we introduce the notion of visual semantic density, which we define as the amount of information necessary to describe an image comprehensively such that the image content can be accurately inferred from the description. Together with the already existing annotations, this measure can estimate the annotation completeness, helping to identify collection content with missing annotations. We conduct user experiments to understand how humans perceive visual semantic density in different image collections to identify suitable proxy measures for our notion of visual semantic density. We find that extensive image captions can serve as a proxy to calculate an image's semantic density. Furthermore, we implement a visual semantic density estimator capable of approximating the human perception of the measure. We evaluate the performance of this estimator on several image datasets, concluding that it is feasible to sort images automatically by their visual semantic density, thereby allowing for the efficient scheduling of annotation tasks. Consequently, we believe that the visual semantic density estimation process can be used as a completeness measure to give feedback to annotating users in diverse visual content ecosystems, such as Wikimedia Commons.



Paperid:265 Poster
Authors:Haoxuan Li,Zhengmao Yang,Yunshan Ma,Yi Bin,Yang Yang,Tat-Seng Chua
Abstract:
We study an emerging and intriguing problem of multimodal temporal event forecasting with large language models. Compared to using text or graph modalities, the investigation of utilizing images for temporal event forecasting has received less attention, particularly in the era of large language models (LLMs). To bridge this gap, we are particularly interested in two key questions of: 1) why images will help in temporal event forecasting, and 2) how to integrate images into the LLM-based forecasting framework. To answer these research questions, we propose to identify two essential functions that images play in the scenario of temporal event forecasting, i.e. highlighting and complementary. Then, we develop a novel framework, named MM-Forecast. It employs an Image Function Identification module to recognize these functions as verbal descriptions using multimodal large language models (MLLMs), and subsequently incorporates these function descriptions into LLM-based forecasting models. To evaluate our approach, we construct a new multimodal dataset, MidEast-TE-mm, by extending an existing event dataset MidEast-TE with images. Empirical studies demonstrate that our MM-Forecast can correctly identify the image functions, and further more, incorporating these verbal function descriptions significantly improves the forecasting performance. The dataset, code, and prompt will be released upon acceptanc



Paperid:266 Poster
Authors:Jiali Chen,Xusen Hei,Yuqi Xue,Yuancheng Wei,Jiayuan Xie,Yi Cai,Qing Li
Abstract:
Large multimodal models (LMMs) have shown remarkable performance in the visual commonsense reasoning (VCR) task, which aims to answer a multiple-choice question based on visual commonsense within an image. However, the ability of LMMs to correct potential visual commonsense errors in the distractor upon their occurrence is yet under-explored. Drawing inspiration from how a human teacher crafts challenging distractors to test students' comprehension of the concepts or skills and assists them in identifying and correcting errors toward the answer, we are the pioneering research for LMMs to simulate this error correction learning process. To this end, we employ GPT-4 as a ``teacher'' to collect the explainable feedback dataset VCR-DF for error correction, which serves as a benchmark to evaluate the ability of LMMs to identify misconceptions and clarify reasons behind the error in VCR distractors toward final answers. In addition, we propose an LMM-based Pedagogical Expert Instructed Feedback Generation (PEIFG) model to incorporate the learnable expert prompts and multimodal instruction as guidance for feedback generation. Experimental results show that our PEIFG significantly outperforms existing LMMs. We believe our benchmark carves out a new direction for evaluating the capabilities of LMMs.



Paperid:267 Poster
Authors:Zecheng Wang,Xinye Li,Zhanyue Qin,Chunshan Li,Zhiying Tu,Dianhui Chu,Dianbo Sui
Abstract:
Multimodal large language models (MLLM) have been observed to exhibit biases originating from their training datasets. Unlike unimodal LLMs, biases in MLLMs may stem from interactions between multiple modalities, which increases the complexity of multimodal debiasing. Conventional approaches like fine-tuning to alleviate biases in models are costly and data-hungry. Model editing methods, which focus on post-hoc modifications of model knowledge, have recently demonstrated significant potential across diverse applications. These methods can effectively and precisely adjust the behavior of models in specific knowledge domains, while minimizing the impact on the overall performance of the model. However, there is currently no comprehensive study to drive the application of model editing methods in debiasing MLLM and to analyze its pros and cons. To facilitate research in this field, we define the debiasing problem of MLLM as an editing problem and propose a novel set of evaluation metrics for MLLM debias editing. Through various experiments, we demonstrate that: (1) Existing model editing methods can effectively alleviate biases in MLLM and can generalize well to semantically equivalent image-text pairs. However, most methods tend to adversely affect the stability of the MLLM. (2) Compared to editing the visual modality of the MLLM, editing the textual modality yields better results in addressing MLLM biases. (3) Model editing based debiasing method can achieve generalization across different types of biases.



Paperid:268 Poster
Authors:Xinghao Wu,Xuefeng Liu,Jianwei Niu,Haolin Wang,Shaojie Tang,Guogang Zhu,Hao Su
Abstract:
To address data heterogeneity, the key strategy of personalized Federated Learning (PFL) is to decouple general knowledge (shared among clients) and client-specific knowledge, as the latter can have a negative impact on collaboration if not removed. Existing PFL methods primarily adopt a parameter partitioning approach, where the parameters of a model are designated as one of two types: parameters shared with other clients to extract general knowledge and parameters retained locally to learn client-specific knowledge. However, as these two types of parameters are put together like a jigsaw puzzle into a single model during the training process, each parameter may simultaneously absorb both general and client-specific knowledge, thus struggling to separate the two types of knowledge effectively. In this paper, we introduce FedDecomp, a simple but effective PFL paradigm that employs parameter additive decomposition to address this issue. Instead of assigning each parameter of a model as either a shared or personalized one, FedDecomp decomposes each parameter into the sum of two parameters: a shared one and a personalized one, thus achieving a more thorough decoupling of shared and personalized knowledge compared to the parameter partitioning method. In addition, as we find that retaining local knowledge of specific clients requires much lower model capacity compared with general knowledge across all clients, we let the matrix containing personalized parameters be low rank during the training process. Moreover, a new alternating training strategy is proposed to further improve the performance. Experimental results across multiple datasets and varying degrees of data heterogeneity demonstrate that FedDecomp outperforms state-of-the-art methods up to 4.9%.



Paperid:269 Poster
Authors:Shuang Wang,Pengyi Hao,Fuli Wu,Cong Bai
Abstract:
For solving the limitations of the current self knowledge distillation including never fully utilizing the knowledge of shallow exits and neglecting the impact of auxiliary exits' structure on the performance of network, a novel self knowledge distillation framework via virtual teacher-students mutual learning named LOTH is proposed in this paper. A knowledgeable virtual teacher is constructed from the rich feature maps of each exit to help the learning of each exit. Meanwhile, the logit knowledges of each exit are incorporated to guide the learning of the virtual teacher. They learn mutually through the well-designed loss in LOTH. Moreover, two kinds of auxiliary building blocks are designed to balance the efficiency and effectiveness of network. Extensive experiments with diverse backbones on CIFAR-100 and Tiny-ImageNet validate the effectiveness of LOTH, which realizes superior performance with less resource by the comparison with the state-of-the-art distillation methods. The code of LOTH is available on Github.



Paperid:270 Poster
Authors:Shuman Zhuang,Sujia Huang,Wei Huang,Yuhong Chen,Zhihao Wu,Ximeng Liu
Abstract:
With the growing diversity of data sources, multi-view learning methods have attracted considerable attention. Among these, by modeling the multi-view data as multi-view graphs, multi-view Graph Neural Networks (GNNs) have shown encouraging performance on various multi-view learning tasks. The message passing is the critical mechanism empowering GNNs with superior capacity to process complex graph data. However, most multi-view GNNs are designed on the well-established overall framework, overlooking the intrinsic challenges of the message passing on multi-view scenarios. To clarify this, we first revisit the message passing mechanism from a graph smoothing perspective, revealing the key to designing a multi-view message passing. Following the analysis, in this paper, we propose an enhanced GNN framework termed Confluent Graph Neural Networks (CGNN), with Cross-view Confulent Message Passing (CCMP) tailored for multi-view learning. Inspired by the optimization of an improved multi-view graph smoothing problem, CCMP contains three sub-modules that enable the interaction between graph structures and consistent representations, which makes it aware of consistency and complementarity information across views. Extensive experiments on four types of data including multi-modality data demonstrate that our proposed model exhibits superior effectiveness and robustness.



Paperid:271 Poster
Authors:Xuechen Guo,Wenhao Chai,Shi-Yan Li,Gaoang Wang
Abstract:
Multimodal Large Language Model (MLLM) has recently garnered significant attention as a prominent research focus. By harnessing the capability of powerful Large Language Model (LLM), it facilitates the transition of conversational generative AI from unimodal text to performing multimodal tasks. This blooming development has begun to significantly impact the medical field. However, visual language models in the general domain lack sophisticated comprehension required for medical visual conversations. Even some models specifically tailored for the medical domain often produce answers that tend to be vague and weakly related to the visual contents. In this paper, we propose a fine-grained and adaptive visual language model architecture for Chinese medical visual conversations through parameter-efficient tuning. Specifically, we devise a fusion module with fine-grained vision encoders to achieve enhancement for subtle medical visual semantics. Then we note data redundancy that is common in medical scenes but ignored in most prior works. In cases of a single text paired with multiple figures, we utilize weighted scoring with knowledge distillation to adaptively screen valid images mirroring text descriptions. For execution, we leverage a large-scale Chinese ultrasound multimodal dataset obtained first-hand from the hospital database. We create instruction-following data based on text derived from doctors, which ensures professionality and thus contributes to effective tuning. With enhanced architecture and quality data, our Large Chinese Language and Vision Assistant for Ultrasound (LLaVA-Ultra) shows strong capability and robustness to medical scenarios. On three medical visual question answering datasets, LLaVA-Ultra surpasses previous state-of-the-art models on various metrics.



Paperid:272 Poster
Authors:Yuanyuan Liu,Yuxuan Huang,Shuyang Liu,Yibing Zhan,Zijing Chen,Zhe Chen
Abstract:
In Video-based Facial Expression Recognition (V-FER), models are typically trained on closed-set datasets with a fixed number of known classes. However, these models struggle with unknown classes common in real-world scenarios. In this paper, we introduce a challenging Open-set Video-based Facial Expression Recognition (OV-FER) task, aiming to identify both known and new, unseen facial expressions. While existing approaches use large-scale vision-language models like CLIP to identify unseen classes, we argue that these methods may not adequately capture the subtle human expressions needed for OV-FER. To address this limitation, we propose a novel Human Expression-Sensitive Prompting (HESP) mechanism to significantly enhance CLIP's ability to model video-based facial expression details effectively. Our proposed HESP comprises three components: 1) a textual prompting module with learnable prompts to enhance CLIP's textual representation of both known and unknown emotions, 2) a visual prompting module that encodes temporal emotional information from video frames using expression-sensitive attention, equipping CLIP with a new visual modeling ability to extract emotion-rich information, and 3) an open-set multi-task learning scheme that promotes interaction between the textual and visual modules, improving the understanding of novel human emotions in video sequences. Extensive experiments conducted on four OV-FER task settings demonstrate that HESP can significantly boost CLIP's performance (a relative improvement of 17.93% on AUROC and 106.18% on OSCR) and outperform other state-of-the-art open-set video understanding methods by a large margin. Code is available athttps://github.com/cosinehuang/HESP.



Paperid:273 Poster
Authors:Yanbin Deng,Zheng Li,Ning Xie,Wei Zhang
Abstract:
Motion transitions, which serve as bridges between two sequences of character animation, play a crucial role in creating long variable animation for real-time 3D interactive applications. In this paper, we present a framework to produce hybrid character animation, which combines motion capture animation and physical simulation animation that seamlessly connects the front and back motion clips. In contrast to previous works using interpolation for transition, our physics-based approach inherently ensures physical validity, and both the transition moment of the source motion clip and the horizontal rotation of the target motion clip can be specified arbitrarily within a certain range, which achieves high responsiveness and wide latitude for user control. The control policy of character can be trained automatically using only the motion capture data that requires transition, and is enhanced by our proposed Self-Behavior Cloning (SBC), an approach to improve the unsupervised reinforcement learning of motion transition. We show that by making an appropriate trade-off between diversity and stability of transition, our framework can accomplish the interactive transition tasks from a fully-connected state machine constructed from nine motion clips with high accuracy and naturalness.



Paperid:274 Poster
Authors:Fengze Jiang,Shuling Wang,Xiaojin Gong
Abstract:
Multi-task dense prediction plays an important role in the field of computer vision and has an abundant array of applications. Its main purpose is to reduce the amount of network training parameters by sharing network parameters while using the correlation between tasks to improve overall performance. We propose a task-conditional network that handles one task at a time and shares most network parameters to achieve these goals. Inspired by adapter tuning, we propose an adapter module that focuses on both spatial- and channel-wise information to extract features from the frozen encoder backbone. This approach not only reduces the number of training parameters, but also saves training time and memory resources by attaching a parallel adapter pathway to the encoder. We additionally use learnable task prompts to model different tasks and use these prompts to adjust some parameters of adapters to fit the network to diverse tasks. These task-conditional adapters are also applied to the decoder, which enables the entire network to switch between various tasks, producing better task-specific features and achieving excellent performance. Extensive experiments on two challenging multi-task benchmarks, NYUD-v2 and PASCAL-Context, show that our approach achieves state-of-the-art performance with excellent parameter, time, and memory efficiency.



Paperid:275 Poster
Authors:Wenbin Zou,Hongxia Gao,Weipeng Yang,Tongtong Liu
Abstract:
Ultra-High-Definition (UHD) technology has attracted widespread attention due to its exceptional visual quality, but it also poses new challenges for low-light image enhancement (LLIE) techniques. UHD images inherently possess high computational complexity, leading existing UHD LLIE methods to employ high-magnification downsampling to reduce computational costs, which in turn results in information loss. The wavelet transform not only allows downsampling without loss of information, but also separates the image content from the noise. It enables state-space models (SSMs) to avoid being affected by noise when modeling long sequences, thus making full use of the long-sequence modeling capability of SSMs. On this basis, we propose Wave-Mamba, a novel approach that combines wavelet transform and SSMs. Our method is based on two pivotal insights derived from the wavelet domain: 1) most of the content information of an image exists in the low-frequency component, less in the high-frequency component. 2) The high-frequency component exerts a minimal influence on the outcomes of low-light enhancement. Specifically, to be able to efficiently model global content information on UHD images, we proposed a low-frequency state space (LFSS) module by improving SSMs to focus on restoring the information of low-frequency sub-bands. Moreover, we propose a high-frequency enhancement module (HFEM) for high-frequency sub-band information, which uses the enhanced low-frequency information to correct the high-frequency information and effectively restore the correct high-frequency details. Through comprehensive evaluation, our method has demonstrated superior performance, significantly outshining current leading techniques while maintaining a more streamlined architecture.



Paperid:276 Poster
Authors:Wuliang Huang,Yiqiang Chen,Xinlong Jiang,Chenlong Gao,Qian Chen,Teng Zhang,Bingjie Yan,Yifan Wang,Jianrong Yang
Abstract:
Multi-modality physiological signal-based emotion recognition has attracted increasing attention as its capacity to capture human affective states comprehensively. Due to multi-modality heterogeneity and cross-subject divergence, practical applications struggle with generalizing models across individuals. Effectively addressing both issues requires mitigating the gap between multi-modality signals while acquiring generalizable representations across subjects. However, existing approaches often handle these dual challenges separately, resulting in suboptimal generalization. This study introduces a novel framework, termed Correlation-Driven Multi-Modality Graph Decomposition (CMMGD). The proposed CMMGD initially captures adaptive cross-modal correlations to connect each unimodal graph to a multi-modality mixed graph. To simultaneously address the dual challenges, it incorporates a correlation-driven graph decomposition module that decomposes the mixed graph into concordant and discrepant subgraphs based on the correlations. The decomposed concordant subgraph encompasses consistently activated features across modalities and subjects during emotion elicitation, unveiling a generalizable subspace. Additionally, we design a Multi-Modality Graph Regularized Transformer (MGRT) backbone specifically tailored for multimodal physiological signals. The MGRT can alleviate the over-smoothing issue and mitigate over-reliance on any single modality. Extensive experiments demonstrate that CMMGD outperforms the state-of-the-art methods by 1.79% and 2.65% on DEAP and MAHNOB--HCI datasets, respectively, under the leave-one-subject-out cross-validation strategy.



Paperid:277 Poster
Authors:Muxin Pu,Mei Kuan Lim,Chun Yong Chong
Abstract:
Sign language recognition (SLR) refers to interpreting sign language glosses from given videos automatically. This research area presents a complex challenge in computer vision because of the rapid and intricate movements inherent in sign languages, which encompass hand gestures, body postures, and even facial expressions. Recently, skeleton-based action recognition has attracted increasing attention due to its ability to handle variations in subjects and backgrounds independently. However, current skeleton-based SLR methods exhibit three limitations: 1) they often neglect the importance of realistic hand poses, where most studies train SLR models on non-realistic skeletal representations; 2) they tend to assume complete data availability in both training and inference phases and capture intricate relationships among different body parts collectively; 3) these methods treat all sign glosses uniformly, failing to account for differences in complexity levels regarding skeletal representations. To enhance the realism of hand skeletal representations, we present a kinematic hand pose rectification method for enforcing constraints. Mitigating the impact of missing data, we propose a feature-isolated mechanism to focus on capturing local spatial-temporal context. This method captures the context concurrently and independently from individual features, thus enhancing the robustness of the SLR model. Additionally, to adapt to varying complexity levels of sign glosses, we develop an input-adaptive inference approach to optimise computational efficiency and accuracy. Experimental results demonstrate the effectiveness of our approach, as evidenced by achieving a new state-of-the-art performance on WLASL100 and LSA64. For WLASL100, we achieve a top-1 accuracy of 86.50%, marking a relative improvement of 2.39% over the previous state-of-the-art. For LSA64, we achieve a top-1 accuracy of 99.84%. The artefacts and code related to the study are made publicly online (The link to the GitHub repository will be revealed upon acceptance of the paper).



Paperid:278 Poster
Authors:Junsheng Wang,Tiantian Gong,Yan Yan
Abstract:
Supervised cross-modal retrieval (CMR) achieves excellent performance thanks to the semantic information provided by its labels, which helps to establish semantic correlations between samples from different modalities. However, in real-world scenarios, there often exists a large amount of unlabeled and unpaired multimodal training data, rendering existing methods unfeasible. To address this issue, we propose a novel partially aligned cross-modal retrieval method called Optimal Transport-based Prototype Alignment Learning (OTPAL). Due to the high computational complexity involved in directly establishing matching correlations between unannotated unaligned cross-modal samples, instead, we establish matching correlations between shared prototypes and samples. To be specific, we employ the optimal transport algorithm to establish cross-modal alignment information between samples and prototypes, and then minimize the distance between samples and their corresponding prototypes through a specially designed prototype alignment loss. As an extension of this paper, we also extensively investigate the influence of incomplete multimodal data on cross-modal retrieval performance under the partially aligned setting proposed above. To further address the above more challenging scenario, we raise a scalable prototype-based neighbor feature completion method, which better captures the correlations between incomplete samples and neighbor samples through a cross-modal self-attention mechanism. Experimental results on four benchmark datasets show that our method can obtain satisfactory accuracy and scalability in various real-world scenarios.



Paperid:279 Poster
Authors:Le Jiang,Yan Huang,Lianxin Xie,Wen Xue,Cheng Liu,Si Wu,Hau-San Wong
Abstract:
The prevalence of multimedia applications has led to increased concerns and demand for auto face retouching. Face retouching aims to enhance portrait quality by removing blemishes. However, the existing auto-retouching methods rely heavily on a large amount of paired training samples, and perform less satisfactorily when handling complex and unusual blemishes. To address this issue, we propose a Language-guided Blemish Removal Transformer for automatically retouching face images, while at the same time reducing the dependency of the model on paired training data. Our model is referred to as LangBRT, which leverages vision-language pre-training for precise facial blemish removal. Specifically, we design a text-prompted blemish detection module that indicates the regions to be edited. The priors not only enable the transformer network to handle specific blemishes in certain areas, but also reduce the reliance on retouching training data. Further, we adopt a target-aware cross attention mechanism, such that the blemish regions are edited accurately while at the same time maintaining the normal skin regions unchanged. Finally, we adopt a regularization approach to encourage the semantic consistency between the synthesized image and the text description of the desired retouching outcome. Extensive experiments are performed to demonstrate the superior performance of LangBRT over competing auto-retouching methods in terms of dependency on training data, blemish detection accuracy and synthesis quality.



Paperid:280 Poster
Authors:Junliu zhong,Li Zhiyi,Dan Xiang,Maotang Han,Changsheng Li,gan yanfen
Abstract:
Currently, the information processing in a spatial domain alone has intrinsic limitations that hinder the deep network’s effectiveness (performance) improvement in a single image deraining. Moreover, the deraining networks' structures and learning processes are becoming increasingly intricate, leading to challenges in structural lightweight, and training and testing efficiency. We propose a lightweight multi-domain multi-attention progressive network (M2PN) to handle these challenges. For performance improvement, the M2PN backbone applies a simple progressive CNN-based structure consisting of the S same recursive M2PN modules. This recursive backbone with a skip connection mechanism allows for better gradient flow and helps to effectively capture low-to-high-level/scales spatial features in progressive structure to improve contextual information acquisition. To further complement acquired spatial information for better deraining, we conduct spectral analysis on the frequency energy distribution of rain steaks, and theoretically present the relationship between the spectral bandwidths and the unique falling characteristics and special morphology of rain steaks. We present the frequency-channel attention (FcA) mechanism and the spatial-channel attention (ScA) mechanism to fuse frequency-channel features and spatial features better to distinguish and remove rain steaks. The simple recursive network structure and effective multi-domain multi-attention mechanism serve as the M2PN to achieve superior performance and facilitate fast convergence during training. Furthermore, the M2PN structure, with a small network component quantity, shallow network channels, and few convolutional kernels, requires only 168K parameters, which is 1 to 2 orders of magnitude lower than the existing SOTA networks. The experimental results demonstrate that even with such a few network parameters, M2PN still achieves the best overall performance.



Paperid:281 Poster
Authors:Pei He,Licheng Jiao,Lingling Li,Xu Liu,Fang Liu,Wenping Ma,Shuyuan Yang,Ronghua Shang
Abstract:
Domain generalization 3D segmentation aims to learn the point clouds with unknown distributions. Feature augmentation has been proven to be effective for domain generalization. However, each point of the 3D segmentation scene contains uncertainty in the target domain, which affects model generalization. This paper proposes the Domain Generalization-Aware Uncertainty Introspective Learning (DGUIL) method, including Potential Uncertainty Modeling (PUM) and Momentum Introspective Learning (MIL), to deal with the point uncertainty in domain shift. Specifically, PUM explores the underlying uncertain point cloud features and generates the different distributions for each point. The PUM enhances the point features over an adaptive range, which provides various information for simulating the distribution of the target domain. Then, MIL is designed to learn generalized feature representation in uncertain distributions. The MIL utilizes uncertainty correlation representation to measure the predicted divergence of knowledge accumulation, which learns to carefully judge and understand divergence through uncertainty introspection loss. Finally, extensive experiments verify the advantages of the proposed method over current state-of-the-art methods. The code will be available.



Paperid:282 Poster
Authors:hangjun Che,Xinyu Pu,Deqiang Ouyang,Beibei Li
Abstract:
Incomplete Multi-View Clustering (IMVC) is a promising topic in multimedia as it breaks the data completeness assumption. Most existing methods solve IMVC from the perspective of graph learning. In contrast, self-representation learning enjoys a superior ability to explore relationships among samples. However, only a few works have explored the potentiality of self-representation learning in IMVC. These self-representation methods infer missing entries from the perspective of whole samples, resulting in redundant information. In addition, designing an effective strategy to retain salient features while eliminating noise is rarely considered in IMVC. To tackle these issues, we propose a novel self-representation learning method with missing sample recovery and enhanced low-rank tensor regularization. Specifically, the missing samples are inferred by leveraging the local structure of each view, which is constructed from available samples at the feature level. Then an enhanced tensor norm, referred to as Logarithm-p norm is devised, which can obtain an accurate cross-view description. Our proposed method achieves exact subspace representation in IMVC by leveraging high-order correlations and inferring missing information at the feature level. Extensive experiments on several widely used multi-view datasets demonstrate the effectiveness of the proposed method.



Paperid:283 Poster
Authors:Ao Li,Huijun Liu,Jinrong Sheng,Zhongming Chen,Yongxin Ge
Abstract:
Weakly-supervised Temporal Action Localization (WTAL) following a localization-by-classification paradigm has achieved significant results, yet still grapples with confounding arising from ambiguous snippets. Previous works have attempted to distinguish these ambiguous snippets from action snippets without investigating the underlying causes of their formation, thus failing to effectively eliminate the bias on both action-context and action-content. In this paper, we revisit WTAL from the perspective of structural causal model to identify the true origins of confounding, and propose an efficient dual-confounding eliminating framework to alleviate these biases. Specifically, we construct a Substituted Confounder Set (SCS) to eliminate the confounding bias on action-content by leveraging the modal disparity between RGB and FLOW. Then, a Multi-level Consistency Mining (MCM) method is designed to mitigate the confounding bias on action-content by utilizing the consistency between discriminative snippets and corresponding proposals at both the feature and label levels. Notably, SCS and MCM could be seamlessly integrated into any two-stream models without additional parameters by Expectation-Maximization (EM) algorithm. Extensive experiments on two challenging benchmarks including THUMOS14 and ActivityNet-1.2 demonstrate the superior performance of our method.



Paperid:284 Poster
Authors:Meichen Liu,Shuting He,Songnan Lin,Bihan Wen
Abstract:
Arbitrary style transfer aims to render artistic features from a style reference onto an image while retaining its original content. Previous methods either focus on learning the holistic style from a specific artist or extracting instance features from a single artwork. However, they often fail to apply style elements uniformly across the entire image and lack adaptation to the style of different artworks. To solve these issues, our key insight is that the art genre has better generality and adaptability than the overall features of the artist. To this end, we propose a Dual-head Genre-instance Transformer (DGiT) framework to simultaneously capture the genre and instance features for arbitrary style transfer. To the best of our knowledge, this is the first work to integrate the genre features and instance features to generate a high-quality stylized image. Moreover, we design two contrastive losses to enhance the capability of the network to capture two style features. Our approach ensures the uniform distribution of the overall style across the stylized image while enhancing the details of textures and strokes in local regions. Qualitative and quantitative evaluations demonstrate that our approach exhibits its superior performance in terms of visual qualitative and efficiency.



Paperid:285 Poster
Authors:Xiongjun Zhao,Zheng-Yu Liu,Fen Liu,Guanting Li,Yutao Dou,Shaoliang Peng
Abstract:
Despite significant advances in image-text medical visual language modeling, the high cost of fine-grained annotation of images to align radiology reports has led current approaches to focus primarily on semantic alignment between the image and the full report, neglecting the critical diagnostic information contained in the text. This is insufficient in medical scenarios demanding high explainability. To address this problem, in this paper, we introduce radiology reports as images in prompt learning. Specifically, we extract key clinical concepts, lesion locations, and positive labels from easily accessible radiology reports and combine them with an external medical knowledge base to form fine-grained self-supervised signals. Moreover, we propose a novel Report-Concept Textual-Prompt Learning (RC-TPL), which aligns radiology reports at multiple levels. In the inference phase, report-level and concept-level prompts provide rich global and local semantic understanding for X-ray images. Extensive experiments on X-ray image datasets demonstrate the superior performance of our approach with respect to various baselines, especially in the presence of scarce imaging data. Our study not only significantly improves the accuracy of data-constrained medical X-ray diagnosis, but also demonstrates how the integration of domain-specific conceptual knowledge can enhance the explainability of medical image analysis. The implementation code will be publicly available.



Paperid:286 Poster
Authors:Qijie Wang,Liu Guandu,Bin Wang
Abstract:
Recent advances in vision-language foundational models, such as CLIP, have demonstrated significant strides in zero-shot classification. However, the extensive parameterization of models like CLIP necessitates a resource-intensive fine-tuning process. In response, TIP-Adapter and SuS-X have introduced training-free methods aimed at bolstering the efficacy of downstream tasks. While these approaches incorporate support sets to maintain data distribution consistency between knowledge cache and test sets, they often fall short in terms of generalization on the test set, particularly when faced with test data exhibiting substantial distributional variations. In this work, we present CapS-Adapter, an innovative method that employs a caption-based support set, effectively harnessing both image and caption features to exceed existing state-of-the-art techniques in training-free scenarios. CapS-Adapter adeptly constructs support sets that closely mirror target distributions, utilizing instance-level distribution features extracted from multimodal large models. By leveraging CLIP's single and cross-modal strengths, CapS-Adapter enhances predictive accuracy through the use of multimodal support sets. Our method achieves outstanding zero-shot classification results across 19 benchmark datasets, improving accuracy by 2.19% over the previous leading method. Our contributions are substantiated through extensive validation on multiple benchmark datasets, demonstrating superior performance and robust generalization capabilities.



Paperid:287 Poster
Authors:Shengyin Jiang,Shaoqing Xu,lifang,Li Liu,Ziying Song,Yang Bo,Zhi-Xin Yang
Abstract:
Multi-modal fusion techniques, such as radar and images, enable a complementary and cost-effective perception of the surrounding environment regardless of lighting and weather conditions. However, existing fusion methods for surround-view images and radar are challenged by the inherent noise and positional ambiguity of radar, which leads to significant performance losses. To address this limitation effectively, our paper presents a robust, end-to-end fusion framework dubbed SparseInteraction. First, we introduce the Noisy Radar Filter (NRF) module to extract foreground features by creatively using queried semantic features from the image to filter out noisy radar features. Furthermore, we implement the Sparse Cross-Attention Encoder (SCAE) to effectively blend foreground radar eatures and image features to address positional ambiguity issues at a sparse level. Ultimately, to facilitate model convergence and performance, the foreground prior queries containing position information of the foreground radar are concatenated with predefined queries and fed into the subsequent transformer-based decoder. The experimental results demonstrate that the proposed fusion strategies markedly enhance detection performance and achieve new state-of-the-art results on the nuScenes benchmark. Source code is available athttps://github.com/GG-Bonds/SparseInteraction.



Paperid:288 Poster
Authors:LiQiu Chen,Yuqing Huang,Hengyu li,Zikun Zhou,Zhenyu He
Abstract:
Thermal infrared (TIR) data exhibits higher tolerance to extreme environments, making it a valuable complement to RGB data in tracking tasks. RGB-T tracking aims to leverage information from both RGB and TIR images for stable and robust tracking. However, existing RGB-T tracking methods often face challenges due to significant modality differences and selective emphasis on interactive information, leading to inefficiencies in the cross-modal interaction process. To address these issues, we propose a novel Integrating Interaction into Modality-shared Fearues with ViT(IIMF) framework, which is a simplified cross-modal interaction network including modality-shared, RGB modality-specific, and TIR modality-specific branches. Modality-shared branch aggregates modality-shared information and implements inter-modal interaction with the Vision Transformer(ViT). Specifically, our approach first extracts modality-shared features from RGB and TIR features using a cross-attention mechanism. Furthermore, we design a Cross-Attention-based Modality-shared Information Aggregation (CAMIA) module to further aggregate modality-shared information with modality-shared tokens.



Paperid:289 Poster
Authors:Ling Huang,Wenqian Dong,Song xiao,Jiahui Qu,Yuanbo Yang,Yunsong Li
Abstract:
Joint classification of multi-modal remote sensing images has achieved great success thanks to complementary advantages of multi-modal images. However, modality absence is a common dilemma in real world caused by imaging conditions, which leads to a breakdown of most classification methods that rely on complete modalities. Existing approaches either learn shared representations or train specific models for each absence case so that they commonly confront the difficulty of balancing the complementary advantages of the modalities and scalability of the absence case. In this paper, we propose a language-guided visual prompt compensation network (LVPCnet) to achieve joint classification in case of arbitrary modality absence using a unified model that simultaneously considers modality complementarity. It embeds missing modality-specific knowledge into visual prompts to guide the model in capturing complete modal information from available ones for classification. Specifically, a language-guided visual feature decoupling stage (LVFD-stage) is designed to extract shared and specific modal feature from multi-modal images, establishing a complementary representation model of complete modalities. Subsequently, an absence-aware visual prompt compensation stage (VPC-stage) is proposed to learn visual prompts containing missing modality-specific knowledge through cross-modal representation alignment, further guiding the complementary representation model to reconstruct modality-specific features for missing modalities from available ones based on the learned prompts. The proposed VPC-stage entails solely training visual prompts to perceive missing information without retraining the model, facilitating effective scalability to arbitrary modal missing scenarios. Systematic experiments conducted on three public datasets have validated the effectiveness of the proposed approach.



Paperid:290 Poster
Authors:jiade chen,Jin Wang,Yunhui Shi,Nam Ling,Baocai Yin
Abstract:
Point cloud upsampling concerns producing a dense and uniform point set from a sparse and irregular one. Current upsampling methods primarily encounter two challenges: (i) insufficient uni-modal representations of sparse point clouds, and (ii) inaccurate estimation of geometric details in dense point clouds, resulting in suboptimal upsampling results. To tackle these challenges, we propose MVP-Net, a multi-view depth image guided cross-modal detail estimation distillation network for point cloud upsampling, in which the multi-view depth images of point clouds are fully explored to guide upsampling. Firstly, we propose a cross-modal feature extraction module, consisting of two branches designed to extract point features and depth image features separately. This setup aims to produce sufficient cross-modal representations of sparse point clouds. Subsequently, we design a Multi-View Depth Image to Point Feature Fusion (MVP) block to fuse the cross-modal features in a fine-grained and hierarchical manner. The MVP block is incorporated into the feature extraction module. Finally, we introduce a paradigm for multi-view depth image-guided detail estimation and distillation. The teacher network fully utilizes paired multi-view depth images of sparse point clouds and their dense counterparts to formulate multi-hierarchical representations of geometric details, thereby achieving high-fidelity reconstruction. Meanwhile, the student network takes only sparse point clouds and their multi-view depth images as input, and it learns to predict the multi-hierarchical detail representations distilled from the teacher network. Extensive qualitative and quantitative results on both synthetic and real-world datasets demonstrate that our method outperforms state-of-the-art point cloud upsampling methods.



Paperid:291 Poster
Authors:Zhaolin Wan,Qiushuang Yang,Zhiyang Li,Xiaopeng Fan,Wangmeng Zuo,Debin Zhao
Abstract:
The emergence of virtual reality technology has made stereoscopic omnidirectional images (SOI) easily accessible and prompting the need to evaluate their perceptual quality. At present, most stereoscopic omnidirectional image quality assessment (SOIQA) methods rely on one of the projection formats, i.e., Equirectangular Projection (ERP) or CubeMap Projection (CMP). However, while ERP provides global information and the less distorted CMP complements it by providing local structural guidance, research on leveraging both ERP and CMP in SOIQA remains limited, hindering a comprehensive understanding of both global and local visual cues. Motivated by this gap, our study introduces a novel dual-stream perception-driven network for blind quality assessment of stereoscopic omnidirectional images. By integrating both ERP and CMP, our method effectively captures both global and local information, marking the first attempt to bridge this gap in SOIQA, particularly through deep learning methodologies. We employ an inter-intra feature fusion module, which considers both the inter-complementarity between ERP and CMP and the intra-relationships within CMP images. This module dynamically and complementarily adjusts the contributions of features from both projections and effectively integrates them to achieve a more comprehensive perception. Besides, deformable convolution is employed to extract the local region of interest, simulating the orientation selectivity of the primary visual cortex. Finally, with the features of left and right views of SOI, a stereo cross attention module that simulates the binocular fusion mechanism is proposed to predict the quality score. Extensive experiments are conducted to evaluate our model and the state-of-the-art competitors, demonstrating that our model has achieved the best performance on the databases of LIVE 3D VR, SOLID, and NBU.



Paperid:292 Poster
Authors:Mengyin Liu,Chao Zhu,Shiqi Ren,Xu-Cheng Yin
Abstract:
With the prosperity of the intelligent surveillance, multiple cameras have been applied to localize pedestrians more accurately. However, previous methods rely on laborious annotations of pedestrians in every frame and camera view. Therefore, we propose in this paper an Unsupervised Multi-view Pedestrian Detection approach (UMPD) to learn an annotation-free detector via vision-language models and 2D-3D cross-modal mapping: 1) Firstly, Semantic-aware Iterative Segmentation (SIS) is proposed to extract unsupervised representations of multi-view images, which are converted into 2D masks as pseudo labels, via our proposed iterative PCA and zero-shot semantic classes from vision-language models; 2) Secondly, we propose Geometry-aware Volume-based Detector (GVD) to end-to-end encode multi-view 2D images into a 3D volume to predict voxel-wise density and color via 2D-to-3D geometric projection, trained by 3D-to-2D rendering losses with SIS pseudo labels; 3) Thirdly, for better detection results, i.e., the 3D density projected on Birds-Eye-View, we propose Vertical-aware BEV Regularization (VBR) to constrain pedestrians to be vertical like the natural poses. Extensive experiments on popular multi-view pedestrian detection benchmarks Wildtrack, Terrace, and MultiviewX, show that our proposed UMPD, as the first fully-unsupervised method to our best knowledge, performs competitively to the previous state-of-the-art supervised methods. Code is available athttps://github.com/lmy98129/UMPD.



Paperid:293 Poster
Authors:Jiulin Li,Mengyu Yang,Ye Tian,Lanshan Zhang,Yongchun Lu,Jice Liu,Wendong Wang
Abstract:
Vision-Language Models (VLMs) built on contrastive learning, such as CLIP, demonstrate great transferability and excel in downstream tasks like zero-shot classification and retrieval. To further enhance the performance of VLMs, existing methods have introduced additional parameter modules or fine-tuned VLMs on downstream datasets. However, these methods often fall short in scenarios where labeled data for downstream tasks is either unavailable or insufficient for fine-tuning, and the training of additional parameter modules may considerably impair the existing transferability of VLMs. To alleviate this issue, we introduce WaveDN, a wavelet-based distribution normalization method that can boost the VLMs' performance on downstream tasks without parametric modules or labeled data. Initially, wavelet distributions are extracted from the embeddings of the sampled, unlabeled test samples. Subsequently, WaveDN conducts a hierarchical normalization across the wavelet coefficients of all embeddings, thereby incorporating the distributional characteristics of the test data. Finally, the normalized embeddings are reconstructed via inverse wavelet transformation, facilitating the computation of similarity metrics between the samples. Through extensive experiments on two downstream tasks, using a total of 14 datasets covering text-image and text-audio modal data, WaveDN has demonstrated superiority compared to state-of-the-art methods.



Paperid:294 Poster
Authors:Shijie Chen,Junbao Zhuo,Xin Li,Haizhuang Liu,Rongquan Wang,Jiansheng Chen,Huimin Ma
Abstract:
LiDAR-based 3D detection, as an essential technique in multimedia applications such as augmented reality and autonomous driving, has made great progress in recent years. However, the performance of a well trained 3D detectors is considerably graded when deployed in unseen environments due to the severe domain gap. Traditional unsupervised domain adaptation methods, including co-training and mean-teacher frameworks, do not effectively bridge the domain gap as they struggle with noisy and incomplete pseudo-labels and the inability to capture domain-invariant features. In this work, we introduce a novel Co-training Mean-Teacher (CMT) framework for unsupervised domain adaptation in 3D object detection. Our framework enhances adaptation by leveraging both source and target domain data to construct a hybrid domain that aligns domain-specific features more effectively. We employ hard instance mining to enrich the target domain feature distribution and utilize class-aware contrastive learning to refine feature representations across domains. Additionally, we develop batch adaptive normalization to fine-tune the batch normalization parameters of the teacher model dynamically, promoting more stable and reliable learning. Extensive experiments across various benchmarks, including Waymo, nuScenes and KITTI, demonstrate the superiority of our CMT over the state-of-the-art approaches in different adaptation scenarios.



Paperid:295 Poster
Authors:Wei Feng,Zhenwei Wu,Qianqian Wang,Bo Dong,Quanxue Gao
Abstract:
Federated multi-view clustering aims to provide a feasible and effective solution for handling unlabeled data owned by multiple clients. There are two main challenges: 1) The local data is always sensitive, thus preventing any inadvertent data leakage to the server or other clients. 2) Multi-view data contain both consistency and complementarity information, necessitating thorough exploration and utilization of these aspects to achieve enhanced clustering performance. Fully considering the above challenges, in this paper, we propose a novel federated multi-view method named Federated Fuzzy C-Means with Schatten-p Norm Minimization(FFCMSP) which is based on Fuzzy C-Means and Schatten p-norm. Specifically, we utilize the membership degrees to replace conventional hard clustering assignment in K-means, enabling improved uncertainty handling and less information loss. Moreover, we introduce a Schatten p-norm-based regularizer to fully explore the inter-view complementary information and global spatial structure. We also develop a federated optimization algorithm enabling clients to collaboratively learn the clustering results. Extensive experiments on several datasets demonstrate that our proposed method exhibits superior performance in federated multi-view clustering.



Paperid:296 Poster
Authors:Yao Luo,Ming Yang,Jinhui Tang
Abstract:
Video frame interpolation is a critical component of video streaming, a vibrant research area dealing with requests of both service providers and users. However, most existing methods cannot handle changing video resolutions while improving user perceptual quality. We aim to unleash the multifaceted knowledge yielded by the hierarchical views at multiple scales in a pyramid network. Specifically, we build a dual-view pyramid network by introducing pyramidal dual-view correspondence matching. It compels each scale to actively seek knowledge in view of both the current scale and a coarser scale, conducting robust correspondence matching by considering neighboring scales. Meanwhile, an auxiliary multi-scale collaborative supervision is devised to enforce the exchange of knowledge among current scale and a finer scale and thus reduce error propagation from coarse to fine scales. Based on the robust video dynamic caption of pyramidal dual-view correspondence matching, we further develop a pyramidal refinement module that formulates frame refinement as progressive latent representation generations by developing flow-guided cross-scale attention for feature fusion among neighboring frames. The proposed method achieves favorable performance on several benchmarks of varying video resolutions with better user perceptual quality and a relatively compact model size.



Paperid:297 Poster
Authors:Yuzhen Li,Zehang Deng,Yuxin Cao,Lihua Liu
Abstract:
Previous works have shown that reducing parameter overhead and computations for transformer-based single image super-resolution (SISR) models (e.g., SwinIR) usually leads to a reduction of performance. In this paper, we present GRFormer, an efficient and lightweight method, which not only reduces the parameter overhead and computations, but also greatly improves performance. The core of GRFormer is Grouped Residual Self-Attention (GRSA), which is specifically oriented towards two fundamental components. Firstly, it introduces a novel grouped residual layer (GRL) to replace the QKV linear layer in self-attention, aimed at efficiently reducing parameter overhead, computations, and performance loss at the same time. Secondly, it integrates a compact Exponential-Space Relative Position Bias (ES-RPB) as a substitute for the original relative position bias to improve the ability to represent position information while further minimizing the parameter count. Extensive experimental results demonstrate that GRFormer outperforms state-of-the-art transformer-based methods for x2, x3 and x4 SISR tasks, notably outperforming SOTA by a maximum PSNR of 0.23dB when trained on the DIV2K dataset, while reducing the number of parameter and MACs by about 60% and 49% in only self-attention module respectively. We hope that our simple and effective method that can easily applied to SR models based on window-division self-attention can serve as a useful tool for further research in image super-resolution. The code is available athttps://github.com/sisrformer/GRFormer.



Paperid:298 Poster
Authors:Yihao Wang,Meng Yang,Rui Cao
Abstract:
Addressing the disparity in description granularity and information gap between images and text has long been a formidable challenge in text-based person retrieval (TBPR) tasks. Recent researchers tried to solve this problem by random local alignment. However, they failed to capture the fine-grained relationships between images and text, so the information and modality gaps remain on the table. We align image regions and text phrases at the same semantic granularity to address the semantic atomicity gap. Our idea is first to extract and then exploit the relationships between fine-grained locals. We introduce a novel Fine-grained Semantic Alignment with Transferred Person-SAM (SAP-SAM) approach. By distilling and transferring knowledge, we propose a Person-SAM model to extract fine-grained semantic concepts at the same granularity from images and texts of TBPR and its relationships. With the extracted knowledge, we optimize the fine-grained matching via Explicit Local Concept Alignment and Attentive Cross-modal Decoding to discriminate fine-grained image and text features at the same granularity level and represent the important semantic concepts from both modalities, effectively alleviating the granularity and information gaps. We evaluate our proposed approach on three popular TBPR datasets, demonstrating that SAP-SAM achieves state-of-the-art results and underscores the effectiveness of end-to-end fine-grained local alignment in TBPR tasks.



Paperid:299 Poster
Authors:Minghe Gao,Shuang Chen,Liang Pang,Yuan Yao,Jisheng Dang,Wenqiao Zhang,Juncheng Li,Siliang Tang,Yueting Zhuang,Tat-Seng Chua
Abstract:
The remarkable performance of Multimodal Large Language Models (MLLMs) has unequivocally demonstrated their proficient understanding capabilities in handling a wide array of visual tasks. Nevertheless, the opaque nature of their black-box reasoning processes persists as an enigma, rendering them uninterpretable and struggling with hallucination. Their ability to execute intricate compositional reasoning tasks is also constrained, culminating in a stagnation of learning progression for these models. In this work, we introduce Fact, a novel paradigm designed to generate multimodal rationales that are faithful, concise, and transferable for teaching MLLMs. This paradigm utilizes verifiable visual programming to generate executable code guaranteeing faithfulness and precision. Subsequently, through a series of operations including pruning, merging, and bridging, the rationale enhances its conciseness. Furthermore, we filter rationales that can be transferred to end-to-end paradigms from programming paradigms to guarantee transferability. Empirical evidence from experiments demonstrates the superiority of our method across models of varying parameter sizes, significantly enhancing their compositional reasoning and generalization ability. Our approach also reduces hallucinations owing to its high correlation between images and text. The anonymous project is available at:https://anonymous.4open.science/r/Fact_program-216D/



Paperid:300 Poster
Authors:Anna Zhu,Ke Xiao,Bo Zhou,Runmin Wang
Abstract:
Inducing linguistic knowledge for scene text recognition (STR) is a new trend that could provide semantics for performance boost. However, most auto-regressive STR models optimize one-step ahead prediction (i.e., 1-gram prediction) for character sequence, which only utilizes the previous semantic context. Most non-auto-regressive models only apply linguistic knowledge individually on the output sequence to refine the results in parallel, which do not fully utilize the visual clues concurrently. In this paper, we propose a novel language-based STR model, called ProphetSTR. It adopts an n-stream self-attention mechanism in the decoder to predict the next characters simultaneously based on the previous predictions at each time step. It could utilize the previous semantic information and the near future clues, encouraging the model to predict more accurate results. If the prediction results for the same character at successive time steps are inconsistent, we should not trust any of them. Otherwise, they are reliable predictions. Therefore, we propose a multi-modality verification module, masking the unreliable semantic features and inputting with visual and trusted semantic ones simultaneously for masked prediction recovery in parallel. It learns to align different modalities implicitly and considers both visual context and linguistic knowledge, which could generate more reliable results. Furthermore, we propose a multi-scale weight-sharing encoder for multi-granularity image representation. Extensive experiments demonstrate that ProphetSTR achieves state-of-the-art performances on many benchmarks. Further ablative studies prove the effectiveness of our proposed components.



Paperid:301 Poster
Authors:Cunhang Fan,Jingjing Zhang,Hongyu Zhang,Wang Xiang,Jianhua Tao,Xinhui Li,Jiangyan Yi,Dianbo Sui,Zhao Lv
Abstract:
Speaker extraction aims to selectively extract the target speaker from the multi-talker environment under the guidance of auxiliary reference. Recent studies have shown that the attended speaker's information can be decoded by the auditory attention decoding from the listener's brain activity. However, how to more effectively utilize the common information about the target speaker contained in both electroencephalography (EEG) and speech is still an unresolved problem. In this paper, we propose a multi-scale fusion network (MSFNet) for brain-controlled speaker extraction, which utilizes the EEG recorded from the listener to extract the target speech. In order to make full use of the speech information, the mixed speech is encoded with multiple time scales so that the multi-scale embeddings are acquired. In addition, to effectively extract the non-Euclidean data of EEG, the graph convolutional networks are used as the EEG encoder. Finally, these multi-scale embeddings are separately fused with the EEG features. To facilitate research related to auditory attention decoding and further validate the effectiveness of the proposed method, we also construct the AVED dataset, a new EEG-Audio dataset. Experimental results on both the public Cocktail Party dataset and the newly proposed AVED dataset in this paper show that our MSFNet model significantly outperforms the state-of-the-art method in certain objective evaluation metrics.



Paperid:302 Poster
Authors:Xueyang Li,Yu Song,Yunzhong Lou,Xiangdong Zhou
Abstract:
Computer-Aided Design (CAD) generative modeling is widely applicable in the fields of industrial engineering. Recently, text-to-3D generation has shown rapid progress in point clouds, mesh, and other non-parametric representations. On the contrary, text to 3D parametric CAD generative modeling is a practical task that has not been explored well, where its shape can be defined with several editable parametric command sequences. To investigate this, we design an encoder-decoder framework, namely CAD Translator, for incorporating the awareness of parametric CAD sequences into texts appropriately with only one-stage training. We first align texts and parametric CAD sequences via a Cascading Contrastive Strategy in the latent space, and then we propose CT-Mix to conduct the random mask operation on their embeddings separately to further get a fusion embedding via the linear interpolation. This can strengthen the connection between texts and parametric CAD sequences effectively. To train CAD Translator, we create a Text2CAD dataset with the help of Large Multimodal Model (LMM) for this practical task and conduct thorough experiments to demonstrate the effectiveness of our method.



Paperid:303 Poster
Authors:Weiguang Zhang,Qiufeng Wang,Kaizhu Huang,Xiaowei Huang,Fengjun Guo,Xiaomeng Gu
Abstract:
Photographed documents are prevalent but often suffer from deformations like curves or folds, hindering readability. Consequently, document dewarping has been widely studied, however its performance is still not satisfied due to lack of real training samples with pixel-level annotation. To obtain the pixel-level labels, we leverage a document registration pipeline to automatically align warped-flat documents. Unlike general image registration works, registering documents poses unique challenges due to their severe deformations and fine-grained textures. In this paper, we introduce a coarse-to-fine framework including a coarse registration network (CRN) aiming to eliminate severe deformations then a fine registration network (FRN) focusing on fine-grained features. In addition, we utilize self-supervised learning to initialize our document registration model, where we propose a cross-reconstruction pre-training task on the pair of warped-flat documents. Extensive experiments show that we can achieve satisfied document registration performance, consequently obtaining a high-quality registered document dataset with pixel-level annotation. Without bells and whistles, we re-train two popular document dewarping models on our registered document dataset WarpDoc-R, and obtain superior performance with those using almost 100× scale of synthetic training data, verifying the label quality of our document registration method. The code and pixel-level labels will be released.



Paperid:304 Poster
Authors:Zhongchi Wang,Hailong Sun,Zhengyang Zhao
Abstract:
Federated learning has rapidly gained attention in the industrial sector due to its significant advantages in protecting privacy. However, ensuring the fairness of federated learning models post-deployment presents a challenge in practical applications. Given that clients typically rely on limited private datasets to assess model fairness, this constrains their ability to make accurate judgments about the fairness of the model. To address this issue, we propose an innovative evaluation framework, FedEvalFair, which integrates private data from multiple clients to comprehensively assess the fairness of models in actual deployment without compromising data privacy. Firstly, FedEvalFair draws on the concept of federated learning to achieve a comprehensive assessment while protecting privacy. Secondly, based on the statistical concept of 'estimating the population from the sample', FedEvalFair is capable of estimating the fairness performance of the model in real-world settings from a limited data sample. Thirdly, we have designed a flexible two-stage evaluation strategy based on statistical hypothesis testing. We verified the theoretical performance and sensitivity to fairness variations of FedEvalFair using Monte Carlo simulations, demonstrating the superior performance of its two-stage evaluation strategy. Additionally, we validated the effectiveness of the FedEvalFair method on real-world datasets, including UCI Adult and eICU, and demonstrated its stability in dealing with real-world data distribution changes compared to traditional evaluation methods.



Paperid:305 Poster
Authors:Zhe Ji,Qiansiqi Hu,Yicheng Zheng,Liyao Xiang,Xinbing Wang
Abstract:
Recently, there is a surge in machine-generated natural language content being misused by unauthorized parties. Watermarking is a well-recognized technique to address the issue by tracing the provenance of the text. However, we found most existing watermarking systems for texts are subject to ad hoc design and thus suffer from fundamental vulnerabilities.We propose a principled design for text watermarking based on a theoretical information-hiding framework. The watermarking party and attacker play a rate-distortion-constrained capacity game to achieve the maximum rate of reliable transmission, i.e., watermark capacity. The capacity can be expressed by the mutual information between the encoding and the attacker's corrupted text, indicating how many watermark bits are effectively conveyed under distortion constraints. The system is realized by a learning-based framework with mutual information neural estimators. In the framework, we adopt the assumption of an omniscient attacker and let the watermarking party pit against the attacker who is fully aware of the watermarking strategy. The watermarking party thus achieves higher robustness against removal attacks. We further show that the incorporation of side information substantially enhances the efficacy and robustness of the watermarking system. Experimental results have shown the superiority of our watermarking system compared to the state-of-the-art in terms of capacity, robustness, and preserving text semantics.



Paperid:306 Poster
Authors:Xinyao Liao,Wei Wei,Dangyang Chen,Yuanyuanfu
Abstract:
Scene Graph Generation(SGG) is a scene understanding task aimed at identifying object entities and elucidating their relationships within a given image. In contrast to prevailing two-stage methods, which typically involve a large object detector (e.g., Faster R-CNN) followed by a separate relation predictor, one-stage methods integrate a fixed-size set of learnable queries to jointly reason relational triplets. This paradigm demonstrates robust performance with significantly reduced parameters and computational overhead. However, the challenge in one-stage methods stems from the issue of weak entanglement, wherein subjects or objects necessitate not only shared features within triplets but also independent visual appearances. Previous methods primarily utilize multiple decoders for separate visual feature extraction, alongside conditioned queries to model shared information. In this paper, we empirically demonstrate that task-specific queries play a pivotal role in generating decoupled visual features, rather than distinct decoders. Building upon this premise, We introduce UniQ, a Unified decoder with task-specific Queries architecture, as a novel formulation to streamline and enhance the efficiency of SGG. Besides devising triplet set prediction losses to ensure end-to-end training, UniQ employs queries specific to each sub-task to extract visual features in parallel with shared parameters. Experimental results on the Visual Genome dataset demonstrate that UniQ has superior performance to both one-stage and two-stage methods.



Paperid:307 Poster
Authors:Liupeng Li,Yuhua Zheng,Shupeng Liu,Xiaoyin Xu,Taihao Li
Abstract:
Dynamic facial expression recognition (DFER) is a rapidly developing field that focuses on recognizing facial expressions in video sequences. However, the complex temporal modeling caused by noisy frames, along with the limited training data seriously hinder the further development of DFER. Previous efforts in this domain have been constrained as they tackled these issues separately. Inspired by recent success of the pretrained language-image model, such as CLIP, in promoting development of downstream tasks, we propose to leverage it to jointly address the two limitations in DFER. Since the original CLIP model lacks of ability to model temporal relationship and determine the optimal task-related textual prompts, we utilize DFER-specific domain knowledge, including characteristics of temporal correlations and relationships between facial behavior descriptions at different levels to guide the adaptation of CLIP to DFER. Specifically, we propose enhancements to CLIP's visual encoder through the design of a hierarchical video encoder that captures both short- and long-term temporal correlations in DFER. Meanwhile, we align facial expressions with action units through prior knowledge to construct semantically rich textual prompts, which we further enhance with visual content. Furthermore, we introduce a class-aware consistency regularization mechanism that adaptively filters out noisy frames, bolstering the model's robustness against interference. Extensive experiments on three in-the-wild dynamic facial expression datasets demonstrate that our method outperforms the state-of-the-art DFER approaches.



Paperid:308 Poster
Authors:Chengwei Zhang,Xueyi Zhang,Xianghu Yue,Mingrui Lao,Tao Jiang,Jiawei Wang,Fubo Zhang,Longyong Chen
Abstract:
Point clouds from real-world scenarios inevitably contain complex noise, significantly impairing the accuracy of downstream tasks. To tackle this challenge, cascading encoder-decoder architecture has become a conventional technical route to iterative denoise. However, circularly feeding the output of denoiser as its input again involves the re-extraction of underlying surface, leading to unstable denoising process and over-smoothed geometric details. To address these issues, we propose a novel denoising paradigm dubbed PD-Refiner that employs a single encoder to model the underlying surface. Then, we leverage several lightweight hierarchical Underlying Surface Inheritance Refiners (USIRs) to inherit and strengthen it, thereby avoiding the re-extraction from the intermediate point cloud. Furthermore, we design adaptive edge-aware supervision to improve the edge awareness of the USIRs, allowing for the adjustment of the denoising preferences from global structure to local details. The results demonstrate that our method not only achieves state-of-the-art performance in terms of denoising stability and efficacy, but also enhances edge clarity and point cloud uniformity.



Paperid:309 Poster
Authors:zhijun jia,Huaying Xue,Xiulian Peng,Yan Lu
Abstract:
Low resource of parallel data is the key challenge of accent conversion(AC) problem in which both the pronunciation units and prosody pattern need to be converted. We propose a two-stage generative framework "convert-and-speak" in which the conversion is only operated on the semantic token level and the speech is synthesized conditioned on the converted semantic token with a speech generative model in target accent domain. The decoupling design enables the "speaking" module to use massive amount of target accent speech and relieves the parallel data required for the "conversion" module. Conversion with the bridge of semantic token also relieves the requirement for the data with text transcriptions and unlocks the usage of language pre-training technology to further efficiently reduce the need of parallel accent speech data. To reduce the complexity and latency of "speaking", a single-stage AR generative model is designed to achieve good quality as well as lower computation cost. Experiments on Indian-English to general American-English conversion show that the proposed framework achieves state-of-the-art performance in accent similarity, speech quality, and speaker maintenance with only 15 minutes of weakly parallel data which is not constrained to the same speaker. Extensive experimentation with diverse accent types suggests that this framework possesses a high degree of adaptability, making it readily scalable to accommodate other accents with low-resource data. Audio samples are available athttps://convert-and-speak.github.io/demo/



Paperid:310 Poster
Authors:Wenzhuo Xu,Kai Chen,Ziyi Gao,Zhipeng Wei,Jingjing Chen,Yu-Gang Jiang
Abstract:
Pre-trained Vision-Language Models (VLMs) have shown great ability in various Vision-Language tasks. However, these VLMs exhibit inherent vulnerabilities to transferable adversarial examples, which could potentially undermine their performance and reliability in real-world applications. Cross-modal interactions have been demonstrated to be the key point to boosting adversarial transferability, but the utilization of them is limited in existing multimodal transferable adversarial attacks. Stable Diffusion, which contains multiple cross-attention modules, possesses great potential in facilitating adversarial transferability by leveraging abundant cross-modal interactions. Therefore, We propose a Multimodal Diffusion-based Attack (MDA), which conducts adversarial attacks against VLMs using Stable Diffusion. Specifically, MDA initially generates adversarial text, which is subsequently utilized as guidance to optimize the adversarial image during the diffusion process. Besides leveraging adversarial text in calculating downstream loss to obtain gradients for optimizing image, MDA also takes it as the guiding prompt in adversarial image generation during the denoising process, which enriches the ways of cross-modal interactions, thus strengthening the adversarial transferability. Compared with pixel-based attacks, MDA introduces perturbations in the latent space rather than pixel space to manipulate high-level semantics, which is also beneficial to improving adversarial transferability. Experimental results demonstrate that the adversarial examples generated by MDA are highly transferable across different VLMs on different downstream tasks, surpassing state-of-the-art methods by a large margin.



Paperid:311 Poster
Authors:GeunTaek Lim,Hyunwoo Kim,Joonsoo Kim,Yukyung Choi
Abstract:
Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos with only video-level annotations. As many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as auxiliary semantic knowledge with vision-language pre-training (VLP). However, there are still areas where existing research falls short. Previous approaches primarily focused on leveraging textual information from language models but overlooked the alignment of dynamic human action and VLP knowledge in joint space. Furthermore, the deterministic representation employed in previous studies struggles to capture fine-grained human motion. To address these problems, we propose a novel framework that aligns human action knowledge and VLP knowledge in the probabilistic embedding space. Moreover, we propose intra- and inter-distribution contrastive learning to enhance the probabilistic embedding space based on statistical similarities. Extensive experiments and ablation studies reveal that our method significantly outperforms all previous state-of-the-art methods. Our code will be available after publication.



Paperid:312 Poster
Authors:liu chuang,Yichao Cao,Haogang Zhu,Xiu Su
Abstract:
In this work, we introduce a novel approach to single-source domain generalization (SDG) in medical imaging, focusing on overcoming the challenge of style variation in out-of-distribution (OOD) domains without requiring domain labels or additional generative models. We propose a \textbf{Uni}versal \textbf{Freq}uency Perturbation framework for \textbf{SDG} termed as \textit{\textbf{UniFreqSDG}}, that performs hierarchical feature-level frequency domain perturbations, facilitating the model's ability to handle diverse OOD styles. Specifically, we design a learnable spectral perturbation module that adaptively learns the frequency distribution range of samples, allowing for precise low-frequency (LF) perturbation. This adaptive approach not only generates stylistically diverse samples but also preserves domain-invariant anatomical features without the need for manual hyperparameter tuning. Then, the frequency features before and after perturbation are decoupled and recombined through the Content Preservation Reconstruction operation, effectively preventing the loss of discriminative content information. Furthermore, we introduce the Active Domain-variance Inducement Loss to encourage effective perturbation in the frequency domain while ensuring the sufficient decoupling of domain-invariant and domain-style features. Extensive experiments demonstrate that \textit{\textbf{UniFreqSDG}} increases the dice score by an average of 7.47% (from 77.98% to 85.45%) on the fundus dataset and 4.99% (from 71.42% to 76.73%) on the prostate dataset compared to the state-of-the-art approaches.



Paperid:313 Poster
Authors:Heng Fang,Sheng Huang,Wenhao Tang,Luwen Huangfu,Bo Liu
Abstract:
Multiple Instance Learning (MIL) represents the predominant framework in Whole Slide Image (WSI) classification, covering aspects such as sub-typing, diagnosis, and beyond. Current MIL models predominantly rely on instance-level features derived from pretrained models such as ResNet. These models segment each WSI into independent patches and extract features from these local patches, leading to a significant loss of global spatial context and restricting the model's focus to merely local features. To address this issue, we propose a novel MIL framework, named SAM-MIL, that emphasizes spatial contextual awareness and explicitly incorporates spatial context by extracting comprehensive, image-level information. The Segment Anything Model (SAM) represents a pioneering visual segmentation foundational model that can capture segmentation features without the need for additional fine-tuning, rendering it an outstanding tool for extracting spatial context directly from raw WSIs. Our approach includes the design of group feature extraction based on spatial context and a SAM-Guided Group Masking strategy to mitigate class imbalance issues. We implement a dynamic mask ratio for different segmentation categories and supplement these with representative group features of categories. Moreover, SAM-MIL divides instances to generate additional pseudo-bags, thereby augmenting the training set, and introduces consistency of spatial context across pseudo-bags to further enhance the model's performance. Experimental results on the CAMELYON-16 and TCGA lung cancer datasets demonstrate that our proposed SAM-MIL model outperforms existing mainstream methods in WSIs classification.



Paperid:314 Poster
Authors:Shengguang Wu,Zhenglun Chen,Qi Su
Abstract:
Ancient artifacts are an important medium for cultural preservation and restoration. However, many physical copies of artifacts are either damaged or lost, leaving a blank space in archaeological and historical studies that calls for artifact image generation techniques. Despite the significant advancements in open-domain text-to-image synthesis, existing approaches fail to capture the important domain knowledge presented in the textual description, resulting in errors in recreated images such as incorrect shapes and patterns. In this paper, we propose a novel knowledge-aware artifact image synthesis approach that brings lost historical objects accurately into their visual forms. We use a pretrained diffusion model as backbone and introduce three key techniques to enhance the text-to-image generation framework: 1) we construct prompt with explicit archeological knowledge elicited from large language models (LLMs); 2) we incorporate additional textual guidance to correlated historical expertise in a contrastive manner; 3) we introduce further visual-semantic constraints on edge and perceptual features that enable our model to learn more intricate visual details of the artifacts. Compared to existing approaches, our proposed model produces higher-quality artifact images that align better with the implicit details and historical knowledge contained within written literature.



Paperid:315 Poster
Authors:Kai Han,Jin Wang,Yunhui Shi,Nam Ling,Baocai Yin
Abstract:
Deep unfolding network (DUN) is a powerful technique for image compressive sensing that bridges the gap between optimization methods and deep networks. However, DUNs usually rely heavily on single-domain information, overlooking the inter-domain dependencies. Therefore, such DUNs often face the following challenges: 1) information loss due to the inefficient representation within a single domain, and 2) limited robustness due to the absence of inter-domain dependencies. To overcome these challenges, we propose a deep unfolding framework D$^3$U-Net that establishes a dual-domain collaborative optimization scheme. This framework introduces both visual representations from the image domain and multi-resolution analysis provided by the wavelet domain. Such dual-domain representations constrain the feasible region within the solution space more accurately. Specifically, we design a consistency-difference collaborative mechanism to capture inter-domain dependencies effectively. This mechanism not only enhances the fidelity of reconstruction but also enriches the depth and breadth of extracted features, improving the overall robustness and reconstruction quality. Moreover, we develop an inter-stage transmission pathway to minimize the information loss during transmission while broadcasting multi-scale features in a frequency-adaptive manner. Extensive experimental results on various benchmark datasets show the superior performance of our method.



Paperid:316 Poster
Authors:LIU BO,LU ZEXIN,Yan Wang
Abstract:
Contrastive vision-language pre-training has shown great promise in representation transfer learning and cross-modality learning in the medical field. However, without fully exploiting the intrinsic properties and correlations of multimodal medical data within patient studies, current research fails to explore all the potential of available data, leading to suboptimal performance on representation learning. In this paper, we propose a novel pre-training framework for learning better medical vision-language embedding, oriented on patients' study-level data. Based on the order-agnostic property of radiology report, we adopt a two-stage feature extraction method for more representative textual characterization. Then, by leveraging momentum encoders and memory queues, study-level semantics are explored with three contrastive objectives to provide comprehensive supervision from three perspectives, \textit{i.e.}, cross-modal, multi-modal, and uni-modal, such that the potential information neglected by previous research can be fully exploited. The superiority of the proposed framework is demonstrated by the impressive improvements on four typical downstream tasks, including zero-shot/data-efficient image classification, image segmentation, and cross-modal retrieval. In addition, comprehensive ablation studies and analysis are provided to verify the effectiveness of each component of the framework. The code and pre-trained model will be released upon acceptance.



Paperid:317 Poster
Authors:Guangyao Li,Yajun Jian,Yan Yan,Hanzi Wang
Abstract:
Open-vocabulary multi-object tracking (MOT) aims to track arbitrary objects encountered in the real world beyond the training set. However, recent methods rely solely on instance-level association and identification of novel objects. Such a design may not consider the valuable fine-grained semantic representations of the targets within key and reference frames. In this paper, we propose a Global and Local Awareness open-vocabulary MOT method (GLATrack), which learns to tackle the task of real-world MOT from both global and instance-level perspectives. Specifically, we introduce a region-aware feature enhancement module to refine global knowledge for complementing local target information, which enhances semantic representation and bridges the distribution gap between the image feature map and the pooled regional features. We propose a bidirectional semantic complementarity strategy to mitigate semantic misalignment arising from missing target information in key frames, which dynamically selects valuable information within reference frames to enrich object representation during the knowledge distillation process. Furthermore, we introduce an appearance richness measurement module to provide appropriate representations for targets with different appearances. The proposed method achieves an improvement of 6.9% in TETA and 5.6% in mAP on the challenge large-scale TAO benchmark compared with the state-of-the-art, demonstrating excellent tracking performance in open-world scenarios. The code will be available.



Paperid:318 Poster
Authors:Zejun Li,Ye Wang,Mengfei Du,Qingwen Liu,Binhao Wu,Jiwen Zhang,Chengxing Zhou,Zhihao Fan,Jie Fu,Jingjing Chen,zhongyu wei,Xuanjing Huang
Abstract:
Recent years have witnessed remarkable progress in the development of large vision-language models (LVLMs). Benefiting from the strong language backbones and efficient cross-modal alignment strategies, LVLMs exhibit surprising capabilities to perceive visual signals and perform visually grounded reasoning. However, the capabilities of LVLMs have not been comprehensively and quantitatively evaluated. Most existing multi-modal benchmarks require task-oriented input-output formats, posing great challenges to automatically assess the free-form text output of LVLMs. To effectively leverage the annotations available and reduce the manual efforts required for constructing new benchmarks, we propose to re-formulate existing benchmarks into unified LVLM-compatible formats. Through systematic data collection and reformulation, we present ReForm-Eval benchmark, offering substantial data for evaluating various capabilities of LVLMs. Through extensive experiments and analysis in ReForm-Eval, we demonstrate the comprehensiveness and reliability of ReForm-Eval in assessing various LVLMs. Our benchmark and evaluation framework will be open-sourced as a cornerstone for advancing the development of LVLMs.



Paperid:319 Poster
Authors:Yan Zhuang,Yanru Zhang,Zheng Hu,Xiaoyue Zhang,Jiawen Deng,Fuji Ren
Abstract:
Multimodal Sentiment Analysis (MSA) has witnessed remarkable progress and gained increasing attention in recent decades, thanks to the advancements in deep learning. However, current MSA methodologies primarily rely on global representation extracted from different modalities, such as the mean of $all$ token representations, to construct sophisticated fusion networks. These approaches often overlook the valuable details present in local representations, which consist of fused representations of consecutive $several$ tokens. Additionally, the integration of multiple local representations and the fusion of local and global information present significant challenges. To address these limitations, we propose the Global-Local Modal (GLoMo) Fusion framework. This framework comprises two essential components: (i) modality-specific mixture of experts layers that integrate diverse local representations within each modality, and (ii) a global-guided fusion module that effectively combine global and local representations. The former component leverages specialized expert networks to automatically select and integrate crucial local representations from each modality, while the latter ensures the preservation of global information during the fusion process. We extensively evaluate GLoMo on various datasets, encompassing tasks in multimodal sentiment analysis, multimodal humor detection, and multimodal emotion recognition. Empirical results demonstrate that GLoMo outperforms existing state-of-the-art models, validating the effectiveness of our proposed framework.



Paperid:320 Poster
Authors:Haochen Zhao,Hui Meng,Deqian Yang,Xiexiao zheng,Xiaoze Wu,Qingfeng Li,Jianwei Niu
Abstract:
Semi-supervised multi-organ medical image segmentation aids physicians in improving disease diagnosis and treatment planning and reduces the time and effort required for organ annotation. Existing state-of-the-art methods train the labeled data with ground truths and train the unlabeled data with pseudo-labels. However, the two training flows are separate, which does not reflect the interrelationship between labeled and unlabeled data. To address this issue, we propose a semi-supervised multi-organ segmentation method called GuidedNet, which leverages the knowledge from labeled data to guide the training of unlabeled data. The primary goals of this study are to improve the quality of pseudo-labels for unlabeled data and to enhance the network's learning capability for both small and complex organs. A key concept is that voxel features from labeled and unlabeled data that are close to each other in the feature space are more likely to belong to the same class. On this basis, a 3D Consistent Gaussian Mixture Model (3D-CGMM) is designed to leverage the feature distributions from labeled data to rectify the generated pseudo-labels. Furthermore, we introduce a Knowledge Transfer Cross Pseudo Supervision (KT-CPS) strategy, which leverages the prior knowledge obtained from the labeled data to guide the training of the unlabeled data, thereby improving the segmentation accuracy for both small and complex organs. Extensive experiments on two public datasets, FLARE22 and AMOS, demonstrated that GuidedNet is capable of achieving state-of-the-art performance.



Paperid:321 Poster
Authors:Zhenyu Hou,Junjun Guo
Abstract:
Incorporating domain-specific visual information into text poses one of the critical challenges for domain-specific multi-modal neural machine translation (DMNMT). While most existing DMNMT methods often borrow multi-modal fusion frameworks from multi-modal neural machine translation (MNMT) in the general domain, they overlook the domain gaps between general and specific domains. Visual-to-textual interaction in a specific domain frequently exhibits multi-focus characteristics, making it difficult to consistently focus on domain-specific multi-visual details using traditional multi-modal fusion frameworks. This challenge can lead to a decrease in machine translation performance for domain-specific terms. To tackle this problem, this paper presents a virtual visual scene-guided domain-shadow multi-modal fusion mechanism to simultaneously integrate multi-grained domain visual details and text with the guidance of modality-agnostic virtual visual scene, thereby enhancing machine translation performance for DMNMT, especially for domain terms. Specifically, we first adopt a modality-mixing selection-voting strategy to generate modality-mixed domain-shadow representations through layer-by-layer intra-modality selection and inter-modality exchanging. Then, we gradually aggregate modality-mixed domain representations and text across modality boundaries with the guidance of a modality-agnostic virtual visual scene to enhance the collaboration between domain characteristics and textual semantics. The experimental results on three benchmark datasets demonstrate that our proposed approach outperforms the state-of-the-art (SOTA) methods in all machine translation tasks. The in-depth analysis further highlights the robustness and generalizability of our approach across various scenarios. Additionally, the virtual visual scene generation module showcases robust capabilities in model compression, underscoring its potential for practical applications.



Paperid:322 Poster
Authors:Yi Zhang,Zhefeng Wang,Rui Hu,Xinyu Duan,Yi ZHENG,Baoxing Huai,Jiarun Han,Jitao Sang
Abstract:
Neural networks often tend to rely on bias features that have strong but spurious correlations with the target labels for decision-making, leading to poor performance on data that does not adhere to these correlations. Early debiasing methods typically construct an unbiased optimization objective based on the labels of bias features. Recent work assumes that bias label is unavailable and usually trains two models: a biased model to deliberately learn bias features for exposing data bias, and a target model to eliminate bias captured by the bias model. In this paper, we first reveal that previous biased models fit target labels, which resulted in failing to expose data bias. To tackle this issue, we propose poisoner, which utilizes data poisoning to embed the biases learned by biased models into the poisoned training data, thereby encouraging the models to learn more biases. Specifically, we couple data poisoning and model training to continuously prompt the biased model to learn more bias. By utilizing the biased model, we can identify samples in the data that contradict these biased correlations. Subsequently, we amplify the influence of these samples in the training of the target model to prevent the model from learning such biased correlations. Experiments show the superior debiasing performance of our method.



Paperid:323 Poster
Authors:Sa Yan,Nuowen Kan,Chenglin Li,Wenrui Dai,Junni Zou,Hongkai Xiong
Abstract:
Image compression for machine vision exhibits various rate-accuracy performance across different downstream tasks and content types. An efficient utilization of constrained network resource for achieving an optimal overall task performance has thus recently attracted a growing attention. In this paper, we propose Tombo, a task-oriented image compression and transmission framework that efficiently identifies the optimal encoding bitrate and routing scheme for multiple image bitstreams delivered simultaneously for different downstream tasks. Specifically, we study the characteristics of image rate-accuracy performance for different machine vision tasks, and formulate the task-oriented joint bitrate and routing optimization problem for multi-bitstreams as a multi-commodity network flow problem with the time-expanded network modeling. To ensure consistency between the encoding bitrate and routing optimization, we also propose an augmented network that incorporates the encoding bitrate variables into the routing variables. To improve computational efficiency, we further convert the original optimization problem to a multi-marginal optimal transport problem, and adopt a Sinkhorn iteration-based algorithm to quickly obtain the near-optimal solution. Finally, we adapt Tombo to efficiently deal with the dynamic network scenario where link capacities may fluctuate over time. Empirical evaluations on three typical machine vision tasks and four real-world network topologies demonstrate that Tombo achieves a comparable performance to the optimal one solved by the off-the-shelf solver Gurobi, with a $5\times \sim 114\times$ speedup.



Paperid:324 Poster
Authors:Jiali Chen,Yi Cai,Ruohang Xu,Jiexin Wang,Jiayuan Xie,Qing Li
Abstract:
With the increasing popularity of online social applications, stickers have become common in online chats. Teaching a model to select the appropriate sticker from a set of candidate stickers based on dialogue context is important for optimizing the user experience. Existing methods have proposed leveraging emotional information to facilitate the selection of appropriate stickers. However, considering the frequent co-occurrence among sticker images, words with emotional preference in the dialogue and emotion labels, these methods tend to over-rely on such dataset bias, inducing spurious correlations during training. As a result, these methods may select inappropriate stickers that do not match users' intended expression. In this paper, we introduce a causal graph to explicitly identify the spurious correlations in the sticker selection task. Building upon the analysis, we propose a Causal Knowledge-Enhanced Sticker Selection model to mitigate spurious correlations. Specifically, we design a knowledge-enhanced emotional utterance extractor to identify emotional information within dialogues. Then an interventional visual feature extractor is employed to obtain unbiased visual features, aligning them with the emotional utterances representation. Finally, a standard transformer encoder fuses the multimodal information for emotion recognition and sticker selection. Extensive experiments on the MOD dataset show that our CKS model significantly outperforms the baseline models.



Paperid:325 Poster
Authors:JiaQi Wang,Lu Lu,Mingmin Chi,Jian Chen
Abstract:
The effectiveness of contrastive-learning-based Knowledge Distillation (KD) has sparked renewed interest in relational distillation, but these methods typically focus on angle-wise information from the penultimate layer. We show that exploiting relational information derived from intermediate layers further improves the effectiveness of distillation. We also find that adding distance-wise relational information to contrastive-learning-based methods negatively impacts distillation quality, revealing an implicit contention between angle-wise and distance-wise attributes. Therefore, we propose a ${\bf{M}}$ulti-stage ${\bf{D}}$ecoupled ${\bf{R}}$elational (MDR) KD framework equipped with an adaptive stage selection to identify the stages that maximize the efficacy of transferring the relational knowledge. MDR framework decouples angle-wise and distance-wise information to resolve their conflicts while still preserving complete relational knowledge, thereby resulting in an elevated transferring efficiency and distillation quality. To evaluate the proposed method, we conduct extensive experiments on multiple image benchmarks ($\textit{i.e.}$ CIFAR100, ImageNet and Pascal VOC), covering various tasks ($\textit{i.e.}$ classification, few-shot learning, transfer learning and object detection). Our method exhibits superior performance under diverse scenarios, surpassing the state of the art by an average improvement of 1.22% on CIFAR-100 across extensively utilized teacher-student network pairs.



Paperid:326 Poster
Authors:Xiaze Zhang,Ziheng Ding,Qi Jing,Ying Cheng,Wenchao Ding,Rui Feng
Abstract:
Simultaneous Localization and Mapping (SLAM) plays a pivotal role in autonomous driving and robotics. Given the complexity of road environments, there is a growing research emphasis on developing robust and accurate multi-modal SLAM systems. Existing methods often rely on hand-craft feature extraction and cross-modal fusion techniques, resulting in limited feature representation capability and reduced flexibility and robustness. To address this challenge, we introduce DeepPointMap2, a novel learning-based LiDAR-Visual SLAM architecture that leverages neural descriptors to tackle multiple SLAM subtasks in a unified manner. Our approach employs neural networks to extract multi-modal feature tokens, which are then adaptively fused by the Visual-Point Fusion Module to generate sparse neural 3D descriptors, ensuring precise localization and robust performance. As a pioneering work, our method achieves state-of-the-art localization performance among various Visual-based, LiDAR-based, and Visual-LiDAR-based methods in widely used benchmarks, as shown in the experiment results. Furthermore, the approach proves to be robust in scenarios involving camera failure and LiDAR obstruction.



Paperid:327 Poster
Authors:Zizhao Wu,Haohan Li,Gongyi Chen,Zhou Yu,Xiaoling Gu,Yigang Wang
Abstract:
3DQA has gained considerable attention due to its enhanced spatial understanding capabilities compared to image-based VQA. However, existing 3DQA methods have explicitly focused on integrating text and color-coded point cloud features, thereby overlooking the rich high-level semantic relationships among objects. In this paper, we propose a novel graph-based 3DQA method termed 3DGraphQA, which leverages scene graph reasoning to enhance the ability to handle complex reasoning tasks in 3DQA and offers stronger interpretability. Specifically, our method first adaptively constructs dynamic scene graphs for the 3DQA task. Then we inject both the situation and the question inputs into the scene graph, forming the situation-graph and the question-graph, respectively. Based on the constructed graphs, we finally perform intra- and inter-graph feature propagation for efficient graph inference: intra-graph feature propagation is performed based on Graph Transformer in each graph to realize single-modal contextual interaction and high-order contextual interaction; inter-graph feature propagation is performed among graphs based on bilinear graph networks to realize the interaction between different contexts of situations and questions. Drawing on these intra- and inter-graph feature propagation, our approach is poised to better grasp the intricate semantic and spatial relationship issues among objects within the scene and their relations to the questions, thereby facilitating reasoning complex and compositional questions. We validate the effectiveness of our approach on ScanQA and SQA3D datasets, and expand the SQA3D dataset to SQA3D Pro with multi-view information, making it more suitable for our approach. Experimental results demonstrate that our 3DGraphQA outperforms existing methods.



Paperid:328 Poster
Authors:Jiaqi Guo,Lianli Gao,Junchen Zhu,JiaxinZhang,Siyang Li,Jingkuan Song
Abstract:
Visual effects synthesis is crucial in the film and television industry, which aims at enhancing raw footage with virtual elements for greater expressiveness. As the demand for detailed and realistic effects escalates in modern production, professionals are compelled to allocate substantial time and resources to this endeavor. Thus, there is an urgent need to explore more convenient and less resource-intensive methods, such as incorporating the burgeoning Artificial Intelligence Generated Content (AIGC) technology. However, research into this potential integration has yet to be conducted. As the first work to establish a connection between visual effects synthesis and AIGC technology, we start by carefully setting up two paradigms according to the need for pre-produced effects or not: synthesis with reference effects and synthesis without reference effects. Following this, we compile a dataset by processing a collection of effects videos and scene videos, which contains a wide variety of effect categories and scenarios, adequately covering the common effects seen in films and television industry. Furthermore, we explore the capabilities of a pre-trained text-to-video model to synthesize visual effects within these two paradigms. The experimental results demonstrate that the pipeline we established can effectively produce impressive visual effects synthesis outcomes, thereby evidencing the significant potential of existing AIGC technology for application in visual effects synthesis tasks.



Paperid:329 Poster
Authors:Geng Tu,Feng Xiong,Bin Liang,Hui Wang,Xi Zeng,Ruifeng Xu
Abstract:
Multimodal Emotion Recognition in Conversations (MERC) aims to identify the emotions conveyed by each utterance in a conversational video. Current efforts focus on modeling speaker-sensitive context dependencies and multimodal fusion. Despite the progress, the reliability of MERC methods remains largely unexplored. Extensive empirical studies reveal that current methods suffer from unreliable predictive confidence. Specifically, in some cases, the confidence estimated by these models increases when a modality or specific contextual cues are corrupted, defining these as uncertain samples. This contradicts the foundational principle in informatics, namely, the elimination of uncertainty. Based on this, we propose a novel calibration framework CMERC to calibrate MERC models without altering the model structure. It integrates curriculum learning to guide the model in progressively learning more uncertain samples; hybrid supervised contrastive learning to refine utterance representations, by pulling uncertain samples and others apart; and confidence constraint to penalize the model on uncertain samples. Experimental results on two datasets show that the CMERC significantly enhances the reliability and generalization capabilities of various MERC models, surpassing the state-of-the-art methods.



Paperid:330 Poster
Authors:Jongbhin Woo,Hyeonggon Ryu,Youngjoon Jang,Jae Won Cho,Joon Son Chung
Abstract:
Video Temporal Grounding (VTG) aims to identify visual frames in a video clip that match text queries. Recent studies in VTG employ cross-attention to correlate visual frames and text queries as individual token sequences. However, these approaches overlook a crucial aspect of the problem: a holistic understanding of the query sentence. A model may capture correlations between individual word tokens and arbitrary visual frames while possibly missing out on the global meaning. To address this, we introduce two primary contributions: (1) a visual frame-level gate mechanism that incorporates holistic textual information, (2) cross-modal alignment loss to learn the fine-grained correlation between query and relevant frames. As a result, we regularize the effect of individual word tokens and suppress irrelevant visual frames. We demonstrate that our method outperforms state-of-the-art approaches in VTG benchmarks, indicating that holistic text understanding guides the model to focus on the semantically important parts within the video.



Paperid:331 Poster
Authors:Ziwei Zheng,Zechuan Zhang,Yulin Wang,Shiji Song,Gao Huang,Le Yang
Abstract:
Generic event boundary detection (GEBD), inspired by human visual cognitive behaviors of consistently segmenting videos into meaningful temporal chunks, finds utility in various applications such as video editing and summarization. In this paper, we demonstrate that state-of-the-art GEBD models often prioritize final performance over model complexity, resulting in low inference speed and hindering efficient deployment in real-world scenarios. We contribute to addressing this challenge by experimentally reexamining the architecture of GEBD models and uncovering several surprising findings. Firstly, we reveal that a concise GEBD baseline model already achieves promising performance without any sophisticated design. Secondly, we find that the common design of GEBD models using image-domain backbones can contain plenty of architecture redundancy, motivating us to gradually “modernize” each component to enhance efficiency. Thirdly, we show that the GEBD models using image-domain backbones conducting the spatiotemporal learning in a spatial-then-temporal greedy manner can suffer from a distraction issue, which might be the inefficient villain for the GEBD. Using a video-domain backbone to jointly conduct spatiotemporal modeling for GEBD is an effective solution for this issue. The outcome of our exploration is a family of GEBD models, named EfficientGEBD, significantly outperforms the previous SOTA methods by up to 1.7% performance growth and 280% practical speedup under the same backbone choice. Our research prompts the community to design modern GEBD methods with the consideration of model complexity, particularly in resource-aware applications. The code is available athttps://github.com/anonymous.



Paperid:332 Poster
Authors:Junfeng Yang,JingFu,Zhen Zhang,Limei Liu,Qin Li,wei zhang,Wenzhi Cao
Abstract:
The alignment of the image quality assessment (IQA) model with diverse human preferences remains a challenge, owing to the variability in preferences for different types of visual content, including user-generated and AI-generated content (AIGC), etc. Despite the significant success of existing IQA methods in assessing specific visual content by leveraging knowledge from pre-trained models, the intricate factors impacting final ratings and the specially designed network architecture of these methods result in gaps in their ability to accurately capture human preferences for novel visual content. To address this issue, we propose Align-IQA, a novel framework that aims to generate visual quality scores aligned with diverse human preferences for different types of visual content. Align-IQA contains two key designs: (1) A customizable quality-aware guidance injection module. By injecting specializable quality-aware prior knowledge into general-purpose pre-trained models, the proposed module guides the acquisition of quality-aware features and allows for different adjustments of features to be consistent with diverse human preferences for various types of visual content. (2) A multi-scale feature aggregation module. By simulating the multi-scale mechanism in the human visual system, the proposed module enables the extraction of a more comprehensive representation of quality-aware features from the human perception perspective. Extensive experimental results demonstrate that Align-IQA achieves comparable or better performance than SOTA methods. Notably, Align-IQA outperforms the previous best results on AIGC datasets, achieving PLCC of 0.890 (+3.73%) and 0.924 (+1.99%) on AGIQA-1K and AGIQA-3K. Additionally, Align-IQA reduces training parameters by 72.26% and inference overhead by 78.12% while maintaining SOTA performance.



Paperid:333 Poster
Authors:Shuai Li,Fan Qi,Zixin Zhang,Changsheng Xu
Abstract:
In the evolving landscape of federated learning (FL), the integration of multimodal data presents both unprecedented opportunities and significant challenges. Existing work falls short of meeting the growing demand for systems that can efficiently handle diverse tasks and modalities in rapidly changing environments. We propose a meta-learning strategy tailored for Multimodal Federated Learning (MFL) in a multitask setting, which harmonizes intra-modal and inter-modal feature spaces through the Cross-Modal Meta Consensus. This innovative approach enables seamless integration and transfer of knowledge across different data types, enhancing task personalization within modalities and facilitating effective cross-modality knowledge sharing. Additionally, we introduce Gradient Consistency-based Clustering for multimodal convergence, specifically designed to resolve conflicts at meta-initialization points arising from diverse modality distributions, supported by theoretical guarantees. Our approach, evaluated as $M^{3}Fed$ on five federated datasets, with at most four modalities and four downstream tasks, demonstrates strong performance across diverse data distributions, affirming its effectiveness in multimodal federated learning. The code is available athttps://anonymous.4open.science/r/M3Fed-44DB.



Paperid:334 Poster
Authors:Zhaoyu Zhang,Yang Hua,Guanxiong Sun,Hui Wang,Seán F. McLoone
Abstract:
Recently, many studies have highlighted that training Generative Adversarial Networks (GANs) with limited data suffers from the overfitting of the discriminator ($D$). Existing studies mitigate the overfitting of $D$ by employing data augmentation, model regularization, or pre-trained models. Despite the success of existing methods in training GANs with limited data, noise injection is another plausible, complementary, yet not well-explored approach to alleviate the overfitting of $D$ issue. In this paper, we propose a simple yet effective method called Dual Adaptive Noise Injection (DANI), to further improve the training of GANs with limited data. Specifically, DANI consists of two adaptive strategies: adaptive injection probability and adaptive noise strength. For the adaptive injection probability, Gaussian noise is injected into both real and fake images for generator ($G$) and $D$ with a probability $p$, respectively, where the probability $p$ is controlled by the overfitting degree of $D$. For the adaptive noise strength, the Gaussian noise is produced by applying the adaptive forward diffusion process to both real and fake images, respectively. As a result, DANI can effectively increase the overlap between the distributions of real and fake data during training, thus alleviating the overfitting of $D$ issue. Extensive experiments on several commonly-used datasets with both StyleGAN2 and FastGAN backbones demonstrate that DANI can further improve the training of GANs with limited data and achieve state-of-the-art results compared with other methods.



Paperid:335 Poster
Authors:Qi Zhang,Chi Huang,Qian Zhang,Nan Li,Wei Feng
Abstract:
The recent progress in novel view synthesis is attributed to the Neural Radiance Field (NeRF), which requires plenty of images with precise camera poses. However, collecting available dense input images with accurate camera poses is a formidable challenge in real-world scenarios. In this paper, we propose Learning Geometry Consistent Neural Radiance Field (GC-NeRF), to tackle this challenge by jointly optimizing a NeRF and camera poses under sparse (as low as 2) and unposed images. First, GC-NeRF establishes the geometric consistencies in the image-level, which produce photometric constraints from inter- and intra-views for updating NeRF and camera poses in a fine-grained manner. Second, we adopt geometry projection with camera extrinsic parameters to present the region-level consistency supervisions, which construct pseudo-pixel labels for capturing critical matching correlations. Moreover, GC-NeRF presents an adaptive high-frequency mapping function to augment the geometry and texture information of the 3D scene. We evaluate the effectiveness of GC-NeRF, which sets a new state-of-the-art in the sparse view jointly optimized regime on multiple challenge real-world datasets.



Paperid:336 Poster
Authors:Yang Luo,Yiheng Zhang,Zhaofan Qiu,Ting Yao,Zhineng Chen,Yu-Gang Jiang,Tao Mei
Abstract:
The emergence of text-to-image generation models has led to the recognition that image enhancement, performed as post-processing, would significantly improve the visual quality of the generated images. Exploring diffusion models to enhance the generated images nevertheless is not trivial and necessitates to delicately enrich plentiful details while preserving the visual appearance of key content in the original image. In this paper, we propose a novel framework, namely FreeEnhance, for content-consistent image enhancement using the off-the-shelf image diffusion models. Technically, FreeEnhance is a two-stage process that firstly adds random noise to the input image and then capitalizes on a pre-trained image diffusion model (i.e., Latent Diffusion Models) to denoise and enhance the image details. In the noising stage, FreeEnhance is devised to add lighter noise to the region with higher frequency to preserve the high-frequent patterns (e.g., edge, corner) in the original image. In the denoising stage, we present three target properties as constraints to regularize the predicted noise, enhancing images with high acutance and high visual quality. Extensive experiments conducted on the HPDv2 dataset demonstrate that our FreeEnhance outperforms the state-of-the-art image enhancement models in terms of quantitative metrics and human preference. More remarkably, FreeEnhance also shows higher human preference compared to the commercial image enhancement solution of Magnific AI.



Paperid:337 Poster
Authors:Zhilin He,Yawei Zhang,Jingchang Mu,Xiaoyue Gu,Tianhao Gu
Abstract:
Facing with two significant challenges for monocular depth estimation under a lightweight network, including the preservation of detail information and the artifact reduction of the predicted depth maps, this paper proposes a self-supervised monocular depth estimation framework, called LiteGfm. It contains a DepthNet with an Anti-Artifact Guided (AAG) module and a PoseNet. In the AAG module, a Guided Image Filtering with cross-detail masking is first designed to filter the input features of the decoder for preserving comprehensive detail information. Second, a filter kernel generator is proposed to decompose the Sobel operator along the vertical and horizontal axes for achieving cross-detail masking, which better captures the structure and edge feature for minimizing artifacts. Furthermore, a boundary-aware loss between the reconstructed and input images is presented to preserve high-frequency details for decreasing artifacts. Extensive experimental results demonstrate that LiteGfm under 1.9M parameters gets more optimal performance than state-of-the-art methods.



Paperid:338 Poster
Authors:Xiang He,Liuxiangxi,Yang Li,Dongcheng Zhao,Guobin Shen,Qingqun Kong,Xin Yang,Yi Zeng
Abstract:
The audio-visual event localization task requires identifying concurrent visual and auditory events from unconstrained videos within a network model, locating them, and classifying their category. The efficient extraction and integration of audio and visual modal information have always been challenging in this field. In this paper, we introduce CACE-Net, which differs from most existing methods that solely use audio signals to guide visual information. We propose an audio-visual co-guidance attention mechanism that allows for adaptive cross-modal attention guidance between audio and visual information, reducing inconsistencies between modalities. Moreover, we consider that existing methods have difficulty distinguishing between similar background and event and lack the fine-grained features for event classification. Consequently, we employ background-event contrast enhancement for fused feature discrimination and fine-tuning of pre-trained models to extract more refined and discernible features from complex multimodal inputs. Specifically, we have enhanced the model's ability to discern subtle differences between events and backgrounds and improved the accuracy for event classification. Experiments on the AVE dataset demonstrate that CACE-Net sets a new benchmark in the audio-visual event localization tasks, proving the effectiveness of our proposed methods in handling the complexities of multimodal learning and event localization in unconstrained video content.



Paperid:339 Poster
Authors:Bin Huang,Feng He,Qi Wang,Hong Chen,Guohao Li,Zhifan Feng,Xin Wang,Wenwu Zhu
Abstract:
Sampling strategies have been widely adopted in Vision-Language Pre-training (VLP) and have achieved great success recently. However, the sampling strategies adopted by current VLP works are limited in two ways: i) they only focus on negative sampling, ignoring the importance of more informative positive samples; ii) their sampling strategies are conducted in the local in-batch level, which may lead to sub-optimal results. To tackle these problems, in this paper, we propose a Global Positive-Negative Sampling (GPN-S) framework for vision-language pre-training, which conducts both positive and negative sampling in the global level, grounded on the notion of neighborhood relationships. Specifically, our proposed GPN-S framework is capable of utilizing positive sampling to bring semantically equivalent samples closer, as well as employing negative sampling to push challenging negative samples farther away. We jointly consider them for vision-language pre-training on the global-level perspective rather than a local-level mini-batch, which provides more informative and diverse samples. We evaluate the effectiveness of the proposed GPN-S framework by conducting experiments on several common downstream tasks, and the results demonstrate significant performance improvement over the existing models.



Paperid:340 Poster
Authors:Tianjiao Wan,Kele Xu,Long Lan,Zijian Gao,Feng Dawei,Bo Ding,Huaimin Wang
Abstract:
Active learning (AL) aims to select highly informative data points from an unlabeled dataset for annotation, mitigating the need for extensive human labeling effort. However, classical AL methods heavily rely on human expertise to design the sampling strategy, inducing limited scalability and generalizability. Many efforts have sought to address this limitation by directly connecting sample selection with model performance improvement, typically through influence function. Nevertheless, these approaches often ignore the dynamic nature of model behavior during training optimization, despite empirical evidence highlights the importance of dynamic influence to track the sample contribution. This oversight can lead to suboptimal selection, hindering the generalizability of model. In this study, we explore the dynamic influence based data selection strategy by tracing the impact of unlabeled instances on model performance throughout the training process. Our theoretical analyses suggest that selecting samples with higher projected gradients along the accumulated optimization direction at each checkpoint leads to improved performance. Furthermore, to capture a wider range of training dynamics without incurring excessive computational or memory costs, we introduce an additional dynamic loss term designed to encapsulate more generalized training progress information. These insights are integrated into a universal and task-agnostic AL framework termed Dynamic Influence Scoring for Active Learning (DISAL). Comprehensive experiments across various tasks have demonstrated that DISAL significantly surpasses existing state-of-the-art AL methods, demonstrating its ability to facilitate more efficient and effective learning in different domains.



Paperid:341 Poster
Authors:Huanhuan Zhang,Zhuo Liu,Haotian Li,Anfu Zhou,Chuanming Wang,Huadong Ma
Abstract:
Optimizing user Quality of Experience (QoE) for live video streaming remains a long-standing challenge. The Bitrate Control Algorithm (BCA) plays a crucial role in shaping user QoE. Recent advancements have seen RL-based algorithms overtake traditional rule-based methods, promising enhanced QoE optimization. Nevertheless, our comprehensive study reveals a pressing issue: current RL-based BCAs are limited to the fixed and formulaic reward functions, rendering them ill-equipped to adapt to dynamic network environments and varied viewer preferences. In this work, we present AraLive, an automatically adaptive reward learning method designed for seamless integration with any existing learning-based approach in live streaming contexts. To accomplish this goal, we construct a dedicated user QoE assessment dataset for live streaming and customize-design an adversarial model that skillfully aligns human feedback with actual network scenarios. We have deployed AraLive in not only the live streaming but also the classic VoD systems, in comparison to a series of state-of-the-art BCAs. The experimental results demonstrate that AraLive not only elevates overall QoE but also exhibits remarkable adaptability to varied user preferences.



Paperid:342 Poster
Authors:Ning Xu,Yifei Gao,Tingting Zhang,Hongshuo Tian,Anan Liu
Abstract:
News Captioning involves generating the descriptions for news images based on the detailed content of related news articles. Given that these articles often contain extensive information not directly related to the image, captions may end up misaligned with the visual content. To mitigate this issue, we propose the novel cross-modal coherence-enhanced feedback prompting method to clarify the crucial elements that align closely with the visual content for news captioning. Specifically, we first adapt CLIP to develop a news-specific image-text matching module, enriched with insights from language model MPNet using a matching-score comparative loss, which facilitates effective cross-modal knowledge distillation. This module enhance the coherence between images and each news sentences via rating confidence. Then, we design confidence-aware prompts to fine-tune LLaVA model with by LoRa strategy, focusing on essential details in extensive articles. Lastly, we evaluate the generated news caption with refined CLIP, constructing confidence-feedback prompts to further enhance LLaVA through feedback learning, which iteratively refine captions to improve its accuracy. Extensive experiments conduct on two public datasets, GoodNews and NYTimes800k, have validated the effectiveness of our method.



Paperid:343 Poster
Authors:Liyang He,Zhenya Huang,Chenglong Liu,Rui Li,Runze Wu,Qi Liu,Enhong Chen
Abstract:
Deep Hashing (DH) has emerged as an indispensable technique for fast image search in recent years. However, using full-precision Convolutional Neural Networks (CNN) in DH makes it challenging to deploy on devices with limited resources. To deploy DH on resource-limited devices, the Binary Neural Network (BNN) offers a solution that significantly reduces computations and parameters compared to CNN. Unfortunately, applying BNN directly to DH will lead to huge performance degradation. To tackle this problem, we first conducted extensive experiments and discovered that the center-based method provides a fundamental guarantee for BNN-DH performance. Subsequently, we delved deeper into the impact of BNNs on center-based methods and revealed two key insights. First, we find reducing the distance between hash codes and hash centers is challenging for BNN-DH compared to CNN-based DH. This can be attributed to the limited representation capability of BNN. Second, the evolution of hash code aggregation undergoes two stages in BNN-DH, which is different from CNN-based DH. Thus, we need to take into account the changing trends in code aggregation at different stages. Based on these findings, we designed a strong and general method called One-bit Deep Hashing (ODH). First, ODH incorporates a semantic self-adaptive hash center module to address the problem of hash codes inadequately converging to their hash centers. Then, it employs a novel two-stage training method to consider the evolution of hash code aggregation. Finally, extensive experiments on two datasets demonstrate that ODH can achieve significant superiority over other BNN-DH models. The code for ODH is available athttps://anonymous.4open.science/r/OSH-1730.



Paperid:344 Poster
Authors:Shaodong Wang,Yunyang Ge,Liuhan Chen,Haiyang Zhou,Qian Wang,Xinhua Cheng,Li Yuan
Abstract:
As a critical component in graphic design, artistic posters are widely applied in the advertising and entertainment industry, thus the automatic poster creation from user-provided prompts has become increasingly desired recently. Although existing Text2Image methods create impressive images aligned with given prompts, they fail to generate ideal artistic posters, especially posters with Chinese texts. To create desired artistic Chinese posters including an aligned background, reasonable layouts, and stylized graphical texts from given prompts only, we propose an automatic poster creation framework, named Prompt2Poster. Our framework first utilizes the capacity of the powerful Large Language Model (LLM) to extract user intention from provided prompts and generate the aligned background. For the harmonious layout and graphical text generation, we propose Controllable Layout Generator (CLG) and Graphical Text Generator (GTG) modules that both leverage sufficient multi-modal information, leading to accurate and pleasurable visual results. Comprehensive experiments demonstrate that our Prompt2Poster achieves superior performance especially on text quality and visual harmony than existing poster creation methods. Our codes will be released after the paper review. Our codes will be released after the paper review.



Paperid:345 Poster
Authors:Yuxiang Cai,Yongheng Shang,Jianwei Yin
Abstract:
Unsupervised domain adaptation (UDA) has been a crucial way for cross-domain semantic segmentation of remote sensing images and reached apparent advents. However, most existing efforts focus on single source single target domain adaptation, which don't explicitly consider the serious domain shift between multiple source and target domains in real applications, especially inter-domain shift between various target domains and intra-domain shift within each target domain. In this paper, to address simultaneous inter-domain shift and intra-domain shift for multiple target domains, we propose a novel unsupervised, multistage, multisource and multitarget domain adaptation network (MultiDAN), which involves multisource and multitarget domain adaptation (MSMTDA), entropy-based clustering (EC) and multistage domain adaptation (MDA). Specifically, MSMTDA learns feature-level multiple adversarial strategies to alleviate complex domain shift between multiple target and source domains. Then, EC clusters the various target domains into multiple subdomains based on entropy of target predictions of MSMTDA. Besides, we propose a new pseudo label update strategy (PLUS) to dynamically produce more accurate pseudo labels for MDA. Finally, MDA aligns the clean subdomains, including pseudo labels generated by PLUS, with other noisy subdomains in the output space via the proposed multistage adaptation algorithm (MAA). The extensive experiments on the benchmark remote sensing datasets highlight the superiority of our MultiDAN against recent state-of-the-art UDA methods.



Paperid:346 Poster
Authors:Yifan Wang,Wuliang Huang,Lei Li,Chun Yuan
Abstract:
The challenging task composed image retrieval targets at identifying the matched image from the multi-modal query with a reference image and a textual modifier. Most existing methods are devoted to composing the unified query representations from the query images and texts, yet the distribution gaps between the hybrid-modal query representations and visual target representations are neglected. However, directly incorporating target features on the query may cause ambiguous rankings and poor robustness due to the insufficient exploration of the distinguishments and overfitting issues. To address the above concerns, we propose a novel framework termed SemAntic Distillation from Neighborhood (SADN) for composed image retrieval. For mitigating the distribution divergences, we construct neighborhood sampling from the target domain for each query and further aggregate neighborhood features with adaptive weights to restructure the query representations. Specifically, the adaptive weights are determined by the collaboration of two individual modules, as correspondence-induced adaption and divergence-based correction. Correspondence-induced adaption accounts for capturing the correlation alignments from neighbor features under the guidance of the positive representations, and the divergence-based correction regulates the weights based on the embedding distances between hard negatives and the query in the latent space. Extensive experimental results and ablation studies on CIRR and FashionIQ validate that the proposed semantic distillation from neighborhood significantly outperforms baseline methods.



Paperid:347 Poster
Authors:ACMMM 2024 Conference Submission4292 Authors
Abstract:
Parameter-Efficient Fine Tuning (PEFT) has been demonstrated to be effective and efficient for transferring foundation models to downstream tasks. Transferring pretrained uni-modal models to multi-modal downstream tasks helps alleviate substantial computational costs for retraining multi-modal models. However, existing approaches primarily focus on multi-modal fusion, while neglecting the modal-specific fine-tuning, which is also crucial for multi-modal tasks. To this end, we propose parameter-efficient $Co$llaborative $P$rompt $L$earning ($CoPL$) to fine-tune both uni-modal and multi-modal features. Specifically, the collaborative prompts consist of modal-specific prompts and modal-interaction prompts. The modal-specific prompts are tailored for fine-tuning each modality on specific tasks, while the modal-interaction prompts are customized to efficiently explore inter-modality association. Furthermore, prompt bank-based mutual coupling is introduced to extract instance-level features, further enhancing the model's generalization ability. Extensive experimental results demonstrate that our approach achieves comparable or higher performance on various audio-visual downstream tasks while utilizing approximately 1% extra trainable parameters.



Paperid:348 Poster
Authors:Mingzhao Yang,Shangchao Su,Bin Li,Xiangyang Xue
Abstract:
In recent years, the attention towards One-Shot Federated Learning (OSFL) has been driven by its capacity to minimize communication. With the development of the diffusion model (DM), several methods employ the DM for OSFL, utilizing model parameters, image features, or textual prompts as mediums to transfer the local client knowledge to the server. However, these mediums often require public datasets or the uniform feature extractor, significantly limiting their practicality. In this paper, we propose FedDEO, a Description-Enhanced One-Shot Federated Learning Method with DMs, offering a novel exploration of utilizing the DM in OSFL. The core idea of our method involves training local descriptions on the clients, serving as the medium to transfer the knowledge of the distributed clients to the server. Firstly, we train local descriptions on the client data to capture the characteristics of client distributions, which are then uploaded to the server. On the server, the descriptions are used as conditions to guide the DM in generating synthetic datasets that comply with the distributions of various clients, enabling the training of the aggregated model. Theoretical analyses and sufficient quantitation and visualization experiments on three large-scale real-world datasets demonstrate that through the training of local descriptions, the server is capable of generating synthetic datasets with high quality and diversity. Consequently, with advantages in communication and privacy protection, the aggregated model outperforms compared FL or diffusion-based OSFL methods and, on some clients, outperforms the performance ceiling of centralized training.



Paperid:349 Poster
Authors:Hao Wu,Fan Xu,Chong Chen,Xian-Sheng Hua,Xiao Luo,Haixin Wang
Abstract:
In this paper, we investigate the challenge of spatio-temporal video prediction task, which involves generating future video frames based on historical spatio-temporal observation streams. Existing approaches typically utilize external information such as semantic maps to improve video prediction accuracy, which often neglect the inherent physical knowledge embedded within videos. Worse still, their high computational costs could impede their applications for high-resolution videos. To address these constraints, we introduce a novel framework called \underline{P}hysics-\underline{a}ssisted \underline{S}patio-\underline{t}emporal \underline{Net}work (PastNet) for high-quality video prediction. The core of PastNet lies in incorporating a spectral convolution operator in the Fourier domain, which efficiently introduces inductive biases from the underlying physical laws. Additionally, we employ a memory bank with the estimated intrinsic dimensionality to discretize local features during the processing of complex spatio-temporal signals, thereby reducing computational costs and facilitating efficient high-resolution video prediction. Extensive experiments on various widely-used spatio-temporal video benchmarks demonstrate the effectiveness and efficiency of the proposed PastNet compared with a range of state-of-the-art methods, particularly in high-resolution scenarios.



Paperid:350 Poster
Authors:Ye Tian,Zhe Wang,Jianguo Sun,Liguo Zhang
Abstract:
Audio super-resolution aims to improve the quality of acoustic signals and is able to reconstruct corresponding high-resolution acoustic signals from low-resolution acoustic signals. However, since acoustic signals can be divided into two forms: time-domain acoustic waves or frequency-domain spectrograms, most existing research focuses on data enhancement in a single field, which can only obtain partial or local features of the audio signal, resulting in limitations of data analysis. Therefore, this paper proposes a time-frequency domain fusion enhanced audio super-resolution method to mine the complementarity of the two representations of acoustic signals. Specifically, we propose an end-to-end audio super-resolution network. Including the variational autoencoder based sound wave super-resolution module (SWSRM), U-Net-based Spectrogram Super-Resolution Module (SSRM), and attention-based Time-Frequency Domain Fusion Module (TFDFM). SWSRM and SSRM can generate more high-frequency and low-frequency components for audio respectively. As a critical component of our method, TFDFM performs weighted fusion on the above two outputs to obtain a super-resolution audio signal. Compared with other methods, experimental results on the VCTK and Piano datasets in natural scenes show that the time-frequency domain fusion audio super-resolution model has a state-of-the-art bandwidth expansion effect. Furthermore, we perform super-resolution on the ShipsEar dataset containing underwater acoustic signals. The super-resolution results are used to test ship target recognition, and and the accuracy is improved by 12.66%. Therefore, the proposed super-resolution method has excellent signal enhancement effect and generalization ability.



Paperid:351 Poster
Authors:Zongqian Wu,Yujing Liu,Mengmeng Zhan,Ping Hu,Xiaofeng Zhu
Abstract:
Although current prompt learning methods have successfully been designed to effectively reuse the large pre-trained models without fine-tuning their large number of parameters, they still have limitations to be addressed, i.e., without considering the adverse impact of meaningless patches in every image and without simultaneously considering in-sample generalization and out-of-sample generalization. In this paper, we propose an adaptive multi-modality prompt learning to address the above issues. To do this, we employ previous text prompt learning and propose a new image prompt learning. The image prompt learning achieves in-sample and out-of-sample generalization, by first masking meaningless patches and then padding them with the learnable parameters and the information from texts. Moreover, each of the prompts provides auxiliary information to each other, further strengthening these two kinds of generalization. Experimental results on real datasets demonstrate that our method outperforms SOTA methods, in terms of different downstream tasks.



Paperid:352 Poster
Authors:Xiuliang Duan,Dating Tan,Liangda Fang,Yuyu Zhou,Chaobo He,Ziliang Chen,Lusheng Wu,Guanliang Chen,Zhiguo Gong,Weiqi Luo,Quanlong Guan
Abstract:
MultiModal Large Language Models (MM-LLMs) have demonstrated exceptional reasoning abilities in various visual question-answering tasks. However, they encounter significant challenges when answering geometry questions. These challenges arise due to the need to engage in rigorous reasoning and executing precise arithmetic. To enhance the ability of LLMs to solve multimodal geometric questions, we propose Reason-and-Execute (RaE) prompting: a new prompting method specifically designed for enhancing MM-LLMs to solve geometric questions. Specifically, we first designed a rigorous reasoning process based on domain knowledge of geometry, using a reverse thinking approach, and obtained the precise arithmetic steps required for solving the question. Secondly, based on the analysis of the reasoning process, we designed code blocks in a programming language to implement the arithmetic functions. Finally, by executing the contents of the code blocks using an interpreter, we obtained the answers to the geometric questions. We evaluated the accuracy of 9 models in answering questions on 6 datasets (including four geometry datasets and two science datasets) using different prompting templates. Specifically, in the main experimental result, our RaE showed a maximum enhancement of 12.8% compared to other prompting methods, which proves strong reasoning and arithmetic abilities in solving geometric questions of our method. Moreover, we analyzed the impact of answering from the perspective of solving geometric problems by considering multiple factors, including domain knowledge, geometry shapes, understanding of the question text, and language. This once again emphasizes that our method has passed the comprehensive test of solving geometry questions. The source code and data will be published in a GitHub repository.



Paperid:353 Poster
Authors:Fu Rong,Wenjin Peng,Meng Lan,Qian Zhang,Lefei Zhang
Abstract:
Driving scene topology reasoning aims to understand the objects present in the current road scene and model their topology relationships to provide guidance information for downstream tasks. Previous approaches fail to adequately facilitate interactions among traffic objects and neglect to incorporate scene information into topology reasoning, thus limiting the comprehensive exploration of potential correlations among objects and diminishing the practical significance of the reasoning results. Besides, the lack of constraints on lane direction may introduce erroneous guidance information and lead to a decrease in topology prediction accuracy. In this paper, we propose a novel topology reasoning framework, dubbed TSTGT, to address these issues. Specifically, we design a divide-and-conquer topology graph Transformer to respectively infer the lane-lane and lane-traffic topology relationships, which can effectively aggregate the local and global object information in the driving scene and facilitate the topology relationship learning. Additionally, a traffic scene-assisted reasoning module is devised and combined with the topology graph Transformer to enhance the practical significance of lane-traffic topology. In terms of lane detection, we develop a point-wise matching strategy to infer lane centerlines with correct directions, thereby improving the topology reasoning accuracy. Extensive experimental results on Openlane-V2 benchmark validate the superiority of our TSTGT over state-of-the-art methods and the effectiveness of our proposed modules.



Paperid:354 Poster
Authors:Yu Feng,Zhen Tian,Yifan Zhu,Zongfu Han,Haoran Luo,Guangwei Zhang,Meina Song
Abstract:
The key challenge of cross-modal domain-incremental learning (DIL) is to enable the learning model to continuously learn from novel data with different feature distributions under the same task without forgetting old ones. However, existing top-performing methods still cause high forgetting rates, by lacking intra-domain knowledge extraction and inter-domain common prompting strategy. In this paper, we propose a simple yet effective framework, CP-Prompt, by training limited parameters to instruct a pre-trained model to learn new domains and avoid forgetting existing feature distributions. CP-Prompt captures intra-domain knowledge by compositionally inserting personalized prompts on multi-head self-attention layers and then learns the inter-domain knowledge with a common prompting strategy. CP-Prompt shows superiority compared with state-of-the-art baselines among three widely evaluated DIL tasks. The source code is available athttps://anonymous.4open.science/r/CP_Prompt-C126.



Paperid:355 Poster
Authors:Tao Ling,Siping SHI,Hao Wang,Chuang Hu,Dan Wang
Abstract:
Federated learning is a promising privacy-preserving learning paradigm in which multiple clients can collaboratively learn a model with their image data kept local. For protecting data ownership, personalized watermarks are usually added to the image data by each client. However, the introduced watermarks can lead to a shortcut learning problem, where the learned model performs predictions over-rely on the simple watermark-related features and represents a low accuracy on real-world data. Existing works assume the central server can directly access the predefined shortcut features during the training process. However, these may fail in the federated learning setting as the shortcut features of the heterogeneous watermarked data are difficult to obtain.In this paper, we propose a federated Morozov regularization technique, where the regularization parameter can be adaptively determined based on the watermark knowledge of all the clients in a privacy-preserving way, to eliminate the shortcut learning problem caused by the watermarked data. Specifically, federated Morozov regularization firstly performs lightweight local watermark mask estimation in each client to obtain the locations and intensities knowledge of local watermarks. Then, it aggregates the estimated local watermark masks to generate the global watermark knowledge with a weighted averaging. Finally, federated Morozov regularization determines the regularization parameter for each client by combining the local and global watermark knowledge. With the regularization parameter determined, the model is trained as normal federated learning. We implement and evaluate federated Morozov regularization based on a real-world deployment of federated learning on 40 Jetson devices with real-world datasets. The results show that federated Morozov regularization improves model accuracy by 11.22% compared to existing baselines.



Paperid:356 Poster
Authors:Qing Zhang,Haocheng Lv,Jie Liu,Zhiyun Chen,JianYong Duan,Hao Wang,Li He,MingYing Xu
Abstract:
With the rise of large-scale language models(LLMs), it is currently popular and effective to convert multimodal information into text descriptions for multimodal multi-hop question answering. However, we argue that the current methods of multi-modal multi-hop question answering still mainly face two challenges: 1) The retrieved evidence containing a large amount of redundant information, inevitably leads to a significant drop in performance due to irrelevant information misleading the prediction. 2) The reasoning process without interpretable reasoning steps makes the model difficult to discover the logical errors for handling complex questions. To solve these problems, we propose a unified LLMs-based approach but wihout heavily relying on them due to the LLM's potential errors, and innovatively treat multimodal multi-hop question answering as a joint entailment tree generation and question answering problem. Specifically, we design a multi-task learning framework with a focus on facilitating common knowledge sharing across interpretability and prediction tasks while preventing task-specific errors from interfering with each other via mixture of experts. Afterward, we design an iterative feedback mechanism to further enhance both tasks by feeding back the results of the joint training to the LLM for regenerating entailment trees, aiming to iteratively refine the potential answer. Notably, our method has \textbf{won the first place} in the official leaderboards of WebQA (since April 10, 2024), and achieving competitive results on MultimodalQA.



Paperid:357 Poster
Authors:Xudong Lu,Yuqi Jiang,Haiwen Hong,Qi Sun,Cheng Zhuo
Abstract:
Multi-modality image fusion (MMIF) aims to integrate the complementary features of source images into the fused image, including target saliency and texture specifics. Recently, image fusion methods leveraging diffusion models have demonstrated commendable results. Despite their strengths, diffusion models reduce the capability to perceive local features. Additionally, their inherent working mechanism, introducing noise to the inputs, consequently leads to a loss of original information. To overcome this problem, we propose a novel Diffusion-CNN feature Aggregation Fusion (DCAFuse) network that can extract complementary features from the dual branches and aggregate them effectively. Specifically, we utilize the denoising diffusion probabilistic model (DDPM) in the diffusion-based branch to construct global information, and multi-scale convolutional kernels in the CNN-based branch to extract local detailed features. Afterward, we design a novel complementary feature aggregation module (CFAM). By constructing coordinate attention maps for the concatenated features, CFAM captures long-range dependencies in both horizontal and vertical directions, thereby dynamically guiding the aggregation weights of branches. In addition, to further improve the complementarity of dual-branch features, we introduce a novel loss function based on cosine similarity and a unique denoising timestep selection strategy. Extensive experimental results show that our proposed DCAFuse outperforms other state-of-the-art methods in multiple image fusion tasks, including infrared and visible image fusion (IVF) and medical image fusion (MIF). The source code will be publicly available athttps://xxx/xxx/xxx.



Paperid:358 Poster
Authors:Zhenzhong Kuang,Jianan Lu,Chenhui Hong,Haobin Huang,Suguo Zhu,Xiaowei Zhao,Jun Yu,Jianping Fan
Abstract:
The issue of face privacy protection has aroused wide social concern along with the increasing applications of face images. The latest methods focus on achieving a good privacy-utility tradeoff so that the protected results can still be used to support the downstream computer vision tasks. However, they may suffer from limited flexibility in manipulating this tradeoff because the practical requirements may vary under different scenarios. In this paper, we present a two-stage latent representation reorganization (LReOrg) framework for face image privacy protection relying on our conditional bidirectional network which is optimized by using a distinct keyword-based swap training strategy with a multi-task loss. The privacy sensitive information are anonymized in the first stage and the destroyed useful information are recovered in the second stage according to user requirements. LReOrg is advantageous in: (a) enabling users to recurrently process fine-grained attributes; (b) providing flexible control over privacy-utility tradeoff by manipulating which attributes to anonymize or preserve using cross-modal keywords; and (c) eliminating the need of data annotations for network training. The experimental results on benchmark datasets have reported the superior ability of our approach for providing flexible protection on facial information.



Paperid:359 Poster
Authors:Yuanhe Tian,Fei Xia,Yan Song
Abstract:
Existing radiology report generation (RRG) studies mostly adopt autoregressive (AR) approaches to produce textual descriptions token-by-token for specific clinical radiographs, where they are susceptible to error propagation problems if irrelevant contents are half-way generated, leading to potential ill-presenting of precise diagnoses, especially when there exist complicated abnormalities in radiographs. Although the non-AR paradigm, e.g., diffusion model, provides an alternative solution to tackle the problem from AR by generating all contents in parallel, the mechanism of using Gaussian noise in existing diffusion models still has a significant room to improve when such models are used in particular circumstances, i.e., providing proper guidance in controlling noises in the diffusive process to ensure precise report generation. In this paper, we propose to conduct RRG with diffusion networks by controlling the noise with task-specific features, which leverages irrelevant visual and textual information as noise rather than the stochastic Gaussian noise, and allows the diffusion networks to filter particular information through iterative denoising, thus performing a precise and controlled report generation process. Experiments on IU X-Ray and MIMIC-CXR demonstrate the superiority of our approach compared to strong baselines and state-of-the-art solutions. Human evaluation and noise type analysis show that comprehensive noise control greatly helps diffusion networks to refine the generation of global and local report contents.



Paperid:360 Poster
Authors:Zhiwen Wang,Yuhui Wu,Zheng WANG,Jiwei Wei,Tianyu Li,Guoqing Wang,Yang Yang,Heng Tao Shen
Abstract:
When applying high-level visual algorithms to rainy scenes, it is customary to preprocess the rainy images using low-level rain removal networks, followed by visual networks to achieve the desired objectives. Such a setting has never been explored by adversarial attack methods, which are only limited to attacking one kind of them. Considering the deficiency of multi-functional attacking strategies and the significance for open-world perception scenarios, we are the first to propose a Cascaded Adversarial Attack (CAA) setting, where the adversarial example can simultaneously attack different-level tasks, such as rain removal and semantic segmentation in an integrated system. Specifically, our attack on the rain removal network aims to preserve rain streaks in the output image, while for the semantic segmentation network, we employ powerful existing adversarial attack methods to induce misclassification of the image content. Importantly, CAA innovatively utilizes binary masks to effectively concentrate the aforementioned two significantly disparate perturbation distributions on the input image, enabling attacks on both networks. Additionally, we propose two variants of CAA, which minimize the differences between the two generated perturbations by introducing a carefully designed perturbation interaction mechanism, resulting in enhanced attack performance. Extensive experiments validate the effectiveness of our methods, demonstrating their superior ability to significantly degrade the performance of the downstream task compared to methods that solely attack a single network.



Paperid:361 Poster
Authors:Huixiang Wen,Shan Chang,Shizong Yan,Jie Xu,Hongzi Zhu,Yanting Zhang,Bo Li
Abstract:
Adhesive adversarial patches have been common used in attacks against the computer vision task of monocular depth estimation (MDE). Compared to physical patches permanently attached to target objects, optical projection patches show great flexibility and have gained wide research attention. However, applying digital patches for direct projection may lead to partial blurring or omission of details in the captured patches, attributed to high information density, surface depth discrepancies, and non-uniform pixel distribution. To address these challenges, in this work we introduce DepthCloak, an adversarial optical patch designed to interfere with the MDE of vehicles. To this end, we first simplify the patch to a gray pattern because the projected ``black-and-white light'' has strong robustness to ambient light. We propose a GAN-based approach to simulate projections and deduce a projectable list. Then, we employ neighborhood averaging to fill sparse depth values, compress all depth values into a reduced dynamic range via nonlinear mapping, and use these values to adjust the Gaussian blur radius as weight parameters, thereby simulating depth variation effects. Finally, by integrating Moiré pattern and applying style transfer techniques, we customize adversarial patches featuring regularly arranged characteristics. We deploy DepthCloak in real driving scenarios, and extensive experiments demonstrate that DepthCloak can achieve depth errors of over 9 meters in both bright and night-time conditions while achieving an attack success rate of over 80% in the physical world.



Paperid:362 Poster
Authors:Yi Tu,Chong Zhang,Ya Guo,Huan Chen,Jinyang Tang,Huijia Zhu,Qi Zhang
Abstract:
The recognition of named entities in visually-rich documents (VrD-NER) plays a critical role in various real-world scenarios and applications. However, the research in VrD-NER faces three major challenges: complex document layouts, incorrect reading orders, and unsuitable task formulations. To address these challenges, we propose a query-aware entity extraction head, namely UNER, to collaborate with existing multi-modal document transformers to develop more robust VrD-NER models. The UNER head considers the VrD-NER task as a combination of sequence labeling and reading order prediction, effectively addressing the issues of discontinuous entities in documents. Experimental evaluations on diverse datasets demonstrate the effectiveness of UNER in improving entity extraction performance. Moreover, the UNER head enables a supervised pre-training stage on various VrD-NER datasets to enhance the document transformer backbones and exhibits substantial knowledge transfer from the pre-training stage to the fine-tuning stage. By incorporating universal layout understanding, a pre-trained UNER-based model demonstrates significant advantages in few-shot and cross-linguistic scenarios and exhibits zero-shot entity extraction abilities.



Paperid:363 Poster
Authors:Luoyi Sun,Xuenan Xu,Mengyue Wu,Weidi Xie
Abstract:
Recently, the AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, for audio representation learning, the present datasets suffer from limitations in the following aspects: insufficient volume, simplistic content, and arduous collection procedures. To establish an audio dataset with high-quality captions, we propose an innovative, automatic approach leveraging multimodal inputs, such as video frames, audio streams. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We exploit a series of pre-trained models or APIs, to determine audio-visual synchronisation, generate image captions, object detection, or audio tags for specific videos. Subsequently, we employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues. To demonstrate the effectiveness of the proposed dataset, we train widely used models on our dataset and show performance improvement on various downstream tasks, namely, audio-language retrieval, audio captioning, zero-shot classification. In addition, we establish a novel benchmark with environmental information and provide a benchmark for audio-text tasks.



Paperid:364 Poster
Authors:Xiangrui Liu,Xinju Wu,Pingping Zhang,Shiqi Wang,Zhu Li,Sam Kwong
Abstract:
Gaussian splatting, renowned for its exceptional rendering quality and efficiency, has emerged as a prominent technique in 3D scene representation. However, the substantial data volume of Gaussian splatting impedes its practical utility in real-world applications. Herein, we propose an efficient 3D scene representation, named Compressed Gaussian Splatting (CompGS), which harnesses compact Gaussian primitives for faithful 3D scene modeling with a remarkably reduced data size. To ensure the compactness of Gaussian primitives, we devise a hybrid primitive structure that captures predictive relationships between each other. Then, we exploit a small set of anchor primitives for prediction, allowing the majority of primitives to be encapsulated into highly compact residual forms. Moreover, we develop a rate-constrained optimization scheme to eliminate redundancies within such hybrid primitives, steering our CompGS towards an optimal trade-off between bitrate consumption and representation efficacy. Experimental results show that the proposed CompGS significantly outperforms existing methods, achieving superior compactness in 3D scene representation without compromising model accuracy and rendering quality. Our code will be released on GitHub for further research.



Paperid:365 Poster
Authors:Yifei Gao,Jiaqi Wang,Zhiyu Lin,Jitao Sang
Abstract:
The evolution of Artificial Intelligence Generated Contents (AIGCs) is advancing towards higher quality. The growing interactions with AIGCs present a new challenge to the data-driven AI community: While AI-generated contents have played a crucial role in a wide range of AI models, the potential hidden risks they introduce have not been thoroughly examined. Beyond human-oriented forgery detection, AI-generated content poses potential issues for AI models originally designed to process natural data. In this study, we underscore the exacerbated hallucination phenomena in Large Vision-Language Models (LVLMs) caused by AI-synthetic images. Remarkably, our findings shed light on a consistent AIGC hallucination bias: the object hallucinations induced by synthetic images are characterized by a greater quantity and a more uniform position distribution, even these synthetic images do not manifest unrealistic or additional relevant visual features compared to natural images. Moreover, our investigations on Q-former and Linear projector reveal that synthetic images may present token deviations after visual projection, thereby amplifying the hallucination bias.



Paperid:366 Poster
Authors:Pengfei Zhou,Fangxiang Feng,Guang Liu,Ruifan Li,Xiaojie Wang
Abstract:
Latent diffusion model has demonstrated impressive efficacy in image generation and editing tasks. Recently, it has also promoted the advancement of image harmonization. However, methods involving latent diffusion model all face a common challenge: the severe image distortion introduced by the VAE component, while image harmonization is a low-level image processing task that relies on pixel-level evaluation metrics. In this paper, we propose Harmony-VAE, leveraging the input of the harmonization task itself to enhance the quality of decoded images. The input involving composite image contains the precise pixel level information, which can complement the correct foreground appearance and color information contained in denoised latents. Meanwhile, the inherent generative nature of diffusion models makes it naturally adapt to inverse image harmonization, i.e. generating synthetic composite images based on real images and foreground masks. We train an inverse harmonization diffusion model to perform data augmentation on two subsets of iHarmony4 and construct a new human harmonization dataset with prominent foreground objects. Extensive experiments demonstrate the effectiveness of our proposed Harmony-VAE and inverse harmonization model. The code, pretrained models and the new dataset will be made publicly available.



Paperid:367 Poster
Authors:Zimo Liu,Kangjun Liu,Mingyue Guo,Shiliang Zhang,Yaowei Wang
Abstract:
Model compression and distillation techniques have become essential for deploying deep learning models efficiently. However, existing methods often encounter challenges related to model generalization and scalability for harnessing the expertise of pre-trained large models. This paper introduces CoTuning, a novel framework designed to enhance the generalization ability of neural networks by leveraging collaborative learning between large and small models. CoTuning overcomes the limitations of traditional compression and distillation techniques by introducing strategies for knowledge exchange and simultaneous optimization. Our framework comprises an adapter-based co-tuning mechanism between cloud and edge models, a scale-shift projection for feature alignment, and a novel collaborative knowledge distillation mechanism for domain-agnostic tasks. Extensive experiments conducted on various benchmark datasets demonstrate the effectiveness of CoTuning in improving model generalization while maintaining computational efficiency and scalability. The proposed framework exhibits a significant advancement in model compression and distillation, with broad implications for research in the collaborative evolution of large-small models.



Paperid:368 Poster
Authors:Jiahua Xiao,Yang Liu,Shizhou Zhang,Xing Wei
Abstract:
Remarkable progresses have been made in hyperspectral image (HSI) denoising. However, the majority of existing methods are predominantly confined to the spatial-spectral domain, overlooking the untapped potential inherent in the Fourier domain. This paper presents a novel approach to address HSI denoising by bridging the information from the Fourier and spatial-spectral domains. Our method highlights key insights into the Fourier properties within spatial and spectral domains through the Fourier transform. Specifically, we note that the amplitude predominantly encodes noise and photon reflection characteristics, while the phase holds structural information. Additionally, the Fourier transform offers a receptive field that spans the entire image, enabling effective global noise distribution capture. These insights unveil new perspectives on the physical properties of HSIs, motivating us to leverage complementary information exchange between Fourier and spatial-spectral domains. To this end, we introduce the Fourier-prior Integration Denoising Network (FIDNet), a potent yet straightforward approach that utilizes Fourier insights to synergistically interact with spatial-spectral domains for superior HSI denoising. In FIDNet, we independently extract spatial and Fourier features through dual branches and merge these representations to enhance spectral evolution modeling through the inherent structure consistency constraints and continuing reflection variation revealed in Fourier prior. Our proposed method demonstrates robust generalization across synthetic and real-world benchmark datasets, outperforming state-of-the-art methods in both quantitative quality and visual results.



Paperid:369 Poster
Authors:Fangjian Liao,Xingxing Zou,Waikeung Wong
Abstract:
Image-to-image (i2i) translation has achieved notable success, yet remains challenging in scenarios like real-to-illustrative style transfer of fashion. Existing methods focus on enhancing the generative model with diversity while lacking ID-preserved domain translation. This paper introduces a novel model named Uni-DlLoRA to release this constraint. The proposed model combines the original images within a pretrained diffusion-based model using the proposed Uni-adapter extractors, while adopting the proposed Dual-LoRA module to provide distinct style guidance. This approach optimizes generative capabilities and reduces the number of additional parameters required. In addition, a new multimodal dataset featuring higher-quality images with captions built upon an existing real-to-illustration dataset is proposed. Experimentation validates the effectiveness of our proposed method.



Paperid:370 Poster
Authors:Qian Cao,Xu Chen,Ruihua Song,Xiting Wang,Xinting Huang,Yuchen Ren
Abstract:
Image captioning, which generates natural language descriptions of the visual information in an image, is a crucial task in vision-language research. Previous models have typically addressed this task by aligning the generative capabilities of machines with human intelligence through statistical fitting of existing datasets. While these models demonstrate proficiency in describing the content of normal images, they may struggle to accurately describe those where certain parts of the image are obscured or edited. Conversely, humans effortlessly excel at it in this case. The weaknesses these models exhibit, including hallucinations and limited interpretability, often result in performance declines when applied to scenarios involving shifted association patterns. In this paper, we present a generic image captioning framework that leverages causal inference to make existing models more capable of interventional tasks, and counterfactually explainable. Specifically, our approach consists of two variants that utilize either total effect or natural direct effect. We incorporate these concepts into the training process, enabling the models to handle counterfactual scenarios and thereby become more generalizable. Extensive experiments on various datasets have demonstrated that our method can effectively reduce hallucinations and increase the model's faithfulness to the images, with a high portability for both small-scale and large-scale image-to-text models.



Paperid:371 Poster
Authors:Changqing Lin,Jinhui Pang,Xiaoshuai Hao,Rong Yin,Zixuan Wang,Zhihui Zhang,Jinglin He,HUANG TAI SHENG
Abstract:
Continual graph learning (CGL) is an important and challenging task that aims to extend static GNNs to dynamic task flow scenarios. As one of the mainstream CGL methods, the experience replay (ER) method receives widespread attention due to its superior performance. However, existing ER methods focus on identifying samples by feature significance or topological relevance, which limits their utilization of comprehensive graph data. In addition, the topology-based ER methods only consider local topological information and add neighboring nodes to the buffer, which ignores the global topological information and increases memory overhead. To bridge these gaps, we propose a novel method called Feature-Topology Fusion-based Experience Replay (FTF-ER) to effectively mitigate the catastrophic forgetting issue with enhanced efficiency. Specifically, from an overall perspective to maximize the utilization of the entire graph data, we propose a highly complementary approach including both feature and global topological information, which can significantly improve the effectiveness of the sampled nodes. Moreover, to further utilize global topological information, we propose Hodge Potential Score (HPS) as a novel module to calculate the topological importance of nodes. HPS derives a global node ranking via Hodge decomposition on graphs, providing more accurate global topological information compared to neighbor sampling. By excluding neighbor sampling, HPS significantly reduces buffer storage costs for acquiring topological information and simultaneously decreases training time. Compared with state-of-the-art methods, FTF-ER achieves a significant improvement of 3.6% in AA and 7.1% in AF on the OGB-Arxiv dataset, demonstrating its superior performance in the class-incremental learning setting.



Paperid:372 Poster
Authors:Li Xiaochen,Jian Cheng,Ziying Xia,Zichong Chen,Junhao Shi,Zhicheng Dong,Nyima Tashi
Abstract:
Online action detection aims to identify ongoing actions within untrimmed video streams, with extensive applications in real-life scenarios. However, in practical applications, video frames are received sequentially over time and new action categories continually emerge, giving rise to the challenge of catastrophic forgetting - a problem that remains inadequately explored. Generally, in the field of video understanding, researchers address catastrophic forgetting through class-incremental learning. Nevertheless, online action detection is based solely on historical observations, thus demanding higher temporal modeling capabilities for class-incremental learning methods. In this paper, we conceptualize this task as Class-Incremental Online Action Detection (CIOAD) and propose a novel framework, TS-ILM, to address it. Specifically, TS-ILM consists of two key components: task-level temporal pattern extractor and temporal-sensitive exemplar selector. The former extracts the temporal patterns of actions in different tasks and saves them, allowing the data to be comprehensively observed on a temporal level before it is input into the backbone. The latter selects a set of frames with the highest causal relevance and minimum information redundancy for subsequent replay, enabling the model to learn the temporal information of previous tasks more effectively. We benchmark our approach against state-of-the-art class-incremental learning methods applied in the image and video domains on the THUMOS'14 and TVSeries datasets. Our method significantly outperforms the previous approaches.



Paperid:373 Poster
Authors:Xin Lu,Chuanqing Zhuang,Zhengda Lu,Yiqun Wang,Jun Xiao
Abstract:
4D facial expression synthesizing is a critical problem in the fields of computer vision and graphics. Current methods lack flexibility and smoothness when simulating the inter-frame motion of expression sequences. In this paper, we propose a frequency-controlled 4D facial expression synthesizing method, FC-4DFS. Specifically, we introduce a frequency-controlled LSTM network to generate 4D facial expression sequences frame by frame from a given neutral landmark with a given length. Meanwhile, we propose a temporal coherence loss to enhance the perception of temporal sequence motion and improve the accuracy of relative displacements. Furthermore, we designed a Multi-level Identity-Aware Displacement Network based on a cross-attention mechanism to reconstruct the 4D facial expression sequences from landmark sequences. Finally, our FC-4DFS achieves flexible and SOTA generation results of 4D facial expression sequences with different lengths on CoMA and Florence4D datasets. The code will be available on GitHub.



Paperid:374 Poster
Authors:Kin-Chung Chan,Jun Xiao,Hana Lebeta Goshu,Kin-man Lam
Abstract:
The technique of 3D Gaussian splatting (3DGS) has demonstrated its effectiveness and efficiency in rendering photo-realistic images for novel view synthesis. However, 3DGS requires a high density of camera coverage, and its performance inevitably degrades with sparse training views, which significantly restricts its applications in real-world products. In recent years, many researchers have tried to use depth information to alleviate this problem, but the performance of their methods is sensitive to the accuracy of depth estimation. To this end, we propose an efficient method to enhance the performance of 3DGS with sparse training views. Specifically, instead of applying depth maps for regularization, we propose a densification method that generates high-quality point clouds for improved initialization of 3D Gaussians. Furthermore, we propose Systematically Angle of View Sampling (SAOVS), which employs Spherical Linear Interpolation (SLERP) and linear interpolation for side view sampling, to determine unseen views outside the training data for semantic pseudo-label regularization. Experiments show that our proposed method significantly outperforms other promising 3D rendering models on the ScanNet dataset and the LLFF dataset. In particular, compared with the conventional 3DGS method, the PSNR and SSIM performance gains achieved by our method are up to 1.71dB and 0.07, respectively. In addition, the novel view synthesis obtained by our method demonstrates the highest visual quality with fewer distortions.



Paperid:375 Poster
Authors:Xian Zhang,Haokun Wen,Jianlong Wu,Pengda Qin,Hui Xue',Liqiang Nie
Abstract:
Change captioning involves describing the subtle changes between a pair of similar images. Although existing efforts have achieved compelling success, they overlook the potential of multimodal large language models (MLLMs) in tackling this challenging task. In this work, we aim to empower MLLMs with the capability to perceive subtle differences between paired images and enhance their performance in generating change captions. Specifically, we present a diFferentIal-perceptive aNd rEtRieval-augmented MLLM (FINER-MLLM) tailored for this task. In particular, FINER-MLLM leverages LoRA fine-tuned MLLM's image encoder to extract image patch features, enabling the capture of detailed image information. Subsequently, within MLLM's feature extraction, typically Q-Former, FINER-MLLM incorporates dual constraints: the intra-image feature independence constraint and the inter-image feature alignment constraint. These constraints ensure that the features can comprehensively extract subtle visual information within each image and that corresponding features across images align effectively.Last, we introduced the retrieval augmentation to first retrieve the relevant corpus to facilitate the MLLM's decoder \textit{i.e.}, LLM, in generating accurate change captions. Extensive experiments on three benchmark datasets, \textit{i.e.}, CLEVR-Change, Spot-the-Diff, and Image-Editing-Request, demonstrate the superiority of our proposed method.



Paperid:376 Poster
Authors:Chaofan Huo,Ye Shi,Jingya Wang
Abstract:
Learning the prior knowledge of the 3D human-object spatial relation is crucial for reconstructing human-object interaction from images and understanding how humans interact with objects in 3D space. Previous works learn this prior from datasets collected in controlled environments, but due to the diversity of domains, they struggle to generalize to real-world scenarios. To overcome this limitation, we present a 2D-supervised method that learns the 3D human-object spatial relation prior purely from 2D images in the wild. Our method utilizes a flow-based neural network to learn the prior distribution of the 2D human-object keypoint layout and viewports for each image in the dataset. The effectiveness of the prior learned from 2D images is demonstrated on the human-object reconstruction task by applying the prior to tune the relative pose between the human and the object during the post-optimization stage. To validate and benchmark our method on in-the-wild images, we collect the WildHOI dataset from the YouTube website, which consists of various interactions with 8 objects in real-world scenarios. We conduct the experiments on the indoor BEHAVE dataset and the outdoor WildHOI dataset. The results show that our method achieves almost comparable performance with fully 3D supervised methods on the BEHAVE dataset, even if we have only utilized the 2D layout information, and outperforms previous methods in terms of generality and interaction diversity on in-the-wild images.



Paperid:377 Poster
Authors:Huanpeng Chu,Wei Wu,Chengjie Zang,Kun Yuan
Abstract:
Diffusion models have revolutionized image synthesis, setting new benchmarks in quality and creativity. However, their widespread adoption is hindered by the intensive computation required during the iterative denoising process. Post-training quantization (PTQ) presents a solution to accelerate sampling, aibeit at the expense of sample quality, extremely in low-bit settings. Addressing this, our study introduces a unified Quantization Noise Correction Scheme (QNCD), aimed at minishing quantization noise throughout the sampling process. We identify two primary quantization challenges: intra and inter quantization noise. Intra quantization noise, mainly exacerbated by embeddings in the resblock module, extends activation quantization ranges, increasing disturbances in each single denosing step. Besides, inter quantization noise stems from cumulative quantization deviations across the entire denoising process, altering data distributions step-by-step. QNCD combats these through embedding-derived feature smoothing for eliminating intra quantization noise and an effective runtime noise estimatiation module for dynamicly filtering inter quantization noise. Extensive experiments demonstrate that our method outperforms previous quantization methods for diffusion models, achieving lossless results in W4A8 and W8A8 quantization settings on ImageNet (LDM-4).



Paperid:378 Poster
Authors:Green Rosh,Pawan Prasad B H,LOKESH R BOREGOWDA,Kaushik Mitra
Abstract:
Single image reflection removal is a severely ill-posed problem and it is very hard to separate the desirable transmission and undesirable reflection layers. Most of the existing single image reflection removal methods try to recover the transmission layer by exploiting cues that are extracted only from the given input image. However, there is abundant unutilized information in the form of millions of reflection free images available publicly. Even though this information is easily available, utilizing the same for effectively removing reflections is non-trivial. In this paper, we propose a novel method, termed R2SFD, for improving single image reflection removal using a Semantic Feature Dictionary (SFD) constructed from a database of reflection-free images. The SFD is constructed using a novel Reflection Aware Feature Extractor (RAFENet) that extracts features invariant to the presence of reflections. The SFD and the input image are then passed to another novel network termed SFDNet. This network first extracts RAFENet features from the reflection-corrupted input image, searches for similar features in the SFD, and transfers the semantic content to generate the final output. To further improve reflection removal, we also introduce a Large Scale Reflection Removal (LSRR) dataset consisting of 2650 image pairs comprising of a variety of real world reflection scenarios. The proposed method achieves superior results both qualitatively and quantitatively compared to the state of the art single image reflection removal methods on real public datasets as well as our LSRR dataset.



Paperid:379 Poster
Authors:Xueyuan Chen,Shangzhe Li,Junran Wu
Abstract:
Graph contrastive learning has achieved great success in pre-training graph neural networks without ground-truth labels. Leading graph contrastive learning follows the classical scheme of contrastive learning, forcing model to identify the essential information from augmented views. However, general augmented views are produced via random corruption or learning, which inevitably leads to semantics alteration. Although domain knowledge guided augmentations alleviate this issue, the generated views are domain specific and undermine the generalization. In this work, motivated by the firm representation ability of sparse model from pruning, we reformulate the problem of graph contrastive learning via contrasting different model versions rather than augmented views. We first theoretically reveal the superiority of model pruning in contrast to data augmentations. In practice, we take original graph as input and dynamically generate a perturbed graph encoder to contrast with the original encoder by pruning its transformation weights. Furthermore, considering the integrity of node embedding in our method, we are capable of developing a local contrastive loss to tackle the hard negative samples that disturb the model training. We extensively validate our method on various benchmarks regarding graph classification via unsupervised and transfer learning. Compared to the state-of-the-art (SOTA) works, better performance can always be obtained by the proposed method.



Paperid:380 Poster
Authors:Kexiang Feng,Chuanmin Jia,Siwei Ma,Wen Gao
Abstract:
The widespread adoption of bio-inspired cameras has catalyzed the development of spike-based intelligent applications. Despite its innovative imaging principle allows for functionality in extreme scenarios, the intricate nature of spike signals poses processing challenges to achieve desired performance. Traditional methods struggles to deliver visual perception and temporal prediction simultaneously, and they lack the flexibility needed for diverse intelligent applications. To address this problem, we analyze the spatio-temporal correlations between spike information at different temporal scales. A novel spike processing method is introduced for compact spike representations that utilizes intra-scale correlation for higher predictive accuracy. Additionally, we propose a multi-scale spatio-temporal aggregation unit (MSTAU) that further leverages inter-scale correlation to achieve efficient perception and precise prediction. Experimental results show noticeable improvements in scene reconstruction and object classification, with increases of3.49dBin scene reconstruction quality and2.20%in accuracy, respectively. Besides, the proposed method accommodate different visual applications via switching analysis models, offering a novel perspective for spike processing.



Paperid:381 Poster
Authors:Xinyue Liu,Jiahui Wan,Linlin Zong,Bo Xu
Abstract:
Open-ended VideoQA presents a significant challenge due to the absence of fixed options, requiring the identification of the correct answer from a vast pool of candidate answers. Previous approaches typically utilize classifier or similarity comparison on fusion feature to yield prediction directly, lacking coarse-to-fine filtering on numerous candidates. Gradual refining the probability distribution of candidates can achieve more precise prediction. Thus, we propose the DiffAns model, which integrates the diffusion model to handle open-ended VideoQA task, simulating the gradual process by which humans answer open-ended question. Specifically, we first diffuse the true answer label into a random distribution (forward process). And under the guidance of answer-aware condition generated from video and question, the model iteratively denoises to obtain the correct probability distribution(backward process). This equips the model with the capability to progressively refine the random probability distribution of candidates, ultimately predicting the correct answer. We conduct experiments on three challenging open-ended VideoQA datasets, surpassing existing SoTA methods. Extensive experiments further explore and analyse the impact of each modules, as well as the design of diffusion model, demonstrating the effectiveness of DiffAns. Our code will be available.



Paperid:382 Poster
Authors:Jiongming Qin,Fei LUO,Tuo Cao,Wenju Xu,Chunxia Xiao
Abstract:
Prior neural radiance fields often struggle to preserve high-frequency textures in urban and aerial large-scale scenes due to insufficient model capacity on the scene surface. This is attributed to their sampling locations or grid vertices falling in empty areas. Additionally, most models do not consider the drastic changes in distances. To address these issues, we propose a novel high-frequency surface shell radiance field, which uses depth-guided information to create a shell enveloping the scene surface under the current view, and then samples conic frustums on this shell to render high-frequency textures. Specifically, our method comprises three parts. Initially, we propose a strategy to fuse voxel grids and information of distance scales to generate a coarse scene at different distance scales. Subsequently, we construct a shell based on the depth information to carry out compensation to incorporate texture details not captured by voxels. Finally, the smooth and denoise post-processing further improves the rendering quality. Substantial scene experiments and ablation experiments demonstrate that our method achieves the obvious improvement of high-frequency textures at different distance scales and outperforms the state-of-the-art methods.



Paperid:383 Poster
Authors:Feng Zhu,Xinxing Yang,Longfei Li,JUN ZHOU
Abstract:
Cross-Domain Recommendation (CDR) has been proposed to improve the recommendation accuracy in the target domain (the sparser dataset) by benefiting from the auxiliary information transferred or the knowledge learned from one or many source domains (the denser datasets). However, most of the existing CDR approaches still suffer from the problem of negative transfer caused by undifferentiated knowledge transfer, and thus the recommendation accuracy in some domains, especially in the sparser domains, is still too low, which is not practical in real application scenarios. To address this problem, we propose a novel Active Masked Attention framework, i.e., AMA-CDR, for many-to-many CDR scenarios. Our AMA-CDR pursues a higher goal for CDR approaches, i.e., \textit{improving the recommendation performance in the target domain to achieve a practically usable level}, which is meaningful and challenging in real CDR systems. Specifically, AMA-CDR adopts an end-to-end graph embedding to reduce the objective distortion between graph embedding and embedding combination. More importantly, we propose an active mask for the embedding combination to ease negative transfer, which leverages both the prior knowledge, i.e., data density, and the posterior knowledge, i.e., sample uncertainty. Extensive experiments conducted on two public datasets demonstrate that our proposed AMA-CDR models significantly outperform the state-of-the-art approaches and achieve the new goal.



Paperid:384 Poster
Authors:Ling Zhang,Yidong Ma,Zhi Jiang,Weilei He,Zhongyun Bao,Gang Fu,Wenju Xu,Chunxia Xiao
Abstract:
Recently, learning-based methods have made significant progress for image specular highlight removal. However, many of these approaches treat all the image pixels as spatially consistent, overlooking the negative impact of invalid pixels on feature reconstruction. This oversight often leads to undesirable outcomes, such as color distortion or residual highlights. In this paper, we propose a novel image specular highlight removal network called HighlightRNet, which utilizes valid pixels as references to reconstruct the highlight-free image. To achieve this, we introduce a context-aware fusion block (CFBlock) that aggregates information in four directions, effectively capturing global contextual information. Additionally, we introduce a location-aware feature transformation module (LFTModule) to adaptively learn the valid pixels for feature reconstruction, thereby avoiding information errors caused by invalid pixels. With these modules, our method can produce high-quality highlight-free results without color distortion and highlight residual. Furthermore, we develop a multiple light image-capturing system to construct a large-scale highlight dataset called NSH, which exhibits minimal misalignment in image pairs and minimal brightness variation in non-highlight regions. Experimental results on various datasets demonstrate the superiority of our method over state-of-the-art methods, both qualitatively and quantitatively.



Paperid:385 Poster
Authors:Yuntao Wang,Jinpu Zhang,Ruonan Wei,Wenbo Gao,Yuehuan Wang
Abstract:
Cross-area evaluation poses a significant challenge for ground-to-aerial geo-localization (G2AGL), in which the training and testing data are captured from entirely distinct areas. However, current methods struggle in cross-area evaluation due to their emphasis solely on learning global information from single-scale features. Some efforts alleviate this problem but rely on complex and specific technologies like pre-processing and hard sample mining. To this end, we propose a pure end-to-end solution, free from task-specific techniques, termed the Multi-scale Feature Representation Generalization Network (MFRGN) to improve generalization. Specifically, we introduce multi-scale features and explicitly utilize them for G2GAL. Furthermore, we devise an efficient global-local information module with two flows to bolster feature representations. In the global flow, we present a lightweight Self and Cross Attention Module (SCAM) to efficiently learn global embeddings. In the local flow, we develop a Global-Prompt Attention Block (GPAB) to capture discriminative features under the global embeddings as prompts. As a result, our approach generates robust descriptors representing multi-scale global and local information, thereby enhancing the model's invariance to scene variations. Extensive experiments on benchmarks show our MFRGN achieves competitive performance in same-area evaluation and improves cross-area generalization by a significant margin compared to SOTA methods.



Paperid:386 Poster
Authors:Yang Fang,Xuefeng Rao,Xinbo Gao,Weisheng Li,Min Zijian
Abstract:
Martian terrain segmentation plays a crucial role in autonomous navigation and safe driving of Mars rovers as well as global analysis of Martian geological landforms. However, most deep learning-based segmentation models cannot effectively handle the challenges of highly unstructured and unbalanced terrain distribution on the Martian surface, thus leading to inadequate adaptability and generalization ability. In this paper, we propose a novel multi-view Martian Terrain Segmentation framework (MTSNet) by developing an efficient Martian Terrain text-Guided Segment Anything Model (MTG-SAM) and combining it with a tailored Local Terrain Feature Enhancement Network (LTEN) to capture intricate terrain details. Specifically, the proposed MTG-SAM is equipped with a Terrain Context attention Adapter Module (TCAM) to efficiently and effectively unleashing the model adaptability and transferability on Mars-specific terrain distribution. Then, a Local Terrain Feature Enhancement Network (LTEN) is designated to compensate for the limitations of MTG-SAM in capturing the fine-grained local terrain features of Mars surface. Afterwards, a simple yet efficient Gated Fusion Module (GFM) is introduced to dynamically merge the global contextual features from MTG-SAM encoder and the local refined features from LTEN module for comprehensive terrain feature learning. Moreover, the proposed MTSNet enables terrain-specific text as prompts resolving the efficiency issue of existing methods that require costly annotation of bounding boxes or foreground points. Experimental results on AI4Mars and ConeQuest datasets demonstrate that our proposed MTSNet can effectively learns the unique Martian terrain feature distribution and achieves state-of-the-art performance on multi-view terrain segmentation from both the perspectives of the Mars rover and the satellite remote sensing.



Paperid:387 Poster
Authors:Yi LIU,Xinyi LI,Shuai Wenjing
Abstract:
Neural Radiance Fields (NeRFs) demonstrate high efficiency in generating photo-realistic novel view. Recent studies introduce the trials on the 3D inpainting by NeRF. However, the performance of these works have been validated for data collected in a narrow range of multi-view, while degrade for the wide range of multi-view. To address this problem, we propose a novel NeRF framework to remove the obstacle and reproduce occluded areas in high quality for both wide and narrow range of multi-view. In this framework, we design a region coding network to carry out object segmentation. With the depth information, the segmentation component transfers a single obstacle mask to other views in high accuracy. By referring to the segmentation results, we introduce an innovative view selection mechanism to reconstruct the occluded area using supplementary information from multi-view and 2D inpainting. We also contribute to the evaluation of 3D scene de-occlusion by introducing a dataset including views captured in wide range and in pair with and without the obstacle object for comparison. We evaluate our framework in both narrow and wide range datasets by quantitative measurement and visually qualitative comparison, which confirm the competitive and superior performance of our framework.



Paperid:388 Poster
Authors:Ziyu Zhao,Pingping Cai,Canyu Zhang,Xiaoguang Li,Song Wang
Abstract:
Cross-modal 2D–3D point cloud semantic segmentation on few-shot-based learning provides a practical approach for borrowing matured 2D domain knowledge into the 3D segmentation model, which reduces the reliance on laborious 3D annotation work and improves generalization to new categories. However, previous methods use single-view point cloud generation algorithms to bridge the gap between 2D images and 3D point clouds, leaving the incomplete geometry of an object or scene due to occlusions. To address this issue, we propose a novel view synthesis cross-modal few-shot point cloud semantic segmentation network. It introduces the color and depth inpainting to generate multi-view images and masks, which compensate for the absent depth information of generated point clouds. Additionally, we propose a cross-modal embedding network to bridge the domain features between synthesized and original, collected 3D data, and a weighted prototype network is employed to balance the impact of multi-view images and enhance the segmentation performance. Extensive experiments on two benchmarks show the superiority of our method by outperforming the existing cross-modal few-shot 3D segmentation methods.



Paperid:389 Poster
Authors:Zhuoling Li,Yong Wang,Kaitong Li
Abstract:
Some recent methods address few-shot image classification by extracting semantic information from class names and devising mechanisms for aligning vision and semantics to integrate information from both modalities. However, class names provide only abstract information, which is insufficient to capture the visual details present in images. As a result, this vision-semantics alignment is inherently biased, leading to sub-optimal integration outcomes. In this paper, we avoid this biased vision-semantics alignment by introducing CLIP, a natural bridge between vision and semantics, and enforcing unbiased vision-vision alignment as a proxy task. Specifically, we align features encoded from the same image by both the few-shot encoder and CLIP's vision encoder. This alignment is accomplished through a linear layer, with a training objective formulated using optimal transport-based assignment prediction. Thanks to the inherent alignment between CLIP's vision and text encoders, the few-shot encoder is indirectly aligned to CLIP's text encoder, which serves as the foundation for better vision-semantics integration. In addition, to further improve vision-semantics integration at the testing stage, we mine potential fine-grained semantic attributes of class names from large language models. Correspondingly, an online optimization module is designed to adaptively integrate the semantic attributes and visual information extracted from images. Extensive results on four datasets demonstrate that our method outperforms state-of-the-art methods.



Paperid:390 Poster
Authors:Mingcan Xiang,Steven Jiaxun Tang,Qizheng Yang,Hui Guan,Tongping Liu
Abstract:
In the domain of multimedia and multimodal processing, the efficient handling of diverse data streams—such as images, video, and sensor data—is paramount. Model compression and multitask learning (MTL) are crucial in this field, offering the potential to address the resource-intensive demands of processing and interpreting multiple forms of media simultaneously. However, effectively compressing a multitask model presents significant challenges due to the complexities of balancing sparsity allocation and accuracy performance across multiple tasks. To tackle the challenges, we propose AdapMTL, an adaptive pruning framework for MTL models. AdapMTL leverages multiple learnable soft thresholds independently assigned to the shared backbone and the task-specific heads to capture the nuances in different components' sensitivity to pruning. During training, it co-optimizes the soft thresholds and MTL model weights to automatically determine the suitable sparsity level at each component in order to achieve both high task accuracy and high overall sparsity. It further incorporates an adaptive weighting mechanism that dynamically adjusts the importance of task-specific losses based on each task's robustness to pruning. We demonstrate the effectiveness of AdapMTL through comprehensive experiments on popular multitask datasets, namely NYU-v2 and Tiny-Taskonomy, with different architectures, showcasing superior performance compared to state-of-the-art pruning methods.



Paperid:391 Poster
Authors:Xitong Ling,Minxi Ouyang,Yizhi Wang,Xinrui Chen,Renao Yan,Hongbochu,Junru Cheng,Tian Guan,Xiaoping Liu,Sufang Tian,Yonghong He
Abstract:
Histopathology analysis is the gold standard for medical diagnosis.Accurate classification of whole slide images (WSIs) and region-of-interests (ROIs) level localization will assist pathologists in clinical diagnosis. With a gigapixel resolution and a scarcity of fine-grained annotations, WSI is difficult to classify directly. In the field of weakly supervised learning, multiple instance learning (MIL) serves as a promising approach to solving WSI classification tasks. Currently, a prevailing aggregation strategy is to apply attention mechanism as a measure of the importance of each instance for further classification. Notwithstanding, attention mechanism fails to capture inter-instance information and self-attention mechanism can cause quadratic computational complexity issues. To address these challenges, we propose an agent aggregator with mask denoise mechanism for multiple instance learning termed AMD-MIL. The agent token represents an intermediate variable between the query and key for implicit computation of the instance importance. Mask and denoising are also learnable matrices mapped from the agents-aggregated value, which first dynamically mask out some low-contribution instance representations and then eliminate the relative noise introduced during the mask process. AMD-MIL can indirectly achieve more reasonable attention allocation by adjusting feature representations, thereby sensitively capturing micro-metastases in cancer and achieving better interpretability. Our extensive experiments on CAMELYON-16, CAMELYON-17, TCGA-KIDNEY, and TCGA-LUNG datasets show our method’s superiority over existing state-of-the-art approaches. The code will be available upon acceptance



Paperid:392 Poster
Authors:Xiang Ma,Xuemei Li,Lexin Fang,Caiming Zhang
Abstract:
Many contrastive learning based models have achieved advanced performance in image-text matching tasks. The key of these models lies in analyzing the correlation between image-text pairs, which involves cross-modal interaction of embeddings in corresponding dimensions. However, the embeddings of different modalities are from different models or modules, and there is a significant modality gap. Directly interacting such embeddings lacks rationality and may capture inaccurate correlation. Therefore, we propose a novel method called DIAS to bridge the modality gap from two aspects: (1) We align the information representation of embeddings from different modalities in corresponding dimension to ensure the correlation calculation is based on interactions of similar information. (2) The spatial constraints of inter- and intra-modalities unmatched pairs are introduced to ensure the effectiveness of semantic alignment of the model. Besides, a sparse correlation algorithm is proposed to select strong correlated spatial relationships, enabling the model to learn more significant features and avoid being misled by weak correlation. Extensive experiments demonstrate the superiority of DIAS, achieving 4.3%-10.2% rSum improvements on Flickr30k and MSCOCO benchmarks.



Paperid:393 Poster
Authors:Yu Liao,Xinfeng Zhang,Rui Yang,Jianwei Tao,Bai Liu,Zhipeng Hu,Shuang Wang,Zeng Zhao
Abstract:
In recent years, Vision-Language Pre-training (VLP) models have demonstrated rich prior knowledge for multimodal alignment, prompting investigations into their application in Specific Domain Image-Text Retrieval (SDITR) such as Text-Image Person Re-identification (TIReID) and Remote Sensing Image-Text Retrieval (RSITR). Due to the unique data characteristics in specific scenarios, the primary challenge is to leverage discriminative fine-grained local information for improved mapping of images and text into a shared space. Current approaches interact with all multimodal local features for alignment, implicitly focusing on discriminative local information to distinguish data differences, which may bring noise and uncertainty. Furthermore, their VLP feature extractors like CLIP often focus on instance-level representations, potentially reducing the discriminability of fine-grained local features. To alleviate these issues, we propose an Explicit Key Local information Selection and Reconstruction Framework (EKLSR), which explicitly selects key local information to enhance feature representation. Specifically, we introduce a Key Local information Selection and Fusion (KLSF) that utilizes hidden knowledge from the VLP model to select interpretably and fuse key local information. Secondly, we employ Key Local segment Reconstruction (KLR) based on multimodal interaction to reconstruct the key local segments of images (text), significantly enriching their discriminative information and enhancing both inter-modal and intra-modal interaction alignment. To demonstrate the effectiveness of our approach, we conducted experiments on five datasets across TIReID and RSITR. Notably, our EKLSR model achieves state-of-the-art performance on two RSITR datasets.



Paperid:394 Poster
Authors:Weiying Xie,Mei Yuan,Jitao Ma,Yunsong Li
Abstract:
Deep Convolutional Neural Networks (CNNs) have demonstrated excellent performance in various multimedia application scenarios. However, complex models often require significant computational resources and energy costs. Therefore, CNN compression is crucial for addressing deployment challenges of multimedia application on resource constrained edge devices. However, existing CNN channel pruning strategies primarily focus on the "weights" or "activations" of the model, overlooking its "interpretability" information. In this paper, we explore CNN pruning strategies from the perspective of model interpretability. We model the correspondence between channel feature maps and interpretable visual perception based on class saliency maps, aiming to assess the contribution of each channel to the desired output. Additionally, we utilize Discrete Wavelet Transform (DWT) to capture the global features and structure of class saliency maps. Based on this, we propose a Channel Spatial Dependability (CSD) metric, evaluating the importance and contribution of channels in a bidirectional manner to guide model quantization pruning. And we dynamically adjust the pruning rate of each layer based on performance changes, in order to achieve more accurate and efficient adaptive pruning. Experimental results demonstrate that our method achieves significant results across a range of different networks and datasets. For instance, we achieved a 51.3% pruning on the ResNet-56 model while maintaining an accuracy of 94.16%, outperforming feature-map or weight-based pruning and other State-of-the-Art (SOTA).



Paperid:395 Poster
Authors:Zhiyuan Ma,Guoli Jia,Biqing Qi,Bowen Zhou
Abstract:
Recently, stable diffusion (SD) models have typically flourished in the field of image synthesis and personalized editing, with a range of photorealistic and unprecedented images being successfully generated. As a result, widespread interests have been ignited to develop and use various SD-based tools for visual content creations. However, the exposures of AI-created contents on public platforms could raise both legal and ethical risks. In this regard, the traditional methods of adding watermarks to the already generated images (i.e. post-processing) may face a dilemma (e.g., being erased or modified) in terms of copyright protection and content monitoring, since the powerful image inversion and text-to-image editing techniques have been widely explored in SD-based methods. In this work, we propose a $\textbf{Safe}$ and high-traceable $\textbf{S}$table $\textbf{D}$iffusion framework (namely $\textbf{Safe-SD}$) to adaptively implant the graphical watermarks (e.g., QR code) into the imperceptible structure-related pixels during generative diffusion process for supporting text-driven invisible watermarking and detection. Different previous high-cost injection-then-detection training framework, we design a simple and unified architecture, which makes it possible to simultaneously train watermark injection and detection in a single network, greatly improving the efficiency and convenience of use. Moreover, to further support text-driven generative watermarking and deeply explore its robustness and high-traceability, we elaborately design a $\lambda$-sampling and $\lambda$-encryption algorithm to fine-tune a latent diffuser wrapped by a VAE for balancing high-fidelity image synthesis and high-traceable watermark detection. We present our quantitative and qualitative results on two representative datasets LSUN, COCO and FFHQ, demonstrating state-of-the-art performance of Safe-SD and showing it significantly outperforms the previous approaches.



Paperid:396 Poster
Authors:Wei Feng,Dongyuan Wei,Qianqian Wang,Bo Dong,Quanxue Gao
Abstract:
Multi-view clustering (MVC) methods based on non-negative matrix factorization (NMF) have gained popularity owing to their ability to provide interpretable clustering results. However, these NMF-based MVC methods generally process each view independently and thus ignore the potential relationship between views. Besides, they are limited in the ability to capture the nonlinear data structures. To overcome these weaknesses and inspired by deep learning, we propose a multi-view clustering method based on deep non-negative tensor factorization (MVC-DNTF). With deep tesnor factorization, our method can well exploit the spatial structure of the original data and is capable of extracting more deep and nonlinear features embedded in different views. To further extract the complementary information of different views, we adopt the weighted tensor Schatten $p$-norm regularization term. An optimization algorithm is developed to effectively solves the MVC-DNTF objective. Extensive experiments are performed to demonstrate the effectiveness and superiority of our method.



Paperid:397 Poster
Authors:Jinyu Cai,Yunhe Zhang,Zhoumin Lu,Wenzhong Guo,See-Kiong Ng
Abstract:
Graph anomaly detection (GAD) aims to identify anomalous graphs that significantly deviate from other ones, which has raised growing attention due to the broad existence and complexity of graph-structured data in many real-world scenarios. However, existing GAD methods usually execute with centralized training, which may lead to privacy leakage risk in some sensitive cases, thereby impeding collaboration among organizations seeking to collectively develop robust GAD models. Although federated learning offers a promising solution, the prevalent non-IID problems and high communication costs present significant challenges, particularly pronounced in collaborations with graph data distributed among different participants. To tackle these challenges, we propose an effective federated graph anomaly detection framework (FGAD). We first introduce an anomaly generator to perturb the normal graphs to be anomalous, and train a powerful anomaly detector by distinguishing generated anomalous graphs from normal ones. Then, we leverage a student model to distill knowledge from the trained anomaly detector (teacher model), which aims to maintain the personality of local models and alleviate the adverse impact of non-IID problems. Moreover, we design an effective collaborative learning mechanism that facilitates the personalization preservation of local models and significantly reduces communication costs among clients. Empirical results of the GAD tasks on non-IID graphs (single/multi datasets) from diverse domains demonstrate the superiority and efficiency of the proposed FGAD method.



Paperid:398 Poster
Authors:Quanjiang Li,Tingjin Luo,Mingdie Jiang,Jiahui Liao,Zhangqi Jiang
Abstract:
Due to the explosive growth in data sources and label categories, multi-view multi-label learning has garnered widespread attention. However, multi-view multi-label data often exhibits incomplete features and few labeled instances alongside a huge number of unlabeled instances, due to the technical limitations of data collection and high annotation cost of manual labeling in practice. Learning for such simultaneous missing of view features and labels is crucial but rarely studied, particularly when the labeled samples with full observations are limited. In this paper, we tackle this problem by proposing a novel Deep Incomplete Multi-View Semi-Supervised Multi-Label Learning method (DIMvSML). Specifically, to improve high-level representations of missing features, DIMvSML firstly employs deep graph networks to recover the feature information with structural similarity relations. Meanwhile, we design the structure-specific deep feature extractors to obtain discriminative information and preserve the cross-view consistency for the recovered data with instance-level contrastive loss. Furthermore, to eliminate the bias of the estimate of the risk that the semi-supervised multi-label methods minimise, we design a safe risk estimate framework with an unbiased loss and improve its empirical performance by using pseudo-labels of unlabeled data. Besides, we provide both the theoretical proof of better estimate variance and the intuitive explanation of our debiased framework. Finally, extensive experimental results on public datasets validate the superiority of DIMvSML compared with state-of-the-art methods.



Paperid:399 Poster
Authors:Xincheng Ju,Dong Zhang,Suyang Zhu,Junhui Li,Shoushan Li,Guodong Zhou
Abstract:
Conversation is a common form of human communication that includes extensive emotional interaction. Traditional approaches focused on studying emotions and their underlying causes in conversations. They try to address two issues: what emotions are present in the dialogue and what causes these emotions. However, these works often overlook the bidirectional nature of emotional interaction in dialogue: utterances can evoke emotionscause, and emotions can also lead to certain utterances consequence. Therefore, we propose a new issue: what consequences arise from these emotions? This leads to the introduction of a new task called Emotion Consequence Forecasting in CONversations (ECFCON). In this work, we first propose a corresponding dialogue-level dataset. Specifically, we select 2,780 video dialogues for annotation, totaling 39,950 utterances. Out of these, 12,391 utterances contain emotions, and 8,810 of these have discernible consequences. Then, we benchmark this task by conducting experiments from the perspectives of traditional methods, generalized LLMs prompting methods, and clue-driven hybrid methods. Both our dataset and benchmark codes are openly accessible to the public.



Paperid:400 Poster
Authors:Wei Yang,Qingchen Yang
Abstract:
Whether it is an e-commerce platform or a short video platform, the effective use of multi-modal data plays an important role in the recommendation system. More and more researchers are exploring how to effectively use multimodal signals to entice more users to buy goods or watch short videos. Some studies have added multimodal features as side information to the model and achieved certain results. In practice, the purchase behavior of users mainly depends on some subjective intentions of users. However, it is difficult for neural networks to effectively process noise information and extract high-level intention information. To investigate the benefits of latent intentions and leverage them effectively for recommendation, we propose a Multimodal-aware Multi-intention Learning method for recommendation (MMIL). Specifically, we establish the relationship between intention and recommendation objective based on probability formula, and propose a multi-intention recommendation optimization objective which can avoid intention overfitting. We then construct an intent representation learner to learn accurate multiple intent representations. Further, considering the close relationship between user intent and multimodal signals, we introduce modal attention mechanisms to learn modal perceived intent representations. In addition, we design a multi-intention comparison module to assist the learning of multiple intention representations. On three real-world data sets, the proposed MMIL method outperforms other advanced methods. The effectiveness of intention modeling and intention contrast module is verified by comprehensive experiments.



Paperid:401 Poster
Authors:Shiye Wang,Changsheng Li,Jialin Tang,Xing Gong,Ye Yuan,Guoren Wang
Abstract:
Parameter-Efficient-Tuning (PET) for pre-trained deep models (e.g., transformer) hold significant potential for domain increment learning (DIL). Recent prevailing approaches resort to prompt learning, which typically involves learning a small number of prompts for each domain to avoid the issue of catastrophic forgetting. However, previous studies have pointed out prompt-based methods are often challenging to optimize, and their performance may vary non-monotonically with trainable parameters. In contrast to previous prompt-based DIL methods, we put forward an importance-aware shared parameter subspace learning for domain incremental learning, on the basis of low-rank adaption (LoRA). Specifically, we propose to incrementally learn a domain-specific and domain-shared low-rank parameter subspace for each domain, in order to effectively decouple the parameter space and capture shared information across different domains. Meanwhile, we present a momentum update strategy for learning the domain-shared subspace, allowing for the smoothly accumulation of knowledge in the current domain while mitigating the risk of forgetting the knowledge acquired from previous domains. Moreover, given that domain-shared information might hold varying degrees of importance across different domains, we design an importance-aware mechanism that adaptively assigns an importance weight to the domain-shared subspace for the corresponding domain. Finally, we devise a cross-domain contrastive constraint to encourage domain-specific subspaces to capture distinctive information within each domain effectively, and enforce orthogonality between domain-shared and domain-specific subspaces to minimize interference between them. Extensive experiments on image domain incremental datasets demonstrate the effectiveness of the proposed method in comparison to the related state-of-the-art methods.



Paperid:402 Poster
Authors:chunxiao Li,Shuyang Wang,Xuejing Kang,Anlong Ming
Abstract:
Temporal Automatic White Balance (TAWB) corrects the color cast within each frame, while ensuring consistent illumination across consecutive frames. Unlike conventional AWB, there has been limited research conducted on TAWB for an extended period. However, the growing popularity of short-form videos has increased focus on video color experiences. To further advance research on TAWB, we aim to address the bottlenecks associated with datasets, models, and benchmarks. 1) Dataset challenge: Currently, only one TAWB dataset (BCC), captured with a single camera, is available. It lacks temporal continuity due to challenges in capturing realistic illuminations and dynamic raw data. In response, we meticulously designed an acquisition strategy based on the actual distribution pattern of illuminations and created a comprehensive TAWB dataset named CTA comprising 6 cameras that offer 12K continuous illuminations. Furthermore, we employed video frame interpolation techniques, extending the captured static raw data into dynamic form and ensuring continuous illumination. 2) Model challenge: Among the two prevailing TAWB methods, both rely on LSTM. However, the fixed gating mechanism of LSTM often fails to adapt to varying content or illuminations, resulting in unstable illumination estimation. In response, we propose CTANet, which integrates cross-frame attention and RepViT for self-adjustment to content and illumination variations. Additionally, the mobile-friendly design of RepViT enhances the portability of CTANet. 3) Benchmark challenge: Currently, there is no benchmark of TAWB methods on illumination and camera types to date. Addressing this, a benchmark has been proposed by conducting a comparative analysis of 8 cutting-edge AWB and TAWB methods with CTANet across 3 typical illumination scenes and 7 cameras from two representative datasets. Our dataset and code are available in supplementary material.



Paperid:403 Poster
Authors:Weitian Zhang,Yichao Yan,Yunhui Liu,Xingdong Sheng,Xiaokang Yang
Abstract:
This paper aims to introduce 3D Gaussians for efficient, expressive, and editable digital avatar generation.This task faces two major challenges: 1) The unstructured nature of 3D Gaussians makes it incompatible with current generation pipelines; 2) the animation of 3D Gaussians in a generative setting that involves training with multiple subjects remains unexplored. In this paper, we propose a novel avatar generation method named $E^{3}$Gen, to effectively address these challenges. First, we propose a novel generative UV features representation that encodes unstructured 3D Gaussians onto a structured 2D UV space defined by the SMPLX parametric model. This novel representation not only preserves the representation ability of the original 3D Gaussians but also introduces a shared structure among subjects to enable generative learning of the diffusion model. To tackle the second challenge, we propose a part-aware deformation module to achieve robust and accurate full-body expressive pose control. Extensive experiments demonstrate that our method achieves superior performance in avatar generation and enables expressive full-body pose control and editing.



Paperid:404 Poster
Authors:Xin Chen,Bin Wang,jinzheng jiang,Kunkun Zhang,Yongsheng Gao
Abstract:
Fine-grained leaf image retrieval (FGLIR) is a new unsupervised pattern recognition task in content-based image retrieval (CBIR). It aims to distinguish varieties/cultivars of leaf images within a certain plant species and is more challenging than general leaf image retrieval task due to the inherently subtle differences across different cultivars. In this study, we for the first time investigate the possible way to mine the spatial structure and contextual information from the activation of the convolutional layers of CNN networks for FGLIR. For achieving this goal, we design a novel geometrical structure, named Triplet Patch-Pairs Composite Structure (TPCS), consisting of three symmetric patch pairs segmented from the leaf images in different orientations. We extract CNN feature map for each patch in TPCS and measure the difference between the feature maps of the patch pair for constructing local deep self-similarity descriptor. By varying the size of the TPCS, we can yield multi-scale deep self-similarity descriptors. The final aggregated local deep self-similarity descriptors, named Structural Deep Patch Representation (SDePR), not only encode the spatial structure and contextual information of leaf images in deep feature domain, but also are invariant to the geometrical transformations. The extensive experiments of applying our SDEPR method to the public challenging FGLIR tasks show that our method outperforms the state-of-the-art handcrafted visual features and deep retrieval models.



Paperid:405 Poster
Authors:Weibang Jiang,Yu-Ting Lan,Bao-liang Lu
Abstract:
Emotion recognition based on electroencephalogram (EEG) has garnered increasing attention in recent years due to the non-invasiveness and high reliability of EEG measurements. Despite the promising performance achieved by numerous existing methods, several challenges persist. Firstly, there is the challenge of emotional label noise, stemming from the assumption that emotions remain consistently evoked and stable throughout the entirety of video observation. Such an assumption proves difficult to uphold in practical experimental settings, leading to discrepancies between EEG signals and anticipated emotional states. In addition, there's the need for comprehensive capture of the temporal-spatial-spectral characteristics of EEG signals and cope with low signal-to-noise ratio (SNR) issues. To tackle these challenges, we propose a comprehensive pipeline named REmoNet, which leverages novel self-supervised techniques and multi-regularized co-learning. Two self-supervised methods, including masked channel modeling via temporal-spectral transformation and emotion contrastive learning, are introduced to facilitate the comprehensive understanding and extraction of emotion-relevant EEG representations during pre-training. Additionally, fine-tuning with multi-regularized co-learning exploits feature-dependent information through intrinsic similarity, resulting in mitigating emotional label noise. Experimental evaluations on two public datasets demonstrate that our proposed approach, REmoNet, surpasses existing state-of-the-art methods, showcasing its effectiveness in simultaneously addressing raw EEG signals and noisy emotional labels.



Paperid:406 Poster
Authors:Yihao Liu,Feng Xue,Anlong Ming,Mingshuai Zhao,Huadong Ma,Nicu Sebe
Abstract:
In the last year, universal monocular metric depth estimation (universal MMDE) has gained considerable attention, serving as the foundation model for various multimedia tasks, such as video and image editing. Nonetheless, current approaches face challenges in maintaining consistent accuracy across diverse scenes without scene-specific parameters and pre-training, hindering the practicality of MMDE. Furthermore, these methods rely on extensive datasets comprising millions, if not tens of millions, of data for training, leading to significant time and hardware expenses. This paper presents SM4Depth, a model that seamlessly works for both indoor and outdoor scenes, without needing extensive training data and GPU clusters. Firstly, to obtain consistent depth across diverse scenes, we propose a novel metric scale modeling, i.e., variationbased unnormalized depth bins. It reduces the ambiguity of the conventional metric bins and enables better adaptation to large depth-gaps of scenes during training. Secondly, we propose a “divide and conquer" solution to reduce reliance on massive training data. Instead of estimating directly from the vast solution space, the metric bins are estimated from multiple solution sub-spaces to reduce complexity. Additionally, we introduce an uncut depth dataset, Campus Depth, to evaluate the depth accuracy and consistency across various indoor and outdoor scenes. Trained on a consumer-grade GPU using just 150K RGB-D pairs, SM4Depth achieves outstanding performance on the most never-before-seen datasets, especially maintaining consistent accuracy across indoors and outdoors. The code can be found in the supplementary material.



Paperid:407 Poster
Authors:Wenbin Wang,Liang Ding,Li Shen,Yong Luo,Han Hu,Dacheng Tao
Abstract:
Multimodal Sentiment Analysis (MSA) focuses on leveraging multimodal signals for understanding human sentiment. Most of the existing works rely on superficial information, neglecting the incorporation of contextual world knowledge (e.g., background information derived from but beyond the given image and text pairs), thereby restricting their ability to achieve better multimodal sentiment analysis (MSA). In this paper, we propose a plug-in framework named WisdoM, to leverage the contextual world knowledge induced from the large vision-language models (LVLMs) for enhanced MSA. WisdoM utilizes LVLMs to comprehensively analyze both images and corresponding texts, simultaneously generating pertinent context. Besides, to reduce the noise in the context, we design a training-free contextual fusion mechanism. We evaluate our WisdoM in both the aspect-level and sentence-level MSA tasks on the Twitter2015, Twitter2017, and MSED datasets. Experiments on three MSA benchmarks upon several advanced LVLMs, show that our approach brings consistent and significant improvements (up to +6.3% F1 score).



Paperid:408 Poster
Authors:Wenlin Li,Yucheng Xu,Xiaoqing Zheng,Suoya Han,Jun Wang,Xiaobo Sun
Abstract:
Sparse and noisy images (SNIs), like those in spatial gene expression data, pose significant challenges for effective representation learning and clustering, which are essential for thorough data analysis and interpretation. In response to these challenges, we propose $\textbf{D}$ual $\textbf{A}$dvancement of $\textbf{R}$epresentation $\textbf{L}$earning and $\textbf{C}$lustering ($\textit{\textbf{DARLC}}$), an innovative framework that leverages contrastive learning to enhance the representations derived from masked image modeling. Simultaneously, $\textit{DARLC}$ integrates cluster assignments in a cohesive, end-to-end approach. This integrated clustering strategy addresses the ``class collision problem'' inherent in contrastive learning, thus improving the quality of the resulting representations. To generate more plausible positive views for contrastive learning, we employ a graph attention network-based technique that produces denoised images as augmented data. As such, our framework offers a comprehensive approach that improves the learning of representations by enhancing their local perceptibility, distinctiveness, and the understanding of relational semantics. Furthermore, we utilize a Student's t mixture model to achieve more robust and adaptable clustering of SNIs. Extensive evaluation on 12 real-world datasets of SNIs, representing spatial gene expressions, demonstrat $\textit{DARLC}$'s superiority over current state-of-the-art methods in both image clustering and generating representations that accurately reflect biosemantics content and gene interactions.



Paperid:409 Poster
Authors:yunannan,Tao Ma,Jiqing Zhang,Yuji Zhang,Qirui Bao,Xiaopeng Wei,Xin Yang
Abstract:
Human pose estimation has made progress based on deep learning. However, it still faces challenges when encountering exposure, low light, and high-speed scenarios such as motion blur and miss human contours in low light scenes. Moreover, due to the extensive operations required for large-scale convolutional neural network (CNN) inference, marker-free human pose estimation based on standard frame-based cameras is still slow and power consuming for real-time feedback interaction. Event-based cameras quickly output asynchronous sparse moving-edge information, which is low latency and low power consumption for real-time interaction with human pose estimators. For further study. this paper proposed a high-frame rate labeled event-based human pose estimation dataset named Event Multi Movement HPE (EventMM HPE). It consists of records from synchronized event camera, high frame rate camera and Vicon motion capture system, with each sequence recording multiple action combinations and high frame rate (240Hz) annotations. This paper also proposed an event-based human pose estimation model, which utilizes adaptive patches to efficiently achieves good performance for the sparse and reduced input data from DVS. The source code, dataset, and pre-trained models will be released upon acceptance.



Paperid:410 Poster
Authors:Cong Wang,Chengjin Yu,Jie Mu,Wei Wang
Abstract:
While current CNN-based low-light image enhancement (LIE) approaches have achieved significant progress, they often fail to generate better perceptual quality which requires restoring better details and more natural colors. To address these problems, we set a new path, called PercepLIE, by presenting the VQGAN with Multi-luminance Detail Compensation (MDC) and Global Color Adjustment (GCA). Specifically, observed that latent light features of the low-light images are quite different from those captured in normal light, we utilize VQGAN to explore the latent light representation of normal-light images to help the estimation of the low-light and normal-light mapping. Furthermore, we employ Gamma correction with varying Gamma values on the gradient to create multi-luminance details, forming the basis for our MDC module to facilitate better detail estimation. To optimize the colors of low-light input images, we introduce a simple yet effective GCA module that is based on spatially varying representation between the estimated normal-light images in this module and low-light inputs. By combining the VQGAN with MDC and GCA within a stage-wise training mechanism, our method generates images with finer details and natural colors and achieves favorable performance on both synthetic and real-world datasets in terms of perceptual quality metrics including NIQE, PI, and LPIPS.



Paperid:411 Poster
Authors:Ziyang Zhou,Pinghui Wang,Zi Liang,Ruofei Zhang,Haitao Bai
Abstract:
Deep neural networks are widely used in retrieval systems. However, they are notoriously vulnerable to attack. Among the various forms of adversarial attacks, the patch attack is one of the most threatening forms. This type of attack can introduce cognitive biases into the retrieval system by inserting deceptive patches into images. Despite the seriousness of this threat, there are still no well-established solutions for image retrieval systems. In this paper, we propose the Pre-denosing Augmented Image Retrieval (PAIR) model, a new approach designed to protect image retrieval systems against adversarial patch attacks. The core strategy of PAIR is to dynamically and randomly reconstruct entire images based on their semantic content. This purifies well-designed patch attacks while preserving the semantic integrity of the images. Furthermore, we present a novel training strategy that incorporates a semantic discriminator. This discriminator significantly improves PAIR's ability to capture real semantics and reconstruct images. Experiments show that PAIR significantly outperforms existing defense methods. It effectively reduces the success rate of two state-of-the-art patch attack methods to below 5%, achieving a 14% improvement over current leading methods. Moreover, in defending against other forms of attack, such as global perturbation attacks, PAIR also achieves competitive results. The codes are available at:https://anonymous.4open.science/r/PAIR-8FD2.



Paperid:412 Poster
Authors:Honglin Yuan,Shiyun Lai,Xingfeng Li,Jian Dai,Yuan Sun,Zhenwen Ren
Abstract:
In practical data collection processes, certain views may become partially unavailable due to sensor failures or equipment issues, leading to the problem of incomplete multi-view clustering (IMVC). While some IMVC methods employing prototype completion achieve satisfactory performance, almost all of them implicitly assume correct alignment of prototypes across all views. However, during prototype generation, different networks could generate different cluster centers, thereby leading to the produced prototypes from different views may be misaligned, i.e., prototype noisy correspondence. To address this issue, we propose Robust Prototype Completion for Incomplete Multi-view Clustering (RPCIC), which mitigates the impact of noisy correspondence in prototypes. Specifically, RPCIC initially utilizes cross-view contrastive learning module to obtain consistent feature representations across different views. Subsequently, we devise robust contrastive loss for the produced prototypes, aiming to alleviate the influence of noisy correspondence within them. Finally, we employ prototype fusion-based strategy to complete the missing data. Comprehensive experiments demonstrate that RPCIC outperforms 11 state-of-the-art methods in terms of both performance and robustness.



Paperid:413 Poster
Authors:Ziyang Yuan,Mingdeng Cao,Xintao Wang,Zhongang Qi,Chun Yuan,Ying Shan
Abstract:
Incorporating a customized object into image generation presents an attractive feature in text-to-image (T2I) generation. Some methods finetune T2I models for each object individually at test-time, which tend to be overfitted and time-consuming. Others train an extra encoder to extract object visual information for customization efficiently but struggle to preserve the object's identity. To address these limitations, we present CustomNet, a unified encoder-based object customization framework that explicitly incorporates 3D novel view synthesis capabilities into the customization process. This integration facilitates the adjustment of spatial positions and viewpoints, producing diverse outputs while effectively preserving the object's identity. To train our model effectively, we propose a dataset construction pipeline to better handle real-world objects and complex backgrounds. Additionally, we introduce delicate designs that enable location control and flexible background control through textual descriptions or user-defined backgrounds. Our method allows for object customization without the need of test-time optimization, providing simultaneous control over viewpoints, location, and text. Experimental results show that our method outperforms other customization methods regarding identity preservation, diversity, and harmony.



Paperid:414 Poster
Authors:Yajie Zhang,Zhi-An Huang,Zhiliang Hong,Songsong Wu,Jibin Wu,KC Tan
Abstract:
The heterogeneity of medical images poses significant challenges to accurate disease diagnosis. To tackle this issue, the impact of such heterogeneity on the causal relationship between image features and diagnostic labels should be incorporated into model design, which however remains underexplored. In this paper, we propose a mixed prototype correction for causal inference (MPCCI) method, aimed at mitigating the impact of unseen confounding factors on the causal relationships between medical images and disease labels, so as to enhance the diagnostic accuracy of deep learning models. The MPCCI comprises a causal inference component based on front-door adjustment and an adaptive training strategy. The causal inference component employs a multi-view feature extraction (MVFE) module to establish mediators, and a mixed prototype correction (MPC) module to execute causal interventions. Moreover, the adaptive training strategy incorporates both information purity and maturity metrics to maintain stable model training. Experimental evaluations on three medical image datasets, encompassing CT and ultrasound modalities, demonstrate the superior diagnostic accuracy and reliability of the proposed MPCCI. The code will be available athttps://github.com/***/***.



Paperid:415 Poster
Authors:Shoubin Yu,Jacob Zhiyuan Fang,Jian Zheng,Gunnar A Sigurdsson,Vicente Ordonez,Robinson Piramuthu,Mohit Bansal
Abstract:
In this paper, we introduce a new challenging task called Zero-shot Controllable Image-to-Video Animation, where the goal is to animate an image based on motion trajectories defined by the user, without fine-tuning the base model. Primary challenges include maintaining consistency of background, consistency of object in motion, faithfulness to the user-defined trajectory, and quality of motion animation. We also introduce a novel approach for this task, leveraging diffusion models called IMG2VIDANIM-ZERO (IVA0). IVA0 tackles our controllable Image-to-Video (I2V) task by decomposing it into two subtasks: ‘out-of-place’ and ‘in-place’ motion animation. Due to this decomposition, IVA0 can leverage existing work on layout-conditioned image generation for out-of-place motion generation, and existing text-conditioned video generation methods for in-place motion animation, thus facilitating zero-shot generation. Our model also addresses key challenges for controllable animation, such as Layout Conditioning via Spatio-Temporal Masking to incorporate user guidance and Motion Afterimage Suppression (MAS) scheme to reduce object ghosting during out-of-place animation. Finally, we design a novel controllable I2V benchmark featuring diverse local- and global-level metrics. Results show IVA0 as a new state-of-the-art, establishing a new standard for the zero-shot controllable I2V task. Our method highlights the simplicity and effectiveness of task decomposition and modularization for this novel task for future studies.



Paperid:416 Poster
Authors:Dan Wang,Xinrui Cui
Abstract:
We propose Interpretable Neural Radiance Fields (InNeRF) for generalizable 3D scene representation and rendering. In contrast to previous image-based rendering, which used two independent working processes of pooling-based fusion and MLP-based rendering, our framework unifies source-view fusion and target-view rendering processes via an end-to-end interpretable Transformer-based network. InNeRF enables the investigation of deep relationships between the target-rendering view and source views that were previously neglected by pooling-based fusion and fragmented rendering procedures. As a result, InNeRF improves model interpretability by enhancing the shape and appearance consistency of a 3D scene in both the surrounding view space and the ray-cast space. For a query rendering 3D point, InNeRF integrates both its projected 2D pixels from the surrounding source views and its adjacent 3D points along the query ray and simultaneously decodes this information into the query 3D point representation. Experiments show that InNeRF outperforms state-of-the-art image-based neural rendering methods in both scene-agnostic and per-scene finetuning scenarios, especially when there is a considerable disparity between source views and rendering views. The interpretation experiment shows that InNeRF can explain a query rendering process.



Paperid:417 Poster
Authors:yiyong xiao,Kai Shu,Haoyi Zhang,BaoHuaYin,Wai Seng Cheang,Haoyang Wang,Jiechao Gao
Abstract:
Co-Speech gesture generation encounters challenges with imbalanced, long-tailed gesture distributions. While recent methods typically address this by employing Vector Quantized Variational Autoencoder (VQ-VAE), encode gestures into a codebook and classify codebook indices based on audio or text cues. However, due to the imbalanced, the codebook classification tends to bias towards majority gestures, neglecting semantically rich minority gestures. To address this, this paper proposes the Entropy-Guided Co-Speech Gesture Generation (EGGesture). EGGesture leverages an Entropy-Guided VQ-VAE to jointly optimize the distribution of codebook indices and adjust loss weights for codebook index classification, which consists of a) A differentiable approach for entropy computation using Gumbel-Softmax and cosine similarity, facilitating online codebook distribution optimization, and b) a strategy that utilizes computed codebook entropy to collaboratively guide the classification loss weighting. These designs enable the dynamic refinement of the codebook utilization, striking a balance between the quality of the learned gesture representation and the accuracy of the classification phase. Experiments on the Trinity and BEAT datasets demonstrate EGGesture’s state-of-the-art performance both qualitatively and quantitatively. The code and video are available.



Paperid:418 Poster
Authors:Zhenyu Xie,Haoye Dong,Yufei Gao,Zehua Ma,Xiaodan Liang
Abstract:
Image-based 3D Virtual Try-ON aims to sculpt the 3D human according to person and clothes images, which is data-efficient (i.e., getting rid of expensive 3D data) but challenging. Recent text-to-3D methods achieve remarkable improvement in high-fidelity 3D human generation, demonstrating its potential for 3D virtual try-on. Inspired by the impressive success of personalized diffusion models (e.g., Dreambooth and LoRA) for 2D VTON, it is straightforward to achieve 3D VTON by integrating the personalization technique into the diffusion-based text-to-3D framework. However, employing the personalized module in a pre-trained diffusion model (e.g., StableDiffusion (SD)) would degrade the model's capability for multi-view or multi-domain synthesis, which is detrimental to the geometry and texture optimization guided by Score Distillation Sampling (SDS) loss. In this work, we propose a novel customizing 3D human try-on model, named \textbf{DreamVTON}, to separately optimize the geometry and texture of the 3D human. Specifically, a personalized SD with multi-concept LoRA is proposed to provide the generative prior about the specific person and clothes, while a Densepose ControlNet is exploited to guarantee consistent prior about body pose across various camera views. Besides, to avoid the inconsistent multi-view priors from the personalized SD dominating the optimization, DreamVTON introduces a template-based optimization mechanism, which employs mask templates for geometry shape learning and normal/RGB templates for geometry/texture details learning. Furthermore, for the geometry optimization phase, DreamVTON integrates a normal-style LoRA into personalized SD to enhance normal map generative prior, facilitating smooth geometry modeling. Extensive experiments show that DreamVTON can generate high-quality 3D Humans with the input person, clothes images, and text prompt, outperforming existing methods.



Paperid:419 Poster
Authors:Shuai Zhao,Yongkun Du,Zhineng Chen,Yu-Gang Jiang
Abstract:
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. However, the domain gap between synthetic and real images poses a challenge in acquiring the representations of real images from synthetic images, thereby limiting the performance of these methods on real scenes. We note that vision-language models like CLIP, pre-trained on extensive image-text pairs, effectively align images and texts in a unified embedding space, suggesting the potential to derive the representations of real images from texts alone. Building upon this premise, we introduce a novel method named Decoder Pretraining with only text for STR (DPTR). In the pre-training stage, DPTR leverages text embeddings produced by CLIP text encoder as visual embeddings, directing the decoder to acquire the ability to search for potential representations of real images from these text embeddings. Furthermore, we introduce a Offline Randomized Perturbation (ORP) strategy, leveraging CLIP image encoder to enrich text embeddings by incorporating image embeddings. In the fine-tuning stage, we introduce a Feature Merge Unit (FMU) that focuses on the character regions within the text image, thereby enabling the pre-trained decoder to work more efficiently and accurately. Extensive experiments across various decoders and multi-language STR underscore the broad applicability and remarkable performance of DPTR, providing a novel insight for STR pre-training.



Paperid:420 Poster
Authors:Shengwei Zhao,Xu Linhai,Yuying Liu,Shaoyi Du
Abstract:
Large-scale pre-trained audio-language models excel in general multi-modal representation, facilitating their adaptation to downstream audio recognition tasks in a data-efficient manner. However, existing few-shot audio recognition methods based on audio-language models primarily focus on learning coarse-grained correlations, which are not sufficient to capture the intricate matching patterns between the multi-level information of audio and the diverse characteristics of category concepts. To address this gap, we propose multi-grained correspondence learning for bootstrapping audio-language models to improve audio recognition with few training samples. This approach leverages generative models to enrich multi-modal representation learning, mining the multi-level information of audio alongside the diverse characteristics of category concepts. Multi-grained matching patterns are then established through multi-grained key-value cache and multi-grained cross-modal contrast, enhancing the alignment between audio and category concepts. Additionally, we incorporate optimal transport to tackle temporal misalignment and semantic intersection issues in fine-grained correspondence learning, enabling flexible fine-grained matching. Our method achieves state-of-the-art results on ESC-50 and FSDkaggle18, two benchmark datasets for few-shot audio recognition, with comprehensive ablation experiments validating its effectiveness.



Paperid:421 Poster
Authors:Xianwei Zhuang,Xuxin Cheng,Zhihong Zhu,Zhanpeng Chen,Hongxiang Li,Yuexian Zou
Abstract:
Pre-trained language models (PLMs) that rely solely on textual corpus may present limitations in multimodal semantics comprehension. Existing studies attempt to alleviate this issue by incorporating additional modal information through image retrieval or generation. However, these methods: (1) inevitably encounter modality gaps and noise; (2) treat all modalities indiscriminately; and (3) ignore visual or acoustic semantics of key entities. To tackle these challenges, we propose a novel principled iterative framework for multimodal-augmented PLMs termed MASE, which achieves efficient and balanced injection of multimodal semantics under the proposed Expectation Maximization (EM) based iterative algorithm. Initially, MASE utilizes multimodal proxies instead of explicit data to enhance PLMs, which avoids noise and modality gaps. In E-step, MASE adopts a novel information-driven self-balanced strategy to estimate allocation weights. Furthermore, MASE employs heterogeneous graph attention to capture entity-level fine-grained semantics on the proposed multimodal-semantic scene graph. In M-step, MASE injects global multimodal knowledge into PLMs through a cross-modal contrastive loss. Experimental results show that MASE consistently outperforms competitive baselines on multiple tasks across various architectures. More impressively, MASE is compatible with existing efficient parameter fine-tuning methods, such as prompt learning.



Paperid:422 Poster
Authors:Tao Jin,Weicai Yan,Ye Wang,Sihang Cai,Shuaiqifan,Zhou Zhao
Abstract:
In the field of machine learning, continual learning is a crucial concept that allows models to adapt to non-stationary data distributions. However, most of the existing works focus on uni-modal settings and ignore the multi-modal data. In this paper, to enable neural networks better understand diverse modalities in real-world scenario, we investigate continual learning for two typical vision-language applications, i.e. retrieval and grounding. Instead of conventional exemplar-based methods, we leverage the pre-trained transformer model (e.g. CLIP/GLIP) and the prompt technique to tackle this problem. Under this scheme, we identify two critical limitations in existing methods: (1) Unfamiliarity across tasks, which prevents task-specific prompts from achieving forward propagation; and (2) Heterogeneity between modalities, which makes it difficult to guarantee a consistent optimization direction for prompts of different modalities. To overcome these constraints, we design Historical Prompt Calibration that includes two objectives to calibrate prompts. First, the intra-modal relevance estimation helps encode sufficient task-specific information for prompts, with the help a relevance estimator developed for recognizing task relevance. Second, the inter-modal consistency alignment enhances the agreement of the two modality-specific prompts in the current task by contrasting them with the prompts from previous tasks. We evaluate the superiority of our strategy over state-of-the arts methods by four vision-language applications, including two retrieval tasks (i.e. image- and video-text retrieval) and two grounding tasks (i.e. referring expression comprehension and segmentation).



Paperid:423 Poster
Authors:Xiaoda Yang,Xize Cheng,Dongjie Fu,Minghui Fang,Jialung Zuo,Shengpeng Ji,Tao Jin,Zhou Zhao
Abstract:
Talking Face Generation (TFG) reconstructs facial motions concerning lips given speech input, which aims to generate high-quality, synchronized, and lip-readable videos. Previous efforts have achieved success in generating quality and synchronization, and recently, there has been an increasing focus on the importance of intelligibility. Despite these efforts, there remains a challenge in achieving a balance among quality, synchronization, and intelligibility, often resulting in trade-offs that compromise one aspect in favor of another. In light of this, we propose SyncTalklip, a novel dual-tower framework designed to overcome the challenges of synchronization while improving lip-reading performance. To enhance the performance of SyncTalklip in both synchronization and intelligibility, we design AV-SyncNet, a pre-trained multi-task model, aiming to achieve a dual-focus on synchronization and intelligibility. Moreover, we propose a novel cross-modal contrastive learning bringing audio and video closer to enhance synchronization. Experimental results demonstrate that SyncTalklip achieves state-of-the-art performance in quality, intelligibility, and synchronization. Furthermore, extensive experiments have demonstrated our model's generalizability across domains. The code and demo is available at \url{https://sync-talklip.github.io}.



Paperid:424 Poster
Authors:Zhiyu Zhu,Zhibo Jin,Jiayu Zhang,Huaming Chen
Abstract:
In the field of artificial intelligence, AI models are frequently described as ‘black boxes’ due to the obscurity of their internal mechanisms. It has ignited research interest on model interpretability, especially in attribution methods that offers precise explanations of model decisions. Current attribution algorithms typically evaluate the importance of each parameter by exploring the sample space. A large number of intermediate states are introduced during the exploration process, which may reach the model’s Out-of-Distribution (OOD) space. Such intermediate states will impact the attribution results, making it challenging to grasp the relative importance of features. In this paper, we firstly define the local space and its relevant properties, and we propose the Local Attribution (LA) algorithm that leverages these properties. The LA algorithm comprises both targeted and untargeted exploration phases, which are designed to effectively generate intermediate states for attribution that thoroughly encompass the local space. Compared to the state-of-the-art attribution methods, our approach achieves an average improvement of 38.21% in attribution effectiveness. Extensive ablation studies within our experiments also validate the significance of each component in our algorithm. Our code is available at:https://anonymous.4open.science/r/LA-2024



Paperid:425 Poster
Authors:Hao Yu,Xin Yang,Xin Gao,Yihui Feng,Hao Wang,Yan Kang,Tianrui Li
Abstract:
This paper delves into federated class-incremental learning (FCiL), where new classes appear continually or even privately to local clients. However, existing FCiL methods suffer from the problem of spatial-temporal catastrophic forgetting, i.e., forgetting the previously learned knowledge over time and the client-specific information owned by different clients. Additionally, private class and knowledge heterogeneity amongst local clients further exacerbate spatial-temporal forgetting, making FCiL challenging to apply. To address these issues, we propose Federated Class-specific Binary Classifier (FedCBC), an innovative approach to transferring and fusing knowledge across both temporal and spatial perspectives. FedCBC consists of two novel components: (1) continual personalization that distills previous knowledge from a global model to multiple local models, and (2) selective knowledge fusion that enhances knowledge integration of the same class from divergent clients and shares private knowledge with other clients. Extensive experiments using three newly-formulated metrics (termed GA, KRS, and KRT) demonstrate the effectiveness of the proposed approach.



Paperid:426 Poster
Authors:Chunjie Ma,Lina Du,Zan Gao,Li Zhuo,Meng Wang
Abstract:
Currently, Transformer-based prohibited object detection methods in X-ray images appear constantly, but there are still some shortcomings such as poor performance and high computational complexity for prohibited object detection with heavily occlusion. Therefore, a coarse to fine detection method for prohibited object in X-ray images based on progressive Transformer decoder is proposed in this paper. Firstly, a coarse to fine framework is proposed, which includes two stages: coarse detection and fine detection. Through adaptive inference in stages, the computational efficiency of the model is effectively improved. Then, a position and class object queries method is proposed, which improves the convergence speed and detection accuracy of the model by fusing the position and class information of prohibited object with object queries. Finally, a progressive Transformer decoder is proposed, which distinguishes high and low score queries by increasing confidence thresholds, so that high-score queries are not affected by low-score queries in the decoding stage, and the model can focus more on decoding low-score queries, which usually correspond to prohibited object with severe occlusion. The experimental results on three public benchmark datasets (SIXray, OPIXray, HiXray) demonstrate that compared with the baseline DETR, the proposed method achieves the state-of-the-art detection accuracy with a 21.6% reduction in model computational complexity. Especially for prohibited objects with heavily occlusion, accurate detection can be carried out.



Paperid:427 Poster
Authors:Md Tanvir Islam,Nasir Rahim,Saeed Anwar,Muhammad Saqib,Sambit Bakshi,Khan Muhammad
Abstract:
Reducing atmospheric hazes and enhancing image clarity is crucial for a range of applications related to computer vision. The lack of real-life hazy ground truth images necessitates synthetic datasets, which often need more diverse haze types, impeding effective haze type classification and dehazing algorithm selection. This research introduces the HazeSpace2M dataset, a comprehensive collection of over 2 million images designed to enhance the performance of dehazing through haze-type classification. HazeSpace2M includes diverse scenes with 10 haze intensity levels, featuring Fog, Cloud, and a novel category, Environmental Haze (EH). Leveraging the dataset, we introduce a novel technique of haze-type classification followed by specialized dehazers to dehaze hazy images. Unlike the conventional methods, our approach classifies haze types before applying type-specific dehazing, improving clarity and functionality across applications lacking real-life hazy images. We benchmark the state-of-the-art classification models against different combinations of the hazy benchmarking datasets (HBDs) and the Real Hazy Testset (RHT) from the HazeSapce2M dataset. For instance, ResNet50 and AlexNet, on average, achieve 92.75% and 92.50% accuracy, respectively, against the existing synthetic HBDs. However, the same models furnish 80% and 70% accuracy, respectively, against our RHT, proving the challenging nature of our dataset. Additional experiments utilizing our proposed framework verify that haze-type classification followed by specialized dehazing enhances dehazing results by 2.41% in PSNR, 17.14% in SSIM, and 10.2% in MSE over general dehazers. These results highlight the significance of HazeSapce2M and the proposed framework in addressing the pervasive challenge of atmospheric haze in multimedia processing. The codes and dataset will be available on GitHub soon.



Paperid:428 Poster
Authors:Yuxing Zhang,Siyuan Meng,Chunchun Chen,Mengyao Peng,Hongyan Gu,Xinli Huang
Abstract:
Graph neural networks (GNNs) have a wide range of applications in multimedia. Recent studies have shown that GNNs are vulnerable to link stealing attacks, which infers the existence of edges in the target GNN’s training graph. Existing methods are usually based on the assumption that links exist between two nodes that share similar posteriors; however, they fail to focus on links that do not hold under this assumption. To this end, we propose LinkThief, an improved link stealing attack that combines generalized structure knowledge with node similarity, in a scenario where the attackers' background knowledge contains partially leaked target graph and shadow graph. Specifically, to equip the attack model with insights into the link structure spanning both the shadow graph and the target graph, we introduce the idea of creating a Shadow-Target Bridge Graph and extracting edge subgraph structure features from it. Through theoretical analysis from the perspective of privacy theft, we first explore how to implement the aforementioned ideas. Building upon the findings, we design the Bridge Graph Generator to construct the Shadow-Target Bridge Graph. Then, the subgraph around the link is sampled by the Edge Subgraph Preparation Module. Finally, the Edge Structure Feature Extractor is designed to obtain generalized structure knowledge, which is combined with node similarity to form the features provided to the attack model. Extensive experiments validate the correctness of theoretical analysis and demonstrate that LinkThief still effectively steals links without extra assumptions.



Paperid:429 Poster
Authors:Xiang Gao,Jiaying Liu
Abstract:
Large-scale text-to-image diffusion models have been a revolutionary milestone in the evolution of generative AI and multimodal technology, allowing extraordinary image generation with natural-language text prompts. However, the issue of lacking controllability of such models restricts their practical applicability for real-life content creation, for which attention has been focused on leveraging a reference image to control text-to-image synthesis. This paper contributes a concise and efficient approach that adapts the pre-trained text-to-image (T2I) diffusion model to the image-to-image (I2I) paradigm in a plug-and-play manner, realizing high-quality and versatile text-driven I2I translation without any model training, model fine-tuning, or online optimization. To guide T2I generation with a reference image, we propose to model diverse guiding factors with different frequency bands of diffusion features in DCT spectral space, and accordingly devise a novel frequency band substitution layer that dynamically substitutes a certain DCT frequency band of diffusion features with the corresponding counterpart of the reference image along the reverse sampling process. We demonstrate that our method flexibly enables highly controllable text-driven I2I translation both in the guiding factor and guiding intensity of the reference image, simply by adjusting the type and bandwidth of the substituted frequency band, respectively. Extensive experiments verify the superiority of our approach over related methods in image translation visual quality, versatility, and efficiency.



Paperid:430 Poster
Authors:Yi Liu,Chengjun Cai,Xiaoli ZHANG,Xingliang YUAN,Cong Wang
Abstract:
Large Vision Language Models (VLMs) extend and enhance the perceptual abilities of Large Language Models (LLMs). Despite offering new possibilities for LLM applications, these advancements raise significant security and ethical concerns, particularly regarding the generation of harmful content. While LLMs have undergone extensive security evaluations with the aid of red teaming frameworks, VLMs currently lack a well-developed one. To fill this gap, we introduce Arondight, a standardized red team framework tailored specifically for VLMs. Arondight is dedicated to resolving issues related to the absence of visual modality and inadequate diversity encountered when transitioning existing red teaming methodologies from LLMs to VLMs. Our framework features an automated multi-modal jailbreak attack, wherein visual jailbreak prompts are produced by a red team VLM, and textual prompts are generated by a red team LLM guided by a reinforcement learning agent. To enhance the comprehensiveness of VLM security evaluation, we integrate entropy bonuses and novelty reward metrics. These elements incentivize the RL agent to guide the red team LLM in creating a wider array of diverse and previously unseen test cases. Our evaluation of ten cutting-edge VLMs exposes significant security vulnerabilities, particularly in generating toxic images and aligning multi-modal prompts. In particular, our Arondight achieves an average attack success rate of 84.5% on GPT-4 in all fourteen prohibited scenarios defined by OpenAI in terms of generating toxic text. For a clearer comparison, we also categorize existing VLMs based on their safety levels and provide corresponding reinforcement recommendations. Our multimodal prompt dataset and red team code will be released after ethics committee approval. CONTENT WARNING: THIS PAPER CONTAINS HARMFUL MODEL RESPONSES.



Paperid:431 Poster
Authors:Bo Xiong,Changqing Su,Zihan Lin,Yanqin Chen,You Zhou,Zhen Cheng,Zhaofei Yu,Tiejun Huang
Abstract:
Droplet-based microfluidic devices, with their high throughput and low power consumption, have found wide-ranging applications in the life sciences, such as drug discovery and cancer detection. However, the lack of real-time methods for accurately estimating droplet generation parameters has resulted in droplet microfluidic systems remaining largely offline-controlled, making it challenging to achieve efficient feedback in droplet generation. To meet the real-time requirements, it's imperative to minimize the data throughput of the collection system while employing parameter estimation algorithms that are both resource-efficient and highly effective. Spike camera, as an innovative form of neuromorphic camera, facilitates high temporal resolution scene capture with comparatively low data throughput. In this paper, we propose a real-time evaluation method for high-speed droplet parameters based on spike-based microfluidic flow-focusing, named RTDE, that integrates spike camera into the droplet collection system to efficiently capture information using spike stream. To process the spike stream effectively, we develop a spike-based estimation algorithm for real-time droplet generation parameters. To validate the performance of our method, we collected spike-based droplet datasets (SDD), comprising synthetic and real data with varying flow velocities, frequencies, and droplet sizes. Experiments result on these datasets consistently demonstrate that our method achieves parameter estimations that closely match the ground truth values, showcasing high precision. Furthermore, comparative experiments with image-based parameter estimation methods highlight the superior time efficiency of our method, enabling real-time calculation of parameter estimations.



Paperid:432 Poster
Authors:Xin Mei,Rui Mao,Xiaoyan Cai,Libin Yang,Erik Cambria
Abstract:
Medical report generation aims at automating the synthesis of accurate and comprehensive diagnostic reports from radiological images. The task can significantly enhance clinical decision-making and alleviate the workload on radiologists. Existing works normally generate reports from single chest radiographs, although historical examination data also serve as crucial references for radiologists in real-world clinical settings. To address this constraint, we introduce a novel framework that mimics the workflow of radiologists. This framework compares past and present patient images to monitor disease progression and incorporates prior diagnostic reports as references for generating current personalized reports. We tackle the textual diversity challenge in cross-modal tasks by promoting style-agnostic discrete report representation learning and token generation. Furthermore, we propose a novel spatio-temporal fusion method with multi-granularities to fuse textual and visual features by disentangling the differences between current and historical data. We also tackle token generation biases, which arise from long-tail frequency distributions, proposing a novel feature normalization technique. This technique ensures unbiased generation for tokens, whether they are frequent or infrequent, enabling the robustness of report generation for rare diseases. Experimental results on the two public datasets demonstrate that our proposed model outperforms state-of-the-art baselines.



Paperid:433 Poster
Authors:ChengHao Deng,haote xu,Xiaolu Chen,Haodi Xu,Xiaotong Tu,Xinghao Ding,Yue Huang
Abstract:
Recently, large pre-trained vision-language models, such as CLIP, have demonstrated significant potential in zero-/few-shot anomaly detection tasks. However, existing methods not only rely on expert knowledge to manually craft extensive text prompts but also suffer from a misalignment of high-level language features with fine-level vision features in anomaly segmentation tasks. In this paper, we propose a method, named SimCLIP, which focuses on refining the aforementioned misalignment problem through bidirectional adaptation of both Multi-Hierarchy Vision Adapter (MHVA) and Implicit Prompt Tuning (IPT). In this way, our approach requires only a simple binary prompt to accomplish anomaly classification and segmentation tasks in zero-shot scenarios efficiently. Furthermore, we introduce its few-shot extension, SimCLIP+, integrating the relational information among vision embedding and skillfully merging the cross-modal synergy information between vision and language to address AD tasks. Extensive experiments on two challenging datasets prove the more remarkable generalization capacity of our method compared to the current state-of-the-art.



Paperid:434 Poster
Authors:Haining Wang,Na Li,Huijie Zhao,Yan Wen,Yi Su,Yuqiang Fang
Abstract:
Due to the limitations of infrared image acquisition conditions, many essential tasks currently rely on visible images as the main source of training data. However, single-modal data makes it difficult for downstream networks to show optimal performance. Therefore, converting the more easily obtainable visible images into infrared images emerges as an effective remedy to alleviate the critical shortage of infrared data. Yet current methods typically focus solely on transferring visible images to infrared style, while overlooking the crucial infrared thermal feature during cross-modal translation. To elevate the authenticity of cross-model translation at the feature level, this paper introduces a translation network based on frequency feature mapping and dual patches contrast, MappingFormer, which can achieve cross-modal image generation from visible to infrared. Specifically, the generator incorporates two branches: low-frequency feature mapping (LFM) and high-frequency feature refinement (HFR), both have embedded the Swin Transformer blocks. The LFM branch captures the fuzzy structural from visible images, while the HFR focuses on mapping edge and texture features. The extracted dual-branch frequency features undergo refinement and fusion through cross-attention mechanisms. Additionally, a dual contrast learning mechanism based on feature patch (DFPC) is designed to infer effective mappings between unaligned cross-modal data. Numerous experimental results prove the effectiveness of this method in cross-modal feature mapping and image generation from visible to infrared. This method holds significant potential for downstream tasks where infrared data is limited.



Paperid:435 Poster
Authors:Lei Han,Xuesong Zhang
Abstract:
Recent advances in continuous super-resolution (SR) has made a substantial progress towards universal SR models, which are characterized by using a single deep neural network (DNN) to fulfill arbitrary scale SR tasks. When deployed on resource stringent platforms, however, a trained DNN model usually requires experience-demanding and laborious manual efforts to compress the models following a predetermined compute budget. This paper proposes an inference-time adaptive network width optimization method for arbitrary scale SR modules, dubbed as Scalable Super-Resolution Neural Operator (SSRNO), which is capable of efficient performance-preserving deployment on various mobile or edge devices with only a user input parameter indicating the desired compression rate. SSRNO realizes the continuous parameterization of SRNO(CVPR2023) by virtue of two novel contributions. First, we propose the Integral Neural Network (INN) formulation for the Galerkin type attention, which is an indispensable component for spatial discretization invariant SR neural networks. Second, we further propose an adaptive layer-wise compression rate estimation mechanism, which allows for the flexible adaptation to variant capacity through the neural network layers. Extensive experiments validate the outperforming overall performances over existing continuous SR models in terms of reconstruction accuracy, model scalability as well as throughput. For instance, compared with the baseline SRNO, a typical configuration of SSRNO can achieve a model size compression up to 62% and an over 2$\times$ speedup in situations where resources are limited, while it can also expand itself to keep the PSNR degradation within 0.1 dBs when the limitations are alleviated. The code will be made public soon.



Paperid:436 Poster
Authors:Zhien Dai,Zhaohui Tang,Hu Zhang,Can Tian,Mingjun Pan,Yongfang Xie
Abstract:
Stereo matching is a pivotal technique for depth estimation and has been popularly applied in various computer vision tasks. Although many related methods have been reported recently, they still face some challenges such as significant disparity variations at object boundaries, difficult prediction at large disparity regions, and suboptimal generalization when label distribution varies between source and target domains. Therefore, we propose a stereo-matching model (i.e., EGLCR-Stereo) that utilizes edge structure information with adaptive fusion of multi-scale matching similarity information for disparity estimation. First, we use a lightweight network to predict the initial disparity. We apply large and small-scale similarity feature extraction modules to extract the matching similarity information within the wide-area receptive field and the refined matching similarity information under the local receptive field. Then, we develop a scale adaptive attention module for efficiently fusing information at different scales. Meanwhile, we propose an edge structure-aware module for exploring edge information in the scene. After that, we use an iterative-based strategy for disparity estimation using edge structure information with fused multi-scale matching similarity information. We conduct abundant experiments on some popular stereo matching datasets including Middlebury, KITTI, ETH3D, and Scene Flow. The results show that our proposed EGLCR-Stereo achieves state-of-the-art performance both in accuracy and generalization.



Paperid:437 Poster
Authors:Xicong Wang,Huiyuan Fu,Jiaxuan Wang,Xin Wang,Heng Zhang,Huadong Ma
Abstract:
Due to the limitations of sensor, traditional cameras struggle to capture details within extremely dark areas of videos. The absence of such details can significantly impact the effectiveness of low-light video enhancement. In contrast, event cameras offer a visual representation with higher dynamic range, facilitating the capture of motion information even in exceptionally dark conditions. Motivated by this advantage, we propose the Real-Event Embedded Network for low-light video enhancement. To better utilize events for enhancing extremely dark regions, we propose an Event-Image Fusion module, which can identify these dark regions and enhance them significantly. To ensure temporal stability of the video and restore details within extremely dark areas, we design unsupervised temporal consistency loss and detail contrast loss. Alongside the supervised loss, these loss functions collectively contribute to the semi-supervised training of the network on unpaired real data. Experimental results on synthetic and real data demonstrate the superiority of the proposed method compared to the state-of-the-art methods. Our codes will be publicly available.



Paperid:438 Poster
Authors:Hanziwang,Jiamin Ren,Yifeng Ding,Lei Ren,Huixing Jiang,Chen Wei,Fangxiang Feng,Xiaojie Wang
Abstract:
Multimodal Large Language Models (MLLMs) have showcased remarkable advances in handling various vision-language tasks. These models typically consist of a Large Language Model (LLM), a vision encoder and a connector structure, which is used to bridge the modality gap between vision and language. It is challenging for the connector to filter the right visual information for LLM according to the task in hand. Most of the previous connectors, such as light-weight projection and Q-former, treat visual information for diverse tasks uniformly, therefore lacking task-specific visual information extraction capabilities. To address the issue, this paper proposes Q-MoE, a query-based connector with Mixture-of-Experts (MoE) to extract task-specific information with text-driven routing. Furthermore, an optimal path based training strategy is also proposed to find an optimal expert combination. Extensive experiments on two popular open-source LLMs and several different visual-language tasks demonstrate the effectiveness of the Q-MoE connecter. We will open our codes upon publication.



Paperid:439 Poster
Authors:Xu Zhang,Zhipeng Xie,Haiyang Yu,Qitong Wang,Peng Wang,Wei Wang
Abstract:
Handling varying computational resources is a critical issue in modern AI applications. Adaptive deep networks, featuring the dynamic employment of multiple classifier heads among different layers, have been proposed to address classification tasks under varying computing resources. Existing approaches typically utilize the last classifier supported by the available resources for inference, as they believe that the last classifier always performs better across all classes. However, our findings indicate that earlier classifier heads can outperform the last head for certain classes. Based on this observation, we introduce the Collaborative Decision Making (CDM) module, which fuses the multiple classifier heads to enhance the inference performance of adaptive deep networks. CDM incorporates an uncertainty-aware fusion method based on evidential deep learning (EDL), that utilizes the reliability (uncertainty values) from the first $c$-1 classifiers to improve the $c$-th classifier' accuracy. We also design a balance term that reduces fusion saturation and unfairness issues caused by EDL constraints to improve the fusion quality of CDM. Finally, a regularized training strategy that uses the last classifier to guide the learning process of early classifiers is proposed to further enhance the CDM module's effect, called the Guided Collaborative Decision Making (GCDM) framework. The experimental evaluation demonstrates the effectiveness of our approaches. Results on ImageNet datasets show CDM and GCDM obtain 0.4% to 2.8% accuracy improvement (under varying computing resources) on popular adaptive networks.



Paperid:440 Poster
Authors:Wen Luo,Yu Xia,Shen Tianshu,Sujian Li
Abstract:
The rise of social media and the exponential growth of multimodal communication necessitates advanced techniques for Multimodal Information Extraction (MIE). However, existing methodologies primarily rely on direct Image-Text interactions, a paradigm that often face the significant challenges due to semantic and modality gaps between images and text. In this paper, we introduce a new paradigm of Image-Context-Text interaction, where large multimodal models (LMMs) are utilized to generate descriptive textual context to bridge these gaps. In line with this paradigm, we propose a novel Shapley Value-based Contrastive Alignment (Shap-CA) method, which aligns both context-text and context-image pairs. Shap-CA initially applies the Shapley value concept from cooperative game theory to assess the individual contribution of each element in the set of contexts, texts and images towards total semantic and modality overlaps. Following this quantitative evaluation, a contrastive learning strategy is employed to enhance the interactive contribution within context-text/image pairs, while minimizing the influence across these pairs. Furthermore, we design an adaptive fusion module for selective cross-modal fusion. Extensive experiments across four MIE datasets demonstrate that our method significantly outperforms existing state-of-the-art methods. Code will be released upon acceptance.



Paperid:441 Poster
Authors:Sijing Wu,Yunhao Li,Yichao Yan,Huiyu Duan,Ziwei Liu,Guangtao Zhai
Abstract:
3D facial animation has attracted considerable attention due to its extensive applications in the multimedia field. Audio-driven 3D facial animation has been widely explored with promising results. However, multi-modal 3D facial animation, especially text-guided 3D facial animation is rarely explored due to the lack of multi-modal 3D facial animation dataset. To fill this gap, we first construct a large-scale multi-modal 3D facial animation dataset, MMHead, which consists of 49 hours of 3D facial motion sequences, speech audios, and rich hierarchical text annotations. Each text annotation contains abstract action and emotion descriptions, fine-grained facial and head movements (i.e., expression and head pose) descriptions, and three possible scenarios that may cause such emotion. Concretely, we integrate five public 2D portrait video datasets, and propose an automatic pipeline to 1) reconstruct 3D facial motion sequences from monocular videos; and 2) obtain hierarchical text annotations with the help of AU detection and ChatGPT. Based on the MMHead dataset, we establish benchmarks for two new tasks: text-induced 3D talking head animation and text-to-3D facial motion generation. Moreover, a simple but efficient VQ-VAE-based method named MM2Face is proposed to unify the multi-modal information and generate diverse and plausible 3D facial motions, which achieves competitive results on both benchmarks. Extensive experiments and comprehensive analysis demonstrate the significant potential of our dataset and benchmarks in promoting the development of multi-modal 3D facial animation.



Paperid:442 Poster
Authors:Yunqiang Pei,Jialei Tang,Qihang Tang,Mingfeng Zha,Dongyu Xie,Guoqing Wang,Zhitao Liu,Ning Xie,Peng Wang,Yang Yang,Heng Tao Shen
Abstract:
Prior research on emotion recognition in extended reality (XR) has faced challenges due to the occlusion of facial expressions by Head-Mounted Displays (HMDs). This limitation hinders accurate Facial Expression Recognition (FER), which is crucial for immersive user experiences. This study aims to overcome the occlusion challenge by integrating physiological signals with partially visible facial expressions to enhance emotion recognition in XR environments. We employed a multi-task approach, utilizing a feature-level fusion to fuse Electroencephalography (EEG) and Galvanic Skin Response (GSR) signals with occluded facial expressions. The model predicts valence and arousal simultaneously from both macro-and micro-expression. Our method demonstrated improved accuracy in emotion recognition under partial occlusion conditions. The integration of temporal physiological signals with other modalities significantly enhanced performance, particularly for half-face emotion recognition. The study presents a novel approach to emotion recognition in XR, addressing the limitations of facial occlusion by HMDs. The findings suggest that physiological signals are vital for interpreting emotions in occluded scenarios, offering potential for real-time applications and advancing social XR applications.



Paperid:443 Poster
Authors:Ding Wang,Wei Zhou,Songlin Hu
Abstract:
Information diffusion prediction aims to forecast the path of information spreading in social networks. Prior works generally consider the diffusion process to be driven by user correlations or preferences. Recent works focus on characterizing the dynamicity of user preferences and propose to capture users' dynamic preferences by discretizing the diffusion process into structure snapshots. Despite their effectiveness, these works summarize user preferences from partially observed structure snapshots, ignoring that users' preferences are evolving constantly. Moreover, discretizing the diffusion process makes these models overlook abundant structure information across different periods, reducing their ability to discover potential participants. To address the above issues, we propose a novel \textbf{G}raph Neural \textbf{O}rdinary \textbf{D}ifferential \textbf{E}quation \textbf{N}etwork (GODEN) for information diffusion prediction, which incorporates neural ordinary differential equations (ODE) to model the continuous dynamics of the diffusion process. Specifically, we design two coupled ODE functions on nodes and edges to describe their co-evolution dynamic and infer user dynamic preferences based on the solution of ODEs. Besides, we extract user correlations from a heterogeneous graph to complement user encoding for prediction. Finally, to predict the future user infections of the observed cascade, we represent its diffusion pattern in terms of user and temporal contexts and apply a multi-head attention module to attend to different contexts. Experimental results confirm our approach’s effectiveness on four real-world datasets, with our model outperforming the state-of-the-art diffusion prediction models.



Paperid:444 Poster
Authors:Zihan Fang,Shide Du,Yuhong Chen,Shiping Wang
Abstract:
The inherent variability and unpredictability in open multi-view learning scenarios infuse considerable ambiguity into the learning and decision-making processes of predictors. This demands that predictors not only recognize familiar patterns but also adaptively interpret unknown ones out of training scope. To address this challenge, we propose an Ambiguity-Aware Multi-view Learning Framework, which integrates four synergistic modules into an end-to-end framework to achieve generalizability and reliability beyond the known. By introducing the mixed samples to broaden the learning sample space, accompanied by corresponding soft labels to encapsulate their inherent uncertainty, the proposed method adapts to the distribution of potentially unknown samples in advance. Furthermore, an instance-level sparse inference is implemented to learn sparse approximated points in the multiple view embedding space, and individual view representations are gated by view-level confidence mappings. Finally, a multi-view consistent representation is obtained by dynamically assigning weights based on the degree of cluster-level dispersion. Extensive experiments demonstrate that our approach is effective and stable compared with other state-of-the-art methods in open-world recognition situations.



Paperid:445 Poster
Authors:Zongxin Ye,Wenyu Li,Sidun Liu,Peng Qiao,Yong Dou
Abstract:
Recent advances on neural rendering have shown photo-realistic results in novel view synthesis. As one of the most promising method, 3D Gaussian Splatting (3D-GS) couple 3D Gaussian primitive with differentiable rasterization to obtain high-fidelity 3D scene reconstruction and achieve real-time rendering. The exceptional performance of 3D-GS is attributed to the carefully designed adpative density control strategy, which progressively populate empty areas by splitting/cloning more Gaussians throughout the optimization process. While 3D-GS offers significant advantages, it frequently suffer from over-reconstruction issue in intricate scenes containing high-frequency details, consequently leading to blur. This issue's underlying causes have still been under-explored. In this work, we present an comprehensive analysis of the cause of aforementioned artifacts and we call it gradient collision, which prevent large Gaussians that cover small-scale geometry from splitting. To address this issue, we further propose novel homodirectional gradient as the guidance for densification. Our strategy efficiently identifies large Gaussians in over-reconstructed regions, and recovers fine details by splitting. We evaluate our proposed method on various challenging datasets, and our approach achieves best rendering quality with reduced memory consumption and yields better distributions of 3D Gaussians in world space. Our method is also easy to implement with just few lines of codes and can be incorporated into a wide variety other Gaussian Splatting-based methods. We will open source our codes upon formal publication.



Paperid:446 Poster
Authors:Xiao Zhao,XUKUN ZHANG,Dingkang Yang,Mingyang Sun,Mingcheng Li,Shunli Wang,Lihua Zhang
Abstract:
Accurate and robust multimodal multi-task perception is crucial for modern autonomous driving systems. However, current multimodal perception research follows independent paradigms designed for specific perception tasks, leading to a lack of complementary learning among tasks and decreased performance in multi-task learning (MTL) due to joint training. In this paper, we propose MaskBEV, a masked attention-based MTL paradigm that unifies 3D object detection and bird's eye view (BEV) map segmentation. MaskBEV introduces a task-agnostic Transformer decoder to process these diverse tasks, enabling MTL to be completed in a unified decoder without requiring additional design of specific task heads. To fully exploit the complementary information between BEV map segmentation and 3D object detection tasks in BEV space, we propose spatial modulation and scene-level context aggregation strategies. These strategies consider the inherent dependencies between BEV segmentation and 3D detection, naturally boosting MTL performance. Extensive experiments on nuScenes dataset show that compared with previous state-of-the-art MTL methods, MaskBEV achieves 1.3 NDS improvement in 3D object detection and 2.7 mIoU improvement in BEV map segmentation, while also demonstrating slightly leading inference speed.



Paperid:447 Poster
Authors:Kang Shen,Haifeng Xia,Guangxing Geng,GuangYue Geng,Siyu Xia,Zhengming Ding
Abstract:
Speech-driven 3D facial animation aims to synthesize 3D talking head animations with precise lip movements and rich stylistic expressions. However, existing methods exhibit two limitations: 1) they mostly focused on emotionless facial animation modeling, neglecting the importance of emotional expression, due to the lack of high-quality 3D emotional talking head datasets, and 2) several latest works treated emotional intensity as a global controllable parameter, akin to emotional or speaker style, leading to over-smoothed emotional expressions in their outcomes. To address these challenges, we first collect a 3D talking head dataset comprising five emotional styles with a set of coefficients based on the MetaHuman character model and then propose an end-to-end deep neural network, DEITalk, which conditions on speech and emotional style labels to generate realistic facial animation with dynamic expressions. To model emotional saliency variations in long-term audio contexts, we design a dynamic emotional intensity (DEI) modeling module and a dynamic positional encoding (DPE) strategy. The former extracts implicit representations of emotional intensity from speech features and utilizes them as local (high temporal frequency) emotional supervision, whereas the latter offers abilities to generalize to longer speech sequences. Moreover, we introduce an emotion-guided feature fusion decoder and a four-way loss function to generate emotion-enhanced 3D facial animation with controllable emotional styles. Extensive experimental results demonstrate that our method outperforms existing state-of-the-art methods. We recommend watching the video demo provided in our supplementary material for detailed results.



Paperid:448 Poster
Authors:Zequn Zeng,Jianqiao Sun,Hao Zhang,Tiansheng Wen,Yudi Su,Yan Xie,Zhengjue Wang,Bo Chen
Abstract:
Image captioning evaluation metrics can be divided into two categories: reference-based metrics and reference-free metrics. However, reference-based approaches may struggle to evaluate descriptive captions with abundant visual details produced by advanced multimodal large language models, due to their heavy reliance on limited human-annotated references. In contrast, previous reference-free metrics have been proven effective via CLIP cross-modality similarity. Nonetheless, CLIP-based metrics, constrained by their solution of global image-text compatibility, often have a deficiency in detecting local textual hallucinations and are insensitive to small visual objects. Besides, their single-scale designs are unable to provide an interpretable evaluation process such as pinpointing the position of caption mistakes and identifying visual regions that have not been described. To move forward, we propose a novel reference-free metric for image captioning evaluation, dubbed Hierarchical Image Captioning Evaluation Score (HICE-S). By detecting local visual regions and textual phrases, HICE-S builds an interpretable hierarchical scoring mechanism, breaking through the barriers of the single-scale structure of existing reference-free metrics. Comprehensive experiments indicate that our proposed metric achieves the SOTA performance on several benchmarks, outperforming existing reference-free metrics like CLIP-S and PAC-S, and reference-based metrics like METEOR and CIDEr. Moreover, several case studies reveal that the assessment process of HICE-S on detailed texts closely resembles interpretable human judgments. The code is available in the supplementary.



Paperid:449 Poster
Authors:Zeyu Xiao,Dachun Kai,Yueyi Zhang,Xiaoyan Sun,Zhiwei Xiong
Abstract:
Event cameras are novel bio-inspired cameras that record asynchronous events with high temporal resolution and dynamic range. Leveraging the auxiliary temporal information recorded by event cameras holds great promise for the task of video super-resolution (VSR). However, existing event-guided VSR methods assume that the event and RGB cameras are strictly calibrated (e.g., pixel-level sensor designs in DAVIS 240/346). This assumption proves limiting in emerging high-resolution devices, such as dual-lens smartphones and unmanned aerial vehicles, where such precise calibration is typically unavailable. To unlock more event-guided application scenarios, we propose to perform the task of asymmetric event-guided VSR for the first time, and we propose an Asymmetric Event-guided VSR Network (AsEVSRN) for this new task. AsEVSRN incorporates two specialized designs for leveraging the asymmetric event stream in VSR. Firstly, the content hallucination module dynamically enhances event and RGB information by exploiting their complementary nature, thereby adaptively boosting representational capacity. Secondly, the event-enhanced bidirectional recurrent cells align and propagate temporal features fused with features from content-hallucinated frames. Within the bidirectional recurrent cells, event-enhanced flow is employed for simultaneous utilization and fusion of temporal information at both the feature and pixel levels. Comprehensive experimental results affirm that our method consistently produces superior results both quantitatively and qualitatively. Code will be released.



Paperid:450 Poster
Authors:Jiaxing Li,Hongbo Zhao,Yijun Wang,Jianxin Lin
Abstract:
Video colorization poses challenging tasks, necessitating structural stability, continuity, and details control in the colors produced. In this paper, based on a pretrained text-to-image model, we introduce the $\textbf{Gated Color Guidance}$ module ($\textbf{GCG}$), enabling the model to adaptively perform color propagation or generation according to the structural differences between reference and grayscale frames. Based on this multifunctionality, we propose a novel two-stage coloring strategy. In the first stage, under reference-mask condition, the model autonomously and jointly colors input keyframes in a one-to-many color domain mapping, while temporal coherence constraints are emphasized by modifying the attention mechanism. In the second stage, under reference-guided condition, the model effectively captures the colors of matching structures in the reference, and we further introduce $\textbf{Sliding Reference Grid}$ strategy ($\textbf{SRG}$) to merge and extract the color features from multiple frames, providing more stable coloring for the grayscale frames. Through this pipeline, we can achieve high-quality and stable video coloring while maintaining the accuracy of detailed colors. Additionally, the two-stage strategy is flexible and detachable, allowing users to adjust the number of selected reference frames to balance coloring quality and efficiency. Extensive experiments demonstrate that our method significantly outperforms previous state-of-the-art models in both qualitative comparison and quantitative measurement.



Paperid:451 Poster
Authors:Jing Zhou,Ziqi Yu,Zhongyun Bao,Gang Fu,Weilei He,Chao Liang,Chunxia Xiao
Abstract:
We propose a method for light and shadow editing of outdoor disharmonious composite images, including foreground harmonization and cast shadow generation. Most existing work can only perform foreground appearance editing tasks or only focus on shadow generation. In fact, lighting not only affects the brightness and color of objects, but also produces corresponding cast shadows. In recent years, diffusion models have demonstrated their strong generative capabilities, and due to their iterative denoising properties, they have a significant advantage in image restoration tasks. But it fails to preserve content structure of image. In this purpose, we propose an effective model to tackle the problem of foreground light-shadow editing. Specifically, we use a coarse shadow prediction module (SP) to generate coarse shadows for foreground objects. Then, we use the predicted results as prior knowledge to guide the generation of harmony diffusion model. In this process, the primary task is to learn lighting variation to harmonize foreground regions. The secondary task is to generate high-quality cast shadow containing more details. Considering that existing datasets do not support the dual tasks of image harmonization and shadow generation, we construct a real outdoor dataset, IH-SG, covering various lighting conditions. Extensive experiments conducted on existing benchmark datasets and the IH-SG dataset demonstrate the superiority of our method.



Paperid:452 Poster
Authors:Minjing Yu,Lingzhi Zeng,Xinxin Du,Jenny Sheng,Qiantian Liao,Yong-jin Liu
Abstract:
Hanfu is the representative traditional costume of Han nationality in China, which carries the outstanding craftsmanship of dyeing, weaving, and embroidery, and is of great significance to the inheritance of traditional culture. However, the existing methods of Hanfu publicity still have problems, which are not conducive to the inheritance of Hanfu culture. In this work, we developed the VisHanfu virtual reality system by focusing on the "Cross-Shaped Flat Structure", which is an integral feature of Hanfu. We have digitally restored five representative Hanfu historical artifacts and provided an interactive making experience. Combined with high realistic cloth simulation techniques, it allows users to interactively observe the movement effects of the Hanfu. The results of user experiments show that our system can provide a favorable experience for users, and bring a better learning effect, which helps users to enhance their interest in learning and thus contributes to the inheritance of Hanfu culture.



Paperid:453 Poster
Authors:Hao Yang,Min Wang,zhengfei Yu,Zhi Zeng,Mingrui Lao,Yun Zhou
Abstract:
The security of Deep Neural Networks (DNNs) has proven to be critical for their applicabilities in real-world scenarios. However, DNNs are well-known to be vulnerable against adversarial attacks, such as adding artificially designed imperceptible magnitude perturbation to the benign input. Therefore, adversarial robustness is essential for DNNs to defend against malicious attacks. Stochastic Neural Networks (SNNs) have recently shown effective performance on enhancing adversarial robustness by injecting uncertainty into models. Nevertheless, existing SNNs are still limited for adversarial defense, as their insufficient representation capability from the fixed uncertainty. In this paper, to elevate feature representation capability of SNNs, we propose a novel yet practical stochastic neural network that maximizes feature distribution variance (MFDV-SNN). In addition, we provide theoretical insights to support the adversarial resistance of MFDV, which primarily derived from the stochastic noise we injected into DNNs. Our research demonstrates that by gradually increasing the level of stochastic noise in a DNN, the model naturally becomes more resistant to input perturbations. Since adversarial training is not required, MFDV-SNN does not compromise clean data accuracy and saves up to 7.5 times computation time. Extensive experiments on various attacks demonstrate that MFDV-SNN improves adversarial robustness significantly compared to other methods.



Paperid:454 Poster
Authors:Pinhan Fu,Xinyan Liang,Yuhua Qian,Qian Guo,Zhifang Wei,Wen Li
Abstract:
Most existing NAS-based multi-modal classification (MMC-NAS) methods are optimized using the classification accuracy. They can not simultaneously provide multiple models with diverse perferences such as model complex and classification performance for meeting different users' demands.Combining NAS-MMC with multi-objective optimization is a nature way for this issue. However, the challenge problem of this solution is the high computation cost. For multi-objective optimization, the computing bottleneck is pareto front search. Some higher-quality MMC models (namely core structures, CSs) consisting of high-quality features and fusion operators are easier to identify. We find that CSs have a close relation with the pareto front (PF), i.e., the individuals lying in PF contain the CSs. Based on the finding, we propose an efficient multi-objective neural architecture search for multi-modal classification by applying CSs to guide the PF search (CoMO-NAS). In conclusion, experimental results thoroughly demonstrate the effectiveness of our CoMO-NAS. Compared to state-of-the-art competitors on benchmark multi-modal tasks, we achieve comparable performance with lower model complexity in shorter search time.



Paperid:455 Poster
Authors:Xiao-Qian Liu,Ming-Hui Liu,Zhen-Duo Chen,Xin Luo,Xin-Shun Xu
Abstract:
Multilingual text recognition (MLTR) is increasingly essential for facilitating cultural communication. However, existing methods often struggle with retaining previous language knowledge when learning new languages. A straightforward solution is performing incremental learning (IL) on MLTR tasks. However, it ignores the shared words and characters across incremental languages, which we first term as an incremental sharing problem. Motivated by this observation, we propose a HierArchical Multi-label learning framework for Multilingual tExt Recognition, termed HAMMER. An online knowledge analysis is designed to identify shared knowledge and provide corresponding multi-label language supervision. Specifically, only words and characters appearing simultaneously in multiple languages are considered shared knowledge. Additionally, to further capture language dependencies, we introduce a hierarchical language evaluation mechanism to predict language scores at word and character levels. These scores, supervised by the knowledge analysis, guide the specific recognizers to effectively utilize both old and new language knowledge, thereby mitigating catastrophic forgetting caused by imbalanced rehearsal sets. Extensive experiments conducted on benchmark datasets, MLT17 and MLT19, show that HAMMER exhibits remarkable results and outperforms other state-of-the-art approaches.



Paperid:456 Poster
Authors:Yunshan Ma,Yingzhi He,WENJUN ZHONG,Xiang Wang,Roger Zimmermann,Tat-Seng Chua
Abstract:
Product bundling has been a prevailing marketing strategy that is beneficial in the online shopping scenario. Effective product bundling methods depend on high-quality item representations capturing both the individual items' semantics and cross-item relations. However, previous item representation learning methods, either feature fusion or graph learning, suffer from inadequate cross-modal alignment and struggle to capture the cross-item relations for cold-start items. Multimodal pre-train models could be the potential solutions given their promising performance on various multimodal downstream tasks. However, the cross-item relations have been under-explored in the current multimodal pre-train models. To bridge this gap, we propose a novel and simple framework Cross-Item Relational Pre-training (CIRP) for item representation learning in product bundling. Specifically, we employ a multimodal encoder to generate image and text representations. Then we leverage both the cross-item contrastive loss (CIC) and individual item's image-text contrastive loss (ITC) as the pre-train objectives. Our method seeks to integrate cross-item relation modeling capability into the multimodal encoder. Therefore, even for cold-start items that have no relations, their representations are still relation-aware. Furthermore, to eliminate the potential noise and reduce the computational cost, we harness a relation pruning module to remove the noisy and redundant relations. We apply the item representations extracted by CIRP to the product bundling model ItemKNN, and experiments on three e-commerce datasets demonstrate that CIRP outperforms various leading representation learning methods.



Paperid:457 Poster
Authors:Yujia Xiao,Xi Wang,Xu Tan,Lei He,Xinfa Zhu,sheng zhao,Tan Lee
Abstract:
The latest Text-to-Speech (TTS) systems can produce speech with voice quality and naturalness comparable to human speech. Yet the demand for large amount of high-quality data from target speakers remains a significant challenge. Particularly for long-form expressive reading, target speaker's training speech that covers rich contextual information are needed. In this paper a novel design of context-aware speech pre-trained model is developed for expressive TTS based on contrastive learning. The model can be trained with abundant speech data without explicitly labelled speaker identities. It captures the intricate relationship between the speech expression of a spoken sentence and the contextual text information. By incorporating cross-modal text and speech features into the TTS model, it enables the generation of coherent and expressive speech, which is especially beneficial when there is a scarcity of target speaker data. The pre-trained model is evaluated first in the task of Context-Speech retrieval and then as the integral part of a zero-shot TTS system. Experimental results demonstrate that the pretraining framework effectively learns Context-Speech representations and significantly enhances the expressiveness of synthesized speech. Audio demos are available at:https://ccsp2024.github.io/demo/.



Paperid:458 Poster
Authors:Liang Du,Yukai Shi,Yan Chen,Peng Zhou,Yuhua Qian
Abstract:
Incomplete Multi-View Clustering (IMVC) is crucial for multi-media data analysis. While graph learning-based IMVC methods have shown promise, they still have limitations. The prevalent first-order affinity graph often misclassifies out-neighborhood intra-cluster and in-neighbor inter-cluster samples, worsened by data incompleteness. These inaccuracies, combined with high computational demands, restrict their suitability for large-scale IMVC tasks. To adress these issues, we propose a novel Fast and Scalable IMVC with duality Optimal graph Filtering (FSIMVC-OF). Rather than relying on predefined sample-side graph filters for higher-order interactions, we refine the clustering-friendly structure of the bipartite graph by learning an optimal filter within a consensus clustering framework. Instead of learning a sample-side filter, we optimize an anchor-side graph filter and apply it to the anchor side, ensuring computational efficiency with linear complexity, supported by the provable equivalence between these two types of graph filters. We present an alternative optimization algorithm with linear complexity, making it exceptionally suited for large-scale tasks. Extensive experimental analysis demonstrates the superior performance of FSIMVC-OF over current IMVC methods.



Paperid:459 Poster
Authors:Qishan Zhang,Shuangbing Wen,Tao Hu
Abstract:
Generative AI technologies, including text-to-speech (TTS) and voice conversion (VC), frequently become indistinguishable from genuine samples, posing challenges for individuals in discerning between real and synthetic content. This indistinguishability undermines trust in media, and the arbitrary cloning of personal voice signals presents significant challenges to privacy and security. In the field of deepfake audio detection, the majority of models achieving higher detection accuracy currently employ self-supervised pre-trained models. However, with the ongoing development of deepfake audio generation algorithms, maintaining high discrimination accuracy against new algorithms grows more challenging. To enhance the sensitivity of deepfake audio features, we propose a deepfake audio detection model that incorporates an SLS (Sensitive Layer Selection) module. Specifically, utilizing the pre-trained XLS-R enables our model to extract diverse audio features from its various layers, each providing distinct discriminative information. Utilizing the SLS classifier, our model captures sensitive contextual information across different layer levels of audio features, effectively employing this information for fake audio detection. Experimental results show that our method achieves state-of-the-art (SOTA) performance on both the ASVspoof 2021 DF and In-the-Wild datasets, with a specific Equal Error Rate (EER) of 1.92% on the ASVspoof 2021 DF dataset and 7.46% on the In-the-Wild dataset. Code will be publicly released in the near future.



Paperid:460 Poster
Authors:Minghui Li,Jiangxiong Wang,Hao Zhang,Ziqi Zhou,Shengshan Hu,pei Xiaobing
Abstract:
The success of deep face recognition (FR) systems has raised serious privacy concerns due to their ability to enable unauthorized tracking of users in the digital world. Previous studies proposed introducing imperceptible adversarial noises into face images to deceive those face recognition models, thus achieving the goal of enhancing facial privacy protection.Nevertheless, they heavily rely on user-chosen references to guide the generation of adversarial noises, and cannot simultaneously construct natural and highly transferable adversarial face images in black-box scenarios. In light of this, we present a novel face privacy protection scheme with improved transferability while maintain high visual quality. We propose shaping the entire face space directly instead of exploiting one kind of facial characteristic like makeup information to integrate adversarial noises. To achieve this goal, we first exploit global adversarial latent search to traverse the latent space of the generative model, thereby creating natural adversarial face images with high transferability. We then introduce a key landmark regularization module to preserve the visual identity information. Finally, we investigate the impacts of various kinds of latent spaces and find that $\mathcal{F}$ latent space benefits the trade-off between visual naturalness and adversarial transferability. Extensive experiments over two datasets demonstrate that our approach significantly enhances attack transferability while maintaining high visual quality, outperforming state-of-the-art methods by an average 25% improvement in deep FR models and 10% improvement on commercial FR APIs, including Face++, Aliyun, and Tencent.



Paperid:461 Poster
Authors:Guofan Fan,Zekun Qi,Wenkai Shi,Kaisheng Ma
Abstract:
Geometry and color information provided by the point clouds are both crucial for 3D scene understanding. Two pieces of information characterize the different aspects of point clouds, but existing methods lack an elaborate design for the discrimination and relevance. Hence we explore a 3D self-supervised paradigm that can better utilize the relations of point cloud information. Specifically, we propose a universal 3D scene pre-training framework via Geometry-Color Contrast (Point-GCC), which aligns geometry and color information using a Siamese network. To take care of actual application tasks, we design (i) hierarchical supervision with point-level contrast and reconstruct and object-level contrast based on the novel deep clustering module to close the gap between pre-training and downstream tasks; (ii) architecture-agnostic backbone to adapt for various downstream models. Benefiting from the object-level representation associated with downstream tasks, Point-GCC can directly evaluate model performance and the result demonstrates the effectiveness of our methods. Transfer learning results on a wide range of tasks also show consistent improvements across all datasets. e.g., new state-of-the-art object detection results on SUN RGB-D and S3DIS datasets. Codes will be released on Github.



Paperid:462 Poster
Authors:Wu Ran,Peirong Ma,Zhiquan He,Hong Lu
Abstract:
We address image deraining under complex backgrounds, diverse rain scenarios, and varying illumination conditions, representing a highly practical and challenging problem. Our approach utilizes synthetic, real-world, and nighttime datasets, wherein rich backgrounds, multiple degradation types and diverse illumination conditions coexist. The primary challenge in training models on these datasets arises from the discrepancies among them, potentially leading to conflicts or competition during the training period. To address this issue, we first align the distribution of synthetic, real-world and nighttime datasets. Then we propose a novel contrastive learning strategy to extract multi-view representations that effectively capture image details, degradations, and illuminations, thereby facilitating training across all datasets. Regarding multi-view representations as profitable prompts for deraining, we devise a prompting strategy to integrate them into the decoding process. This contributes to the development of Rainmer, a potent U-Net-based deraining model. Additionally, a spatial-channel interaction module is introduced to fully exploit cues when extracting multi-view representations. Extensive experiments on synthetic, real-world, and nighttime datasets demonstrate that Rainmer outperforms current representative methods. Moreover, Rainmer achieves superior performance on the All-in-One image restoration dataset, underscoring its effectiveness. Furthermore, quantitative results reveal that Rainmer significantly improves object detection performance on both daytime and nighttime rainy datasets. These observations substantiate the potential of Rainmer for practical applications.



Paperid:463 Poster
Authors:Ziming Wang,Boxiang Zhang,Ming Ma,Yue Wang,Taoli Du,Wenhui Li
Abstract:
Point cloud segmentation forms the foundation of 3D scene understanding. Boundaries, the intersections of regions, are prone to mis-segmentation. Current point cloud segmentation models exhibit unsatisfactory performance on boundaries. There is limited focus on explicitly addressing semantic segmentation of point cloud boundaries. We introduce a method called Multi-fineness Boundary Constraint (MBC) to tackle this challenge. By querying boundaries at various degrees of fineness and imposing feature constraints within these boundary areas, we enhance the discrimination between boundaries and non-boundaries, improving point cloud boundary segmentation. However, solely emphasizing boundaries may compromise the segmentation accuracy in broader non-boundary regions. To mitigate this, we introduce a new concept of point cloud space termed ensemble and a Shifted Ensemble-aware Perception (SEP) module. This module establishes information interactions between points with minimal computational cost, effectively capturing direct point-to-point long-range correlations within ensembles. It enhances segmentation performance for both boundaries and non-boundaries. We conduct experiments on multiple benchmarks. The experimental results demonstrate that our method achieves performance surpassing or comparable to state-of-the-art methods, validating the effectiveness and superiority of our approach.



Paperid:464 Poster
Authors:Yuan Xie,Yichen Zhang,Yifang Yin,SHENG ZHANG,Ying Zhang,Rajiv Ratn Shah,Roger Zimmermann,Guoqing Xiao
Abstract:
The wide use of mobile devices has led to a proliferated creation of extensive trajectory data, rendering trajectory classification increasingly vital and challenging for downstream applications such as trip time prediction and trip recommendations. Existing deep learning methods offer powerful feature extraction capabilities to detect nuanced variances in trajectory classification tasks. However, their effectiveness remains compromised by the following two unsolved challenges. First, identifying the distribution of nearby trajectories based on noisy and sparse GPS coordinates poses a significant challenge. It can provide critical contextual features to the classification, which has not been fully explored in previous work. Second, though efforts have been made to incorporate a shape feature by rendering trajectories into images, they fail to model the local correspondence between GPS points and image pixels. Such information loss results in a substantial decline in performance. To address these issues, we propose a novel model termed Traj2Former to spotlight the spatial distribution of the adjacent trajectory points (i.e., contextual snapshot) and enhance the snapshot fusion between the trajectory data and the corresponding spatial contexts. We propose a new GPS rendering method to generate the contextual snapshots, but it is worth noting that our Traj2Former method is agnostic to the context source, which can vary from trajectory database, digital map, to satellite imagery. Moreover, to capture diverse temporal patterns, we conduct a multi-scale sequential fusion by compressing the trajectory data with differing rates. Extensive experiments have been conducted to verify the superiority of our Traj2Former model, which achieves state-of-the-art classification accuracy on two real-world datasets.



Paperid:465 Poster
Authors:Feihong Lu,Weiqi Wang,Yangyifei Luo,Ziqin Zhu,Qingyun Sun,Baixuan Xu,Haochen Shi,Shiqi Gao,Qian Li,Yangqiu Song,Jianxin Li
Abstract:
Social media has become ubiquitous for connecting with others, staying updated with news, expressing opinions, and finding entertainment. However, understanding the intention behind social media posts remains challenging due to the implicit and commonsense nature of these intentions, the need for cross-modality understanding of both text and images, and the presence of noisy information such as hashtags, misspelled words, and complicated abbreviations. To address these challenges, we present MIKO, a Multimodal Intention Knowledge DistillatiOn framework that collaboratively leverages a Large Language Model (LLM) and a Multimodal Large Language Model (MLLM) to uncover users' intentions. Specifically, our approach uses an MLLM to interpret the image, an LLM to extract key information from the text, and another LLM to generate intentions. By applying MIKO to publicly available social media datasets, we construct an intention knowledge base featuring 1,372K intentions rooted in 137,287 posts. Moreover, We conduct a two-stage annotation to verify the quality of the generated knowledge and benchmark the performance of widely used LLMs for intention generation, and further apply MIKO to a sarcasm detection dataset and distill a student model to demonstrate the downstream benefits of applying intention knowledge.



Paperid:466 Poster
Authors:Guangchen Shi,Wei Zhu,Yirui Wu,Danhuai Zhao,Kang Zheng,Tong Lu
Abstract:
Few-shot semantic segmentation (FSS) aims to locate pixels of unseen classes with clues from a few labeled samples. Recently, thanks to profound prior knowledge, diffusion models have been expanded to achieve FSS tasks. However, due to probabilistic noising and denoising processes, it is difficult for them to maintain spatial relationships between inputs and outputs, leading to inaccurate segmentation masks. To address this issue, we propose a Diffusion-based Segmentation network (DiffSeg), which decouples probabilistic denoising and segmentation processes. Specifically, DiffSeg leverages attention maps extracted from a pretrained diffusion model as support-query interaction information to guide segmentation, which mitigates the impact of probabilistic processes while benefiting from rich prior knowledge of diffusion models. In the segmentation stage, we present a Perceptual Attention Module (PAM), where two cross-attention mechanisms capture semantic information of support-query interaction and spatial information produced by the pretrained diffusion model. Furthermore, a self-attention mechanism within PAM ensures a balanced dependence for segmentation, thus preventing inconsistencies between the aforementioned semantic and spatial information. Additionally, considering the uncertainty inherent in the generation process of diffusion models, we equip DiffSeg with a Spatial Control Module (SCM), which models spatial structural information of query images to control boundaries of attention maps, thus aligning the spatial location between knowledge representation and query images. Experiments on PASCAL-5$^i$ and COCO datasets show that DiffSeg achieves new state-of-the-art performance with remarkable advantages.



Paperid:467 Poster
Authors:Haowei Kuang,Yiyang Ma,Wenhan Yang,Zongming Guo,Jiaying Liu
Abstract:
Diffusion models show impressive performances in image generation with excellent perceptual quality. However, its tendency to introduce additional distortion prevents its direct application in image compression. To address the issue, this paper introduces a Consistency Guided Diffusion Model (CGDM) tailored for perceptual image compression, which integrates an end-to-end image compression model with a diffusion-based post-processing network, aiming to learn richer detail representations with less fidelity loss. In detail, the compression and post-processing networks are cascaded and a branch of consistency guided features is added to constrain the deviation in the diffusion process for better reconstruction quality. Furthermore, a Syntax driven Feature Fusion (SFF) module is constructed to take an extra ultra-low bitstream from the encoding end as input, guiding the adaptive fusion of information from the two branches. In addition, we design a globally uniform boundary control strategy with overlapped patches and adopt a continuous online optimization mode to improve both coding efficiency and global consistency. Extensive experiments validate the superiority of our method to existing perceptual compression techniques and the effectiveness of each component in our method.



Paperid:468 Poster
Authors:Shudong Huang,Hecheng Cai,Hao Dai,Wentao Feng,Jiancheng Lv
Abstract:
Multi-view clustering has garnered attention for its effectiveness in addressing heterogeneous data by unsupervisedly revealing underlying correlations between different views. As a mainstream method, multi-view graph clustering has attracted increasing attention in recent years. Despite its success, it still has some limitations. Notably, many methods construct the similarity graph without considering the local geometric structure and exploit coarse-grained complementary and consensus information from different views at the view level. To solve the shortcomings, we focus on local structure consistency and fine-grained representations across multiple views. Specifically, each view's local consistency similarity graph is obtained through the adaptive neighbor. Subsequently, the multi-view similarity tensor is rotated and sliced into fine-grained instance-wise slices. Finally, these slices are fused into the final similarity matrix. Consequently, cross-view consistency can be captured by exploring the intersections of multiple views in an instance-wise manner. We design a collaborative framework with the augmented Lagrangian method to refine all subtasks towards optimal solutions iteratively. Extensive experiments on several multi-view datasets confirm the significant enhancement in clustering accuracy achieved by our method.



Paperid:469 Poster
Authors:Yinghui Sun,Xingfeng Li,Sun Quansen,Min-Ling Zhang,Zhenwen Ren
Abstract:
Recently, tensor Schatten $p$-norm has achieved impressive performance for fast multi-view clustering \cite{xia2023tensorized}. This primarily ascribes the superiority of tensor Schatten $p$-norm in exploring high-order structure information among views. Whereas, 1) tensor Schatten $p$-norm treats different singular values equally, such that the larger singular values corresponding to certain significant feature information (i.e., prior information) have not been utilized fully; 2) tensor Schatten $p$-norm also ignore ranking the core entries of core tensor, which may contain noise information; 3) existing methods select fixed anchors or averagely update anchors to construct the neighbor bipartite graphs, greatly limiting the flexibility and expression of anchors. To break these limitations, we propose a novel \textbf{Improved Weighted Tensor Schatten $p$-Norm for Fast Multi-view Graph Clustering (IWTSN-FMGC)}. Specifically, to eliminate the interference of the first two limitations, we propose an improved weighted tensor Schatten $p$-norm to dynamically rank core tensor and automatically shrink singular values. To this end, improved weighted tensor Schatten $p$-norm has the potential to more effectively leverage low-rank structures and prior information, thereby enhancing robustness compared to current tensor Schatten $p$-norm methods. Further, the designed adaptive neighbor bipartite graph learning can more flexibly and expressively encode the local manifold structure information than existing anchor selection and averaged anchor updating.Extensive experiments validate our effectiveness and superiority across multiple benchmark datasets.



Paperid:470 Poster
Authors:Yuyan Chen,Songzhou Yan,Zhihong Zhu,Zhixu Li,Yanghua Xiao
Abstract:
Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively by memes. This paper poses a particular emphasis on the impact of multi-images on meme captioning. After that, we introduce the \textsc{XMeCap} framework, a novel approach that adopts supervised fine-tuning and reinforcement learning based on an innovative reward model, which factors in both global and local similarities between visuals and text. Our results, benchmarked against contemporary models, manifest a marked improvement in caption generation for both single-image and multi-image memes, as well as different meme categories. \textsc{XMeCap} achieves an average evaluation score of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming the best baseline by 3.71% and 4.82%, respectively. This research not only establishes a new frontier in meme-related studies but also underscores the potential of machines in understanding and generating humor in a multi-modal setting.



Paperid:471 Poster
Authors:Changhao He,Hongyuan Zhu,Peng Hu,Xi Peng
Abstract:
Although multi-view learning has achieved remarkable progress over the past decades, most existing methods implicitly assume that all views (or modalities) are well-aligned. In practice, however, collecting fully aligned views is challenging due to complexities and discordances in time and space, resulting in the Partially View-unaligned Problem (PVP), such as audio-video asynchrony caused by network congestion. While some methods are proposed to align the unaligned views by learning view-invariant representations, almost all of them overlook specific information across different views for complementarity, limiting performance improvement. To address these problems, we propose a robust framework, dubbed VariatIonal ConTrAstive Learning (VITAL), designed to learn both common and specific information simultaneously. To be specific, each data sample is first modeled as a Gaussian distribution in the latent space, where the mean estimates the most probable common information, and the variance indicates view-specific information. Second, by using variational inference, VITAL conducts intra- and inter-view contrastive learning to preserve common and specific semantics in the distribution representations, thereby achieving comprehensive perception. As a result, the common representation (mean) could be used to guide category-level realignment, while the specific representation (variance) complements sample semantic information, thereby boosting overall performance. Finally, considering the abundance of False Negative Pairs (FNPs) generated by unsupervised contrastive learning, we propose a robust loss function that seamlessly incorporates FNP rectification into the contrastive learning paradigm. Empirical evaluations on eight benchmark datasets reveal that VITAL outperforms ten state-of-the-art deep clustering baselines, demonstrating its efficacy in both partially and fully aligned scenarios.



Paperid:472 Poster
Authors:Weixuan Tang,Haoyu Yang,Yuan Rao,Zhili Zhou,Fei Peng
Abstract:
Deep steganography is a technique that imperceptibly hides secret information into image by neural networks. Existing networks consist of two main components, including a hiding component for information hiding and an adversary component for countering against steganalyzers. However, these two components are two ends of the seesaw, and it is difficult to balance the tradeoff between message extraction accuracy and security performance by joint optimization. To address the issues, this paper proposes a steganographic method called AHDeS (Adversary-Hiding-Decoupled Steganography) under the Dig-and-Fill paradigm, wherein the adversary and hiding components can be decoupled into an optimization-based adversary module in the digging process and an INN-based hiding network in the filling process. Specifically, the INN is first trained to implement message embedding and extraction. Given the well-trained and fixed INN, the cover image is iteratively optimized under the frequency compensation mechanism for enhancing the security performance against steganalyzers. Owing to the reversibility of the INN, security performance can be enhanced without sacrificing message extraction accuracy. Experimental results show that the proposed AHDeS can achieve the state-of-the-art security performance and visual quality while maintaining satisfied message extraction accuracy.



Paperid:473 Poster
Authors:Tianshan Liu,Kin-man Lam,Bingkun BAO
Abstract:
Panoramic activity recognition is a comprehensive yet challenging task in crowd scene understanding, which aims to concurrently identify multi-grained human behaviors, including individual actions, social group activities, and global activities. Previous studies tend to capture cross-granularity activity-semantics relations from solely the video input, thus ignoring the intrinsic semantic hierarchy in label-text space. To this end, we propose a label text-aided hierarchical semantics mining (THSM) framework, which explores multi-level cross-modal associations by learning hierarchical semantic alignment between visual content and label texts. Specifically, a hierarchical encoder is first constructed to encode the visual and text inputs into semantics-aligned representations at different granularities. To fully exploit the cross-modal semantic correspondence learned by the encoder, a hierarchical decoder is further developed, which progressively integrates the lower-level representations with the higher-level contextual knowledge for coarse-to-fine action/activity recognition. Extensive experimental results on the public JRDB-PAR benchmark validate the superiority of the proposed THSM framework over state-of-the-art methods.



Paperid:474 Poster
Authors:Minjing Yu,Delong Pang,Ziwen Kang,Zhiyao Sun,Tian Lv,Jenny Sheng,Ran Yi,Yu-Hui Wen,Yong-jin Liu
Abstract:
Speech-driven 3D facial animation has attracted considerable attention due to its extensive applicability across diverse domains. The majority of existing 3D facial animation methods ignore the avatar's expression, while emotion-controllable methods struggle with specifying the avatar's identity and portraying various emotional intensities, resulting in a lack of naturalness and realism in the animation. To address this issue, we first present an Emolib dataset containing 10,736 expression images with eight emotion categories, i.e., neutral, happy, angry, sad, fear, surprise, disgust, and contempt, where each image is accompanied by a corresponding emotion label and a 3D model with expression. Additionally, we present a novel 3D facial animation framework that operates with unpaired training data. This framework produces emotional facial animations aligned with the input face image, effectively conveying diverse emotional expressions and intensities. Our framework initially generates lip-synchronized and expression models separately. These models are then combined using a fusion network to generate face models that effectively synchronize with speech while conveying emotions. Moreover, the mouth structure is incorporated to create a comprehensive face model. This model is then fed into our skin-realistic renderer, resulting in a highly realistic animation. Experimental results demonstrate that our approach outperforms state-of-the-art 3D facial animation methods in terms of realism and emotional expressiveness while also maintaining precise lip synchronization.



Paperid:475 Poster
Authors:Zhixiang Shen,Haolan He,zhao kang
Abstract:
Multi-relational graph clustering has demonstrated remarkable success in uncovering underlying patterns in complex networks. Representative methods manage to align different views motivated by advances in contrastive learning. Our empirical study finds the pervasive presence of imbalance in real-world graphs, which is in principle contradictory to the motivation of alignment. In this paper, we first propose a novel metric, the Aggregation Class Distance, to empirically quantify structural disparities among different graphs. To address the challenge of view imbalance, we propose Balanced Multi-Relational Graph Clustering (BMGC), comprising unsupervised dominant view mining and dual signals guided representation learning. It dynamically mines the dominant view throughout the training process, synergistically improving clustering performance with representation learning. Theoretical analysis ensures the effectiveness of dominant view mining. Extensive experiments and in-depth analysis on real-world and synthetic datasets showcase that BMGC achieves state-of-the-art performance, underscoring its superiority in addressing the view imbalance inherent in multi-relational graphs.



Paperid:476 Poster
Authors:Zerui Zhang,Jun Yu,Liangxian Cui,Qiang Ling,TianyuLiu
Abstract:
Self-supervised category-level 6D pose estimation stands as a fundamental task in computer vision. Nonetheless, existing methods encounter the following challenges: 1) They are impacted by the many-to-one ambiguity in the correspondences between pixels and point clouds. 2) Existing networks struggle to reconstruct precise object models due to the significant part-level shape variations among specific categories. To address these issues, we propose a novel method based on a Coarse-to-Fine Correspondence Optimization (\textbf{CFCO}) module and a Part-level Shape Reconstruction (\textbf{PSR}) module. In the \textbf{CFCO} module, we employ Hungarian matching to generate one-to-one pseudo labels at both region and pixel levels, providing explicit supervision for the corresponding similarity matrices. In the \textbf{PSR} module, we introduce a part-level discrete shape memory to capture more fine-grained shape variations of different objects and utilize it to perform precise reconstruction. We evaluate our method on the REAL275 and WILD6D datasets. Extensive experiments demonstrate that our method outperforms existing methods, achieving new state-of-the-art results.



Paperid:477 Poster
Authors:Chen Hui,Haiqi Zhu,Shuya Yan,Shaohui Liu,Feng Jiang,Debin Zhao
Abstract:
Deep network-based image Compressive Sensing (CS) has attracted much attention in recent years. However, there still exist the following two issues: 1) Existing methods typically use fixed-scale sampling, which leads to limited insights into the image content. 2) Most pre-trained models can only handle fixed sampling rates and fixed block scales, which restricts the scalability of the model. In this paper, we propose a novel scale-aware scalable CS network (dubbed S$^2$-CSNet), which achieves scale-aware adaptive sampling, fine granular scalability and high-quality reconstruction with one single model. Specifically, to enhance the scalability of the model, a structural sampling matrix with a predefined order is first designed, which is a universal sampling matrix that can sample multi-scale image blocks with arbitrary sampling rates. Then, based on the universal sampling matrix, a distortion-guided scale-aware scheme is presented to achieve scale-variable adaptive sampling, which predicts the reconstruction distortion at different sampling scales from the measurements and select the optimal division scale for sampling. Furthermore, a multi-scale hierarchical sub-network under a well-defined compact framework is put forward to reconstruct the image. In the multi-scale feature domain of the sub-network, a dual spatial attention is developed to explore the local and global affinities between dense feature representations for deep fusion. Extensive experiments manifest that the proposed S$^2$-CSNet outperforms existing state-of-the-art CS methods.



Paperid:478 Poster
Authors:Bing Wang,Shengsheng Wang,Changchun Li,Renchu Guan,Ximing Li
Abstract:
Nowadays, misinformation is widely spreading over various social media platforms and causes extremely negative impacts on society. To combat this issue, automatically identifying misinformation, especially those containing multimodal content, has attracted growing attention from the academic and industrial communities, and induced an active research topic named Multimodal Misinformation Detection (MMD). Typically, existing MMD methods capture the semantic correlation and inconsistency between multiple modalities, but neglect some potential clues in multimodal content. Recent studies suggest that manipulated traces of the images in articles are non-trivial clues for detecting misinformation. Meanwhile, we find that the underlying intentions behind the manipulation, e.g., harmful and harmless, also matter in MMD. Accordingly, in this work, we propose to detect misinformation by learning manipulation features that indicate whether the image has been manipulated, as well as intention features regarding the harmful and harmless intentions of the manipulation. Unfortunately, the manipulation and intention labels that make these features discriminative are unknown. To overcome the problem, we propose two weakly supervised signals as alternatives by introducing additional datasets on image manipulation detection and formulating two classification tasks as positive and unlabeled learning problems. Based on these ideas, we propose a novel MMD method, namely Harmfully Manipulated Images Matter in MMD (MANI-M$^3$D). Extensive experiments across three benchmark datasets can demonstrate that \baby can consistently improve the performance of any MMD baselines.



Paperid:479 Poster
Authors:Gongli Xi,Ye Tian,Mengyu Yang,Lanshan Zhang,Xirong Que,Wendong Wang
Abstract:
Masked image modeling (MIM), as a self-supervised learning paradigm in computer vision, has gained widespread attention among researchers. MIM operates by training the model to predict masked patches of the image. Given the sparse nature of image semantics, it is imperative to devise a masking strategy that steers the model towards reconstructing high-semantic regions. However, conventional mask strategies often miss these high-semantic regions or lack alignment with the masks and semantics. To solve this, we propose the Global Patch-wise Attention (GPA) framework, a transferable and efficient framework for MIM pre-training. We observe that the attention between patches can be the metric of identifying high-semantic regions, which can guide the model to learn more effective representations. Therefore, we firstly define the global patch-wise attention via vision transformer blocks. Then we design the soft-to-hard mask generation to guide the model gradually focusing on high semantic regions identified by GPA (GPA as a teacher). Finally, we design an extra task to predict GPA (GPA as a feature). Experiments conducted under various settings demonstrate that our proposed GPA framework enables MIM to learn better representations, which benefit the model across a wide range of downstream tasks. Furthermore, our GPA framework can be easily and effectively transferred to various MIM architectures.



Paperid:480 Poster
Authors:Jingtao Wang,Zechao Li
Abstract:
Accurately identifying correct correspondence (inlier) within initial ones is pivotal for robust feature-based point cloud registration. Current methods typically rely on one-shot 3D correspondence classification with a single coherence constraint to obtain inlier. These approaches are either insufficiently accurate or inefficient, often requiring more network parameters. To address this issue, we propose a lightweight network, 3DPCP-Net, for fast and robust registration. Its core design lies in progressive correspondence pruning through mining deep spatial geometric coherence, which can effectively learn pairwise 3D spatial distance and angular features to progressively remove outlier (mismatched correspondence) for accurate pose estimation. Moreover, we also propose an efficient feature-based hypothesis proposer that leverages the geometric consistency features to generate reliable model hypotheses for each reliable correspondence explicitly. Extensive experiments on 3DMatch, 3DLoMatch, KITTI and Augmented ICL-NUIM demonstrate the accurate and efficient of our method for outlier removal and pose estimation tasks. Furthermore, our method is highly versatile and can be easily integrated into both learning-based and geometry-based frameworks, enabling them to achieve state-of-the-art results. The code is provided in the supplementary materials.



Paperid:481 Poster
Authors:Ting Fu,Yu-Wei Zhan,Chong-Yu Zhang,Xin Luo,Zhen-Duo Chen,Yongxin Wang,Xun Yang,Xin-Shun Xu
Abstract:
Deep Cross-Modal Hashing (CMH) has become one of the most popular solutions for cross-modal retrieval. Existing methods need to first collect data and then be trained with these accumulated data. However, in real world, data may be generated and possessed by different owners. Considering the concerns about privacy, data may not be shared or transmitted, leading to the failure of sufficient training of CMH. To solve the problem, we propose a new framework called Federated Cross-modal Hashing with Adaptive Feature Enhancement (FedCAFE). FedCAFE is a federated method which could use distributed data to train existing CMH methods under the privacy protection. To overcome the data heterogeneity challenge of distributed data and improve the generalization ability of global model, FedCAFE is endowed with a novel adaptive feature enhancement module and a new weighted aggregation strategy. Besides, it could fully utilize the rich global information carried in the global model to constrain the model during the local training process. We have conducted extensive experiments on four widely-used datasets in CMH domain with both IID and non-IID settings. The reported results demonstrate that the proposed FedCAFE achieves better performance than several state-of-the-art baselines. As the topic that training deep CMH in federated scenario is in its infancy, we plan to release the code and data to boost the development of the field. However, considering restriction of anonymous submission and size limitation, we could only upload the source code of FedCAFE as supplementary materials for peer review at the present stage.



Paperid:482 Poster
Authors:Yuwen Pan,Rui Sun,Yuan Wang,Tianzhu Zhang,Yongdong Zhang
Abstract:
Referring Remote Sensing Image Segmentation (RRSIS) is a challenging task that aims to identify specific regions in aerial images that are relevant to given textual conditions. Existing methods tend to adopt the paradigm of implicit optimization, utilizing a framework consisting of early cross-modal feature fusion and a fixed convolutional kernel-based predictor, neglecting the inherent inter-domain gap and conducting class-agnostic predictions. In this paper, we rethink the issues with the implicit optimization paradigm and address the RRSIS task from a dual-alignment perspective. Specifically, we prepend the dedicated Dual Alignment Network (DANet), including an explicit alignment strategy and a reliable agent alignment module. The explicit alignment strategy effectively reduces domain discrepancies by narrowing the inter-domain affinity distribution. Meanwhile, the reliable agent alignment module aims to enhance the predictor's multi-modality awareness and alleviate the impact of deceptive noise interference. Extensive experiments on two remote sensing datasets demonstrate the effectiveness of our proposed DANet in achieving superior segmentation performance without introducing additional learnable parameters compared to state-of-the-art methods.



Paperid:483 Poster
Authors:Yu Tong,Weihai Lu,Zhe Zhao,Song Lai,Tong Shi
Abstract:
Recently, automatic multi-domain fake news detection has attracted widespread attention. Many methods achieve domain adaptation by modeling domain category gate networks and domain-invariant features. However, existing multi-domain fake news detection faces three main challenges: (1) Inter-domain modal semantic deviation, where similar texts and images carry different meanings across various domains. (2) Inter-domain modal dependency deviation, where the dependence on different modalities varies across domains. (3) Inter-domain knowledge dependency deviation, where the reliance on cross-domain knowledge and domain-specific knowledge differs across domains. To address these issues, we propose a Multi-modal Multi-Domain Fake News Detection Model (MMDFND). MMDFND incorporates domain embeddings and attention mechanisms into a progressive hierarchical extraction network to achieve domain-adaptive domain-related knowledge extraction. Furthermore, MMDFND utilizes Stepwise Pivot Transformer networks and adaptive instance normalization to effectively utilize information from different modalities and domains. We validate the effectiveness of MMDFND through comprehensive comparative experiments on two real-world datasets and conduct ablation experiments to verify the effectiveness of each module, achieving state-of-the-art results on both datasets. The source code is available athttps://github.com/yutchina/MMDFND.



Paperid:484 Poster
Authors:Zhidong Yu,Zhenbo Shi,Xiaoman Liu,Wei Yang
Abstract:
Recent research has confirmed the possibility of adversarial attacks on deep models. However, these methods typically assume that the surrogate model has access to the target domain, which is difficult to achieve in practical scenarios. To address this limitation, this paper introduces a novel cross-domain attack method tailored for semantic segmentation, named Prototype-based Feature and Frequency Alteration Attack (PFFAA). This approach empowers a surrogate model to efficiently deceive the black-box victim model without requiring access to the target data. Specifically, through limited queries on the victim model, bidirectional relationships are established between the target classes of the victim model and the source classes of the surrogate model, enabling the extraction of prototypes for these classes. During the attack process, the features of each source class are perturbed to move these features away from their respective prototypes, thereby manipulating the feature space. Moreover, we propose substituting frequency information from images used to train the surrogate model into the frequency domain of the test images to modify texture and structure, thus further enhancing the attack efficacy. Experimental results across multiple datasets and victim models validate that PFFAA achieves state-of-the-art attack performances.



Paperid:485 Poster
Authors:Yi Liu,Jiachen Li,Yanchun Ma,Qing Xie,Yongjian Liu
Abstract:
In the task of image dehazing, it has been proven that high-quality codebook priors can be used to compensate for the distribution differences between real-world hazy images and synthetic hazy images, thereby helping the model improve its performance. However, because the concentration and distribution of haze in the image are irregular, the manners those simply replacing or blending the prior information in the codebook with the original image features are inconsistent with this irregularity, which leads to a non-ideal dehazing performance. To this end, we propose a haze concentration aware network(HcaNet), its haze-concentration-aware module(HcaM) can reduce the information loss in the vector quantization stage and achieve an adaptive domain transfer for regions with different degrees of degradation. To further capture the detailed texture information, we develop a frequency selective fusion module(FSFM) to facilitate the transmission of shallow information retained in haze areas to deeper layers, thereby enhancing the fusion with high-quality feature priors. Extensive evaluations demonstrate that the proposed model can be merely trained on synthetic hazy-clean pairs and effectively generalize to real-world data. Several experimental results confirm that the proposed dehazing model outperforms state-of-the-art methods significantly on real-world images.



Paperid:486 Poster
Authors:Wenxuan Yang,Weimin Tan,Yuqi Sun,Bo Yan
Abstract:
Foundation models, pre-trained on massive datasets, have achieved unprecedented generalizability. However, is it truly necessary to involve such vast amounts of data in pre-training, consuming extensive computational resources? This paper introduces Data-effective learning, aiming to use data in the most impactful way to pre-train foundation models. This involves strategies that focus on data quality rather than quantity, ensuring the data used for training has high informational value. Data-effective learning plays a profound role in accelerating Foundation Model training, reducing computational costs, and saving data storage, which is very important as the volume of medical data in recent years has grown beyond many people's expectations. However, due to the lack of standards and comprehensive benchmark, research on medical data-effective learning is poorly studied. To address this gap, our paper introduces a comprehensive benchmark specifically for evaluating data-effective learning in the medical field. This benchmark includes a dataset with millions of data samples from 31 medical centers (DataDEL), a baseline method for comparison (MedDEL), and a new evaluation metric (NormDEL) to objectively measure data-effective learning performance. Our extensive experimental results show the baseline MedDEL can achieve performance comparable to the original large dataset with only 5% of the data. Establishing such an open data-effective learning benchmark is crucial for the medical AI research community because it facilitates efficient data use, promotes collaborative breakthroughs, and fosters the development of cost-effective, scalable, and impactful healthcare solutions.



Paperid:487 Poster
Authors:Lingfei Ren,Ruimin Hu,Zheng Wang,Yilin Xiao,Dengshi Li,Junhang Wu,Jinzhang Hu,Yilong Zang,Zijun Huang
Abstract:
Graph-based fraud detection (GFD) methods have garnered increasing attention due to their effectiveness in identifying fraudsters within multimedia data such as online transactions, product reviews, or telephone voices. However, the prevalent in-distribution (ID) assumption significantly impedes the generalization of GFD approaches to out-of-distribution (OOD) scenarios, which is a pervasive challenge considering the dynamic nature of fraudulent activities. In this paper, we introduce the Heterophilic Graph Invariant Learning Framework (HGIF), a novel approach aimed at bolstering the OOD generalization of GFD. HGIF addresses two pivotal challenges: creating diverse virtual training environments and adapting to varying target distributions. Leveraging edge-aware augmentation, HGIF efficiently generates multiple virtual training environments characterized by generalized heterophily distributions, thereby facilitating robust generalization against fraud graphs with diverse heterophily degrees. Moreover, HGIF employs a shared dual-channel encoder with heterophilic graph contrastive learning, enabling the model to acquire stable high-pass and low-pass node representations during training. During the Test-time Training phase, the shared dual-channel encoder is flexibly fine-tuned to adapt to the test data distribution through graph contrastive learning. Extensive experiments showcase HGIF's superior performance over existing methods in OOD generalization, thus setting a new benchmark for GFD in OOD scenarios.



Paperid:488 Poster
Authors:Ye Miaoxin,Zhou Saixing,Weiqi Luo,Shunquan Tan,Jiwu Huang
Abstract:
Designing embedding costs is pivotal in modern image steganography. Many studies have shown adjusting symmetric embedding costs to asymmetric ones can enhance steganographic security. However, most existing methods heavily depend on manually defined parameters or rules, limiting security performance improvements. To overcome this limitation, we introduce an advanced GAN-based framework that transitions symmetric costs to asymmetric ones without the need for the manual intervention seen in existing approaches, such as the detailed specification of cost modulation directions and magnitudes. In our framework, we firstly achieve symmetric costs for a cover image, which is randomly split into two sub-images, with part of the secret information embedded into one. Subsequently, we design a GAN model to adjust the embedding costs of the second sub-image to asymmetric, facilitating the secure embedding of the remaining secret information. To support our phased embedding approach, our GAN's discriminator incorporates two steganalyers with different tasks: distinguishing the generator's final output, i.e., the stego image, from both the input cover image and the partially embedded stego image, providing diverse guidance to the generator. In addition, we introduce a simple yet effective update strategy to ensure a stable training process. Comprehensive experiments demonstrate that our method significantly enhances security over existing symmetric steganography techniques, achieving state-of-the-art levels compared to other methods focused on embedding costs adjustments. Additionally, detailed ablation studies validate our approach's effectiveness.



Paperid:489 Poster
Authors:Guanchen Ding,Lingbo Liu,Zhenzhong Chen,Chang Wen Chen
Abstract:
Domain shift poses a significant barrier to the performance of crowd counting algorithms in unseen domains. While domain adaptation methods address this challenge by utilizing images from the target domain, they become impractical when target domain images acquisition is problematic. Additionally, these methods require extra training time due to the need for fine-tuning on target domain images. To tackle this problem, we propose an Uncertainty-Guided Style Diversity Augmentation (UGSDA) method, enabling the crowd counting models to be trained solely on the source domain and directly generalized to different unseen target domains. It is achieved by generating sufficiently diverse and realistic samples during the training process. Specifically, our UGSDA method incorporates three tailor-designed components: the Global Styling Elements Extraction (GSEE) module, the Local Uncertainty Perturbations (LUP) module, and the Density Distribution Consistency (DDC) loss. The GSEE extracts global style elements from the feature space of the whole source domain. The LUP aims to obtain uncertainty perturbations from the batch-level input to form style distributions beyond the source domain, which used to generate diversified stylized samples together with global style elements. To regulate the extent of perturbations, the DDC loss imposes constraints between the source samples and the stylized samples, ensuring the stylized samples maintain a higher degree of realism and reliability. Comprehensive experiments validate the superiority of our approach, demonstrating its strong generalization capabilities across various datasets and models. Our code will be made publicly available.



Paperid:490 Poster
Authors:Jiankang Chen,Ling Deng,Zhiyong Gan,Wei-Shi Zheng,Ruixuan Wang
Abstract:
Out-of-Distribution (OOD) detection is crucial when deploying machine learning models in open-world applications. The core challenge in OOD detection is mitigating the model’s overconfidence on OOD data. While recent methods using auxiliary outlier datasets or synthesizing outlier features have shown promising OOD detection performance, they are limited due to costly data col- lection or simplified assumptions. In this paper, we propose a novel OOD detection framework FodFoM that innovatively combines multiple foundation models to generate two types of challenging fake outlier images for classifier training. The first type is based on BLIP-2’s image captioning capability, CLIP’s vision-language knowledge, and Stable Diffusion’s image generation ability. Jointly utilizing these foundation models constructs fake outlier images which are semantically similar to but different from in-distribution (ID) images. For the second type, GroundingDINO’s object detection ability is utilized to help construct pure background images by blur- ring foreground ID objects in ID images. The proposed framework can be flexibly combined with multiple existing OOD detection methods. Extensive empirical evaluations show that image classifiers with the help of constructed fake images can more accurately differentiate real OOD image from ID ones. New state-of-the-art OOD detection performance is achieved on multiple benchmarks. The source code will be publicly released.



Paperid:491 Poster
Authors:Panjun Duan,Yang Zhao,Yuan Chen,Wei Jia,Zhao Zhang,Ronggang Wang
Abstract:
With the rapid development of high-bit-depth display devices, bit-depth expansion (BDE) algorithms that extend low-bit-depth images to high-bit-depth images have received increasing attention. Due to the sensitivity of bit-depth distortions to tiny numerical changes in the least significant bits, the nuanced degradation differences in the training process may lead to varying degradation data distributions, causing the trained models to overfit specific types of degradations. This paper focuses on the problem of blind video BDE, proposing a degradation prediction and embedding framework, and designing a video BDE network based on a recurrent structure and dual-frame alignment fusion. Experimental results demonstrate that the proposed model can outperform some state-of-the-art (SOTA) models in terms of banding artifact removal and color correction, avoiding overfitting to specific degradations and obtaining better generalization ability across multiple datasets. The proposed degradation model and source codes will be open-sourced.



Paperid:492 Poster
Authors:Wei Yang,Tengfei Huo,Zhiqiang Liu
Abstract:
The task of semantic text matching focuses on measuring the semantic similarity between two distinct texts and is widely applied in search and ranking scenarios. In recent years, pre-trained models based on the Transformer architecture have demonstrated powerful semantic representation capabilities and have become the mainstream method for text representation. The pipeline of fine-tuning pre-trained language models on downstream semantic matching tasks has achieved promising results and widespread adoption. However, practical downstream scenarios often face severe challenges in terms of data quality and quantity. Ensuring high-quality and large quantities of samples is often difficult. Current research on enhancing pre-trained models for few-shot semantic text matching tasks is still not advanced enough. Therefore, this paper focuses on providing a general enhancement scheme for few-shot semantic text matching tasks. Specifically, we propose an Enhanced Transformer-based Semantic Matching method for few-shot learning through weakly contrastive pre-training, which is named as EBSIM. Firstly, considering the characteristics of semantic text matching tasks, we design a simple and cost-effective data augmentation method for constructing weakly supervised samples. Then, we design a contrastive learning objective based on alignment-aspect to achieve effective semantic matching by optimizing the bidirectional semantic perception between constructed texts. We conduct comprehensive experiments on five Chinese and English semantic text matching datasets using various Transformer-based pre-trained models. The experimental results confirm that our proposed method significantly improves the model's performance on semantic text matching tasks. Further ablation experiments and case studies validate the effectiveness of our approach. Our code and data will be made publicly available at a later stage.



Paperid:493 Poster
Authors:Fengbo Lan,Chang Wen Chen
Abstract:
The rise of mobile devices has spurred advancements in camera technology and image quality. However, mobile photography still faces issues like scattering and reflective flares. While previous research has acknowledged the negative impact of the mobile devices' internal image signal processing pipeline (ISP) on image quality, the specific ISP operations that hinder flare removal have not been fully identified. In addition, current solutions only partially address ISP-related deterioration due to a lack of comprehensive raw image datasets for flare study. To bridge these research gaps, we introduce a new raw image dataset tailored for mobile camera systems, focusing on eliminating flare. This dataset encompasses over 2,000 high-quality, full-resolution raw image pairs for scattering flare, and 1,200 for reflective flare, captured across various real-world scenarios, mobile devices, and camera settings. It is designed to enhance the generalizability of flare removal algorithms across a wide spectrum of conditions. Through detailed experiments, we have identified that ISP operations, such as denoising, compression, and sharpening, may either improve or obstruct flare removal, offering critical insights into optimizing ISP configurations for better flare mitigation. Our dataset is poised to advance the understanding of flare-related challenges, enabling more precise incorporation of flare removal steps into the ISP. Ultimately, this work paves the way for significant improvements in mobile image quality, benefiting both enthusiasts and professional mobile photographers alike.



Paperid:494 Poster
Authors:BeizhangGuo,Juntao Bao,Baili Chai,Di Wu,Miao Hu
Abstract:
As VR devices become increasingly prevalent, live 360-degree video has surged in popularity. However, current live 360-degree video systems heavily rely on uplink bandwidth to deliver high-quality live videos. Recent advancements in neural-enhanced streaming offer a promising solution to this limitation by leveraging server-side computation to conserve bandwidth. Nevertheless, these methods have primarily concentrated on neural enhancement within a single domain (either spatial or temporal), which may not adeptly adapt to diverse video scenarios and fluctuating bandwidth conditions. In this paper, we propose Lumos, a novel spatial-temporal integrated neural-enhanced live 360-degree video streaming system. To accommodate varied video scenarios, we devise a real-time Neural-enhanced Quality Prediction (NQP) model to predict the neural-enhanced quality for different video contents. To cope with varying bandwidth conditions, we design a Content-aware Bitrate Allocator, which dynamically allocates bitrates and selects an appropriate neural enhancement configuration based on the current bandwidth. Moreover, Lumos employs online learning to improve prediction performance and adjust resource utilization to optimize user quality of experience (QoE). Experimental results demonstrate that Lumos surpasses state-of-the-art neural-enhanced systems with an improvement of up to 0.022 in terms of SSIM, translating to an 8.2%-8.5% enhancement in QoE for live stream viewers.



Paperid:495 Poster
Authors:Sunoh Kim,Daeho Um,Hyunjun Choi,Jin young Choi
Abstract:
Most existing methods for weakly supervised video moment localization use rule-based negative proposals. However, the rule-based ones have a limitation in capturing various confusing locations throughout the entire video. To alleviate the limitation, we propose learning-based negative proposals which are trained using a dual-signed cross-entropy loss. The dual-signed cross-entropy loss is controlled by a weight that changes gradually from a minus value to a plus one. The minus value makes the negative proposals be trained to capture query-irrelevant temporal boundaries (easy negative) in the earlier training stages, whereas the plus one makes them capture somewhat query-relevant temporal boundaries (hard negative) in the later training stages. To evaluate the quality of negative proposals, we introduce a new evaluation metric to measure how well a negative proposal captures a poorly-generated positive proposal. We verify that our negative proposals can be applied with negligible additional parameters and inference costs, achieving state-of-the-art performance on three public datasets.



Paperid:496 Poster
Authors:Yusen Wang,Kaixuan Zhou,Wenxiao Zhang,Chunxia Xiao
Abstract:
We present MegaSurf, a Neural Surface Reconstruction (NSR) framework designed to reconstruct 3D models of large scenes from aerial images. Many methods utilize geometry cues to overcome the shape-radiance ambiguity, which would produce large geometric errors. However, directly using inevitable imprecise geometric cues would lead to degradation in the reconstruction results, especially on large-scale scenes. To address this phenomenon, we propose a Learnable Geometric Guider (LG Guider) to learn a sampling field from reliable geometric cues. The LG Guider decides which position should fit the input radiance and can be continuously refined by rendering loss. Our MegaSurf uses a Divide-and-Conquer training strategy to address the synchronization issue between the Guider and the lagging NSR's radiance field. This strategy enables the Guider to transmit the information it carried to the radiance field without being disrupted by the gradients back-propagated from the lagging rendering loss at the early stage of training. Furthermore, we propose a Fast PatchMatch MVS module to derive the geometric cues in the planer regions that help overcome ambiguity. Experiments on several aerial datasets show that MegaSurf can overcome ambiguity while preserving high-fidelity details. Compared to SOTA methods, MegaSurf achieves superior reconstruction accuracy of large scenes and boosts the acquisition of geometric cues more than four times.



Paperid:497 Poster
Authors:Xiaorui Jiang,Zhongyi Ma,Yulin Fu,Yong Liao,Pengyuan Zhou
Abstract:
Multi-view clustering has proven to be highly effective in exploring consistency information across multiple views/modalities when dealing with large-scale unlabeled data. However, in the real world, multi-view data is often distributed across multiple entities, and due to privacy concerns, federated multi-view clustering solutions have emerged. Existing federated multi-view clustering algorithms often result in misalignment in feature representations among clients, difficulty in integrating information across multiple views, and poor performance in heterogeneous scenarios. To address these challenges, we propose HFMVC, a heterogeneity-aware federated deep multi-view clustering method. Specifically, HFMVC adaptively perceives the degree of heterogeneity in the environment and employs contrastive learning to explore consistency and complementarity information across clients' multi-view data. Besides, we seek consensus among clients where local data originates from the same view, incorporating a contrastive loss between local models and the global model during local training to adjust consistency among local models. Furthermore, we elucidate the sample representation logic for local clustering in different heterogeneous environments, identifying the degree of heterogeneity by computing the within-cluster sum of squares (WCSS) and the average inter-cluster distance (AICD). Extensive experiments verify the superior performance of HFMVC across both IID and Non-IID settings. For instance, on the MNIST-USPS dataset, HFMVC outperforms the state-of-the-art (SOTA) method by 36.83% to 64.91% in ACC, 41.39% to 64.39% in NMI, and 50.28% to 79.06% in ARI.



Paperid:498 Poster
Authors:liang Xie,Wei Gao,Huiming Zheng,Ge Li
Abstract:
Point cloud data is pivotal in applications like autonomous driving, virtual reality, and robotics. However, its substantial volume poses significant challenges in storage and transmission. In order to obtain a high compression ratio, crucial semantic details usually confront severe damage, leading to difficulties in guaranteeing the accuracy of downstream tasks. To tackle this problem, we are the first to introduce a novel Region of Interest (ROI)-guided Point Cloud Geometry Compression (RPCGC) method for human and machine vision. Our framework employs a dual-branch parallel structure, where the base layer encodes and decodes a simplified version of the point cloud, and the enhancement layer refines this by focusing on geometry details. Furthermore, the residual information of the enhancement layer undergoes refinement through an ROI prediction network. This network generates mask information, which is then incorporated into the residuals, serving as a strong supervision signal. Additionally, we intricately apply these mask details in the Rate-Distortion (RD) optimization process, with each point weighted in the distortion calculation. Our loss function includes RD loss and detection loss to better guide point cloud encoding for the machine. Experiment results demonstrate that RPCGC achieves exceptional compression performance and better detection accuracy (10% gain) than some learning-based compression methods at high bitrates in ScanNet and SUN RGB-D datasets.



Paperid:499 Poster
Authors:Yu Chen,Yanan Wu,Na Han,Xiaozhao Fang,Bingzhi Chen,Jie Wen
Abstract:
Partial multi-label learning (PML) deals with the problem of accurately predicting the correct multi-label class for each instance in multi-label data containing noise. Compared with traditional multi-label learning, partial multi-label learning requires learning and completing multi-label classification tasks in an imperfect environment. The existing PML methods have the following problems: (1) the correlation between samples and labels is not fully utilized; (2) the nonlinear nature of the model is not taken into account. To solve these problems, we propose a new method of PML based on label enhancement of near and far neighbor information and nonlinear guidance (PML-LENFN). Specifically, the original binary label information is reconstructed by using the information of sample near neighbors and far neighbors to eliminate the influence of noise. Then we construct a linear multi-label classifier that can explore label correlation. In order to learn the nonlinear relationship between features and labels, we use nonlinear mapping to constrain this classifier, so as to obtain the prediction results that are more consistent with the realistic label distribution.



Paperid:500 Poster
Authors:Zhongwei Xuan,Zunjie Zhu,Shuai Wang,Haibing YIN,Hongkui Wang,Ming Lu
Abstract:
In recent years, novel view synthesis methods using neural implicit fields have gained popularity due to their exceptional rendering quality and rapid training speed. However, the computational cost of volumetric rendering has increased significantly with the advancement of camera technology and the consequent rise in average camera resolution. Despite extensive efforts to accelerate the training process, the training duration remains unacceptable for high-resolution inputs. Therefore, the development of efficient sampling methods is crucial for optimizing the learning process of neural fields from a large volume of inputs. In this paper, we introduce a novel method named Superpixel Efficient Sampling (SES), aimed at enhancing the learning efficiency of neural implicit fields. Our approach optimizes pixel-level ray sampling by segmenting the error map into multiple superpixels using the slic algorithm and dynamically updating their errors during training to increase ray sampling in areas with higher rendering errors. Compared to other methods, our approach leverages the flexibility of superpixels, effectively reducing redundant sampling while considering local information. Our method not only accelerates the learning process but also improves the rendering quality obtained from a vast array of inputs. We conduct extensive experiments to evaluate the effectiveness of our method across several baselines and datasets. The code will be released.



Paperid:501 Poster
Authors:Zhenyu Bao,Guibiao Liao,Zhongyuan Zhao,KANGLIN LIU,Qing Li,Guoping Qiu
Abstract:
Simultaneously achieving 3D reconstruction and novel view synthesis for indoor environments has widespread applications but is technically very challenging. State-of-the-art methods based on implicit neural functions can achieve excellent 3D reconstruction results, but their performances on new view synthesis can be unsatisfactory. The exciting development of neural radiance field (NeRF) has revolutionized novel view synthesis, however, NeRF-based models can fail to reconstruct clean geometric surfaces. %In this paper, We have developed a dual neural radiance field (Du-NeRF) to simultaneously achieve high-quality geometry reconstruction and view rendering. Du-NeRF contains two geometric fields, one derived from the SDF field to facilitate geometric reconstruction and the other derived from the density field to boost new view synthesis. One of the innovative features of Du-NeRF is that it decouples a view-independent component from the density field and uses it as a label to supervise the learning process of the SDF field. This reduces shape-radiance ambiguity and enables geometry and color to benefit from each other during the learning process. Extensive experiments demonstrate that Du-NeRF can significantly improve the performance of novel view synthesis and 3D reconstruction for indoor environments and it is particularly effective in constructing areas containing fine geometries that do not obey multi-view color consistency.



Paperid:502 Poster
Authors:Anwen Hu,Yaya Shi,Haiyang Xu,Jiabo Ye,Qinghao Ye,Ming Yan,Chenliang Li,Qi Qian,Ji Zhang,Fei Huang
Abstract:
Recently, the strong text creation ability of Large Language Models(LLMs) has given rise to many tools for assisting paper reading or even writing. However, the weak diagram analysis abilities of LLMs or Multimodal LLMs greatly limit their application scenarios, especially for scientific academic paper writing. In this work, towards a more versatile copilot for academic paper writing, we mainly focus on strengthening the multi-modal diagram analysis ability of Multimodal LLMs. By parsing Latex source files of high-quality papers, we carefully build a multi-modal diagram understanding dataset M-Paper. By aligning diagrams in the paper with related paragraphs, we construct professional diagram analysis samples for training and evaluation. M-Paper is the first dataset to support joint comprehension of multiple scientific diagrams, including figures and tables in the format of images or Latex codes. Besides, to better align the copilot with the user's intention, we introduce the `outline' as the control signal, which could be directly given by the user or revised based on auto-generated ones. Comprehensive experiments with a state-of-the-art Multimodal LLM demonstrate that training on our dataset shows stronger scientific diagram understanding performance, including diagram captioning, diagram analysis, and outline recommendation. The dataset, code, and model will be publicly available.



Paperid:503 Poster
Authors:Shuo Zhang,Yupeng Zhai,Jilin Mei,Yu Hu
Abstract:
3D occupancy prediction (OCC) aims to estimate and predict the semantic occupancy state of the surrounding environment, which is crucial for scene understanding and reconstruction in the real world. However, existing methods for 3D OCC mainly rely on surround-view camera images, whose performance is still insufficient in some challenging scenarios, such as low-light conditions. To this end, we propose a new multi-modal fusion network for 3D occupancy prediction by fusing features of LiDAR point clouds and surround-view images, called FusionOcc. Our model fuses features of these two modals in 2D and 3D space, respectively. By integrating the depth information from point clouds, a cross-modal fusion module is designed to predict a 2D dense depth map, enabling an accurate depth estimation and a better transition of 2D image features into 3D space. In addition, features of voxelized point clouds are aligned and merged with image features converted by a view-transformer in 3D space. Experiments show that FusionOcc establishes the new state of the art on Occ3D-nuScenes dataset, achieving a mIoU score of 35.94% (without visibility mask) and 56.62% (with visibility mask), showing an average improvement of 3.42% compared to the best previous method. Our work provides a new baseline for further research in multi-modal fusion for 3D occupancy prediction.



Paperid:504 Poster
Authors:Yayun Wei,Lei Cao,Hao Li,Yilin Dong
Abstract:
Decoding human visual representations from brain activity data is a challenging but arguably essential task with an understanding of the real world and the human visual system. However, decoding semantically similar visual representations from brain recordings is difficult, especially for electroencephalography (EEG), which has excellent temporal resolution but suffers from spatial precision. Prevailing methods mainly focus on matching brain activity data with corresponding stimuli-responses using contrastive learning. They rely on massive and high-quality paired data and omit semantically aligned modalities distributed in distinct regions of the latent space. This paper proposes a novel Multimodal Bidirectional Cycle Consistency (MB2C) framework for learning robust visual neural representations. Specifically, we utilize dual-GAN to generate modality-related features and inversely translate back to the corresponding semantic latent space to close the modality gap and guarantee that embeddings from different modalities with similar semantics are in the same region of representation space. We perform zero-shot tasks on the ThingsEEG dataset and EEG classification and image reconstruction tasks on the EEGCVPR40 dataset, achieving state-of-the-art performance compared to other baselines.



Paperid:505 Poster
Authors:Xia Du,Jiajie Zhu,Jizhe Zhou,Chi-Man Pun,Qizhen Xu,Xiaoyuan Liu
Abstract:
In digital security, Reversible Adversarial Examples (RAE) blend adversarial attacks with Reversible Data Hiding (RDH) within images to thwart unauthorized access. Traditional RAE methods, however, compromise attack efficiency for the sake of perturbation concealment, diminishing the protective capacity of valuable perturbations and limiting applications to white-box scenarios. This paper proposes a novel Dual-Phase merging Reversible Adversarial Example (DP-RAE) generation framework, combining a heuristic black-box attack and RDH with Grayscale Invariance (RDH-GI) technology. This dual strategy not only evaluates and harnesses the adversarial potential of past perturbations more effectively but also guarantees flawless embedding of perturbation information and complete recovery of the original image. Experimental validation reveals our method's superiority, secured an impressive 96.9% success rate and 100% recovery rate in compromising black-box models. In particular, it achieved a 90% misdirection rate against commercial models under a constrained number of queries. This marks the first successful attempt at targeted black-box reversible adversarial attacks for commercial recognition models. This achievement highlights our framework's capability to enhance security measures without sacrificing attack performance. Moreover, our attack framework is flexible, allowing the interchangeable use of different attack and RDH modules to meet advanced technological requirements.



Paperid:506 Poster
Authors:Xianqiang Lyu,Hui LIU,Junhui Hou
Abstract:
We propose RainyScape, an unsupervised framework for reconstructing clean scenes from a collection of multi-view rainy images. RainyScape consists of two main modules: a neural rendering module and a rain-prediction module that incorporates a predictor network and a learnable latent embedding that captures the rain characteristics of the scene. Specifically, based on the spectral bias property of neural networks, we first optimize the neural rendering pipeline to obtain a low-frequency scene representation. Subsequently, we jointly optimize the two modules, driven by the proposed adaptive direction-sensitive gradient-based reconstruction loss, which encourages the network to distinguish between scene details and rain streaks, facilitating the propagation of gradients to the relevant components. Extensive experiments on both the classic neural radiance field and the recently proposed 3D Gaussian splatting demonstrate the superiority of our method in effectively eliminating rain streaks and rendering clean images, achieving state-of-the-art performance. The constructed high-quality dataset and source code will be publicly available.



Paperid:507 Poster
Authors:Peng Zhou,Dunbo Cai,Yujian Du,Runqing Zhang,Bingbing Ni,Jie Qin,Ling Qian
Abstract:
With the rise of new 3D representations like NeRF and 3D Gaussian splatting, creating realistic 3D scenes is easier than ever before. However, the incompatibility of these 3D representations with existing editing software has also introduced unprecedented challenges to 3D editing tasks. Although recent advances in text-to-image generative models have made some progress in 3D editing, these methods either lack precision or require users to manually specify the editing areas in 3D space, complicating the editing process. To overcome these issues, we propose Edit3D, an innovative 3D editing method designed to enhance editing quality. Specifically, we propose a multi-turn editing framework and introduce an attention-driven open-set segmentation (ADSS) technique within this framework. ADSS allows for more precise segmentation of parts, which enhances the editing precision and minimizes interference with pixels in areas that are not being edited. Additionally, we propose a fine-tuning phase, intended to further improve the overall editing quality without compromising the training efficiency. Experiments demonstrate that Edit3D effectively adjusts 3D scenes based on textual instructions. Through continuous and multiple turns of editing, it achieves more intricate combinations, enhancing the diversity of 3D editing effects.



Paperid:508 Poster
Authors:Ruyu Liu,Zhengzhe Liu,ZHANG HAOYU,Guodao Zhang,Jianhua Zhang,Bo Sun,Weiguo Sheng,Xiufeng Liu,Yaochu Jin
Abstract:
Locating lesions is the primary goal of colonoscopy examinations. 3D perception techniques can enhance the accuracy of lesion localization by restoring 3D spatial information of the colon. However, existing methods focus on the local depth estimation of a single frame and neglect the precise global positioning of the colonoscope, thus failing to provide the accurate 3D location of lesions. The root causes of this shortfall are twofold: Firstly, existing methods treat colon depth and colonoscope pose estimation as independent tasks or design them as parallel sub-task branches. Secondly, the light source in the colon environment moves with the colonoscope, leading to brightness fluctuations among continuous frame images. To address these two issues, we propose ColVO, a novel deep learning-based Visual Odometry framework, which can continuously estimate colon depth and colonoscopic pose using two key components: a deep couple strategy for depth and pose estimation (DCDP) and a light consistent calibration mechanism (LCC). DCDP utilization of multimodal fusion and loss function constraints to couple depth and pose estimation modes ensures seamless alignment of geometric projections between consecutive frames. Meanwhile, LCC accounts for brightness variations by recalibrating the luminosity values of adjacent frames, enhancing ColVO's robustness. A comprehensive evaluation of ColVO on colon odometry benchmarks reveals its superiority over state-of-the-art methods in depth and pose estimation. We also demonstrate two valuable applications: immediate polyp localization and complete 3D reconstruction of the intestine. The code for ColVO is available athttps://github.com/xxx/xxx.



Paperid:509 Poster
Authors:Jiangtong Zhu,YangZhao,Yinan Shi,Jianwu Fang,Jianru Xue
Abstract:
Online vector map construction based on visual data can bypass the processes of data collection, post-processing, and manual annotation required by traditional map construction, which significantly enhances map-building efficiency. However, existing work treats the online mapping task as a local range perception task, overlooking the spatial scalability required for map construction. We propose \emph{IC-Mapper}, an instance-centric online mapping framework, which comprises two primary components: 1) \textbf{Instance-centric temporal association module:} For the detection queries of adjacent frames, we measure them in both feature and geometric dimensions to obtain the matching correspondence between instances across frames. 2) \textbf{Instance-centric spatial fusion module:} We perform point sampling on the historical global map from a spatial dimension and integrate it with the detection results of instances corresponding to the current frame to achieve real-time expansion and update of the map. Based on the nuScenes dataset, we evaluate our approach on detection, tracking, and global mapping metrics. Experimental results demonstrate the superiority of IC-Mapper against other state-of-the-art methods.



Paperid:510 Poster
Authors:Yaqi Li,Han Fang,Zerun Feng,Kaijing Ma,Chao Ban,Xianghao Zang,LanXiang Zhou,Zhongjiang He,Jingyan Chen,Jiani Hu,Hao Sun,Huayu Zhang
Abstract:
Recent text-to-image (T2I) synthesis models have demonstrated intriguing abilities to produce high-quality images based on text prompts. However, current models still face Text-Image Misalignment problem (e.g., attribute errors and relation mistakes) for compositional generation. Existing models attempted to condition T2I models on grounding inputs to improve controllability while ignoring the explicit supervision from the layout conditions. To tackle this issue, we propose Grounded jOint lAyout aLignment (GOAL), an effective framework for T2I synthesis. Two novel modules, Discriminative semantic alignment (DSAlign) and masked attention alignment (MAAlign), are proposed and incorporated in this framework to improve the text-image alignment. DSAlign leverages discriminative tasks at the region-wise level to ensure low-level semantic alignment. MAAlign provides high-level attention alignment by guiding the model to focus on the target object. We also build a dataset GOAL2K for model fine-tuning, which composes 2000 semantically accurate image-text pairs and their layout annotations. Comprehensive evaluations on T2I-Compbench, NSR-1K, and Drawbench demonstrate the superior generation performance of our method. Especially, there are improvements of 19%, 13%, and 12% in color, shape, and texture metrics for T2I-Compbench. Additionally, Q-Align metrics demonstrate that our method can generate images of higher quality.



Paperid:511 Poster
Authors:Ziqi Yu,Jing Zhou,Zhongyun Bao,Gang Fu,Weilei He,Chao Liang,Chunxia Xiao
Abstract:
Inserting foreground objects into specific background scenes and eliminating the gap between them is an important and challenging task. It typically involves multiple processing tasks, such as image harmonization and shadow generation, which find numerous applications across various fields including computer vision and augmented reality. In these two domains, there are already many mature solutions, but they often only focus on one of the tasks. Some image composition methods can address both of these issues simultaneously but cannot guarantee complete reconstruction of foreground content. In this work,we propose CFDiffusion, which can handle both image harmonization and shadow generation simultaneously. Additionally, we introduce a foreground content enhancement module based on the diffusion model to ensure the complete preservation of foreground content at the insertion location. The experimental results on the iHarmony4 dataset and our self-created IH-SG dataset demonstrate the superiority of our CFDiffusion approach.



Paperid:512 Poster
Authors:Xi Wu,Chuang Huang,Xinliu Liu,Fei Zhou,Zhenwen Ren
Abstract:
Multiple kernel clustering (MKC) has garnered considerable attention, as their efficacy in handling nonlinear data in high-dimensional space. However, current MKC methods have three primary issues: (1) Solely focuse on clustering information while neglecting energy information and potential noise interference within the kernel; (2) The inherent manifold structure in the high-dimensional space is complex, and they lack the insufficient exploration of topological structure; (3) Most encounter cubic computational complexity, posing a formidable resource consumption challenge. To tackle the above issues, we propose a novel MKC method with shifted Laplacian on Grassmann manifold (sLGm). Firstly, sLGm constructs $r$-rank shifted Laplacian and subsequently reconstructs it, retaining the clustering-related and energy-related information while reducing the influence of noise. Additionally, sLGm introduces a Grassmann manifold for partition fusion, which can preserve topological information in the high-dimensional space. Notably, an optimal consensus partition can be concurrently learnt from above two procedures, thereby yielding the clustering assignments, and the computational complexity of the whole procedure drops to the quadratic. Conclusively, a comprehensive suite of experiments is executed to roundly prove the effectiveness of sLGm.



Paperid:513 Poster
Authors:Junyuan Guo,Hao Tang,Teng Wang,Chao Wang
Abstract:
The view synthesis and decoupling of dynamic objects from the static environment in monocular video are both long-standing challenges in CV and CG. Most of the previous NeRF-based methods rely on implicit representation, which require additional supervision and training time. Later, various explicit representations have been applied to the task of novel view synthesis for dynamic scenes, such as multi-planes or 3D gaussian splatting. They usually encode the dynamics by introducing an additional time dimension or a deformation field. These methods greatly reduce the time consumption, but still fail to achieve high rendering quality in some scenes, especially for some real scenes. For the latter decoupling problem, previous neural radiation field methods require frequent tuning of the relevant parameters for different scenes, which is very inconvenient for practical use. We consider above problems and propose a new representation of dynamic scenes based on tensor decomposition, which we call R4D-planes. The key to our method is remapping, which compensates for the shortcomings of the plane structure by fusing space-time information and remapping to new indexes. Furthermore, we implement a new decoupling structure, which can efficiently decouple dynamic and static scenes in a self-supervised manner. Experimental results show our method achieves better rendering quality and training efficiency in both view synthesis and decoupling tasks for monocular scenes.



Paperid:514 Poster
Authors:Rui Yang,Shuang Wang,Jianwei Tao,Yingping Han,Qiaoling Lin,Yanhe Guo,Biao Hou,Licheng Jiao
Abstract:
Recent advances in vision-language pre-trained models like CLIP have greatly enhanced general domain image-text retrieval performance. This success has led scholars to develop methods for applying CLIP to Specific Domain Image-Text Retrieval (SDITR) tasks such as Remote Sensing Image-Text Retrieval (RSITR) and Text-Image Person Re-identification (TIReID). However, these methods for SDITR often neglect two critical aspects: the enhancement of modal-level distribution consistency within the retrieval space and the reduction of CLIP's computational cost during inference, resulting in suboptimal retrieval spaces and unnecessarily high inference computational loads. To address these issues, this paper presents a novel framework, Accurate and lightweight learning for specific domain Image-text Retrieval (AIR), based on the CLIP architecture. AIR incorporates a Modal-Level distribution Consistency Enhancement regularization (MLCE) loss and a Self-Pruning Distillation Strategy (SPDS) to improve retrieval precision and computational efficiency. The MLCE loss harmonizes the sample distance distributions within image and text modalities, fostering a retrieval space closer to the ideal state. Meanwhile, SPDS employs a strategic knowledge distillation process to transfer deep multimodal insights from CLIP to a shallower level, maintaining only the essential layers for inference, thus achieving model light-weighting. Comprehensive experiments across various datasets in RSITR and TIReID demonstrate the effectiveness of both MLCE loss and SPDS. The study also explores the limits of SPDS's performance and compares it with conventional teacher-student distillation methods. The findings reveal that MLCE loss secures optimal retrieval on several datasets, while SPDS achieves a favorable balance between accuracy and computational demand during testing.



Paperid:515 Poster
Authors:Hengyi Wang,Weiying Xie,Jitao Ma,DaixunLi,Yunsong Li
Abstract:
Federated Learning (FL) is an emerging direction in distributed machine learning that enables jointly training a global model without sharing the data with server. However, data heterogeneity biases the parameter aggregation at the server, leading to slower convergence and poorer accuracy of the global model. To cope with this, most of the existing works involve enforcing regularization in local optimization or improving the model aggregation scheme at the server. Though effective, they lack a deep understanding of cross-client features. In this paper, we propose a saliency latent space feature aggregation method (FedSLS) across federated clients. By Guided BackPropagation (GBP), we transform deep models into powerful and flexible visual fidelity encoders, applicable to general state inputs across different image domains, and achieve powerful aggregation in the form of saliency latent features. Notably, since GBP is label-insensitive, it is sufficient to capture saliency features only once on each client. Experimental results demonstrate that FedSLS leads to significant improvements over the state-of-the-arts in terms of accuracies, especially in highly heterogeneous settings. For example, on CIFAR-10 dataset, FedSLS achieves 63.43% accuracy within the strongly heterogeneous environment α=0.05, which is 6% to 23% higher than the other baselines.



Paperid:516 Poster
Authors:Yanshan Zhou,Pingrui Lai,Jiaqi Yu,Yingjie Xiong,Hua Yang
Abstract:
With global occurrences of crowd crushes and stampedes, dense crowd simulation has been drawing great attention. In this research, our goal is to simulate dense crowd motions under six classic motion patterns, more specifically, to generate subsequent motions of dense crowds from the given initial states. Since dense crowds share similarities with fluids, such as continuity and fluidity, one common approach for dense crowd simulation is to construct hydrodynamics-based models, which consider dense crowds as fluids, guide crowd motions with Navier-Stokes equations, and conduct dense crowd simulation by solving governing equations. Despite the proposal of these models, dense crowd simulation faces multiple challenges, including the difficulty of directly solving Navier-Stokes equations due to their nonlinear nature, the ignorance of distinctive crowd characteristics which fluids lack, and the gaps in the evaluation and validation of crowd simulation models. To address the above challenges, we build a hydrodynamic model, which captures the crowd physical properties (continuity, fluidity, etc.) with Navier-Stokes equations and reflects the crowd social properties (sociality, personality, etc.) with operators that describe crowd interactions and crowd-environment interactions. To tackle the computational problem, we propose to solve the governing equation based on Navier-Stokes equations using neural networks, and introduce the Hydrodynamics-Informed Neural Network (HINN) which preserves the structure of the governing equation in its network architecture. To facilitate the evaluation, we construct a new dense crowd motion video dataset called Dense Crowd Flow Dataset (DCFD), containing six classic motion patterns (line, curve, circle, cross, cluster and scatter) and 457 video clips, which can serve as the groundtruths for various objective metrics. Numerous experiments are conducted using HINN to simulate dense crowd motions under six motion patterns with video clips from DCFD. Objective evaluation metrics that concerns authenticity, fidelity and diversity demonstrate the superior performance of our model in dense crowd simulation compared to other simulation models.



Paperid:517 Poster
Authors:Shuo Huang,Shikun Sun,Zixuan Wang,Xiaoyu Qin,xiongyanmin,zhangyuan,Pengfei Wan,Di ZHANG,Jia Jia
Abstract:
Recently, text-to-3D generation has attracted significant attention, resulting in notable performance enhancements. Previous methods utilize end-to-end 3D generation models for initializing 3D Gaussians, and multi-view diffusion models to enforce multi-view consistency. Moreover, they employ text-to-image diffusion models to refine details with score distillation algorithms. However, these methods exhibit two limitations. Firstly, they encounter conflicts in generation directions since different models aim to produce diverse 3D assets. Secondly, the issue of over-saturation in score distillation has not been thoroughly investigated and solved. To address these limitations, we propose PlacidDreamer, a text-to-3D framework that harmonizes initialization, multi-view generation, and text-conditioned generation with a single multi-view diffusion model, while simultaneously employing a novel score distillation algorithm to achieve balanced saturation. To unify the generation direction, we introduce the Latent-Plane module, a training-friendly plug-in extension that enables multi-view diffusion models to provide fast geometry reconstruction for initialization and enhanced multi-view images to personalize the text-to-image diffusion model. To address the over-saturation problem, we propose to view score distillation as a multi-objective optimization problem and introduce the Balanced Score Distillation algorithm, which offers a Pareto Optimal solution that achieves both rich details and balanced saturation. Extensive experiments validate the outstanding capabilities of our PlacidDreamer. The code will be available on GitHub. Code will be available on Github.



Paperid:518 Poster
Authors:Guilin Li,Mengdan Zhang,Xiawu Zheng,Peixian Chen,Zihan Wang,Yunhang Shen,Mingchen Zhuge,Chenglin Wu,Fei Chao,Ke Li,Xing Sun,Rongrong Ji
Abstract:
The integration of large language models into open-world detection frameworks significantly improves versatility in new environments. Prompt representations derived from these models help establish classification boundaries for both base and novel categories within open-world detectors. However, we are the first to discover that directly fine-tuning language models in detection systems results in redundant attention patterns and leads to suboptimal prompt representations. In order to fully leverage the capabilities of large language models and augment prompt encoding for detection, this study introduces a redundancy assessment metric to identify uniform attention patterns. Furthermore, in areas with high redundancy, we incorporate multimodal inplace prompt tuning (MIPT) to enrich the text prompt with visual clues. Experimental results validate the efficacy of our MIPT framework, achieving a notable increase across benchmarks, e.g. elevating GLIP-L from 22.6% to 25.0% on ODinW-35, and 9.0% improvement on LVIS.



Paperid:519 Poster
Authors:Nan Wang,Zonglin Di,Houlin He,Qingchao Jiang,Xiaoxiao Li
Abstract:
Deep learning for medical image classification needs large amounts of carefully labeled data with the aid of domain experts. However, data labeling is vulnerable to noises, which may degrade the accuracy of classifiers. Given the cost of medical data collection and annotation, it is highly desirable for methods that can effectively utilize noisy labeled data. In addition, efficiency and universality are essential for noisy label training, which requires further research. To address the lack of high-quality labeled medical data and meet algorithm efficiency requirements for clinical application, we propose a simple yet effective approach for multi-field medical images to utilize noisy data, named Pseudo-T correction. Specifically, we design a noisy label filter to divide the training data into clean and noisy samples. Then, we estimate a transition matrix that corrects model predictions based on the partitions of clean and noisy data samples. However, if the model overfits noisy data, it makes noisy samples more difficult to detect in the filtering step, resulting in inaccurate transition matrix estimation. Therefore, we employ gradient disparity as an effective criterion to decide whether or not to refine the transition matrix in the model's further training steps. The novel design enables us to build more accurate machine-learning models by leveraging noisy labels. We demonstrate that our method outperforms the state-of-the-art methods on three public medical datasets (dermoscopic images, histopathology slide images, X-Ray) and achieves superior computational efficiency over the alternatives.



Paperid:520 Poster
Authors:Huimin Ma,Siwei Wang,Shengju Yu,Suyuan Liu,Jun-Jie Huang,Huijun Wu,Xinwang Liu,En Zhu
Abstract:
Multi-view Clustering (MVC) generally utilizes the anchor technique to decrease the computational complexity so as to tackle large-scale scenarios. Existing researches generally are supposed to select anchors in advance to complete the next clustering task. Nevertheless, the number of anchors cannot be predetermined and must be selected as a parameter, which introduces additional time consumption for parameter search. Moreover, maintaining an identical number of anchors across each view is not reasonable, as it restricts the representational capacity of anchors in individual views. To address the above issues, we propose a view adaptive anchor multi-view clustering called Multi-view Clustering with Automatic and Aligned Anchor (3AMVC). We introduce the Hierarchical Bipartite Neighbor Clustering (HBNC) strategy to adaptively select a suitable number of representative anchors from the original samples of each view. Specifically, When the representative difference of anchors lies in a acceptable and satisfactory range, the HBNC process is halted and picks out the final anchors. In addition, in response to the varying quantities of anchors across different views, we propose an innovative anchor alignment strategy. This approach initially evaluates the quality of anchors on each view based on the intra-cluster distance criterion and then proceeds to align based on the view with the highest-quality anchors. The carefully organized experiments well validate the effectiveness and strengthens of 3AMVC.



Paperid:521 Poster
Authors:Humen Zhong,Zhibo Yang,Zhaohai Li,Peng Wang,Jun Tang,Wenqing Cheng,Cong Yao
Abstract:
Text recognition is an inherent integration of vision and language, encompassing the visual texture in stroke patterns and the semantic context among the character sequences. Towards advanced text recognition, there are three key challenges: (1) an encoder capable of representing the visual and semantic distribution; (2) a decoder supervises the alignment between vision and semantics; and (3) consistency in the framework during pre-training and fine-tuning. Inspired by masked autoencoding, a successful pre-training strategy in both vision and language, we propose an innovative scene text recognition approach, named VL-Reader. The novelty of VL-Reader lies in that the interplay between vision and language is pervasive throughout the entire process, not only in the encoding stage but also the decoding stage, which has been previously overlooked. Concretely, we first introduce a Masked Visual-Linguistic Recon- struction (MVLR) objective, which aims at simultaneously modeling visual and linguistic information. Then, we design a Masked Visual- Linguistic Decoder (MVLD) to further leverage bi-modal feature interaction. The architecture of VL-Reader maintains consistency from training to inference. In the pre-training stage, VL-Reader reconstructs both masked visual and text tokens, while in the fine- tuning stage, the network degrades to reconstruct all characters from an image without any masked regions. VL reader achieves an average accuracy of 97.1% on six typical datasets, surpassing the SOTA by 1.1%. The improvement was even more significant on chal- lenging datasets. The results demonstrate that vision and language reconstructor can serve as an effective scene text recognizer.



Paperid:522 Poster
Authors:Zhong Ji,Changxu Meng,Yan Zhang,Haoran Wang,Yanwei Pang,Jungong Han
Abstract:
Mountains of researches center around the Remote Sensing Image-Text Retrieval (RSITR), aiming at retrieving the corresponding targets based on the given query. Among them, the transfer of Foundation Models (FMs), such as CLIP, to remote sensing domain shows promising results. However, existing FM-based approaches neglect the negative impact of weakly correlated sample pairs and the key distinctions among remote sensing texts, leading to biased and superficial exploration of sample pairs. To address these challenges, we propose a novel Eliminate Before Align strategy with Keyword Explicit Reasoning framework (EBAKER) for RSITR. Specifically, we devise an innovative Eliminate Before Align (EBA) strategy to filter out the weakly correlated sample pairs to mitigate their deviations from optimal embedding space during alignment. Moreover, we introduce a Keyword Explicit Reasoning (KER) module to facilitate the positive role of subtle key concept differences. Without bells and whistles, our method achieves a one-step transformation from FM to RSITR task, obviating the necessity for extra pretraining on remote sensing data. Extensive experiments on three popular benchmark datasets validate that our proposed EBAKER method outperform the state-of-the-art methods with fewer training data. Our source code will be released soon.



Paperid:523 Poster
Authors:Xiangxiang DAI,Zeyu Zhang,Peng Yang,Yuedong Xu,Xutong Liu,John C.S. Lui
Abstract:
The rapid evolution of multimedia and computer vision technologies requires adaptive visual model deployment strategies to effectively handle diverse tasks and varying environments. This work introduces \textit{AxiomVision}, a novel framework that can guarantee accuracy by leveraging edge computing to dynamically select the most efficient visual models for video analytics under diverse scenarios. Utilizing a tiered edge-cloud architecture, \textit{AxiomVision} enables the deployment of a broad spectrum of visual models, from lightweight to complex DNNs, that can be tailored to specific scenarios while considering camera source impacts. In addition, \textit{AxiomVision} provides three core innovations: (1) a dynamic visual model selection mechanism utilizing continual online learning, (2) an efficient online method that efficiently takes into account the influence of the camera's perspective, and (3) a topology-driven grouping approach that accelerates the model selection process. With rigorous theoretical guarantees, these advancements provide a scalable and effective solution for visual tasks inherent to multimedia systems, such as object detection, classification, and counting. Empirically, \textit{AxiomVision} achieves a 25.7% improvement in accuracy.



Paperid:524 Poster
Authors:Baoqi Gao,Daoxu Sheng,Lei Zhang,Qi Qi,Bo He,Zirui Zhuang,Jingyu Wang
Abstract:
Accurate long-term viewport prediction in tile-based 360° video adaptive streaming helps pre-download tiles for a further future, thus establishing a longer buffer to cope with network fluctuations. Long-term viewport motion is mainly influenced by Historical viewpoint Trajectory (HT) and Video Content information (VC). However, HT and VC are difficult to align in space due to their different modalities, and their relative importance in viewport prediction varies across prediction time steps. In this paper, we propose STAR-VP, a model that fuses HT and VC in a Space-aligned and Time-vARying manner for Viewport Prediction. Specifically, we first propose a novel saliency representation $salxyz$ and a Spatial Attention Module to solve the spatial alignment of HT and VC. Then, we propose a two-stage fusion approach based on Transformer and gating mechanisms to capture their time-varying importance. Visualization of attention scores intuitively demonstrates STAR-VP's capability in space-aligned and time-varying fusion. Evaluation on three public datasets shows that STAR-VP achieves state-of-the-art accuracy for long-term (2-5s) viewport prediction without sacrificing short-term ($<$1s) prediction performance.



Paperid:525 Poster
Authors:Yuanyuan Shi,Yunan Li,Siyu Liang,Huizhou Chen,Qiguang Miao
Abstract:
Gesture recognition plays a crucial role in natural human-computer interaction and sign language recognition. Despite considerable progress in normal daylight, research dedicated to gesture recognition in dark environments is scarce. This is partly due to the lack of sufficient datasets for such a task. We bridge the gap of the lack of data for this task by collecting a new dataset: a large-scale multimodal video dataset for gesture recognition in darkness (MGR-Dark). MGR-Dark is distinguished from existing gesture datasets by its gesture collection in darkness, multimodal videos(RGB, Depth, and Infrared), and high video quality. To the best of our knowledge, this is the first multimodal dataset dedicated to human gesture action in dark videos of high quality. Building upon this, we propose a Modality Translation and Cross-modal Distillation (MTCD) RGB-IR benchmark framework. Specifically, the modality translator is firstly utilized to transfer RGB data to pseudo-Infrared data, a progressive cross-modal feature distillation module is then designed to exploit the underlying relations between RGB, pseudo-Infrared and Infrared modalities to guide RGB feature learning. The experiments demonstrate that the dataset and benchmark proposed in this paper are expected to advance research in gesture recognition in dark videos. The dataset and code will be available upon acceptance.



Paperid:526 Poster
Authors:Yuxiang Zhou,Zhe Sun,Rui Liu,Yong Chen,Dell Zhang
Abstract:
Video hashing is a technique of encoding videos into binary vectors, facilitating efficient video storage and high-speed computation. Current approaches to video hashing predominantly utilize sequential frame images to produce semantic binary codes. However, videos encompass not only visual but also audio signals. Therefore, we propose a tri-level Transformer-based audio-visual hashing technique for video retrieval, named AVHash. It first processes audio and visual signals separately using pre-trained AST and ViT large models, and then projects temporal audio and keyframes into a shared latent semantic space using a Transformer encoder. Subsequently, a gated attention mechanism is designed to fuse the paired audio-visual signals in the video, followed by another Transformer encoder leading to the final video representation. The training of this AVHash model is directed by a video-based contrastive loss as well as a semantic alignment regularization term for audio-visual signals. Experimental results show that AVHash significantly outperforms existing video hashing methods in video retrieval tasks. Furthermore, ablation studies reveal that while video hashing based solely on visual signals achieves commendable mAP scores, the incorporation of audio signals can further boost its performance for video retrieval.



Paperid:527 Poster
Authors:Xinyi Zhang,Qinpeng Cui,Qiqi Bao,Wenming Yang,Qingmin Liao
Abstract:
Recent research on Diffusion Models and Transformers has brought significant advancements to 3D Human Pose Estimation (HPE). Nonetheless, existing methods often fail to concurrently address the issues of accuracy and generalization. In this paper, we propose aGeometry-guided Diffusion Model with Masked Transformer(Masked Gifformer) for robust multi-view 3D HPE. Within the framework of the diffusion model, a hierarchical multi-view transformer-based denoiser is exploited to fit the 3D pose distribution by systematically integrating joint and view information. To address the long-standing problem of poor generalization, we introduce a fully random mask mechanism without any additional learnable modules or parameters. Furthermore, we incorporate geometric guidance into the diffusion model to enhance the accuracy of the model. This is achieved by optimizing the sampling process to minimize reprojection errors through modeling a conditional guidance distribution. Extensive experiments on two benchmarks demonstrate that Masked Gifformer effectively achieves a trade-off between accuracy and generalization. Specifically, our method outperforms other probabilistic methods by $\textgreater 40\%$ and achieves comparable results with state-of-the-art deterministic methods. In addition, our method exhibits robustness to varying camera numbers, spatial arrangements, and datasets.



Paperid:528 Poster
Authors:Weicai Yan,Ye Wang,Wang Lin,Zirun Guo,Zhou Zhao,Tao Jin
Abstract:
Research on continual learning in multi-modal tasks has been receiving increasing attention. However, most existing work overlooks the explicit cross-modal and cross-task interaction. In this paper, we innovatively propose the Low-rank Prompt Interaction (LPI) to address this general problem of multi-modal understanding, which considers both cross-modal interaction and cross-task interaction. Specifically, as for the former, we employ multi-modal correlation modules for corresponding Transformer layers. Considering that the training parameters scale to the number of layers and tasks, we propose Low-rank Interaction-augmented Decomposition to avoid memory explosion, while enhancing the cross-modal association through sharing and separating common-specific low-rank factors. In addition, due to the multi-modal semantic differences carried by the low-rank initialization, we adopt hierarchical low-rank contrastive learning to ensure training robustness. As for the latter, we initially employ visual analysis and identify that different tasks have clear distinctions in terms of proximity. Therefore, we introduce explicit task contrastive constraints in the prompt learning process based on task semantic distance. Experiments on two retrieval tasks show performance improvements with the introduction of a minimal number of parameters, demonstrating the effectiveness of our method.



Paperid:529 Poster
Authors:Jiawei Wang,Da Cao,Shaofei Lu,Zhanchang Ma,Junbin Xiao,Tat-Seng Chua
Abstract:
In Large Language Models (LLMs), text generation that involves knowledge representation is often fraught with the risk of ''hallucinations'', where models confidently produce erroneous or fabricated content. These inaccuracies often stem from intrinsic biases in the pre-training stage or from the incorporation of human preference biases during the fine-tuning process. To mitigate these issues, we take inspiration from Goldman's causal theory of knowledge, which asserts that knowledge is not merely about having a true belief but also involves a causal connection between the belief and the truth of the proposition. We instantiate this theory within the context of Knowledge Question Answering (KQA) by constructing a causal graph that delineates the pathways between the candidate knowledge and belief. Through the application of the do-calculus rules from structural causal models, we devise an unbiased estimation framework based on this causal graph, thereby establishing a methodology for knowledge modeling grounded in causal inference. The resulting CORE framework (short for ``Causal knOwledge REasoning'') is comprised of four essential components: question answering, causal reasoning, belief scoring, and refinement. Together, they synergistically improve the KQA system by fostering faithful reasoning and introspection. Extensive experiments are conducted on ScienceQA and HotpotQA datasets, which demonstrate the effectiveness and rationality of the CORE framework.



Paperid:530 Poster
Authors:Delong Zhang,Yi-Xing Peng,Xiao-Ming Wu,Ancong Wu,Wei-Shi Zheng
Abstract:
Online person re-identification services face privacy breaches from potential data leaks and recovery attacks, exposing cloud-stored images to malicious attackers and triggering public concern. The privacy protection of pedestrian images is crucial. Previous privacy-preserving person re-identification methods are unable to resist recovery attacks and compromise accuracy. In this paper, we propose an iterative method (PixelFade) to optimize pedestrian images into noise-like images to resist recovery attacks. We first give an in-depth study of protected images from previous privacy methods, which reveal that the \textbf{chaos} of protected images can disrupt the learning of recovery networks, leading to a decrease in the power of the recovery attacks. Accordingly, we propose Noise-guided Objective Function with the feature constraints of a specific authorization model, optimizing pedestrian images to normal-distributed noise images while preserving their original identity information as per the authorization model. To solve the above non-convex optimization problem, we propose a heuristic optimization algorithm that alternately performs the Constraint Operation and the Partial Replacement operation. This strategy not only safeguards that original pixels are replaced with noises to protect privacy, but also guides the images towards an improved optimization direction to effectively preserve discriminative features. Extensive experiments demonstrate that our PixelFade outperforms previous methods in resisting recovery attacks and Re-ID performance. The code will be released.



Paperid:531 Poster
Authors:Tianrui Pan,Jie Liu,Bohan Wang,Jie Tang,Gangshan Wu
Abstract:
While existing Audio-Visual Speech Separation (AVSS) methods primarily concentrate on the audio-visual fusion strategy for two-speaker separation, they demonstrate a severe performance drop in the multi-speaker separation scenarios. Typically, AVSS methods employ guiding videos to sequentially isolate individual speakers from the given audio mixture, resulting in notable missing and noisy parts across various segments of the separated speech. In this study, we propose a simultaneous multi-speaker separation framework that can facilitate the concurrent separation of multiple speakers within a singular process. We introduce speaker-wise interactions to establish distinctions and correlations among speakers. Experimental results on the VoxCeleb2 and LRS3 datasets demonstrate that our method achieves state-of-the-art performance in separating mixtures with 2, 3, 4, and 5 speakers, respectively. Additionally, our model can utilize speakers with complete audio-visual information to mitigate other visual-deficient speakers, thereby enhancing its resilience to missing visual cues. We also conduct experiments where visual information for specific speakers is entirely absent or visual frames are partially missing. The results demonstrate that our model consistently outperforms others, exhibiting the smallest performance drop across all settings involving 2, 3, 4, and 5 speakers.



Paperid:532 Poster
Authors:Guogang Zhu,Xuefeng Liu,Jianwei Niu,Shaojie Tang,Xinghao Wu,Jiayuan Zhang
Abstract:
In personalized federated learning (PFL), it is widely recognized that achieving both high model generalization and effective personalization poses a significant challenge due to their conflicting nature. As a result, existing PFL methods can only manage a trade-off between these two objectives. This raises an interesting question: Is it feasible to develop a model capable of achieving both objectives simultaneously? Our paper presents an affirmative answer, and the key lies in the observation that deep models inherently exhibit hierarchical architectures, which produce representations with various levels of generalization and personalization at different stages. A straightforward approach stemming from this observation is to select multiple representations from these layers and combine them to concurrently achieve generalization and personalization. However, the number of candidate representations is commonly huge, which makes this method infeasible due to high computational costs. To address this problem, we propose DualFed, a new method that can directly yield dual representations correspond to generalization and personalization respectively, thereby simplifying the optimization task. Specifically, DualFed inserts a personalized projection network between the encoder and classifier. The pre-projection representations are able to capture generalized information shareable across clients, and the post-projection representations are effective to capture task-specific information on local clients. This design minimizes the mutual interference between generalization and personalization, thereby achieving a win-win situation. Extensive experiments show that DualFed can outperform other FL methods.



Paperid:533 Poster
Authors:Dongxiao He,Jinghan Zhang,Xiaobao Wang,Meng Ge,Zhiyong Feng,Longbiao Wang,Xiaoke Ma
Abstract:
The Conversational Recommendation System (CRS) aims to capture user dynamic preferences and provide item recommendations based on multi-turn conversations. However, effectively modeling these dynamic preferences faces challenges due to conversational limitations, which mainly manifests as limited turns in a conversation (quantity aspect) and low compliance with queries (quality aspect). Previous studies often address these challenges in isolation, overlooking their interconnected nature. The fundamental issue underlying both problems lies in the potential abrupt changes in user preferences, to which CRS may not respond promptly. We acknowledge that user preferences are influenced by temporal factors, serving as a bridge between conversation quantity and quality. Therefore, we propose a more comprehensive CRS framework called Time-aware User-preference Tracking for Conversational Recommendation System (TUT4CRS), leveraging time dynamics to tackle both issues simultaneously. Specifically, we construct a global time interaction graph to incorporate rich external information and establish a local time-aware weight graph based on this information to adeptly select queries and effectively model user dynamic preferences. Extensive experiments on two real-world datasets validate that TUT4CRS can significantly improve recommendation performance while reducing the number of conversation turns.



Paperid:534 Poster
Authors:Yuan Tang,Xu Han,Xianzhi Li,Qiao Yu,yixue Hao,Long Hu,Min Chen
Abstract:
Large 2D vision-language models (2D-LLMs) have gained significant attention by bridging Large Language Models (LLMs) with images using a simple projector. Inspired by their success, large 3D point cloud-language models (3D-LLMs) also integrate point clouds into LLMs. However, directly aligning point clouds with LLM requires expensive training costs, typically in hundreds of GPU-hours on A100, which hinders the development of 3D-LLMs. In this paper, we introduce MiniGPT-3D, an efficient and powerful 3D-LLM that achieves multiple SOTA results while training for only 27 hours on one RTX 3090. Specifically, we propose to align 3D point clouds with LLMs using 2D priors from 2D-LLMs, which can leverage the similarity between 2D and 3D visual information. We introduce a novel four-stage training strategy for modality alignment in a cascaded way, and a mixture of query experts module to adaptively aggregate features with high efficiency. Moreover, we utilize parameter-efficient fine-tuning methods LoRA and Norm fine-tuning, resulting in only 47.8M learnable parameters, which is up to 260x fewer than existing methods. Extensive experiments show that MiniGPT-3D achieves SOTA on 3D object classification and captioning tasks, with significantly cheaper training costs. Notably, MiniGPT-3D gains an 8.12 increase on GPT-4 evaluation score for the challenging object captioning task compared to ShapeLLM-13B, while the latter costs 160 total GPU-hours on 8 A800. We are the first to explore the efficient 3D-LLM, offering new insights to the community. We will release the code and weights after review.



Paperid:535 Poster
Authors:Kai Yin,Jie Shen
Abstract:
Advanced mobile computing has led to a surge in the need for practical super-resolution (SR) techniques. The look-up table (LUT) based SR-LUT has pioneered a new avenue of research without needing hardware acceleration. Nevertheless, all preceding methods that drew inspiration from the SR-LUT framework invariably resort to interpolation and rotation techniques for diminishing the LUT size, thereby prolonging the inference time and contradicting the original objective of efficient SR. Recently, a study named EC-LUT proposed an expanded convolution method to avoid interpolation operations. However, the performance of EC-LUT regarding SR quality and LUT volume is unsatisfactory. To address these limitations, this paper proposes a novel expanded convolutional neural network (ECNN). Specifically, we further extend feature fusion to the feature channel dimension to enhance mapping ability. In addition, our approach reduces the number of single indexed pixels to just one, eliminating the need for rotation tricks and dramatically reducing the LUT size from the MB level to the KB level, thus improving cache hit rates. By leveraging these improvements, we can stack expanded convolutional layers to form an ECNN, with each layer convertible to LUTs during inference. Experiments show that our method improves the overall performance of the upper limit of LUT based methods. For example, under comparable SR quality conditions, our model achieves state-of-the-art performance in speed and LUT volume.



Paperid:536 Poster
Authors:Zhen Zhang,Jing Xiao,Liang Liao,Mi Wang
Abstract:
With the continuous development of imaging technology and the gradual expansion of the amount of image data, how to achieve high compression efficiency of high-resolution images is a challenge problem for storage and transmission. Image rescaling aims to reduce the original data amount through downscaling to facilitate data transmission and storage before encoding, and reconstruct the quality through upscaling after decoding, which is a key technology to assist in high-ratio image compression. However, existing rescaling approaches are more focused on reconstruction quality rather than image compressibility. In repetitive observation scenarios, multi-temporal images brought by periodic observations provide an opportunity to alleviate the conflict between reconstruction quality and compressibility, that is, the historical images as reference indicates what information can be dropped at downscaling to reduce the information content in downscaled image and provides the dropped information to improve the image restoration quality at upscaling. Based on this consideration, we propose a novel multi-temporal assisted reference-based image rescaling framework (RefScale). Specifically, a referencing network is proposed to calculate the similarity map to provide the referencing condition, which is then injected into the conditional invertible neural network to guide the information drop at the downscaling stage and information fusion at the upscaling stage. Additionally, a low-resolution guidance loss is proposed to further constrain the data amount of the downscaled LR image. Experiments conducted on both satellite imaging and autonomous driving show the superior performance of our approach over the state-of-the-art methods.



Paperid:537 Poster
Authors:Zhiwei Hao,Zhongyu Xiao,Yong Luo,Jianyuan Guo,Jing Wang,Li Shen,Han Hu
Abstract:
The recent advancements in cross-modal transformers have demonstrated their superior performance in RGB-D segmentation tasks by effectively integrating information from both RGB and depth modalities. However, existing methods often overlook the varying levels of informative content present in each modality, treating them equally and using models of the same architecture. This oversight can potentially hinder segmentation performance, especially considering that RGB images typically contain significantly more information than depth images. To address this issue, we propose PrimKD, a knowledge distillation based approach that focuses on guided multimodal fusion, with an emphasis on leveraging the primary RGB modality. In our approach, we utilize a model trained exclusively on the RGB modality as the teacher, guiding the learning process of a student model that fuses both RGB and depth modalities. To prioritize information from the primary RGB modality while leveraging the depth modality, we incorporate primary focused feature reconstruction and a selective alignment scheme. This integration enhances the overall freature fusion, resulting in improved segmentation results. We evaluate our proposed method on the NYU Depth V2 and SUN-RGBD datasets, and the experimental results demonstrate the effectiveness of PrimKD. Specifically, our approach achieves mIoU scores of 57.8 and 52.5 on these two datasets, respectively, surpassing existing counterparts by 1.5 and 0.4 mIoU.



Paperid:538 Poster
Authors:Zhe Luo,Weina Fu,Shuai Liu,Saeed Anwar,Muhammad Saqib,Sambit Bakshi,Khan Muhammad
Abstract:
Action detection and understanding provide the foundation for the generation and interaction of multimedia content. However, existing methods mainly focus on constructing complex relational inference networks, overlooking the judgment of detection effectiveness. Moreover, these methods frequently generate detection results with cognitive abnormalities. To solve the above problems, this study proposes a cognitive effectiveness network based on fuzzy inference (Cefdet), which introduces the concept of “cognition-based detection” to simulate human cognition. First, a fuzzy-driven cognitive effectiveness evaluation module (FCM) is established to introduce fuzzy inference into action detection. FCM is combined with human action features to simulate the cognition-based detection process, which clearly locates the position of frames with cognitive abnormalities. Then, a fuzzy cognitive update strategy (FCS) is proposed based on the FCM, which utilizes fuzzy logic to re-detect the cognition-based detection results and effectively update the results with cognitive abnormalities. Experimental results demonstrate that Cefdet exhibits superior performance against several mainstream algorithms on public datasets, validating its effectiveness and superiority.



Paperid:539 Poster
Authors:Huiming Zheng,Wei Gao,Zhuozhen Yu,Tiesong Zhao,Ge Li
Abstract:
With the rise of immersive media applications such as digital museums, virtual reality, and interactive exhibitions, point clouds, as a three-dimensional data storage format, have gained increasingly widespread attention. The massive data volume of point clouds imposes extremely high requirements on transmission bandwidth in the above applications, gradually becoming a bottleneck for immersive media applications. Although existing learning-based point cloud compression methods have achieved specific successes in compression efficiency by mining the spatial redundancy of their local structural features, these methods often overlook the intrinsic connections between point cloud data and other modality data (such as image modality), thereby limiting further improvements in compression efficiency. To address the limitation, we innovatively propose a view-guided learned point cloud geometry compression scheme, namely ViewPCGC. We adopt a novel self-attention mechanism and cross-modality attention mechanism based on sparse convolution to align the modality features of the point cloud and the view image, removing view redundancy through Modality Redundancy Removal Module (MRRM). Simultaneously, side information of the view image is introduced into the Conditional Checkboard Entropy Model (CCEM), significantly enhancing the accuracy of the probability density function estimation for point cloud geometry. In addition, we design a View-Guided Quality Enhancement Module (VG-QEM) in the decoder, utilizing the contour information of the point cloud in the view image to supplement reconstruction details. The superior experimental performance demonstrates the effectiveness of our method. Compared to the state-of-the-art point cloud geometry compression methods, ViewPCGC exhibits an average performance gain exceeding 10% on D1-PSNR metric.



Paperid:540 Poster
Authors:Fengmao Lv,Changru Nie,Jianyang Zhang,Guowu Yang,Guosheng Lin,Xiao Wu,Tianrui Li
Abstract:
Large pre-trained vision-language models like CLIP have shown amazing zero-shot recognition performance. To adapt pre-trained vision-language models to downstream tasks, recent studies have focused on the "learnable context + class name" paradigm, which learns continuous prompt contexts on downstream datasets. In practice, the learned prompt context tends to overfit the base categories and cannot generalize well to novel categories out of the training data. Recent works have also noticed this problem and have proposed several improvements. In this work, we draw a new insight based on empirical analysis, that is, uninformative class names lead to degraded base-to-novel generalization performance in prompt learning, which is usually overlooked by existing works. Under this motivation, we advocate to improve the base-to-novel generalization performance of prompt learning by enhancing the semantic richness of class names. We coin our approach as the Information Disengagement based Associative Prompt Learning (IDAPL) mechanism which considers the associative, meanwhile, decoupled learning of prompt context and class name embedding. IDAPL can effectively alleviate the phenomenon of learnable context overfitting to base classes, meanwhile, learning more informative semantic representation of base classes by fine-tuning the class name embedding, leading to improved performance on both base and novel classes. Experimental results on eleven widely used few-shot learning benchmarks clearly validate the effectiveness of our approach.



Paperid:541 Poster
Authors:Chenglong Zhang,Xinyan Liang,Peng Zhou,Zhaolong Ling,Yingwei Zhang,Xingyu Wu,Weiguo Sheng,Bingbing Jiang
Abstract:
To tackle the high-dimensional data with multiple representations, multi-view unsupervised feature selection has emerged as a significant learning paradigm. However, previous methods suffer from the following dilemmas: (i) The emphasis is on selecting features to preserve the similarity structure of data, while neglecting the discriminative information in the cluster structure; (ii) They often impose the orthogonal constraint on the pseudo cluster labels, disrupting the locality in the cluster label space; (iii) Learning the similarity or cluster structure from all samples is likewise time-consuming. To this end, a Scalable Multi-view Unsupervised Feature Selection with structure learning and fusion (SMUFS) is proposed to jointly exploit the cluster structure and the similarity relations of data. Specifically, SMUFS introduces the sample-view weights to adaptively fuse the membership matrices that indicate cluster structures and serve as the pseudo cluster labels, such that a unified membership matrix across views can be effectively obtained to guide feature selection. Meanwhile, SMUFS performs graph learning from the membership matrix, preserving the locality of cluster labels and improving their discriminative capability. Further, an acceleration strategy has been developed to make SMUFS scalable for relatively large-scale data. A convergent solution is devised to optimize the formulated problem, and extensive experiments demonstrate the effectiveness and superiority of SMUFS.



Paperid:542 Poster
Authors:Mingyang Sun,Qipeng Yan,Zhuoer Liang,Dongliang Kou,Dingkang Yang,Ruisheng Yuan,Xiao Zhao,Mingcheng Li,Lihua Zhang
Abstract:
Reconstructing garments from monocular videos has attracted considerable attention as it provides a convenient and low-cost solution for clothing digitization. In reality, people wear clothing with countless variations and multiple layers. Existing studies attempt to extract garments from a single video. They either behave poorly in generalization due to reliance on limited clothing templates or struggle to handle the intersections of multi-layered clothing leading to the lack of physical plausibility. Besides, there are inevitable and undetectable overlaps for a single video that hinder researchers from modeling complete and intersection-free multi-layered clothing. To address the above limitations, in this paper, we propose a novel method to reconstruct multi-layered clothing from multiple monocular videos sequentially, which surpasses existing work in generalization and robustness against penetration. For each video, neural fields are employed to implicitly represent the clothed body, from which the meshes with frame-consistent structures are explicitly extracted. Next, we implement a template-free method for extracting a single garment by back-projecting the image segmentation labels of different frames onto these meshes. In this way, multiple garments can be obtained from these monocular videos and then aligned to form the whole outfit. However, intersection always occurs due to overlapping deformation in the real world and perceptual errors for monocular videos. To this end, we innovatively introduce a physics-aware module that combines neural fields with a position-based simulation framework to fine-tune the penetrating vertices of garments, ensuring robustly intersection-free. Additionally, we collect a mini dataset with fashionable garments to evaluate the quality of clothing reconstruction comprehensively. The code and data will be open-sourced if this work is accepted.



Paperid:543 Poster
Authors:Minsu Kim,Jeonghun Yeo,Se Jin Park,Hyeongseop Rha,Yong Man Ro
Abstract:
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model. As the massive multilingual modeling of visual data requires huge computational costs, we propose a novel training strategy, processing with visual speech units. Motivated by the recent success of the audio speech unit, we propose to use a visual speech unit that can be obtained by discretizing the visual speech features extracted from the self-supervised visual speech model. Through analysis, we verify that the visual speech units mainly contain viseme information while suppressing non-linguistic information. By using the visual speech units as the inputs of our system, we propose to pre-train a VSR model to predict corresponding text outputs on multilingual data constructed by merging several VSR databases. As both the inputs (i.e., visual speech units) and outputs (i.e., text) are discrete, we can greatly improve the training efficiency compared to the standard VSR training. Specifically, the input data size is reduced to 0.016% of the original video inputs. In order to complement the insufficient visual information in speech recognition, we apply curriculum learning where the inputs of the system begin with audio-visual speech units and gradually change to visual speech units. After pre-training, the model is finetuned on continuous features. We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models, with a single trained model.



Paperid:544 Poster
Authors:Han Fang,Kejiang Chen,Yupeng Qiu,Zehua Ma,Weiming Zhang,Ee-Chien Chang
Abstract:
The powerful denoising capability of the latent diffusion model creates new demands on the robustness of image watermarking algorithms, as attackers can erase the watermark by performing a forward diffusion, followed by backward denoising. While such denoising might introduce large distortion in the pixel domain, the image semantics remain similar. Unfortunately, most existing robust watermarking methods fail to tackle such an erasure attack since they are primarily designed for traditional channel distortions. To address such issues, this paper proposed DERO, a diffusion-model-erasure robust watermarking framework. Based on the frequency domain analysis of the diffusion model's denoising process, we designed a destruction and compensation noise layer (DCNL) to approximate the distortion effects caused by latent diffusion model erasure (LDE). In detail, DCNL consists of a multi-scale low-pass filtering and a white noise compensation process, where the high-frequency components of the image are first obliterated, and then full-frequency components are enriched with white noise. Such a process broadly simulates the LDE distortions. Besides, on the extraction side, we cascaded a pre-trained variational autoencoder before the decoder to extract the watermark in the latent domain, which closely adapts to the operation domain of the LDE process. Meanwhile, to improve the robustness of the decoder, we also design a latent feature augmentation (LFA) operation on the latent feature. Throughout the end-to-end training with the DCNL and LFA, DERO can successfully achieve robustness against LDE. Our experimental results demonstrate the effectiveness and the generalizability of the proposed framework. The LDE robustness is significantly improved from 75% with SOTA methods to an impressive 96% with DERO.



Paperid:545 Poster
Authors:zefan zhang,Weiqi Zhang,yanhui li,bai tian
Abstract:
Multimodal Relation Extraction (MRE) has achieved great improvements. However, modern MRE models are easily affected by irrelevant objects during multimodal alignment which are called error sensitivity issues. The main reason is that visual features are not fully aligned with textual features and the reasoning process may suppress redundant and noisy information at the risk of losing critical information. In light of this, we propose a Caption-Aware Multimodal Relation Extraction Network with Mutual Information Maximization (CAMIM). Specifically, we first generate detailed image captions through the Large Language Model (LLM). Then, the Caption-Aware Module (CAM) hierarchically aligns the fine-grained visual entities and textual entities for reasoning. In addition, for preserving crucial information within different modalities, we leverage a Mutual Information Maximization method to regulate the multimodal reasoning module. Experiments show that our model outperforms the state-of-the-art MRE models on the benchmark dataset MNRE. Further ablation studies prove the pluggable and effective performance of our Caption-Aware Module and mutual information maximization method. Our code will be public soon.



Paperid:546 Poster
Authors:Qiang Wang,Ke Yan,Shouhong Ding
Abstract:
In the realm of CLIP adaptation through prompt learning, it is important to emphasize the pivotal role that the proper alignment of visual and textual representations plays when adapting the CLIP to downstream tasks. We propose that the proper alignment for downstream tasks is determined by the $\textbf{flexibility}$ of the interaction between cross-modal information, which compensates for the absence of contrastive loss during the adaptation process. However, the current prompt learning methods, such as isolated modifications to the visual or language branches of CLIP or the employment of uni-directional cross-modal fusion, are not sufficient to explore the full potential of the mutual interaction between visual and textual modalities. To overcome this limitation, we propose a new paradigm for the CLIP prompt learning community, named $\textbf{B}$i$\textbf{l}$ateral Adaptive Cr$\textbf{o}$ss-Modal Fusi$\textbf{o}$n Pro$\textbf{m}$pt Learning~($\textit{Bloom}$), which includes two enhancements. First, we propose using projection functions for bi-directional modality transformation and fusion functions to encourage the mutual interaction between corresponding layers within both the image and text encoders. Second, we propose an adaptive manner that automatically searches the optimal combination of cross-modal information at each layer. These two improvements ensure a more efficient and flexible integration of the two modalities, thereby achieving proper alignment for specific downstream tasks. We put our method to the test in terms of base-to-novel, cross-dataset, and cross-domain evaluations on 15 image classification datasets. The results demonstrate a significant performance enhancement achieved by $\textit{Bloom}$.



Paperid:547 Poster
Authors:Haoyang Su,Wenzhe Du,Nguyen Cam-Tu,Wang Xiaoliang
Abstract:
With the increasing prevalence of virtual assistants, multimodal conversational recommendation systems (multimodal CRS) becomes essential for boosting customer engagement, improving conversion rates, and enhancing user satisfaction. Yet conversational samples, as training data for such a system, are difficult to obtain in large quantities, particularly in new platforms. Motivated by this challenge, we aim to design innovative methods for training multimodal CRS effectively even in a small data setting. Specifically, assuming the availability of a small number of samples with dialogue states, we devise an effective dialogue state encoder to bridge the semantic gap between conversation and product representations for recommendation. To reduce the cost of dialogue state annotation, a semi-supervised learning method is developed to effectively train the dialogue state encoder with a small set of labeled conversations. In addition, we design a correlation regularisation that leverages knowledge in the multimodal product database to better align textual and visual modalities. Experiments on the dataset MMD demonstrate the effectiveness of our method. Particularly, with only 5% of the MMD training set, our method (namely SeMANTIC) obtains better NDCG scores than those of baseline models trained on the full MMD training set.



Paperid:548 Poster
Authors:Xiuli Bi,Yang Hu,Bo Liu,Weisheng Li,Pamela Cosman,Bin Xiao
Abstract:
As machine learning advances, machine learning as a service (MLaaS) in the cloud brings convenience to human lives but also privacy risks, as powerful neural networks used for generation, classification or other tasks can also become privacy snoopers. This motivates privacy preservation in the inference phase. Many approaches for preserving privacy in the inference phase introduce multi-objective functions, training models to remove specific private information from users' uploaded data. Although effective, these adversarial learning-based approaches suffer not only from convergence difficulties, but also from limited generalization beyond the specific privacy for which they are trained. To address these issues, we propose a method for privacy preservation in the inference phase by removing task-irrelevant information, which requires no knowledge of the privacy attacks nor introduction of adversarial learning. Specifically, we introduce a metric to distinguish task-irrelevant information from task-relevant information, and achieve more efficient metric estimation to remove task-irrelevant features. The experiments demonstrate the potential of our method in several tasks.



Paperid:549 Poster
Authors:Weixiang Han,Chengjun Cai,Guo Yu,Jialiang Peng
Abstract:
Multi-modal learning leverages data from diverse perceptual media to obtain enriched representations, thereby empowering machine learning models to complete more complex tasks. However, recent research results indicate that multi-modal learning still suffers from “modality imbalance”: Certain modalities' contributions are suppressed by dominant ones, consequently constraining the overall performance enhancement of multimodal learning. To tackle this issue, current approaches attempt to mitigate modality competition in various ways, but their effectiveness is still limited. To this end, we propose an Euler Representation Learning-based Modality Rebalance (ERL-MR) strategy, which reshapes the underlying competitive relationships between modalities into mutually reinforcing win-win situations while maintaining stable feature optimization directions. Specifically, ERL-MR employs Euler's formula to map original features to complex space, constructing cooperatively enhanced non-redundant features for each modality, which helps reverse the situation of modality competition. Moreover, to counteract the performance degradation resulting from optimization drift among modalities, we propose a Multi-Modal Constrained (MMC) loss based on cosine similarity of complex feature phase and cross-entropy loss of individual modalities, guiding the optimization direction of the fusion network. Extensive experiments conducted on four multi-modal multimedia datasets and two task-specific multi-modal multimedia datasets demonstrate the superiority of our ERL-MR strategy over state-of-the-art baselines, achieving modality rebalancing and further performance improvements.



Paperid:550 Poster
Authors:Ran Wang,Hua Zuo,Zhen Fang,Jie Lu
Abstract:
In the field of Vision-Language Models (VLM), the Contrastive Language-Image Pretraining (CLIP) model has yielded outstanding performance on many downstream tasks through prompt tuning. By integrating image and text representations, CLIP exhibits zero-shot generalization capabilities on unseen data. However, when new categories and distribution shifts occur, the pretrained text embeddings in CLIP may not align well with unseen images, potentially leading to a decrease in CLIP's zero-shot generalization performance. To address this issue, many existing methods use test samples to update the CLIP model during testing through a process known as Test-Time Adaptation (TTA). Previous TTA techniques, such as image augmentation, can lead to overfitting given outlying samples, while methods based on teacher-student distillation can increase memory use. Further, these methods significantly increase inference time, which is a crucial factor in the testing phase. To improve robustness, mitigate overfitting, and reduce bias toward outlying samples, we propose a novel method: Self-Text Distillation with Conjugate Pseudo-labels (SCP), designed to enhance CLIP's zero-shot generalization. SCP uses gradient information from conjugate pseudo-labels to enhance the model’s robustness toward distribution shifts. It also innovates by using a fixed prompt list to distil learnable prompts from within the same model, acting as a self-regulation mechanism that minimizes overfitting. Additionally, SCP is a fully test-time adaptation method that does not require retraining. It directly improves CLIP's zero-shot generalization at test time without increasing either memory overheads or inference time. In fact, in evaluations across three zero-shot generalization scenarios, SCP surpasses existing state-of-the-art methods in performance and significantly reduces inference time.



Paperid:551 Poster
Authors:Xudong Wang,Weihong Ren,Xi'ai Chen,Huijie Fan,Yandong Tang,Zhi Han
Abstract:
Universal object detectors aim to detect any object in any scene without human annotation, exhibiting superior generalization. However, the current universal object detectors show degraded performance in harsh weather, and their insufficient real-time capabilities limit their application. In this paper, we present Uni-YOLO, a universal detector designed for complex scenes with real-time performance. Uni-YOLO is a one-stage object detector that uses general object confidence to distinguish between objects and backgrounds, and employs a grid cell regression method for real-time detection. To improve its robustness in harsh weather conditions, the input of Uni-YOLO is adaptively enhanced with a physical model-based enhancement module. During training and inference, Uni-YOLO is guided by the extensive knowledge of the vision-language model CLIP. An object augmentation method is proposed to improve generalization in training by utilizing multiple source datasets with heterogeneous annotations. Furthermore, an online self-enhancement method is proposed to allow Uni-YOLO to further focus on specific objects through self-supervised fine-tuning in a given scene. Extensive experiments on public benchmarks and a UAV deployment are conducted to validate its superiority and practical value.



Paperid:552 Poster
Authors:Rui Li,Yishu Liu,Huafeng Li,Jinxing Li,Guangming Lu
Abstract:
Video Individual Counting (VIC), which focuses on accurately tallying the total number of individuals in a video without duplication, is crucial for urban public space management and densely-populated areas planning. Existing methods suffer from limitations in terms of expensive manual annotation, and the efficiency of location or detection algorithms. In this work, we contribute a novel Prototype-guided Dual-Transformer Reasoning framework, termed PDTR, which takes both similarity and difference of adjacent frames into account to achieve accurate counting in an end-to-end regression manner. Specifically, we first design a multi-receptive field feature fusion module to acquire initial comprehensive representations. Subsequently, the dynamic prototype generation module memorizes consistent representations of similar information to generate prototypes. Additionally, to further dig out the shared and private features from different frames, a prototype cross-guided decoder and a privacy-decoupling module are designed. Extensive experiments conducted on two existing VIC datasets, consistently demonstrate the superiority of PDTR over state-of-the-art baselines.



Paperid:553 Poster
Authors:Xingyu Zhang,Siyu Zhao,Zeen Song,Huijie Guo,Jianqi Zhang,Changwen Zheng,Wenwen Qiang
Abstract:
Long-term time series forecasting is a long-standing challenge in various applications. A central issue in time series forecasting is that methods should expressively capture long-term dependency. Furthermore, time series forecasting methods should be flexible when applied to different scenarios. Although Fourier analysis offers an alternative to effectively capture reusable and periodic patterns to achieve long-term forecasting in different scenarios, existing methods often assume high-frequency components represent noise and should be discarded in time series forecasting. However, we conduct a series of motivation experiments and discover that the role of certain frequencies varies depending on the scenarios. In some scenarios, removing high-frequency components from the original time series can improve the forecasting performance, while in others scenarios, removing them is harmful to forecasting performance. Therefore, it is necessary to treat the frequencies differently according to specific scenarios. To achieve this, we first reformulate the time series forecasting problem as learning a transfer function of each frequency in the Fourier domain. Further, we design Frequency Dynamic Fusion (FreDF), which individually predicts each Fourier component, and dynamically fuses the output of different frequencies. Moreover, we provide a novel insight into the generalization ability of time series forecasting and propose the generalization bound of time series forecasting. Then we prove FreDF has a lower bound, indicating that FreDF has better generalization ability. Experiment results and ablation studies demonstrate the effectiveness of FreDF.



Paperid:554 Poster
Authors:Xudong Cai,Yongcai Wang,Lun Luo,Minhang Wang,Deying Li,Jintao Xu,Weihao Gu,Rui Ai
Abstract:
Image matching aims at identifying corresponding points between a pair of images. Currently, detector-free methods have shown impressive performance in challenging scenarios, thanks to their capability of generating dense matches and global receptive field. However, performing feature interaction and proposing matches across the entire image is unnecessary, as not all image regions contribute beneficially to the matching process. Interacting and matching in unmatchable areas can introduce errors, reducing matching accuracy and efficiency. Furthermore, the scale discrepancy issue still troubles existing methods. To address above issues, we propose PRogressive dependency maxImization for Scale-invariant image Matching (PRISM), which jointly prunes irrelevant patch features and tackles the scale discrepancy. To do this, we first present a Multi-scale Pruning Module (MPM) to adaptively prune irrelevant features by maximizing the dependency between the two feature sets. Moreover, we design the Scale-Aware Dynamic Pruning Attention (SADPA) to aggregate information from different scales via a hierarchical design. Our method's superior matching performance and generalization capability are confirmed by leading accuracy across various evaluation benchmarks and downstream tasks. The code will be publicly available.



Paperid:555 Poster
Authors:Soo-ho Kim,Soyeon Hong,Kyungsoo Park,Hyunsouk Cho,Kyung-Ah Sohn
Abstract:
Omnidirectional vision systems provide a 360-degree panoramic view, enabling full environmental awareness in various fields, such as Advanced Driver Assistance Systems (ADAS) and Virtual Reality (VR). Existing omnidirectional stitching methods rely on a single specialized 360-degree camera. However, due to hardware limitations such as high mounting heights and blind spots, adapting these methods into vehicles of varying sizes and geometries is challenging. These challenges include limited generalizability due to the reliance on predefined stitching regions for fixed camera arrays, performance degradation from distance parallax leading to large depth differences, and the absence of suitable datasets with ground truth for multi-camera omnidirectional systems. To overcome these challenges, we propose a novel omnidirectional stitching framework and publicly available dataset tailored for varying distance scenarios with multiple cameras. The framework, referred to as OmniStitch, consists of a Stitching Region Maximisation (SRM) module for automatic adaptation to different vehicles with multiple cameras and a Depth-Aware Stitching (DAS) module to handle depth differences caused by distance parallax between cameras. In addition, we create and release an omnidirectional stitching dataset, called GV360, which provides ground truth images that maintain the perspective of the 360-degree FOV, specifically designed for vehicle-agnostic systems. Extensive evaluations of this dataset demonstrate that our framework outperforms state-of-the-art stitching models, especially in handling varying distance parallax. The proposed dataset and code are publicly available in URL.



Paperid:556 Poster
Authors:Hongzu Su,Jingjing Li,Fengling Li,Ke Lu,Lei Zhu
Abstract:
Mainstream multimodal recommender systems are designed to learn user interest by analyzing user-item interaction graphs. However, what they learn about user interest needs to be completed because historical interactions only record items that best match user interest (i.e., the first-order interest), while suboptimal items are absent. To fully exploit user interest, we propose a Second-Order Interest Learning (SOIL) framework to retrieve second-order interest from unrecorded suboptimal items. In this framework, we build a user-item interaction graph augmented by second-order interest, an interest-aware item-item graph for the visual modality, and a similar graph for the textual modality. In our work, all three graphs are constructed from user-item interaction records and multimodal feature similarity. Similarly to other graph-based approaches, we apply graph convolutional networks to each of the three graphs to learn representations of users and items. To improve the exploitation of both first-order and second-order interest, we optimize the model by implementing contrastive learning modules for user and item representations at both the user-item and item-item levels. The proposed framework is evaluated on three real-world public datasets in online shopping scenarios. Experimental results verify that our method is able to significantly improve prediction performance. For instance, our method outperforms the previous state-of-the-art method MGCN by an average of $8.1%$ in terms of Recall@10.



Paperid:557 Poster
Authors:Weijia Zhang,Dongnan Liu,Weidong Cai,Chao Ma
Abstract:
Knowledge distillation (KD) is an established paradigm for transferring privileged knowledge from a cumbersome model to a more lightweight and efficient one. In recent years, logit-based KD methods are quickly catching up in performance with their feature-based counterparts. However, existing research has pointed out that logit-based methods are still fundamentally limited by two major issues in their training process, namely overconfident teacher and confirmation bias. Inspired by the success of cross-view learning in fields such as semi-supervised learning, in this work we introduce within-view and cross-view regularisations to standard logit-based distillation frameworks to combat the above cruxes. We also perform confidence-based soft label selection to improve the quality of distilling signals from the teacher, which further mitigates the confirmation bias problem. Despite its apparent simplicity, the proposed Consistency-Regularisation-based Logit Distillation (CRLD) significantly boosts student learning, setting new state-of-the-art results on the standard CIFAR-100, Tiny-ImageNet, and ImageNet datasets across a diversity of teacher and student architectures, whilst introducing no extra network parameters. Orthogonal to on-going logit-based distillation research, our method enjoys excellent generalisation properties and, without bells and whistles, boosts the performance of various existing approaches by considerable margins. Our code and models will be released.



Paperid:558 Poster
Authors:Wei Shen,Mang Ye,Wenke Huang
Abstract:
Graph Neural Networks (GNNs) are widely employed to derive meaningful node representations from graphs. Despite their success, deep GNNs frequently grapple with the oversmoothing issue, where node representations become highly indistinguishable due to repeated aggregations. In this work, we consider the oversmoothing issue from two aspects of the node embedding space: dimension and instance. Specifically, while existing methods primarily concentrate on instance-level node relations to mitigate oversmoothing, we propose to mitigate oversmoothing at dimension level. We reveal the heightened information redundancy between dimensions which diminishes information diversity and impairs node differentiation in GNNs. Motivated by this insight, we propose Dimension-Level Decoupling (DLD) to reduce dimension redundancy, enhancing dimensional-level node differentiation. Besides, at the instance level, the neglect of class differences leads to vague classification boundaries. Hence, we introduce Instance-Level Class-Difference Decoupling (ICDD) that repels inter-class nodes and attracts intra-class nodes, improving the instance-level node discrimination with clear classification boundaries. Additionally, we introduce a novel evaluation metric that considers the impact of class differences on node distances, facilitating precise oversmoothing measurement. Extensive experiments demonstrate the effectiveness of our method Dual-Dimensional Class-Difference Decoupling (DDCD) across diverse scenarios.



Paperid:559 Poster
Authors:Honghao Li,Lei Sang,Yi Zhang,Yiwen Zhang
Abstract:
Click-through rate (CTR) prediction is an essential component of industrial multimedia recommendation, and the key to enhancing the accuracy of CTR prediction lies in the effective modeling of feature interactions using rich user profiles, item attributes, and contextual information. Most of the current deep CTR models resort to parallel or stacked structures to break through the performance bottleneck of Multi-Layer Perceptron (MLP). However, we identify two limitations in these models: (1) parallel or stacked structures often treat explicit and implicit components as isolated entities, leading to a loss of mutual information; (2) traditional CTR models, whether in terms of supervision signals or interaction methods, lack the ability to filter out noise information, thereby limiting the effectiveness of the models.In response to this gap, this paper introduces a novel model by integrating alternate structure and contrastive learning into only one simple MLP, discarding the design of multiple MLPs responsible for different semantic spaces, named the Simple Contrast-enhanced Network (SimCEN), which employs a contrastive product to build second-order feature interactions that have the same semantic but different representation spaces. Additionally, it employs an external-gated mechanism between linear layers to facilitate explicit learning of feature interactions and to filter out noise. At the final representation layer of the MLP, a contrastive loss is incorporated to help the MLP obtain self-supervised signals for higher-quality representations. Experiments conducted on six real-world datasets demonstrate the effectiveness and compatibility of this simple framework, which can serve as a substitute for MLP to enhance various representative baselines. Our source code and detailed running logs will be made available athttps://anonymous.4open.science/r/SimCEN-8E21.



Paperid:560 Poster
Authors:Xiao Liang,Yanlei Zhang,Di Wang,Haodi Zhong,Ronghan Li,Quan Wang
Abstract:
Radiology report generation aims to automatically generate clinical descriptions for radiology images, reducing the workload of radiologists. Compared to general image captioning tasks, the subtle differences in medical images and the specialized, complex nature of medical terminology limit the performance of data-driven radiology report generation. Previous research has attempted to leverage prior knowledge, such as organ-disease graphs, to enhance models' abilities to identify specific diseases and generate corresponding medical terminology. However, these methods cover only a limited number of disease types, focusing solely on disease terms mentioned in reports but ignoring their normal or abnormal attributes, which are critical to generating accurate reports. To address this issue, we propose a Divide-and-Conquer approach, named DCG, which separately constructs disease-free and disease-specific nodes within the knowledge graphs. Specifically, we extracted more comprehensive organ-disease entities from reports than previous methods and constructed disease-free and disease-specific nodes by rigorously distinguishing between normal conditions and specific diseases. This enables our model to consciously focus on abnormal information and mitigate the impact of excessively common diseases on report generation. Subsequently, the constructed graph is utilized to enhance the correlation between visual representations and disease terminology, thereby guiding the decoder in report generation. Extensive experiments conducted on benchmark datasets IU-Xray and MIMIC-CXR demonstrate the superiority of our proposed method. Code is available at the anonymous repository {https://anonymous.4open.science/r/DCG_Enhanced_distilGPT2-37D2}.



Paperid:561 Poster
Authors:Shuo Zheng,Yuanjie Dang,Peng Chen,Ruohong Huan,Dongdong Zhao,Ronghua Liang
Abstract:
Temporal relation modeling is one of the core aspects of few-shot action recognition. Most previous works mainly focus on temporal relation modeling based on coarse-level actions, without considering the atomic action details and fine-grained temporal information. This oversight represents a significant limitation in this task. Specifically, coarse-level temporal relation modeling can make the few-shot models overfit in high-discrepancy temporal context, and ignore the low-discrepancy but high-semantic relevance action details in the video. To address these issues, we propose a saliency-guided fine-grained temporal mask learning method that models the temporal atomic action relation for few-shot action recognition in a finer manner. First, to model the comprehensive temporal relations of video instances, we design a temporal mask learning architecture to automatically search for the best matching of each atomic action snippet. Next, to exploit the low-discrepancy atomic action features, we introduce a saliency-guided temporal mask module to adaptively locate and excavate the atomic action information. After that, the few-shot predictions can be obtained by feeding the embedded rich temporal-relation features to a common feature matcher. Extensive experimental results on standard datasets demonstrate our method’s superior performance compared to existing state-of-the-art methods.



Paperid:562 Poster
Authors:Goirik Chakrabarty,Aditya Chandrasekar,Ramya Hebbalaguppe,Prathosh AP
Abstract:
Recent developments in diffusion models have demonstrated an exceptional capacity to generate high-quality prompt-conditioned image edits. Nevertheless, previous approaches have primarily relied on textual prompts for image editing, which tend to be less effective when making precise edits to specific objects or fine-grained regions within a scene containing single/multiple objects. We introduce a novel framework for zero-shot localized multi-object editing through a multi-diffusion process to overcome this challenge. This framework empowers users to perform various operations on objects within an image, such as adding, replacing, or editing $\textbf{many}$ objects in a complex scene $\textbf{in one pass}$. Our approach leverages foreground masks and corresponding simple text prompts that exert localized influences on the target regions resulting in high-fidelity image editing. A combination of cross-attention and background preservation losses within the latent space ensures that the characteristics of the object being edited are preserved while simultaneously achieving a high-quality, seamless reconstruction of the background with fewer artifacts compared to the state-of-the-art (SOTA). We also curate and release a dataset dedicated to multi-object editing, named $\texttt{LoMOE}$-Bench. Our experiments against existing SOTA demonstrate the improved effectiveness of our approach in terms of both image editing quality, and inference speed.



Paperid:563 Poster
Authors:Tao Liu,Feilong.chen,Shuai Fan,Chenpeng Du,Qi Chen,Xie Chen,Kai Yu
Abstract:
The paper introduces AniTalker, an innovative framework designed to generate lifelike talking faces from a single portrait. Unlike existing models that primarily focus on verbal cues such as lip synchronization and fail to capture the complex dynamics of facial expressions and nonverbal cues, AniTalker employs a universal motion representation. This innovative representation effectively captures a wide range of facial dynamics, including subtle expressions and head movements. AniTalker enhances motion depiction through two self-supervised learning strategies: the first involves reconstructing target video frames from source frames within the same identity to learn subtle motion representations, and the second develops an identity encoder using metric learning while actively minimizing mutual information between the identity and motion encoders. This approach ensures that the motion representation is dynamic and devoid of identity-specific details, significantly reducing the need for labeled data. Additionally, the integration of a diffusion model with a variance adapter allows for the generation of diverse and controllable facial animations. This method not only demonstrates AniTalker’s capability to create detailed and realistic facial movements but also underscores its potential in crafting dynamic avatars for real-world applications. Synthetic results can be viewed athttps://anitalker.github.io.



Paperid:564 Poster
Authors:Han Jiang,Haoyu Tang,Ming Yan,Ji Zhang,Mingzhu Xu,Yupeng Hu,Jihua Zhu,Liqiang Nie
Abstract:
Recently, temporal action localization (TAL) methods, especially the weakly-supervised and unsupervised ones, have become a hot research topic. Existing unsupervised methods follow an iterative "clustering and training" strategy with diverse model designs during training stage, while they often overlook maintaining consistency between these stages, which is crucial: more accurate clustering results can reduce the noises of pseudolabels and thus enhance model training, while more robust training can in turn enrich clustering feature representation. We identify two critical challenges in unsupervised scenarios: 1. What features should the model generate for clustering? 2. Which pseudolabeled instances from clustering should be chosen for model training? After extensive explorations, we proposed a novel yet simple framework called Consistency-Oriented Progressive high actionness Learning to address these issues. For feature generation, our framework adopts a High Actionness snippet Selection (HAS) module to generate more discriminative global video features for clustering from the enhanced actionness features obtained from a designed Inner-Outer Consistency Network (IOCNet). For pseudolabel selection, we introduces a Progressive Learning With Representative Instances (PLRI) strategy to identify the most reliable and informative instances within each cluster for model training. These three modules, HAS, IOCNet, and PLRI, synergistically improve consistency in model training and clustering performance. Extensive experiments on THUMOS’14 and ActivityNet v1.2 datasets under both unsupervised and weakly-supervised settings demonstrate that our framework achieves the state-of-the-art results.



Paperid:565 Poster
Authors:Yang Chen,Jingcai Guo,Tian He,Xiaocheng Lu,Ling Wang
Abstract:
Skeleton-based zero-shot action recognition aims to recognize unknown human actions based on the learned priors of the known skeleton-based actions and a semantic descriptor space shared by both known and unknown categories. However, previous works focus on establishing the bridges between the known skeleton representations space and semantic descriptions space at the coarse-grained level for recognizing unknown action categories, ignoring the fine-grained alignment of these two spaces, resulting in suboptimal performance in distinguishing high-similarity action categories. To address these challenges, we propose a novel method via Side information and dual-prompTs learning for skeleton-based zero-shot Action Recognition (STAR) at the fine-grained level. Specifically, 1) we decompose the skeleton into several parts based on its topology structure and introduce the side information concerning multi-part descriptions of human body movements for alignment between the skeleton and the semantic space at the fine-grained level; 2) we design the visual-attribute and semantic-part prompts to improve the intra-class compactness within the skeleton space and inter-class separability within the semantic space, respectively, to distinguish the high-similarity actions. Extensive experiments show that our method achieves state-of-the-art performance in ZSL and GZSL settings on NTU RGB+D, NTU RGB+D 120 and PKU-MMD datasets. The code will be available in the future.



Paperid:566 Poster
Authors:Ge Luo,Yuchen Ma,Manman Zhang,Junqiang Huang,Sheng Li,Zhenxing Qian,Xinpeng Zhang
Abstract:
Automatic live commenting is increasingly acknowledged as a crucial strategy for improving viewer interaction. However, current methods overlook the significance of creating engaging comments. Engaging comments can not only attract viewers' widespread attention, earning numerous "likes", but also further promote subsequent social comment interactions. In this paper, we introduce a novel framework for generating engaging live video comments, aiming to resonate with viewers and enhance the viewing experience. Then, we design a Competitive Context Selection Strategy to accelerate differential learning by constructing relatively attention sample pairs with different levels of attractiveness. This approach addresses the sample imbalance problem between highly-liked and low-liked comments, as well as the relative attractiveness issue of comments within video scenes. Moreover, we develop a Semantic Gap Contrastive Loss to minimize the distance between generated comments and higher-liked comments within the segment, while also widening the gap with lower-liked or unliked comments. This loss function helps the model to generate more engaging comments. To support our proposed generation task, we construct a video comment dataset with "like" information, containing 180,000 comments and their "like" counts. Extensive experiments indicate that the comments generated by our method are highly engaging, more fluent, natural, and diverse compared to baselines.



Paperid:567 Poster
Authors:Xiaotong Yu,Chang Wen Chen
Abstract:
Efficient visual perception using mobile systems is crucial, particularly in unknown environments such as search and rescue operations, where swift and comprehensive perception of objects of interest is essential. In such real-world applications, objects of interest are often situated in complex environments, making the selection of the 'Next Best' view based solely on maximizing visibility gain suboptimal. Semantics, providing a higher-level interpretation of perception, should significantly contribute to the selection of the next viewpoint for various perception tasks. In this study, we formulate a novel information gain that integrates both visibility gain and semantic gain in a unified form to select the semantic-aware Next-Best-View. Additionally, we design an adaptive strategy with termination criterion to support a two-stage search-and-acquisition manoeuvre on multiple objects of interest aided by a multi-degree-of-freedoms (Multi-DoFs) mobile system. Several semantically relevant reconstruction metrics, including perspective directivity and region of interest (ROI)-to-full reconstruction volume ratio, are introduced to evaluate the performance of the proposed approach. Simulation experiments demonstrate the advantages of the proposed approach over existing methods, achieving improvements of up to 27.13% for the ROI-to-full reconstruction volume ratio and a 0.88234 average perspective directivity. Furthermore, the planned motion trajectory exhibits better perceiving coverage toward the target.



Paperid:568 Poster
Authors:Junyu Lin,Yan Zheng,Xinyue Chen,Yazhou Ren,Xiaorong Pu,Jing He
Abstract:
Multi-view based molecular properties prediction learning has received widely attention in recent years in terms of its potential for the downstream tasks in the field of drug discovery. However, the consistency of different molecular view representations and the full utilization of complementary information among them in existing multi-view molecular property prediction methods remain to be further explored. Furthermore, most current methods focus on generating global level representations at the graph level with information from different molecular views (e.g., 2D and 3D views) assuming that the information can be corresponded to each other. In fact it is not unusual that for example the conformation change or computational errors may lead to discrepancies between views. To addressing these issues, we propose a new Cross-View contrastive unification guides Generative Molcular pre-trained model, call MolCVG. We first focus on common and private information extraction from 2D graph views and 3D geometric views of molecules, Minimizing the impact of noise in private information on subsequent strategies. To exploit both types of information in a more refined way, we propose a cross-view contrastive unification strategy to learn cross-view global information and guide the reconstruction of masked nodes, thus effectively optimizing global features and local descriptions. Extensive experiments on real-world molecular data sets demonstrate the effectiveness of our approach for molecular property prediction task.



Paperid:569 Poster
Authors:Hongqiu Wang,Wei Wang,Haipeng Zhou,Huihui Xu,Shaozhi Wu,Lei Zhu
Abstract:
Traditional shadow detectors often identify all shadow regions of static images or video sequences. This work presents the Referring Video Shadow Detection (RVSD), which is an innovative task that rejuvenates the classic paradigm by facilitating the segmentation of particular shadows in videos based on descriptive natural language prompts. This novel RVSD not only achieves segmentation of arbitrary shadow areas of interest based on descriptions (flexibility) but also allows users to interact with visual content more directly and naturally by using natural language prompts (interactivity), paving the way for abundant applications ranging from advanced video editing to virtual reality experiences. To pioneer the RVSD research, we curated a well-annotated RVSD dataset, which encompasses 86 videos and a rich set of 15,011 paired textual descriptions with corresponding shadows. To the best of our knowledge, this dataset is the first one for addressing RVSD. Based on this dataset, we propose a Referring Shadow-Track Memory Network (RSM-Net) for addressing the RVSD task. In our RSM-Net, we devise a Twin-Track Synergistic Memory (TSM) to store intra-clip memory features and hierarchical inter-clip memory features, and then pass these memory features into a memory read module to refine features of the current video frame for referring shadow detection. We also develop a Mixed-Prior Shadow Attention (MSA) to utilize physical priors to obtain a coarse shadow map for learning more visual features by weighting it with the input video frame. Experimental results show that our RSM-Net achieves state-of-the-art performance for RVSD with a notable Overall IOU increase of 4.4%. We shall release our code and dataset for future research.



Paperid:570 Poster
Authors:Richen Liu,Hansheng Wang,Hailong Wang,Siru Chen,Chufan Lai,Ayush Kumar,Siming Chen
Abstract:
We design ScaleTraversal, an interactive tool for creating multi-scale 3D demonstration animations with limited resources for users who are unavailable to access high performance machines such as clusters or super computers. It is challenging to create 3D demonstration animations for multi-modal and multi-scale data. First, it is challenging to strike a balance between flexibility and user friendliness to design the user interface in customizing demonstration animations. Second, the multi-scale biomedical data is often characterized as large-size so that a commonly-used desktop PC is hard to handle. We design an interactive bi-functional user interface to create multi-scale biomedical demonstration animations intuitively. It fully utilizes the strengths of graphical interface’s user friendliness and textual interface’s flexibility, which enables users to customize demonstration animations from macro-scales to meso- and micro-scales. Furthermore, we design four scale-based memory management strategies to solve the challenging issues presented in multi-scale data. They are streaming data processing strategy, scale level data prefetching strategy, memory utilization strategy, and GPU acceleration strategy for rendering. Finally, we conduct both quantitative evaluation and qualitative evaluation to demonstrate the efficiency, expressiveness and usability of ScaleTraversal.



Paperid:571 Poster
Authors:Zhanpeng Chen,Zhihong Zhu,Wanshi Xu,Yunyan Zhang,Xian Wu,Yefeng Zheng
Abstract:
Given coupled sentence image pairs, Multimodal Aspect-based Sentiment Analysis (MABSA) aims to detect aspect terms and predict their sentiment polarity. While existing methods have made great efforts in aligning images and text for improved MABSA performance, they still struggle to effectively mitigate the challenge of the noisy correspondence problem (NCP): the text description is often not well-aligned with the visual content. To alleviate NCP, in this paper, we introduce Aspect-driven Alignment and Refinement (ADAR), which is a two-stage coarse-to-fine alignment framework. In the first stage, ADAR devises a novel Coarse-to-fine Aspect-driven Alignment Module, which introduces Optimal Transport (OT) to learn the coarse-grained alignment between visual and textual features. Then the adaptive filter bin is applied to remove the irrelevant image regions at a fine-grained level; In the second stage, ADAR introduces an Aspect-driven Refinement Module to further refine the cross-modality feature representation. Extensive experiments on two benchmark datasets demonstrate the superiority of our model over state-of-the-art performance in the MABSA task.



Paperid:572 Poster
Authors:Jiansong Qi,Yaping Huang,Ying Zhang,Sihui Zhang,Mei Tian,Yi Tian,Fanchao Meng,Lin Guan,Tianyi Chang
Abstract:
As a non-contact method, eye-tracking data can be used to diagnose people with Autism Spectrum Disorder (ASD) by comparing the differences of eye movements between ASD and healthy people. However, existing works mainly employ a simple free-viewing paradigm or visual search paradigm with restricted or unnatural stimuli to collect the gaze patterns of adults or children with an average age of 6-to-8 years, hindering the early diagnosis and intervention of preschool children with ASD. In this paper, we propose a novel method for identifying children with ASD in three unique features: First, we design a novel eye-tracking paradigm that records Visual Question Answering (VQA) driven gaze patterns in complex natural scenes as a powerful guide for differentiating children with ASD. Second, we contribute a carefully designed dataset, named VQA4ASD, for collecting VQA-driven eye-tracking data from 2-to-6-year-old ASD and healthy children. To the best of our knowledge, this is the first dataset focusing on the early diagnosis of preschool children, which could facilitate the community to understand and explore the visual behaviors of ASD children; Third, we further develop a VQA-guided cooperative ASD screening network (VQA-CASN), in which both task-agnostic and task-specific visual scanpaths are explored simultaneously for ASD screening. Extensive experiments demonstrate that the proposed VQA-CASN achieves competitive performance with the proposed VQA-driven eye-tracking paradigm. The code and dataset will be publicly available.



Paperid:573 Poster
Authors:Guan-Yuan Chen,Von-Wun Soo
Abstract:
The burgeoning field of text-to-music generation models has shown great promise in their ability to generate high-quality music aligned with users' textual descriptions. These models effectively capture abstract/global musical features such as style and mood. However, they often inadequately produce the precise rendering of critical music loop attributes, including melody, rhythms, and instrumentation, which are essential for modern music loop production. To overcome this limitation, this paper proposed a Loops Transformer and a Multi-Stage Cross Attention mechanism that enable a cohesive integration of textual and MIDI input specifications. Additionally, a novel Instrument-Aware Reinforcement Learning technique was introduced to ensure the correct adoption of instrumentation. We demonstrated that the proposed model can generate music loops that simultaneously satisfy the conditions specified by both natural language texts and MIDI input, ensuring coherence between the two modalities. We also showed that our model outperformed the state-of-the-art baseline model, MusicGen, in both objective metrics (by lowering the FAD score by 1.3, indicating superior quality with lower scores, and by improving the Normalized Dynamic Time Warping Distance with given melodies by 12%) and subjective metrics (by +2.56% in OVL, +5.42% in REL, and +7.74% in Loop Consistency). These improvements highlight our model's capability to produce musically coherent loops that satisfy the complex requirements of contemporary music production, representing a notable advancement in the field. Generated music loop samples can be explored at:https://loopstransformer.netlify.app/.



Paperid:574 Poster
Authors:Wenjie Wei,Yu Liang,Ammar Belatreche,Yichen Xiao,Honglin Cao,Zhenbang Ren,Guoqing Wang,Malu Zhang,Yang Yang
Abstract:
Brain-inspired Spiking Neural Networks (SNNs) leverage sparse spikes to represent information and process them in an asynchronous event-driven manner, offering an energy-efficient paradigm for the next generation of machine intelligence. However, the current focus within the SNN community prioritizes accuracy optimization through the development of large-scale models, limiting their viability in resource-constrained and low-power edge devices. To address this challenge, we introduce a lightweight and hardware-friendly Quantized SNN (Q-SNN) that applies quantization to both synaptic weights and membrane potentials. By significantly compressing these two key elements, the proposed Q-SNNs substantially reduce both memory usage and computational complexity. Moreover, to prevent the performance degradation caused by this compression, we present a new Weight-Spike Dual Regulation (WS-DR) method inspired by information entropy theory. Experimental evaluations on various datasets, including static and neuromorphic, demonstrate that our Q-SNNs outperform existing methods in terms of both model size and accuracy. These state-of-the-art results in efficiency and efficacy suggest that the proposed method can significantly improve edge intelligent computing.



Paperid:575 Poster
Authors:Changshuo Wang,Mingzhe Yu,Lei Wu,Lei Meng,Xiang Li,Xiangxu Meng
Abstract:
In recent years, diffusion models have dominated the field of image generation with their outstanding generation quality. However, pre-trained large-scale diffusion models are generally trained using fixed-size images, and fail to maintain their performance at different aspect ratios. Existing methods for generating arbitrary-size images based on diffusion models face several issues, including the requirement for extensive finetuning or training, sluggish sampling speed, and noticeable edge artifacts. This paper presents the InstantAS method for arbitrary-size image generation. This method performs non-overlapping minimum coverage segmentation on the target image, minimizing the generation of redundant information and significantly improving sampling speed. To maintain the consistency of the generated image, we also proposed the Inter-Domain Distribution Bridging method to integrate the distribution of the entire image and suppress the separation of diffusion paths in different regions of the image. Furthermore, we propose the dynamic semantic guided cross-attention method, allowing for the control of different regions using different semantics. InstantAS can be applied to nearly any existing pre-trained Text-to-Image diffusion model. Experimental results show that InstantAS has better fusion capabilities compared to previous arbitrary-size image generation methods and is far ahead in sampling speed compared to them.



Paperid:576 Poster
Authors:Shalayiding Sirejiding,Bayram Bayramli,Yuxiang Lu,Yuwen Yang,Tamam Alsarhan,Hongtao Lu,Yue Ding
Abstract:
Traditional multi-task learning often relies on explicit task interaction mechanisms to enhance multi-task performance. However, these approaches encounter challenges such as negative transfer when jointly learning multiple weakly correlated tasks. Additionally, these methods handle encoded features at a large scale, which escalates computational complexity to ensure dense prediction task performance. In this study, we introduce a Task-Interaction-Free Network (TIF) for multi-task learning, which diverges from explicitly designed task interaction mechanisms. Firstly, we present a Scale Attentive-Feature Fusion Module (SAFF) to enhance each scale in the shared encoder to have rich task-agnostic encoded features. Subsequently, our proposed task and scale-specific decoders efficiently decode the enhanced features shared across tasks without necessitating task-interaction modules. Concretely, we utilize a Self-Feature Distillation Module (SFD) to explore task-specific features at lower scales and the Low-To-High Scale Feature Diffusion Module (LTHD) to diffuse global pixel relationships from low-level to high-level scales. Experiments on publicly available multi-task learning datasets validate that our TIF attains state-of-the-art performance.



Paperid:577 Poster
Authors:Miao Liu,Jing Wang,Xinyuan Qian,Haizhou Li
Abstract:
As one of the crucial elements in human-robot interaction, responsive listening head generation has attracted considerable attention from researchers. It aims to generate a listening head video based on speaker's audio and video as well as a reference listener image. However, existing methods exhibit two limitations: 1) the generation capability of their models is limited, resulting in generated videos that are far from real ones, and 2) they mostly employ autoregressive generative models, unable to mitigate the risk of error accumulation. To tackle these issues, we propose Listenformer that leverages the powerful temporal modeling capability of transformers for generation. It can perform non-autoregressive prediction with the proposed two-stage training method, simultaneously achieving temporal continuity and overall consistency in the outputs. To fully utilize the information from the speaker inputs, we designed an audio-motion attention fusion module, which improves the correlation of audio and motion features for accurate response. Additionally, a novel decoding method called sliding window with a large shift is proposed for Listenformer, demonstrating both excellent computational efficiency and effectiveness. Extensive experiments show that Listenformer outperforms the existing state-of-the-art methods on ViCo and L2L datasets. And a perceptual user study demonstrates the comprehensive performance of our method in generating diversity, identity preserving, speaker-listener synchronization, and attitude matching.



Paperid:578 Poster
Authors:Keke Tang,Zhensu Wang,Weilong Peng,Lujie Huang,Le Wang,Peican Zhu,Wenping Wang,Zhihong Tian
Abstract:
Adversarial attacks on point clouds are crucial for assessing and improving the adversarial robustness of 3D deep learning models. Despite leveraging various geometric constraints, current adversarial attack strategies often suffer from inadequate imperceptibility. Given that adversarial perturbations tend to disrupt the inherent symmetry in objects, we recognize this disruption as the primary cause of the lack of imperceptibility in these attacks. In this paper, we introduce a novel framework, symmetry-aware imperceptible adversarial attacks on 3D point clouds (SymAttack), to address this issue. Our approach starts by identifying part- and patch-level symmetry elements, and grouping points based on semantic and Euclidean distances, respectively. During the adversarial attack iterations, we intentionally adjust the perturbation vectors on symmetric points relative to their symmetry plane. By preserving symmetry within the attack process, SymAttack significantly enhances imperceptibility. Extensive experiments validate the effectiveness of SymAttack in generating imperceptible adversarial point clouds, demonstrating its superiority over the state-of-the-art methods. Codes will be made public upon paper acceptance.



Paperid:579 Poster
Authors:Yi Lu,Shenghao Ren,Qiu Shen,Xun Cao
Abstract:
Whole-body motion imitation has gained wide attention in recent years as it can enhance the locomotive capabilities of humanoid robots. In this task, non-intrusive human motion capturing with RGB cameras is commonly used for its low-cost, efficiency, portability and user-friendliness. However, RGB based methods always faces the problem of depth ambiguity, leading to inaccurate and unstable imitation. Accordingly, we propose to introduce pressure sensor into the non-intrusive humanoid motion imitation system for two considerations: first, pressure can be used to estimate the contact relationship and interaction force between human and the ground, which play a key role in the balancing and stabilizing motion; second, pressure can be measured in the manner of almost non-intrusive approach, which can keep the experience of human demonstrator. In this paper, we establish a RGB-Pressure (RGB-P) based humanoid imitation system, achieving accurate and stable end-to-end mapping from human body models to robot control parameters. Specifically, we use RGB camera to capture human posture and pressure insoles to measure the underfoot pressure during the movements of human demonstrator. Then, a constraint relationship between pressure and pose is studied to refine the estimated pose according to the support modes and balance mechanism, thereby enhancing consistency between human and robot motions. Experimental results demonstrate that fusing RGB and pressure can enhance overall robot motion execution performance by improving stability while maintaining imitation similarity.



Paperid:580 Poster
Authors:Sifan Wu,Haipeng Chen,Yifang Yin,Sihao Hu,Runyang Feng,Yingying Jiao,Ziqi Yang,Zhenguang Liu
Abstract:
Human pose estimation in videos has long been a compelling yet challenging task within the realm of computer vision. Nevertheless, this task remains difficult because of the complex video scenes, such as video defocus and self-occlusion. Recent methods strive to integrate multi-frame visual features generated by a backbone network for pose estimation. However, they often ignore the useful joint information encoded in the initial heatmap, which is a by-product of the backbone generation. Comparatively, methods that attempt to refine the initial heatmap fail to consider any spatio-temporal motion features. As a result, the performance of existing methods for pose estimation falls short due to the lack of ability to leverage both local joint (heatmap) information and global motion (feature) dynamics.To address this problem, we propose a novel joint-motion mutual learning framework for pose estimation, which effectively concentrates on both local joint dependency and global pixel-level motion dynamics. Specifically, we introduce a context-aware joint learner that adaptively leverages initial heatmaps and motion flows to retrieve robust local joint features. Given that local joint features and global motion flows are complementary, we further propose a progressive joint-motion mutual learning that synergistically exchanges information and interactively learns between joint features and motion flows to improve the capability of the model. More importantly, to capture more diverse joint and motion cues, we theoretically analyze and propose an information orthogonality objective to avoid learning redundant information from multi-cues. Empirical experiments show our method outperforms prior arts on three challenging benchmarks.



Paperid:581 Poster
Authors:Benhui Zhang,Junyu Gao,Yuan Yuan
Abstract:
The emergence of video captioning makes it possible to automatically generate natural language description for a given video. However, generating detailed video descriptions that incorporate domain-specific information remains an unsolved challenge, holding significant research and application value, particularly in domains such as sports commentary generation. Moreover, sports event commentary goes beyond being a mere game report, it involves entertaining, metaphorical, and emotional descriptions. To promote the field of sports commentary automatic generation, in this paper, we introduce a novel dataset, the Basketball Highlight Commentary (BH-Commentary), comprising approximately 4K basketball highlight videos with ground-truth commentaries from professional commentators. In addition, we propose an end-to-end framework as a benchmark for basketball highlight commentary generation task, in which a lightweight and effective prompt strategy is designed to enhance alignment fusion among visual and textual features. Extensive experiments on the BH-Commentary dataset demonstrate the validity of the dataset and the effectiveness of the proposed benchmark for sports highlight commentary generation. (The dataset is available athttps://anonymous.4open.science/r/dataset-DC8E)



Paperid:582 Poster
Authors:Binbin Xu,Jun Yin,Nan Zhang
Abstract:
Multi-View Clustering (MVC) aims to mine complementary information across different views to partition multi-view data more effectively and has attracted considerable interest. However, existing deep multi-view clustering methods frequently neglect the exploration of structural information within individual view and lack the learning of structural consistency among views, which results in limitations in the clustering performance. In this paper, we introduce a novel multi-view clustering framework based on graph consistency learning to address this issue. Specifically, we design intra-view graph contrastive learning to uncover structural information within each view and achieve structural conscistency objectives through cross-view graph consistency learning. Additionally, to address the conflict between different learning objectives when trained in the same space, we introduce two new feature spaces, one for cluster-levcel contrastive learning and the other for instance-level contrastive learning. Subsequently, to make the most of discriminative information from all views, we concatenate high-level features from all views to form global features and employ self-supervision to promote clustering consistency across different views. Experimental results on several challenging datasets demonstrate the outstanding performance of our proposed method.



Paperid:583 Poster
Authors:Yingjie Gao,Yanan Zhang,Ziyue Huang,Nanqing Liu,Di Huang
Abstract:
In recent years, Few-Shot Object Detection (FSOD) has gained widespread attention and made significant progress due to its ability to learn models with strong generalization power using extremely limited annotated data. Although the fine-tuning-based paradigm for FSOD has become mainstream, where detectors are initially pretrained on base classes with sufficient samples and then fine-tuned on novel classes with few annotated samples, the scarcity of samples in novel classes hampers the precise capture of their data distribution. To address this issue, we propose a novel framework for few-shot object detection, namely Prototype-based Soft-labels and Test-Time Learning (PS-TTL). Specifically, we design a Test-Time Learning (TTL) module that employs a mean-teacher network for self-training to discover novel instances on test data, effectively alleviating the problem of overfitting to the distribution of base class. Furthermore, we develop a Prototype-based Soft-labels (PS) strategy via assessing similarities between pseudo-labels and category prototypes to unleash the potential of low-quality pseudo-labels, thereby significantly mitigating the constraints posed by few-shot samples. Extensive experiments on both the VOC and COCO benchmarks show that PS-TTL achieves a new state-of-the-art, highlighting its effectiveness.



Paperid:584 Poster
Authors:Hancheng Zhu,Ju Shi,Zhiwen Shao,Rui Yao,Yong Zhou,Jiaqi Zhao,Leida Li
Abstract:
Image Aesthetic Quality Assessment (IAQA) aims to simulate users' visual perception to judge the aesthetic quality of images. In social media, users' aesthetic experiences are often reflected in their textual comments regarding the aesthetic attributes of images. To fully explore the attribute information perceived by users for evaluating image aesthetic quality, this paper proposes an image aesthetic quality assessment method based on attribute-driven multimodal hierarchical prompts. Unlike existing IAQA methods that utilize multimodal pre-training or straightforward prompts for model learning, the proposed method leverages attribute comments and quality-level text templates to hierarchically learn the aesthetic attributes and quality of images. Specifically, we first leverage users' aesthetic attribute comments to perform prompt learning on images. The learned attribute-driven multimodal features can comprehensively capture the semantic information of image aesthetic attributes perceived by users. Then, we construct text templates for different aesthetic quality levels to further facilitate prompt learning through semantic information related to the aesthetic quality of images. The proposed method can explicitly simulate users' aesthetic judgment of images to obtain more precise aesthetic quality. Experimental results demonstrate that the proposed IAQA method based on hierarchical prompts outperforms existing methods significantly on multiple IAQA databases. Our source code is provided in the supplementary material, and we will release all source code along with this paper.



Paperid:585 Poster
Authors:Xu Han,Yuan Tang,Zhaoxuan Wang,Xianzhi Li
Abstract:
Existing Transformer-based models for point cloud analysis suffer from quadratic complexity, leading to compromised point cloud resolution and information loss. In contrast, the newly proposed Mamba model, based on state space models (SSM), outperforms Transformer in multiple areas with only linear complexity. However, the straightforward adoption of Mamba does not achieve satisfactory performance on point cloud tasks. In this work, we present Mamba3D, a state space model tailored for point cloud learning to enhance local feature extraction, achieving superior performance, high efficiency, and scalability potential. Specifically, we propose a simple yet effective Local Norm Pooling (LNP) block to extract local geometric features. Additionally, to obtain better global features, we introduce a bidirectional SSM (bi-SSM) with both a token forward SSM and a novel backward SSM that operates on the feature channel. Extensive experimental results show that Mamba3D surpasses Transformer-based counterparts and concurrent works in multiple tasks, with or without pre-training. Notably, Mamba3D achieves multiple SoTA, including an overall accuracy of 92.6% (train from scratch) on the ScanObjectNN and 95.1% (with single-modal pre-training) on the ModelNet40 classification task, with only linear complexity. We shall release the code and model upon publication of this work.



Paperid:586 Poster
Authors:Rui Xie,Anlong Ming,Shuai He,Yi Xiao,Huadong Ma
Abstract:
Image aesthetics assessment (IAA) primarily examines image quality from a user-centric perspective and can be applied to guide various applications, including image capture, recommendation, and enhancement. The fundamental issue in IAA revolves around the quantification of image aesthetics. Existing methodologies rely on assigning a scalar (or a distribution) to represent aesthetic value based on conventional practices, which confines this scalar within a specific range and artificially labels it. However, conventional methods rarely incorporate research on interpretability, particularly lacking systematic responses to the following three fundamental questions:Can aesthetic qualities be quantified?What is the nature of quantifying aesthetics?How can aesthetics be accurately quantified?In this paper, we present a law called "Special Relativity" of IAA (SR-IAA) that addresses the aforementioned core questions. We have developed a Multi-Attribute IAA Framework (MAINet), which serves as a preliminary validation for SR-IAA within the existing datasets and achieves state-of-the-art (SOTA) performance. Specifically, our metrics on multi-attribute assessment outperform the second-best performance by 8.06% (AADB), 1.67% (PARA), and 2.44% (SPAQ) in terms of SRCC. We anticipate that our research will offer innovative theoretical guidance to the IAA research community. Codes are available in the supplementary material.



Paperid:587 Poster
Authors:Jiexuan Yan,Sheng Huang,Nankun Mu,Luwen Huangfu,Bo Liu
Abstract:
Real-world data consistently exhibits a long-tailed distribution, often spanning multiple categories. This complexity underscores the challenge of content comprehension, particularly in scenarios requiring long-tailed multi-label image classification (LTMLC). In such contexts, imbalanced data distribution and multi-object recognition pose significant hurdles. To address this issue, we propose a novel and effective approach for LTMLC, termed Category-Prompt Refined Feature Learning (CPRFL), utilizing semantic correlations between different categories and decoupling category-specific visual representations for each category. Specifically, CPRFL initializes category-prompts from the pretrained CLIP's embeddings and decouples category-specific visual representations through interaction with visual features, thereby facilitating the establishment of semantic correlations between the head and tail classes. To mitigate the visual-semantic domain bias, we design a progressive Dual-Path Back-Propagation mechanism to refine the prompts by progressively incorporating context-related visual information into prompts. Simultaneously, the refinement process facilitates the progressive purification of the category-specific visual representations under the guidance of the refined prompts. Furthermore, taking into account the negative-positive sample imbalance, we adopt the Asymmetric Loss as our optimization objective to suppress negative samples across all classes and potentially enhance the head-to-tail recognition performance. We validate the effectiveness of our method on two LTMLC benchmarks and extensive experiments demonstrate the superiority of our work over baselines.



Paperid:588 Poster
Authors:Qiao Li,Xiaomeng Fu,Xi Wang,Jin Liu,Xingyu Gao,Jiao Dai,Jizhong Han
Abstract:
With the rapid advancements of large-scale text-to-image diffusion models, various practical applications have emerged, bringing significant convenience to society. However, model developers may misuse the unauthorized data to train diffusion models. These data are at risk of being memorized by the models, thus potentially violating citizens' privacy rights. Therefore, in order to judge whether a specific image is utilized as a member of a model's training set, Membership Inference Attack (MIA) is proposed to serve as a tool for privacy protection. Current MIA methods predominantly utilize pixel-wise comparisons as distinguishing clues, considering the pixel-level memorization characteristic of diffusion models. However, it is practically impossible for text-to-image models to memorize all the pixel-level information in massive training sets. Therefore, we move to the more advanced structure-level memorization. Observations on the diffusion process show that the structures of members are better preserved compared to those of nonmembers, indicating that diffusion models possess the capability to remember the structures of member images from training sets. Drawing on these insights, we propose a simple yet effective MIA method tailored for text-to-image diffusion models. Extensive experimental results validate the efficacy of our approach. Compared to current pixel-level baselines, our approach not only achieves state-of-the-art performance but also demonstrates remarkable robustness against various distortions.



Paperid:589 Poster
Authors:Chenxi Ma,Weimin Tan,Shili Zhou,Bo Yan
Abstract:
With the rising interest in multi-camera cross-spectral systems, cross-spectral images have been widely used in computer vision and image processing. Therefore, an effective super-resolution (SR) method is significant in providing high-resolution (HR) cross-spectral images for different research and applications. However, existing SR methods rarely consider utilizing cross-spectral information to assist the SR of visible images and cannot handle the complex degradation (noise, high brightness, low light) and misalignment problem in low-resolution (LR) cross-spectral images. Here, we first explore the potential of using near-infrared (NIR) image guidance for better SR, based on the observation that NIR images can preserve valuable information for recovering adequate image details. To take full advantage of the cross-spectral prior, we propose a novel $\textbf{C}$ross-$\textbf{S}$pectral $\textbf{P}$rior guided image $\textbf{SR}$ approach ($\textbf{CSPSR}$). Concretely, we design a cross-view matching (CVM) module and a dynamic multi-modal fusion (DMF) module to enhance the spatial correlation between cross-spectral images and to bridge the multi-modal feature gap, respectively. The DMF module facilitates adaptive feature adaptation and effective information transmission through a dynamic convolution and a cross-spectral feature transfer (CSFT) unit. Extensive experiments demonstrate the effectiveness of our CSPSR, which can exploit the prominent cross-spectral information to produce state-of-the-art results.



Paperid:590 Poster
Authors:JiYuan Wang,Chunyu Lin,Lang Nie,Kang Liao,Shuwei Shao,Yao Zhao
Abstract:
Recently, diffusion-based depth estimation methods have drawn widespread attention due to their elegant denoising patterns and promising performance. However, they are typically unreliable under adverse conditions prevalent in real-world scenarios, such as rainy, snowy, etc. In this paper, we propose a novel robust depth estimation method with a customed contrastive learning mode for diffusion models, named D4RD, to resist performance degradation in complex environments. Concretely, we integrate the strength of knowledge distillation into contrastive learning, building the `trinity' contrastive scheme. It takes the sampled noise of the forward diffusion process as a natural reference and guides the predicted noise in different scenes to gather towards a more stable and precise optima. Meanwhile, we further extend noise-level trinity to more generic feature and image levels, building a multi-level contrast to distribute the burden of robust perception across the overall network. Moreover, before handling complex scenarios, we enhance the stability of the baseline diffusion model with three simple but effective improvements, which facilitate convergence and remove depth outliers. Extensive experiments show that D4RD achieves superior performance to existing state-of-the-art (SoTA) solutions on both synthetic corruption datasets and real-world weather conditions. The code will be available.



Paperid:591 Poster
Authors:Shuanglin Yan,Jun Liu,Neng Dong,Liyan Zhang,Jinhui Tang
Abstract:
In this paper, we study the problem of Text-to-Image Person Re-identification (TIReID), which aims to find images of the same identity described by a text sentence from a pool of candidate images. Benefiting from Vision-Language Pre-training, such as CLIP (Contrastive Language-Image Pretraining), the TIReID techniques have achieved remarkable progress recently. However, most existing methods only focus on instance-level matching and ignore identity-level matching, which involves associating multiple images and texts belonging to the same person. In this paper, we propose a novel prototypical prompting framework (Propot) designed to simultaneously model instance-level and identity-level matching for TIReID. Our Propot transforms the identity-level matching problem into a prototype learning problem, aiming to learn identity-enriched prototypes. Specifically, Propot works by ‘initialize, adapt, enrich, then aggregate’. We first use CLIP to generate high-quality initial prototypes. Then, we propose a domain-conditional prototypical prompting (DPP) module to adapt the prototypes to the TIReID task using task-related information. Further, we propose an instance-conditional prototypical prompting (IPP) module to update prototypes conditioned on intra-modal and inter-modal instances to ensure prototype diversity. Finally, we design an adaptive prototype aggregation module to aggregate these prototypes, generating final identity-enriched prototypes. With identity-enriched prototypes, we diffuse its rich identity information to instances through prototype-to-instance contrastive loss to facilitate identity-level matching. Extensive experiments conducted on three benchmarks demonstrate the superiority of Propot compared to existing TIReID methods.



Paperid:592 Poster
Authors:Wenju Sun,Qingyong Li,Siyu Zhang,Wen Wang,Yangliao Geng
Abstract:
The posterior estimation of parameters based on Bayesian theory is a crucial technique in Incremental Learning (IL). The estimated posterior is typically utilized to impose loss regularization, which aligns the current training model parameters with the previously learned posterior to mitigate catastrophic forgetting, a major challenge in IL. However, this additional loss regularization can also impose detriment to the model learning, preventing it from reaching the true global optimum. To overcome this limitation, this paper introduces a novel Bayesian IL framework, Robust Parameter Posterior Fusion (RP$^2$F). Unlike traditional methods, RP$^2$F directly estimates the parameter posterior for new data without introducing extra loss regularization, which allows the model to accommodate new knowledge more sufficiently. It then fuses this new posterior with the existing ones based on the Maximum A Posteriori (MAP) principle, ensuring effective knowledge sharing across tasks. Furthermore, RP$^2$F incorporates a common parameter-robustness priori to facilitate a seamless integration during posterior fusion. Comprehensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets show that RP$^2$F not only effectively mitigates catastrophic forgetting but also achieves backward knowledge transfer.



Paperid:593 Poster
Authors:Aoqiang Zhu,Min Hu,Xiaohua Wang,Jiaoyun Yang,Yiming Tang,Fuji Ren
Abstract:
Multimodal sentiment analysis (MSA) aims to integrate multiple modalities of information to better understand human sentiment. The current research mainly focuses on conducting multimodal fusion and representation learning, which neglects the under-optimized modal representations generated by the imbalance of unimodal performances in joint learning. Moreover, the size of labeled datasets limits the generalization ability of existing supervised models used in MSA. To address the above issues, this paper proposes a knowledge-enhanced self-supervised balanced representation approach (KEBR) to capture common sentimental knowledge in unlabeled videos and explore the optimization issue of information imbalance between modalities. First, a text-based cross-modal fusion method (TCMF) is constructed, which injects the non-verbal information from the videos into the semantic representation of text to enhance the multimodal representation of text. Then, a multimodal cosine constrained loss (MCC) is designed to constrain the fusion of non-verbal information in joint learning to balance the representation of multimodal information. Finally, with the help of sentiment knowledge and non-verbal information, KEBR conducts sentiment word masking and sentiment intensity prediction, so that the sentiment knowledge in the videos is embedded into the pre-trained multimodal representation in a balanced manner. Experimental results on two publicly available datasets MOSI and MOSEI show that KEBR significantly outperforms the baseline, achieving new state-of-the-art results.



Paperid:594 Poster
Authors:Rongwen Li,Haiyang Hu,Liang Du,Jiarong Chen,Bingbing Jiang,Peng Zhou
Abstract:
Multi-view clustering is an important task in multimedia and machine learning. In multi-view clustering, multi-view spectral clustering is one kind of the most popular and effective methods. However, existing multi-view spectral clustering ignores the fairness in the clustering result, which may cause discrimination. To tackle this problem, in this paper, we propose an innovative Fair Multi-view Spectral Clustering (FMSC) method. Firstly, we provide a new perspective of fairness from the graph theory viewpoint, which constructs a relation between fairness and the average degree in graph theory. Secondly, based on this relation, we design a novel fairness-aware regularized term, which has the same form as the ratio cut in spectral clustering. Thirdly, we seamlessly plug this fairness-aware regularized term into the multi-view spectral clustering, leading to our one-stage FMSC, which can directly obtain the final clustering result without any post-processing. We also conduct extensive experiments compared with state-of-the-art fair clustering and multi-view clustering methods, which shows that our method can achieve better fairness.



Paperid:595 Poster
Authors:Junbo Hu,Zhixin Li
Abstract:
Transformer-based encoders that encode both region and grid features are the preferred choice for the image captioning task due to their multi-head self-attention mechanism. This mechanism ensures superior capture of relationships and contextual information between various regions in an image. However, because of the Transformer block stacking, self-attention computes the visual features several times, increasing computing costs and producing a great deal of redundant feature calculation. In this paper, we propose a novel Distilled Cross-Combination Transformer (DCCT) network. Specifically, we first design a distillation cascade fusion encoder(DCFE) to filter out redundant features in visual features that affect attentional focus, obtaining refined features. Additionally, we introduce a parallel cross-fusion attention module (PCFA) that fully utilizes the complementarity and correlation between grid and region features to better fuse the encoded dual visual features. Extensive experiments on the MSCOCO dataset demonstrate that the proposed DCCT strategy outperforms many state-of-the-art techniques and attains exceptional performance.



Paperid:596 Poster
Authors:Cheng Shen,Liquan Shen,Mengyao Li,Meng Yu
Abstract:
Sonar imaging is widely utilized in submarine and underwater detection missions. However, due to the complex underwater environment, sonar images suffer from complex distortions and noises, making detection models hard to extract clean high-level features for detection. Existing works introduce denoised images as pseudo labels to assist the network to extract clean features while not fully considering the rationality of pseudo labels. To this end, we propose an Efficient Pseudo Labels-Driven Underwater Forward-looking Sonar Images Object Detection algorithm (EPL-UFLSID). Specifically, we first design a Gaussian Mixture Model based Deep Image Prior (GMMDIP) network to generate denoised sonar images by setting the GMM distribution as its input. After that, to filter the most detection-friendly images of the denoised images generated by GMMDIP as efficient pseudo labels, Detection-Friendly Image Quality Assessment network (DFIQA), is designed, which is also able to help EPL-UFLSID further distill cleaner features from pseudo labels to improve detection performance. Extensive experimental results show that our EPL-UFLSID reaches average precision (AP) of 67.8%/39.8% and average recall (AR) of 73.7%/49.6% on two real sonar datasets, which outperforms SOTA underwater forward-looking sonar images object detection algorithms.



Paperid:597 Poster
Authors:Zidu Wang,Xiangyu Zhu,Jiang Yu,Tianshuo Zhang,Zhen Lei
Abstract:
3D textured face reconstruction from sketches applicable in many scenarios such as animation, 3D avatars, artistic design, missing people search, etc., is a highly promising but underdeveloped research topic. On the one hand, the stylistic diversity of sketches leads to existing sketch-to-3D-face methods only being able to handle pose-limited and realistically shaded sketches. On the other hand, texture plays a vital role in representing facial appearance, yet sketches lack this information, necessitating additional texture control in the reconstruction process. This paper proposes a novel method for reconstructing controllable textured and detailed 3D faces from sketches, named S2TD-Face. S2TD-Face introduces a two-stage geometry reconstruction framework that directly reconstructs detailed geometry from the input sketch. To keep geometry consistent with the delicate strokes of the sketch, we propose a novel sketch-to-geometry loss that ensures the reconstruction accurately fits the input features like dimples and wrinkles. Our training strategies do not rely on hard-to-obtain 3D face scanning data or labor-intensive hand-drawn sketches. Furthermore, S2TD-Face introduces a texture control module utilizing text prompts to select the most suitable textures from a library and seamlessly integrate them into the geometry, resulting in a 3D detailed face with controllable texture. S2TD-Face surpasses existing state-of-the-art methods in extensive quantitative and qualitative experiments. The code will be publicly available.



Paperid:598 Poster
Authors:Ruofan Jia,Weiying Xie,Jie Lei,Yunsong Li
Abstract:
In practical object detection scenarios, distributed data and stringent privacy protections significantly limit the feasibility of traditional centralized training methods. Federated learning (FL) emerges as a promising solution to this dilemma. Nonetheless, the issue of data heterogeneity introduces distinct challenges to federated object detection, evident in diminished object perception, classification, and localization abilities. In response, we introduce a task-driven federated learning methodology, dubbed Adaptive Hierarchical Aggregation (FedAHA), tailored to overcome these obstacles. Our algorithm unfolds in two strategic phases from shallow-to-deep layers: (1) Structure-aware Aggregation (SAA) aligns feature extractors during the aggregation phase, thus bolstering the global model's object perception capabilities; (2) Convex Semantic Calibration (CSC) leverages convex function theory to average semantic features instead of model parameters, enhancing the global model's classification and localization precision. We demonstrate experimentally and theoretically the effectiveness of the proposed two modules respectively. Our method consistently outperforming the state-of-the-art methods across multiple valuable application scenarios. Moreover, we build a real FL system using Raspberry Pis to demonstrate that our approach achieves a good trade-off between performance and efficiency.



Paperid:599 Poster
Authors:Zehang LIN,Jiayuan Xie,Zhenguo Yang,Yi Yu,Qing Li
Abstract:
News event discovery refers to the identification and detection of news events using multimodal data on social media. Currently, most works assume that the test set consists of known events. However, in real life, the emergence of new events is more frequent, which invalidates this assumption. In this paper, we propose a Dynamic Augmentation and Entropy Optimization (DAEO) model to address the scenario of generalized news event discovery, which requires the model to not only identify known events but also distinguish various new events. Specifically, we first introduce a multimodal augmentation module, which utilizes adversarial learning to enhance the multimodal representation capability. Secondly, we design an adaptive entropy optimization strategy combined with a self-distillation method, which uses multi-view pseudo-label consistency to improve the model's performance on both known and new events. In addition, we collect a multimodal news event discovery (MNED) dataset of 161,350 samples annotated with 66 real-world events. Extensive experimental results on the MNED dataset demonstrate the effectiveness of our proposed method. Our dataset is available onhttps://anonymous.4open.science/r/2FF5.



Paperid:600 Poster
Authors:Chengshun Wang,Na Zhao
Abstract:
The remarkable success of neural radiance fields in low-level vision tasks such as novel view synthesis has motivated its extension to high-level semantic understanding, giving rise to the concept of the neural semantic field (NeSF). NeSF aims to simultaneously synthesize novel view images and associated semantic segmentation maps. Generalizable NeSF, in particular, is an appealing direction as it can generalize to unseen scenes for synthesizing images and semantic maps for novel views, thereby avoiding the need for tedious per-scene optimization. However, existing approaches to generalizable NeSF fall short in fully exploiting the geometric and semantic features as well as their mutual interactions, resulting in suboptimal performance in both novel-view image synthesis and semantic segmentation. To address this limitation, we propose Geometry-Semantics Synergy for Generalized Neural Semantic Fields (GS$^2$-GNeSF), a novel approach aimed at improving the performance of generalizable NeSF through the comprehensive construction and synergistic interaction of geometric and semantic features. In GS$^2$-GNeSF, we introduce a robust geometric prior generator to generate the cost volumes and depth prior, which aid in constructing geometric features and facilitating geometry-aware sampling. Leveraging the depth prior, we additionally construct a global semantic context for the target view. This context provides two types of compensation information to enhance geometry and semantic features, achieved through boundary detection and semantic segmentation, respectively. Lastly, we present an efficient dual-directional interactive attention mechanism to foster deep interactions between the enhanced geometric and semantic features. Experiments conducted on both synthetic and real datasets demonstrate that our GS$^2$-GNeSF outperforms existing methods in both novel view and semantic map synthesis, highlighting its effectiveness in generalizing neural semantic fields for unseen scenes.



Paperid:601 Poster
Authors:Zhiwen Yang,Liang Li,Jiehua Zhang,Tingyu Wang,Yaoqi Sun,Chenggang Yan
Abstract:
Incremental monocular depth estimation aims to continuously learn from new domains while maintaining their performance on old domains. The catastrophic forgetting problem is the key challenge when the model adapts the dynamic scene variations. Previous methods usually address this forgetting problem by storing raw samples from the old domain, allowing the model to review the knowledge of the old domain. However, due to the concerns of data privacy and security, our objective is to tackle the incremental monocular depth estimation problem in more stringent scenarios without the need for replaying samples. In this paper, we attribute the cross-domain catastrophic forgetting to the domain distribution shifts and continuous variations of depth space. To this end, we propose Domain Shared and Specific Prompt Learning (DSSP) for incremental monocular depth estimation. In detail, to alleviate the domain distribution shift, complementary domain prompt are designed to learn the domain-shared and domain-specific knowledge which are optimized by the inter-domain alignment and intra-domain orthogonal loss. To mitigate the depth space variations, we first introduce a pre-trained model to generate the domain-shared depth space. Then, we design $S^2$-Adapter that quantizes depth space variations with scale&shift matrices and converts the domain-shared depth space to domain-specific depth space. Our method achieves state-of-the-art performance under various scenarios such as different depth ranges, virtual and real, different weather conditions, and the few-shot incremental learning setting on 12 datasets. We will release the source codes and pre-trained models.



Paperid:602 Poster
Authors:Lixing Tan,shuang Song,Kangneng Zhou,Chengbo Duan,Lanying Wang,Huayang Ren,Linlin Liu,Wei Zhang,Ruoxiu Xiao
Abstract:
X-ray images play a vital role in the intraoperative processes due to their high resolution and fast imaging speed and greatly promote the subsequent segmentation, registration and reconstruction. However, over-dosed X-rays superimpose potential risks to human health to some extent. Data-driven algorithms from volume scans to X-ray images are restricted by the scarcity of paired X-ray and volume data. Existing methods are mainly realized by modelling the whole X-ray imaging procedure. In this study, we propose a learning-based approach termed CT2X-GAN to synthesize the X-ray images in an end-to-end manner using the content and style disentanglement from three different image domains. Our method decouples the anatomical structure information from CT scans and style information from unpaired real X-ray images/ digital reconstructed radiography (DRR) images via a series of decoupling encoders. Additionally, we introduce a novel consistency regularization term to improve the stylistic resemblance between synthesized X-ray images and real X-ray images. Meanwhile, we also impose a supervised process by computing the similarity of computed real DRR and synthesized DRR images. We further develop a pose attention module to fully strengthen the comprehensive information in the decoupled content code from CT scans, facilitating high-quality multi-view image synthesis in the lower 2D space. Extensive experiments were conducted on the publicly available CTSpine1K dataset and achieved 97.8350, 0.0842 and 3.0938 in terms of FID, KID and defined user-scored X-ray similarity, respectively. In comparison with 3D-aware methods ($\pi$-GAN, EG3D), CT2X-GAN is superior in improving the synthesis quality and realistic to the real X-ray images.



Paperid:603 Poster
Authors:Lihao Liu,Yanqi Cheng,Zhongying Deng,Shujun Wang,Dongdong Chen,Xiaowei Hu,Pietro Lio,Carola-Bibiane Schönlieb,Angelica I Aviles-Rivero
Abstract:
Multi-object tracking in traffic videos is a crucial research area, offering immense potential for enhancing traffic monitoring accuracy and promoting road safety measures through the utilisation of advanced machine learning algorithms. However, existing datasets for multi-object tracking in traffic videos often feature limited instances or focus on single classes, which cannot well simulate the challenges encountered in complex traffic scenarios. To address this gap, we introduce TrafficMOT, an extensive dataset designed to encompass diverse traffic situations with complex scenarios. To validate the complexity and challenges presented by TrafficMOT, we conducted comprehensive empirical studies using three different settings: fully-supervised, semi-supervised, and a recent powerful zero-shot foundation model Tracking Anything Model (TAM). The experimental results highlight the inherent complexity of this dataset, emphasising its value in driving advancements in the field of traffic monitoring and multi-object tracking.



Paperid:604 Poster
Authors:Yue Zhang,Parisa Kordjamshidi
Abstract:
The existing methods for Vision and Language Navigation in the Continuous Environment (VLN-CE) commonly incorporate a waypoint predictor to discretize the environment. This simplifies the navigation actions into a view selection task and improves navigation performance significantly compared to direct training using low-level actions. However, the VLN-CE agents are still far from the real robots since there are gaps between their visual perception and executed actions. First, VLN-CE agents that discretize the visual environment are primarily trained with high-level view selection, which causes them to ignore crucial spatial reasoning within the low-level action movements. Second, in these models, the existing waypoint predictors neglect object semantics and their attributes related to passibility, which can be informative in indicating the feasibility of actions. To address these two issues, we introduce a low-level action decoder jointly trained with high-level action prediction, enabling the current VLN agent to learn and ground the selected visual view to the low-level controls. Moreover, we enhance the current waypoint predictor by utilizing visual representations containing rich semantic information and explicitly masking obstacles based on humans' prior knowledge about the feasibility of actions. Empirically, our agent can improve navigation performance metrics compared to the strong baselines on both high-level and low-level actions.



Paperid:605 Poster
Authors:Mao Xueying,Xiaoxiao Hu,Wanli Peng,Zhenliang Gan,Zhenxing Qian,Xinpeng Zhang,Sheng Li
Abstract:
Traditional video steganography methods are based on modifying the covert space for embedding, whereas we propose an innovative approach that embeds secret message within semantic feature for steganography during the video editing process. Although existing traditional video steganography methods excel in balancing security and capacity, they lack adequate robustness against common distortions in online social networks (OSNs). In this paper, we propose an end-to-end robust generative video steganography network (RoGVSN), which achieves visual editing by modifying semantic feature of videos to embed secret message. We exemplify the face-swapping scenario as an illustration to demonstrate the visual editing effects. Specifically, we devise an adaptive scheme to seamlessly embed secret messages into the semantic features of videos through fusion blocks. Extensive experiments demonstrate the superiority of our method in terms of robustness, extraction accuracy, visual quality, and capacity.



Paperid:606 Poster
Authors:Zhenyang Li,Fan Liu,Yinwei Wei,Zhiyong Cheng,Liqiang Nie,Mohan Kankanhalli
Abstract:
Recommendation algorithms forecast user preferences by correlating user and item representations derived from historical interaction patterns. In pursuit of enhanced performance, many methods focus on learning robust and independent representations by disentangling the intricate factors within interaction data across various modalities in an unsupervised manner. However, such an approach obfuscates the discernment of how specific factors (e.g., category or brand) influence the outcomes, making it challenging to regulate their effects. In response to this challenge, we introduce a novel method called Attribute-Driven Disentangled Representation Learning (short for AD-DRL), which explicitly incorporates attributes from different modalities into the disentangled representation learning process. By assigning a specific attribute to each factor in multimodal features, AD-DRL can disentangle the factors at both attribute and attribute-value levels. To obtain robust and independent representations for each factor associated with a specific attribute, we first disentangle the representations of features both within and across different modalities. Moreover, we further enhance the robustness of the representations by fusing the multimodal features of the same factor. Empirical evaluations conducted on three public real-world datasets substantiate the effectiveness of AD-DRL, as well as its interpretability and controllability.



Paperid:607 Poster
Authors:Jiaming Shen,Kun Hu,Wei Bao,Chang Wen Chen,Zhiyong Wang
Abstract:
Hand-drawn 2D animation workflow is typically initiated with the creation of sketch keyframes. Subsequent manual inbetweens are crafted for smoothness, which is a labor-intensive process and the prospect of automatic animation sketch interpolation has become highly appealing. Yet, common frame interpolation methods are generally hindered by two key issues: 1) limited texture and colour details in sketches, and 2) exaggerated alterations between two sketch keyframes. To overcome these issues, we propose a novel deep learning method - Sketch-Aware Interpolation Network (SAIN). This approach incorporates multi-level guidance that formulates region-level correspondence, stroke-level correspondence and pixel-level dynamics. A multi-stream U-Transformer is then devised to characterize sketch inbewteening patterns using these multi-level guides through the integration of self / cross-attention mechanisms. Additionally, to facilitate future research on animation sketch inbetweening, we constructed a large-scale dataset - STD-12K, comprising 30 sketch animation series in diverse artistic styles. Comprehensive experiments on this dataset convincingly show that our proposed SAIN surpasses the state-of-the-art interpolation methods. Our code and dataset will be publicly available.



Paperid:608 Poster
Authors:Zeyu Xiao,Zhihe Lu,Xinchao Wang
Abstract:
People are nowadays using smartphones to capture photos from multimedia platfroms. The presence of moir'e patterns resulting from spectral aliasing can significantly degrade the visual quality of images, particularly in ultra-high-definition (UHD) images. However, existing demoir'eing methods have mostly been designed for low-definition images, making them unsuitable for handling moir'e patterns in UHD images due to their substantial memory requirements. In this paper, we propose a novel patch bilateral compensation network (P-BiC) for the demoir'e pattern removal in UHD images, which is memory-efficient and prior-knowledge-based. Specifically, we divide the UHD images into small patches and perform patch-level demoir'eing to maintain the low memory cost even for ultra-large image sizes. Moreover, a pivotal insight, namely that the green channel of an image remains relatively less affected by moir'e patterns, while the tone information in moir'e images is still well-retained despite color shifts, is directly harnessed for the purpose of bilateral compensation. The bilateral compensation is achieved by two key components in our P-BiC, i.e., a green-guided detail transfer (G$^2$DT) module that complements distorted features with the intact content, and a style-aware tone adjustment (STA) module for the color adjustment. We quantitatively and qualitatively evaluate the effectiveness of P-BiC with extensive experiments.



Paperid:609 Poster
Authors:Xingtao Wang,Xianqi Zhang,Wenxue Cui,Ruiqin Xiong,Xiaopeng Fan,Debin Zhao
Abstract:
Mesh denoising is a fundamental task in geometry processing, and recent studies have demonstrated the remarkable superiority of deep learning-based methods in this field. However, existing works commonly rely on neural networks without explicit designs for noise and geometry which are actually fundamental factors in mesh denoising. In this paper, by jointly considering noise intensity and geometric characteristics, a novel Filtering Coefficient Learner (FCL for short) for mesh denoising is developed, which delicately generates coefficients to filter face normals. Specifically, FCL produces filtering coefficients consisting of a noise-aware component and a geometry-aware component. The first component is inversely proportional to the noise intensity of each face, resulting in smaller coefficients for faces with stronger noise. For the effective assessment of the noise intensity, a noise intensity estimation module is designed, which predicts the angle between paired noisy-clean normals based on a mean filtering angle. The second component is derived based on two types of geometric features, namely the category feature and face-wise features. The category feature provides a global description of the input patch, while the face-wise features complement the perception of local textures. Extensive experiments have validated the superior performance of FCL over state-of-the-art works in both noise removal and feature preservation.



Paperid:610 Poster
Authors:Tingting Li,Ziming Zhao,Jianwei Yin
Abstract:
Quantum networks have the potential to transmit multimedia data with high security and efficiency. However, ensuring high-fidelity transmission links remains a significant challenge. This study proposes a novel framework to enhance quantum network performance via link selection and transport strategy. Specifically, we formalize the quantum fidelity estimation and link selection as a best-arm identification problem, leverage median elimination to estimate fidelity and select the quantum link for each multimedia chunk transmission. To optimize the transmission of multimedia chunks in a quantum network, we can employ the scheduling strategy to maximize the cumulative benefit of chunk transmissions while considering the fidelity of the links and the overall network utilization. Through extensive experiments, our proposal demonstrates significant advantages. Compared to the randomized method, Minerva reduces bounce number and execution time by 12% ∼ 28% and 8% ∼ 32%, respectively, while improving average fidelity by 15%. Compared with the uniformly distributed method, our approach decreases bounce number by 24% ∼ 30% and execution time by 8% ∼ 32% and enhances average fidelity by 11% ∼ 21%.



Paperid:611 Poster
Authors:ZUYU ZHANG,YAN LI,Byeong-Seok Shin
Abstract:
Single domain generalization (SDG) aims to learn a generalizable model from only one source domain available to unseen target domains. Existing SDG techniques rely on data or feature augmentation to generate distributions that complement the source domain. However, these approaches fail to address the challenge where gradient conflicts from synthesized domains impede the learning of domain-invariant representation. Inspired by the concept of mechanical equilibrium in physics, we propose a novel conflict-aware approach named domain gradient equilibrium for SDG. Unlike prior conflict-aware SDG methods that alleviate the gradient conflicts by setting them to zero or random values, the proposed domain gradient equilibrium method first decouples gradients into domain-invariant and domain-specific components. The domain-specific gradients are then adjusted and reweighted to achieve equilibrium, steering the model optimization toward a domain-invariant direction to enhance generalization capability. We conduct comprehensive experiments on four image recognition benchmarks, and our method achieves an accuracy improvement of 2.94% in the PACS dataset over existing state-of-the-art approaches, demonstrating the effectiveness of our proposed approach.



Paperid:612 Poster
Authors:qiuhui chen,Yi Hong
Abstract:
Multimodal medical data, such as brain scans and non-imaging clinical records like demographics and neuropsychology examinations, play an important role in diagnosing neurodegenerative disorders, e.g., Alzheimer's disease (AD) and Parkinson's disease (PD). However, the disease-relevant information is overwhelmed by the high-dimensional image scans and the massive non-imaging data, making it a challenging task to fuse multimodal medical inputs efficiently. Recent multimodal learning methods adopt deep encoders to extract features and simple concatenation or alignment techniques for feature fusion, which suffer the representation degeneration issue due to the vast irrelevant information. To address this challenge, we propose a deep self-weighted multimodal relevance weighting approach, which leverages clustering-based constrastive learning and eliminates the intra- and inter-modal irrelevancy. The learned relevance score is integrated as a gate with a multimodal attention transformer to provide an improved fusion for the final diagnosis. Our proposed model, called SMART (Self-weighted Multimodal Attention-and-Relevance gated Transformer), is extensively evaluated on three public AD/PD datasets and achieves state-of-the-art (SOTA) performance in the diagnostics of neurodegenerative disorders. Our source code will be available.



Paperid:613 Poster
Authors:Luanyuan Dai,Xiaoyu Du,Jinhui Tang
Abstract:
Two-view correspondence pruning aims to accurately remove incorrect correspondences (outliers) from initial ones. Graph Neural Networks (GNNs) incorporated by Multilayer Perceptrons (MLPs) are treated as a powerful manner to handle sparse and unevenly distributed data. However, the expression capability of correspondence features obtained by MLPs is limited by their inherent insufficient of context information. In addition, previous works directly utilize the outputs of off-the-shelf GNNs, thus leading to confusion between sparse correspondence attribute features and their global structural information. To alleviate these issues, we propose a two-view correspondence pruning network TrGa. Specifically, we firstly use complete Transformer structures instead of context-agnostic MLPs to capture correspondence features with global context information and stronger expression capability. After that, we introduce the Concatenation Graph Node and Global Structure (CGNS) block to separately capture the interaction patterns among sparse correspondence attribute features and the global structural information among them, which can prevent their confusion. Finally, the proposed Feature Dimension Transformation and Enhancement (FDTE) block is applied for dimension transformation and feature augmentation. Additionally, we propose an efficient variant C-TrGa, in which the similarity matrix of the proposed C-Transformer is computed along the channel dimension. Extensive experiments demonstrate that the proposed TrGa and C-TrGa outperform state-of-the-art methods in different computer vision tasks. The code is provided in the supplementary materials.



Paperid:614 Poster
Authors:Yuchen Wang,Xingyu Zhu,Guanhui Ye,Shiyao Zhang,Xuetao Wei
Abstract:
DNN-based watermarking methods are rapidly developing and delivering impressive performances. Recent advances achieve resolution-agnostic image watermarking by reducing the variant resolution watermarking problem to a fixed resolution watermarking problem. However, such a reduction process can potentially introduce artifacts and low robustness. To address this issue, we propose the first, to the best of our knowledge, Resolution-Agnostic Image WaterMarking (RAIMark) framework by watermarking the implicit neural representation (INR) of image. Unlike previous methods, our method does not rely on the previous reduction process by directly watermarking the continuous signal instead of image pixels, thus achieving resolution-agnostic watermarking. Precisely, given an arbitrary-resolution image, we fit an INR for the target image. As a continuous signal, such an INR can be sampled to obtain images with variant resolutions. Then, we quickly fine-tune the fitted INR to get a watermarked INR conditioned on a binary secret message. A pre-trained watermark decoder extracts the hidden message from any sampled images with arbitrary resolutions. By directly watermarking INR, we achieve resolution-agnostic watermarking with increased robustness. Extensive experiments show that our method outperforms previous methods with significant improvements: averagely improved bit accuracy by 7%$\sim$29%. Notably, we observe that previous methods are vulnerable to at least one watermarking attack (e.g. JPEG, crop, resize), while ours are robust against all watermarking attacks.



Paperid:615 Poster
Authors:Yao Li,Jiajun Deng,Yuxuan Xiao,Yingjie Wang,Xiaomeng Chu,Jianmin Ji,Yanyong Zhang
Abstract:
Fusing the data of millimeter-wave Radar sensors and high-definition cameras has emerged as a viable approach to achieving precise 3D object detection for roadside traffic surveillance. For roadside perception systems, earlier studies have pointed out that it is better to perform the fusion on the 2D image plane than on the BEV plane (which is popular for on-car perception systems), especially when the perception range is large (e.g., > 150𝑚). Image-plane fusion requires critical transformations, like perspective projection from the Radar’s BEV to the camera’s 2D plane and reverse IPM. However, real-world issues like uneven terrain and sensor movement degrade these transformations’ precision, impacting fusion effectiveness. To alleviate these issues, we propose a geometry-based Radar-camera fusion method on the ground, namely FARFusion V2. Specifically, we extend the ground-plane assumption in FARFusion [20] to support arbitrary shapes by formulating the ground height as an implicit representation based on geometric transformations. By incorporating the ground information, we can enhance Radar data with target height measurements. Consequently, we can thus project the enhanced Radar data onto the 2D plane to obtain more accurate depth information, thereby assisting the IPM process. A real-time parameterized transformation parameters estimation module is further introduced to refine the view transformation processes. Moreover, considering various measurement noises across these two sensors, we introduce an uncertainty-based depth fusion strategy into the 2D fusion process to maximize the probability of obtaining the optimal depth value. Extensive experiments are conducted on our collected roadside OWL benchmark, demonstrating the excellent localization capacity of FARFusion V2 in far-range scenarios. Our method achieves an average location accuracy of 0.771m when we extend the detection range up to 500m.



Paperid:616 Poster
Authors:Chao Wang,Yang Zhou,Liangtian He,Lin Fenglai,Hongming Chen,Liang-Jian Deng
Abstract:
In this paper, we propose a simple but effective illumination distribution prior (IDP) for images to illuminate the darkness. The illumination distribution prior is the product of a statistical approach to low-light images. It is based on a key factor - the mean value and standard deviation of images are positively correlated with the illumination. Using IDP in combination with the dual-domain feature fusion network (DFFN), we can obtain images that are more consistent with the ground truth distribution. DFFN inserts the discrete wavelet transform (DWT) into the transformer architecture, aiming to recover the detailed texture of the image through local high-frequency information and global spatial information. We have conducted extensive experiments on five widely used low-light image enhancement datasets and the experimental results show the superior performance of our proposed network (IDP-Net) compared to other state-of-the-art methods.



Paperid:617 Poster
Authors:Hengxing Liu,Mingjia Li,Xiaojie Guo
Abstract:
Shadow, as a natural consequence of light interacting with objects, plays a crucial role in shaping the aesthetics of an image, which however also impairs the content visibility and overall visual quality. Recent shadow removal approaches employ the mechanism of attention, due to its effectiveness, as a key component. However, they often suffer from two issues including large model size and high computational complexity for practical use. To address these shortcomings, this work devises a lightweight yet accurate shadow removal framework. First, we analyze the characteristics of the shadow removal task to seek the key information required for reconstructing shadow regions and designing a novel regional attention mechanism to effectively capture such information. Then, we customize a Regional Attention Shadow Removal Model (RASM, in short), which leverages non-shadow areas to assist in restoring shadow ones. Unlike existing attention-based models, our regional attention strategy allows each shadow region to interact more rationally with its surrounding non-shadow areas, for seeking the regional contextual correlation between shadow and non-shadow areas. Extensive experiments are conducted to demonstrate that our proposed method delivers superior performance over other state-of-the-art models in terms of accuracy and efficiency, making it appealing for practical applications. Our code will be made publicly available.



Paperid:618 Poster
Authors:Yijia Wang,Qianqian Xu,Yangbangyan Jiang,Siran Dai,Qingming Huang
Abstract:
In recent years, multi-view outlier detection (MVOD) methods have advanced significantly, aiming to identify outliers within multi-view datasets. A key point is to better detect class outliers and class-attribute outliers, which only exist in multi-view data. However, existing methods either is not able to reduce the impact of outliers when learning view-consistent information, or struggle in cases with varying neighborhood structures. Moreover, most of them do not apply to partial multi-view data in real-world scenarios. To overcome these drawbacks, we propose a novel method named Regularized Contrastive Partial Multi-view Outlier Detection (RCPMOD). In this framework, we utilize contrastive learning to learn view-consistent information and distinguish outliers by the degree of consistency. Specifically, we propose (1) An outlier-aware contrastive loss with a potential outlier memory bank to eliminate their bias motivated by a theoretical analysis. (2) A neighbor alignment contrastive loss to capture the view-shared local structural correlation. (3) A spreading regularization loss to prevent the model from overfitting over outliers. With the Cross-view Relation Transfer technique, we could easily impute the missing view samples based on the features of neighbors. Experimental results on four benchmark datasets demonstrate that our proposed approach could outperform state-of-the-art competitors under different settings.



Paperid:619 Poster
Authors:Siying Xiao,Mao Ye,Qichen He,Shuaifeng Li,Song Tang,Xiatian Zhu
Abstract:
Black-box domain adaptation treats the source domain model as a black box. During the transfer process, the only available information about the target domain is the noisy labels output by the black-box model. This poses significant challenges for domain adaptation. Conventional approaches typically tackle the black-box noisy label problem from two aspects: self-knowledge distillation and pseudo-label denoising, both achieving limited performance due to limited knowledge information. To mitigate this issue, we explore the potential of off-the-shelf vision-language (ViL) multimodal models with rich semantic information for black-box domain adaptation by introducing an Adversarial Experts Model (AEM). Specifically, our target domain model is designed as one feature extractor and two classifiers, trained over two stages: In the knowledge transferring stage, with a shared feature extractor, the black-box source model and the ViL model act as two distinct experts for joint knowledge contribution, guiding the learning of one classifier each. While contributing their respective knowledge, the experts are also updated due to their own limitation and bias. In the adversarial alignment stage, to further distill expert knowledge to the target domain model, adversarial learning is conducted between the feature extractor and the two classifiers. A new consistency-max loss function is proposed to measure two classifier consistency and further improve classifier prediction certainty. Extensive experiments on multiple datasets demonstrate the effectiveness of our approach. Our source code will be released.



Paperid:620 Poster
Authors:Wenqi Ren,Ruihao Xia,Meng Zheng,Ziyan Wu,Yang Tang,Nicu Sebe
Abstract:
This paper addresses the issue of cross-class domain adaptation (CCDA) in semantic segmentation, where the target domain contains both shared and novel classes that are either unlabeled or unseen in the source domain. This problem is challenging, as the absence of labels for novel classes hampers the accurate segmentation of both shared and novel classes. Since Visual Language Models (VLMs) are capable of generating zero-shot predictions without requiring task-specific training examples, we propose a label alignment method by leveraging VLMs to relabel pseudo labels for novel classes. Considering that VLMs typically provide only image-level predictions, we embed a two-stage method to enable fine-grained semantic segmentation and design a threshold based on the uncertainty of pseudo labels to exclude noisy VLM predictions. To further augment the supervision of novel classes, we devise memory banks with an adaptive update scheme to effectively manage accurate VLM predictions, which are then resampled to increase the sampling probability of novel classes. Through comprehensive experiments, we demonstrate the effectiveness and versatility of our proposed method across various CCDA scenarios.



Paperid:621 Poster
Authors:Ze Yuan,Jinyang Guo,Dakai An,Junran Wu,He Zhu,Jianhao Li,Xueyuan Chen,Ke Xu,Jiaheng Liu
Abstract:
Recently, indoor 3D object detection has shown impressive progress. However, these improvements have come at the cost of increased memory consumption and longer inference times, making it difficult to apply these methods in practical scenarios. To address this issue, knowledge distillation has emerged as a promising technique for model acceleration. In this paper, we propose the VRDistill framework, the first knowledge distillation framework designed for efficient indoor 3D object detection. Our VRDistill framework includes a refinement module and a soft foreground mask operation to enhance the quality of the distillation. The refinement module utilizes trainable layers to improve the quality of the teacher's votes, while the soft foreground mask operation focuses on foreground votes, further enhancing the distillation performance. Comprehensive experiments on the ScanNet and SUN-RGBD datasets demonstrate the effectiveness and generalization ability of our VRDistill framework.



Paperid:622 Poster
Authors:Xiangyang Luo,Xin Zhang,Yifan Xie,Xinyi Tong,Weijiang Yu,Heng Chang,Fei Ma,Fei Richard Yu
Abstract:
Face swapping, the technique of transferring the identity from one face to another, merges as a field with significant practical applications. However, previous swapping methods often result in visible artifacts. To address this issue, in our paper, we propose $CodeSwap$, a symmetrical framework to achieve face swapping with high-fidelity and realism. Specifically, our method firstly utilizes a codebook that captures the knowledge of high quality facial features. Building on this foundation, the face swapping is then converted into the code manipulation task in a code space. To achieve this, we design a Transformer-based architecture to update each code independently, which enable more precise manipulations. Furthermore, we incorporate a mask generator to achieve seamless blending of the generated face with the background of target image. A distinctive characteristic of our method is its symmetrical approach to processing both target and source images, simultaneously extracting information from each to improve the quality of face swapping. This symmetry also simplifies the bidirectional exchange of faces in a singular operation. Through extensive experiments on ClelebA-HQ and FF++, our method is proven to not only achieve efficient identity transfer but also substantially reduce the visible artifacts.



Paperid:623 Poster
Authors:Zhenni Yu,Xiaoqin Zhang,LiZhao,Yi Bin,Guobao Xiao
Abstract:
This paper introduces a new Segment Anything Model with Depth Perception (DSAM) for Camouflaged Object Detection (COD). DSAM exploits the zero-shot capability of SAM to realize precise segmentation in the RGB-D domain. It consists of the Prompt-Deeper Module and the Finer Module. The Prompt-Deeper Module utilizes knowledge distillation and the Bias Correction Module to achieve the interaction between RGB features and depth features, especially using depth features to correct erroneous parts in RGB features. Then, the interacted features are combined with the box prompt in SAM to create a prompt with depth perception. The Finer Module explores the possibility of accurately segmenting highly camouflaged targets from a depth perspective. It uncovers depth cues in areas missed by SAM through mask reversion, self-filtering, and self-attention operations, compensating for its defects in the COD domain. DSAM represents the first step towards the SAM-based RGB-D COD model. It maximizes the utilization of depth features while synergizing with RGB features to achieve multimodal complementarity, thereby overcoming the segmentation limitations of SAM and improving its accuracy in COD. Experimental results on COD benchmarks demonstrate that DSAM achieves excellent segmentation performance and reaches the state-of-the-art (SOTA) on COD benchmarks with less consumption of training resources.



Paperid:624 Poster
Authors:Junwei He,Qianqian Xu,Yangbangyan Jiang,Zitai Wang,Yuchen Sun,Qingming Huang
Abstract:
With the progressive advancements in deep graph learning, out-of-distribution (OOD) detection for graph data has emerged as a critical challenge. While the efficacy of auxiliary datasets in enhancing OOD detection has been extensively studied for image and text data, such approaches have not yet been explored for graph data. Unlike Euclidean data, graph data exhibits greater diversity but lower robustness to perturbations, complicating the integration of outliers. To tackle these challenges, we propose the introduction of \textbf{H}ybrid External and Internal \textbf{G}raph \textbf{O}utlier \textbf{E}xposure (HGOE) to improve graph OOD detection performance. Our framework involves using realistic external graph data from various domains and synthesizing internal outliers within ID subgroups to address the poor robustness and presence of OOD samples within the ID class. Furthermore, we develop a boundary-aware OE loss that adaptively assigns weights to outliers, maximizing the use of high-quality OOD samples while minimizing the impact of low-quality ones. Our proposed HGOE framework is model-agnostic and designed to enhance the effectiveness of existing graph OOD detection models. Experimental results demonstrate that our HGOE framework can significantly improve the performance of existing OOD detection models across all 8 real datasets.



Paperid:625 Poster
Authors:Zan Chen,Xiao Yu,Yuanjing Feng
Abstract:
Accurate segmentation of cerebrovascular structures from TOF-MRA is vital for treating cerebrovascular diseases. However, existing methods rely on voxel categorization, leading to discontinuities in fine vessel locations. We propose a connectivity-based cerebrovascular segmentation method that considers inter-voxel relationships to overcome this limitation. By modeling connectivity, we transform voxel classification into predicting inter-voxel connectivity. Given cerebrovascular structures' sparse and widely distributed nature, we employ sparse 3D Bi-level routing attention to reduce computational overhead while effectively capturing cerebrovascular features. To enhance directional information extraction, we utilize the 3D-direction excitation block. Additionally, the 3D-direction interactive block continuously augments direction information in the feature map and sends it to the skip connection. We compare our method with current state-of-the-art cerebrovascular segmentation techniques and classical medical image segmentation methods using clinical and open cerebrovascular datasets. Our method demonstrates superior performance, outperforming existing approaches. Ablation experiments further validate the effectiveness of our proposed method.



Paperid:626 Poster
Authors:Xiangyu Sun,Joo Chan Lee,Daniel Rho,Jong Hwan Ko,Usman Ali,Eunbyung Park
Abstract:
The neural radiance field (NeRF) has made significant strides in representing 3D scenes and synthesizing novel views. Despite its advancements, the high computational costs of NeRF have posed challenges for its deployment in resource-constrained environments and real-time applications. As an alternative to NeRF-like neural rendering methods, 3D Gaussian Splatting (3DGS) offers rapid rendering speeds while maintaining excellent image quality. However, as it represents objects and scenes using a myriad of Gaussians, it requires substantial storage to achieve high-quality representation. To mitigate the storage overhead, we propose Factorized 3D Gaussian Splatting (F-3DGS), a novel approach that drastically reduces storage requirements while preserving image quality. Inspired by classical matrix and tensor factorization techniques, our method represents and approximates dense clusters of Gaussians with significantly fewer Gaussians through efficient factorization. We aim to efficiently represent dense 3D Gaussians by approximating them with a limited amount of information for each axis and their combinations. This method allows us to encode a substantially large number of Gaussians along with their essential attributes---such as color, scale, and rotation---necessary for rendering using a relatively small number of elements. Extensive experimental results demonstrate that F-3DGS achieves a significant reduction in storage costs while maintaining comparable quality in rendered images.



Paperid:627 Poster
Authors:Lehao Lin,Hong KANG,Xinyao Sun,Wei Cai
Abstract:
Non-Fungible Tokens (NFTs) have emerged as a pivotal digital asset, offering authenticated ownership of unique digital content. Despite it has gained remarkable traction, yet face pressing storage and verification challenges stemming from blockchain's permanent data costs. Existing off-chain or centralized storage solutions, while being alternatives, also introduce notable security vulnerabilities. We present SemNFT, an innovative decentralized framework integrated with blockchain oracle middleware services, addressing these persistent NFT dilemmas. Our approach compresses NFT source data into compact embeddings encapsulating semantic essence. These arrays are stored on-chain, while facilitating reliable decentralized image reconstruction and ownership verification. We implemented ERC721-compliant smart contracts with supplementary functionalities, demonstrating SemNFT’s seamless integrative capabilities within the ecosystem. Extensive evaluations evidence marked storage optimizations and preservation of requisite visual fidelity by comparison with existing solutions. The proposed SemNFT framework marks a significant advancement in holistically confronting rising NFT storage and verification challenges without compromising decentralization. It substantively propels the meaningful evolution of NFT infrastructure to achieve digital asset immortality.



Paperid:628 Poster
Authors:ZeHao Qi,Ruixu Zhang,Xinyi Hu,Wenxuan Liu,Zheng Wang
Abstract:
Our paper introduces a novel video dataset specifically designed for Temporal Intention Localization (TIL), aimed at identifying hidden abnormal intention in densely populated and dynamically complex environments. Traditional Temporal Action Localization (TAL) frameworks, focusing on overt actions within constrained temporal intervals, often miss the subtleties of pre-abnormal actions that unfold over extended periods. Our dataset comprises 228 videos with 5790 clips, each annotated to capture fine-grained actions within ambiguous temporal boundaries using a Joint-Linear-Assignment methodology. This comprehensive approach enables detailed analysis of the evolution of abnormal intention over time. To address the detection of subtle, hidden intention, we developed the Intention-Action Fusion module, an creative approach that integrates dynamic feature fusion across 11 behavioral subcategories, significantly enhancing the model's ability to discern nuanced intention. This enhancement has led to performance improvements of up to 139% in specific scenarios, dramatically boosting the model's sensitivity and interpretability, which is crucial for advancing the capabilities of proactive surveillance systems. By pushing the boundaries of current technology, our dataset and methodologies foster the development of proactive surveillance systems capable of preemptively identifying potential threats from nuanced behavioral patterns, encouraging further exploration into the complexities of intention beyond observable actions.



Paperid:629 Poster
Authors:KE LIANG,Lingyuan Meng,Yue Liu,Meng Liu,Wei Wei,Siwei Wang,Suyuan Liu,Wenxuan Tu,sihang zhou,Xinwang Liu
Abstract:
Various information in different modalities in an intuitive way in multi-modal knowledge graphs (MKGs), which are utilized in different downstream tasks, like recommendation. However, most MKGs are still far from complete, which motivates the flourishing of MKG reasoning models. Recently, with the development of general artificial intelligence, pre-trained transformers have drawn increasing attention, especially in multi-modal scenarios. However, the research of multi-modal pre-trained transformers (MPT) for knowledge graph reasoning (KGR) is still at an early stage. As the biggest difference between MKG and other multi-modal data, the rich structural information underlying the MKG is still not fully utilized in previous MPT. Most of them only use the graph structure as a retrieval map for matching images and texts connected with the same entity, which hinders their reasoning performances. To this end, the graph Structure Guided Multi-modal Pre-trained Transformer is proposed for knowledge graph reasoning (SGMPT). Specifically, the graph structure encoder is adopted for structural feature encoding. Then, a structure-guided fusion module with two simple yet effective strategies, i.e., weighted summation and alignment constraint, is designed to inject the structural information into both the textual and visual features. To the best of our knowledge, SGMPT is the first MPT for multi-modal KGR, which mines structural information underlying MKGs. Extensive experiments on FB15k-237-IMG and WN18-IMG, demonstrate that our SGMPT outperforms existing state-of-the-art models, and proves the effectiveness of the designed strategies.



Paperid:630 Poster
Authors:Shuyuan Wen,Bingrui Hu,wenchaoli
Abstract:
Unsupervised domain adaptation (UDA) aims to adapt a model trained on the source domain (e.g. synthetic data) to the target domain (e.g. real-world data) without requiring further annotations on the target domain. Most previous UDA methods for semantic segmentation focus on minimizing the domain discrepancy of various levels, e.g., pixels and features, for extracting domain-invariant knowledge.However, the primary domain knowledge, such as context and detail correlation, remains underexplored. To address this problem, we propose a context- and detail-enhanced unsupervised learning framework, called CDEA, for domain adaptive semantic segmentation that facilitates image detail correlations and contexts semantic consistency. Firstly, we propose an adaptive masked image consistency module to enhance UDA by learning spatial context relations of the target domain, which enforces the consistency between predictions and masked target images. Secondly, we propose a detail extraction module to enhance UDA by integrating the learning of spatial information into low-level layers, which fuses the low-level detail features with deep semantic features. Extensive experiments verify the effectiveness of the proposed method and demonstrate the superiority of our approach over state-of-the-art methods.



Paperid:631 Poster
Authors:Yujia Wang,Zhongxu Wang,Hua Huang
Abstract:
Sound Effect (SFX) generation, primarily aims to automatically produce sound waves for sounding visual objects in images or videos. Rather than learning an automatic solution to this task, we aim to propose a much broader system, AutoSFX, significantly applicable and less time-consuming, \ie automating sound design for videos. Our key insight is that ensuring consistency between auditory and visual information, performing seamless transitions between sound clips, and harmoniously mixing sounds playing simultaneously, is crucial for creating a unified audiovisual experience. AutoSFX capitalizes on this concept by aggregating multimodal representations by cross-attention and leverages a diffusion model to generate sound with visual information embedded. AutoSFX also optimizes the generated sounds to render the entire soundtrack for the input video, leading to a more immersive and engaging multimedia experience. We have developed a user-friendly interface for AutoSFX enabling users to interactively engage in the SFX generation for their videos with particular needs. To validate the capability of our vision-to-sound generation, we conducted comprehensive experiments and analyses using the widely recognized VEGAS and VGGSound test sets, yielding promising results. We also conducted a user study to evaluate the performance of the optimized soundtrack and the usability of the interface. Overall, the results revealed that our AutoSFX provides a viable sound landscape solution for making attractive videos.



Paperid:632 Poster
Authors:Lingyu Xiong,Xize Cheng,Jintao Tan,Xianjia Wu,Xiandong Li,Lei Zhu,Fei Ma,Minglei Li,Huang Xu,Zhihui Hu
Abstract:
Audio-driven talking face generation aims to synthesize video with lip movements synchronized to input audio. However, current generative techniques face challenges in preserving intricate regional textures (skin, teeth). To address the aforementioned challenges, we propose a novel framework called \textbf{SegTalker} to decouple lip movements and image textures by introducing segmentation as intermediate representation. Specifically, given the mask of image employed by a parsing network, we first leverage the speech to drive the mask and generate talking segmentation. Then we disentangle semantic regions of image into style codes using a mask-guided encoder. Ultimately, we inject the previously generated talking segmentation and style codes into a mask-guided StyleGAN to synthesize video frame. In this way, mostly of textures are fully preserved. Moreover, our approach can inherently achieve background separation and facilitate mask-guided facial local editing. In particular, by editing the mask and swapping the region textures from a given reference image (e.g. hair, lip, eyebrows), our approach enables facial editing seamlessly when generating talking face videos. Experiments demonstrate that our proposed approach can effectively preserve texture details and generate temporally consistent video while remaining competitive in lip synchronization. Quantitative results on the HDTF dataset illustrate the superior performance of our method over existing methods on most metrics.



Paperid:633 Poster
Authors:Lei Liu,Li Liu,Yawen Cui
Abstract:
Even in the era of large models, one of the well-known issues in continual learning (CL) is catastrophic forgetting, which is significantly challenging when the continual data stream exhibits a long-tailed distribution, termed as Long-Tailed Continual Learning (LTCL). Existing LTCL solutions generally require the label distribution of the data stream to achieve re-balance training. However, obtaining such prior information is often infeasible in real scenarios since the model should learn without pre-identifying the majority and minority classes. To this end, we propose a novel Prior-free Balanced Replay (PBR) framework to learn from long-tailed data stream with less forgetting. Concretely, motivated by our experimental finding that the minority classes are more likely to be forgotten due to the higher uncertainty, we newly design an uncertainty-guided reservoir sampling strategy to prioritize rehearsing minority data without using any prior information, which is based on the mutual dependence between the model and samples. Additionally, we incorporate two prior-free components to further reduce the forgetting issue: (1) Boundary constraint is to preserve uncertain boundary supporting samples for continually re-estimating task boundaries. (2) Prototype constraint is to maintain the consistency of learned class prototypes along with training. Our approach is evaluated on three standard long-tailed benchmarks, demonstrating superior performance to existing CL methods and previous SOTA LTCL approach in both task- and class-incremental learning settings, as well as ordered- and shuffled-LTCL settings.



Paperid:634 Poster
Authors:Xiaowen Cai,Yunbo Tao,Daizong Liu,Pan Zhou,Xiaoye Qu,Jianfeng Dong,Keke Tang,Lichao Sun
Abstract:
With the development of depth sensors and 3D vision, the vulnerability of 3D point cloud models has garnered heightened concern. Almost all existing 3D attackers are deployed in the white-box setting, where they access the model details and directly optimize coordinate-wise noises to perturb 3D objects. However, realistic 3D applications would not share any model information (model parameters, gradients, etc.) with users. Although a few recent works try to explore the black-box attack, they still achieve limited attack success rates (ASR) and fail to generate high-quality adversarial samples. In this paper, we focus on designing a transfer-based black-box attack method, called Transferable Frequency-aware 3D GAN, to delve into achieving a high black-box ASR by improving the adversarial transferability while making the adversarial samples more imperceptible. Considering that the 3D imperceptibility depends on whether the shape of the object is distorted, we utilize the spectral tool with the GAN design to explicitly perceive and preserve the 3D geometric structures. Specifically, we design the Graph Fourier Transform (GFT) encoding layer in the GAN generator to extract the geometries as guidance, and develop a corresponding Inverse-GFT decoding layer to decode latent features with this guidance to reconstruct high-quality adversarial samples. To further improve the transferability, we develop a dual learning scheme of discriminator from both frequency and feature perspectives to constrain the generator via adversarial learning. Finally, imperceptible and transferable perturbations are rapidly generated by our proposed attack. Experimental results demonstrate that our attack method achieves the highest transfer ASR while exhibiting stronger imperceptibility.



Paperid:635 Poster
Authors:Zikai Song,Ying Tang,Run Luo,Lintao Ma,Junqing Yu,Yi-Ping Phoebe Chen,Wei Yang
Abstract:
Point tracking is a challenging task in computer vision, aiming to establish point-wise correspondence across long video sequences. Recent advancements have primarily focused on temporal modeling techniques to improve local feature similarity, often overlooking the valuable semantic consistency inherent in tracked points. In this paper, we introduce a novel approach leveraging language embeddings to enhance the coherence of frame-wise visual features related to the same object. We recognize that videos typically involve a limited number of objects with specific semantics, allowing us to automatically learn language embeddings. Our proposed method, termed autogenic language embedding for visual feature enhancement, strengthens point correspondence in long-term sequences. Unlike existing visual-language schemes, our approach learns text embeddings from visual features through a dedicated mapping network, enabling seamless adaptation to various tracking tasks without explicit text annotations. Additionally, we introduce a consistency decoder that efficiently integrates text tokens into visual features with minimal computational overhead. Through enhanced visual consistency, our approach significantly improves point tracking trajectories in lengthy videos with substantial appearance variations. Extensive experiments on widely-used point tracking benchmarks demonstrate the superior performance of our method, showcasing notable enhancements compared to trackers relying solely on visual cues.



Paperid:636 Poster
Authors:Xiaomeng Chu,Jiajun Deng,Guoliang You,Yifan Duan,Yao Li,Yanyong Zhang
Abstract:
The recent advances in query-based multi-camera 3D object detection are featured by initializing object queries in the 3D space, and then sampling features from perspective-view images to perform multi-round query refinement. In such a framework, query points near the same camera ray are likely to sample similar features from very close pixels, resulting in ambiguous query features and degraded detection accuracy. To this end, we introduce RayFormer, a camera-ray-inspired query-based 3D object detector that aligns the initialization and feature extraction of object queries with the optical characteristics of cameras. Specifically, RayFormer transforms perspective-view image features into bird’s eye view (BEV) via the lift-splat-shoot method and segments the BEV map to sectors based on the camera rays. Object queries are uniformly and sparsely initialized along each camera ray, facilitating the projection of different queries onto different areas in the image to extract distinct features. Besides, we leverage the instance information of images to supplement the uniformly initialized object queries by further involving additional queries along the ray from 2D object detection boxes. To extract unique object-level features that cater to distinct queries, we design a ray sampling method that suitably organizes the distribution of feature sampling points on both images and bird’s eye view. Extensive experiments are conducted on the nuScenes dataset to validate our proposed ray-inspired model design. The proposed RayFormer achieves 55.5% mAP and 63.4% NDS, respectively. Our codes will be made available.



Paperid:637 Poster
Authors:Jia-Li Yin,Menghao chen,jin Han,Bo-Hao Chen,Ximeng Liu
Abstract:
Adversarial examples (AEs), which are maliciously hand-crafted by adding perturbations to benign images, reveal the vulnerability of deep neural networks (DNNs) and have been used as a benchmark for evaluating model robustness. With great efforts have been devoted to generating AEs with stronger attack ability, the visual quality of AEs is generally neglected in previous studies. The lack of a good quality measure of AEs makes it very hard to compare the relative merits of attack techniques and is hindering technological advancement. How to evaluate the visual quality of AEs remains an understudied and unsolved problem. In this work, we make the first attempt to fill the gap by presenting an image quality assessment method specifically designed for AEs. Towards this goal, we first construct a new database, called AdvDB, developed on diverse adversarial examples with elaborated annotations. We also propose a detection-based structural similarity index (AdvDSS) for adversarial example perceptual quality assessment. Specifically, the visual saliency for capturing the near-threshold adversarial distortions is first detected via human visual system (HVS) techniques and then the structural similarity is extracted to predict the quality score. Moreover, we further propose AEQA for overall adversarial example quality assessment by integrating the perceptual quality and attack intensity of AEs. Extensive experiments validate that the proposed AdvDSS achieves state-of-the-art performance which is more consistent with human opinions.



Paperid:638 Poster
Authors:Xuan Hai,Xin Liu,Yuan Tan,Gang Liu,Song Li,Weina Niu,Rui Zhou,Xiaokang Zhou
Abstract:
Voice is one of the most widely used media for information transmission in human society. While high-quality synthetic voices are extensively utilized in various applications, they pose significant risks to content security and trust building. Numerous studies have concentrated on fake voice detection to mitigate these risks, with many claiming to achieve promising performance. However, recent research has demonstrated that existing fake voice detectors suffer from serious overfitting to speaker-irrelative features (SiFs) and cannot be used in real-world scenarios. In this paper, we analyze the limitations of existing fake voice detectors and propose a new design philosophy, guiding the detection model to prioritize learning human voice features rather than the difference between the human voice and the synthetic voice. Based on this philosophy, we propose a novel fake voice detection framework named SiFSafer, which uses pre-trained speech representation models to enhance the learning of feature distribution in human voices and the adapter fine-tuning to optimize the performance. The evaluation shows that the average EERs of existing fake voice detectors in the ASVspoof challenge can exceed 20% if the SiFs like silence segments are removed, while SiFSafer achieves an EER of less than 8%, indicating that SiFSafer is robust to SiFs and strongly resistant to existing attacks.



Paperid:639 Poster
Authors:Bowen Zhao,Qianqian Wang,ZHIQIANG TAO,Wei Feng,Quanxue Gao
Abstract:
Existing fair multi-view clustering methods impose a constraint that requires the distribution of sensitive attributes to be uniform within each cluster. However, this constraint can lead to misallocation of samples with sensitive attributes. To solve this problem, we propose a novel Deep Fair Multi-View Clustering (DFMVC) method that learns a consistent and discriminative representation instructed by a fairness constraint constructed from the distribution of clusters. Specifically, we incorporate contrastive constraints on semantic features from different views to obtain consistent and discriminative representations for each view. Additionally, we align the distribution of sensitive attributes with the target cluster distribution to achieve optimal fairness in clustering results. Experimental results on four datasets with sensitive attributes demonstrate that our method improves both the fairness and performance of clustering compared to state-of-the-art multi-view clustering methods.



Paperid:640 Poster
Authors:Yuhui Quan,Xiaoheng Tan,Yan Huang,Yong Xu,Hui Ji
Abstract:
Underwater images, often plagued by complex degradation, pose significant challenges for image enhancement. To address these challenges, the paper redefines underwater image enhancement as an image decomposition problem and proposes a deep invertible neural network (INN) that accurately predicts both the latent image and the degradation effects. Instead of using an explicit formation model to describe the degradation process, the INN adheres to the constraints of the image decomposition model, providing necessary regularization for model training, particularly in the absence of supervision on degradation effects. Taking into account the diverse scales of degradation factors, the INN is structured on a multi-scale basis to effectively manage the varied scales of degradation factors. Moreover, the INN incorporates several asymmetric design elements that are specifically optimized for the decomposition model and the unique physics of underwater imaging. Comprehensive experiments show that our approach provides significant performance improvement over existing methods.



Paperid:641 Poster
Authors:Wen Yin,Bin Benjamin Zhu,Yulai Xie,Pan Zhou,Dan Feng
Abstract:
RGB-Thermal salient object detection (RGBT-SOD) plays a critical role in complex scene recognition fields such as autonomous driving, yet security research in this area remains limited. This paper introduces the first backdoor attack targeting RGBT-SOD, generating saliency maps on triggered inputs that depict non-existent salient objects chosen by the attacker, or designate no salient region (all black pixels) or the entire image as a salient region (all white pixels). We uncover that triggers possess an influence range for generating non-existent salient objects, supported by a theoretical approximation provided in this study. Extensive experimental evaluations validate the efficacy of our attack in both digital domain and physical-world scenarios. Notably, our dual-modality backdoor attack achieves an Attack Success Rate (ASR) of 86.72% with only 5 pairs of images in model training. Despite exploring potential countermeasures, we find them ineffective in thwarting our attacks, underscoring the urgent need for robust defenses against sophisticated backdoor attacks in RGBT-SOD systems.



Paperid:642 Poster
Authors:Yuhan Wu,Xiyu Meng,Yang He,Junru Zhang,Haowen Zhang,Yabo Dong,Dongming Lu
Abstract:
Learning semantic-rich representations from unlabeled time series data with intricate dynamics is a notable challenge. Traditional contrastive learning techniques predominantly focus on segment-level augmentations through time slicing, a practice that, while valuable, often results in sampling bias and suboptimal performance due to the loss of global context. Furthermore, they typically disregard the vital frequency information to enrich data representations. To this end, we propose a novel self-supervised general-purpose framework called Temporal-Frequency and Contextual Consistency (TFCC). Specifically, This framework first performs two instance-level augmentation families over the entire series to capture nuanced representations alongside critical long-term dependencies. Then, TFCC advances by initiating dual cross-view forecasting tasks between the original series and its augmented counterpart in both time and frequency dimensions to learn robust representations. Finally, three specially designed consistency modules —temporal, frequency, and temporal-frequency— aid in further developing discriminative representations on top of the learned robust representations. Extensive experiments on multiple benchmark datasets demonstrate TFCC's superiority over the state-of-the-art classification and forecasting methods and exhibit exceptional efficiency in semi-supervised and transfer learning scenarios. Code, data, and model checkpoints will be released after the review period.



Paperid:643 Poster
Authors:Junkang Liu,Fanhua Shang,Yuanyuan Liu,Hongying Liu,Yuangang Li,YunXiang Gong
Abstract:
Although federated learning has been widely studied in recent years, there are still high overhead expenses in each communication round for large-scale models such as Vision Transformer. To lower the communication complexity, we propose a novel communication efficient block coordinate gradient descent (FedBCGD) method. The proposed method splits model parameters into several blocks and enables upload a specific parameter block by each client during training, which can significantly reduce communication overhead. Moreover, we also develop an accelerated FedBCGD algorithm (called FedBCGD+) with client drift control and stochastic variance reduction techniques. To the best of our knowledge, this paper is the first parameter block communication work for training large-scale deep models. We also provide the convergence analysis for the proposed algorithms. Our theoretical results show that the communication complexities of our algorithms are a factor $1/N$ lower than those of existing methods, where $N$ is the number of parameter blocks, and they enjoy much faster convergence results than their counterparts. Empirical results indicate the superiority of the proposed algorithms compared to state-of-the-art algorithms.



Paperid:644 Poster
Authors:Hua Yu,Weiming Liu,Jiapeng Bai,Gui Xu,Yaqing Hou,Yew-Soon Ong,Qiang Zhang
Abstract:
Recent generative methods have revolutionized the way of human motion synthesis, such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Denoising Diffusion Probabilistic Models (DMs). These methods have gained significant attention in human motion fields. However, there are still challenges in unconditionally generating highly diverse human motions from a given distribution. To enhance the diversity of synthesized human motions, previous methods usually employ deep neural networks (DNNs) to train a transport map that transforms Gaussian noise distribution into real human motion distribution. According to Figalli's regularity theory, the optimal transport map computed by DNNs frequently exhibits discontinuities. This is due to the inherent limitation of DNNs in representing only continuous maps. Consequently, the generated human motions tend to heavily concentrate on densely populated regions of the data distribution, resulting in mode collapse or mode mixture. To address the issues, we propose an efficient method called MOOT for unconditional human motion synthesis. First, we utilize a reconstruction network based on GRU and transformer to map human motions to latent space. Next, we employ convex optimization to map the noise distribution to the latent space distribution of human motions through the Optimal Transport (OT) map. Then, we combine the extended OT map with the generator of reconstruction network to generate new human motions. Thereby overcoming the issues of mode collapse and mode mixture. MOOT generates a latent code distribution that is well-behaved and highly structured, providing a strong motion prior for various applications in the field of human motion. Through qualitative and quantitative experiments, MOOT achieves state-of-the-art results surpassing the latest methods, validating its superiority in unconditional human motion generation.



Paperid:645 Poster
Authors:RUOFAN WANG,Xingjun Ma,Hanxu Zhou,Chuanjun Ji,Guangnan Ye,Yu-Gang Jiang
Abstract:
Recent advancements in Large Vision-Language Models (VLMs) have underscored their superiority in various multimodal tasks. However, the adversarial robustness of VLMs has not been fully explored. Existing methodologies mainly assess robustness through unimodal adversarial attacks that perturb images, while assuming inherent resilience against text-based attacks. In contrast, our methodology adopts a comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within VLMs. Furthermore, we propose a dual optimization objective aimed at guiding the model to generate affirmative responses with high toxicity. Specifically, we begin by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input, thus imbuing the image with toxic semantics. Subsequently, an adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions. The discovered adversarial image prefix and text suffix are collectively denoted as a Universal Master Key (UMK). When integrated into various malicious queries, UMK can circumvent the alignment defenses of VLMs and lead to the generation of objectionable content, known as jailbreaks. The experimental results demonstrate that our universal attack strategy can effectively jailbreak MiniGPT-4 with a 96% success rate, highlighting the fragility of VLMs and the exigency for new alignment strategies.



Paperid:646 Poster
Authors:Kangpeng Hu,Sun Quansen,Yinghui Sun,Tao Wang
Abstract:
Interactive segmentation task aims at taking into account the influence of user preferences on the basis of general semantic segmentation in order to obtain the specific target-of-interest. Given the fact that most of the related algorithms generate a single mask only, the robustness of which might be constrained due to the diversity of user intention in the early interaction stage, namely the vague selection of object part/whole object/adherent object, especially when there's only one click. To handle this, we propose a novel framework called Diversified Interactive Segmentation Network (DISNet) in which we revisit the peculiarity of first click: given an input image, DISNet outputs multiple candidate masks under the guidance of first click only, it then utilizes a Dual-attentional Mask Correction (DAMC) module consisting of two branches: a) Masked attention based on click propagation. b) Mixed attention within first click, subsequent clicks and image w.r.t. position and feature space. Moreover, we design a new sampling strategy to generate GT masks with rich semantic relations. The comparison between DISNet and mainstream algorithms demonstrates the efficacy of our methods, which further exemplifies the decisive role of first click in the realm of interactive segmentation.



Paperid:647 Poster
Authors:Yiren Lu,Jing Ma,Yu Yin
Abstract:
Radiance Fields (RFs) have emerged as a crucial technology for 3D scene representation, enabling the synthesis of novel views with remarkable realism. However, as RFs become more widely used, the need for effective editing techniques that maintain coherence across different perspectives becomes evident. Current methods primarily depend on per-frame 2D image inpainting, which often fails to maintain consistency across views, thus compromising the realism of edited RF scenes. In this work, we introduce a novel RF editing pipeline that significantly enhances consistency by requir- ing the inpainting of only a single reference image. This image is then projected across multiple views using a depth-based approach, effectively reducing the inconsistencies observed with per-frame inpainting. However, projections typically assume photometric consistency across views, which is often impractical in real-world settings. To accommodate realistic variations in lighting and view- point, our pipeline adjusts the appearance of the projected views by generating multiple directional variants of the inpainted image, thereby adapting to different photometric conditions. Additionally, we present an effective and robust multi-view object segmentation approach as a valuable byproduct of our pipeline. Extensive experi- ments demonstrate that our method significantly surpasses existing frameworks in maintaining content consistency across views and enhancing visual quality.



Paperid:648 Poster
Authors:Xuannan Liu,Pei Pei Li,Huaibo Huang,Zekun Li,Xing Cui,jiahao.liang,lixiong Qin,Weihong Deng,Zhaofeng He
Abstract:
The massive generation of multimodal fake news involving both text and images exhibits substantial distribution discrepancies, prompting the need for generalized detectors. However, the insulated nature of training restricts the capability of classical detectors to obtain open-world facts. While Large Vision-Language Models (LVLMs) have encoded rich world knowledge, they are not inherently tailored for combating fake news and struggle to comprehend local forgery details. In this paper, we propose FKA-Owl, a novel framework that leverages forgery-specific knowledge to augment LVLMs, enabling them to reason about manipulations effectively. The augmented forgery-specific knowledge includes semantic correlation between text and images, and artifact trace in image manipulation. To inject these two kinds of knowledge into the LVLM, we design two specialized modules to establish their representations, respectively. The encoded knowledge embeddings are then incorporated into LVLMs. Extensive experiments on the public benchmark demonstrate that FKA-Owl achieves superior cross-domain performance compared to previous methods. Code will be made publicly available.



Paperid:649 Poster
Authors:wenjie li,Heng Guo,Xuannan Liu,Kongming Liang,Jiani Hu,Zhanyu Ma,Jun Guo
Abstract:
Face super-resolution aims to reconstruct a high-resolution face image from a low-resolution face image. Previous methods typically employ an encoder-decoder structure to extract facial structural features, where the direct downsampling inevitably introduces distortions, especially to high-frequency features such as edges. To address this issue, we propose a wavelet-based feature enhancement network, which mitigates feature distortion by losslessly decomposing the input facial feature into high-frequency and low-frequency components using the wavelet transform and processing them separately. To improve the efficiency of facial feature extraction, a full domain Transformer is further proposed to enhance local, regional, and global low-frequency facial features. Such designs allow our method to perform better without stacking many network modules as previous methods did. Extensive experiments demonstrate that our method effectively balances performance, model size, and inference speed. All code and data will be released upon acceptance.



Paperid:650 Poster
Authors:Litian Zhang,Xiaoming Zhang,Chaozhuo Li,Ziyi Zhou,Jiacheng Liu,Feiran Huang,Xi Zhang
Abstract:
The detection of fake news has emerged as a pressing issue in the era of online social media. To detect meticulously fabricated fake news, propagation paths are introduced to provide nuanced social context to complement the pure semantics within news content. However, existing propagation-enhanced models face a dilemma between detection efficacy and social hazard. In this paper, we investigate the novel problem of early fake news detection via propagation path generation, capable of enjoying the merits of rich social context within propagation paths while alleviating potential social hazards. In contrast to previous discriminative detection models, we further propose a novel generative model, DGA-Fake, by simulating realistic propagation paths based on news content before actual spreading. A guided diffusion module is integrated into DGA-Fake to generate simulated user interaction sequences, guided by historical interactions and news content. Evaluation across three datasets demonstrates the superiority of our proposal. Our code is publicly available inhttps://anonymous.4open.science/r/DGA-Fake-1D5F/.



Paperid:651 Poster
Authors:Shuo Wang,Yongcai Wang,Zhimin Xu,Yongyu Guo,Wanting Li,Zhe Huang,xuewei Bai,Deying Li
Abstract:
For interacting with mobile objects in unfamiliar environments, simultaneously locating, mapping, and tracking the 3D poses of multiple objects are crucially required. This paper proposes a Tracklet and Query Graph based framework, i.e., GSLAMOT to address this challenge. GSLAMOT represents the dynamic scene by a combination of semantic map, agent trajectory, and an online maintained Tracklet Graph (TG). TG tracks and predicts the 3D poses of the detected active objects. A Query Graph (QG) is constructed in each frame by object detection to query and to update TG, as well as the semantic map and the agent trajectory. For accurate object association, a Multi-criteria Subgraph Similarity Association (MSSA) method is proposed to find matched objects between the detections in QG and the predicted tracklets in TG. Then an Object-centric Graph Optimization (OGO) method is proposed to optimize the TG, the semantic map, and the agent trajectory simultaneously. It triangulates the detected objects into the map to enrich the map's semantic information. We address the efficiency issues to handle the three tightly coupled tasks in parallel. Experiments are conducted on KITTI, Waymo, and an emulated Traffic Congestion dataset that highlights challenging scenarios including congested objects. Experiments show that GSLAMOT enables accurately crowded object tracking while conducting SLAM accurately in challenging scenarios, demonstrating more excellent performances than the state-of-the-art methods.



Paperid:652 Poster
Authors:Yuning Ding,Sifan Zhang,Shenglan Liu,Jinrong Zhang,Wenyue Chen,Duan Haifei,bingcheng dong,Tao Sun
Abstract:
Human Action Quality Assessment (AQA) is a prominent area of research in human action analysis. Current mainstream methods only consider the RGB modality which results in limited feature representation and insufficient performance due to the complexity of the AQA task. In this paper, we propose a simple and modular framework called the Two-Modality Assessment Framework (2M-AF), which comprises a skeleton stream, an RGB stream and a regression module. For the skeleton stream, we develop the Self-supervised Mask Encoder Graph Convolution Network (SME-GCN) to achieve representation learning, and further implement score assessment. Additionally, we propose a Preference Fusion Module (PFM) to fuse features, which can effectively avoid the disadvantages of different modalities. Our experimental results demonstrate the superiority of the proposed 2M-AF over current state-of-the-art methods on three publicly available datasets: AQA-7, UNLV-Diving, and MMFS-63.



Paperid:653 Poster
Authors:Congqi Cao,Yueran Zhang,Yating Yu,Qinyi Lv,Lingtong Min,Yanning Zhang
Abstract:
Existing works in few-shot action recognition mostly fine-tune a pre-trained image model and design sophisticated temporal alignment modules at feature level. However, simply fully fine-tuning the pre-trained model could cause overfitting due to the scarcity of video samples. Additionally, we argue that the exploration of task-specific information is insufficient when relying solely on well extracted abstract features. In this work, we propose a simple but effective task-specific adaptation method (Task-Adapter) for few-shot action recognition. By introducing the proposed Task-Adapter into the last several layers of the backbone and keeping the parameters of the original pre-trained model frozen, we mitigate the overfitting problem caused by full fine-tuning and advance the task-specific mechanism into the process of feature extraction. In each Task-Adapter, we reuse the frozen self-attention layer to perform task-specific self-attention across different videos within the given task to capture both distinctive information among classes and shared information within classes, which facilitates task-specific adaptation and enhances subsequent metric measurement between the query feature and support prototypes. Experimental results consistently demonstrate the effectiveness of our proposed Task-Adapter on four standard few-shot action recognition datasets. Especially on temporal challenging SSv2 dataset, our method outperforms the state-of-the-art methods by a large margin.



Paperid:654 Poster
Authors:Hamed Alimohammadzadeh,Shahram Ghandeharizadeh
Abstract:
Swarical, a \underline{Swar}m-based hierarch\underline{ical} localization technique, enables miniature drones, known as Flying Light Specks (FLSs), to accurately and efficiently localize and illuminate complex 2D and 3D shapes. Its accuracy depends on the physical hardware (sensors) of FLSs, which are used to track neighboring FLSs in order to localize themselves. It uses the hardware specification to convert mesh files into point clouds that enable a swarm of FLSs to localize at the highest accuracy afforded by their hardware. Swarical considers a heterogeneous mix of FLSs with different orientations for their tracking sensors, ensuring a line of sight between a localizing FLS and its anchor FLS. We present an implementation using Raspberry cameras and ArUco markers. A comparison of Swarical with a state of the art decentralized localization technique shows that it is as accurate and more than 2x faster.



Paperid:655 Poster
Authors:Kien Trung Pham,Jingye Chen,Qifeng Chen
Abstract:
We present TALE, a novel training-free framework harnessing the power of text-driven diffusion models to tackle cross-domain image composition task that aims at seamlessly incorporating user-provided objects into a specific visual context regardless of domain disparity. Previous methods often involve either training auxiliary networks or finetuning diffusion models on customized datasets, which are expensive and may undermine the robust textual and visual priors of pretrained diffusion models. Some recent works attempt to break the barrier by proposing training-free workarounds that rely on manipulating attention maps to tame the denoising process implicitly. However, composing via attention maps does not necessarily yield desired compositional outcomes. These approaches could only retain some semantic information and usually fall short in preserving identity characteristics of input objects or exhibit limited background-object style adaptation in generated images. In contrast, TALE is a novel method that operates directly on latent space to provide explicit and effective guidance for the composition process to resolve these problems. Specifically, we equip TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided Latent Optimization. The former formulates noisy latents conducive to initiating and steering the composition process by directly leveraging background and foreground latents at corresponding timesteps, and the latter exploits designated energy functions to further optimize intermediate latents conforming to specific conditions that complement the former to generate desired final results. Our experiments demonstrate that TALE surpasses prior baselines and attains state-of-the-art performance in image-guided composition across various photorealistic and artistic domains.



Paperid:656 Poster
Authors:Wei He,Xiang Li,Shengtian Xu,Yuzheng Chen,SIO CHAN IN DEVIN,Ge lin,LIK-HANG LEE
Abstract:
The preservation of cultural heritage, as mandated by the United Nations Sustainable Development Goals (SDGs), is integral to sustainable urban development. This paper focuses on the Dragon Boat Festival, a prominent event in Chinese cultural heritage, and proposes leveraging immersive technologies, particularly Virtual Reality (VR), to enhance its preservation and accessibility. Traditionally, participation in the festival's dragon boat races was limited to elite athletes, excluding broader demographics. Our proposed solution, named MetaDragonBoat, enables virtual participation in dragon boat racing, offering immersive experiences that replicate physical exertion through a cultural journey. Thus, we build a digital twin of a university campus located in a region with a rich dragon boat racing tradition. Coupled with three paddling techniques that are enabled by either commercial controllers or physical paddle controllers with haptic feedback, diversified users can engage in realistic rowing experiences. Our results demonstrate that by integrating resistance into the paddle controls, users could simulate the physical effort of dragon boat racing, promoting a deeper understanding and appreciation of this cultural heritage.



Paperid:657 Poster
Authors:Wenxuan Wang,Chenglei Wang,huihui Qi,Menghao Ye,Xuelin Qian,PENG WANG,Yanning Zhang
Abstract:
With the wide application of deep neural network models in various computer vision tasks, there has been a proliferation of adversarial example generation strategies aimed at exploring model security deeply. However, existing adversarial training defense models, which rely on single or limited types of attacks under a one-time learning process, struggle to adapt to the dynamic and evolving nature of attack methods. Therefore, to achieve defense performance improvements for models in long-term applications, we propose a novel Sustainable Self-evolution Adversarial Training (SSEAT) framework. Specifically, we introduce a continual adversarial defense pipeline to realize learning from various kinds of adversarial examples across multiple stages. Additionally, to address the issue of model catastrophic forgetting caused by continual learning from ongoing novel attacks, we propose an adversarial data replay module to better select more diverse and key relearning data. Furthermore, we design a consistency regularization strategy to encourage current defense models to learn more from previously trained ones, guiding them to retain more past knowledge and maintain accuracy on clean samples. Extensive experiments have been conducted to verify the efficacy of the proposed SSEAT defense method, which demonstrates superior defense performance and classification accuracy compared to competitors.



Paperid:658 Poster
Authors:Jinfeng Wei,Xiao Feng Zhang
Abstract:
In this work, we introduce DOPRA, a novel approach designed to mitigate hallucinations in multi-modal large language models (MLLMs). Unlike existing solutions that typically involve costly supplementary training data or the integration of external knowledge sources, DOPRA innovatively addresses hallucinations by decoding-specific weighted layer penalties and redistribution, offering an economical and effective solution without the need for additional resources. DOPRA is grounded in unique insights into the intrinsic mechanisms controlling hallucinations within MLLMs, especially the models' tendency to over-rely on a subset of summary tokens in the self-attention matrix, neglecting critical image-related information. This phenomenon is particularly pronounced in certain strata. To counteract this over-reliance, DOPRA employs a strategy of weighted overlay penalties and redistribution in specific layers, such as the 12th layer, during the decoding process. Furthermore, DOPRA includes a retrospective allocation process that re-examines the sequence of generated tokens, allowing the algorithm to reallocate token selection to better align with the actual image content, thereby reducing the incidence of hallucinatory descriptions in auto-generated captions. Overall, DOPRA represents a significant step forward in improving the output quality of MLLMs by systematically reducing hallucinations through targeted adjustments during the decoding process.



Paperid:659 Poster
Authors:Yanghao Su,Jie Zhang,Ting Xu,Tianwei Zhang,Weiming Zhang,Nenghai Yu
Abstract:
Backdoor attacks pose a significant security vulnerability for deep neural networks (DNNs), enabling them to operate normally on clean inputs but manipulate predictions when specific trigger patterns occur. In this paper, we consider a practical post-training scenario backdoor defense, where the defender aims to evaluate whether a trained model has been compromised by backdoor attacks. Currently, post-training backdoor detection approaches often operate under the assumption that the defender has knowledge of the attack information, logit output from the model, and knowledge of the model parameters, limiting their implementation in practical scenarios. In contrast, our approach functions as a lightweight diagnostic scanning tool that operates in conjunction with other defense methods, assisting in defense pipelines.We begin by presenting an intriguing observation: the decision boundary of the backdoored model exhibits a greater degree of closeness than that of the clean model. Simultaneously, if only one single label is infected, a larger portion of the regions will be dominated by the attacked label. Leveraging this observation, drawing an analogy to X-rays in disease diagnosis, we propose Model X-ray . This novel backdoor detection approach is based on the analysis of illustrated two-dimensional (2D) decision boundaries, offering interpretability and visualization. Model X-ray can not only identify whether the target model is infected but also determine the target attacked label under the all-to-one attack strategy. Importantly, it accomplishes this solely by the predicted hard labels of clean inputs, regardless of any assumptions about attacks and prior knowledge of the training details of the model. Extensive experiments demonstrated that Model X-ray can be effective and efficient across diverse backdoor attacks, datasets, and architectures.



Paperid:660 Poster
Authors:Dan Zeng,Yu Zhu,Shuiwang Li,Qijun Zhao,Qiaomu Shen,Bo Tang
Abstract:
In this paper, we are interested in identifying denser and finer animals joints. The lack of standardized joint definitions across various APE datasets, e.g., AnimalPose with 20 joints, AP-10k with 17 joints, and TigDog with 19 joints, presents a significant challenge yet offers an opportunity to fully utilize annotation data. This paper challenges this new non-standardized annotation problem, aiming to learn fine-grained (e.g., 24 or more joints) pose estimators in datasets that lack complete annotations. To combat the unannotated joints, we propose FreeNet, comprising a base network and an adaptation network connected through a circuit feedback learning paradigm. FreeNet enhances the adaptation network's tolerance to unannotated joints via body part-aware learning, optimizing the sampling frequency of joints based on joint detection difficulty, and improves the base network's predictions for unannotated joints using feedback learning. This leverages the cognitive differences of the adaptation network between non-standardized labeled and large-scale unlabeled data. Experimental results on three non-standard datasets demonstrate the effectiveness of our method for fine-grained APE.



Paperid:661 Poster
Authors:Huadai Liu,Rongjie Huang,Yang Liu,Hengyuan Cao,Jialei Wang,Xize Cheng,Siqi Zheng,Zhou Zhao
Abstract:
Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce EchoAudio, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. Unlike prior approaches that address noise removal through iterative processes, EchoAudio integrates Consistency Models (CMs) into the generation process, facilitating rapid inference through a mapping from any point at any time step to the trajectory's initial point. To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver. This innovation shortens the time schedule from thousands to dozens of steps while maintaining sample quality, thereby achieving fast convergence and high-quality generation. Furthermore, To optimize the performance of transformer-based neural network architectures, we integrate the advanced techniques pioneered by LLaMA into the foundational framework of transformers. This architecture supports stable and efficient training, ensuring robust performance in text-to-audio synthesis. Experimental results on text-to-audio generation and text-to-music synthesis tasks demonstrate that EchoAudio needs only 2 iterations to synthesize high-fidelity audios, while it maintains sample quality competitive with state-of-the-art models using hundreds of steps. EchoAudio enables a sampling speed of 333x faster than real-time on a single NVIDIA 4090Ti GPU, making generative models practically applicable to text-to-audio generation deployment. Our extensive preliminary analysis shows that each design in EchoAudio is effective. \footnote{Audio samples are available at \url{https://Echo-Audio.github.io/.}}



Paperid:662 Poster
Authors:Linhui Xiao,Xiaoshan Yang,Fang Peng,Yaowei Wang,Changsheng Xu
Abstract:
Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual/linguistic knowledge separately while ignoring the multimodal corresponding information. Motivated by recent advancements in contrastive language-image pre-training and low-rank adaptation (LoRA) methods, we aim to solve the grounding task based on multimodal pre-training. However, there exists significant task gaps between pre-training and grounding. Therefore, to address these gaps, we propose a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG. Specifically, HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (Hi LoRA) paradigm. The cross-modal bridge can address the inconsistency between visual features and those required for grounding, and establish a connection between multi-level visual and text features. Hi LoRA prevents the accumulation of perceptual errors by adapting the cross-modal features from shallow to deep layers in a hierarchical manner. Experimental results on five datasets demonstrate the effectiveness of our approach and showcase the significant grounding capabilities as well as promising energy efficiency advantages. All of the code and models will be released upon acceptance.



Paperid:663 Poster
Authors:Ruohao Guo,Dantong Niu,Liao Qu,Yanyu Qi,Ji Shi,Wenzhen Yue,Bowei Xing,Taiyan Chen,Xianghua Ying
Abstract:
Panoramic audio-visual saliency detection is to segment the most attention-attractive regions in 360° panoramic videos with sound. To meticulously delineate the detected salient regions and effectively model human attention shift, we extend this task to more fine-grained instance scenarios: identifying salient object instances and inferring their saliency ranks. In this paper, we propose the first instance-level framework that can simultaneously be applied to segmentation and ranking of multiple salient objects in panoramic videos. Specifically, it consists of a distortion-aware pixel decoder to overcome panoramic distortions, a sequential audio-visual fusion module to integrate audio-visual information, and a spatio-temporal object decoder to separate individual instances and predict their saliency scores. Moreover, owing to the absence of such annotations, we create the ground-truth saliency ranks for the PAVS10K benchmark. Extensive experiments demonstrate that our model is capable of achieving state-of-the-art performance on the PAVS10K for both saliency detection and ranking tasks. The code and dataset will be released soon.



Paperid:664 Poster
Authors:Shihua Zhang,Jiayi Ma
Abstract:
As one of the most fundamental computer vision problems, image feature matching aims to establish correct correspondences between two-view images. Existing studies enhance the descriptions of feature points with graph neural network (GNN), identifying correspondences with the predicted assignment matrix. However, this pipeline easily falls into a suboptimal result during training for the solution space is extremely complex, and is inaccessible to the prior that can guide the information propagation and network convergence. In this paper, we propose a novel method called DiffGlue that for the first time introduces the Diffusion Model into the image feature matching framework. Concretely, based on the incrementally iterative diffusion and denoising processes, DiffGlue can be guided by the prior from the Diffusion Model and trained step by step on the optimization path, approaching the optimal solution progressively. Besides, it contains a special Assignment-Guided Attention as a bridge to merge the Diffusion Model and sparse image feature matching, which injects the inherent prior into GNN so that ameliorates its message delivery. Extensive experiments reveal that DiffGlue converges faster and better, outperforming state-of-the-arts on several applications such as homography estimation, relative pose estimation, and visual localization.



Paperid:665 Poster
Authors:Huan Yao,Changxing Ding,Xuanda Xu,Zhifeng Lin
Abstract:
Estimating the 3D poses of interacting hands from a monocular image is challenging due to the similarity in appearance between hand parts. Therefore, utilizing the appearance features alone tends to result in unreliable pose estimation. Existing approaches directly fuse the appearance features with position features, ignoring that the two types of features are heterogeneous. Here, the appearance features are derived from the RGB values of pixels, while the position features are mapped from the coordinates of pixels or joints. To address this problem, we present a novel framework called \textbf{D}ecoupled \textbf{F}eature \textbf{L}earning (\textbf{DFL}) for 3D pose estimation of interacting hands. By decoupling the appearance and position features, we facilitate the interactions within each feature type and those between both types of features. First, we compute the appearance relationships between the joint queries and the image feature maps; we utilize these relationships to aggregate each joint's appearance and position features. Second, we compute the 3D spatial relationships between hand joints using their position features; we utilize these relationships to guide the feature enhancement of joints. Third, we calculate appearance relationships and spatial relationships between the joints and image using the appearance and position features, respectively; we utilize these complementary relationships to promote the joints' location in the image. The two processes mentioned above are conducted iteratively. Finally, only the refined position features are used for hand pose estimation. This strategy avoids the step of mapping heterogeneous appearance features to hand-joint positions. Our method significantly outperforms state-of-the-art methods on the large-scale InterHand2.6M dataset. More impressively, our method exhibits strong generalization ability on in-the-wild images. The code will be released.



Paperid:666 Poster
Authors:Xuefeng Yin,Chenyang Zhu,Shanglai Qu,Yuqi Li,Kai Xu,Baocai Yin,Xin Yang
Abstract:
Simultaneously mapping and exploring a complex unknown scene is an NP-hard problem, which is still challenging with the rapid development of deep learning techniques. We present CSO, a deep reinforcement learning-based framework for efficient active scene mapping. Constraint-guided space optimization is adopted for both state and critic space to reduce the difficulty of finding the global optimal explore path and avoid long-distance round trips while exploring. We first take the frontiers-based entropy as the input constraint with the raw observation into the network, which guides the training start from imitating the local greedy searching. However, the entropy-based optimization can easily get stuck with few local optimal or cause inefficient round trips since the entropy space and the real world do not share the same metric. Inspired by constrained reinforcement learning, we then introduce an action mask-based optimization constraint to align the metric of these two spaces. Exploration optimization in aligned spaces can avoid long-distance round trips more effectively. We evaluate our method with a ground robot in 29 complex indoor scenes with different scales. Our method can perform 19.16% more exploration efficiency and 3.12% more exploration completeness on average compared to the state-of-the-art alternatives. We also implement our method in real-world scenes that can efficiently explore an area of 649 $m^2$. The experiment video can be found in the supplementary material.



Paperid:667 Poster
Authors:Zengsheng Kuang,Changxing Ding,Huan Yao
Abstract:
Achieving 3D hand-object pose estimation in interaction scenarios is challenging due to the severe occlusion generated during the interaction. Existing methods address this issue by utilizing the correlation between the hand and object poses as additional cues. They usually first extract the hand and object features from their respective regions and then refine them with each other. However, this paradigm disregards the role of a broad range of image context. To address this problem, we propose a novel and robust approach that learns a broad range of context by imposing priors. First, we build this approach using stacked transformer decoder layers. These layers are required for extracting image-wide context and regional hand or object features by constraining cross-attention operations. We share the context decoder layer parameters between the hand and object pose estimations to avoid interference in the context-learning process. This imposes a prior, indicating that the hand and object are mutually the most important context for each other, significantly enhancing the robustness of obtained context features. Second, since they play different roles, we provide customized feature maps for the context, hand, and object decoder layers. This strategy facilitates the disentanglement of these layers, reducing the feature learning complexity. Finally, we conduct extensive experiments on the popular HO3D and Dex-YCB databases. The experimental results indicate that our method significantly outperforms state-of-the-art approaches and can be applied to other hand pose estimation tasks. The code will be released.



Paperid:668 Poster
Authors:Linfeng Tang,Yuxin Deng,Xunpeng Yi,Qinglong Yan,Yixuan Yuan,Jiayi Ma
Abstract:
Existing multi-modal image fusion algorithms are typically designed for high-quality images and fail to tackle degradation (e.g., low light, low resolution, and noise), which restricts image fusion from unleashing the potential in practice. In this work, we present Degradation-Robust Multi-modality image Fusion (DRMF), leveraging the powerful generative properties of diffusion models to counteract various degradations during image fusion. Our critical insight is that generative diffusion models driven by different modalities and degradation are inherently complementary during the denoising process. Specifically, we pre-train multiple degradation-robust conditional diffusion models for different modalities to handle degradations. Subsequently, the diffusion priori combination module is devised to integrate generative priors from pre-trained uni-modal models, enabling effective multi-modal image fusion. Extensive experiments demonstrate that DRMF excels in infrared-visible and medical image fusion, even under complex degradations.



Paperid:669 Poster
Authors:Kenan Huang,Junbao Zhuo,Shuhui Wang,Chi Su,Qingming Huang,Huimin Ma
Abstract:
Image-to-Video adaptation is proposed to train a model using labeled images and unlabeled videos to facilitate the classification of unlabeled videos. The latest work synthesizes videos using still images to mitigate the modality gap between images and videos. However, the synthesized videos are not realistic due to the camera movements are only simulated in 2D space. Therefore, we generate realistic videos by simulating arbitrary camera movements in 3D scenes, and then the model can be trained using the generated source videos. Unfortunately, the optical flows from the generated videos have unexpected negative impacts, resulting in suboptimal performance. To address this issue, we propose the Category-aware Flow Memory Bank, which replaces optical flows in source videos with real target flows, and the new composed videos are beneficial for training. In addition, we leverage the video pace prediction task to enhance the speed awareness of the model in order to solve the problem that the model performs poorly in handling some categories with similar appearances but significant speed differences. Our method achieves state-of-the-art performance and comparable performance on three Image-to-Video benchmarks.



Paperid:670 Poster
Authors:Wenbo Huang,Jinghui Zhang,Xuwei Qian,Zhen Wu,Meng Wang,Lei Zhang
Abstract:
High frame-rate (HFR) videos of action recognition improve fine-grained expression while reducing the spatio-temporal relation and motion information density. Thus, large amounts of video samples are continuously required for traditional data-driven training. However, samples are not always sufficient in real-world scenarios promotes few-shot action recognition (FSAR) research. We observe that most recent FSAR works build spatio-temporal relation of video samples via temporal alignment after spatial feature extraction, cutting apart spatial and temporal features within samples. They also capture motion information via narrow perspectives between adjacent frames without considering density, leading to insufficient motion information capturing. Therefore, we propose a novel plug-and-play architecture for FSAR called $\underline{\textbf{S}}$patio-temp$\underline{\textbf{O}}$ral fr$\underline{\textbf{A}}$me tu$\underline{\textbf{P}}$le enhancer ($\textbf{SOAP}$) in this paper. The model we designed with such architecture refers to SOAP-Net. Temporal connections between different feature channels and spatio-temporal relation of features are considered instead of simple feature extraction. Comprehensive motion information is also captured, using frame tuples with multiple frames containing more motion information than adjacent frames. Combining frame tuples of frame counts further provides a broader perspective. SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and HMDB51. Extensive empirical evaluations underscore the competitiveness, pluggability, generalization, and robustness of SOAP. The code will be released.



Paperid:671 Poster
Authors:Ziyue Wu,Junyu Gao,Changsheng Xu
Abstract:
Video Scene Graph Generation (VidSGG) plays a crucial role in various visual-language tasks by providing accessible structured visual relation knowledge. However, the requirement of annotating all categories of prevailing VidSGG methods limits their application in real-world scenarios. Despite the popular VLMs facilitating preliminary exploration of open-vocabulary VidSGG tasks, the correspondence between visual union regions and relation predicates is usually ignored. Therefore, we propose an Open-vocabulary VidSGG framework named Union-Aware Semantic Alignment Network (UASAN) to explore the alignment between visual union regions and relation predicate concepts in the same semantic space. Specifically, a visual refiner is designed to acquire open-vocabulary knowledge and the ability to bridge different modalities. To achieve better alignment, we first design a semantic-aware context encoder to achieve a comprehensive semantic interaction between object trajectories, visual union regions, and trajectory motion information to obtain semantic-aware union region representations. Then, a union-relation alignment decoder is utilized to generate the discriminative relation token for each union region for final relation prediction. Extensive experimental results on two benchmark datasets show that our UASAN achieves significant performance over existing methods, which also verifies the necessity of modeling union region-predicate alignment in the VidSGG pipeline. Code is available in supplementary material.



Paperid:672 Poster
Authors:Zhiru Wang,Shiyun Xie,Chengwei Pan,Guoping Wang
Abstract:
Recently, the 3D Gaussian Splatting (3D-GS) method has achieved great success in novel view synthesis, providing real-time rendering while ensuring high-quality rendering results. However, this method faces challenges in modeling specular reflections and handling anisotropic appearance components, especially in dealing with view-dependent color under complex lighting conditions. Additionally, 3D-GS uses spherical harmonic to learn the color representation, which has limited ability to represent complex scenes. To overcome these challenges, we introduce Lantent-SpecGS, an approach that utilizes a universal latent neural descriptor within each 3D Gaussian. This enables a more effective representation of 3D feature fields, including appearance and geometry. Moreover, two parallel CNNs are designed to decoder the splatting feature maps into diffuse color and specular color separately. A mask that depends on the viewpoint is learned to merge these two colors, resulting in the final rendered image. Experimental results demonstrate that our method obtains competitive performance in novel view synthesis and extends the ability of 3D-GS to handle intricate scenarios with specular reflections.



Paperid:673 Poster
Authors:Shengxin Chen,Gen Luo,Yiyi Zhou,Xiaoshuai Sun,GUANNAN JIANG,Rongrong Ji
Abstract:
Visual grounding is a task of locating the object referred by a natural language description. To reduce annotation costs, recent researchers are devoted into one-stage weakly supervised methods for visual grounding, which typically adopt the anchor-text matching paradigm. Despite the efficiency, we identify that anchor representations are often noisy and insufficient to describe object information, which inevitably hinders the vision-language alignments. In this paper, we propose a novel query-based one-stage framework for weakly supervised visual grounding, namely QueryMatch. Different from previous work, QueryMatch represents candidate objects with a set of query features, which inherently establish accurate one-to-one associations with visual objects. In this case, QueryMatch re-formulates weakly supervised visual grounding as a query-text matching problem, which can be optimized via the query-based contrastive learning. Based on QueryMatch, we further propose an innovative strategy for effective weakly supervised learning, namely Negative Sample Quality Estimation (NSQE). In particular, NSQE aims to augment negative training samples by actively selecting high-quality query features. Though this strategy, NSQE can greatly benefit the weakly supervised learning of QueryMatch. To validate our approach, we conduct extensive experiments on three benchmark datasets of two grounding tasks, i.e., referring expression comprehension (REC) and segmentation (RES). Experimental results not only show the state-of-art performance of QueryMatch in two tasks, e.g., over +5%IoU@0.5on RefCOCO in REC and over +20% mIOU on RefCOCO in RES, but also confirm the effectiveness of NSQE in weakly supervised learning. Source codes are available at~\url{https://anonymous.4open.science/r/QueryMatch-A82C}.



Paperid:674 Poster
Authors:Xianbing Zhao,Lizhen Qu,Tao Feng,Jianfei Cai,Buzhou Tang
Abstract:
This work proposes a novel and simple sequential learning strategy to train models on videos and texts for multimodal sentiment analysis. To estimate sentiment polarities on unseen out-of-distribution data, we introduce a multimodal model that is trained either in a single source domain or multiple source domains using our learning strategy. This strategy starts with learning domain invariant features in text, followed by learning sparse domain-agnostic features in videos, assisted by the selected features learned in text. Our experimental results demonstrate that our model achieves significantly superior performance than the state-of-the-art approaches in both single-source and multi-source settings. Our feature selection procedure favors the features that are independent to each other and are strongly correlated with their polarity labels. To facilitate research on this topic, the source code of this work will be publicly available upon acceptance.



Paperid:675 Poster
Authors:YiTai Lin,Zhijie Wei,Wanfa Zhang,XiPing Lin,Yudi Dai,Chenglu Wen,Siqi Shen,Lan Xu,Cheng Wang
Abstract:
We introduce HmPEAR, a novel dataset crafted for advancing research in 3D Human Pose Estimation (3D HPE) and Human Action Recognition (HAR), with a primary focus on outdoor environments. This dataset offers a synchronized collection of imagery, LiDAR point clouds, 3D human poses, and action categories. In total, the dataset encompasses over 300,000 frames collected from 10 distinct scenes and 25 diverse subjects. Among these, 250,000 frames of data contain 3D human pose annotations captured using an advanced motion capture system and further optimized for accuracy. Furthermore, the dataset annotates 40 types of daily human actions, resulting in over 6,000 action clips. Through extensive experimentation, we have demonstrated the quality of HmPEAR and highlighted the challenges it presents to current methodologies. Additionally, we propose straightforward baselines leveraging sequential images and point clouds for 3D HPE and HAR, which underscore the mutual reinforcement between them, highlighting the potential for cross-task synergies.



Paperid:676 Poster
Authors:Naibo Wang,Yuchen Deng,Wenjie Feng,Shichen Fan,Jianwei Yin,See-Kiong Ng
Abstract:
Traditional federated learning mainly focuses on parallel settings (PFL), which can suffer significant communication and computation costs. In contrast, one-shot and sequential federated learning (SFL) have emerged as innovative paradigms to alleviate these costs. However, the issue of non-IID (independent and identically distributed) data persists as a significant challenge in one-shot and SFL settings, exacerbated by the restricted communication between clients. In this paper, we improve the one-shot sequential federated learning for non-IID data by proposing a local model diversity-enhancing strategy. Specifically, to leverage the potential of local model diversity for improving model performance, we introduce a local model pool for each client that comprises diverse models generated during local training, and propose two distance measurements to further enhance the model diversity and mitigate the effect of non-IID data. Consequently, our proposed framework can improve the global model performance while maintaining low communication costs. Extensive experiments demonstrate that our method exhibits superior performance to existing one-shot PFL methods and achieves better accuracy compared with state-of-the-art one-shot SFL methods on both label-skew and domain-shift tasks (e.g., 6%+ accuracy improvement on the CIFAR-10 dataset).



Paperid:677 Poster
Authors:Yiluo Wei,Yiming Zhu,Pan Hui,Gareth Tyson
Abstract:
The rise of generative AI is transforming the landscape of digital imagery, and exerting a significant influence on online creative communities. This has led to the emergence of AI-Generated Content (AIGC) social platforms, such as Civitai. These distinctive social platforms allow users to build and share their own generative AI models, thereby enhancing the potential for more diverse artistic expression. Designed in the vein of social networks, they also provide artists with the means to showcase their creations (generated from the models), engage in discussions, and obtain feedback, thus nurturing a sense of community. Yet, this openness also raises concerns about the abuse of such platforms, e.g., using models to disseminate deceptive deepfakes or infringe upon copyrights. To explore this, we conduct the first comprehensive empirical study of an AIGC social platform, focusing on its use for generating abusive content. As an exemplar, we construct a comprehensive dataset covering Civitai, the largest available AIGC social platform. Based on this dataset of 87K models and 2M images, we explore the characteristics of content and discuss strategies for moderation to better govern these platforms.



Paperid:678 Poster
Authors:Yi Ma,Peiqi Duan,Yuchen Hong,Chu Zhou,Yu Zhang,Jimmy Ren,Boxin Shi
Abstract:
Neuromorphic event sensors are novel visual cameras that feature high-speed illumination-variation sensing and have found widespread application in guiding frame-based imaging enhancement. This paper focuses on color restoration in the event-guided image deblurring task, we fuse blurry images with mosaic color events instead of mono events to avoid artifacts such as color bleeding. The challenges associated with this approach include demosaicing color events for reconstructing full-resolution sampled signals and fusing bimodal signals to achieve image deblurring. To meet these challenges, we propose a novel network called Color4E to enhance the color restoration quality for the image deblurring task. Color4E leverages an event demosaicing module to upsample the spatial resolution of mosaic color events and a cross-encoding image deblurring module for fusing bimodal signals, a refinement module is designed to fuse full-color events and refine initial deblurred images. Furthermore, to avoid the real-simulated gap of events, we implement a display-filter-camera system that enables mosaic and full-color event data captured synchronously, to collect a real-captured dataset used for network training and validation. The results on the public dataset and our collected dataset show that Color4E enables high-quality event-based image deblurring compared to state-of-the-art methods.



Paperid:679 Poster
Authors:Shilong Tian,Hong Chen,Chengtao Lv,Yu Liu,Jinyang Guo,Xianglong Liu,Shengxi Li,Hao Yang,Tao Xie
Abstract:
Recently, video diffusion models (VDMs) have garnered significant attention due to their notable advancements in generating coherent and realistic video content. However, processing multiple frame features concurrently, coupled with the considerable model size, results in high latency and extensive memory consumption, hindering their broader application. Post-training quantization (PTQ) is an effective technique to reduce memory footprint and improve computational efficiency. Unlike image diffusion, we observe that the temporal features, which are integrated into all frame features, exhibit pronounced skewness. Furthermore, we investigate significant inter-channel disparities and asymmetries in the activation of video diffusion models, resulting in low coverage of quantization levels by individual channels and increasing the challenge of quantization. To address these issues, we introduce the first PTQ strategy tailored for video diffusion models, dubbed QVD. Specifically, we propose the High Temporal Discriminability Quantization (HTDQ) method, designed for temporal features, which retains the high discriminability of quantized features, providing precise temporal guidance for all video frames. In addition, we present the Scattered Channel Range Integration (SCRI) method which aims to improve the coverage of quantization levels across individual channels. Experimental validations across various models, datasets, and bit-width settings demonstrate the effectiveness of our QVD in terms of diverse metrics. In particular, we achieve near-lossless performance degradation on W8A8, outperforming the current methods by 205.12 in FVD.



Paperid:680 Poster
Authors:Tao Huang,Xinjia Ou,Yanghuali,Shengze Hu,Jing Geng,Junjie Hu,Zhuoran Xu
Abstract:
Knowledge Tracing (KT) is a critical service in distance education, predicting students' future performance based on their responses to learning resources. The reasonable assessment of the knowledge state, along with accurate response prediction, is crucial for KT. However, existing KT methods prioritize fitting results and overlook attention to the problem-solving process. They equate the knowledge students memorize before problem-solving with the knowledge that can be acquired or applied during problem-solving, leading to dramatic fluctuations in knowledge states between mastery and non-mastery, with low interpretability. This paper explores knowledge transformation in problem-solving and proposes an interpretable model, Problem-Solving Knowledge Tracing (PSKT). Specifically, we first present a knowledge-centered problem representation that enhances its expression by adjusting problem variability. Then, we meticulously designed a Sequential Neural Network (SNN) with three stages: (1) Before problem-solving, we model students' personalized problem space and simulate their acquisition of problem-related knowledge through a gating mechanism. (2) During problem-solving, we evaluate knowledge application and calculate response with a four-parameter IRT. (3) After problem-solving, we quantify student knowledge internalization and forgetting using an incremental indicator. The SNN, inspired by problem-solving and constructivist learning theories, is an interpretable model that attributes learner performance to subjective problems (difficulty, discrimination), objective knowledge (knowledge acquisition and application), and behavior (guessing and slipping). Finally, extensive experimental results demonstrate that PSKT has certain advantages in predicting accuracy, assessing knowledge states reasonably, and explaining the learning process.



Paperid:681 Poster
Authors:Xibiao Wang,Hang Gao,Xindian Wei,Liang Peng,Rui Li,Cheng Liu,Si Wu,Hau-San Wong
Abstract:
Partially View-aligned Clustering (PVC) presents a challenge as it requires a comprehensive exploration of complementary and consistent information in the presence of partial alignment of view data. Existing PVC methods typically learn view correspondence based on latent features that are expected to contain common semantic information. However, latent features obtained from heterogeneous spaces, along with the enforcement of alignment into the same feature dimension, can introduce cross-view discrepancies. In particular, partially view-aligned data lacks sufficient shared correspondences for the critical common semantic feature learning, resulting in inaccuracies in establishing meaningful correspondences between latent features across different views. While feature representations may differ across views, instance relationships within each view could potentially encode consistent common semantics across views. Motivated by this, our aim is to learn view correspondence based on graph distribution metrics that capture semantic view-invariant instance relationships. To achieve this, we utilize similarity graphs to depict instance relationships and learn view correspondence by aligning semantic similarity graphs through optimal transport with graph distribution. This facilitates the precise learning of view alignments, even in the presence of heterogeneous view-specific feature distortions. Furthermore, leveraging well-established cross-view correspondence, we introduce a cross-view contrastive learning to learn semantic features by exploiting consistency information. The resulting meaningful semantic features effectively isolate shared latent patterns, avoiding the inclusion of irrelevant private information. We conduct extensive experiments on several real datasets, demonstrating the effectiveness of our proposed method for the PVC task.



Paperid:682 Poster
Authors:Changgu Chen,Libing Yang,Xiaoyan Yang,Lianggangxu Chen,Gaoqi He,Changbo Wang,Yang Li
Abstract:
In recent years, large-scale pre-trained diffusion models have demonstrated their outstanding capabilities in image and video generation tasks. However, existing models tend to produce visual objects commonly found in the training dataset, which diverges from user input prompts. The underlying reason behind the inaccurate generated results lies in the model's difficulty in sampling from specific intervals of the initial noise distribution corresponding to the prompt. Moreover, it is challenging to directly optimize the initial distribution, given that the diffusion process involves multiple denoising steps. In this paper, we introduce a Fine-tuning Initial Noise Distribution (FIND) framework with policy optimization, which unleashes the powerful potential of pre-trained diffusion networks by directly optimizing the initial distribution to align the generated contents with user-input prompts. To this end, we first reformulate the diffusion denoising procedure as a one-step Markov decision process and employ policy optimization to directly optimize the initial distribution. In addition, a dynamic reward calibration module is proposed to ensure training stability during optimization. Furthermore, we introduce a ratio clipping algorithm to utilize historical data for network training and prevent the optimized distribution from deviating too far from the original policy to restrain excessive optimization magnitudes. Extensive experiments demonstrate the effectiveness of our method in both text-to-image and text-to-video tasks, surpassing SOTA methods in achieving consistency between prompts and the generated content. Our method achieves 10 times faster than the SOTA approach.



Paperid:683 Poster
Authors:Zhichao Liao,Fengyuan Piao,Di Huang,Xinghui Li,Yue Ma,Pingfa Feng,Heming Fang,Long ZENG
Abstract:
Drawing freehand sketches of mechanical components on multimedia devices for AI-based engineering modeling becomes a new trend. However, its development is being impeded because existing works cannot produce suitable sketches for data-driven research. These works either generate sketches lacking a freehand style or utilize generative models not originally designed for this task resulting in poor effectiveness. To address this issue, we design a two-stage generative framework mimicking the human sketching behavior pattern, called MSFormer, which is the first time to produce humanoid freehand sketches tailored for mechanical components. The first stage employs Open CASCADE technology to obtain multi-view contour sketches from mechanical components, filtering perturbing signals for the ensuing generation process. Meanwhile, we design a view selector to simulate viewpoint selection tasks during human sketching for picking out information-rich sketches. The second stage translates contour sketches into freehand sketches by a transformer-based generator. To retain essential modeling features as much as possible and rationalize stroke distribution, we introduce a novel edge-constraint stroke initialization. Furthermore, we utilize a CLIP vision encoder and a new loss function incorporating the Hausdorff distance to enhance the generalizability and robustness of the model. Extensive experiments demonstrate that our approach achieves state-of-the-art performance for generating freehand sketches in the mechanical domain.



Paperid:684 Poster
Authors:Hongjian Zhan,yangfu Li,Xiong Yu-Jie,Umapada Pal,Yue Lu
Abstract:
Lightweight models play an important role in real-life applications, especially in the recent mobile device era. However, due to limited network scale and low-quality images, the performance of lightweight models on Scene Text Recognition (STR) tasks is still much to be improved. Recently, contrastive learning has shown its power in many areas, with promising performances without additional computational cost. Based on these observations, we propose a new efficient and effective frame-level contrastive learning (FLCL) framework for lightweight STR models. The FLCL framework consists of a backbone to extract basic features, a Text Perceiver Module (TPM) to focus on text-relevant representations, and a FLCL loss to update the network. The backbone can be any feature extraction architecture. The TPM is an innovative Mamba-based structure that is designed to suppress features irrelevant to the text content from the backbone. Unlike existing word-level contrastive learning, we look into the nature of the STR task and propose the frame-level contrastive learning loss, which can work well with the famous Connectionist Temporal Classification loss. We conduct experiments on six well-known STR benchmarks as well as a new low-quality dataset. Compared to vanilla contrastive learning and other non-parameter methods, the FLCL framework significantly outperforms others on all datasets, especially the low-quality dataset. In addition, character feature visualization demonstrates that the proposed method can yield more discriminative character features for visually similar characters, which also substantiates the efficacy of the proposed methods. Codes and the low-quality dataset will be available soon.



Paperid:685 Poster
Authors:Runkai Zhao,Heng Wang,Weidong Cai
Abstract:
Detecting 3D lane lines from monocular images is garnering increasing attention in the Autonomous Driving (AD) area due to its cost-effective deployment solution. However, current monocular image models capture road scenes in a single-view perspective lacking 3D spatial awareness, which is error-prone to adverse circumstance changes such as curved roads, occlusion, and low illumination. In this work, we design a novel cross-modal knowledge transfer scheme, namely LaneCMKT, to address this issue by transferring 3D geometric cues learned from a pre-trained LiDAR model to the image model. Performing on the unified Bird's-Eye-View (BEV) grid, our monocular image model acts as a student network and benefits from the spatial guidance of the 3D LiDAR teacher model over the intermediate feature space. Since LiDAR points and image pixels are intrinsically two different modalities, to facilitate such heterogeneous feature transfer learning at matching levels, we propose a dual-path knowledge transfer mechanism. We divide the feature space into shallow and deep paths where the image student model is prompted to focus on lane-favored geometric cues from the LiDAR teacher model. We conduct extensive experiments and thorough analysis on the large-scale public benchmark OpenLane. Our model achieves notable improvements over the image baseline by 5.3% and the current BEV-driven SoTA method by 2.7% in the F1 score, without introducing any extra computational overhead. We also observe that the 3D abilities grabbed from the teacher model are critical for dealing with complex spatial lane properties from a 2D perspective.



Paperid:686 Poster
Authors:Jiaming Lei,Lin Li,Chunping Wang,Jun Xiao,Long Chen
Abstract:
Benefiting from strong generalization ability, pre-trained vision-language models (VLMs), e.g., CLIP, have been widely utilized in zero-shot scene understanding. Unlike simple recognition tasks, grounded situation recognition (GSR) requires the model not only to classify salient activity (verb) in the image, but also to detect all semantic roles that participate in the action. This complex task usually involves three steps: verb recognition, semantic role grounding, and noun recognition. Directly employing class-based prompts with VLMs and grounding models for this task suffers from several limitations, e.g., it struggles to distinguish ambiguous verb concepts, accurately localize roles with fixed verb-centric template input, and achieve context-aware noun predictions. In this paper, we argue that these limitations stem from the model’s poor understanding of verb/noun classes. To this end, we introduce a new approach for zero-shot GSR via Language EXplainer(LEX), which significantly boosts the model’s comprehensive capabilities through three explainers: 1) verb explainer, which generates general verb-centric descriptions to enhance the discriminability of different verb classes; 2) grounding explainer, which rephrases verb-centric templates for clearer understanding, thereby enhancing precise semantic role localization; and 3) noun explainer, which creates scene-specific noun descriptions to ensure context-aware noun recognition. By equipping each step of the GSR process with an auxiliary explainer, LEX facilitates complex scene understanding in real-world scenarios. Our extensive validations on the SWiG dataset demonstrate LEX’s effectiveness and interoperability in zero-shot GSR.



Paperid:687 Poster
Authors:Zhijun Zhai,Zengmao Wang,Xiaoxiao Long,Kaixuan Zhou,Bo Du
Abstract:
GAN-based image editing task aims at manipulating image attributes in the latent space of generative models. Most of the previous 2D and 3D-aware approaches mainly focus on editing attributes in images with ambiguous semantics or regions from a reference image, which fail to achieve photographic semantic attribute transfer, such as the beard from a photo of a man. In this paper, we propose an image-driven Semantic Attribute Transfer method in 3D (SAT3D) by editing semantic attributes from a reference image. For the proposed method, the exploration is conducted in the style space of a pre-trained 3D-aware StyleGAN-based generator by learning the correlations between semantic attributes and style code channels. For guidance, we associate each attribute with a set of phrase-based descriptor groups, and develop a Quantitative Measurement Module (QMM) to quantitatively describe the attribute characteristics in images based on descriptor groups, which leverages the image-text comprehension capability of CLIP. During the training process, the QMM is incorporated into attribute losses to calculate attribute similarity between images, guiding target semantic transferring and irrelevant semantics preserving. We present our 3D-aware attribute transfer results across multiple domains and also conduct comparisons with classical 2D image editing methods, demonstrating the effectiveness and customizability of our SAT3D.



Paperid:688 Poster
Authors:Wenshuo Chen,Hongru Xiao,Erhang Zhang,Lijie Hu,Lei Wang,Mengyuan Liu,Chen Chen
Abstract:
Is the Text to Motion model robust? Recent advancements in Text-to-Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Stable Text-to-Motion Framework (SATO). SATO consists of three modules, each dedicated to stable attention, stable prediction, and maintaining a balance between accuracy and robustness trade-off. We present a methodology for constructing an SATO that satisfies the stability of attention and prediction. To verify the stability of the model, we introduced a new textual synonym perturbation dataset based on HumanML3D and KIT-ML. Results show that SATO is significantly more stable against synonym and other slight perturbations while keeps its high accuracy performance.



Paperid:689 Poster
Authors:Xudong Zhou,Tianxiang Chen
Abstract:
Medical image segmentation is of great significance to disease diagnosis and treatment planning. Despite multiple progresses, most present methods (1) pay insufficient attention to suppressing background noise disturbance that impacts segmentation accuracy and (2) are not efficient enough, especially when the images are of large resolutions. To address the two challenges, we turn to a traditional de-noising method and a new efficient network structure and propose BSBP-RWKV for accurate and efficient medical image segmentation. Specifically, we combine the advantages of Perona-Malik Diffusion (PMD) in noise suppression without losing boundary details and RWKV in its efficient structure, and devise the DWT-PMD RWKV Block across one of our encoder branches to preserve boundary details of lesion areas while suppressing background noise disturbance in an efficient structure. Then we feed the de-noised lesion boundary cues to our proposed Multi-Step Runge-Kutta convolutional Block to supplement the cues with more local details. We also propose a novel loss function for shape refinement that can align the shape of predicted lesion areas with GT masks in both spatial and frequency domains. Experiments on ISIC 2016 and Kvasir-SEG show the superior accuracy and efficiency of our BSBP-RWKV. Specifically, BSBP-RWKV reduces complexity of 5.8 times compared with the SOTA while also cutting down GPU memory usage by over 62.7% for each 1024×1024 image during inference.



Paperid:690 Poster
Authors:Xin Wang,Kai Chen,Xingjun Ma,Zhineng Chen,Jingjing Chen,Yu-Gang Jiang
Abstract:
Deep neural networks (DNNs) are known to be vulnerable to adversarial attacks even under a black-box setting where the adversary can only query the model. Particularly, query-based black-box adversarial attacks estimate adversarial gradients based on the returned probability vectors of the target model for a sequence of queries. During this process, the queries made to the target model are intermediate adversarial examples crafted at the previous attack step, which share high similarities in the pixel space. Motivated by this observation, stateful detection methods have been proposed to detect and reject query-based attacks. While demonstrating promising results, these methods either have been evaded by more advanced attacks or suffer from low efficiency in terms of the number of shots (queries) required to detect different attacks. Arguably, the key challenge here is to assign high similarity scores for any two intermediate adversarial examples perturbed from the same image. To address this challenge, we propose a novel Adversarial Contrastive Prompt Tuning (ACPT) method to robustly fine-tune the CLIP image encoder to extract similar embeddings for any two intermediate adversarial queries. With ACPT, we further introduce a detection framework AdvDet that can detect 7 state-of-the-art query-based attacks with >99% detection rate within 5 shots. We also show that ACPT is robust to 3 types of adaptive attacks.



Paperid:691 Poster
Authors:Shilong Jia,Tingting WU,Yingying Fang,Tieyong Zeng,Guixu Zhang,Zhi Li
Abstract:
Incremental Object Detection (IOD) simulates the dynamic data flow in real-world applications, which require detectors to learn new classes or adapt to domain shifts while retaining knowledge from previous tasks. Most existing IOD methods focus only on class incremental learning, assuming all data comes from the same domain. However, this is hardly achievable in practical applications, as images collected under different conditions often exhibit completely different characteristics, such as lighting, weather, style, etc. Class IOD methods suffer from severe performance degradation in these scenarios with domain shifts. To bridge domain shifts and category gaps in IOD, we propose Purified Distillation (PD), where we use a set of trainable queries to transfer the teacher's attention on old tasks to the student and adopt the gradient reversal layer to guide the student to learn the teacher's feature space structure from a micro perspective. This strategy further explores the features extracted by the teacher during incremental learning, which has not been extensively studied in previous works. Meanwhile, PD combines classification confidence with localization confidence to purify the most meaningful output nodes, so that the student model inherits a more comprehensive teacher knowledge. Extensive experiments across various IOD settings on six widely used datasets show that PD significantly outperforms state-of-the-art methods. Even after five steps of incremental learning, our method can preserve 60.6% mAP on the first task, while compared methods can only maintain up to 55.9%.



Paperid:692 Poster
Authors:Guoliang Zou,Yangdong Ye,Tongji Chen,Shizhe Hu
Abstract:
Contrastive multi-view clustering is widely recognized for its effectiveness in mining feature representation across views via contrastive learning (CL), gaining significant attention in recent years. Most existing methods mainly focus on the feature-level or/and cluster-level CL, but there are still two shortcomings. Firstly, feature-level CL is limited by the influence of anomalies and large noise data, resulting in insufficient mining of discriminative feature representation. Secondly, cluster-level CL lacks the guidance of global information and is always restricted by the local diversity information. We in this paper Learn dUal enhanCed rEpresentation for Contrastive Multi-view Clustering (LUCE-CMC) to effectively addresses the above challenges, and it mainly contains two parts, i.e., enhanced feature-level CL (En-FeaCL) and enhanced cluster-level CL (En-CluCL). Specifically, we first adopt a shared encoder to learn shared feature representations between multiple views and then obtain cluster-relevant information that is beneficial to the clustering results. Moreover, we design a reconstitution approach to force the model to concentrate on learning features that are critical to reconstructing the input data, reducing the impact of noisy data and maximizing the sufficient discriminative information of different views in helping the En-FeaCL part. Finally, instead of contrasting the view-specific clustering result like most existing methods do, we in the En-CluCL part make the information at the cluster-level more richer by contrasting the cluster assignment from each view and the cluster assignment obtained from the shared fused features. The end-to-end training methods of the proposed model are mutually reinforcing and beneficial. Extensive experiments conducted on multi-view datasets show that the proposed LUCE-CMC outperforms established baselines to a considerable extent.



Paperid:693 Poster
Authors:Jingchao Wang,Zhengnan Deng,Tongxu Lin,Wenyuan Li,Shaobin Ling,Junyu Lin
Abstract:
Multi-label image classification is crucial for a wide range of multimedia applications. To address the resource limitation issue, various knowledge distillation (KD) methods have been developed to transfer knowledge from a large network (referred to as the "teacher") to a small network (referred to as the "student"). However, existing KD methods do not explicitly distill the dependencies between labels, which limits the model ability to capture multi-label correlation. Furthermore, although existing methods for multi-label image classification have utilized the second-order label pair dependency (direct dependency between two labels), the high-order label pair dependency, which captures the indirect dependency between two labels, remains unexplored. In this paper, we propose a \textbf{\underline{M}}ulti-Order Label Pair \textbf{\underline{D}}ependencies \textbf{\underline{K}}nowledge \textbf{\underline{D}}istillation (MDKD) framework. MDKD explicitly distills the knowledge to capture multi-order dependencies between labels, including the label pair dependencies from second-order and high-order, thus transferring the insight of label correlations from different perspectives. Extensive experiments on Pascal VOC2007, MSCOCO2014, and NUS-WIDE demonstrate the superior performances of MDKD.



Paperid:694 Poster
Authors:Cong Wang,Liyan Wang,Jie Mu,Chengjin Yu,Wei Wang
Abstract:
In this paper, we develop a progressive local and non-local interactive network with multi-scale cross-content deeply discriminative learning to solve image deraining. The proposed model contains two key techniques: 1) Progressive Local and Non-Local Interactive Network (PLNLIN) and 2) Multi-Scale Cross-Content Deeply Discriminative Learning (MCDDL). The PLNLIN is a U-shaped encoder-decoder network, where the proposed new Progressive Local and Non-Local Interactive Module (PLNLIM) is the basic unit in the encoder-decoder framework. The PLNLIM fully explores local and non-local learning in convolution and Transformer operation respectively and the local and non-local content are further interactively learned in a progressive manner. The proposed MCDDL not only discriminates the output of the generator but also receives the deep content from the generator to distinguish real and fake features at each side layer of the discriminator in a multi-scale manner. We show that the proposed MCDDL has fast and stable convergence properties that lack in existing discriminative learning manners. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art methods on five public synthetic datasets and one real-world dataset.



Paperid:695 Poster
Authors:Shichen Lu,Longteng Guo,Wenxuan Wang,Zijia Zhao,Tongtian Yue,Jing Liu,Si Liu
Abstract:
In recent years, large vision language models (LVLMs) have significantly advanced artificial intelligence, especially in integrating visual and linguistic data for complex tasks like visual conversation, image captioning and visual question answering. These advancements have pushed the boundaries of multimodal comprehension and generative capabilities, setting the stage for advanced human-computer interaction. Existing LVLMs are primarily divided into two independent research paths: one involves scaling up the model size to enhance performance, while the other focuses on reducing parameters through pruning, etc., to accommodate environments with limited computational resources. However, we believe that both large and tiny models have their respective advantages and that collaborative training could yield better results compared to independent training strategies. Therefore, we propose a novel collaborative framework named Collaborative Training of Tiny-Large Vision Language Models (CTVLMs)that connects large and tiny models via a projection layer, utilizing a synergistic training strategy to leverage the respective strengths of each model and enhance their performance. Our framework offers several advantages over previous methods: it strengthens the interconnection between large and tiny models, improving training efficiency. The mutual assistance between the models enhances their performance. In our collaborative training strategy, by leveraging the parameter efficiency of tiny models, we effectively align image-text features. Subsequently, using knowledge distillation, we assist large models in better aligning cross-modal information. During the fine-tuning phase, we utilize the extensive knowledge base of the large models to enhance the performance of tiny models. Through our collaborative training approach, we achieve closely integrated large and tiny models, whose capabilities mutually enhance each other, allowing for direct adaptation to various computational resource scenarios. Extensive experiments on vision-language benchmarks demonstrate that our approach significantly outperforms existing methods.



Paperid:696 Poster
Authors:Qian Li,Yucheng Zhou,Cheng Ji,Feihong Lu,Jianian Gong,Shangguang Wang,Jianxin Li
Abstract:
Text-video retrieval (TVR) identifies relevant videos based on textual queries. Existing methods are limited by their ability to understand and connect different modalities, resulting in increased difficulty in retrievals. In this paper, we propose a generation-based TVR paradigm facilitated by LLM distillation to better learn and capture deep retrieval knowledge for text-video retrieval, amidsting the rapid evolution of Large Language Models. Specifically, we first design the fine-tuning large vision-language model that leverages the knowledge learned from language models to enhance the alignment of semantic information between the text and video modalities. It also incorporates an inductive reasoning mechanism, which focuses on incorporating important temporal and spatial features into the video embeddings. We further design question prompt clustering to select the most important prompts, considering their contribution to improving retrieval performance. Experimental results show that our approach achieves excellent performance on two benchmark datasets compared to its competitors.



Paperid:697 Poster
Authors:Yiran Cheng,Bintao He,Renmin Han,Fa Zhang
Abstract:
Volume electron microscopy (vEM) is becoming a prominent technique in three-dimensional (3D) cellular visualization. vEM collects a series of two-dimensional (2D) images and reconstructs ultra-structures at the nanometer scale by rational axial interpolation between neighboring sections. However, section damage inevitably occurs in the sample preparation and imaging process, suffering from manual operational errors or occasional mechanical failures. The damaged regions present blurry and contaminated structure information, even local blank holes. Despite significant progress in single-image inpainting, it is still a great challenge to recover missing biological structures, that satisfy 3D structural continuity among sections. In this paper, we propose an optical flow-based serial section inpainting architecture to effectively combine the 3D structure information from neighboring sections and 2D image features from surrounding regions. We design a two-stage reference generation strategy to predict a rational and detailed intermediate state image from coarse to fine. Then, a GAN-based inpainting network is adopted to integrate all reference information and guide the restoration of missing structures, while ensuring consistent distribution of pixel values across the 2D image. Extensive experimental results well demonstrate the superiority of our method over existing inpainting tools.



Paperid:698 Poster
Authors:Penglei Sun,Yaoxian Song,Xiang Liu,Xiaofei Yang,Qiang Wang,tiefeng li,Yang YANG,Xiaowen Chu
Abstract:
3D multimodal question answering (MQA) plays a crucial role in scene understanding by enabling intelligent agents to comprehend their surroundings in 3D environments. While existing research has primarily focused on indoor household tasks and outdoor roadside autonomous driving tasks, there has been limited exploration of city-level scene understanding tasks. Furthermore, existing research faces challenges in understanding city scenes, due to the absence of spatial semantic information and human-environment interaction information at the city level. To address these challenges, we investigate 3D MQA from both dataset and method perspectives. From the dataset perspective, we introduce a novel 3D MQA dataset named City-3DQA for city-level scene understanding, which is the first dataset to incorporate scene semantic and human-environment interactive tasks within the city. From the method perspective, we propose a Scene graph enhanced City-level Understanding method (Sg-CityU), which utilizes the scene graph to introduce the spatial semantic. A new benchmark is reported and our proposed Sg-CityU achieves accuracy of 63.94 % and 63.76 % in different settings of City-3DQA. Compared to indoor 3D MQA methods and zero-shot using advanced large language models (LLMs), Sg-CityU demonstrates state-of-the-art (SOTA) performance in robustness and generalization. Our dataset and code are available on our project website\footnote{\url{https://sites.google.com/view/city3dqa/}}.



Paperid:699 Poster
Authors:Yuhan Liu,Qianxin Huang,Siqi Hui,Jingwen Fu,Sanping Zhou,Kangyi Wu,Pengna Li,Jinjun Wang
Abstract:
Homography estimation is the task of determining the transformation from an image pair. Our approach focuses on employing detector-free feature matching methods to address this issue. Previous work has underscored the importance of incorporating semantic information, however there still lacks an efficient way to utilize semantic information. Previous methods suffer from treating the semantics as a pre-processing, causing the utilization of semantics overly coarse-grained and lack adaptability when dealing with different tasks. In our work, we seek another way to use the semantic information, that is semantic-aware feature representation learning framework. Based on this, we propose SRMatcher, a new detector-free feature matching method, which encourages the network to learn integrated semantic feature representation. Specifically, to capture precise and rich semantics, we leverage the capabilities of recently popularized vision foundation models (VFMs) trained on extensive datasets. Then, a cross-image Semantic-aware Fusion Block (SFB) is proposed to integrate its fine-grained semantic features into the feature representation space. In this way, by reducing errors stemming from semantic inconsistencies in matching pairs, our proposed SRMatcher is able to deliver more accurate and realistic outcomes. Extensive experiments show that SRMatcher surpasses solid baselines and attains SOTA results on multiple real-world datasets. Compared to the previous SOTA approach GeoFormer, SRMatcher increases the area under the cumulative curve (AUC) by about 11% on HPatches. Additionally, the SRMatcher could serve as a plug-and-play framework for other matching methods like LoFTR, yielding substantial precision improvement.



Paperid:700 Poster
Authors:Kangzheng Liu,Feng Zhao,Yu Yang,Guandong Xu
Abstract:
Multimodal knowledge graph (MKG) reasoning has attracted significant attention since impressive performance has been achieved by adding multimodal auxiliary information (i.e., texts and images) to the entities of traditional KGs. However, existing studies heavily rely on path-based methods for learning structural modality, failing to capture the complex structural interactions among multimodal entities beyond the reasoning path. In addition, existing studies have largely ignored the dynamic impact of different multimodal features on different decision facts for reasoning, which utilize asymmetric coattention to independently learn the static interplay between different modalities without dynamically joining the reasoning process. We propose a novel Dynamic Structure-aware representation learning method, namely DySarl, to overcome this problem and significantly improve the MKG reasoning performance. Specifically, we devise a dual-space multihop structural learning module in DySarl, aggregating the multihop structural features of multimodal entities via a novel message-passing mechanism. It integrates the message paradigms in Euclidean and hyperbolic spaces, effectively preserving the neighborhood information beyond the limited multimodal query paths. Furthermore, DySarl has an interactive symmetric attention module to explicitly learn the dynamic impacts of unimodal attention senders and multimodal attention targets on decision facts through a newly designed symmetric attention component and fact-specific gated attention unit, equipping DySarl with the dynamic associations between the multimodal feature learning and later reasoning. Extensive experiments show that DySarl achieves significantly improved reasoning performance on two public MKG datasets compared with that of the state-of-the-art baselines. Source codes are available athttps://anonymous.4open.science/r/DySarl.



Paperid:701 Poster
Authors:Zherui Qiu,Chenqu Ren,Kaiwen Song,Xiaoyi Zeng,Leyuan Yang,Juyong Zhang
Abstract:
While neural radiance fields (NeRF) have shown promise in novel view synthesis, their implicit representation limits explicit control over object manipulation. Existing research has proposed the integration of explicit geometric proxies to enable deformation. However, these methods face two primary challenges: firstly, the time-consuming and computationally demanding tetrahedralization process. Secondly, handling complex or thin structures often leads to either excessive, storage-intensive tetrahedral mesh or poor-quality ones that impair deformation capabilities. To address these challenges, we propose Manipulable NeRF, a method that seamlessly integrates the manipulability of tetrahedral meshes with the high-quality rendering capabilities of feature grid representations. To avoid ill-shaped tetrahedra and tetrahedralization for each object, we propose a two-stage training strategy. Starting with an almost-regular tetrahedral grid, our model initially retains key tetrahedra surrounding the object and subsequently refines object details using finer-granularity mesh in the second stage. We also present the concept of recursively subdivided tetrahedra to create higher-resolution meshes implicitly. This enables multi-resolution encoding while only necessitating the storage of the coarse tetrahedral mesh generated in the first training stage. We conducted a comprehensive evaluation of our Manipulable NeRF on both synthetic and real-captured datasets. Both quantitative and qualitative results demonstrate the effectiveness of our method for novel view synthesis and deformation tasks.



Paperid:702 Poster
Authors:Jintao Chen,Fan Wang,Shengye Pang,Siwei Tan,Mingshuai Chen,Tiancheng Zhao,Meng Xi,Jianwei Yin
Abstract:
Recent years have witnessed remarkable advances in graph representation learning using Graph Neural Networks (GNNs). To fully exploit the unlabeled graphs, researchers pre-train GNNs on large-scale graph databases and then fine-tune these pre-trained Graph Models (GMs) for better performance in downstream tasks. Because different GMs are developed with diverse pre-training tasks or datasets, they can be complementary to each other for a more complete knowledge base. Naturally, a compelling question is emerging: How can we exploit the diverse knowledge captured by different GMs simultaneously in downstream tasks? In this paper, we make one of the first attempts to exploit multiple GMs to advance the performance in the downstream tasks. More specifically, for homogeneous GMs that share the same model architecture but are obtained with different pre-training tasks or datasets, we align each layer of these GMs and then aggregate them adaptively on a per-sample basis with a tailored Recurrent Aggregation Policy Network (RAPNet). For heterogeneous GMs with different model architectures, we design an alignment module to align the output of diverse GMs and a meta-learner to decide the importance of each GM conditioned on each sample automatically before aggregating the GMs. Extensive experiments on various downstream tasks from 3 domains reveal our dominance over each single GM. Additionally, our methods (UniGM) can achieve better performance with moderate computational overhead compared to alternative approaches including ensemble and model fusion. Also, we verify that our methods are not limited to graph data but could be flexibly applied to image and text data. The codes can be seen in the anonymous link:https://anonymous.4open.science/r/UniGM-DA65.



Paperid:703 Poster
Authors:Xiaodi Li
Abstract:
Portrait video editing has attracted wide attention thanks to its practical applications. Existing methods either target fixed-length clips or perform temporally inconsistent per-frame editing. In this work, we present a brand new system, StreamEdit, which is primarily designed to edit streaming videos. Our system follows the ideology of editing propagation to ensure temporal consistency. Concretely, we choose to edit only one reference frame and warp the outcome to obtain the editing results of other frames. For this purpose, we employ a warping module, aided by a probabilistic pixel correspondence estimation network, to help establish the pixel-wise mapping between two frames. However, such a pipeline requires the reference frame to contain all contents appearing in the video, which is scarcely possible especially when there exist large motions and occlusions. To address this challenge, we propose to adaptively replace the reference frame, benefiting from a heuristic strategy referring to the overall pixel mapping uncertainty. That way, we can easily align the editing of the before- and after-replacement reference frames via image inpainting. Extensive experimental results demonstrate the effectiveness and generalizability of our approach in editing streaming portrait videos. Code will be made public.



Paperid:704 Poster
Authors:Jinfu Liu,Chen Chen,Mengyuan Liu
Abstract:
Skeleton-based action recognition has garnered significant attention due to the utilization of concise and resilient skeletons. Nevertheless, the absence of detailed body information in skeletons restricts performance, while other multimodal methods require substantial inference resources and are inefficient when using multimodal data during both training and inference stages. To address this and fully harness the complementary multimodal features, we propose a novel multi-modality co-learning (MMCL) framework by leveraging the multimodal large language models (LLMs) as auxiliary networks for efficient skeleton-based action recognition, which engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference. Our MMCL framework primarily consists of two modules. First, the Feature Alignment Module (FAM) extracts rich RGB features from video frames and aligns them with global skeleton features via contrastive learning. Second, the Feature Refinement Module (FRM) uses RGB images with temporal information and text instruction to generate instructive features based on the powerful generalization of multimodal LLMs. These instructive text features will further refine the classification scores and the refined scores will enhance the model’s robustness and generalization in a manner similar to soft labels. Extensive experiments on NTU RGB+D, NTU RGB+D 120 and Northwestern-UCLA benchmarks consistently verify the effectiveness of our MMCL, which outperforms the existing skeleton-based action recognition methods. Meanwhile, experiments on UTD-MHAD and SYSU-Action datasets demonstrate the commendable generalization of our MMCL in zero-shot and domain-adaptive action recognition. Our code will be publicly available and can be found in the supplementary files.



Paperid:705 Poster
Authors:Tao Wu,Mengze Li,Jingyuan Chen,Wei Ji,Wang Lin,Jinyang Gao,Kun Kuang,Zhou Zhao,Fei Wu
Abstract:
Research on Multi-modal Large Language Models (MLLMs) towards the multi-image cross-modal instruction has received increasing attention and made significant progress, particularly in scenarios involving closely resembling images (e.g., change captioning). Existing MLLMs typically follow a two-step process in their pipelines: first, extracting visual tokens independently for each input image, and then aligning these visual tokens from different images with the Large Language Model (LLM) in its textual feature space. However, the independent extraction of visual tokens for each image may result in different semantics being prioritized for different images in the first step, leading to a lack of preservation of linking information among images for subsequent LLM analysis. This issue becomes more serious in scenarios where significant variations exist among the images (e.g., visual storytelling). To address this challenge, we introduce Semantic Alignment for Multi-modal large language models (SAM). By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis and align the semantics of different images before feeding them into LLM. As the test bed, we propose a large-scale dataset named MmLINK consisting of 69K samples. Different from most existing datasets for MLLMs fine-tuning, our MmLINK dataset comprises multi-modal instructions with significantly diverse images. Extensive experiments on the group captioning task and the storytelling task prove the effectiveness of our SAM model, surpassing the state-of-the-art methods by a large margin (+37% for group captioning and +22% for storytelling on CIDEr score). Project page:https://anonymous.4open.science/r/SAM-F596.



Paperid:706 Poster
Authors:Deji Zhao,Donghong Han,Ye Yuan,Bo Ning,Li Mengxiang,Zhongjiang He,Shuangyong Song
Abstract:
Open-domain multi-modal dialogue system heavily relies on visual information to generate contextually relevant responses. The existing open-domain multi-modal dialog generation methods ignore the complementary relationship between multiple modalities, and are difficult to integrate with LLMs. To address these issues, we propose an automatically constructed visual context graph method, called AutoGraph. We aim to structure complex information and seamlessly integrate it with large language models (LLMs), aligning information from multiple modalities at both semantic and structural levels. Specifically, we fully connect the text graphs and scene graphs, and then trim unnecessary edges via LLMs to automatically construct a visual context graph. Next, we design several graph sampling grammar for the first time to convert graph structures into sequence which is suitable for LLMs. Finally, we propose a two-stage fine-tuning method to allow LLMs to understand graph sampling grammar and generate responses. The AutoGraph method is a general approach that can enhance the visual capabilities of LLMs. We validate our proposed method on text-based LLMs, and visual-based LLMs, respectively. Experimental results show that our proposed method achieves state-of-the-art performance on multiple public datasets. Relevant code can be found onhttps://anonymous.4open.science/r/AutoGraph-26BE.



Paperid:707 Poster
Authors:Mahiro Ukai,Shuhei Kurita,Atsushi Hashimoto,Yoshitaka Ushiku,Nakamasa Inoue
Abstract:
Visual question answering aims to provide responses to natural language questions given visual input. Recently, visual programmatic models (VPMs), which generate executable programs to answer questions through large language models (LLMs), have attracted research interest. However, they often require long input prompts to provide the LLM with sufficient API usage details to generate relevant code. To address this limitation, we propose AdaCoder, an adaptive prompt compression framework for VPMs. AdaCoder operates in two phases: a compression phase and an inference phase. In the compression phase, given a preprompt that describes all API definitions in the Python language with example snippets of code, a set of compressed preprompts is generated, each depending on a specific question type. In the inference phase, given an input question, AdaCoder predicts the question type and chooses the appropriate corresponding compressed preprompt to generate code to answer the question. Notably, AdaCoder employs a single frozen LLM and pre-defined prompts, negating the necessity of additional training and maintaining adaptability across different powerful black-box LLMs such as GPT and Claude. In experiments, we apply AdaCoder to ViperGPT and demonstrate that it reduces token length by 71.1%, while maintaining or even improving the performance of visual question answering.



Paperid:708 Poster
Authors:Wenhan Wu,Ce Zheng,Zihao Yang,Chen Chen,Srijan Das,Aidong Lu
Abstract:
Recently, transformers have demonstrated great potential for modeling long-term dependencies from skeleton sequences and thereby gained ever-increasing attention in skeleton action recognition. However, the existing transformer-based approaches heavily rely on the naive attention mechanism for capturing the spatiotemporal features, which falls short in learning discriminative representations that exhibit similar motion patterns. To address this challenge, we introduce the Frequency-aware Mixed Transformer (FreqMixFormer), specifically designed for recognizing similar skeletal actions with subtle discriminative motions. First, we introduce a frequency-aware attention module to unweave skeleton frequency representations by embedding joint features into frequency attention maps, aiming to distinguish the discriminative movements based on their frequency coefficients. Subsequently, we develop a mixed transformer architecture to incorporate spatial features with frequency features to model the comprehensive frequency-spatial patterns. Additionally, a temporal transformer is proposed to extract the global correlations across frames. Extensive experiments show that FreqMiXFormer outperforms SOTA on 3 popular skeleton action recognition datasets, including NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets. Codes will be publicly available.



Paperid:709 Poster
Authors:Lei Lu,Yanyue Xie,Wei Jiang,Wei Wang,Xue Lin,Yanzhi Wang
Abstract:
This paper investigates the challenging problem of learned image compression (LIC) with extreme low bitrates. Previous LIC methods based on transmitting quantized continuous features often yield blurry and noisy reconstruction due to the severe quantization loss. While previous LIC methods based on learned codebooks that discretize visual space usually give poor-fidelity reconstruction due to the insufficient representation power of limited codewords in capturing faithful details. We propose a novel dual-stream framework, HyrbidFlow, which combines the continuous-feature-based and codebook-based streams to achieve both high perceptual quality and high fidelity under extreme low bitrates. The codebook-based stream benefits from the high-quality learned codebook priors to provide high quality and clarity in reconstructed images. The continuous feature stream targets at maintaining fidelity details. To achieve the ultra low bitrate, a masked token-based transformer is further proposed, where we only transmit a masked portion of codeword indices and recover the missing indices through token generation guided by information from the continuous feature stream. We also develop a bridging correction network to merge the two streams in pixel decoding for final image reconstruction, where the continuous stream features rectify biases of the codebook-based pixel decoder to impose reconstructed fidelity details. Experimental results demonstrate superior performance across several datasets under extremely low bitrates, compared with existing single-stream codebook-based or continuous-feature-based LIC methods.



Paperid:710 Poster
Authors:Libo Long,Xiao Hu,Jochen Lang
Abstract:
Video frame interpolation based on optical flow has made great progress in recent years. Most of the previous studies have focused on improving the quality of clean videos. However, many real-world videos contain large obstructions which cause blur and artifacts making the video discontinuous. To address this challenge, we propose our Obstruction Robustness Framework (ORF) that enhances the robustness of existing VFI networks in the face of large obstructions. The ORF contains two components: (1) A feature repair module that first captures ambiguous pixels in the synthetic frame by a region similarity map, then repairs them with a cross-overlap attention module. (2) A data augmentation strategy that enables the network to handle dynamic obstructions without extra data. To the best of our knowledge, this is the first work that explicitly addresses the error caused by large obstructions in video frame interpolation. By using previous state-of-the-art methods as backbones, our method not only improves the results in original benchmarks but also significantly enhances the interpolation quality for videos with obstructions.



Paperid:711 Poster
Authors:Patrick Steinert,Stefan Wagenpfeil,Ingo Frommholz,Matthias Hemmje
Abstract:
The metaverse is an evolving field and the subject of multimedia research. In this paper, we introduce the 256-MetaverseRecords dataset, a novel and extensive collection of annotated screen recordings in the form of videos from various virtual worlds of the metaverse. We describe the process of creating the dataset, the quality criteria for the annotations, and the exploration of the dataset. We also show four experiments to evaluate the performance of different feature extraction methods for Metaverse Recordings (MVRs): MVR segmentation, audio event detection, and object and interaction detection based on this dataset. Our results demonstrate that existing methods have limitations and leave challenges in dealing with the diversity and complexity of metaverse data, and that more research is needed to develop metaverse-specific techniques. Our dataset can serve as a valuable resource for the research community and foster the development of new applications and solutions for the metaverse.



Paperid:712 Poster
Authors:Zheng WEI,Yuzheng Chen,Wai Tong,Xuan Zong,Huamin Qu,Xian Xu,LIK-HANG LEE
Abstract:
In film education, high expenses and limited space significantly challenge teaching synchronized sound recording (SSR). Traditional methods, which emphasize theory with limited practical experience, often fail to bridge the gap between theoretical understanding and practical application. As such, we introduce MetaEcho, an educational virtual reality leveraging the presence theory for teaching SSR. MetaEcho provides realistic simulations of various recording equipment and facilitates communication between learners and instructors, offering an immersive learning experience that closely mirrors actual practices. An evaluation with 24 students demonstrated that MetaEcho surpasses the traditional method in presence, collaboration, usability, realism, comprehensibility, and creativity. Three experts also commented on the benefits of MetaEcho and the opportunities for promoting SSR education in the metaverse era.



Paperid:713 Poster
Authors:Chengyi Yang,Wentao Liu,Shisong Chen,Jiayin Qi,Aimin Zhou
Abstract:
Continual learning emerges as a framework that trains the model on a sequence of tasks without forgetting previously acquired knowledge, which has been applied in multiple multimodal scenarios. Recently, prompt-based continual learning has achieved excellent domain adaptability and knowledge transfer through prompt generation. However, existing methods mainly focus on designing the architecture of a generator, neglecting the importance of providing effective guidance for training the generator. To address this issue, we propose Generating Prompts in Latent Space (GPLS), which considers prompts as latent variables to account for the uncertainty of prompt generation and aligns with the fact that prompts are inserted into the hidden layer outputs and exert an implicit influence on classification. GPLS adopts a trainable encoder to encode task and feature information into prompts with reparameterization technique, and provides refined and targeted guidance for the training process through the evidence lower bound (ELBO) related to Mahalanobis distance. Extensive experiments demonstrate that GPLS achieves state-of-the-art performance on various benchmarks.



Paperid:714 Poster
Authors:Aoqi Li,Saihui Hou,Chenye Wang,Qingyuan Cai,Yongzhen Huang
Abstract:
In this work, we present AerialGait, a comprehensive dataset for aerial-ground gait recognition. This dataset comprises 82,454 sequences totaling over 10 million frames from 533 subjects, captured from both aerial and ground perspectives. To align with real-life scenarios of aerial and ground surveillance, we utilize a drone and a ground surveillance camera for data acquisition. The drone is operated at various speeds, directions, and altitudes. Meanwhile, we conduct data collection across five diverse surveillance sites to ensure a comprehensive simulation of real-world settings. AerialGait has several unique features: 1) The gait sequences exhibit significant variations in views, resolutions, and illumination across five distinct scenes. 2) It incorporates challenges of motion blur and frame discontinuity due to drone mobility. 3) The dataset reflects the domain gap caused by the view disparity between aerial and ground views, presenting a realistic challenge for drone-based gait recognition. Moreover, we perform a comprehensive analysis of existing gait recognition methods on AerialGait dataset and propose the Aerial-Ground Gait Network (AGG-Net). AGG-Net effectively learns discriminative features from aerial views by uncertainty learning and clusters features across aerial and ground views through prototype learning. Our model achieves state-of-the-art performance on both AerialGait and DroneGait datasets. The dataset and code will be made available upon acceptance.



Paperid:715 Poster
Authors:PEIYONG WANG,Bohan Xiao,Qisheng He,Carri Glide-Hurst,Ming Dong
Abstract:
Image-to-image translation is defined as the process of learning a mapping between images from a source domain and images from a target domain. The probabilistic structure that maps a fixed initial state to a pinned terminal state through a standard Wiener process is a Brownian bridge. In this paper, we propose a score-based Stochastic Differential Equation (SDE) approach via the Brownian bridges, termed the Amenable Brownian Bridges (A-Bridges), to image-to-image translation tasks as an unconditional diffusion model. Our framework embraces a large family of Brownian bridge models, while the discretization of the linear A-Bridge exploits its advantage that provides the explicit solution in a closed form and thus facilitates the model training. Our model enables the accelerated sampling and has achieved record-breaking performance in sample quality and diversity on benchmark datasets following the guidance of its SDE structure.



Paperid:716 Poster
Authors:Shuting He,Henghui Ding
Abstract:
3D referring segmentation is an emerging and challenging vision-language task that aims to segment the object described by a natural language expression in a point cloud scene. The key challenge behind this task is vision-language feature fusion and alignment. In this work, we propose RefMask3D to explore the comprehensive multi-modal feature interaction and understanding. First, we propose a Geometry-Enhanced Group-Word Attention to integrate language with geometrically coherent sub-clouds through cross-modal group-word attention, which effectively addresses the challenges posed by the sparse and irregular nature of point clouds. Then, we introduce a Linguistic Primitives Construction to produce semantic primitives representing distinct semantic attributes, which greatly enhance the vision-language understanding at the decoding stage. Furthermore, we introduce an Object Cluster Module that analyzes the interrelationships among linguistic primitives to consolidate their insights and pinpoint common characteristics, helping to capture holistic information and enhance the precision of target identification. The proposed RefMask3D achieves new state-of-the-art performance on 3D referring segmentation, 3D visual grounding, and also 2D referring image segmentation. Especially, RefMask3D outperforms previous state-of-the-art method by a large margin of 5.36% mIoU on the challenging ScanRefer dataset.



Paperid:717 Poster
Authors:Dayu Hu,Suyuan Liu,Jun Wang,Junpu Zhang,Siwei Wang,Xingchen Hu,Xinzhong Zhu,Chang Tang,Xinwang Liu
Abstract:
Multi-view clustering (MVC) constitutes a distinct approach to data mining within the field of machine learning. Due to limitations in the data collection process, missing attributes are frequently encountered. However, existing MVC methods primarily focus on missing instances, showing limited attention to missing attributes. A small number of studies employ the reconstruction of missing instances to address missing attributes, potentially overlooking the synergistic effects between the instance and feature spaces, which could lead to distorted imputation outcomes. Furthermore, current methods uniformly treat all missing attributes as zero values, thus failing to differentiate between real and technical zeroes, potentially resulting in data over-imputation. To mitigate these challenges, we introduce a novel Reliable Attribute-Missing Multi-View Clustering method (RAM-MVC). Specifically, feature reconstruction is utilized to address missing attributes, while similarity graphs are simultaneously constructed within the instance and feature spaces. By leveraging structural information from both spaces, RAM-MVC learns a high-quality feature reconstruction matrix during the joint optimization process. Additionally, we introduce a reliable imputation guidance module that distinguishes between real and technical attribute-missing events, enabling discriminative imputation. The proposed RAM-MVC method outperforms nine baseline methods, as evidenced by real-world experiments using single-cell multi-view data.



Paperid:718 Poster
Authors:Jiyuan Zhang,Kang Chen,Shiyan Chen,Yajing Zheng,Tiejun Huang,Zhaofei Yu
Abstract:
Novel View Synthesis plays a crucial role by generating new 2D renderings from multi-view images of 3D scenes. However, capturing high-speed scenes with conventional cameras often leads to motion blur, hindering the effectiveness of 3D reconstruction. To address this challenge, high-frame-rate dense 3D reconstruction emerges as a vital technique, enabling detailed and accurate modeling of real-world objects or scenes in various fields, including Virtual Reality or embodied AI. Spike cameras, a novel type of neuromorphic sensor, continuously record scenes with an ultra-high temporal resolution, showing potential for accurate 3D reconstruction. Despite their promise, existing approaches, such as applying Neural Radiance Fields (NeRF) to spike cameras, encounter challenges due to the time-consuming rendering process. To address this issue, we make the first attempt to introduce the 3D Gaussian Splatting (3DGS) into spike cameras in high-speed capture, providing 3DGS as dense and continuous clues of views, then constructing SpikeGS. Specifically, to train SpikeGS, we establish computational equations between the rendering process of 3DGS and the processes of instantaneous imaging and exposing-like imaging of the continuous spike stream. Besides, we build a very lightweight but effective mapping process from spikes to instant images to support training. Furthermore, we introduced a new spike-based 3D rendering dataset for validation. Extensive experiments have demonstrated our method possesses the high quality of novel view rendering, proving the tremendous potential of spike cameras in modeling 3D scenes.



Paperid:719 Poster
Authors:Liang He,Hongke Wang,Zhen Wu,Jianbing Zhang,Xinyu Dai,Jiajun Chen
Abstract:
With the rise of multimedia-driven content on the internet, multimodal relation extraction has gained significant importance in various domains, such as intelligent search and multimodal knowledge graph construction. Social media, as a rich source of image-text data, plays a crucial role in populating knowledge bases. However, the noisy information present in social media data poses a challenge in multimodal relation extraction. Current methods focus on extracting relevant information from images to improve model performance but often overlook the importance of global image information. In this paper, we propose a novel multimodal relation extraction method, named FocalMRE, which leverages image focal augmentation, focal attention, and gating mechanisms. FocalMRE enables the model to concentrate on the image's focal regions while effectively utilizing the global information in the image. Through gating mechanisms, FocalMRE optimizes the multimodal fusion strategy, allowing the model to select the most relevant augmented regions for overcoming noise interference in relation extraction. The experimental results on the public MNRE dataset reveal that our proposed method exhibits robust and significant performance advantages in the multimodal relation extraction task, especially in scenarios with high noise, long-tail distributions, and limited resources.



Paperid:720 Poster
Authors:Yukang Lin,Haonan Han,Chaoqun Gong,Zunnan Xu,Yachao Zhang,Xiu Li
Abstract:
Reconstructing 3D objects from a single image guided by pretrained diffusion models has demonstrated promising outcomes. However, due to utilizing the case-agnostic rigid strategy, their generalization ability to arbitrary cases and the 3D consistency of reconstruction are still poor. In this work, we propose Consistent123, a case-aware two-stage method for highly consistent 3D asset reconstruction from one image with both 2D and 3D diffusion priors. In the first stage, Consistent123 utilizes only 3D structural priors for sufficient geometry exploitation, with a CLIP-based case-aware adaptive detection mechanism embedded within this process. In the second stage, 2D texture priors are introduced and progressively take on a dominant guiding role, delicately sculpting the details of the 3D model. Consistent123 aligns more closely with the evolving trends in guidance requirements, adaptively providing adequate 3D geometric initialization and suitable 2D texture refinement for different objects. Consistent123 can obtain highly 3D-consistent reconstruction and exhibits strong generalization ability across various objects. Qualitative and quantitative experiments show that our method significantly outperforms state-of-the-art image-to-3D methods.



Paperid:721 Poster
Authors:Zibin Liu,Banglei Guan,Yang Shang,Shunkun Liang,Zhenbao Yu,Qifeng Yu
Abstract:
Object pose tracking is one of the pivotal technologies in multimedia, attracting ever-growing attention in recent years. Existing methods employing traditional cameras encounter numerous challenges such as motion blur, sensor noise, partial occlusion, and changing lighting conditions. The emerging bio-inspired sensors, particularly event cameras, possess advantages such as high dynamic range and low latency, which hold the potential to address the aforementioned challenges. In this work, we present an optical flow-guided 6DoF object pose tracking method with an event camera. A 2D-3D hybrid feature extraction strategy is firstly utilized to detect corners and edges from events and object models, which characterizes object motion precisely. Then, we search for the optical flow of corners by maximizing the event-associated probability within a spatio-temporal window, and establish the correlation between corners and edges guided by optical flow. Furthermore, by minimizing the distances between corners and edges, the 6DoF object pose is iteratively optimized to achieve continuous pose tracking. Experimental results of both simulated and real events demonstrate that our methods outperform event-based state-of-the-art methods in terms of both accuracy and robustness.



Paperid:722 Poster
Authors:Xiuquan Du,Jiajia Chen,XuejunZhang
Abstract:
Missed polyps are the major risk factor for colorectal cancer. To minimize misdiagnosis, many methods have been developed. However, they either rely on laborious instance-level annotations, require labeling of prompt points, or lack the ability to filter noise proposals and detect polyps integrally, resulting in severe challenges in this area. In this paper, we propose a novel Cooperation-Based network (CBNet), a two-stage polyp detection framework supervised by image labels that removes wrong proposals through classification in collaboration with segmentation and obtains a more accurate detector by aggregating adaptive multi-level regional features. Specifically, we conduct Cooperation-Based Region Proposal Network (CBRPN) to reduce the negative impact of noises by deleting proposals without polyps, enabling our network to capture polyp features. Moreover, to enhance location integrity and classification precision of polyps, we aggregate multi-level region of interest (ROI) features under the guidance of the backbone classification layer, namely Adaptive ROI Fusion Module (ARFM). Extensive experiments on the public and private datasets show that our method achieves the state-of-the-art performance for weakly supervised methods and even outperforms full supervision in some terms. All code is available at repository.



Paperid:723 Poster
Authors:Renshu Gu,Jiajun Zhu,Yixuan Si,Fei Gao,Jiamin Xu,Gang Xu
Abstract:
3D Human pose estimation from multiple cameras with unknown calibration has received less attention than it should. The few existing data-driven solutions do not fully exploit 3D training data that are available on the market, and typically train from scratch for every novel multi-view scene, which impedes both accuracy and efficiency. We show how to exploit 3D training data to the fullest and associate multiple dynamic views efficiently to achieve high precision on novel scenes using a simple yet effective framework, dubbed \textit{Multiple Dynamic View Pose estimation} (MDVPose). MDVPose utilizes novel scenarios data to finetune a single-view pretrained motion encoder in multi-view setting, aligns arbitrary number of views in a unified coordinate via Procruste alignment, and imposes multi-view consistency. The proposed method achieves 22.1 mm P-MPJPE or 34.2 mm MPJPE on the challenging in-the-wild Ski-Pose PTZ dataset, which outperforms the state-of-the-art method by 24.8% P-MPJPE (-7.3 mm) and 19.0% MPJPE (-8.0 mm). It also outperforms the state-of-the-art methods by a large margin (-18.2mm P-MPJPE and -28.3mm MPJPE) on the EgoBody dataset. In addition, MDVPose achieves robust performance on the Human3.6M datasets featuring multiple static cameras. Code will be released upon acceptance.



Paperid:724 Poster
Authors:Zhenqi Dai,Ting Liu,Xingxing Zhang,Yunchao Wei,Yanning Zhang
Abstract:
In this paper, we present the One-shot In-context Part Segmentation (OIParts) framework, designed to tackle the challenges of part segmentation by leveraging visual foundation models (VFMs). Existing training-based one-shot part segmentation methods that utilize VFMs encounter difficulties when faced with scenarios where the one-shot image and test image exhibit significant variance in appearance and perspective, or when the object in the test image is partially visible. We argue that training on the one-shot example often leads to overfitting, thereby compromising the model's generalization capability. Our framework offers a novel approach to part segmentation that is training-free, flexible, and data-efficient, requiring only a single in-context example for precise segmentation with superior generalization ability. By thoroughly exploring the complementary strengths of VFMs, specifically DINOv2 and Stable Diffusion, we introduce an adaptive channel selection approach by minimizing the intra-class distance for better exploiting these two features, thereby enhancing the discriminatory power of the extracted features for the fine-grained parts. We have achieved remarkable segmentation performance across diverse object categories. The OIParts framework not only eliminates the need for extensive labeled data but also demonstrates superior generalization ability. Through comprehensive experimentation on three benchmark datasets, we have demonstrated the superiority of our proposed method over existing part segmentation approaches in one-shot settings.



Paperid:725 Poster
Authors:Yuhan Wang,Mofei Song
Abstract:
3D novelty detection plays a crucial role in various real-world applications, especially in safety-critical fields such as autonomous driving and intelligent surveillance systems. However, existing 3D novelty detection methods are constrained by the scarcity of 3D data, which may impede the model's ability to learn adequate representations, thereby impacting detection accuracy. To address this challenge, we propose a Unified Learning Framework (UniL) for facilitating novelty detection. During the pretraining phase, UniL assists the point cloud encoder in learning information from other modalities, aligning visual, textual, and 3D features within the same feature space. Additionally, we introduce a novel Multimodal Supervised Contrastive Loss (MSC Loss) to improve the model's ability to cluster samples from the same category in feature space by leveraging label information during pretraining. Furthermore, we propose a straightforward yet powerful scoring method, Depth Map Error (DME), which assesses the discrepancy between projected depth maps before and after point cloud reconstruction during novelty detection. Extensive experiments conducted on 3DOS have demonstrated the effectiveness of our approach, significantly enhancing the performance of the unsupervised VAE method in 3D novelty detection. The code will be made available.



Paperid:726 Poster
Authors:Wuyou Xia,Shengzhe Liu,Qin Rong,Guoli Jia,Eunil Park,Jufeng Yang
Abstract:
In online chatting, people are increasingly favoring the use of stickers to supplement or replace text for replies, as sticker images can express more vivid and varied emotions. The Sticker Response Selection (SRS) task aims to predict the sticker image that is most relevant to the history dialogue context. Previous researches explore the semantic similarity between context and stickers, while ignoring the role of both unimodal and cross-modal emotional information. In this paper, we propose a “Perceive before Respond” training paradigm. We perceive the emotion of stickers through a knowledge distillation module, which acquires emotion knowledge from the existing large-scale sticker emotion recognition dataset and distills it into our framework to enhance the understanding of sticker emotion. To further distinguish stickers with similar subject elements within the same topic, we perform contrastive learning at both inter-topic and intra-topic levels to obtain discriminative and diverse sticker representations. In addition, we improve the hard negative sampling method for image-text matching based on cross-modal sentiment association, conducting hard sample mining from both semantic similarity and sentiment polarity similarity for sticker-to-dialogue and dialogue-to-sticker. Extensive experiments verify the effectiveness of each proposed component. Ablation experiments on different backbone networks demonstrate the generality of our approach. The code is provided in the supplement material and will be released to the public.



Paperid:727 Poster
Authors:Yin Wang,Hao LU,Ying-Cong Chen,Li Kuang,Mengchu Zhou,Shuiguang Deng
Abstract:
Remote photoplethysmography (rPPG) is a promising technique for non-contact physiological signal measurement, which has great potential in health monitoring and emotion analysis. However, existing methods for the rPPG task ignore the long-tail phenomenon of physiological signal data, especially on multiple domains joint training. In addition, we find that the long-tail problem of the physiological label (phys-label) exists in different datasets, and the long-tail problem of domain exists under the same phys-label. To tackle these problems, in this paper, we propose aHierarchicalBalanced framework (rPPG-HiBa), which mitigates the bias caused by domain and phys-label imbalance. Specifically, we propose anti-spurious domain center learning tailored to learning domain-balanced embeddings space. Then, we adopt compact-aware continuity regularization to estimate phys-label-wise imbalances and construct continuity between embeddings. Extensive experiments demonstrate that our method outperforms the state-of-the-art in cross-dataset and intra-dataset settings.



Paperid:728 Poster
Authors:Haojie Wei,Yuan Jun,Rui Zhang,Quanyu Dai,Yueguo Chen
Abstract:
Music source separation and pitch estimation are two vital tasks in music information retrieval. Typically, the input of pitch estimation is obtained from the output of music source separation. Therefore, existing methods have tried to perform these two tasks simultaneously, so as to leverage the mutually beneficial relationship between both tasks. However, these methods still face two critical challenges that limit the improvement of both tasks: the lack of labeled data and joint learning optimization. To address these challenges, we propose a Model-Agnostic Joint Learning (MAJL) framework for both tasks. MAJL is a generic framework and can use variant models for each task. It includes a two-stage training method and a dynamic weighting method named Dynamic Weights on Hard Samples (DWHS), which addresses the lack of labeled data and joint learning optimization, respectively. Experimental results on public music datasets show that MAJL outperforms state-of-the-art methods on both tasks, with significant improvements of 0.92 in Signal-to-Distortion Ratio (SDR) for music source separation and 2.71% in Raw Pitch Accuracy (RPA) for pitch estimation. Furthermore, comprehensive studies not only validate the effectiveness of each component of MAJL, but also indicate the great generality of MAJL in adapting to different model architectures.



Paperid:729 Poster
Authors:Shengran Cheng,Chuhang Ma,Ye Pan
Abstract:
Facial landmark detection forms the foundation for numerous face-related tasks. Recently, this field has gained substantial attention and made significant advancements. Nonetheless, detecting facial landmarks for stylized characters still remains a challenge. Existing approaches, which are mostly trained on real-human face datasets, struggle to perform well due to the structural variations between real and stylized characters. Additionally, a comprehensive dataset for analyzing stylized characters' facial features is lacking. This study proposes a novel dataset, the Facial Landmark Dataset for Stylized Characters (FLSC), which contains 2674 images and 4086 faces. These data is selected from 16 cartoon video clips, together with 98 landmarks per image, labeled by professionals. Besides, we propose StylizedFacePoint: a deep-learning-based method for stylized facial landmark detection that outperforms the existing approaches. This method has also proven to work well for characters with styles outside the training domain. Moreover, we outline two primary types of applications for our dataset and method. For each, we provide a detailed illustrative example.



Paperid:730 Poster
Authors:Shaokun Wang,Yifan Yu,Yuhang He,Yihong Gong
Abstract:
Vision Transformers (ViTs) excel in extracting global information from image patches. However, their inherent limitation lies in effectively extracting information within local regions, hindering their applicability and performance. Particularly, fully supervised pre-trained ViTs, such as Vanilla ViT and CLIP, face the challenge of locality vanishing when adapting to downstream tasks. To address this, we introduce a novel LOcality-aware pRompt lEarning (LORE) method, aiming to improve the adaptation of pre-trained ViTs to downstream tasks. LORE integrates a data-driven Black Box module (i.e.,a pre-trained ViT encoder) with a knowledge-driven White Box module. The White Box module is a locality-aware prompt learning mechanism to compensate for ViTs’ deficiency in incorporating local information. More specifically, it begins with the design of a Locality Interaction Network (LIN), which treats an image as a neighbor graph and employs graph convolution operations to enhance local relationships among image patches. Subsequently, a Knowledge-Locality Attention (KLA) mechanism is proposed to capture critical local regions from images, learning Knowledge-Locality (K-L) prototypes utilizing relevant semantic knowledge. Afterwards, K-L prototypes guide the training of a Prompt Generator (PG) to generate locality-aware prompts for images. The locality-aware prompts, aggregating crucial local information, serve as additional input for our Black Box module. Combining pre-trained ViTs with our locality-aware prompt learning mechanism, our Black-White Box model enables the capture of both global and local information, facilitating effective downstream task adaptation. Experimental evaluations across four downstream tasks demonstrate the effectiveness and superiority of our LORE.



Paperid:731 Poster
Authors:Zining Wang,Jinyang Guo,Ruihao Gong,Yang Yong,Aishan Liu,Yushi Huang,Jiaheng Liu,Xianglong Liu
Abstract:
With the increased attention to model efficiency, model sparsity technologies have developed rapidly in recent years, among which post-training sparsity (PTS) has become more and more prevalent because of its effectiveness and efficiency. However, there remain questions on better fine-grained PTS algorithms and the sparsification ability of models, which hinders the further development of this area. Therefore, a benchmark to comprehensively investigate the issues above is urgently needed. In this paper, we propose the first comprehensive post-training sparsity benchmark called PTSBench towards PTS algorithms and models. We benchmark 10+ PTS general-pluggable fine-grained algorithms on 3 typical computer vision tasks using over 40 off-the-shelf model architectures. Through extensive experiments and analyses, we obtain valuable conclusions and provide several insights from both PTS fine-grained algorithms and model aspects, which can comprehensively address the aforementioned questions. Our PTSBench can provide (1) in-depth and comprehensive evaluations for the sparsification abilities of models, (2) new observations for a better understanding of the PTS method toward algorithms and models, and (3) an upcoming well-structured and easy-integrate open-source framework for model sparsification ability evaluation. We hope this work will provide illuminating conclusions and advice for future studies of post-training sparsity methods and sparsification-friendly model design.



Paperid:732 Poster
Authors:Qian Qiao,Yu Xie,Jun Gao,Tianxiang Wu,Shaoyao Huang,Jiaqing Fan,Ziqiang Cao,Zili Wang,Yue Zhang
Abstract:
More and more end-to-end text spotting methods based on Transformer architecture have demonstrated superior performance. These methods utilize a bipartite graph matching algorithm to perform one-to-one optimal matching between predicted objects and actual objects. However, the instability of bipartite graph matching can lead to inconsistent optimization targets, thereby affecting the training performance of the model. Existing literature applies denoising training to solve the problem of bipartite graph matching instability in object detection tasks. Unfortunately, this denoising training method cannot be directly applied to text spotting tasks, as these tasks need to perform irregular shape detection tasks and more complex text recognition tasks than classification. To relieve the issue, in this paper, we propose a novel denoising training method (DNTextSpotter) for arbitrary-shaped text detection and recognition. Specifically, we decompose the queries of the denoising part into noised positional queries and noised content queries. We use the four Bezier control points of the Bezier center curve to generate the noised positional queries. For the noised content queries, considering that the output of the text in a fixed positional order is not conducive to aligning position with content, we employ a masked character sliding method to initialize content queries to assist the alignment of text content and position. To improve the model's perception of the background, we further utilize an additional loss function for background characters classification in the denoising training part. Although DNTextSpotter is conceptually simple, it outperforms the state-of-the-art methods on four benchmarks (Total-Text, SCUT-CTW1500, ICDAR15, and Inverse-Text), especially yielding an improvement of $11.3%$ against the best approach in Inverse-Text.



Paperid:733 Poster
Authors:Zixuan Wang,Jiayi Li,Xiaoyu Qin,Shikun Sun,Songtao Zhou,Jia Jia,Jiebo Luo
Abstract:
Synthesizing camera movement from music and dance is highly challenging due to the contradictions and complexities of dance cinematography. Unlike human movement, which is always continuous, dance camera movements involve continuous sequences of varying lengths and sudden drastic changes to simulate the switching of multiple cameras. However, in previous works, every camera frame is equally treated which results in jittering and unavoidable smoothing post-processing. To solve this problem, in this paper, we propose to formalize animator dance cinematography knowledge by formulating this problem as a three-stage process: keyframe detection, keyframe synthesis, and tween curve prediction. Following this formulation, we design a novel end-to-end dance camera synthesis framework \textbf{DanceCamAnimator}, which first imitates human animation procedure and shows powerful keyframe-based controllability with variable length. Extensive experiments on the DCM dataset demonstrate that our method surpasses previous works quantitatively and qualitatively. We will make our code publicly available to promote future research.



Paperid:734 Poster
Authors:Junming Hou,Zihan Cao,Naishan Zheng,Xuan Li,Xiaoyu Chen,Xinyang Liu,Xiaofeng Cong,Danfeng Hong,Man Zhou
Abstract:
Vision transformer family has dominated the satellite pan-sharpening field driven by the global-wise spatial information modeling mechanism from the core self-attention ingredient. The standard modeling rules within these promising pan-sharpening methods are to roughly stack the transformer variants in a cascaded manner. Despite the remarkable advancement, their success may be at the huge cost of model parameters and FLOPs, thus preventing its application over low-resource satellites. To address this challenge between favorable performance and expensive computation, we tailor an efficient linearly-evolved transformer variant and employ it to construct a lightweight pan-sharpening framework. In detail, we deepen into the popular cascaded transformer modeling with cutting-edge methods and develop the alternative 1-order linearly-evolved transformer variant with the 1-dimensional linear convolution chain to achieve the same function. In this way, our proposed method is capable of benefiting the cascaded modeling rule while achieving favorable performance in the efficient manner. Extensive experiments over multiple satellite datasets suggest that our proposed method achieves competitive performance against other state-of-the-art with fewer computational resources. Further, the consistently favorable performance has been verified over the hyper-spectral image fusion task. Our main focus is to provide an alternative global modeling framework with an efficient structure. The code will be publicly available.



Paperid:735 Poster
Authors:Yuanchen Shi,Fang Kong
Abstract:
With the popularity of the internet and social media, growing number of online chats and comment replies are presented in the form of multimodal dialogues that contain stickers. Automatically summarizing these dialogues can effectively reduce content overload and save reading time. However, existing datasets and works are either unimodal text dialogue summarization, or articles with real photos that respectively perform text summaries and key image extraction, and have not simultaneously considered the multimodal dialogue automatic summarization tasks with sticker images and online chat scenarios. To compensate for the lack of datasets and researches in this field, we propose a brand-new Multimodal Chat Dialogue Summarization Containing Stickers (MCDSCS) task and dataset. It consists of 5,527 Chinese multimodal chat dialogues and 14,356 different sticker images, with each dialogue interspersed with stickers in the text to reflect the real social media chat scenario. MCDSCS can also contribute to filling the gap in Chinese multimodal dialogue data. We use the most advanced GPT4 model and carefully design Chain-of-Thoughts (COT) supplemented with manual review to generate dialogues and extract summaries. We also propose a novel method that integrates the visual information of stickers with the text descriptions of emotions and intentions (TEI). Experiments show that our method can effectively improve the performance of various mainstream summary generation models, even better than ChatGPT and some other multimodal models. Our data and code will be publicly available.



Paperid:736 Poster
Authors:Shiwei Zhang,Wei Ke,Shuai Liu,Xiaopeng Hong,Tong Zhang
Abstract:
The core of active semi-supervised crowd counting is the sample selection criteria. However, the scale factor has been neglected in active learning approaches despite the fact that the scale of heads varies drastically in the crowd images. In this paper, we propose a simple yet effective active labeling strategy to explicitly select informative unlabeled images, guided by the intra-scale uncertainty and inter-scale inconsistency metrics. The intra-scale uncertainty is quantified through the sum of the query-level entropy of images at different scales. Images are initially ranked based on this uncertainty for preselection. Inter-scale inconsistency is measured by the divergence between the query-level predictions of upscaled and downscaled images, allowing for the identification of the most informative images exhibiting the highest inconsistency. Additionally, we implement a progressive updating scheme for the semi-supervised crowd counting framework, in which the pseudo-labels for unlabeled images are refined iteratively. It further improves the counting accuracy. Through extensive experiments on widely used benchmarks, the proposed approach has demonstrated superior performance compared to previous state-of-the-art semi-supervised and active semi-supervised crowd counting methods.



Paperid:737 Poster
Authors:Yuqi Sun,Qing Lin,Weimin Tan,Bo Yan
Abstract:
Recent advances in multimodal artificial intelligence have greatly improved the integration of vision-language-audio cues to enrich the content creation process. Inspired by these developments, in this paper, we first integrate audio into the face inpainting task to facilitate identity manipulation. Our main insight is that a person's voice carries distinct identity markers, such as age and gender, which provide an essential supplement for identity-aware face inpainting. By extracting identity information from audio as guidance, our method can naturally support tasks of identity preservation and identity swapping in face inpainting. Specifically, we introduce a dual-stream network architecture comprising a face branch and an audio branch. The face branch is tasked with extracting deterministic information from the visible parts of the input masked face, while the audio branch is designed to capture heuristic identity priors from the speaker's voice. The identity codes from two streams are integrated using a multi-layer perceptron (MLP) to create a virtual unified identity embedding that represennts comprehensive identity features. In addition, to explicitly exploit the information from audio, we introduce an audio-face generator to generate an `fake' audio face directly from audio and fuse the multi-scale intermediate features from the audio-face generator into face inpainting network through an audio-visual feature fusion (AVFF) module. Extensive experiments demonstrate the positive impact of extracting identity information from audio on face inpainting task, especially in identity preservation.



Paperid:738 Poster
Authors:Zuoyan Zhao,Hui Xue,Pengfei Fang,Shipeng Zhu
Abstract:
Scene text image super-resolution (STISR) aims at simultaneously increasing the resolution and readability of low-resolution scene text images, thus boosting the performance of the downstream recognition task. Two factors in scene text images, visual structure and semantic information, affect the recognition performance significantly. To mitigate the effects from these factors, this paper proposes a Prior-Enhanced Attention Network (PEAN). Specifically, an attention-based modulation module is leveraged to understand scene text images by neatly perceiving the local and global dependence of images, despite the shape of the text. Meanwhile, a diffusion-based module is developed to enhance the text prior, hence offering better guidance for the SR network to generate SR images with higher semantic accuracy. Additionally, a multi-task learning paradigm is employed to optimize the network, enabling the model to generate legible SR images. As a result, PEAN establishes new SOTA results on the TextZoom benchmark. Experiments are also conducted to analyze the importance of the enhanced text prior as a means of improving the performance of the SR network. Code will be made available.



Paperid:739 Poster
Authors:Ruonan Zhang,Ziwei Shang,Fengjuan Wang,Zhaoqilin Yang,Shan Cao,Yigang Cen,Gaoyun An
Abstract:
Scene Graph Generation (SGG) is an important cross-modal task in scene understanding, aiming to detect visual relations in an image. However, due to the various appearance features, the feature distributions of different categories have suffered from a severe overlap, which makes the decision boundaries ambiguous. The current SGG methods mainly attempt to re-balance the data distribution, which is dataset-dependent and limits the generalization. To solve this problem, a Synergetic Prototype Learning Network (SPLN) is proposed here, where the generalized semantic space is modeled and the synergetic effect among different semantic subspaces is delved into. In SPLN, a Collaboration-induced Prototype Learning method is proposed to model the interaction of visual semantics and structural semantics. The conventional visual semantics is focused on with a residual-driven representation enhancement module to capture details. And the intersection of structural semantics and visual semantics is explicitly modeled as conceptual semantics, which has been ignored by existing methods. Meanwhile, to alleviate the noise of unrelated and meaningless words, an Intersection-induced Prototype Learning method is also proposed specially for conceptual semantics with an essence-driven prototype enhancement module. Moreover, a Selective Fusion Module is proposed to synergetically integrate the results of visual, structural, conceptual branches and the generalized semantics projection. Experiments on VG and GQA datasets show that our method achieves state-of-the-art performance on the unbiased metrics, and ablation studies validate the effectiveness of each component.



Paperid:740 Poster
Authors:Zhichao Yang,Leida Li,Pengfei Chen,Jinjian Wu,Weisheng Dong
Abstract:
The perception of image aesthetics is built upon the understanding of semantic content. However, how to evaluate the aesthetic quality of images with diversified semantic backgrounds remains challenging in image aesthetics assessment (IAA). To address the dilemma, this paper presents a semantics-aware image aesthetics assessment approach, which first analyzes the semantic content of images and then models the aesthetic distinctions among images from two perspectives, i.e., aesthetic attribute and aesthetic level. Concretely, we propose two strategies, dubbed tag matching and contrastive ranking, to extract knowledge pertaining to image aesthetics. The tag matching identifies the semantic category and the dominant aesthetic attributes based on predefined tag libraries. The contrastive ranking is designed to uncover the comparative relationships among images with different aesthetic levels but similar semantic backgrounds. In the process of contrastive ranking, the impact of long-tailed distribution of aesthetic data is also considered by balanced sampling and traversal contrastive learning. Extensive experiments and comparisons on three benchmark IAA databases demonstrate the superior performance of the proposed model in terms of both prediction accuracy and alleviating long-tailed effect. The code of the proposed method will be public.



Paperid:741 Poster
Authors:Hebaixu Wang,Hao Zhang,Xunpeng Yi,Xinyu Xiang,Leyuan Fang,Jiayi Ma
Abstract:
The fusion of visible and infrared images aims to produce high-quality fusion images with rich textures and salient target information. Existing methods lack interactivity and flexibility in the execution of fusion. It is unfeasible to express the requirements to modify the fusion effect, and the different regions in the source images are treated equally across the identical fusion model, which causes the fusion homogenization and low distinction. Besides, their pre-defined fusion strategies invariably lead to monotonous effects, which are insufficiently comprehensive. They fail to adequately consider data credibility, scene illumination, and noise degradation inherent in the source information. To address these issues, we propose the Text-driven and Region-aware Flexible visible and infrared image fusion, termed as TeRF. On the one hand, we propose a flexible image fusion framework with multiple large language and vision models, which facilitates the visual-text interaction. On the other hand, we aggregate comprehensive fine-tuning paradigms for the different fusion requirements to build a unified fine-tuning pipeline. It allows the linguistic selection of the regions and effects, yielding visually appealing fusion outcomes. Extensive experiments demonstrate the competitiveness of our method both qualitatively and quantitatively compared to existing state-of-the-art methods.



Paperid:742 Poster
Authors:Yingchun Wang,Jingcai Guo,Song Guo,LIU Yi,Jie ZHANG,Weizhan Zhang
Abstract:
Recent studies reveal that even highly biased dense networks can contain an invariant substructure with superior out-of-distribution (OOD) generalization. While existing works commonly seek these substructures using global sparsity constraints, the uniform imposition of sparse penalties across samples with diverse levels of spurious contents renders such methods suboptimal. The precise adaptation of model sparsity, specifically tailored for spurious features, remains a significant challenge. Motivated by the insight that in-distribution (ID) data containing spurious features may exhibit lower experiential risk, we propose a novelSpuriousFeature-targetedPruning framework, dubbedSFP, to induce the authentic invariant substructures without referring to the above concerns. Specifically, SFP distinguishes spurious features within ID instances during training by a theoretically validated threshold. It then penalizes the corresponding feature projections onto the model space, steering the optimization towards subspaces spanned by those invariant factors. Moreover, we also conduct detailed theoretical analysis to provide a rationality guarantee and a proof framework for OOD structures based on model sparsity. Experiments on various OOD datasets show that SFP can significantly outperform both structure-based and non-structure-based OOD generalization SOTAs by large margins.



Paperid:743 Poster
Authors:Haosen Sun,Yiming Li,Xixiang Lyu,Jing Ma
Abstract:
Deep neural networks (DNNs) are susceptible to backdoor attacks due to their black-box nature and lack of interpretability. Backdoor attacks intend to manipulate the model's prediction when hidden backdoors are activated by predefined triggers. Although considerable progress has been made in backdoor detection and removal at the model deployment stage, an effective defense against backdoor attacks during the training time is still under-explored. In this paper, we propose a novel training-time backdoor defense method called Learning from Distinction (LfD), allowing training a backdoor-free model on the backdoor-poisoned data. LfD uses a low-capacity model as a teacher to guide the learning of a backdoor-free student model via a dynamic weighting strategy. Extensive experiments on CIFAR-10, GTSRB and ImageNet-subset datasets show that LfD significantly reduces attack success rates to 0.67%, 6.14% and 1.42%, respectively, with minimal impact on clean accuracy (less than 1%, 3% and 1%).



Paperid:744 Poster
Authors:Xu Zhang,Fan Ni,Guan-Nan Dong,Aichun Zhu,Jianhui Wu,Mingcheng Ni,Hui Liu
Abstract:
Most existing methods for text-based person retrieval focus on text-to-image person retrieval. Nevertheless, due to the lack of dynamic information provided by isolated frames, the performance is hampered when the person is obscured or variable motion details are missed in isolated frames. To overcome this, we propose a novel Text-to-Video Person Retrieval (TVPR) task. Since there is no dataset or benchmark that describes person videos with natural language, we construct a large-scale cross-modal person video dataset containing detailed natural language annotations, such as person’s appearance, actions and interactions with environment, etc., termed as Text-to-Video Person Re-identification (TVPReid) dataset. In this paper, we introduce a Multielement Feature Guided Fragments Learning (MFGF) strategy, which leverages the cross-modal text-video representations to provide strong text-visual and text-motion matching information to tackle uncertain occlusion conflicting and variable motion details. Specifically, we establish two potential cross-modal spaces for text and video feature collaborative learning to progressively reduce the semantic difference between text and video. To evaluate the effectiveness of the proposed MFGF, extensive experiments have been conducted on TVPReid dataset. To the best of our knowledge, MFGF is the first successful attempt to use video for text-based person retrieval task and has achieved state-of-the-art performance on TVPReid dataset. The TVPReid dataset will be publicly available to benefit future research.



Paperid:745 Poster
Authors:Li Zheng,Boyu Chen,Hao Fei,Fei Li,Shengqiong Wu,Lizi Liao,Donghong Ji,Chong Teng
Abstract:
Coreference resolution, an essential task in natural language processing, is particularly challenging in multi-modal scenarios where data comes in various forms and modalities. Despite advancements, limitations due to scarce labeled data and underleveraged unlabeled data persist. We address these issues with a self-adaptive fine-grained multi-modal data augmentation framework for semi-supervised MCR, focusing on enriching training data from labeled datasets and tapping into the untapped potential of unlabeled data. Regarding the former issue, we first leverage text coreference resolution datasets and diffusion models, to perform fine-grained text-to-image generation with aligned text entities and image bounding boxes. We then introduce a self-adaptive selection strategy, meticulously curating the augmented data to enhance the diversity and volume of the training set without compromising its quality. For the latter issue, we design a self-adaptive threshold strategy that dynamically adjusts the confidence threshold based on the model’s learning status and performance, enabling effective utilization of valuable information from unlabeled data. Additionally, we incorporate a distance smoothing term, which smooths distances between positive and negative samples, enhancing discriminative power of the model’s feature representations and addressing noise and uncertainty in the unlabeled data. Our experiments on the widely-used CIN dataset show that our framework significantly outperforms state-of-the-art baselines by at least 9.57% on MUC F1 score and 4.92% on CoNLL F1 score. Remarkably, against weakly-supervised baselines, our framework achieves a staggering 22.24% enhancement in MUC F1 score. These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing MCR tasks. Our codes:https://anonymous.4open.science/r/SLUDA.



Paperid:746 Poster
Authors:HUAN CHEN,Tingfa Xu,Zhenxiang Chen,Peifu Liu,Huiyan Bai,Jianan Li
Abstract:
Change detection identifies differences between images captured at different times. Real-world change detection faces challenges from the diverse and intricate nature of change areas, while current datasets and algorithms are often limited to simpler, uniform changes, reducing their effectiveness in practical application. Existing dual-branch methods process images independently, risking the loss of change information due to insufficient early interaction. In contrast, single-stream approaches, though improving early integration, lack efficacy in capturing complex changes. To address these issues, we introduce a novel single-stream architecture, the Multi-scale Change-Aware Transformer (MACT), which features the Dynamic Change-Aware Attention module and the Multi-scale Change-Enhanced Aggregator. The Dynamic Change-Aware Attention module, integrating local self-attention and cross-temporal attention, conducts dynamic iteration on images differences, thereby targeting feature extraction of change areas. The Multi-scale Change-Enhanced Aggregator enables the model to adapt to various scales and complex shapes through local change enhancement and multiscale aggregation strategies. To overcome the limitations of existing datasets regarding the scale diversity and morphological complexity of change areas, we construct the Mining Area Change Detection dataset. The dataset offers a diverse array of change areas that span multiple scales and exhibit complex shapes, providing a robust benchmark for change detection. Extensive experiments demonstrate that the our model outperforms existing methods, especially for irregular and multi-scale changes.



Paperid:747 Poster
Authors:Yingjie Zhou,Zicheng Zhang,Wei Sun,Xiaohong Liu,Xiongkuo Min,Guangtao Zhai
Abstract:
In recent years, immersive communication has emerged as a compelling alternative to traditional video communication methods. One prospective avenue for immersive communication involves augmenting the user's immersive experience through the transmission of three-dimensional (3D) talking heads (THs). However, transmitting 3D THs poses significant challenges due to its complex and voluminous nature, often leading to pronounced distortion and a compromised user experience. Addressing this challenge, we introduce the 3D Talking Heads Quality Assessment (THQA-3D) dataset, comprising 1,000 sets of distorted and 50 original TH mesh sequences (MSs), to facilitate quality assessment in 3D TH transmission. A subjective experiment, characterized by a novel interactive approach, is conducted with recruited participants to assess the quality of MSs in THQA-3D dataset. Leveraging this dataset, we also propose a multimodal Quality-of-Experience (QoE) method incorporating a Large Quality Model (LQM). This method involves frontal projection of MSs and subsequent rendering into videos, with quality assessment facilitated by the LQM and a variable-length video memory filter (VVMF). Additionally, tone-lip coherence and silence detection techniques are employed to characterize audio-visual coherence in 3D MS streams. Experimental evaluation demonstrates the proposed method's superiority, achieving state-of-the-art performance on the THQA-3D dataset and competitiveness on other QoE datasets. Both the THQA-3D dataset and the QoE model will be publicly available.



Paperid:748 Poster
Authors:Haoyu Tong,Xiaoyu Zhang,Jin Yulin,Jian Lou,Kai Wu,Xiaofeng Chen
Abstract:
Adversarial training (AT) is a fundamental method to enhance the robustness of Deep Neural Networks (DNNs) against adversarial examples. While AT achieves improved robustness on adversarial examples, it often leads to reduced accuracy on clean examples. Considerable effort has been devoted to handling the trade-off from the perspective of \textit{input space}. However, we demonstrate that the trade-off can also be illustrated from the perspective of the \textit{gradient space}. In this paper, we propose Adversarial Training with Adaptive Gradient Reconstruction (AGR), a novel approach that balances generalization (accuracy on clean examples) and robustness (accuracy on adversarial examples) in adversarial training via steering through clean and adversarial gradient directions. We first introduce a ingenious technique named Gradient Orthogonal Projection in the case of negative correlation gradients to adjust the adversarial gradient direction to reduce the degradation of generalization. Then we present a gradient interpolation scheme in the case of positive correlation gradients for efficiently increasing the generalization without compromising the robustness of the final obtained. Rigorous theoretical analysis prove that our AGR has lower generalization error upper bounds indicating its effectiveness. Comprehensive experiments empirically demonstrate that AGR achieves excellent capability of balancing generalization and robustness, and is compatible with various adversarial training methods to achieve superior performance.



Paperid:749 Poster
Authors:Boyong He,Yuxiang Ji,Zhuoyue Tan,Liaoni Wu
Abstract:
Object detectors often suffer a decrease in performance due to the large domain gap between the training data (source domain) and real-world data (target domain). Diffusion-based generative models have shown remarkable abilities in generating high-quality and diverse images, suggesting their potential for extracting valuable feature from various domains. To effectively leverage the cross-domain feature representation of diffusion models, in this paper, we train a detector with frozen-weight diffusion model on the source domain, then employ it as a teacher model to generate pseudo labels on the unlabeled target domain, which are used to guide the supervised learning of the student model on the target domain. We refer to this approach as Diffusion Domain Teacher (DDT). By employing this straightforward yet potent framework, we significantly improve cross-domain object detection performance without compromising the inference speed. Our method achieves an average mAP improvement of 21.2% compared to the baseline on 6 datasets from three common cross-domain detection benchmarks (Cross-Camera, Syn2Real, Real2Artistic), surpassing the current state-of-the-art (SOTA) methods by an average of 5.7% mAP. Furthermore, extensive experiments demonstrate that our method consistently brings improvements even in more powerful and complex models, highlighting broadly applicable and effective domain adaptation capability of our DDT.



Paperid:750 Poster
Authors:Zonglin Lyu,Ming Li,Jianbo Jiao,Chen Chen
Abstract:
Recent work in Video Frame Interpolation (VFI) tries to formulate VFI as a diffusion-based conditional image generation problem, synthesizing the intermediate frame given a random noise and neighboring frames. Due to the relatively high resolution of videos, Latent Diffusion Models (LDMs) are employed as the conditional generation model, where the autoencoder compresses images into latent representations for diffusion and then reconstructs images from these latent representations. Such a formulation poses a crucial challenge: VFI expects that the output is $\textit{deterministically}$ equal to the ground truth intermediate frame, but LDMs $\textit{randomly}$ generate a diverse set of different images when the model runs multiple times. The reason for the diverse generation is that the cumulative variance (variance accumulated at each step of generation) of generated latent representations in LDMs is large. This makes the sampling trajectory random, resulting in diverse rather than deterministic generations. To address this problem, we propose our unique solution: Frame Interpolation with Consecutive Brownian Bridge Diffusion. Specifically, we propose consecutive Brownian Bridge diffusion that takes a deterministic initial value as input, resulting in a much smaller cumulative variance of generated latent representations. Our experiments suggest that our method can improve together with the improvement of the autoencoder and achieve state-of-the-art performance in VFI, leaving strong potential for further enhancement.



Paperid:751 Poster
Authors:Lianghui Zhu,Junwei Zhou,Yan Liu,Hao Xin,Wenyu Liu,Xinggang Wang
Abstract:
Weakly-supervised visual recognition using inexact supervision is a critical yet challenging learning problem. It significantly reduces human labeling costs and traditionally relies on multi-instance learning and pseudo-labeling. This paper introduces WeakSAM and solves the weakly-supervised object detection (WSOD) and segmentation by utilizing the pre-learned world knowledge contained in a vision foundation model, i.e., the Segment Anything Model (SAM). WeakSAM addresses two critical limitations in traditional WSOD retraining, i.e., pseudo ground truth (PGT) incompleteness and noisy PGT instances, through adaptive PGT generation and Region of Interest (RoI) drop regularization. It also addresses the SAM's shortcomings of requiring human prompts and category unawareness in object detection and segmentation. Our results indicate that WeakSAM significantly surpasses previous state-of-the-art methods in WSOD and WSIS benchmarks with large margins, i.e. average improvements of 7.4% and 8.5%, respectively.



Paperid:752 Poster
Authors:Ziyi Gao,Kai Chen,Zhipeng Wei,Tingshu Mou,Jingjing Chen,Zhiyu Tan,Hao Li,Yu-Gang Jiang
Abstract:
Recent diffusion-based unrestricted attacks generate imperceptible adversarial examples with high transferability compared to previous unrestricted attacks and restricted attacks. However, existing works on diffusion-based unrestricted attacks are mostly focused on images yet are seldom explored in videos. In this paper, we propose the Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack (ReToMe-VA), which is the first framework to generate imperceptible adversarial video clips with higher transferability. Specifically, to achieve spatial imperceptibility, ReToMe-VA adopts a Timestep-wise Adversarial Latent Optimization (TALO) strategy that optimizes perturbations in diffusion models' latent space at each denoising step. TALO offers iterative and accurate updates to generate more powerful adversarial frames. TALO can further reduce memory consumption in gradient computation. Moreover, to achieve temporal imperceptibility, ReToMe-VA introduces a Recursive Token Merging (ReToMe) mechanism by matching and merging tokens across video frames in the self-attention module, resulting in temporally consistent adversarial videos. ReToMe concurrently facilitates inter-frame interactions into the attack process, inducing more diverse and robust gradients, thus leading to better adversarial transferability. Extensive experiments demonstrate the efficacy of ReToMe-VA, particularly in surpassing state-of-the-art attacks in adversarial transferability by more than 14.16% on average.



Paperid:753 Poster
Authors:Can Cui,Siteng Huang,Wenxuan Song,Pengxiang Ding,Zhang Min,Donglin Wang
Abstract:
To address the occlusion issues in person Re-Identification (ReID) tasks, many methods have been proposed to extract part features by introducing external spatial information. However, due to missing part appearance information caused by occlusion and noisy spatial information from external model, these purely vision-based approaches fail to correctly learn the concepts of human body parts from limited training data and struggle in accurately locating body parts, ultimately leading to misaligned part features. To tackle these challenges, we propose a Prompt-guided Feature Disentangling method (ProFD), which leverages the rich pre-trained knowledge in the textual modality facilitate model to generate well-aligned part features. ProFD first designs part-specific prompts and utilizes noisy segmentation mask to preliminarily align visual and textual embedding, enabling the textual prompts to have spatial awareness. Furthermore, to alleviate the noise from external masks, ProFD adopts a hybrid-attention decoder, ensuring spatial and semantic consistency during the decoding process to minimize noise impact. Additionally, to avoid catastrophic forgetting, we employ a self-distillation strategy, retaining pre-trained knowledge of CLIP with memory banks to mitigate over-fitting during training. Evaluation results on the Market1501, DukeMTMC-reID, Occluded-Duke, Occluded-ReID, and P-DukeMTMC datasets demonstrate that ProFD achieves competitive performance, surpassing previous methods and achieving state-of-the-art results.



Paperid:754 Poster
Authors:Yu Liu,Longhan Feng,Qi Jia,Zezheng Liu,Zihuang Cao
Abstract:
Elliptical Object Detection (EOD) is crucial yet challenging due to complex scenes and varying object characteristics. Existing methods often struggle with parameter configurations and lack adaptability in label-scarce scenarios. To address this, a new semi-supervised teacher-student framework, Dual-Teacher Collaborative Guidance (DTCG), is proposed, comprising a five-parameter teacher detector, a six-parameter teacher detector, and a student detector. This allows the two teachers, specializing in different regression approaches, to co-instruct the student within a unified model, preventing errors and enhancing performance. Additionally, a feature correlation module (FCM) highlights differences between teacher features and employs deformable convolution to select advantageous features for final parameter regression. A collaborative training strategy (CoT) updates the teachers asynchronously, breaking through training and performance bottlenecks. Extensive experiments conducted on two widely recognized datasets affirm the superior performance of our DTCG over other leading competitors across various semi-supervised scenarios. Notably, our method achieves a 5.61% higher performance than the second best method when utilizing only 10% annotated data.



Paperid:755 Poster
Authors:Wonwoo Cho,Kangyeol Kim,Saemee Choi,Jaegul Choo
Abstract:
Despite the growing prevalence of black-box pre-trained models (PTMs) such as prediction API services and proprietary software, there remains a significant challenge in directly applying general models to real-world scenarios due to the data distribution gap. Considering a data deficiency and constrained computational resource scenario, this paper proposes a novel parameter-efficient transfer learning framework for vision recognition models in the black-box setting. Our framework incorporates two novel training techniques. First, we align the input space (i.e., image) of PTMs to the target data distribution by generating visual prompts of spatial and frequency domain. Along with the novel spatial-frequency hybrid visual prompter, we design a novel training technique based on probabilistic clusters, which can enhance class separation in the output space (i.e., prediction probabilities). In experiments, our model demonstrates superior performance in a few-shot transfer learning setting across extensive visual recognition datasets, surpassing state-of-the-art baselines. Additionally, the proposed method efficiently reduces computational costs for training and inference phases.



Paperid:756 Poster
Authors:Taoyu Su,Jiawei Sheng,Shicheng Wang,Xinghua Zhang,Hongbo Xu,Tingwen Liu
Abstract:
Multi-modal entity alignment (MMEA) aims to identify equivalent entities between multi-modal knowledge graphs (MMKGs), where the entities can be associated with related images. Most existing studies integrate multi-modal information heavily relying on the automatically-learned fusion module, rarely suppressing the redundant information for MMEA explicitly. This characteristic would involve alignment-irrelevant misleading clues from modalities, and hampers model prediction especially in low-resource or high-noise data scenarios. To this end, we explore variational information bottleneck for multi-modal entity alignment (IBMEA), which emphasizes the alignment-relevant information and suppresses the alignment-irrelevant information in generating entity representations. Specifically, we devise multi-modal variational encoders to generate modal-specific entity representations as probability distributions. Then, we propose four modal-specific information bottleneck regularizers, limiting the misleading clues in refining modal-specific entity representations. Finally, we propose a modal-hybrid information contrastive regularizer to integrate all the refined modal-specific representations, enhancing the entity similarity between MMKGs to achieve MMEA. We conduct extensive experiments on two cross-KG and three bilingual MMEA datasets. Experimental results demonstrate that our model consistently outperforms previous state-of-the-art methods, and also shows promising and robust performance in low-resource and high-noise data scenarios. Our code is available athttps://anonymous.4open.science/r/IBMEA.



Paperid:757 Poster
Authors:Fangming Cui,Xun Yang,Chao Wu,Liang Xiao,Xinmei Tian
Abstract:
Prompt learning represents a promising method for adapting pre-trained vision-language models (VLMs) to various downstream tasks by learning a set of text embeddings. One challenge inherent to these methods is the poor generalization performance due to the invalidity of the learned text embeddings for unseen tasks. A straightforward approach to bridge this gap is to freeze the text embeddings in prompts, which results in a lack of capacity to adapt VLMs for downstream tasks. To address this dilemma, we propose a paradigm called EnPrompt with a novel External Layer (EnLa). Specifically, we propose a textual external layer and learnable visual embeddings for adapting VLMs to downstream tasks. The learnable external layer is built upon valid embeddings of pre-trained CLIP. This design considers the balance of learning capabilities between the two branches. To align the textual and visual features, we propose a novel two-pronged approach: i) we introduce the optimal transport as the discrepancy metric to align the vision and text modalities, and ii) we introduce a novel strengthening feature to enhance the interaction between these two modalities. Four representative experiments (i.e., base-to-novel generalization, few-shot learning, cross-dataset generalization, domain shifts generalization) across 15 datasets demonstrate that our method outperforms the existing prompt learning method.



Paperid:758 Poster
Authors:Francesco Tonini,Nicola Dall'Asen,Lorenzo Vaquero,Cigdem Beyan,Elisa Ricci
Abstract:
Gaze target detection aims at determining the image location where a person is looking. While existing studies have made significant progress in this area by regressing accurate gaze heatmaps, these achievements have largely relied on access to extensive labeled datasets, which demands substantial human labor. In this paper, our goal is to reduce the reliance on the size of labeled training data for gaze target detection. To achieve this, we propose AL-GTD, an innovative approach that integrates supervised and self-supervised losses within a novel sample acquisition function to perform active learning (AL). Additionally, it utilizes pseudo-labeling to mitigate distribution shifts during the training phase. AL-GTD achieves the best of all AUC results by utilizing only 40-50% of the training data, in contrast to state-of-the-art (SOTA) gaze target detectors requiring the entire training dataset to achieve the same performance. Importantly, AL-GTD quickly reaches satisfactory performance with 10-20% of the training data, showing the effectiveness of our acquisition function, which is able to acquire the most informative samples. We provide a comprehensive experimental analysis by adapting several AL methods for the task. AL-GTD outperforms AL competitors, simultaneously exhibiting superior performance compared to SOTA gaze target detectors when all are trained within a low-data regime. The code for the proposed method and counterpart AL methods will be released upon acceptance.



Paperid:759 Poster
Authors:Xiangbo Yin,Jiangming Shi,Yachao Zhang,Yang Lu,zhizhong zhang,Yuan Xie,Yanyun Qu
Abstract:
Unsupervised Visible-Infrared Person Re-identification (USVI-ReID) presents a formidable challenge, which aims to match pedestrian images across visible and infrared modalities without any annotations. Recently, clustered pseudo-label methods have become predominant in USVI-ReID, although the inherent noise in pseudo-labels presents a significant obstacle. Most existing works primarily focus on shielding the model from the harmful effects of noise, neglecting to calibrate noisy pseudo-labels usually associated with hard samples, which will compromise the robustness of the model. To address this issue, we design a Robust Pseudo-label Learning with Neighbor Relation (RPNR) framework for USVI-ReID. To be specific, we first introduce a straightforward yet potent Noisy Pseudo-label Calibration module to correct noisy pseudo-labels. Due to the high intra-class variations, noisy pseudo-labels are difficult to calibrate completely. Therefore, we introduce a Neighbor Relation Learning module to reduce high intra-class variations by modeling potential interactions between all samples. Subsequently, we devise an Optimal Transport Prototype Matching module to establish reliable cross-modality correspondences. On that basis, we design a Memory Hybrid Learning module to jointly learn modality-specific and modality-invariant information. Comprehensive experiments conducted on two widely recognized benchmarks, SYSU-MM01 and RegDB, demonstrate that RPNR outperforms the current state-of-the-art GUR with an average Rank-1 improvement of 10.3%. The source codes will be released soon.



Paperid:760 Poster
Authors:Xiangcheng Zhai,Yingqi Jie,Xueguang Xie,Aimin Hao,Na Jiang,Yang Gao
Abstract:
Generating photorealistic animations from a single still photo represents a significant advancement in multimedia editing and artistic creation. While existing AIGC methods have reached milestone successes, they often struggle with maintaining consistency with real-world physical laws, particularly in fluid dynamics. To address this issue, this paper introduces ANFluid, a physics solver and data-driven coupled framework that combines physics-aware simulation (PAS) and dual-flow texture learning (DFTL) to animate natural fluid photos effectively. The PAS component of ANFluid ensures that motion guides adhere to physical laws, and can be automatically tailored with specific numerical solver to meet the diversities of different fluid scenes. Concurrently, DFTL focuses on enhancing texture prediction. It employs bidirectional self-supervised optical flow estimation and multi-scale wrapping to strengthen dynamic relationships and elevate the overall animation quality. Notably, despite being built on a transformer architecture, the innovative encoder-decoder design in DFTL does not increase the parameter count but rather enhances inference efficiency. Extensive quantitative experiments have shown that our ANFluid surpasses most current methods on the Holynski and CLAW datasets. User studies further confirm that animations produced by ANFluid maintain better physical and content consistency with the real world and the original input, respectively. Moreover, ANFluid supports interactive editing during the simulation process, enriching the animation content and broadening its application potential.



Paperid:761 Poster
Authors:Zixian Gao,Disen Hu,Xun Jiang,Huimin Lu,Heng Tao Shen,Xing Xu
Abstract:
Multimodal sentiment analysis, which has garnered widespread attention in recent years, aims to predict human emotional states using multimodal data. Previous studies have primarily focused on enhancing multimodal fusion and integrating information across different modalities while overlooking the impact of noisy data on the internal features of each single modality. In this paper, we propose the Enhanced experts with Uncertainty-Aware Routing (EUAR) method to address the influence of noisy data on multimodal sentiment analysis by capturing uncertainty and dynamically altering the network. Specifically, we introduce the Mixture of Experts approach into multimodal sentiment analysis for the first time, leveraging its properties under conditional computation to dynamically alter the network in response to different types of noisy data. Particularly, we refine the experts within the MoE framework to capture uncertainty in the data and extract clearer features. Additionally, a novel routing mechanism is introduced. Through our proposed U-loss, which utilizes the quantified uncertainty by experts, the network learns to route different samples to experts with lower uncertainty for processing, thus obtaining clearer, noise-free features. Experimental results demonstrate that our method achieves state-of-the-art performance on three widely used multimodal sentiment analysis datasets. Moreover, experiments on noisy datasets show that our approach outperforms existing methods in handling noisy data. Our anonymous implementation code can be available athttps://anonymous.4open.science/r/EUAR-7BF6.



Paperid:762 Poster
Authors:Xun Jiang,Zhuoyuan Wei,Shenshen Li,Xing Xu,Jingkuan Song,Heng Tao Shen
Abstract:
Temporal Sentence Grounding (TSG), which aims to localize events in untrimmed videos with a given language query, has been widely studied in the last decades. However, recently researchers have demonstrated that previous approaches are severely limited in out- of-distribution generalization, thus proposing the De-biased TSG challenge which requires models to overcome weakness towards outlier test samples. In this paper, we design a novel framework, termed Counterfactually-Augmented Event Matching (CAEM), which incorporates counterfactual data augmentation to learn event-query joint representations to resist the training bias. Specifically, it consists of three components: (1) A Temporal Counterfactual Augmentation module that generates counterfactual video-text pairs by temporally delaying events in the untrimmed video, enhancing the model’s capacity for counterfactual thinking. (2) An Event-Query Matching model that is used to learn joint representations and predict corresponding matching scores for each event candidate. (3) A Counterfact-Adaptive Framework (CAF) that incorporates the counterfactual consistency rules on the matching process of the same event-query pairs, furtherly mitigating the bias learned from training sets. We conduct thorough experiments on two widely used DTSG datasets, i.e., Charades-CD and ActivityNet-CD, to evaluate our proposed CAEM method. Extensive experimental results show our proposed CAEM method outperforms recent state-of-the-art methods on all datasets. Our anonymous implementation code is available athttps://anonymous.4open.science/r/CAEM-6734.



Paperid:763 Poster
Authors:Leqi Shen,Sicheng Zhao,Yifeng Zhang,Hui Chen,Jundong Zhou,pengzhang liu,Yongjun Bao,Guiguang Ding
Abstract:
Collecting large-scale multi-label data with full labels is difficult for real-world scenarios. Many existing studies have tried to address the issue of missing labels caused by annotation but ignored the difficulties encountered during the annotation process. We find that the high annotation workload can be attributed to two reasons: learning and annotating. In this paper, we propose a new setting, i.e., block diagonal labels, to reduce the workload on both sides. The numerous categories can be divided into different subsets based on semantics and relevance. Each annotator can only focus on its own subset of labels so that only a small set of highly relevant labels are required to be annotated per image. To deal with the issue of such missing labels, we introduce a simple yet effective method that does not require any prior knowledge of the dataset. In practice, we propose an Adaptive Pseudo-Labeling method to predict the unknown labels with less noise. Formal analysis is conducted to evaluate the superiority of our setting. Extensive experiments are conducted to verify the effectiveness of our method on multiple widely used benchmarks: VOC2012, COCO2014, NUSWIDE, and OpenImages. Specifically, our method achieves 81.92% mAP on COCO2014 with 9.31% annotated labels and 63.26% mAP on NUSWIDE with 15.60% annotated labels.



Paperid:764 Poster
Authors:Kaixin Shen,Ruijie Quan,Linchao Zhu,Jun Xiao,Yi Yang
Abstract:
Maintaining temporal stability is crucial in multi-agent trajectory prediction. Insufficient regularization to uphold this stability often results in fluctuations in kinematic states, leading to inconsistent predictions and the amplification of errors. In this study, we introduce a framework called Multi-Agent Trajectory prediction via neural interaction Energy (MATE). This framework assesses the interactive motion of agents by employing neural interaction energy, which captures the dynamics of interactions and illustrates their influence on the future trajectories of agents. To bolster temporal stability, we introduce two constraints: inter-agent interaction constraint and intra-agent motion constraint. These constraints work together to ensure temporal stability at both the system and agent levels, effectively mitigating prediction fluctuations inherent in multi-agent systems. Comparative evaluations against previous methods on four diverse datasets highlight the superior prediction accuracy and generalization capabilities of our model. We will release our code.



Paperid:765 Poster
Authors:Wanru Xu,Zhenjiang Miao,Yi Tian,Yigang Cen,Lili Wan,Ma Xiaole
Abstract:
Visual abduction reasoning aims to find the most plausible explanation for incomplete observations, and suffers from inherent uncertainties and ambiguities, which mainly stem from the latent causal relations, incomplete observations, and the reasoning itself. To address this, we propose a probabilistic model named Uncertainty-Guided Probabilistic Distillation Transformer (UPD-Trans) to model uncertainties for Visual Abductive Reasoning. In order to better discover the correct cause-effect chain, we model all the potential causal relations into a unified reasoning framework, thus both the direct relations and latent relations are considered. In order to reduce the effect of the stochasticity and uncertainty for reasoning:we extend the deterministic Transformer to a probabilistic Transformer by considering those uncertain factors as Gaussian random variables and explicitly modeling their distribution;we introduce a distillation mechanism between the posterior branch with complete observations and the prior branch with incomplete observations to transfer posterior knowledge. Evaluation results on the benchmark datasets, consistently demonstrate the commendable performance of our UPD-Trans, with significant improvements after latent relation modeling and uncertainty modeling.



Paperid:766 Poster
Authors:Zewen Du,Zhenjiang Hu,Guiyu Zhao,Ying Jin,Hongbin Ma
Abstract:
Feature upsampling is an essential operation in constructing deep convolutional neural networks. However, existing upsamplers either lack specific feature guidance or necessitate the utilization of high-resolution feature maps, resulting in a loss of performance and flexibility. In this paper, we find that the local self-attention naturally has the feature guidance capability, and its computational paradigm aligns closely with the essence of feature upsampling (\ie feature reassembly of neighboring points). Therefore, we introduce local self-attention into the upsampling task and demonstrate that the majority of existing upsamplers can be regarded as special cases of upsamplers based on local self-attention. Considering the potential semantic gap between upsampled points and their neighboring points, we further introduce the deformation mechanism into the upsampler based on local self-attention, thereby proposing LDA-AQU. As a novel dynamic kernel-based upsampler, LDA-AQU utilizes the feature of queries to guide the model in adaptively adjusting the position and aggregation weight of neighboring points, thereby meeting the upsampling requirements across various complex scenarios. In addition, LDA-AQU is lightweight and can be easily integrated into various model architectures. We evaluate the effectiveness of LDA-AQU across four dense prediction tasks: object detection, instance segmentation, panoptic segmentation, and semantic segmentation. LDA-AQU consistently outperforms previous state-of-the-art upsamplers, achieving performance enhancements of 1.7 AP, 1.5 AP, 2.0 PQ, and 2.5 mIoU compared to the baseline models in the aforementioned four tasks, respectively. The code will be released soon.



Paperid:767 Poster
Authors:Xingyuan Mao,Yuwen Liu,Lianyong Qi,Li Duan,Xiaolong Xu,Xuyun Zhang,Wanchun Dou,Amin Beheshti,Xiaokang Zhou
Abstract:
Federated learning addresses privacy concerns in multimedia recommender systems by enabling collaborative model training without exchanging raw data. However, existing federated recommendation models are mainly based on basic backbones like Matrix Factorization (MF), which is inadequate to capture complex implicit interactions between users and multimedia contents. Graph Convolutional Networks (GCNs) offer a promising method by utilizing the information from high-order neighbors, but face challenges in federated settings due to problems such as over-smoothing, data heterogeneity, and elevated communication expenses. To resolve these problems, we propose a Cluster-driven Personalized Federated Recommender System with Interest-aware Graph Convolution Network (CPF-GCN) for multimedia recommendation. CPF-GCN comprises a local interest-aware GCN module that optimizes node representations through subgraph-enhanced adaptive graph convolution operations, mitigating the over-smoothing problem by adaptively extracting information from layers and selectively utilizing high-order connectivity based on user interests. Simultaneously, a cluster-driven aggregation approach at the server significantly reduces communication costs by selectively aggregating models from clusters. The aggregation produces a global model and cluster-level models, combining them with the user's local model allows us to tailor the recommendation model for the user, achieving personalized recommendations. Moreover, we propose an adversarial optimization technique to further augment the robustness of CPF-GCN. Experiments on three multimedia datasets demonstrate that CPF-GCN significantly outperforms the state-of-the-art models.



Paperid:768 Poster
Authors:Jie Liang,Rongjie Wang,Rui Peng,Zhe Zhang,Kaiqiang Xiong,Ronggang Wang
Abstract:
The quality of 3D models reconstructed by PatchMatch Multi-View Stereo remains a challenging problem due to unreliable photometric consistency in object boundaries and textureless areas. Since textureless areas usually exhibit strong planarity, previous methods used planar prior and significantly improved the reconstruction performance. However, their planar prior ignores the depth discontinuity at the object boundary, making the boundary inaccurate (not sharp). In addition, due to the unreliable planar models in large-scale low-textured objects, the reconstruction results are incomplete. To address the above issues, we introduce the segmentation generated from Segment Anything Model into PM pipelines for the first time. We use segmentation to determine whether the depth is continuous based on the characteristics of segmentation and depth sharing boundaries. Then we segment planes at object boundaries and enhance the consistency of planes in objects. Specifically, we construct $\textbf{Boundary Plane}$ that fits the object boundary and $\textbf{Object Plane}$ to increase consistency of planes in large-scale textureless objects. Finally, we use a probability graph model to calculate the $\textbf{Aggregated Prior guided by Multiple Planes}$ and embed it into the matching cost. The experimental results indicate that our method achieves state-of-the-art in terms of boundary sharpness on ETH3D. And it also significantly improves the completeness weakly textured objects. We also validated the generalization of our method on Tanks&Temples.



Paperid:769 Poster
Authors:Qinfeng Li,Zhiqiang Shen,Zhenghan Qin,Yangfan Xie,Xuhong Zhang,Tianyu Du,Sheng Cheng,Xun Wang,Jianwei Yin
Abstract:
Proprietary large language models (LLMs) have been widely applied in various scenarios. Additionally, deploying LLMs on edge devices is trending for efficiency and privacy reasons. However, edge deployment of proprietary LLMs introduces new security challenges: edge-deployed models are exposed as white-box accessible to users, enabling adversaries to conduct effective model stealing (MS) attacks. Unfortunately, existing defense mechanisms fail to provide effective protection. Specifically, we identify four critical protection properties that existing methods fail to simultaneously satisfy: (1) maintaining protection after a model is physically copied; (2) authorizing model access at request level; (3) safeguarding runtime reverse engineering; (4) achieving high security with negligible runtime overhead. To address the above issues, we propose TransLinkGuard, a plug-and-play model protection approach against model stealing on edge devices. The core part of TransLinkGuard is a lightweight authorization module residing in a secure environment, e.g., TEE. The authorization module can freshly authorize each request based on its input. Extensive experiments show that TransLinkGuard achieves the same security protection as the black-box security guarantees with negligible overhead.



Paperid:770 Poster
Authors:Ran Yi,Haokun Zhu,Teng Hu,Yu-Kun Lai,Paul L Rosin
Abstract:
Recent studies have shown impressive progress in universal style transfer which can integrate arbitrary styles into content images. However, existing approaches struggle with low aesthetics and disharmonious patterns in the final results. To address this problem, we propose AesStyler, a novel Aesthetic Guided Universal Style Transfer method. Specifically, our approach introduces the aesthetic assessment model, trained on a dataset with human-assessed aesthetic scores, into the universal style transfer task to accurately capture aesthetic features that universally resonate with human aesthetic preferences. Unlike previous methods which only consider aesthetics of specific style images, we propose to build a Universal Aesthetic Codebook (UAC) to harness universal aesthetic features that encapsulate the global aspects of aesthetics. Aesthetic features are fed into a novel Universal and Style-specific Aesthetic-Guided Attention (USAesA) module to guide the style transfer process. USAesA empowers our model to integrate the aesthetic attributes of both universal and style-specific aesthetic features with style features and facilitates the fusion of these aesthetically enhanced style features with content features. Extensive experiments and user studies have demonstrated that our approach generates aesthetically more harmonious and pleasing results than the state-of-the-art methods, both aesthetic-free and aesthetic-aware.



Paperid:771 Poster
Authors:Yalan Qin,Li Qian
Abstract:
Multi-view clustering methods have been extensively explored in the last decades. This kind of methods is built on the assumption that the data are sampled from multiple subspaces with low dimension and each group fits into one of these subspaces. The quadratic or cubic computation complexity produced by these methods is inevitable, resulting in the difficulty for clustering multi-view datasets with large scales. Some efforts have been presented to select key anchors beforehand to capture the data distributions in different views. Despite significant progress, these methods pay few attentions to deriving provably scalable and correct method for finding the optimal shared anchor graph from the geometric interpretation perspective. They also ignore to give a well balance between the connectedness and subspace preserving properties of the shared anchor graph. In this paper, we propose the Fast Elastic- Net Multi-view Clustering (FENMC) from a geometric interpretation perspective. We provide the geometric analysis in determining the optimal shared anchor graph based on the introduced elastic-net regularizer for fast multi-view clustering, where the elastic-net regularizer is built on the mixture of $L_2$ and $L_1$ norms. We also give a theoretical justification for the balance between the connectedness and subspace preserving properties of the shared anchor graph for multi-view clustering. Our experiments on different datasets show that the proposed method not only obtains the satisfied clustering performance, but also deals with large-scale datasets with high efficiency.



Paperid:772 Poster
Authors:Jian Chen,Wei Wang,Yuzhu Hu,Junxin Chen,Han Liu,Xiping Hu
Abstract:
Online chatting has become an essential aspect of our daily interactions, with stickers emerging as a prevalent tool for conveying emotions more vividly than plain text. While conventional image emotion recognition focuses on global features, sticker emotion recognition necessitates incorporating both global and local features, along with additional modalities like text. To address this, we introduce a topic ID-guided transformer method to facilitate a more nuanced analysis of the stickers. Considering that each sticker will have a topic, and stickers with the same topic will have the same object, we introduce a topic ID and regard the stickers with the same topic ID as topic context. Our approach encompasses a novel topic-guided context-aware module and a topic-guided attention mechanism, enabling the extraction of comprehensive topic context features from stickers sharing the same topic ID, significantly enhancing emotion recognition accuracy. Moreover, we integrate a frequency linear attention module to leverage frequency domain information to better capture the object information of the stickers and a locally enhanced re-attention mechanism for improved local feature extraction. Extensive experiments and ablation studies on the large-scale sticker emotion dataset SER30k validate the efficacy of our method. Experimental results show that our proposed method obtains the best accuracy on both single-modal and multi-modal sticker emotion recognition.



Paperid:773 Poster
Authors:Yubo Wang,Chaohu Liu,yanqiuqu,Haoyu Cao,Deqiang Jiang,Linli Xu
Abstract:
Large vision-language models (LVLMs) integrate visual information into large language models, showcasing remarkable multi-modal conversational capabilities. However, the visual modules introduces new challenges in terms of robustness for LVLMs, as attackers can craft adversarial images that are visually clean but may mislead the model to generate incorrect answers. In general, LVLMs rely on vision encoders to transform images into visual tokens, which are crucial for the language models to perceive image contents effectively. Therefore, we are curious about one question: Can LVLMs still generate correct responses when the encoded visual tokens are attacked and disrupting the visual information? To this end, we propose a non-targeted attack method referred to as VT-Attack (Visual Tokens Attack), which constructs adversarial examples from multiple perspectives, with the goal of comprehensively disrupting feature representations and inherent relationships as well as the semantic properties of visual tokens output by image encoders. Using only access to the image encoder in the proposed attack, the generated adversarial examples exhibit transferability across diverse LVLMs utilizing the same image encoder and generality across different tasks. Extensive experiments validate the superior attack performance of the VT-Attack over baseline methods, demonstrating its effectiveness in attacking LVLMs with image encoders, which in turn can provide guidance on the robustness of LVLMs, particularly in terms of the stability of the visual feature space.



Paperid:774 Poster
Authors:Zihao Liu,Xiaoyu Wu,Shengjin Wang,Jiayao Qian
Abstract:
Large-scale pretrained image-language models have shown remarkable performance recently. However, building a video-language model is more challenging due to the complexity of video and the difficulty of collecting high-quality data. This paper builds a video-language model in an adaptive manner, which transfers the knowledge from the image domain and can achieve state-of-the-art performance without any further massive video pretraining. The main contributions include a Visual Perception Adapter that seamlessly and efficiently adapts a pretrained image-language model to the video domain and a fine-grained contrastive learning with Inter-modal Token Alignment that bridges semantic gaps between vision, audio, and language with less data. The proposed model is evaluated on video captioning and retrieval. Experiments demonstrate that the proposed model exhibits competitive performance compared to models pretrained on millions of video-text pairs. Notably, our model's CIDEr and R@1 scores on the MSR-VTT dataset exceed the existing state-of-the-art by 6.3% and 1.3%.



Paperid:775 Poster
Authors:Xiao Yu,Kejiang Chen,Kai Zeng,Han Fang,Zijin Yang,Xiuwei Shang,Yuang Qi,Weiming Zhang,Nenghai Yu
Abstract:
The rapid development of image generative models has lowered the threshold for image creation but also raised security concerns related to the propagation of false information, urgently necessitating the development of detection technologies for AI-generated images. Presently, text-to-image generation stands as the predominant approach to image generation, where the rendering of generated images hinges on two primary factors: text prompts and the inherent characteristics of the model. However, the variety of semantic text prompts yields diverse generated images, posing significant challenges to existing detection methodologies that rely solely on learning from image features, particularly in scenarios with limited samples. To tackle these challenges, this paper presents a novel perspective on the AI-generated image detection task, advocating for detection under semantic-decoupling conditions. Building upon this insight, we propose SemGIR, a semantic-guided image regeneration based method for AI-generated image detection. SemGIR first regenerates images through image-to-text followed by a text-to-image generation process, subsequently utilizing these re-generated image pairs to derive discriminative features. This regeneration process effectively decouples semantic features organically, allowing the detection process to concentrate more on the inherent characteristics of the generative model. Such an efficient detection scheme can also be effectively applied to attribution. Experimental findings demonstrate that in realistic scenarios with limited samples, SemGIR achieves an average detection accuracy 15.76% higher than state-of-the-art (SOTA) methods. Furthermore, in attribution experiments on the SDv2.1 model, SemGIR attains an accuracy exceeding 98%, affirming the effectiveness and practical utility of the proposed method.



Paperid:776 Poster
Authors:Shen Lin,Xiaoyu Zhang,Willy Susilo,Xiaofeng Chen,Jun Liu
Abstract:
As concerns over privacy protection grow and relevant laws come into effect, machine unlearning (MU) has emerged as a pivotal research area. Due to the complexity of the forgetting data distribution, the sample-wise MU is still open challenges. Gradient ascent, as the inverse of gradient descent, is naturally applied to machine unlearning, which is also the inverse process of machine learning. However, the straightforward gradient ascent MU method suffers from the trade-off between effectiveness, fidelity, and efficiency. In this work, we analyze the gradient ascent MU process from a multi-task learning (MTL) view. This perspective reveals two problems that cause the trade-off, i.e., the gradient direction problem and the gradient dominant problem. To address these problems, we propose a novel MU method, namely GDR-GMA, consisting of Gradient Direction Rectification (GDR) and Gradient Magnitude Adjustment (GMA). For the gradient direction problem, GDR rectifies the direction between the conflicting gradients by projecting a gradient onto the orthonormal plane of the conflicting gradient. For the gradient dominant problem, GMA dynamically adjusts the magnitude of the update gradients by assigning the dynamic magnitude weight parameter to the update gradients. Furthermore, we evaluate GDR-GMA against several baseline methods in three sample-wise MU scenarios: random data forgetting, sub-class forgetting, and class forgetting. Extensive experimental results demonstrate the superior performance of GDR-GMA in effectiveness, fidelity, and efficiency. Code is available athttps://github.com/RUIYUN-ML/GDR-GMA.



Paperid:777 Poster
Authors:Kun Wang,Hao Liu,Lirong Jie,Zixu Li,Yupeng Hu,Liqiang Nie
Abstract:
Video moment localization (VML) aims to identify the temporal boundary of the target moment semantically matching the given query. Existing approaches fall into three paradigms: fully-supervised, weakly-supervised, and point-supervised. Compared to other two paradigms, point-supervised VML strikes a balance between localization accuracy and annotation cost. However, it is still in its infancy due to the following two challenges: explicit granularity alignment and implicit scale perception, especially when facing complex cross-modal correspondences. To this end, we propose a Semantic Granularity and Scale Correspondence Integration (SG-SCI) framework aimed at modeling the semantic alignment between video and text, leveraging limited single-frame annotation information for correspondence learning. It explicitly models semantic relations of different feature granularities and adaptively mines the implicit semantic scale, thereby enhancing and utilizing modal feature representations of varying granularities and scales. SG-SCI employs a granularity correspondence alignment module to align semantic information by leveraging latent prior knowledge. Then we develop a scale correspondence learning strategy to identify and address semantic scale differences. Extensive comparison experiments, ablation studies, and necessary hyperparameter analyses on benchmark datasets have demonstrated the promising performance of our model over several state-of-the-art competitors.



Paperid:778 Poster
Authors:Xulu Zhang,Wengyu Zhang,Xiaoyong Wei,Jinlin Wu,Zhaoxiang Zhang,Zhen Lei,Qing Li
Abstract:
This paper presents a pilot study that explores the application of active learning, traditionally studied in the context of discriminative models, to generative models. We specifically focus on image synthesis personalization tasks. The primary challenge in conducting active learning on generative models lies in the open-ended nature of querying, which differs from the closed form of querying in discriminative models that typically target a single concept. We introduce the concept of anchor directions to transform the querying process into a semi-open problem. We propose a direction-based uncertainty sampling strategy to enable generative active learning and tackle the exploitation-exploration dilemma. Extensive experiments are conducted to validate the effectiveness of our approach, demonstrating that an open-source model can achieve superior performance compared to closed-source models developed by large companies, such as Google's StyleDrop. The source code is available athttps://github.com/(open\_upon\_acceptance).



Paperid:779 Poster
Authors:Qi Chen,Wenjie Liu,Hu Ding
Abstract:
Conditional Generative Adversarial Network (cGAN) is an important type of GAN which is often equipped with an auxiliary classifier. However, existing cGANs usually have the issue of mode collapse which can incur unstable performance in practice. In this paper, we propose a novel stable training method for cGANs with well preserving the generation fidelity and diversity. Our key ideas are designing efficient adversarial training strategies for the auxiliary classifier and mitigating the overconfidence issue caused by the cross-entropy loss. We propose a classifier-based cGAN called Confidence Guided Generative Adversarial Networks (CG-GAN) by introducing the adversarial training to a $K$-way classifier. In particular, we show in theory that the obtained $K$-way classifier can encourage the generator to learn the real joint distribution. To further enhance the performance and stability, we propose to establish a high-entropy prior label distribution for the generated data and incorporate a reverse KL divergence term into the minimax loss of CG-GAN. Through a comprehensive set of experiments on the popular benchmark datasets, including the large-scale dataset ImageNet, we demonstrate the advantages of our proposed method over several state-of-the-art cGANs.



Paperid:780 Poster
Authors:Long Tian,Hongyi Zhao,Ruiying Lu,Rongrong Wang,YuJie Wu,Liming Wang,Xiongpeng He,Xiyang Liu
Abstract:
Few-Shot Industrial Anomaly Detection (FS-IAD) has drawn great attention most recently since data efficiency and the ability of designing algorithms for fast migration across products become the main concerns. The difficulty of memory-based IAD in low-data regimes primarily lies in inefficient measurement between the memory bank and query images. We address such pivot issues from a new perspective of optimal matching between features of image regions. Taking the unbalanced nature of patch-wise industrial image features into consideration, we adopt Conditional Transport (CT) as a metric to compute the structural distance between representations of the memory bank and query images to determine feature relevance. The CT generates the optimal matching flows between unbalanced structural elements that achieve the minimum matching cost, which can be directly used for IAD since it well reflects the differences of query images compared with the normal memory. Realizing the fact that query images usually come one-by-one or batch-by-batch, we further propose an Online Conditional Transport (OCT) by making full use of current and historical query images for IAD via simultaneously calibrating the memory bank and matching features between the calibrated memory and the current query features. Go one step further, for sparse foreground products, we employ a predominant segment model to implement Foreground-aware OCT (FOCT) to improve the effectiveness and efficiency of OCT by forcing the model to pay more attention to diverse targets rather than redundant background when calibrating the memory bank. FOCT can improve the diversity of calibrated memory during the whole IAD process, which is critical for robust FS-IAD in practice. Besides, FOCT is flexible since it can be friendly plugged and played with any pre-trained backbones, such as WRN, and any pre-trained segment models, such as SAM. The effectiveness of our model is demonstrated across diverse datasets, including benchmarks of MVTec and MPDD, achieving SOTA performance.



Paperid:781 Poster
Authors:Rui-Chen Zheng,Yang Ai,Zhen-Hua Ling
Abstract:
This paper studies the task of speech reconstruction from ultrasound tongue images and optical lip videos recorded in a silent speaking mode, where people only activate their intra-oral and extra-oral articulators without producing real speech. This task falls under the umbrella of articulatory-to-acoustic (A2A) conversion and may also be referred to as a silent speech interface. To overcome the domain discrepancy between silent and standard vocalized articulation, we introduce a novel pseudo target generation strategy. It integrates the text modality to align with articulatory movements, thereby guiding the generation of pseudo acoustic features for supervised training on speech reconstruction from silent articulation. Furthermore, we propose to employ a denoising diffusion probabilistic model as the fundamental architecture for the A2A conversion task and train the model using a combined training approach with the generated pseudo acoustic features. Experiments show that our proposed method significantly improves the intelligibility and naturalness of the reconstructed speech in the silent speaking mode compared to all baseline methods. Specifically, the word error rate of the reconstructed speech decreases by approximately 5% when measured using an automatic speech recognition engine for intelligibility assessment, and the subjective mean opinion score for naturalness improves by 0.14. Moreover, analytical experiments reveal that the proposed pseudo target generation strategy can generate pseudo acoustic features that synchronize better with articulatory movements than previous strategies. Samples are available at our project page.



Paperid:782 Poster
Authors:Wenxiao Zhang,Ziqi Wang,Li Xu,Xun Yang,Jun Liu
Abstract:
Point cloud plays a significant role in recent learning-based vision tasks, which contain additional information about the physical space compared to 2D images. However, such a 3D data format also results in more expensive computation costs to train a sophisticated network with large 3D datasets. Previous methods for point cloud compression focus on compacting the representation of each point cloud for better storage and transmission. In this paper, we introduce a new open problem in point cloud field: Can we compress a large point cloud dataset into a much smaller synthetic dataset while preserving the important information of the original large dataset?} In other words, we explore the possibility of training a network on a smaller dataset of informative point clouds extracted from the original large dataset but maintaining similar network performance. Training on this small synthetic dataset could largely improve the training efficiency. To explore this new open problem, we formulate it as a parameter-matching issue where a network could get similar network parameters after training on the original set and the generated synthetic set, respectively. We find that we could achieve this goal by moving the critical points within each initial point cloud through an iterative gradient matching strategy. We conduct extensive experiments on various synthetic and real-scanned 3D object classification benchmarks, showing that our synthetic dataset could achieve almost the same performance with only 5% point clouds of ScanObjectNN dataset compared to training with the full dataset.



Paperid:783 Poster
Authors:Xinwei Zhang,Aishan Liu,Tianyuan Zhang,Siyuan Liang,Xianglong Liu
Abstract:
Deep learning-based lane detection (LD) plays a critical role in autonomous driving systems, such as adaptive cruise control. However, it is vulnerable to backdoor attacks. Existing backdoor attack methods on LD exhibit limited effectiveness in dynamic real-world scenarios, primarily because they fail to consider dynamic scene factors, including changes in driving perspectives (e.g., viewpoint transformations) and environmental conditions (e.g., weather or lighting changes). To tackle this issue, this paper introduces BadLANE, a dynamic scene adaptation backdoor attack for LD designed to withstand changes in real-world dynamic scene factors. To address the challenges posed by changing driving perspectives, we propose an amorphous trigger pattern composed of shapeless pixels. This trigger design allows the backdoor to be activated by various forms or shapes of mud spots or pollution on the road or lens, enabling adaptation to changes in vehicle observation viewpoints during driving. To mitigate the effects of environmental changes, we design a meta-learning framework to train meta-generators tailored to different environmental conditions. These generators produce meta-triggers that incorporate diverse environmental information, such as weather or lighting conditions, as the initialization of the trigger patterns for backdoor implantation, thus enabling adaptation to dynamic environments. Extensive experiments on various commonly used LD models in both digital and physical domains validate the effectiveness of our attacks, outperforming other baselines significantly (+25.15% on average in Attack Success Rate). Our code is available on the anonymous website.



Paperid:784 Poster
Authors:Jun Dan,Weiming Liu,Mushui Liu,Chunfeng Xie,Shunjie Dong,Guofang Ma,Yanchao Tan,Jiazheng Xing
Abstract:
Semi-supervised graph domain adaptation, as a subfield of graph transfer learning, seeks to precisely annotate unlabeled target graph nodes by leveraging transferable features acquired from the limited labeled source nodes. However, most existing studies often directly utilize GCNs-based feature extractors to capture domain-invariant node features, while neglecting the issue that GCNs are insufficient in collecting complex structure information in graph. Considering the importance of graph structure information in encoding the complex relationship among nodes and edges, this paper aims to utilize such powerful information to assist graph transfer learning. To achieve this goal, we develop an novel framework called HOGDA. Concretely, HOGDA introduces a high-order structure information mixing module to effectively assist the feature extractor in capturing transferable node features.Moreover, to achieve fine-grained feature distributions alignment, the AWDA strategy is proposed to dynamically adjust the node weight during adversarial domain adaptation process, effectively boosting the model's transfer ability. Furthermore, to mitigate the overfitting phenomenon caused by limited source labeled nodes, we also design a TNC strategy to guide the unlabeled nodes to achieve discriminative clustering. Extensive experimental results show that our HOGDA outperforms the state-of-the-art methods on various transfer tasks.



Paperid:785 Poster
Authors:Buyu Liu,Kai Wang,Yansong Liu,Jun Bao,Tingting Han,Jun Yu
Abstract:
This work aims to address the multi-view perspective RGB generation from text prompts given Bird-Eye-View(BEV) semantics. Unlike prior methods that neglect layout consistency, lack the ability to handle detailed text prompts, or incapable of generalizing to unseen view points, MVPbev simultaneously generates cross-view consistent images of different perspective views with a two-stage design, allowing object-level control and novel view generation at test-time. Specifically, MVPbev firstly projects given BEV semantics to perspective view with camera parameters, empowering model to generalize to unseen view points. Then we introduce a multi-view attention module where special initialization and de-nosing processes are introduced to explicitly enforce local consistency among overlapping views w.r.t. cross-view homography. Last but not the least, MVPbev further allows test-time instance-level controllabity by refining a pre-trained text-to-image diffusion model. Our extensive experiments on NuScenes demonstrate that our method is capable of generating high-resolution photorealistic images from text descriptions with thousands of training samples, surpassing the state-of-the-art methods under various evaluation metrics. We further demonstrate the advances of our method in terms of generalizability and controllability with the help of novel evalution metrics and comprehensive human analysis. Our code and model will be made available.



Paperid:786 Poster
Authors:Xin Jiang,Hao Tang,Rui Yan,Jinhui Tang,Zechao Li
Abstract:
Fine-grained image retrieval (FGIR) aims to learn to generate visual representations that distinguish visually similar objects while maintaining generalization. Existing methods propose various techniques to generate discriminative features, but rarely consider the particularity of the FGIR task itself. This paper presents a meticulous analysis leading to the proposal of practical guidelines to identify subcategory-specific discrepancies and generate discriminative features to design high-performance FGIR models. These guidelines include emphasizing the object (G1), highlighting subcategory-specific discrepancies (G2), and employing effective training strategy (G3). Following G1 and G2, we design a novel Dual Visual Filtering mechanism for the plain visual transformer, denoted as DVF, to capture subcategory-specific discrepancies. Specifically, the dual visual filtering mechanism comprises an object-oriented module and a semantic-oriented module. These components serve to magnify objects and identify discriminative regions, respectively. Following G3, we implement a discriminative model training strategy to improve the discriminability and generalization ability of DVF. Extensive analysis and ablation studies confirm the efficacy of our proposed guidelines. Without bells and whistles, our DVF achieves state-of-the-art performance on three widely-used fine-grained datasets in closed-set and open-set settings.



Paperid:787 Poster
Authors:Tianyuan Zhang,Lu Wang,Hainan Li,Yisong Xiao,Siyuan Liang,Aishan Liu,Xianglong Liu,Dacheng Tao
Abstract:
Lane detection (LD) is an essential component of autonomous driving systems, providing fundamental functionalities like adaptive cruise control and automated lane centering. Existing LD benchmarks primarily focus on evaluating common cases, neglecting the robustness of LD models against environmental illusions such as shadows and tire marks on the road. This research gap poses significant safety challenges since these illusions exist naturally in real-world traffic situations. For the first time, this paper studies the potential threats caused by these environmental illusions to LD and establishes the first comprehensive benchmark LanEvil for evaluating the robustness of LD against this natural corruption. We systematically design 14 prevalent yet critical types of environmental illusions (e.g., shadow, reflection) that cover a wide spectrum of real-world influencing factors in LD tasks. Based on real-world environments, we create 94 realistic and customizable 3D cases using the widely used CARLA simulator, resulting in a dataset comprising 90,292 sampled images. Through extensive experiments, we benchmark the robustness of popular LD methods using LanEvil, revealing substantial performance degradation (-5.37% Accuracy and -10.70% F1-Score on average), with shadow effects posing the greatest risk (-7.39% Accuracy). Additionally, we assess the performance of commercial auto-driving systems OpenPilot and Apollo through collaborative simulations, demonstrating that proposed environmental illusions can lead to incorrect decisions and potential traffic accidents. To defend against environmental illusions, we propose the Attention Area Mixing (AAM) approach using hard examples, which witness significant robustness improvement (+3.76%) under illumination effects. We hope our paper can contribute to advancing more robust auto-driving systems in the future. Part of our dataset and demos can be found at the anonymous website.



Paperid:788 Poster
Authors:RUI Liu,Mingjie Li,Shen Zhao,Ling Chen,Xiaojun Chang,Lina Yao
Abstract:
Medical report generation (MRG) has emerged as a pivotal research topic in the medical multi-modal field, given its potential to alleviate the heavy workloads of radiologists. Recently, advancements have been made with MRG systems that leverage large multimodal models (LMMs) to generate high-quality reports. To address the challenge of collecting large amounts of paired medical image-report data for training, this paper proposes a zero-shot report generation model based on in-context learning, we call it MCVGen. Departing from traditional in-context learning approaches that directly feed all demonstrations to a pre-trained large model, this work innovates by employing a multi-modal contextual vector (MCV) to represent the contextual information of demonstrations. Initially, we pre-train a medical large multi-modal model (Med-LMM) and secure the last hidden state of each demonstration through the forward pass in Med-LMM. Benefits from the auto-regressive mechanism, the last hidden state garners critical information to the targeted scenarios. Subsequently, we average the multiple MCVs and integrate them with the first hidden state on the new query, thereby shifting the latent states and guiding the model toward acquiring previously unlearned multi-modal contextual information. This approach has the advantage of regulating the number of prompts, thus reducing computational costs. We tested our model on the publicly available Open-IU and MIMIC datasets, demonstrating its exceptional zero-shot capability on both cross-center and cross-disease evaluations. We hope it could be a viable solution for practical clinical applications.



Paperid:789 Poster
Authors:Ji Qiu,Peng Lu,Xujun Peng,Wenhao Guo,Zhaoran Zhao,XiangTao Dong
Abstract:
This paper presents a pioneering method for teaching computer sketching that transforms input images into sequential, parameterized strokes. However, two challenges are raised for this sketching task: weak stimuli during stroke decomposition and maintaining semantic correctness, stylistic consistency, and detail integrity in the final drawings. To tackle the challenge of weak stimuli, our method incorporates an attention agent, which enhances the algorithm's sensitivity to subtle canvas changes by focusing on smaller, magnified areas. Moreover, in enhancing the perceived quality of drawing outcomes, we integrate a sketching style feature extractor to seamlessly capture semantic information and execute style adaptation, alongside a drawing agent that decomposes strokes under the guidance of the XDoG reward, thereby ensuring the integrity of sketch details. Based on dual intelligent agents, we have constructed an efficient sketching model. Experimental results attest to the superiority of our approach in both visual effects and perceptual metrics when compared to state-of-the-art techniques, confirming its efficacy in achieving realistic sketching.



Paperid:790 Poster
Authors:Wencan Huang,Daizong Liu,Wei Hu
Abstract:
As a widely explored multi-modal task, 3D object grounding endeavors to localize a unique pre-existing object within a single 3D scene given a natural language description. However, such a strict setting is unnatural as it is not always possible to know whether a target object actually exists in a specific 3D scene. In real-world scenarios, a collection of 3D scenes is generally available, some of which may not contain the described object while some potentially contain multiple target objects. To this end, we introduce a more realistic setting, named Group-wise 3D Object Grounding, to simultaneously process a group of related 3D scenes, allowing a flexible number of target objects to exist in each scene. Instead of localizing target objects in each scene individually, we argue that ignoring the rich visual information contained in other related 3D scenes within the same group may lead to sub-optimal results. To achieve more accurate localization, we propose a baseline method named GNL3D, a Grouped Neural Listener for 3D grounding in the group-wise setting, which extends the traditional 3D object grounding pipeline with a novel language-guided consensus aggregation and distribution mechanism to explicitly exploit the intra-group visual connections. Specifically, based on context-aware spatial-semantic alignment, a Language-guided Consensus Aggregation Module (LCAM) is developed to aggregate the visual features of target objects in each 3D scene to form a visual consensus representation, which is then distributed and injected into a consensus-modulated feature refinement module for refining visual features, thus benefiting the subsequent multi-modal reasoning. Furthermore, we design a curriculum strategy to promote the LCAM to learn step by step how to extract effective visual consensus with the existence of negative 3D scenes where no target object exists. To validate the effectiveness of the proposed method, we reorganize and enhance the ReferIt3D dataset and propose evaluation metrics to benchmark prior work and GNL3D. Extensive experiments demonstrate GNL3D achieves state-of-the-art results on both the group-wise setting and the traditional 3D object grounding task.



Paperid:791 Poster
Authors:Xiaoyu Han,Shunyuan Zheng,Zonglin Li,Chenyang Wang,Xin Sun,Quanling Meng
Abstract:
Image-based virtual try-on aims to seamlessly fit in-shop clothing to a person image while maintaining pose consistency. Existing methods commonly employ the thin plate spline (TPS) transformation or appearance flow to deform in-shop clothing for aligning with the person's body. Despite their promising performance, these methods often lack precise control over fine details, leading to inconsistencies in shape between clothing and the person's body as well as distortions in exposed limb regions. To tackle these challenges, we propose a novel shape-guided clothing warping method for virtual try-on, dubbed SCW-VTON, which incorporates global shape constraints and additional limb textures to enhance the realism and consistency of the warped clothing and try-on results. To integrate global shape constraints for clothing warping, we devise a dual-path clothing warping module comprising a shape path and a flow path. The former path captures the clothing shape aligned with the person's body, while the latter path leverages the mapping between the pre- and post-deformation of the clothing shape to guide the estimation of appearance flow. Furthermore, to alleviate distortions in limb regions of try-on results, we integrate detailed limb guidance by developing a limb reconstruction network based on masked image modeling. Through the utilization of SCW-VTON, we are able to generate try-on results with enhanced clothing shape consistency and precise control over details. Extensive experiments demonstrate the superiority of our approach over state-of-the-art methods both qualitatively and quantitatively.



Paperid:792 Poster
Authors:Meiqi Cao,Rui Yan,Xiangbo Shu,Guangzhao Dai,Yazhou Yao,Guosen Xie
Abstract:
Panoramic Activity Recognition (PAR) aims to identify multi-granul-arity behaviors performed by multiple persons in panoramic scenes, including individual activities, group activities, and global activities. Previous methods 1) heavily rely on manually annotated detection boxes in training and inference, hindering further practical deployment; or 2) directly employ normal detectors to detect multiple persons with varying size and spatial occlusion in panoramic scenes, blocking the performance gain of PAR. To this end, we consider learning a detector adapting varying-size occluded persons, which is optimized along with the recognition module in the all-in-one framework. Therefore, we propose a novel Adapt-Focused bi-Propagating Prototype learning (AdaFPP) framework to jointly recognize individual, group, and global activities in panoramic activity scenes by learning an adapt-focused detector and multi-granularity prototypes as the pretext tasks in an end-to-end way. Specifically, to accommodate the varying sizes and spatial occlusion of multiple persons in crowed panoramic scenes, we introduce a panoramic adapt-focuser, achieving the size-adapting detection of individuals by comprehensively selecting and performing fine-grained detections on object-dense sub-regions identified through original detections. In addition, to mitigate information loss due to inaccurate individual localizations, we introduce a bi-propagation prototyper that promotes closed-loop interaction and informative consistency across different granularities by facilitating bidirectional information propagation among the individual, group, and global levels. Extensive experiments demonstrate the significant performance of AdaFPP and emphasize its powerful applicability for PAR.



Paperid:793 Poster
Authors:Shijie Li,Yunbin Tu,Qingyuan Xiang,Zheng Li
Abstract:
Recently, dynamic convolution shows performance boost for the CNN-related networks in medical image segmentation. The core idea is to replace static convolutional kernel with a linear combination of multiple convolutional kernels, conditioned on input-dependent attention function. However, the existing dynamic convolution design suffers from two limitations: i) The convolutional kernels are weighted by enforcing a single-dimensional attention function upon the input maps, overlooking the synergy in multi-dimensional information. This results in sub-optimal computations of convolution kernels. ii) The linear kernel aggregation is inefficient, restricting the model’s capacity to learn more intricate patterns. In this paper, we rethink the dynamic convolution design to address these limitations and propose multi-dimensional aggregation dynamic convolution (MAGIC). Specifically, our MAGIC introduce a dimensional-reciprocal fusion module to capture correlations among input maps across the spatial, channel, and global dimensions simultaneously for computing convolutional kernels. Furthermore, we design kernel recalculation module, which enhances the efficiency of aggregation through learning the interaction between kernels. As a drop-in replacement for regular convolution, our MAGIC can be flexibly integrated into prevalent pure CNN or hybrid CNN-Transformer backbones. The extensive experiments on four benchmarks demonstrate that our MAGIC outperforms regular convolution and existing dynamic convolution.



Paperid:794 Poster
Authors:Tingfeng Cao,Junsheng Kong,Xue Zhao,Wenqing Yao,Junwei Ding,Jinhui Zhu,Jian Dong Zhang
Abstract:
In e-commerce platforms, visual content plays a pivotal role in capturing and retaining audience attention. A high-quality and aesthetically designed product background image can quickly grab consumers' attention, and increase their confidence in taking actions, such as making a purchase. Recently, diffusion models have achieved profound advancements, rendering product background generation a promising avenue for exploration. However, text-guided diffusion models require meticulously crafted prompts. The diverse range of products makes it challenging to compose prompts that result in visually appealing and semantically appropriate background scenes. Current work has made great efforts on creating prompts through expert-crafted rules or specialized fine-tuning of large language models, but it still relies on detailed human inputs and often falls short in generating desirable results by e-commerce standards.In this paper, we propose Product2Img, a novel prompt-free diffusion model with automatic training data refinement strategy for product background generation. Product2Img employs Contrastive Background Alignment (CBA) for the text encoder to enhance the relevant background perception ability in the diffusion generation process, without the need for specific background prompts. Meanwhile, we develope the Iterative Data Refinement with Self-improved LMM (IDR-LMM), a framework that iteratively enhances the data selection capability of LMM for diffusion model training, thereby yielding continuous performance improvements. Furthermore, we establish an E-commerce Product Background Dataset (EPBD) for the research in this paper and future work. Experimental results indicate that our approach significantly outperforms current prevalent methods in terms of automatic metrics and human evaluation, yielding improved background aesthetics and relevance.



Paperid:795 Poster
Authors:Hengde Zhu,Xiangyu Kong,Weicheng Xie,Xin Huang,Linlin Shen,Lu Liu,Hatice Gunes,Siyang Song
Abstract:
Human facial reactions play crucial roles in dyadic human-human interactions, where individuals (i.e., listeners) with varying cognitive process styles may display different but appropriate facial reactions in response to an identical behaviour expressed by their conversational partners. While several existing facial reaction generation approaches are capable of generating multiple appropriate facial reactions (AFRs) in response to each given human behaviour, they fail to take human's personalised cognitive process in AFRs' generation. In this paper, we propose the first online personalised multiple appropriate facial reaction generation (MAFRG) approach which learns a unique personalised cognitive style from the target human listener's previous facial behaviours and represents it as a set of network weight shifts. These personalised weight shifts are then applied to edit the weights of a pre-trained generic MAFRG model, allowing the obtained personalised model to naturally mimic the target human listener's cognitive process in its reasoning for multiple AFRs generations. Experimental results show that our approach not only largely outperformed all existing approaches in generating more appropriate and diverse generic AFRs, but also serves as the first reliable personalised MAFRG solution. Our code is provided in the Supplementary Material.



Paperid:796 Poster
Authors:Yutong Wang,Sidan Zhu,Hongteng Xu,Dixin Luo
Abstract:
Trailer generation is a challenging video clipping task that aims to select highlighting shots from long videos like movies and re-organize them in an attractive way. In this study, we propose an inverse partial optimal transport (IPOT) framework to achieve music-guided movie trailer generation. In particular, we formulate the trailer generation task as selecting and sorting key movie shots based on audio shots, which involves matching the latent representations across visual and acoustic modalities. We learn a multi-modal latent representation model in the proposed IPOT framework to achieve this aim. In this framework, a two-tower encoder derives the latent representations of movie and music shots, respectively, and an attention-assisted Sinkhorn matching network parameterizes the grounding distance between the shots' latent representations and the distribution of the movie shots. Taking the correspondence between the movie shots and its trailer music shots as the observed optimal transport plan defined on the grounding distances, we learn the model by solving an inverse partial optimal transport problem, leading to a bi-level optimization strategy. We collect real-world movies and their trailers to construct a dataset with abundant label information called CMTD and, accordingly, train and evaluate various automatic trailer generators. Compared with state-of-the-art methods, our IPOT method consistently shows superiority in subjective visual effects and objective quantitative measurements.



Paperid:797 Poster
Authors:Jinxu Zhang,Yongqi Yu,Yu Zhang
Abstract:
Document Visual Question Answering (DVQA) involves responding to queries based on the contents of document images. Existing works are confined to locating information within a single page and lack support for cross-page question-and-answer interactions. Furthermore, the token length limitation on model inputs can lead to the truncation of answer-relevant segments. In this study, we present CREAM, an innovative methodology that focuses on high-performance retrieval and integrates relevant multimodal document information to effectively address this critical issue. To overcome the limitations of current text embedding similarity methods, we first employ a coarse-to-fine retrieval and ranking approach. The coarse phase calculates the similarity between the query and text chunk embeddings, while the fine phase involves multiple rounds of grouping and ordering with a large language model to identify the text chunks most relevant to the query. Subsequently, integrating an attention pooling mechanism for multi-page document images into the vision encoder allows us to effectively merge the visual information of multi-page documents, enabling the multimodal large language model(MLLM) to simultaneously process both single-page and multi-page documents. Finally, we apply various parameter-efficient tuning methods to enhance document visual question-answering performance. Experiments demonstrate that our approach secures state-of-the-art results across various document datasets.



Paperid:798 Poster
Authors:Bo Dong,Pichao WANG,Hao Luo,Fan Wang
Abstract:
Camouflaged instance segmentation is a challenging task due to the various aspects such as color, structure, lighting, etc., of object instances embedded in complex backgrounds. Although the current DETR-based scheme simplifies the pipeline, it suffers from a large number of object queries, leading to many false positive instances. To address this issue, we propose an adaptive query selection mechanism. Our research reveals that a large number of redundant queries scatter the extracted features of the camouflaged instances. To remove these redundant queries with weak correlation, we evaluate the importance of the object query from the perspectives of information entropy and volatility. Moreover, we observed that occlusion and overlapping instances significantly impact the accuracy of the selection mechanism. Therefore, we design a boundary location embedding mechanism that incorporates fake instance boundaries to obtain better location information for more accurate query instance matching. We conducted extensive experiments on two challenging camouflaged instance segmentation datasets, namely COD10K and NC4K, and demonstrated the effectiveness of our proposed model. Compared with the OSFormer, our model significantly improves the performance by 3.8% AP and 5.6% AP with less computational cost, achieving the state-of-the-art of 44.8 AP and 48.1 AP with ResNet-50 on the COD10K and NC4K test-dev sets, respectively.



Paperid:799 Poster
Authors:Yue Jiang,Yueming Lyu,Ziwen He,Bo Peng,Jing Dong
Abstract:
Recent advancements in text-to-image generative models have showcased remarkable capabilities across various tasks. However, these powerful models have revealed the inherent risks of social biases related to gender, race, and their intersections. Such biases can propagate distorted real-world perspectives and spread unforeseen prejudice and discrimination. Current debiasing methods are primarily designed for scenarios with a single individual in the image and exhibit homogenous race or gender when multiple individuals are involved, harming the diversity of social groups within the image. To address this problem, we consider the semantic consistency between text prompts and generated images in text-to-image diffusion models to identify how biases are generated. We propose a novel method to locate where the biases are based on different tokens and then mitigate them for each individual. Specifically, we introduce a Linguistic-aligned Attention Guidance module consisting of Block Voting and Linguistic Alignment, to effectively locate the semantic regions related to biases. Additionally, we employ Fair Inference in these regions to generate fair attributes across arbitrary distributions while preserving the original structural and semantic information. Extensive experiments and analyses demonstrate our method outperforms existing methods for debiasing with multiple individuals across various scenarios.



Paperid:800 Poster
Authors:Yuxiang Yang,Lu Wen,Xinyi Zeng,Yuanyuan Xu,Xi Wu,Jiliu Zhou,Yan Wang
Abstract:
Facial Expression Recognition (FER) holds significant importance in human-computer interactions. Existing cross-domain FER methods often transfer knowledge solely from a single labeled source domain to an unlabeled target domain, neglecting the comprehensive information across multiple sources. Nevertheless, cross-multidomain FER (CMFER) is very challenging for (i) the inherent inter-domain shifts across multiple domains and (ii) the intra-domain shifts stemming from the ambiguous expressions and low inter-class distinctions. In this paper, we propose a novel Learning with Alignments CMFER framework, named LA-CMFER, to handle both inter- and intra-domain shifts. Specifically, LA-CMFER is constructed with a global branch and a local branch to extract features from the full images and local subtle expressions, respectively. Based on this, LA-CMFER presents a dual-level inter-domain alignment method to force the model to prioritize hard-to-align samples in knowledge transfer at a sample level while gradually generating a well-clustered feature space with the guidance of class attributes at a cluster level, thus narrowing the inter-domain shifts. To address the intra-domain shifts, LA-CMFER introduces a multi-view intra-domain alignment method with a multi-view clustering consistency constraint where a prediction similarity matrix is built to pursue consistency between the global and local views, thus refining pseudo labels and eliminating latent noise. Extensive experiments on six benchmark datasets have validated the superiority of our LA-CMFER.



Paperid:801 Poster
Authors:Xiaojiao Guo,Xuhang Chen,Shenghong Luo,Shuqiang Wang,Chi-Man Pun
Abstract:
Specular highlight removal plays a pivotal role in multimedia applications, as it enhances the quality and interpretability of images and videos, ultimately improving the performance of downstream tasks such as content-based retrieval, object recognition, and scene understanding. Despite significant advances in deep learning-based methods, current state-of-the-art approaches often rely on additional priors or supervision, limiting their practicality and generalization capability. In this paper, we propose the Dual-Hybrid Attention Network for Specular Highlight Removal (DHAN-SHR), an end-to-end network that introduces novel hybrid attention mechanisms to effectively capture and process information across different scales and domains without relying on additional priors or supervision. DHAN-SHR consists of two key components: the Adaptive Local Hybrid-Domain Dual Attention Transformer (L-HD-DAT) and the Adaptive Global Dual Attention Transformer (G-DAT). The L-HD-DAT captures local inter-channel and inter-pixel dependencies while incorporating spectral domain features, enabling the network to effectively model the complex interactions between specular highlights and the underlying surface properties. The G-DAT models global inter-channel relationships and long-distance pixel dependencies, allowing the network to propagate contextual information across the entire image and generate more coherent and consistent highlight-free results. To evaluate the performance of DHAN-SHR and facilitate future research in this area, we compile a large-scale benchmark dataset comprising a diverse range of images with varying levels of specular highlights. Through extensive experiments, we demonstrate that DHAN-SHR outperforms 18 state-of-the-art methods both quantitatively and qualitatively, setting a new standard for specular highlight removal in multimedia applications. The code and dataset will be available.



Paperid:802 Poster
Authors:Xiaole Zhao,Linze Li,Chengxing Xie,XIAOMING ZHANG,Ting Jiang,Wenjie Lin,Shuaicheng Liu,Tianrui Li
Abstract:
Transformer-based deep models for single image super-resolution (SISR) have greatly improved the performance of lightweight SISR tasks in recent years. However, they often suffer from heavy computational burden and slow inference due to the complex calculation of multi-head self-attention (MSA), seriously hindering their practical application and deployment. In this work, we present an efficient SR model to mitigate the dilemma between model efficiency and SR performance, which is dubbed Entropy Attention and Receptive Field Augmentation network (EARFA), and composed of a novel entropy attention (EA) and a shifting large kernel attention (SLKA). From the perspective of information theory, EA increases the entropy of intermediate features conditioned on a Gaussian distribution, providing more informative input for subsequent reasoning. On the other hand, SLKA extends the receptive field of SR models with the assistance of channel shifting, which also favors to boost the diversity of hierarchical features. Since the implementation of EA and SLKA does not involve complex computations (such as extensive matrix multiplications), the proposed method can achieve faster nonlinear inference than Transformer-based SR models while maintaining better SR performance. Extensive experiments show that the proposed model can significantly reduce the delay of model inference while achieving the SR performance comparable with other advanced models.



Paperid:803 Poster
Authors:Satoshi Kosugi
Abstract:
In this paper, we delve into the concept of interpretable image enhancement, a technique that enhances image quality by adjusting filter parameters with easily understandable names such as “Exposure” and “Contrast”. Unlike using predefined image editing filters, our framework utilizes learnable filters that acquire interpretable names through training. Our contribution is two-fold. Firstly, we introduce a novel filter architecture called an image-adaptive neural implicit lookup table, which uses a multilayer perceptron to implicitly define the transformation from input feature space to output color space. By incorporating image-adaptive parameters directly into the input features, we achieve highly expressive filters. Secondly, we introduce a prompt guidance loss to assign interpretable names to each filter. We evaluate visual impressions of enhancement results, such as exposure and contrast, using a vision and language model along with guiding prompts. We define a constraint to ensure that each filter affects only the targeted visual impression without influencing other attributes, which allows us to obtain the desired filter effects. Experimental results show that our method outperforms existing predefined filter-based methods, thanks to the filters optimized to predict target results. We will make our code publicly available upon acceptance.



Paperid:804 Poster
Authors:Stanislav Frolov,Brian Bernhard Moser,Sebastian Palacio,Andreas Dengel
Abstract:
We present ObjBlur, a novel curriculum learning approach to improve layout-to-image generation models, where the task is to produce realistic images from layouts composed of boxes and labels. Our method is based on progressive object-level blurring, which effectively stabilizes training and enhances the quality of generated images. This curriculum learning strategy systematically applies varying degrees of blurring to individual objects or the background during training, starting from strong blurring to progressively cleaner images. Our findings reveal that this approach yields significant performance improvements, stabilized training, smoother convergence, and reduced variance between multiple runs. Moreover, our technique demonstrates its versatility by being compatible with generative adversarial networks and diffusion models, underlining its applicability across various generative modeling paradigms. With ObjBlur, we reach new state-of-the-art results on the complex COCO and Visual Genome datasets.



Paperid:805 Poster
Authors:Zhen Wang,Dongyuan Li,Guang Li,Ziqing Zhang,Renhe Jiang
Abstract:
Low-light image enhancement has been researched several years. However, current image restoration methods predominantly focus on recovering images from RGB images, overlooking the potential of incorporating more modalities. With the advancements in personal handheld devices, we can now easily capture images with depth information using devices such as mobile phones. The integration of depth information into image restoration is a research question worthy of exploration. Therefore, in this paper, we propose a multimodal low-light image enhancement task based on depth information and establish a dataset namedLED(Low-light ImageEnhanced withDepth Map), consisting of 1,365 samples. Each sample in our dataset includes a low-light image, a normal-light image, and the corresponding depth map. Moreover, for the LED dataset, we design a corresponding multimodal method, which can processes the input images and depth map information simultaneously to generate the predicted normal-light images. Experimental results and detailed ablation studies proves the efficiency of our method which exceeds previous single-modal state-of-the arts methods from relevant field.



Paperid:806 Poster
Authors:Lutao Jiang,Hangyu Li,Lin Wang
Abstract:
Text-to-3D content creation has recently received much attention, especially with the prevalence of 3D Gaussians Splatting (3D GS). In general, GS-based methods comprise two key stages: initialization and rendering optimization. To achieve initialization, existing works directly apply random sphere initialization or 3D diffusion models, e.g., Point-E, to derive the initial shapes. However, such strategies suffer from two critical yet challenging problems: 1) the final shapes are still similar to the initial ones even after training; 2) shapes can be produced only from simple texts, \eg, 'a dog', not for lexically richer (or harder) texts, \eg, `a dog is sitting on the top of the airplane'. To address these problems, this paper proposes a novel general framework to boost the 3D GS Initialization for text-to-3D generation upon the lexical richness. Our key idea is to aggregate 3D Gaussians into spatially uniform voxels to represent complex shapes while enabling the spatial interaction among the 3D Gaussians and semantic interaction between Gaussians and texts. Specifically, we first construct a voxelized representation, where each voxel holds a 3D Gaussian with its position, scale, and rotation fixed while setting opacity as the sole factor to determine a position's occupancy. We then design an initialization network mainly consisting of two novel components: 1) Global Information Perception (GIP) block and 2) Gaussians-Text Fusion (GTF) block. Such a design enables each 3D Gaussian to assimilate the spatial information from other areas and semantic information from texts. Extensive experiments show the superiority of our framework of high-quality 3D GS initialization against the existing methods, e.g., Shap-E, by taking lexicallysimple,medium, andhardtexts. Also, our framework can be seamlessly plugged into state-of-the-art training frameworks, e.g., LucidDreamer for semantically consistent text-to-3D generation. Our code will be released upon acceptance.



Paperid:807 Poster
Authors:Linfei Li,Lin Zhang,Zhong Wang,Ying Shen
Abstract:
Recently, the multi-modal fusion of RGB, depth, and semantics has shown great potential in the domain of dense Simultaneous Localization and Mapping (SLAM), as konwn as dense semantic SLAM. Yet a prerequisite for generating consistent and continuous semantic maps is the availability of dense, efficient, and scalable scene representations. To date, existing semantic SLAM systems based on explicit scene representations (points/meshes/surfels) are limited by their resolutions and inabilities to predict unknown areas, thus failing to generate dense maps. Contrarily, a few implicit scene representations (Neural Radiance Fields) to deal with these problems rely on time-consuming ray tracing-based volume rendering technique, which cannot meet the real-time rendering requirements of SLAM. Fortunately, the Gaussian Splatting scene representation has recently emerged, which inherits the efficiency and scalability of point/surfel representations while smoothly represents geometric structures in a continuous manner, showing promise in addressing the aforementioned challenges. To this end, we propose $\textbf{GS$^3$LAM}$, a $\textbf{G}$aussian $\textbf{S}$emantic $\textbf{S}$platting $\textbf{SLAM}$ framework, which takes multimodal data as input and can render consistent, continuous dense semantic maps in real-time. To fuse multimodal data, GS$^3$LAM models the scene as a Semantic Gaussian Field (SG-Field), and jointly optimizes camera poses and the field by establishing error constraints between observed and predicted data. Furthermore, a Depth-adaptive Scale Regularization (DSR) scheme is proposed to tackle the problem of misalignment between scale-invariant Gaussians and geometric surfaces within the SG-Field. To mitigate the forgetting phenomenon, we propose an effective Random Sampling-based Keyframe Mapping (RSKM) strategy, which exhibits notable superiority over local covisibility optimization strategies commonly utilized in 3DGS-based SLAM systems. Extensive experiments conducted on the benchmark datasets reveal that compared with state-of-the-art competitors, GS$^3$LAM demonstrates increased tracking robustness, superior real-time rendering quality, and enhanced semantic reconstruction precision. To make the results reproducible, the source code will be publicly released.



Paperid:808 Poster
Authors:Jingjia Huang,Jingyan Tu,Ge Meng,Yingying Wang,Yuhang Dong,Xiaotong Tu,Xinghao Ding,Yue Huang
Abstract:
Multi-focus image fusion (MFIF) aims to combine multiple images with different focused regions into a single all-in-focus image. Existing unsupervised deep learning-based methods only fuse structural information of images in the spatial domain, neglecting potential solutions from the frequency domain exploration. In this paper, we make the first attempt to integrate spatial-frequency information to achieve high-quality MFIF. We propose a novel unsupervised spatial-frequency interaction MFIF network named SFIMFN, which consists of three key components: Adaptive Frequency Domain Information Interaction Module (AFIM), Ret-Attention-Based Spatial Information Extraction Module (RASEM), and Invertible Dual-domain Feature Fusion Module (IDFM). Specifically, in AFIM, we interactively explore global contextual information by combining the amplitude and phase information of multiple images separately. In RASEM, we design a customized transformer to encourage the network to capture important local high-frequency information by redesigning the self-attention mechanism with a bidirectional, two-dimensional form of explicit decay. Finally, we employ IDFM to fuse spatial-frequency information without information loss to generate the desired all-in-focus image. Extensive experiments on different datasets demonstrate that our method significantly outperforms state-of-the-art unsupervised methods in terms of qualitative and quantitative metrics as well as the generalization ability.



Paperid:809 Poster
Authors:Zhaoyang Li,Zhu Teng,Baopeng Zhang,Jianping Fan
Abstract:
The rapid advancement of generation methods has sparked significant concerns about potential misuse, emphasizing the urgency to detect new types of forgeries in open-world settings. Although pioneering works have explored the classification of open-world deepfakes (OW-DF), they neglect the influence of new forgery techniques, which struggle to handle a greater variety of manipulable objects and increasingly realistic artifacts. To align research with the evolving technologies of forgery, we propose a new task named Open-World Deepfake Interpretation (OW-DFI). This task involves the localization of imperceptible artifacts across diverse manipulated objects and deciphering forgery methods, especially new forgery techniques. To this end, we leverage non-casual semantics from large visual models (LVMs) and eliminate them from the nuanced manipulated artifacts. Our proposed model includes Semantic Intervention Learning (SIL) and Correlation-based Incremental Learning (CIL). SIL enhances the inconsistency of forgery artifacts with refined semantics from LVMs, while CIL combats catastrophic forgetting and semantic overfitting through an inter-forgery inheritance transpose and a targeted semantic intervention. Exploiting LVMs, our proposed method adopts an unconventional strategy that aligns with the semantic direction of LVMs, moving beyond just uncovering limited forgery-related features for deepfake detection. To assess the effectiveness of our approach in discovering new forgeries, we construct an Open-World Deepfake Interpretation (OW-DFI) benchmark and conduct experiments in an incremental form. Comprehensive experiments demonstrate our method's superiority on the OW-DFI benchmark, showcasing outstanding performance in localizing forgeries and decoding new forgery techniques. The source code and benchmark will be made publicly accessible on [website].



Paperid:810 Poster
Authors:Cheng Xin,Hao Wang,Jinwei Wang,Xiangyang Luo,Bin Ma
Abstract:
In Joint Photographic Experts Group (JPEG) image steganalysis and forensics, the quantization step can reveal the history of image operations. Several methods for estimating the quantization step have been proposed by researchers. However, existing algorithms fail to account for robustness, which limits the application of these algorithms. To solve the above problems, we propose a two-stream network structure based on Swin Transformer. The spatial domain features of JPEG images exhibit strong robustness but low accuracy. Conversely, frequency domain features demonstrate high accuracy but weak robustness. Therefore, we design a two-stream network with the multi-scale feature of Swin Transformer to extract spatial domain features with high robustness and frequency domain features with high accuracy, respectively. Furthermore, to adaptively fuse features in both the frequency domain and spatial domain, we design a Spatial-frequency Information Dynamic Fusion (SIDF) module to dynamically allocate weights. Finally, we modify the network from a regression model to a classification model to speed up convergence and improve the accuracy of the algorithm. The experimental results show that the accuracy of the proposed method is higher than 98% on clean images. Meanwhile, in robust environments, the algorithm proposed maintains an average accuracy of over 81%.



Paperid:811 Poster
Authors:Xiying Zheng,Yukang Zhang,Yang Lu,Hanzi Wang
Abstract:
Semi-supervised visible-infrared person re-identification (SSVI-ReID) aims to match pedestrian images of the same identity from different modalities (visible and infrared) while only annotating visible images, which is highly related to multimedia and multi-modal processing. Existing works primarily focus on assigning accurate pseudo-labels to infrared images but overlook the two key challenges: erroneous pseudo-labels and large modality discrepancy. To alleviate these issues, this paper proposes a novel Modality-Unified and Confidence-Guided (MUCG) semi-supervised learning framework. Specifically, we first propose a Dynamic Intermediate Modality Generation (DIMG) module, which transfers knowledge from labeled visible images to unlabeled infrared images, enhancing the pseudo-label quality and bridging the modality discrepancy. Meanwhile, we propose a Weighted Identification Loss (WIL) that can reduce the model's dependence on erroneous labels by using confidence weighting. Moreover, an effective Modality Consistency Loss (MCL) is proposed to narrow the distribution of visible and infrared features, further narrowing the modality discrepancy and enabling the learning of modality-unified features. Extensive experiments show that the proposed MUCG has significant advantages in improving the performance of the SSVI-ReID task, surpassing the current state-of-the-art methods by a significant margin. The code will be available.



Paperid:812 Poster
Authors:Zhongyi Fan,Zixin Yin,Gang Li,Yibing Zhan,Heliang Zheng
Abstract:
DreamBooth has demonstrated significant potential in subject-driven text-to-image generation, especially in scenarios requiring precise preservation of a subject's appearance. However, it still suffers from inefficiency and requires extensive iterative training to customize concepts using a small set of reference images. To address these issues, we introduce DreamBooth++, a region-level training strategy designed to significantly improve the efficiency and effectiveness of learning specific subjects. In particular, our approach employs a region-level data re-formulation technique that packs a set of reference images into a single sample, significantly reducing computational costs. Moreover, we adapt convolution and self-attention layers to ensure their processings are restricted within individual regions. Thus their operational scope (i.e., receptive field) can be preserved within a single subject, avoiding generating multiple sub-images within a single image. Last but not least, we design a text-guided prior regularization between our model and the pretrained one to preserve the original semantic generation ability. Comprehensive experiments demonstrate that our training strategy not only accelerates the subject-learning process but also significantly boosts fidelity to both subject and prompts in subject-driven generation.



Paperid:813 Poster
Authors:Chenrui Wu,Haishuai Wang,Xiang Zhang,Zhen Fang,Jiajun Bu
Abstract:
Federated learning (FL) is undergoing significant traction due to its ability to perform privacy-preserving training on decentralized data. In this work, we focus on sensitive time series data collected by distributed sensors in real-world applications. However, time series data introduce the challenge of dual spatial-temporal feature skew due to their dynamic changes across domains and time, differing from computer vision. This key challenge includes inter-client spatial feature skew caused by heterogeneous sensor collection and intra-client temporal feature skew caused by dynamics in time series distribution. We follow the framework of Personalized Federated Learning (pFL) to handle dual feature drifts to enhance the capabilities of customized local models. Therefore, in this paper, we propose a method FedST to solve key challenges through orthogonal feature decoupling and regularization in both training and testing stages. During training, we collaborate time view and frequency view of time series data to enrich the mutual information and adopt orthogonal projection to disentangle and align the shared and personalized features between views, and between clients. During testing, we apply prototype-based predictions and model-based predictions to achieve model consistency based on shared features. Extensive experiments on multiple real-world classification datasets and multimodal time series datasets show our method consistently outperforms state-of-the-art baselines with clear advantages.



Paperid:814 Poster
Authors:Xiaobo Shen,GaoyaoYu,YinFan Chen,Xichen Yang,Yuhui Zheng
Abstract:
Cross-modal hashing encodes different modalities of multi-modal data into a low-dimensional Hamming space for fast cross-modal retrieval. Most existing cross-modal hashing methods heavily rely on label semantics to boost retrieval performance; however, semantics are expensive to collect in real applications. To mitigate the heavy reliance on semantics, this work proposes a new semi-supervised deep cross-modal hashing method, namely, Graph Convolutional Semi-Supervised Cross-Modal Hashing (GCSCH), which is trained with limited label supervision. The proposed GCSCH first generates pseudo-multi-labels of the unlabeled samples using the simple yet effective idea of consistency regularization and pseudo-labeling. GCSCH designs a fusion network that merges the two modalities and employs Graph Convolutional Network (GCN) to capture semantic information among ground-truth-labeled and pseudo-labeled multi-modal data. Using the idea of knowledge distillation, GCSCH employs a teacher-student learning scheme that can successfully transfer knowledge from the fusion module to the image and text hashing networks. Empirical studies on three multi-modal benchmark datasets demonstrate the superiority of the proposed GCSCH over state-of-the-art cross-modal hashing methods with limited label supervision.



Paperid:815 Poster
Authors:Yang Du,Yuqi Liu,Qin Jin
Abstract:
Video-text retrieval is an important task in the multimodal understanding field.Temporal understanding makes video-text retrieval more challenging than image-text retrieval. However, we find that the widely used video-text benchmarks have shortcomings in assessing model's retrieval ability, especially in temporal understanding, causing large-scale image-text pre-trained models can already achieve comparable zero-shot performance with video-text pre-trained models.In this paper, we introduce RTime, a novel temporal-emphasized video-text retrieval dataset, constructed through a top-down three-step process. We first obtain videos of actions or events with significant temporality, and then reverse these videos to create harder negative samples. We recruit annotators to judge the significance and reversibility of candidate videos, and then write captions for qualified videos. We further adopt GPT-4 to extend more captions based on human-written captions. Our RTime dataset currently consists of 21k videos with 10 captions per video, totalling about 122 hours. Based on RTime, we propose three retrieval benchmark tasks: RTime-Origin, RTime-Hard, and RTime-Binary.We further enforce leveraging harder-negatives in model training, and benchmark a variety of video-text models on RTime. Extensive experiment analysis proves that RTime indeed poses new and higher challenges to video-text retrieval.We will release our RTime benchmarks to further advance video-text retrieval and multimodal understanding research.



Paperid:816 Poster
Authors:Lv Tang,Peng-Tao Jiang,Zhihao Shen,Hao Zhang,Jinwei Chen,Bo Li
Abstract:
In this paper, we introduce a novel multimodal camo-perceptive framework (MMCPF) aimed at handling zero-shot Camouflaged Object Detection (COD) by leveraging the powerful capabilities of Multimodal Large Language Models (MLLMs). Recognizing the inherent limitations of current COD methodologies, which predominantly rely on supervised learning models demanding extensive and accurately annotated datasets, resulting in weak generalization, our research proposes a zero-shot MMCPF that circumvents these challenges. Although MLLMs hold significant potential for broad applications, their effectiveness in COD is hindered and they would make misinterpretations of camouflaged objects. To address this challenge, we further propose a strategic enhancement called the Chain of Visual Perception (CoVP), which significantly improves the perceptual capabilities of MLLMs in camouflaged scenes by leveraging both linguistic and visual cues more effectively. We validate the effectiveness of MMCPF on five widely used COD datasets, containing CAMO, COD10K, NC4K, MoCA-Mask and OVCamo. Experiments show that MMCPF can outperform all existing state-of-the-art zero-shot COD methods, and achieve competitive performance compared to weakly-supervised and fully-supervised methods, which demonstrates the potential of MMCPF. The Github link of this paper is \url{https://github.com/luckybird1994/MMCPF}.



Paperid:817 Poster
Authors:Chengpei Xu,Hao Fu,Long Ma,Wenjing Jia,Chengqi Zhang,Feng Xia,Xiaoyu Ai,Binghao Li,Wenjie Zhang
Abstract:
Localizing text in low-light environments is challenging due to visual degradations. Although a straightforward solution involves a two-stage pipeline with low-light image enhancement (LLE) as the initial step followed by detection, LLE is primarily designed for human vision rather than machine vision and can accumulate errors. In this work, we propose an efficient and effective single-stage approach for localizing text in the dark that circumvents the need for LLE. We introduce a constrained learning module as an auxiliary mechanism during the training stage of the text detector. This module is designed to guide the text detector in preserving textual spatial features amidst feature map resizing, thus minimizing the loss of spatial information in texts under low-light visual degradations. Specifically, we incorporate spatial reconstruction and spatial semantic constraints within this module to ensure the text detector acquires essential positional and contextual range knowledge. Our approach enhances the original text detector's ability to identify text's local topological features using a dynamic snake feature pyramid network and adopts a bottom-up contour shaping strategy with a novel rectangular accumulation technique for accurate delineation of streamlined text features. In addition, we present a comprehensive low-light dataset for arbitrary-shaped text, encompassing diverse scenes and languages. Notably, our method achieves state-of-the-art results on this low-light dataset and exhibits comparable performance on standard normal light datasets. The code and dataset will be released.



Paperid:818 Poster
Authors:Wenhao Guo,Peng Lu,Xujun Peng,Zhaoran Zhao,Ji Qiu,XiangTao Dong
Abstract:
Single Image Super-Resolution (SISR) is a pivotal challenge in computer vision, aiming to restore high-resolution (HR) images from their low-resolution (LR) counterparts. The presence of diverse degradation kernels creates a significant domain gap, limiting the effective generalization of models in real-world scenarios. This study introduces the Bézier Curve basis-based Sparse Coding Network (BCSCN), a preprocessing network designed to mitigate input distribution discrepancies between the training and testing phases of super-resolution networks. BCSCN achieves this by removing visual defects associated with the degradation kernel in LR images, such as artifacts, residual structures, and noise. Additionally, we propose a set of rewards to guide the search for basis coefficients in BCSCN, enhancing the preservation of main content while eliminating information related to degradation. The experimental results highlight the importance of BCSCN, showcasing its capacity to effectively reduce domain gaps and enhance the generalization of super-resolution networks.



Paperid:819 Poster
Authors:Xiangcheng Du,Zhao Zhou,Xingjiao Wu,Yanlong Wang,Zhuoyao Wang,Yingbin Zheng,Cheng Jin
Abstract:
Deep networks have shown impressive performance in the image restoration tasks, such as image colorization. However, we find that previous approaches rely on the digital representation from single color model with a specific mapping function, a.k.a., color space, during the colorization pipeline. In this paper, we first investigate the modeling of different color spaces, and find each of them exhibiting distinctive characteristics with unique distribution of colors. The complementarity among multiple color spaces leads to benefits for the image colorization task.We present MultiColor, a new learning-based approach to automatically colorize grayscale images that combines clues from multiple color spaces. Specifically, we employ a set of dedicated colorization modules for individual color space. Within each module, a transformer decoder is first employed to refine color query embeddings and then a color mapper produces color channel prediction using the embeddings and semantic features. With these predicted color channels representing various color spaces, a complementary network is designed to exploit the complementarity and generate pleasing and reasonable colorized images. We conduct extensive experiments on real-world datasets, and the results demonstrate superior performance over the state-of-the-arts. The code will be available.



Paperid:820 Poster
Authors:ZiYi Dong,Yao Xiao,Pengxu Wei,Liang Lin
Abstract:
Groundbreaking advancements in text-to-image generation have recently been achieved with the emergence of diffusion models. These models exhibit a remarkable ability to generate highly artistic and intricately detailed images based on textual prompts. However, obtaining desired generation outcomes often necessitates repetitive trials of manipulating text prompts just like casting spells on a magic mirror, and the reason behind that is the limited capability of semantic understanding inherent in current image generation models. Specifically, existing diffusion models encode the input text prompt with a pre-trained encoder structure, which is usually trained on a limited amount of image-caption pairs. State-of-the-art large language models (LLMs) based on the decoder-only structure have shown very powerful semantic understanding capability as their architectures are more suitable for training on very large-scale unlabeled data. In this work, we propose to enhance text-to-image diffusion models by borrowing the strength of semantic understanding from large language models (LLMs), resulting in a simple yet effective adapter to allow the diffusion models to be compatible with the decoder-only structure. In the evaluation, we conduct not only extensive empirical results but also the supporting theoretical analysis with various architectures (e.g., encoder-only, encoder-decoder, and decoder-only). The experimental results show that the enhanced models with our adapter module are superior to the stat-of-the-art models in terms of text-to-image generation quality and reliability.



Paperid:821 Poster
Authors:Linli Yao,Yuanmeng Zhang,Ziheng Wang,Xinglin Hou,Tiezheng Ge,Yuning Jiang,Xu Sun,Qin Jin
Abstract:
Automatically narrating videos in natural language complying with user requests, i.e. Controllable Video Captioning task, can help people manage massive videos with desired intentions. However, existing works suffer from two shortcomings: 1) the control signal is single-grained which can not satisfy diverse user intentions; 2) the video description is generated in a single round which can not be further edited to meet dynamic needs. In this paper, we propose a novel Video Caption Editing (VCE) task to automatically revise an existing video description guided by multi-grained user requests. Inspired by human writing-revision habits, we design the user command as a pivotal triplet {operation, position, attribute} to cover diverse user needs from coarse-grained to fine-grained. To facilitate the VCE task, we automatically construct an open-domain benchmark dataset named VATEX-EDIT and manually collect an e-commerce dataset called EMMAD-EDIT. Further, we propose a specialized small-scale model (i.e., OPA) compared with two generalist Large Multi-modal Models to perform an exhaustive analysis of the novel task. For evaluation, we adopt comprehensive metrics encompassing caption fluency, command-caption consistency, and video-caption alignment. Experiments reveal the task challenges of fine-grained multi-modal semantics understanding and processing. Our datasets, codes, and evaluation tools are ready to be open-sourced.



Paperid:822 Poster
Authors:Ting Zhe,Jing Zhang,Yongqian Li,Yong Luo,Han Hu,Dacheng Tao
Abstract:
Detecting hand actions in videos is crucial for understanding video content and has diverse real-world applications. Existing approaches often focus on whole-body actions or coarse-grained action categories, lacking fine-grained hand-action localization information. To fill this gap, we introduce the FHA-Kitchens (Fine-Grained Hand Actions in Kitchen Scenes) dataset, providing both coarse- and fine-grained hand action categories along with localization annotations. This dataset comprises 2,377 video clips and 30,047 frames, annotated with approximately 200k bounding boxes and 880 action categories. Evaluation of existing action detection methods on FHA-Kitchens reveals varying generalization capabilities across different granularities. To handle multi-granularity in hand actions, we propose MG-HAD, an End-to-End Multi-Granularity Hand Action Detection method. It incorporates two new designs: Multi-dimensional Action Queries and Coarse-Fine Contrastive Denoising. Extensive experiments demonstrate MG-HAD's effectiveness for multi-granularity hand action detection, highlighting the significance of FHA-Kitchens for future research and real-world applications. The dataset and source code will be released.



Paperid:823 Poster
Authors:Xiaohuan Ding,Gong Yangrui,Tianyi Shi,Zihang Huang,Gangwei Xu,Xin Yang
Abstract:
Restoring low-quality fundus images, especially the recovery of vessel structures, is crucial for clinical observation and diagnosis. Existing state-of-the-art methods use standard convolution and window based self-attention block to recover low-quality fundus images, but these feature capturing approaches do not effectively match the slender and tortuous structure of retinal vessels. Therefore, these methods struggle to accurately restore vessel structures. To overcome this challenge, we propose a novel low-quality fundus image restoration method called Masked Snake Attention Network (MSANet). It is designed specifically for accurately restoring vessel structures. Specifically, we introduce the Snake Attention module (SA) to adaptively aggregate vessel features based on the morphological structure of the vessels. Due to the small proportion of vessel pixels in the image, we further present the Masked Snake Attention module (MSA) to more efficiently capture vessel features. MSA enhances vessel features by constraining snake attention within regions predicted by segmentation methods. Extensive experimental results demonstrate that our MSANet outperforms the state-of-the-art methods in enhancement evaluation and downstream segmentation tasks.



Paperid:824 Poster
Authors:Qizhi Xie,Kun Yuan,Yunpeng Qu,Mingda Wu,Ming Sun,Chao Zhou,Jihong Zhu
Abstract:
Quality assessment and aesthetics assessment aim to evaluate the perceived quality and aesthetics of visual content. Current learning-based methods suffer greatly from the scarcity of labeled data and usually perform sub-optimally in terms of generalization. Although masked image modeling (MIM) has achieved noteworthy advancements across various high-level tasks (i.e., classification, detection etc.). In this work, we take on a novel perspective to investigate its capabilities in terms of quality- and aesthetics-awareness. To this end, we propose Quality- and aesthectics-aware PreTraining (QPT V2), the first pretraining framework based on MIM that offers a unified solution for quality and asthectics assessment. Specifically, QPT V2 incporporates following key designs: To perceive the high-level semantics and fine-grained details, pretraining data is curated. To comprehensively encompass quality- and aesthetics-related factors, degradation is introduced. To capture multi-scale quality and aesthetics information, model structure is modified. Extensive experimental results on 11 downstream benchmarks clearly show the superior performance of QPT V2 in comparison with current state-of-the-art approaches and other pretraining paradigms.



Paperid:825 Poster
Authors:Xuze Hao,Wenqian Ni,Xuhao Jiang,Weimin Tan,Bo Yan
Abstract:
Deep convolutional neural networks have made significant breakthroughs in medical image classification, under the assumption that training samples from all classes are simultaneously available. However, in real-world medical scenarios, there's a common need to continuously learn about new diseases, leading to the emerging field of class incremental learning (CIL) in the medical domain. Typically, CIL suffers from catastrophic forgetting when trained on new classes. This phenomenon is mainly caused by the imbalance between old and new classes, and it becomes even more challenging with imbalanced medical datasets. In this work, we introduce two simple yet effective plug-in methods to mitigate the adverse effects of the imbalance. First, we propose a CIL-balanced classification loss to mitigate the classifier bias toward majority classes via logit adjustment. Second, we propose a distribution margin loss that not only alleviates the inter-class overlap in embedding space but also enforces the intra-class compactness. We evaluate the effectiveness of our method with extensive experiments on three benchmark datasets (CCH5000, HAM10000, and EyePACS). The results demonstrate that our approach outperforms state-of-the-art methods.



Paperid:826 Poster
Authors:li yuan,Yi Cai,Junsheng Huang
Abstract:
Joint Multimodal Entity-Relation Extraction (JMERE) is a challenging task that aims to extract entities and their relations from text-image pairs in social media posts. Existing methods for JMERE require large amounts of labeled data. However, gathering and annotating fine-grained multimodal data for JMERE poses significant challenges. Initially, we construct diverse and comprehensive multimodal few-shot datasets tailored to the data distribution. To address the insufficient information in the few-shot setting, we introduce the \textbf{K}nowledge-\textbf{E}nhanced \textbf{C}ross-modal \textbf{P}rompt \textbf{M}odel (KECPM) for JMERE. This method can effectively address the problem of insufficient information in few-shot setting by guiding a large language model to generate supplementary background knowledge. Our proposed method comprises two stages: (1) a knowledge ingestion stage that dynamically formulates prompts based on semantic similarity and employs self-reflection to refine the knowledge generated by ChatGPT; (2) a knowledge-enhanced language model stage that merges the auxiliary knowledge with the original input and utilizes a transformer-based model to execute JMERE tasks. We extensively evaluate our approach on a few-shot dataset derived from the JMERE dataset, demonstrating its superiority over strong baselines in terms of both micro and macro F$_1$ scores. Additionally, we present qualitative analyses and case studies to elucidate the effectiveness of our model.



Paperid:827 Poster
Authors:Shoutong Luo,Zhengxing Sun,Yi Wang,Yunhan Sun,Chendi Zhu
Abstract:
Large-scale point cloud semantic segmentation is a challenging task in 3D computer vision. A key challenge is how to resolve ambiguities arising from locally high inter-class similarity. In this study, we introduce a solution by modeling long-distance contextual information to understand the scene's overall layout. The context sensitivity of previous methods is typically constrained to small blocks(e.g. $2m \times 2m$) and cannot be directly extended to the entire scene. For this reason, we propose \textbf{L}ong-\textbf{D}istance \textbf{C}ontext Modeling Network(LDCNet). Our key insight is that keypoints are enough for inferring the layout of a scene. Therefore, we represent the entire scene using keypoints along with local descriptors and model long-distance context on these keypoints. Finally, we propagate the long-distance context information from keypoints back to non-keypoints. This allows our method to model long-distance context effectively. We conducted experiments on six datasets, demonstrating that our approach can effectively mitigate ambiguities. Our method performs well on large, irregular objects and exhibits good generalization for typical scenarios.



Paperid:828 Poster
Authors:Hui Zeng,Minrui Xu,Tongqing Zhou,Xinyi Wu,Jiawen Kang,Zhiping Cai,Dusit Niyato
Abstract:
Transforming the multi-round vanilla Federated Learning (FL) into one-shot FL (OFL) significantly reduces the communication burden and makes a big leap toward practical deployment. However, we note that existing OFL methods all build on model lossy reconstruction (i.e., aggregating while partially discarding local knowledge in clients’ models), which attains one-shot at the cost of degraded inference performance. By identifying the root cause of stressing too much on finding a one-fit-all model, this work proposes a novel one-shot FL framework by embodying each local model as an independent expert and leveraging a Mixture-of-Experts network to maintain all local knowledge intact. A dedicated self-supervised training process is designed to tune the network, where the sample generation is guided by approximating underlying distributions of local data and making distinct predictions among experts. Notably, the framework also fuels FL with flexible, data-free aggregation and heterogeneity tolerance. Experiments on 4 datasets show that the proposed framework maintains the one-shot efficiency, facilitates superior performance compared with 8 OFL baselines (+5.54% on CIFAR-10), and even attains over $\times$4 performance gain compared with 3 multi-round FL methods, while only requiring less than 85% trainable parameters.



Paperid:829 Poster
Authors:Yeqing Shen,Shang Li,Kun Song
Abstract:
Due to its high speed and low latency, DVS is frequently employed in motion deblurring. Ideally, high-quality events would adeptly capture intricate motion information. However, real-world events are generally degraded, thereby introducing significant artifacts into the deblurred results. In response to this challenge, we model the degradation of events and propose RDNet to improve the quality of image deblurring. Specifically, we first analyze the mechanisms underlying degradation and simulate paired events based on that. These paired events are then fed into the first stage of the RDNet for training the restoration model. The events restored in this stage serve as a guide for the second-stage deblurring process. To better assess the deblurring performance of different methods on real-world degraded events, we present a new real-world dataset named DavisMCR. This dataset incorporates events with diverse degradation levels, collected by manipulating environmental brightness and target object contrast. Our experiments are conducted on synthetic datasets (GOPRO), real-world datasets (REBlur), and the proposed dataset (DavisMCR). The results demonstrate that RDNet outperforms classical event denoising methods in event restoration. Furthermore, RDNet exhibits better performance in deblurring tasks compared to state-of-the-art methods. DavisMCR are available athttps://github.com/Yeeesir/DVS_RDNet.



Paperid:830 Poster
Authors:Ruiyang Xia,Dawei Zhou,Decheng Liu,Lin Yuan,Shuodi Wang,Jie Li,Nannan Wang,Xinbo Gao
Abstract:
One of the serious impacts brought by artificial intelligence is the abuse of deepfake techniques. Despite the proliferation of deepfake detection methods aimed at safeguarding the authenticity of media across the Internet, they mainly consider the improvement of detector architecture or the synthesis of forgery samples. The forgery perceptions, including the feature responses and prediction scores for forgery samples, have not been well considered. As a result, the generalization across multiple deepfake techniques always comes with complicated detector structures and expensive training costs. In this paper, we shift the focus to real-time perception analysis in the training process and generalize deepfake detectors through an efficient method dubbed Forgery Perception Guidance (FPG). In particular, after investigating the deficiencies of forgery perceptions, FPG adopts a sample refinement strategy to pertinently train the detector, thereby elevating the generalization efficiently. Moreover, FPG introduces more sample information as explicit optimizations, which makes the detector further adapt the sample diversities. Experiments demonstrate that FPG improves the generality of deepfake detectors with small training costs, minor detector modifications, and the acquirement of real data only. In particular, our approach not only outperforms the state-of-the-art on both the cross-dataset and cross-manipulation evaluation but also surpasses the baseline that needs more than 3$\times$ training time. Code is available in the supplementary material.



Paperid:831 Poster
Authors:Hao Fang,Haoyuan Zhao,Jianxin Shi,Miao Zhang,Guanzhen Wu,Yi Ching Chou,FENG WANG,Jiangchuan Liu
Abstract:
Live streaming has experienced significant growth recently. Yet this rise in popularity contrasts with the reality that a substantial segment of the global population still lacks Internet access. The emergence of Low Earth orbit Satellite Networks (LSNs), such as SpaceX's Starlink and Amazon's Project Kuiper, presents a promising solution to this issue. Nevertheless, our measurement study reveals that existing live streaming platforms may not be able to deliver a smooth viewing experience on LSNs due to frequent satellite handovers, leading to frequent rebuffering events. Current state-of-the-art learning-based Adaptive Bitrate (ABR) algorithms, even when trained on satellite network traces, fail to manage the abrupt network variations associated with these handovers effectively. To address these challenges, for the first time, we introduce Satellite-Aware Rate Adaptation (SARA), a versatile and lightweight middleware that can be seamlessly integrated with various ABR algorithms to enhance the performance of live streaming over LSNs. SARA intelligently modulates video playback speed and furnishes ABR algorithms with key insights derived from the distinctive network characteristics of LSNs, thereby aiding ABR algorithms in making informed bitrate selections and effectively minimizing rebuffering events that occur during satellite handovers. Our extensive evaluation shows that SARA can effectively reduce the rebuffering time by an average of 39.41% and slightly improve latency by 0.65% while only introducing an overall loss in bitrate by 0.13%.



Paperid:832 Poster
Authors:Muquan Li,Dongyang Zhang,Tao He,Xiurui Xie,Yuan-Fang Li,Ke Qin
Abstract:
Data-free knowledge distillation (DFKD) has emerged as a pivotal technique in the domain of model compression, substantially reducing the dependency on the original training data. Nonetheless, conventional DFKD methods that employ synthesized training data are prone to the limitations of inadequate diversity and discrepancies in distribution between the synthesized and original datasets. To address these challenges, this paper introduces an innovative approach to DFKD through diverse diffusion augmentation (DDA). Specifically, we revise the paradigm of common data synthesis in DFKD to a composite process through leveraging diffusion models subsequent to data synthesis for self-supervised augmentation, which generates a spectrum of data samples with similar distributions while retaining controlled variations. Furthermore, to mitigate excessive deviation in the embedding space, we introduce an image filtering technique grounded in cosine similarity to maintain fidelity during the knowledge distillation process. Comprehensive experiments conducted on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets showcase the superior performance of our method across various teacher-student network configurations, outperforming the contemporary state-of-the-art DFKD methods.



Paperid:833 Poster
Authors:Yi Zhang,Ke Yu,Angelica I Aviles-Rivero,Jiyuan Jia,Yushun Tang,Zhihai He
Abstract:
In this paper, we address the challenge of adapting vision-language models (VLMs) to few-shot image recognition in a training-free manner. We observe that existing methods are not able to effectively characterize the semantic relationship between support and query samples in a training-free setting. We recognize that, in the semantic feature space, the feature of the query image is a linear and sparse combination of support image features since support-query pairs are from the class and share the same small set of distinctive visual attributes. Motivated by this interesting observation, we propose a novel method called Training-free Feature ReConstruction with Sparse optimization (TaCo), which formulates the few-shot image recognition task as a feature reconstruction and sparse optimization problem. Specifically, we exploit the VLM to encode the query and support images into features. We utilize sparse optimization to reconstruct the query feature from the corresponding support features. The feature reconstruction error is then used to define the reconstruction similarity. Coupled with the text-image similarity provided by the VLM, our reconstruction similarity analysis accurately characterizes the relationship between support and query images. This results in significantly improved performance in few-shot image recognition. Our extensive experimental results on few-shot recognition demonstrate that the proposed method outperforms existing state-of-the-art approaches by substantial margins.



Paperid:834 Poster
Authors:Yiming Zhong,Xiaolin Zhang,Yao Zhao,Yunchao Wei
Abstract:
Recently, the text-to-3D task has developed rapidly due to the appearance of the SDS method. However, the SDS method always generates 3D objects with poor quality due to the over-smooth issue. This issue is attributed to two factors: 1) the DDPM single-step inference produces poor guidance gradients; 2) the randomness from the input noises and timesteps averages the details of the 3D contents. In this paper, to address the issue, we propose DreamLCM which incorporates the Latent Consistency Model (LCM). DreamLCM leverages the powerful image generation capabilities inherent in LCM, enabling generating consistent and high-quality guidance,~\ie, predicted noises or images. Powered by the improved guidance, the proposed method can provide accurate and detailed gradients to optimize the target 3D models. In addition, we propose two strategies to enhance the generation quality further. Firstly, we propose a guidance calibration strategy, utilizing Euler solver to calibrate the guidance distribution to accelerate 3D models to converge. Secondly, we propose a dual timestep strategy, which helps DreamLCM to increase the consistency of guidance and optimize 3D models from geometry to appearance. Experiments show that DreamLCM achieves state-of-the-art results in both generation quality and training efficiency.



Paperid:835 Poster
Authors:Xinwei Liu,Xiaojun Jia,Yuan Xun,Siyuan Liang,Xiaochun Cao
Abstract:
Multimodal contrastive learning (MCL) has shown remarkable advances in zero-shot classification by learning from millions of image-caption pairs crawled from the Internet. However, this reliance poses privacy risks, as hackers may unauthorizedly exploit image-text data for model training, potentially including personal and privacy-sensitive information. Recent works propose generating unlearnable examples by adding imperceptible perturbations to training images to build shortcuts for protection. However, they are designed for unimodal classification, which remains largely unexplored in MCL. We first explore this context by evaluating the performance of existing methods on image-caption pairs, and they fail to effectively build shortcuts due to the lack of labels and the dispersion of pairs in MCL. In this paper, we propose Multi-step Error Minimization (MEM), a novel optimization process for generating multimodal unlearnable examples. It extends the Error-Minimization (EM) framework to optimize both image noise and an additional text trigger, thereby enlarging the optimized space and effectively misleading the model to learn the shortcut between the noise features and the text trigger. Specifically, we adopt projected gradient descent to solve the noise minimization problem and use HotFlip to approximate the gradient and replace words to find the optimal text trigger. Extensive experiments demonstrate the effectiveness of MEM, with post-protection retrieval results nearly half of random guessing, and its high transferability across different models.



Paperid:836 Poster
Authors:Jiaqi Wang,Pichao WANG,Yi Feng,Huafeng Liu,Chang Gao,Liping Jing
Abstract:
Most works of interpretable neural networks strive for learning the semantics concepts merely from single modal information such as images. However, humans usually learn semantic concepts from multiple modalities and the semantics is encoded by the brain from fused multi-modal information. Inspired by cognitive science and vision-language learning, we propose a Prototype-Concept Alignment Network (ProCoNet) for learning visual prototypes under the guidance of textual concepts. In the ProCoNet, we have designed a visual encoder to decompose the input image into regional features of prototypes, while also developing a prompt generation strategy that incorporates in-context learning to prompt large language models to generate textual concepts. To align visual prototypes with textual concepts, we leverage the multimodal space provided by the pre-trained CLIP as a bridge. Specifically, the regional features from the vision space and the cropped regions of prototypes encoded by CLIP reside on different but semantically highly correlated manifolds, i.e. follow a multi-manifold distribution. We transform the multi-manifold distribution alignment problem into optimizing the projection matrix by Cayley transform on the Stiefel manifold. Through the learned projection matrix, visual prototypes can be projected into the multimodal space to align with semantically similar textual concept features encoded by CLIP. We conducted two case studies on the CUB-200-2011 and Oxford Flower dataset. Our experiments show that the ProCoNet provides higher accuracy and better interpretability compared to the single-modality interpretable model. Furthermore, ProCoNet offers a level of interpretability not previously available in other interpretable methods.



Paperid:837 Poster
Authors:Xiao Han,Zhenduo zhang,Yiling Wu,Xinfeng Zhang,Zhe Wu
Abstract:
With the development of deep learning, traffic forecasting technology has made significant progress and is being applied in many practical scenarios. However, various events held in cities, such as sporting events, exhibitions, concerts, etc., have a significant impact on traffic patterns of surrounding areas, causing current advanced prediction models to fail in this case. In this paper, to broaden the applicable scenarios of traffic forecasting, we focus on modeling the impact of events on traffic patterns and propose an event traffic forecasting problem with multimodal inputs. We outline the main challenges of this problem: diversity and sparsity of events, as well as insufficient data. To address these issues, we first use textual modal data containing rich semantics to describe the diverse characteristics of events. Then, we propose a simple yet effective multi-modal event traffic forecasting model that uses pre-trained text and traffic encoders to extract the embeddings and fuses the two embeddings for prediction. Encoders pre-trained on large-scale data have powerful generalization abilities to cope with the challenge of sparse data. Next, we design an efficient LLM-based event description text generation pipeline to build SZCEC, a multi-modal event traffic forecasting dataset. Experiments on this real-world dataset show that our method achieves state-of-the-art performance compared with eight baselines, reducing mean absolute error during the event peak period by 4.26%.



Paperid:838 Poster
Authors:Tao Wang,Yushu Zhang,Xiangli Xiao,Lin Yuan,Zhihua Xia,Jian Weng
Abstract:
The significant advancement in face recognition drives face privacy protection into a prominent research direction. Unlike de-identification, a recent class of face privacy protection schemes preserves identifiable formation for face recognition. However, these schemes fail to support the revocation of the leaked identity, causing attackers to potentially identify individuals and then pose security threats. In this paper, we explore the possibility of generating privacy-preserving faces (not features) supporting cancelable biometric recognition. Specifically, we propose a cancelable face generator (CanFG), which removes the physical identity for privacy protection and embeds the virtual identity for face recognition. Particularly, when leaked, the virtual identity can be revoked and renew as another one, preventing re-identification from attackers. Benefiting from the designed distance-preserving identity transformation, CanFG can guarantee separability and preserve recognizability of virtual identities. Moreover, to make CanFG lightweight, we design a simple but effective training strategy, which allows CanFG to require only one (rather than two) network for achieving stable multi-objective learning. Extensive experimental results and sufficient security analyses demonstrate the ability of CanFG to effectively protect physical identity and support cancelable biometric recognition. Our code is available athttps://xxxxxxx/xxxx/xxxx.



Paperid:839 Poster
Authors:Junlin Fang,Wenya Wang,Guosheng Lin,Fengmao Lv
Abstract:
Sarcasm is an intricate expression phenomenon and has garnered increasing attentions over the recent years, especially for multimodal contexts such as videos. Nevertheless, despite being a significant aspect of human sentiment, the effect of sarcasm is consistently overlooked in sentiment analysis. Videos with sarcasm often convey sentiments that diverge or even contradict their explicit messages. Prior works mainly concentrate on simply modeling sarcasm and sentiment features by utilizing the Multi-Task Learning (MTL) framework, which we found introduces detrimental interplays between the sarcasm detection task and sentiment analysis task. Therefore, this study explores the effective enhancement of video sentiment analysis through the incorporation of sarcasm information. To this end, we propose the Progressively Sentiment-oriented Sarcasm Refinement and Integration (PS2RI) framework, which focuses on modeling sentiment-oriented sarcasm features to enhance sentiment prediction. Instead of naively combining sarcasm detection and sentiment prediction under an MTL framework, PS2RI iteratively performs the sentiment-oriented sarcasm refinement and sarcasm integration operations within the sentiment recognition framework, in order to progressively learn sarcasm-aware sentiment feature without suffering the detrimental interplays caused by information irrelevant to the sentiment analysis task. Extensive experiments are conducted to validate both the effectiveness and scalability of our approach.



Paperid:840 Poster
Authors:Xue Li,YU Jiong,Ziyang Li,Hongchun Lu,Ruifeng Yuan
Abstract:
The field of Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) is currently undergoing a paradigm shift, transitioning from specialized models designed for individual tasks to more general retrieval models capable of managing various specialized scenarios. Inspired by the impressive generalization ability of the Contrastive Language-Image Pretraining (CLIP) model, we propose a CLIP-driven universal framework (Dr. CLIP), which leverages prompt learning to guide the synergy between CLIP and ZS-SBIR. Specifically, Dr. CLIP is a multi-branch network based on the CLIP image encoder and text encoder, which can perfectly cover four variants of ZS-SBIR tasks (inter-category, intra-category, cross-datasets, and generalization). Moreover, we decompose the synergy into classification learning, metric learning, and ranking learning, as well as introduce three key components to enhance learning effectiveness. i ) a forgetting suppression idea is applied to prevent catastrophic forgetting and constrains the feature distribution of the new categories in classification learning. ii ) a domain balanced loss is proposed to address sample imbalance and establish effective cross-domain correlations in metric learning. iii ) a pair-relation strategy is introduced to capture relevance and ranking relationships between instances in ranking learning. Eventually, we reorganize and redivide three coarse-grained datasets and two fine-grained datasets to accommodate the training settings for four ZS-SBIR tasks. The comparison experiments confirmed our method surpassed the state-of-the-art (SOTA) methods by a significant margin (1.95%~19.14%, mAP), highlighting its generality and superiority.



Paperid:841 Poster
Authors:Yuanbin Wang,Weilun Dai,Long Chan,Huanyu Zhou,Aixi Zhang,Si Liu
Abstract:
Video Virtual Try-On aims to transfer a garment onto a person in the video. Previous methods typically focus on image-based virtual try-on, but directly applying these methods to videos often leads to temporal discontinuity due to inconsistencies between frames. Limited attempts in video virtual try-on also suffer from unrealistic results and poor generalization ability. In light of previous research, we posit that the task of video virtual try-on can be decomposed into two key aspects: (1) single-frame results are realistic and natural, while retaining consistency with the garment; (2) the person's actions and the garment are coherent throughout the entire video. To address these two aspects, we propose a novel two-stage framework based on Latent Diffusion Model, namely Garment-Preserving Diffusion for Video Virtual Try-On (GPD-VVTO). In the first stage, the model is trained on single-frame data to improve the ability of generating high-quality try-on images. We integrate both low-level texture features and high-level semantic features of the garment into the denoising network to preserve garment details while ensuring a natural fit between the garment and the person. In the second stage, the model is trained on video data to enhance temporal consistency. We devise a novel Garment-aware Temporal Attention (GTA) module that incorporates garment features into temporal attention, enabling the model to maintain the fidelity to the garment during temporal modeling. Furthermore, we collect a video virtual try-on dataset containing high-resolution videos from diverse scenes, addressing the limited variety of current datasets in terms of video background and human actions. Extensive experiments demonstrate that our method outperforms existing state-of-the-art methods in both image-based and video-based virtual try-on tasks, indicating the effectiveness of our proposed framework.



Paperid:842 Poster
Authors:Wang Jiangyi,Zhongyao Cheng,Na Zhao,Jun Cheng,Xulei Yang
Abstract:
Point cloud analysis is challenging due to its unique characteristics of unorderness, sparsity and irregularity. Prior works attempt to capture local relationships by convolution operations or attention mechanisms, exploiting geometric information from coordinates implicitly. These methods, however, are insufficient to describe the explicit local geometry, e.g., curvature and orientation. In this paper, we propose On-the-fly Point Feature Representation (OPFR), which captures abundant geometric information explicitly through Curve Feature Generator module. This is inspired by Point Feature Histogram (PFH) from computer vision community. However, the utilization of vanilla PFH encounters great difficulties when applied to large datasets and dense point clouds, as it demands considerable time for feature generation. In contrast, we introduce the Local Reference Constructor module, which approximates the local coordinate systems based on triangle sets. Owing to this, our OPFR only requires extra 1.56ms for inference (65$\times$ faster than vanilla PFH) and 0.012M more parameters, and it can serve as a versatile plug-and-play module for various backbones, particularly MLP-based and Transformer-based backbones examined in this study. Additionally, we introduce the novel Hierarchical Sampling module aimed at enhancing the quality of triangle sets, thereby ensuring robustness of the obtained geometric features. Our proposed method improves overall accuracy (OA) on ModelNet40 from 90.7% to 94.5% (+3.8%) for classification, and OA on S3DIS Area-5 from 86.4% to 90.0% (+3.6%) for semantic segmentation, respectively, building upon PointNet++ backbone. When integrated with Point Transformer backbone, we achieve state-of-the-art results on both tasks: 94.8% OA on ModelNet40 and 91.7% OA on S3DIS Area-5.



Paperid:843 Poster
Authors:Song Wu,Xiaoyu Wei,Xinyue Chen,Yazhou Ren,Jing He,Xiaorong Pu
Abstract:
Semi-supervised medical image segmentation has gained increasing attention due to its potential to alleviate the manual annotation burden. Mainstream methods typically involve two subnets, and conduct a consistency objective to ensure them producing consistent predictions for unlabeled data. However, they often ignore that the complementarity of model predictions is equally crucial for SSMIS. To realize the potential of the multi-subnet architecture, we propose a novel cross-view mutual learning method with a two-branch co-training framework. Specifically, we first introduce a new conflict-based feature learning (CFL) paradigm that encourages the two subnets to learn distinct features from the same input. These distinct features are then decoded into complementary model predictions, allowing both subnets to understand the input from different views. More importantly, we propose a cross-view mutual learning (CML) method to maximize the effectiveness of CFL. This approach requires only modifications to the model inputs and supervisory signals, and implements a heterogeneous consistency objective to fully explore the complementarity of model predictions. Consequently, the aggregated predictions can effectively capture both consistency and complementarity across all views. Experimental results on three public datasets demonstrate the superiority of our CML over previous state-of-the-art methods.



Paperid:844 Poster
Authors:Shicheng Yang,Xiaoxu Li,Dongliang Chang,Zhanyu Ma,Jing-Hao Xue
Abstract:
Few-shot fine-grained image classification aims to use only few labelled samples to successfully recognize subtle sub-classes within the same parent class. This task is extremely challenging, due to the co-occurrence of large inter-class similarity, low intra-class similarity, and only few labelled samples. In this paper, to address these challenges, we propose a new Channel-Spatial Cross-Attention Module (CSCAM), which can effectively drive a model to extract discriminative fine-grained feature representations with only few shots. CSCAM collaboratively integrates a channel cross-attention module and a spatial cross-attention module, for the attentions across support and query samples. In addition, to fit for the characteristics of fine-grained images, a support averaging method is proposed in CSCAM to reduce the intra-class distance and increase the inter-class distance. Extensive experiments on four few-shot fine-grained classification datasets validate the effectiveness of CSCAM. Furthermore, CSCAM is a plug-and-play module, conveniently enabling effective improvement of state-of-the-art methods for few-shot fine-grained image classification.



Paperid:845 Poster
Authors:Yunfeng FAN,Wenchao Xu,Haozhao Wang,Junhong Liu,Song Guo
Abstract:
Recently, Multimodal Learning (MML) has gained significant interest as it compensates for single-modality limitations through comprehensive complementary information within multimodal data. However, traditional MML methods generally use the joint learning framework with a uniform learning objective that can lead to the modality competition issue, where feedback predominantly comes from certain modalities, limiting the full potential of others. In response to this challenge, this paper introduces DI-MML, a novel detached MML framework designed to learn complementary information across modalities under the premise of avoiding modality competition. Specifically, DI-MML addresses competition by separately training each modality encoder with isolated learning objectives. It further encourages cross-modal interaction via a shared classifier that defines a common feature space and employing a dimension-decoupled unidirectional contrastive (DUC) loss to facilitate modality-level knowledge transfer. Additionally, to account for varying reliability in sample pairs, we devise a certainty-aware logit weighting strategy to effectively leverage complementary information at the instance level during inference. Extensive experiments conducted on audio-visual, flow-image, and front-rear view datasets show the superior performance of our proposed method.



Paperid:846 Poster
Authors:Xudong Lv,Zhiwei He,Yuxiang Yang,Jiahao Nie,Jing Zhang
Abstract:
Neural implicit representations have recently revolutionized simultaneous localization and mapping (SLAM), giving rise to a groundbreaking paradigm known as NeRF-based SLAM. However, existing methods often fall short in accurately estimating poses and reconstructing scenes. This limitation largely stems from their reliance on volume rendering techniques, which oversimplify the modeling process. In this paper, we introduce a novel neural implicit SLAM system designed to address these shortcomings. Our approach reconstructs Neural Radiance Fields (NeRFs) using a self-attentive architecture and represents scenes through neural point cloud encoding. Unlike previous NeRF-based SLAM methods, which depend on traditional volume rendering equations for scene representation and view synthesis, our method employs a self-attentive rendering framework with the Transformer architecture during mapping and tracking stages. To enable incremental mapping, we anchor scene features within a neural point cloud, striking a balance between estimation accuracy and computational cost. Experimental results across three challenging datasets demonstrate the superior performance and robustness of our proposed approach compared to recent NeRF-based SLAM systems. The code will be released.



Paperid:847 Poster
Authors:Weitao Tang,Jianqiang Li,Meijie Du,Die Hu,Qingyun Liu
Abstract:
Some video traffic carries harmful content, such as hate speech and child abuse, primarily encrypted and transmitted through Dynamic Adaptive Streaming over HTTP (DASH). Promptly identifying and intercepting traffic of harmful videos is crucial in network regulation. However, QUIC is becoming another DASH transport protocol in addition to TCP. On the other hand, complex network environments and diverse playback modes lead to significant distortions in traffic. The issues above have not been effectively addressed. This paper proposes a real-time identification method for DASH encrypted video traffic with distortion, named Zenith. We extract stable video segment sequences under various itags as video fingerprints to tackle resolution changes and propose a method of traffic fingerprint extraction under QUIC and VPN. Subsequently, simulating the sequence matching problem as a natural language problem, we propose Traffic Language Model (TLM), which can effectively address video data loss and retransmission. Finally, we propose a frequency dictionary to accelerate Zenith's speed further. Zenith significantly improves accuracy and speed compared to other SOTA methods in various complex scenarios, especially in QUIC, VPN, automatic resolution, and low bandwidth. Zenith requires traffic for just half a minute of video content to achieve precise identification, demonstrating its real-time effectiveness. The project page is available athttps://anonymous.4open.science/r/Zenith-Anonymous.



Paperid:848 Poster
Authors:Shuhuang Chen,Dingjie Fu,Shiming Chen,shuo Ye,Wenjin Hou,Xinge You
Abstract:
Zero-Shot learning (ZSL) correlates visual samples and shared semantic information to transfer knowledge from seen classes to unseen classes. Existing methods typically establish visual-semantic correlation by aligning visual and semantic features, which are extracted from visual samples and semantic information, respectively. However, instance-level images, owing to singular observation perspectives and diverse individuals, cannot exactly match the comprehensive semantic information defined at the class level. Direct feature alignment imposes correlation between mismatched vision and semantics, resulting in spurious visual-semantic correlation. To address this, we propose a novel method termed Causal Visual-semantic Correlation (CVsC) to learn substantive visual-semantic correlation for ZSL. Specifically, we utilize a Visual Semantic Attention module to facilitate interaction between vision and semantics, thereby identifying attribute-related visual features. Furthermore, we design a Conditional Correlation Loss to properly utilize semantic information as supervision for establishing visual-semantic correlation. Moreover, we introduce counterfactual intervention applied to attribute-related visual features, and maximize their impact on semantic and target predictions to enhance substantive visual-semantic correlation. Extensive experiments conducted on three benchmark datasets (i.e., CUB, SUN, and AWA2) demonstrate that our CVSC outperforms existing state-of-the-art methods.



Paperid:849 Poster
Authors:Wenquan Lu,Yufei Xu,Jing Zhang,Chaoyue Wang,Dacheng Tao
Abstract:
Diffusion models have achieved remarkable success in generating realistic images but suffer from generating accurate human hands, such as incorrect finger counts or irregular shapes. This difficulty arises from the complex task of learning the physical structure and pose of hands from training images, which involves extensive deformations and occlusions. For correct hand generation, our paper introduces a lightweight post-processing solution called $\textbf{HandRefiner}$. HandRefiner employs a conditional inpainting approach to rectify malformed hands while leaving other parts of the image untouched. We leverage the hand mesh reconstruction model that consistently adheres to the correct number of fingers and hand shape, while also being capable of fitting the desired hand pose in the generated image. Given a generated failed image due to malformed hands, we utilize ControlNet modules to re-inject such correct hand information. Additionally, we uncover a phase transition phenomenon within ControlNet as we vary the control strength. It enables us to take advantage of more readily available synthetic data without suffering from the domain gap between realistic and synthetic hands. Experiments demonstrate that HandRefiner can significantly improve the generation quality quantitatively and qualitatively. The code will be released.



Paperid:850 Poster
Authors:Wenjie Xuan,Yufei Xu,Shanshan Zhao,Chaoyue Wang,Juhua Liu,Bo Du,Dacheng Tao
Abstract:
ControlNet excels at creating content that closely matches precise contours in user-provided masks. However, when these masks contain noise, as a frequent occurrence with non-expert users, the output would include unwanted artifacts. This paper first highlights the crucial role of controlling the impact of these inexplicit masks with diverse deterioration levels through in-depth analysis. Subsequently, to enhance controllability with inexplicit masks, an advanced Shape-aware ControlNet consisting of a deterioration estimator and a shape-prior modulation block is devised. The deterioration estimator assesses the deterioration factor of the provided masks. Then this factor is utilized in the modulation block to adaptively modulate the model's contour-following ability, which helps it dismiss the noise part in the inexplicit masks. Extensive experiments prove its effectiveness in encouraging ControlNet to interpret inaccurate spatial conditions robustly rather than blindly following the given contours, suitable for diverse kinds of conditions. We showcase application scenarios like modifying shape priors and composable shape-controllable generation. Codes are soon available.



Paperid:851 Poster
Authors:Weifeng Chen,Tao Gu,Yuhao Xu,Arlene Chen
Abstract:
We propose Magic Clothing, a latent diffusion model (LDM)-based network architecture for an unexplored garment-driven image synthesis task. Aiming at generating customized characters wearing the target garments with diverse text prompts, the image controllability is the most critical issue, i.e., to preserve the garment details and maintain faithfulness to the text prompts. To this end, we introduce a garment extractor to capture the detailed garment features, and employ self-attention fusion to incorporate them into the pretrained LDMs, ensuring that the garment details remain unchanged on the target character. Then, we leverage the joint classifier-free guidance to balance the control of garment features and text prompts over the generated results. Meanwhile, the proposed garment extractor is a plug-in module applicable to various finetuned LDMs, and it can be combined with other extensions like ControlNet and IP-Adapter to enhance the diversity and controllability of the generated characters. Furthermore, we design Matched-Points-LPIPS (MP-LPIPS), a robust metric for evaluating the consistency of the target image to the source garment. Extensive experiments demonstrate that our Magic Clothing achieves state-of-the-art results under various conditional controls for garment-driven image synthesis. Our source code is publicly available (for the review process, please refer to our supplementary material).



Paperid:852 Poster
Authors:Chengyou Jia,Minnan Luo,Xiaojun Chang,Zhuohang Dang,Mingfei Han,Mengmeng Wang,Guang Dai,Sizhe Dang,Jingdong Wang
Abstract:
Exploring open-vocabulary video action recognition is a promising venture, which aims to recognize previously unseen actions within any arbitrary set of categories. Existing methods typically adapt pretrained image-text models to the video domain, capitalizing on their inherent strengths in generalization. A common thread among such methods is the augmentation of visual embeddings with temporal information to improve the recognition of seen actions. Yet, they compromise with standard less-informative action descriptions, thus faltering when confronted with novel actions. Drawing inspiration from human cognitive processes, we argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition. To realize this, we innovatively blend video models with Large Language Models (LLMs) to devise Action-conditioned Prompts. Specifically, we propose the Action-Centric generation strategy to produce a set of descriptive sentences that contain distinctive features for identifying given actions. Building upon this foundation, we further introduce a multi-modal action knowledge alignment mechanism to align concepts in video and textual knowledge encapsulated within the prompts. Extensive experiments on various video benchmarks, including zero-shot, few-shot, and base-to-novel generalization settings, demonstrate that our method not only sets new SOTA performance but also possesses excellent interpretability.



Paperid:853 Poster
Authors:Yi Wang,Ningze Zhong,Minglin Chen,Longguang Wang,Yulan Guo
Abstract:
As the growth of VR and AR industry, 3D reconstruction has become a more and more important topic in multimedia. Although 3D Gaussian Splatting is the state-of-the-art method of 3D Reconstruction, it needs a large number of Gaussians to fit a 3D scene due to the Gibbs Phenomenon. The pursuit of compressing 3D Gaussian Splatting and reducing memory overhead has long been a focal point. Embarking on this trajectory, our study delves into this domain, aiming to mitigate these challenges. Inspired by tangram, a Chinese ancient puzzle, we introduce a novel methodology (Tangram-Splatting) that leverages shape priors to optimize 3D scene fitting. Central to our approach is a pioneering technique that diversifies Gaussian function types while preserving algorithmic efficiency. Through exhaustive experimentation, we demonstrate that our method achieves a remarkable average reduction of 62.4% in memory consumption used to store optimized parameters and decreases the training time by at least 10 minutes, with only marginal sacrifices in PSNR performance, typically under 0.3 dB, and our algorithm is even better on some datasets. This reduction in memory burden is of paramount significance for real-world applications, mitigating the substantial memory footprint and transmission burden traditionally associated with such algorithms. Our algorithm underscores the profound potential of Tangram-Splatting in advancing multimedia applications.



Paperid:854 Poster
Authors:Wu Chen,Hehe Fan,Qiuping Jiang,Chao Huang,Yi Yang
Abstract:
Due to the limitation of collection device and unstable scanning process, point cloud data is usually noisy. Those noise deforms the underlying structures of point clouds and inevitably affects downstream tasks such as rendering, reconstruction and analysis. In this paper, we propose a Cross-stage Cross-coder Adaptive Edge Graph Convolution Network (C$^{2}$AENet) to denoise point clouds. Our network uses multiple stages to progressively and iteratively denoise points. To improve the effectiveness, we add connections between two stages and between the encoder and decoder, leading to the cross-stage cross-coder architecture. Additionally, existing graph-based point cloud learning methods tend to capture local structure. They typically construct a semantic graph based on semantic distance, which may ignore Euclidean neighbors and lead to insufficient geometry perception. Therefore, we introduce a geometric graph and adaptively calculate edge attention based on the local and global structural information of the points. This results in a novel graph convolution module that allows the network to capture richer contextual information and focus on more important parts. Extensive experiments demonstrate that the proposed method is competitive compared with other state-of-the-art methods. The code will be made publicly available.



Paperid:855 Poster
Authors:Heng Jia,Yunqiu Xu,Linchao Zhu,Guang Chen,Yufei Wang,Yi Yang
Abstract:
Video captioning is a challenging task and typically requires video-text paired data for training. However, manually annotating coherent textual descriptions for videos is laborious and time-consuming. To address this problem, we propose to utilize solely text data to enhance video captioning models. Drawing inspiration from the exceptional text generation capabilities demonstrated by large language models (LLMs), we aim to leverage these models to generate high-quality and high-diversity video captions for the target domain. Specifically, we prompt GPT-4 with few-shot target-domain captions to generate a limited set of plausible video captions. Subsequently, we continue to prompt GPT-4 with the generated captions to acquire large-scale captions. To fully exploit the generated captions, we propose a Mixture of Scale and Shift experts (MoS$^2$) for efficient adaptation of pre-trained image captioning models for video captioning. MoS$^2$ estimates a probability distribution over a collection of experts by a lightweight routing network, determining the allocation of tokens to appropriate experts. This dynamic adjustment mechanism allows for specific responses to input features, thereby enhancing the model's ability to handle data variations. Our approach not only customizes model responses to input variations, effectively addressing the distribution shift between synthetic and actual captions but also significantly reduces the number of learnable parameters, allowing for more efficient adaptations. With only text data, we achieve superior performance and significantly narrow the performance gap between zero-shot and fine-tuned models. Our method boosts video captioning performance with the synthetic text data, thus substantially alleviating the dependence on paired and large-scale real data of the target domain. The code will be publicly available.



Paperid:856 Poster
Authors:Zhaopeng Gu,Bingke Zhu,Guibo Zhu,Yingying Chen,Hao Li,Ming Tang,Jinqiao Wang
Abstract:
Zero-shot anomaly detection (ZSAD) methods entail detecting anomalies directly without access to any known normal or abnormal samples within the target item categories. Existing approaches typically rely on the robust generalization capabilities of multimodal pretrained models, computing similarities between manually crafted textual features representing "normal" or "abnormal" semantics and image features to detect anomalies and localize anomalous patches. However, the generic descriptions of "abnormal" often fail to precisely match diverse types of anomalies across different object categories. Additionally, computing feature similarities for single patches struggles to pinpoint specific locations of anomalies with various sizes and scales. To address these issues, we propose a novel ZSAD method called FiLo, comprising two components: adaptively learned Fine-Grained Description (FG-Des) and position-enhanced High-Quality Localization (HQ-Loc). FG-Des introduces fine-grained anomaly descriptions for each category using Large Language Models (LLMs) and employs adaptively learned textual templates to enhance the accuracy and interpretability of anomaly recognition. HQ-Loc, utilizing Grounding DINO for preliminary localization, position-enhanced text prompts, and Multi-scale Multi-shape Cross-modal Interaction (MMCI) module, facilitates more accurate localization of anomalies of different sizes and shapes. Experimental results on datasets like MVTec and VisA demonstrate that FiLo significantly improves the performance of ZSAD in both recognition and localization, achieving state-of-the-art performance with an image-level AUC of 83.9% and a pixel-level AUC of 95.9% on the VisA dataset.



Paperid:857 Poster
Authors:Xiaofeng Mao,Zhengkai Jiang,Qilin Wang,Chencan Fu,Jiangning Zhang,Jiafu Wu,Yabiao Wang,Chengjie Wang,Wei Li,Mingmin Chi
Abstract:
Recent advancements in the field of Diffusion Transformers have substantially improved the generation of high-quality 2D images, 3D videos, and 3D shapes. However, the effectiveness of the Transformer architecture in the domain of co-speech gesture generation remains relatively unexplored, as prior methodologies have predominantly employed the Convolutional Neural Network (CNNs) or simple a few transformer layers. In an attempt to bridge this research gap, we introduce a novel Masked Diffusion Transformer for co-speech gesture generation, referred to as MDT-A2G, which directly implements the denoising process on gesture sequences. To enhance the contextual reasoning capability of temporally aligned speech-driven gestures, we incorporate a novel Masked Diffusion Transformer. This model employs a mask modeling scheme specifically designed to strengthen temporal relation learning among sequence gestures, thereby expediting the learning process and leading to coherent and realistic motions. Apart from audio, Our MDT-A2G model also integrates multi-modal information, encompassing text, emotion, and identity. Furthermore, we propose an efficient inference strategy that diminishes the denoising computation by leveraging previously calculated results, thereby achieving a speedup with negligible performance degradation. Experimental results demonstrate that MDT-A2G excels in gesture generation, boasting a learning speed that is over 6$\times$ faster than traditional diffusion transformers and an inference speed that is 5.7$\times$ than the standard diffusion model.



Paperid:858 Poster
Authors:Tianyi Lu,Xing Zhang,Jiaxi Gu,Hang Xu,Renjing Pei,Songcen Xu,Xingjun Ma,Zuxuan Wu
Abstract:
Latent Diffusion Models (LDMs) are renowned for their powerful capabilities in image and video synthesis. Yet, compared to text-to-image (T2I) editing, text-to-video (T2V) editing suffers from a lack of decent temporal consistency and structure, due to insufficient pre-training data, limited model editability, or extensive tuning costs. To address this gap, we propose FLDM (Fused Latent Diffusion Model), a training-free framework that achieves high-quality T2V editing by integrating various T2I and T2V LDMs. Specifically, FLDM utilizes a hyper-parameter with an update schedule to effectively fuse image and video latents during the denoising process. This paper is the first to reveal that T2I and T2V LDMs can complement each other in terms of structure and temporal consistency, ultimately generating high-quality videos. It is worth noting that FLDM can serve as a versatile plugin, applicable to off-the-shelf image and video LDMs, to significantly enhance the quality of video editing. Extensive quantitative and qualitative experiments on popular T2I and T2V LDMs demonstrate FLDM's superior editing quality than state-of-the-art T2V editing methods.



Paperid:859 Poster
Authors:Jinyong Wen
Abstract:
Enlightened by the InfoMax principle, Graph Contrastive Learning (GCL) has achieved remarkable performance in processing large amounts of unlabeled graph data. Due to the impracticality of precisely calculating mutual information (MI), conventional contrastive methods turn to approximate its lower bound using parametric neural estimators, which inevitably introduces additional parameters and leads to increased computational complexity. Building upon a common Gaussian assumption on the distribution of node representations, a computationally tractable surrogate for the original MI can be rigorously derived, termed as Gaussian Mutual Information (GMI). Leveraging multi-view priors of GCL, we induce an efficient contrastive objective based on GMI with performance guarantees, eliminating the reliance on parameterized estimators and negative samples. The emergence of another decorrelation-based self-supervised learning branch parallels contrastive-based approaches. By positioning the proposed GMI-based objective as a pivot, we bridge the gap between these two research areas from two aspects of approximate form and consistent solution, which contributes to the advancement of a unified theoretical framework for self-supervised learning. Extensive comparison experiments and visual analysis provide compelling evidence for the effectiveness and efficiency of our method while supporting our theoretical achievements.



Paperid:860 Poster
Authors:Shibo Hong,Xuhong Zhang,Tianyu Du,Sheng Cheng,Xun Wang,Jianwei Yin
Abstract:
The field of floorplan generation has attracted significant interest from the community. Remarkably, recent progress in methods based on generative models has substantially promoted the development of floorplan generation. However, generating floorplans that satisfy various conditions remains a challenging task. This paper proposes a learning framework, named Cons2Plan, for automatically and high-quality generating vector floorplans from various conditions. The input conditions can be graphs, boundaries, or a combination of both. The conditional diffusion model is the core component of our Cons2Plan. The denoising network uses a conditional embedding module to incorporate the conditions as guidance during the reverse process. Additionally, Cons2Plan incorporates a two-stage approach that generates graph conditions based on boundaries. It utilizes three regression models for node prediction and a novel conditional edge generation diffusion model, named CEDM, for edge generation. We conduct qualitative evaluations, quantitative comparisons, and ablation studies to demonstrate that our method can produce higher-quality floorplans than those generated by state-of-the-art methods.



Paperid:861 Poster
Authors:Zhenhao Yang,Xin Liu,Deqiang Ouyang,Guiduo Duan,Dongyang Zhang,Tao He,Yuan-Fang Li
Abstract:
The open-vocabulary human-object interaction (Ov-HOI) detection aims to identify both base and novel categories of humanobject interactions while only base categories are available during training. Existing Ov-HOI methods commonly leverage knowledge distilled from CLIP to extend their ability to detect previously unseen interaction categories. However, our empirical observations indicate that the inherent noise present in CLIP has a detrimental effect on HOI prediction. Moreover, the absence of novel humanobject position distributions often leads to overfitting on the base categories within their learned queries. To address these issues, we propose a two-step framework named, CaM-LQ, Calibrating visual-language Models, (e.g., CLIP) for open-vocabulary HOI detection with Locality-aware Queries. By injecting fine-grained HOI supervision from the calibrated CLIP into the HOI decoder, our model can achieve the goal of predicting novel interactions. Extensive experimental results demonstrate that our approach performs well in open-vocabulary human-object interaction detection, surpassing state-of-the-art methods across multiple metrics on mainstream datasets and showing superior open-vocabulary HOI detection performance, e.g., with 4.54 points improvement on the HICO-DET dataset over the SoTA CLIP4HOI on the UV task with the same backbone ResNet-50.



Paperid:862 Poster
Authors:Shengyu Hao,Wenhao Chai,Zhonghan Zhao,Meiqi Sun,Wendi Hu,Jieyang Zhou,Yixian Zhao,Qi Li,Yizhou Wang,Xi Li,Gaoang Wang
Abstract:
The growing interest in embodied intelligence has brought ego-centric perspectives to contemporary research. One significant challenge within this realm is the accurate localization and tracking of objects in ego-centric videos, primarily due to the substantial variability in viewing angles. Addressing this issue, this paper introduces a novel zero-shot approach for the 3D reconstruction and tracking of all objects from the ego-centric video. We present Ego3DT, a novel framework that initially identifies and extracts detection and segmentation information of objects within the ego environment. Utilizing information from adjacent video frames, Ego3DT dynamically constructs a 3D scene of the ego view using a pre-trained 3D scene reconstruction model. Additionally, we have innovated a dynamic hierarchical association mechanism for creating stable 3D tracking trajectories of objects in ego-centric videos. Moreover, the efficacy of our approach is corroborated by extensive experiments on two newly compiled datasets, with 1.04× - 2.90× in HOTA, showcasing the robustness and accuracy of our method in diverse ego-centric scenarios.



Paperid:863 Poster
Authors:Yushun Tang,Shuoshuo Chen,Jiyuan Jia,Yi Zhang,Zhihai He
Abstract:
Fully test-time adaptation aims to adapt a network model online based on sequential analysis of input samples during the inference stage. We observe that, when applying a transformer network model into a new domain, the self-attention profiles of image samples in the target domain deviate significantly from those in the source domain, which results in large performance degradation during domain changes. To address this important issue, we propose a new structure for the self-attention modules in the transformer. Specifically, we incorporate three domain-conditioning vectors, called domain conditioners, into the query, key, and value components of the self-attention module. We learn a network to generate these three domain conditioners from the class token at each transformer network layer. We find that, during fully online test-time adaptation, these domain conditioners at each transform network layer are able to gradually remove the impact of domain shift and largely recover the original self-attention profile. Our extensive experimental results demonstrate that the proposed domain-conditioned transformer significantly improves the online fully test-time domain adaptation performance and outperforms existing state-of-the-art methods by large margins.



Paperid:864 Poster
Authors:Zichen Wen,Tianyi Wu,Yazhou Ren,Yawen Ling,Chenhang Cui,Xiaorong Pu,Lifang He
Abstract:
Multi-view clustering is an important machine learning task for multi-media data, encompassing various domains such as images, videos, and texts. Moreover, with the growing abundance of graph data, the significance of multi-view graph clustering (MVGC) has become evident. Most existing methods focus on graph neural networks (GNNs) to extract information from both graph structure and feature data to learn distinguishable node representations. However, traditional GNNs are designed with the assumption of homophilous graphs, making them unsuitable for widely prevalent heterophilous graphs. Several techniques have been introduced to enhance GNNs for heterophilous graphs. While these methods partially mitigate the heterophilous graph issue, they often neglect the advantages of traditional GNNs, such as their simplicity, interpretability, and efficiency. In this paper, we propose a novel multi-view graph clustering method based on dual-optimized adaptive graph reconstruction, named DOAGC. It mainly aims at reconstructing the graph structure adapted to traditional GNNs to deal with the heterophilous graph issues while maintaining the advantages of traditional GNNs. Specifically, we first develop an adaptive graph reconstruction mechanism that accounts for node correlation and original structural information. To further optimize the reconstruction graph, we design a dual optimization strategy and demonstrate the feasibility of our optimization strategy through mutual information theory. Numerous experiments demonstrate that DOAGC effectively mitigates the heterophilous graph problem.



Paperid:865 Poster
Authors:Jin Liu,Huaibo Huang,Jie Cao,Ran He
Abstract:
Diffusion-based text-to-image generation models have significantly advanced the field of art content synthesis. However, current portrait stylization methods generally require either model fine-tuning based on examples or the employment of DDIM Inversion to revert images to noise space, both of which substantially decelerate the image generation process. To overcome these limitations, this paper presents an inversion-free portrait stylization framework based on diffusion models that accomplishes content and style feature fusion in merely four sampling steps. We observed that Latent Consistency Models employing consistency distillation can effectively extract representative Consistency Features from noisy images. To blend the Consistency Features extracted from both content and style images, we introduce a Style Enhancement Attention Control technique that meticulously merges content and style features within the attention space of the target image. Moreover, we propose a feature merging strategy to amalgamate redundant features in Consistency Features, thereby reducing the computational load of attention control. Extensive experiments have validated the effectiveness of our proposed framework in enhancing stylization efficiency and fidelity.



Paperid:866 Poster
Authors:Qiang Wang,Yuning Cui,Yawen Li,paulruan,zhuben,Wenqi Ren
Abstract:
Low-light environments will introduce high-intensity noise into images. Containing fine details with reduced noise, near-infrared/flash images can serve as guidance to facilitate noise removal. However, existing fusion-based methods fail to effectively suppress artifacts caused by inconsistency between guidance/noisy image pairs and do not fully excavate the useful information contained in guidance images. In this paper, we propose a robust and flexible fusion network (RFFNet) for low-light image denoising. Specifically, we present a multi-scale inconsistency calibration module to address inconsistency before fusion by first mapping the guidance features to multi-scale spaces and calibrating them with the aid of pre-denoising features in a coarse-to-fine manner. Furthermore, we develop a dual-domain adaptive fusion module to adaptively extract useful high-/low-frequency signals from the guidance features and then highlight the informative frequencies. Extensive experimental results demonstrate that our method achieves state-of-the-art performance on NIR-guided RGB image denoising and flash-guided no-flash image denoising.



Paperid:867 Poster
Authors:Yang Liu,Xiang.Huang,Minghan Qin,Qinwei Lin,Haoqian Wang
Abstract:
Neural radiance fields are capable of reconstructing high-quality drivable human avatars but are expensive to train and render and not suitable for multi-human scenes with complex shadows. To reduce consumption, we propose Animatable 3D Gaussian, which learns human avatars from input images and poses. We extend 3D Gaussians to dynamic human scenes by modeling a set of skinned 3D Gaussians and a corresponding skeleton in canonical space and deforming 3D Gaussians to posed space according to the input poses. We introduce a multi-head hash encoder for pose-dependent shape and appearance and a time-dependent ambient occlusion module to achieve high-quality reconstructions in scenes containing complex motions and dynamic shadows. On both novel view synthesis and novel pose synthesis tasks, our method achieves higher reconstruction quality than InstantAvatar with less training time (1/60), less GPU memory (1/4), and faster rendering speed ($7\times$). Our method can be easily extended to multi-human scenes and achieve comparable novel view synthesis results on a scene with ten people in only 25 seconds of training. We will release the code and dataset.



Paperid:868 Poster
Authors:Haicheng Liao,Haoyu Sun,Zhenning Li,HuanmingShen,Chengyue Wang,KaHou Tam,Chunlin Tian,Li Li,Cheng-zhong Xu
Abstract:
Accurately and promptly predicting accidents among surrounding traffic agents from camera footage is crucial for the safety of autonomous vehicles (AVs). This task presents substantial challenges stemming from the unpredictable nature of traffic accidents, their long-tail distribution, the intricacies of traffic scene dynamics, and the inherently constrained field of vision of onboard cameras. To address these challenges, this study introduces a novel accident anticipation framework for AVs, termed CRASH. It seamlessly integrates five components: object detector, feature extractor, object-aware module, context-aware module, and multi-layer fusion. Specifically, we develop the object-aware module to prioritize high-risk objects in complex and ambiguous environments by calculating the spatial-temporal relationships between traffic agents. In parallel, the context-aware is also devised to extend global visual information from the temporal to the frequency domain using the Fast Fourier Transform (FFT) and capture fine-grained visual features of potential objects and broader context cues within traffic scenes. To capture a wider range of visual cues, we further propose a multi-layer fusion that dynamically computes the temporal dependencies between different scenes and iteratively updates the correlations between different visual features for accurate and timely accident prediction. Evaluated on real-world datasets—Dashcam Accident Dataset (DAD), Car Crash Dataset (CCD), and AnAn Accident Detection (A3D) datasets—our model surpasses existing top baselines in critical evaluation metrics like Average Precision (AP) and mean Time-To-Accident (mTTA). Importantly, its robustness and adaptability are particularly evident in challenging driving scenarios with missing or limited training data, demonstrating significant potential for application in real-world autonomous driving systems.



Paperid:869 Poster
Authors:Yupeng Zhang,Shuqi Zheng,Ruize Han,Yuzhong Feng,Junhui Hou,Linqi Song,Wei Feng,Liang Wan
Abstract:
One-shot object detection (OSOD) uses a query patch to identify the same category of object in a target image. As the OSOD setting, the target images are required to contain the object category of the query patch, and the image styles (domains) of the query patch and target images are always similar. However, in practical application, the above requirements are not commonly satisfied. Therefore, we propose a new problem namely Cross-Domain Object Search (CDOS), where the object categories of the query patch and target image are decoupled, and the image styles between them may also be significantly different. For this problem, we develop a new method, which incorporates both foreground-background contrastive learning heads and a domain-generalized feature augmentation technique. This makes our method effectively handle the object category gap and domain distribution gap, between the query patch and target image in the training and testing datasets. We further build a new benchmark for the proposed CDOS problem, on which our method shows significant performance improvements over the comparison methods.



Paperid:870 Poster
Authors:Bin Wang,Meishan Zhang,Hao Fei,Yu Zhao,Bobo Li,Shengqiong Wu,Wei Ji,Min Zhang
Abstract:
Event extraction (EE) is a critical direction in the field of information extraction, laying an important foundation for the construction of structured knowledge bases. EE from text has received ample research and attention for years, yet there can be numerous real-world applications that require direct information acquisition from speech signals, online meeting minutes, interview summaries, press releases, etc. While EE from speech has remained under-explored, this paper fills the gap by pioneering a SpeechEE, defined as detecting the event predicates and arguments from a given audio speech. To benchmark the SpeechEE task, we first construct a large-scale high-quality dataset. Based on textual EE datasets under the sentence, document, and dialogue scenarios, we convert texts into speeches through both manual real-person narration and automatic synthesis, empowering the data with diverse scenarios, languages, domains, ambiances, and speaker styles. Further, to effectively address the key challenges in the task, we tailor an E2E SpeechEE system based on the encoder-decoder architecture, where a novel Shrinking Unit module and a retrieval-aided decoding mechanism are devised. Extensive experimental results on all SpeechEE subsets demonstrate the efficacy of the proposed model, offering a strong baseline for the task. At last, being the first work on this topic, we shed light on key directions for future research. All our data and codes will be open to the community upon acceptance.



Paperid:871 Poster
Authors:Qianyu Guo,Jieji Ren,Haofen Wang,Tianxing Wu,Weifeng Ge,Wenqiang Zhang
Abstract:
Visual-language models based on CLIP have shown remarkable abilities in general few-shot image classification. However, their performance drops in specialized fields such as healthcare or agriculture, because CLIP's pre-training does not cover all category data. Existing methods excessively depend on the multi-modal information representation and alignment capabilities acquired from CLIP pre training, which hinders accurate generalization to unfamiliar domains. To address this issue, this paper introduces a novel visual-language collaborative representation network (MCRNet), aiming at acquiring a generalized capability for collaborative fusion and representation of multi-modal information. Specifically, MCRNet learns to generate relational matrices from an information fusion perspective to acquire aligned multi-modal features. This relationship generation strategy is category-agnostic, so it can be generalized to new domains. A class adaptive fine-tuning inference technique is also introduced to help MCRNet efficiently learn alignment knowledge for new categories using limited data. Additionally, the paper establishes a new broad-domain few-shot image classification benchmark containing seven evaluation datasets from five domains. Comparative experiments demonstrate that MCRNet outperforms current state-of-the-art models, achieving an average improvement of 13.06% and 13.73% in the 1-shot and 5-shot settings, highlighting the superior performance and applicability of MCRNet across various domains.



Paperid:872 Poster
Authors:Tran Dang Trung Duc,Byeongkeun Kang,Yeejin Lee
Abstract:
Recently, transformer-based techniques incorporating superpoints have become prevalent in 3D instance segmentation. However, they often encounter an over-segmentation problem, especially noticeable with large objects. Additionally, unreliable mask predictions stemming from superpoint mask prediction further compound this issue. To address these challenges, we propose a novel framework called MSTA3D. It leverages multi-scale feature representation and introduces twin-attention mechanisms to effectively capture them. Furthermore, MSTA3D integrates a box query with a box regularizer, offering a complementary spatial constraint alongside semantic queries. Experimental evaluations on ScanNetV2, ScanNet200, and S3DIS datasets demonstrate that our approach surpasses state-of-the-art 3D instance segmentation methods. The code will be released upon paper publication.



Paperid:873 Poster
Authors:Lijun Zhang,Wei Suo,PENG WANG,Yanning Zhang
Abstract:
Human-object interactions (HOI) detection aims at capturing human-object pairs in images and corresponding actions. It is an important step toward high-level visual reasoning and scene understanding. However, due to the natural bias from the real world, existing methods mostly struggle with rare human-object pairs and lead to suboptimal results. Recently, with the development of the generative model, a straightforward approach is to construct a more balanced dataset based on a group of supplementary samples. Unfortunately, there is a significant domain gap between the generated data and the original data, simply merging the generated images into the original dataset cannot significantly boost the performance. To alleviate the above problem, we present a novel model-agnostic framework called Context-Enhanced Feature Alignment (CEFA) module, which can effectively align the generated data with the original data at the feature level and bridge the domain gap. Specifically, CEFA consists of a feature alignment module and a context enhancement module. On one hand, Considering the crucial role of human-object pairs information in HOI tasks, the feature alignment module aligns the human-object pairs by aggregating instance information. On the other hand, to mitigate the issue of losing important context information caused by the traditional discriminator-style alignment method, we employ a context-enhanced image reconstruction module to improve the model’s learning ability of contextual cues. Extensive experiments have shown that our method can serve as a plug-and-play module to improve the detection performance of HOI models on rare categories.



Paperid:874 Poster
Authors:Daqin Luo,Chengjian Feng,Yuxuan Nong,Yiqing Shen
Abstract:
Automated Machine Learning (AutoML) offers a promising approach to streamline the training of machine learning models. However, existing AutoML frameworks are often limited to unimodal scenarios and require extensive manual configuration. Recent advancements in Large Language Models (LLMs) have showcased their exceptional abilities in reasoning, interaction, and code generation, presenting an opportunity to develop a more automated and user-friendly framework. To this end, we introduce AutoM3L, an innovative Automated Multimodal Machine Learning framework that leverages LLMs as controllers to automatically construct multimodal training pipelines. AutoM3L comprehends data modalities and selects appropriate models based on user requirements, providing automation and interactivity. By eliminating the need for manual feature engineering and hyperparameter optimization, our framework simplifies user engagement and enables customization through directives, addressing the limitations of previous rule-based AutoML approaches. We evaluate the performance of AutoM3L on six diverse multimodal datasets spanning classification, regression, and retrieval tasks, as well as a comprehensive set of unimodal datasets. The results demonstrate that AutoM3L achieves competitive or superior performance compared to traditional rule-based AutoML methods. Furthermore, a user study highlights the user-friendliness and usability of our framework, compared to the rule-based AutoML methods. Code is available at:https://anonymous.4open.science/r/anonymization_code.



Paperid:875 Poster
Authors:Chen Feng,Georgios Tzimiropoulos,Ioannis Patras
Abstract:
Learning with Noisy labels (LNL) poses a significant challenge for the Machine Learning community. Some of the most widely used approaches that select as clean samples for which the model itself (the in-training model) has high confidence, e.g., 'small loss', can suffer from the so called 'self-confirmation' bias. This bias arises because the in-training model, is at least partially trained on the noisy labels. Furthermore, in the classification case, an additional challenge arises because some of the label noise is between classes that are visually very similar ('hard noise'). This paper addresses these challenges by proposing a method (CLIPCleaner) that leverages CLIP, a powerful Vision-Language (VL) model for constructing a zero-shot classifier for efficient, offline, clean sample selection. This has the advantage that the sample selection is decoupled from the in-training model and that the sample selection is aware of the semantic and visual similarities between the classes due to the way that CLIP is trained. We provide theoretical justifications and empirical evidence to demonstrate the advantages of CLIP for LNL compared to conventional pre-trained models. Compared to current methods that combine iterative sample selection with various techniques,CLIPCleaneroffers a simple, single-step approach that achieves competitive or superior performance on benchmark datasets. To the best of our knowledge, this is the first time a VL model has been used for sample selection to address the problem of Learning with Noisy Labels (LNL), highlighting their potential in the domain.



Paperid:876 Poster
Authors:Yifan Li,Yuhang Bai,Shuai Yang,Jiaying Liu
Abstract:
Language-based image colorization aims to convert grayscale images to plausible and visually pleasing color images with language guidance, enjoying wide applications in historical photo restoration and film industry. Existing methods mainly leverage large language models and diffusion models to incorporate language guidance into the colorization process. However, it is still a great challenge to build accurate correspondence between the gray image and the semantic instructions, leading to mismatched, overflowing and under-saturated colors. In this paper, we introduce a novel coarse-to-fine framework, COlorfulness COntrollable Language-based Colorization (COCO-LC), that effectively reinforces the image-text correspondence with a coarsely colorized results. In addition, a multi-level condition that leverages both low-level and high-level cues of the gray image is introduced to realize accurate semantic-aware colorization without color overflows. Furthermore, we condition COCO-LC with a scale factor to determine the colorfulness of the output, flexibly meeting the different needs of users. We validate the superiority of COCO-LC over state-of-the-art image colorization methods in accurate, realistic and controllable colorization through extensive experiments.



Paperid:877 Poster
Authors:Weiye Xu,Min Wang,Wengang Zhou,Houqiang Li
Abstract:
Embodied Everyday Task is a popular task in the embodied AI community, requiring agents to make a sequence of actions based on natural language instructions and visual observations. Traditional learning-based approaches face two challenges. Firstly, natural language instructions often lack explicit task planning. Secondly, extensive training is required to equip models with knowledge of the task environment. Previous works based on Large Language Model (LLM) either suffer from poor performance due to the lack of task-specific knowledge or rely on ground truth as few-shot samples. To address the above limitations, we propose a novel approach called Progressive Retrieval Augmented Generation (P-RAG), which not only effectively leverages the powerful language processing capabilities of LLMs but also progressively accumulates task-specific knowledge without ground-truth. Compared to the conventional RAG methods, which retrieve relevant information from the database in a one-shot manner to assist generation, P-RAG introduces an iterative approach to progressively update the database. In each iteration, P-RAG retrieves the latest database and obtains historical information from the previous interaction as experiential references for the current interaction. Moreover, we also introduce a more granular retrieval scheme that not only retrieves similar tasks but also incorporates retrieval of similar situations to provide more valuable reference experiences. Extensive experiments reveal that P-RAG achieves competitive results without utilizing ground truth and can even further improve performance through self-iterations. We will release the source code to the public.



Paperid:878 Poster
Authors:Peibin Chen,Xijin Zhang,Daniel Kang Du
Abstract:
Polygonal meshes are widely used to represent complex geometries. However, the increasing complexity of models often leads to large meshes with millions of triangles, raising significant challenges for storage, transmission, and computation. Mesh simplification, a process of reducing the number of triangles in a mesh while preserving its overall shape and important features, has emerged as an indispensable technique to address these challenges. In this work, we focus on the problem of obtaining a visually consistent ultra-low-polygon mesh for complex meshes. Unlike previous methods, we design a robust simplification framework, SimpliGuard, to handle any meshes in the wild. Firstly, a reconstruction module is used to construct a low-polygon mesh with a similar shape but a manifold topology. Then, a texture initialization module is employed to quickly initialize the entire texture map. After that, a differentiable rendering module is utilized to optimize the overall structure and texture details, ensuring high-quality results. For meshes with skeletons, the correctness of motion can be preserved with our designed motion post-processing module. Experimental results demonstrate that SimpliGuard significantly outperforms previous methods and various featured software, including Blender and Simplygon.



Paperid:879 Poster
Authors:Yitong Sun,Yao Huang,Xingxing Wei
Abstract:
As physical adversarial attacks become extensively applied in unearthing the potential risk of security-critical scenarios, especially in dynamic scenarios, their vulnerability to environmental variations has also been brought to light. The non-robust nature of physical adversarial attack methods brings less-than-stable performance consequently. Although methods such as Expectation over Transformation (EOT) have enhanced the robustness of traditional contact attacks like adversarial patches, they fall short in practicality and concealment within dynamic environments such as traffic scenarios. Meanwhile, non-contact laser attacks, while offering enhanced adaptability, face constraints due to a limited optimization space for their attributes, rendering EOT less effective. This limitation underscores the necessity for developing a new strategy to augment the robustness of such practices. To address these issues, this paper introduces the Embodied Laser Attack (ELA), a novel framework that leverages the embodied intelligence paradigm of Perception-Decision-Control to dynamically tailor non-contact laser attacks. For the perception module, given the challenge of simulating the victim's view by full-image transformation, ELA has innovatively developed a local perspective transformation network, based on the intrinsic prior knowledge of traffic scenes and enables effective and efficient estimation. For the decision and control module, ELA trains an attack agent with data-driven reinforcement learning instead of adopting time-consuming heuristic algorithms, making it capable of instantaneously determining a valid attack strategy with the perceived information by well-designed rewards, which is then conducted by a controllable laser emitter. Experimentally, we apply our framework to diverse traffic scenarios both in the digital and physical world, verifying the effectiveness of our method under dynamic successive scenes.



Paperid:880 Poster
Authors:Henglei Lv,Jiayu Xiao,Liang Li
Abstract:
Diffusion-based text-to-image personalization has achieved great success in generating subjects specified by users among various contexts. Even though, existing finetuning-based methods still suffer from model overfitting, which greatly harms the generative diversity, especially when given subject images are few. To this end, we propose Pick-and-Draw, a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods. Our approach consists of two components: appearance picking guidance and layout drawing guidance. As for the former, we construct an appearance palette with visual features from the reference image, where we pick local patterns for generating the specified subject with consistent identity. As for layout drawing, we outline the subject's contour by referring to a generative template from the vanilla diffusion model, and inherit the strong image prior to synthesize diverse contexts according to different text conditions. The proposed approach can be applied to any personalized diffusion model and requires as few as a single reference image. Qualitative and quantitative experiments show that Pick-and-Draw consistently improves identity consistency and generative diversity, pushing the trade-off between subject fidelity and image-text fidelity to a new Pareto frontier.



Paperid:881 Poster
Authors:Jiawei Ge,Jiuxin Cao,Xuelin Zhu,Xinyu Zhang,Chang Liu,Kun Wang,Bo Liu
Abstract:
Vision-Language Tracking (VLT) requires locating a specific target in video sequences, given a natural language prompt and an initial object box. Despite recent advancements, existing approaches heavily rely on expensive and time-consuming human annotations. To mitigate this limitation, directly generating pseudo labels from raw videos seems to be a straightforward solution; however, it inevitably introduces undesirable noise during the training process. Moreover, we insist that an efficient tracker should excel in tracking the target, regardless of the temporal direction. Building upon these insights, we propose the pioneering semi-supervised learning scheme for VLT task, representing a crucial step towards reducing the dependency on high-quality yet costly labeled data. Specifically, drawing inspiration from the natural attributes of a video (i.e., space, time, and semantics), our approach progressively leverages inherent consistencies from these aspects: (1) Spatially, each frame and any object cropped from it naturally form an image-bbox (bounding box) pair for self-training; (2) Temporally, bidirectional tracking trajectories should exhibit minimal differences; (3) Semantically, the correlation between visual and textual features is expected to remain consistent. Furthermore, the framework is validated with a simple yet effective tracker we devised, named ATTracker (Asymmetrical Transformer Tracker). It modifies the self-attention operation in an asymmetrical way, striving to enhance target-related features while suppressing noise. Extensive experiments confirm that our ATTracker serves as a robust baseline, outperforming fully supervised base trackers. By unveiling the potential of learning with limited annotations, this study aims to attract attention and pave the way for Semi-supervised Vision-Language Tracking (SS-VLT).



Paperid:882 Poster
Authors:Shidi Chen,Lili Wei,Liqian Liang,Congyan Lang
Abstract:
3D Object Detection (3DOD) aims to accurately locate and identify 3D objects in point clouds, facing the challenge of balancing model performance with computational efficiency. Knowledge distillation emerges as a vital method for model compression in 3DOD, transferring knowledge from complex, larger models to smaller, efficient ones. However, the effectiveness of these methods is constrained by the intrinsic sparsity and structural complexity of point clouds. In this paper, we propose a novel methodology termed Joint Homophily and Heterophily Relational Knowledge Distillation (H2RKD) to distill robust relational knowledge in point clouds, thereby enhancing intra-object similarity and refining inter-object distinction. This unified strategy encompasses the integration of Collaborative Global Distillation (CGD) for distilling global relational knowledge across both distance and angular dimensions, and Separate Local Distillation (SLD) for a focused distillation of local relational dynamics. By seamlessly leveraging the relational dynamics within point clouds, the H2RKD facilitates a comprehensive knowledge transfer, significantly advancing 3D object detection capabilities. Extensive experiments on KITTI and unScenes datasets demonstrate the effectiveness of the proposed H2RKD.



Paperid:883 Poster
Authors:Qi Xu,Yaxin Li,Xuanye Fang,Jiangrong Shen,Qiang Zhang,Gang Pan
Abstract:
Spiking neural networks (SNNs) have superb characteristics in sensory information recognition tasks due to their biological plausibility. However, the performance of some current spiking-based models is limited by their structures which means either fully connected or too-deep structures bring too much redundancy. This redundancy from both connection and neurons is one of the key factors hindering the practical application of SNNs. Although Some pruning methods were proposed to tackle this problem, they normally ignored the fact the neural topology in the human brain could be adjusted dynamically. Inspired by this, this paper proposed an evolutionary-based structure construction method for constructing more reasonable SNNs. By integrating the knowledge distillation and connection pruning method, the synaptic connections in SNNs can be optimized dynamically to reach an optimal state. As a result, the structure of SNNs could not only absorb knowledge from the teacher model but also search for deep but sparse network topology. Experimental results on CIFAR100, Tiny-imagenet and DVS-Gesture show that the proposed structure learning method can get pretty well performance while reducing the connection redundancy. The proposed method explores a novel dynamical way for structure learning from scratch in SNNs which could build a bridge to close the gap between deep learning and bio-inspired neural dynamics.



Paperid:884 Poster
Authors:Daiqing Wu,Dongbao Yang,Yu Zhou,Can Ma
Abstract:
As posts on social media increase rapidly, analyzing the sentiments embedded in image-text pairs has become a popular research topic in recent years. Although existing works achieve impressive accomplishments in simultaneously harnessing image and text information, they lack the considerations of possible low-quality and missing modalities. In real-world applications, these issues might frequently occur, leading to urgent needs for models capable of predicting sentiment robustly. Therefore, we propose a Distribution-based feature Recovery and Fusion (DRF) method for robust multimodal sentiment analysis of image-text pairs. Specifically, we maintain a feature queue for each modality to approximate their feature distributions, through which we can simultaneously handle low-quality and missing modalities in a unified framework. For low-quality modalities, we reduce their contributions to the fusion by quantitatively estimating modality qualities based on the distributions. For missing modalities, we build inter-modal mapping relationships supervised by samples and distributions, thereby recovering the missing modalities from available ones. In experiments, two disruption strategies that corrupt and discard some modalities in samples are adopted to mimic the low-quality and missing modalities in various real-world scenarios. Through comprehensive experiments on three publicly available image-text datasets, we demonstrate the universal improvements of DRF compared to SOTA methods under both two strategies, validating its effectiveness in robust multimodal sentiment analysis.



Paperid:885 Poster
Authors:Shanshan Wang,ALuSi,Xun Yang,Ke Xu,Huibin Tan,Xingyi Zhang
Abstract:
Domain generalization (DG) task aims to learn a robust model from source domains that could handle the out-of-distribution (OOD) issue. In order to improve the generalization ability of the model in unseen domains, increasing the diversity of training samples is an effective solution. However, existing augmentation approaches always have some limitations. On the one hand, the augmentation manner in most DG methods is not enough as the model may not see the perturbed features in approximate the worst case due to the randomness, thus the transferability in features could not be fully explored. On the other hand, the causality in discriminative features is not involved in these methods, which is harm for the generalization of model due to the spurious correlations. To address these issues, we propose a Dual-stream Feature Augmentation (DFA) method by constructing some hard features from two perspectives. Firstly, to improve the transferability, we construct some targeted features with domain related augmentation manner. Through the guidance of uncertainty, some hard cross-domain fictitious features are generated to simulate domain shift. Secondly, to take the causality into consideration, the spurious correlated non-causal information is disentangled by an adversarial mask, then the more discriminative features can be extracted through these hard causal related information. Different from previous fixed synthesizing strategy, the two augmentations are integrated into a unified learnable model with disentangled feature strategy. Based on these hard features, contrastive learning is employed to keep the semantics consistent and improve the robustness of the model. Extensive experiments on several datasets demonstrated that our approach could achieve state-of-the-art performance for domain generalization.



Paperid:886 Poster
Authors:Shixuan Gao,Pingping Zhang,Tianyu Yan,Huchuan Lu
Abstract:
Salient Object Detection (SOD) aims to identify and segment the most prominent objects in images. Existing methods on SOD utilize various Transformer-based models for feature extraction. However, due to the scale of training datasets and training methods, these Transformer-based models still lack performance and generalization in segmentation. Segment Anything Model (SAM) is trained on a large-scale segmentation dataset, which gives it strong generalization and segmentation capabilities. Nonetheless, SAM requires accurate prompts of target objects, which is unavailable in SOD. Additionally, SAM lacks the utilization of multi-scale and multi-layer information, as well as the incorporation of fine-grained details. In order to apply SAM to SOD, and address its shortcomings, we propose a Multi-scale and Detail-enhanced SAM (MDSAM). Specifically, we introduce a Lightweight Multi-scale Adapter (LMSA), which allows SAM to learn multi-scale information with few trainable parameters. Moreover, we propose a Multi-Layer Fusion Block (MLFB) to comprehensively utilize the multi-layer information from the SAM's encoder. Finally, we propose a Detail Enhancement Module (DEM) to incorporate SAM with fine-grained details. Experimental results demonstrate the superior performance of our model on multiple SOD datasets and its strong generalization to other segmentation tasks. The source code will be publicly available.



Paperid:887 Poster
Authors:Yipo Huang,Xiangfei Sheng,Zhichao Yang,Quan Yuan,Zhichao Duan,Pengfei Chen,Leida Li,Weisi Lin,Guangming Shi
Abstract:
The highly abstract nature of image aesthetics perception (IAP) poses significant challenge for current multimodal large language models (MLLMs). The lack of human-annotated multi-modality aesthetic data further exacerbates this dilemma, resulting in MLLMs falling short of aesthetics perception capabilities. To address the above challenge, we first introduce a comprehensively annotated Aesthetic Multi-Modality Instruction Tuning (AesMMIT) dataset, which serves as the footstone for building multi-modality aesthetics foundation models. Specifically, to align MLLMs with human aesthetics perception, we construct a corpus-rich aesthetic critique database with 21,904 diverse-sourced images and 88K human natural language feedbacks, which are collected via progressive questions, ranging from coarse-grained aesthetic grades to fine-grained aesthetic descriptions. To ensure that MLLMs can handle diverse queries, we further prompt GPT to refine the aesthetic critiques and assemble the large-scale aesthetic instruction tuning dataset, i.e. AesMMIT, which consists of 409K multi-typed instructions to activate stronger aesthetic capabilities. Based on the AesMMIT database, we fine-tune the open-sourced general foundation models, achieving multi-modality Aesthetic Expert models, dubbed AesExpert. Extensive experiments demonstrate that the proposed AesExpert models deliver significantly better aesthetic perception performances than the state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision. The dataset, code and models will be made publicly available.



Paperid:888 Poster
Authors:Mingjin Zhang,Longyi Li,Wenxuan SHI,Jie Guo,Yunsong Li,Xinbo Gao
Abstract:
Snapshot spectral compressive imaging can capture spectral information across multiple wavelengths in one imaging. The method, coded aperture snapshot spectral imaging (CASSI), aims to recover 3D spectral cubes from 2D measurements. Most existing methods employ deep unfolding framework based on Transformer, which alternately address a data subproblem and a prior subproblem. However, these frameworks lack flexibility regarding the sensing matrix and inter-stage interactions. In addition, the quadratic computational complexity of global Transformer and the restricted receptive field of local Transformer impact reconstruction efficiency and accuracy. In this paper, we propose a dynamic deep unfolding network with mamba for compressive spectral imaging, called VmambaSCI. We integrate spatial-spectral information of the sensing matrix into the data module and employs spatial adaptive operations in the stage interaction of the prior module. Furthermore, we develop a dual-domain scanning mamba (DSMamba), featuring a novel spatial-channel scanning method for enhanced efficiency and accuracy. To our knowledge, this is the first Mamba-based model for compressive spectral imaging. Experimental results on the public database demonstrate the superiority of the proposed VmambaSCI over the state-of-the-art approaches.



Paperid:889 Poster
Authors:Zhenyu Zhang,Guangyao Chen,Yixiong Zou,Yuhua Li,Ruixuan Li
Abstract:
Few-shot open-set recognition (FSOR) is a challenging task that requires a model to recognize known classes and identify unknown classes with limited labeled data. Existing approaches, particularly Negative-Prototype-Based methods, generate negative prototypes based solely on known class data. However, as the unknown space is infinite while the known-space is limited, these methods suffer from limited representation capability. To address this limitation, we propose a novel approach, termed Diversified Negative Prototypes Generator (DNPG), which adopts the principle of "learning unknowns from unknowns." Our method leverages the unknown space information learned from base classes to generate more representative negative prototypes for novel classes. During the pre-training phase, we learn the unknown space representation of the base classes. This representation, along with inter-class relationships, is then utilized in the meta-learning process to construct negative prototypes for novel classes. To prevent prototype collapse and ensure adaptability to varying data compositions, we introduce the Swap Alignment (SA) module. Our DNPG model, by learning from the unknown space, generates negative prototypes that cover a broader unknown space, thereby achieving state-of-the-art performance on three standard FSOR datasets. We provide the source code in the supplementary materials for reproducibility.



Paperid:890 Poster
Authors:Yuanbin Fu,Jie Ying,Houlei Lv,Xiaojie Guo
Abstract:
Most of previous camouflaged object detection methods heavily lean upon large-scale manually-labeled training samples, which are notoriously difficult to obtain. Even worse, the reliability of labels is compromised by the inherent challenges in accurately annotating concealed targets that exhibit high similarities with their surroundings. To overcome these shortcomings, this paper develops the first semi-supervised camouflaged object detection framework, which requires merely a small amount of samples even having noisy/incorrect annotations. Specifically, on the one hand, we introduce an innovative pixel-level loss re-weighting technique to reduce possible negative impacts from imperfect labels, through a window-based voting strategy. On the other hand, we take advantages of ensemble learning to explore robust features against noises/outliers, thereby generating relatively reliable pseudo labels for unlabelled images. Extensive experimental results on benchmark datasets have been conducted to verify the effectiveness of our design. Our codes will be made publicly available.



Paperid:891 Poster
Authors:Xiangping Zheng,Xiuxin Hao,Bo Wu,Xigang Bao,Xuan Zhang,Wei Li,Xun Liang
Abstract:
Graph Contrastive Learning (GCL) applied in real-world scenarios aims to alleviate label scarcity by harnessing graph structures to disseminate labels from a limited set of labeled data to a broad spectrum of unlabeled data. Recent advancements in amalgamating neural network capabilities with graph structures have demonstrated promising progress. However, prevalent GCL methodologies often overlook the fundamental issue of semi-supervised learning (SSL), relying on uniform negative sample selection schemes such as random sampling, thus yielding suboptimal performance within contexts. To address this challenge, we present GraphSaSe, a tailored approach designed specifically for graph representation tasks. Our model consists of two pivotal components: a Graph Contrastive Learning Framework (GCLF) and a Selection Distribution Generator (SDG) propelled by reinforcement learning to derive selection probabilities. We introduce an innovative strategy whereby the divergence between positive graph representations is translated into a reward mechanism, dynamically guiding the selection of negative samples during training. This adaptive methodology aims to minimize the divergence between augmented positive pairs, thereby enriching graph representation learning crucial for applications. Comprehensive experimentation across diverse real-world datasets validates the effectiveness of our algorithm, positioning it favorably against contemporary state-of-the-art methodologies.



Paperid:892 Poster
Authors:Yang Zhao,Gangwei Xu,Gang Wu
Abstract:
Current state-of-the-art flow methods are mostly based on dense all-pairs cost volumes. However, as image resolution increases, the computational and spatial complexity of constructing these cost volumes grows at a quartic rate, making these methods impractical for high-resolution images. In this paper, we propose a novel Hybrid Cost Volume for memory-efficient optical flow, named HCV. To construct HCV, we first propose a Top-k strategy to separate the 4D cost volume into two global 3D cost volumes. These volumes significantly reduce memory usage while retaining a substantial amount of matching information.We further introduce a local 4D cost volume with a local search space to supplement the local information for HCV. Based on HCV, we design a memory-efficient optical flow network, named HCVFlow. Compared to the recurrent flow methods based the all-pairs cost volumes, our HCVFlow significantly reduces memory consumption while ensuring high accuracy. We validate the effectiveness and efficiency of our method on the Sintel and KITTI datasets and real-world 4K (2160 × 3840) resolution images. Extensive experiments show that our HCVFlow has very low memory usage and outperforms other memory-efficient methods in terms of accuracy.



Paperid:893 Poster
Authors:Zishuo Wang,Wenhao Zhou,Jinglin Xu,Yuxin Peng
Abstract:
Open-vocabulary detection (OVD) aims to detect novel objects without instance-level annotations to achieve open-world object detection at a lower cost. Existing OVD methods mainly rely on the powerful open-vocabulary image-text alignment capability of Vision-Language Pretrained Models (VLM) such as CLIP. However, CLIP is trained on image-text pairs and lacks the perceptual ability for local regions within an image, resulting in the gap between image and region representations. Directly using CLIP for OVD causes inaccurate region classification. We find the image-region gap is primarily caused by the deformation of region feature maps during region of interest (RoI) extraction. To mitigate the inaccurate region classification in OVD, we propose a new Shape-Invariant Adapter named SIA-OVD to bridge the image-region gap in the OVD task. SIA-OVD learns a set of feature adapters for regions with different shapes and designs a new adapter allocation mechanism to select the optimal adapter for each region. The adapted region representations can align better with text representations learned by CLIP. Extensive experiments demonstrate that SIA-OVD effectively improves the classification accuracy for regions by addressing the gap between images and regions caused by shape deformation. SIA-OVD achieves substantial improvements over representative methods on the COCO-OVD benchmark.



Paperid:894 Poster
Authors:Wangguandong Zheng,Haifeng Xia,Rui Chen,Libo Sun,Ming Shao,Siyu Xia,Zhengming Ding
Abstract:
Recently, image-to-3D approaches have achieved significant results with a natural image as input. However, it is not always possible to access these enriched color input samples in practical applications, where only sketches are available. Existing sketch-to-3D researches suffer from limitations in broad applications due to the challenges of lacking color information and multi-view content. To overcome them, this paper proposes a novel generation paradigm Sketch3D to generate realistic 3D assets with shape aligned with the input sketch and color matching the textual description. Concretely, Sketch3D first instantiates the given sketch in the reference image through the shape-preserving generation process. Second, the reference image is leveraged to deduce a coarse 3D Gaussian prior, and multi-view style-consistent guidance images are generated based on the renderings of the 3D Gaussians. Finally, three strategies are designed to optimize 3D Gaussians, i.e., structural optimization via a distribution transfer mechanism, color optimization with a straightforward MSE loss and sketch similarity optimization with a CLIP-based geometric similarity loss. Extensive visual comparisons and quantitative analysis illustrate the advantage of our Sketch3D in generating realistic 3D assets while preserving consistency with the input.



Paperid:895 Poster
Authors:Hefei Huang,Xu Jia,Xinyu Zhang,Shengming Li,Huchuan Lu
Abstract:
Many consumer cameras with rolling shutter (RS) CMOS would suffer undesired distortion and artifacts, particularly when objects experiences fast motion. The neuromorphic event camera, with high temporal resolution events, could bring much benefit to the RS correction process. In this work, we explore the characteristics of RS images and event data for the design of the rolling shutter correction (RSC) model. Specifically, the relationship between RS images and event data is modeled by incorporating time encoding to the computation of cross-attention in transformer encoder to achieve time-aware multi-modal information fusion. Features from RS images enhanced by event data are adopted as keys and values in transformer decoder, providing source for appearance, while features from event data enhanced by RS images are adopted as queries, providing spatial transition information. By embedding the time information of the desired GS image into the query, the transformer with deformable attention is capable of producing the target GS image. To enhance the model's generalization ability, we propose to further self-supervise the model by cycling between time coordinate systems corresponding to RS images and GS images. Extensive evaluations over both synthetic and real datasets demonstrate that the proposed method performs favorably against state-of-the-art approaches.



Paperid:896 Poster
Authors:Keming Wu,Man Yao,Yuhong Chou,Xuerui Qiu,Rui Yang,Bo XU,Guoqi Li
Abstract:
Spiking Neural Networks (SNNs) have received widespread attention due to their unique neuronal dynamics and low-power nature. Previous research empirically shows that SNNs with Poisson coding are more robust than Artificial Neural Networks (ANNs) on small-scale datasets. However, it is still unclear in theory how the adversarial robustness of SNNs is derived, and whether SNNs can still maintain its adversarial robustness advantage on large-scale dataset tasks. This work theoretically demonstrates that SNN's inherent adversarial robustness stems from its Poisson coding. We reveal the conceptual equivalence of Poisson coding and randomized smoothing in defense strategies, and analyze in depth the trade-off between accuracy and adversarial robustness in SNNs via the proposed Randomized Smoothing Coding (RSC) method. Experiments demonstrate that the proposed RSC-SNNs show remarkable adversarial robustness, surpassing ANNs and achieving state-of-the-art robustness results on large-scale dataset ImageNet. Our open-source implementation code is available at ~\href{https://github.com/KemingWu/RSC-SNN}{\textit{https://github.com/KemingWu/RSC-SNN}}.



Paperid:897 Poster
Authors:Yun Xing,Qing Guo,Xiaofeng Cao,Ivor Tsang,Lei Ma
Abstract:
Repairing deep neural networks (DNNs) to maintain its performance during deployment presents significant challenges due to the potential occurrence of unknown but common environmental corruptions. Most existing DNN repair methods only focus on repairing DNN for each corruption separately, lacking the ability of generalizing to the myriad corruptions from the ever-changing deploying environment. In this work, we propose to repair DNN from a novel perspective, i.e. Learning to Repair (L2R), where the repairing of target DNN is realized as a general learning-to-learn, a.k.a. meta-learning, process. In specific, observing different corruptions are correlated on their data distributions, we propose to utilize previous DNN repair experiences as tasks for meta-learning how to repair the target corruption. With the meta-learning from different tasks, L2R learns a meta-knowledge that summarizes how the DNN is repaired under various environmental corruptions. The meta-knowledge essentially serves as a general repairing prior which enables the DNN quickly adapt to unknown corruptions, thus making our method generalizable to different type of corruptions. Practically, L2R benefits DNN repair with a general pipeline yet tailoring meta-learning for repairing DNN is not trivial. By re-designing the meta-learning components under DNN repair context, we further instantiate the proposed L2R strategy into a concrete model named MetaRepair with pragmatic assumption of experience availability. We conduct comprehensive experiments on the corrupted CIFAR-10 and tiny-ImageNet by applying MetaRepair to repair DenseNet, ConvNeXt and VAN. The experimental results confirmed the superior repairing and generalization capability of our proposed L2R strategy under various environmental corruptions.



Paperid:898 Poster
Authors:Hong Chen,Xin Wang,Yipeng Zhang,Yuwei Zhou,Zeyang Zhang,Siao Tang,Wenwu Zhu
Abstract:
Generating customized content in videos has received increasing attention recently. However, existing works primarily focus on customized text-to-video generation for single subject, suffering from subject-missing and attribute-binding problems when being applied to multiple subjects. Furthermore, existing models struggle to assign the desired actions to the corresponding subjects (action-binding problem), failing to achieve satisfactory multi-subject generation performance. To tackle the problems, in this paper, we propose DisenStudio, a novel framework that can generate text-guided videos for customized multiple subjects, given few images for each subject. Specifically, DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism to associate each subject with the desired action. Then the pretrained model is customized for the multiple subjects with the proposed motion-preserved disentangled finetuning, which involves three tuning strategies: multi-subject co-occurrence tuning, masked single-subject tuning, and multi-subject motion-preserved tuning. The first two strategies guarantee the subject occurrence and preserve their visual attributes, and the third strategy helps the model to maintain the temporal motion-generation ability when finetuning on static images. We conduct extensive experiments to demonstrate that our proposed DisenStudio significantly outperforms existing methods in various metrics. Additionally, we show that DisenStudio can be used as a powerful tool for various controllable generation applications.



Paperid:899 Poster
Authors:Jingyu Lin,Guiqin Zhao,Jing Xu,Guoli Wang,Zejin Wang,Antitza Dantcheva,Lan Du,Cunjian Chen
Abstract:
The thermal-to-visible (T2V) face translation task is essential for enabling face verification in low-light or dark conditions by converting thermal infrared faces into their visible counterparts. However, this task faces two primary challenges. First, the inherent differences between the modalities hinder the effective use of thermal information to guide RGB face reconstruction. Second, translated RGB faces often lack the identity details of the corresponding visible faces, such as skin color. To tackle these challenges, we introduce DiffTV, the first Latent Diffusion Model (LDM) specifically designed for T2V facial image translation with a focus on preserving identity. Our approach proposes a novel heterogeneous feature alignment strategy that bridges the modal gap and extracts both coarse- and fine-grained identity features consistent with visible images. Furthermore, a dual-stage condition injection strategy introduces control information to guide identity-preserved translation. Experimental results demonstrate the superior performance of DiffTV, particularly in scenarios where maintaining identity integrity is critical.



Paperid:900 Poster
Authors:Zibo Ma,Bo Zhang,Zheng Zhang,Wu Liu,Wufan Wang,Hui Gao,Wendong Wang
Abstract:
Multi-planar and multi-slice magnetic resonance imaging (MRI) can provide more comprehensive 3D structural information for disease diagnosis. However, compared to multi-source MRI, multi-planar MRI uses almost the same scanning parameters but scans different internal structures. This atypical domain difference may lead to poor performance of traditional domain generalization methods in handling multi-planar MRI, especially when MRI from different planes also comes from different sources. In this paper, we introduce ADDG, an Adaptive Domain Generalization Framework tailored for accurate cross-plane MRI segmentation. ADDG significantly mitigates the impact of information loss caused by slice spacing by incorporating 3D shape constraints of the segmentation target, and better clarifies the feature differences between different planes of data through adaptive data partitioning strategy. Specifically, we propose a mesh deformation-based organ segmentation network to simultaneously delineate the 2D boundary and 3D mask of the prostate, as well as to guide more accurate mesh deformation. We also develop an organ-specific mesh template and employ Loop subdivision for unpooling new vertices to a triangular mesh to guide the mesh deformation task, resulting in smoother organ shapes. Furthermore, we design a flexible meta-learning paradigm that adaptively partitions data domains based on invariant learning, which can learn domain invariant features from multi-source training sets to further enhance the generalization ability of the model. Experimental results show that our approach outperforms several medical image segmentation, single-planar-based 3D shape reconstruction, and domain generalization methods.



Paperid:901 Poster
Authors:Yiluo Wei,Gareth Tyson
Abstract:
In the last two years, Artificial Intelligence Generated Content (AIGC) has received significant attention, leading to an anecdotal rise in the amount of AIGC being shared via social media platforms. The impact of AIGC and its implications are of key importance to social platforms, e.g., regarding the implementation of policies, community formation, and algorithmic design. Yet, to date, we know little about how the arrival of AIGC has impacted the social media ecosystem. To fill this gap, we present a comprehensive study of Pixiv, an online community for artists who wish to share and receive feedback on their illustrations. Pixiv hosts over 100 million artistic submissions and receives more than 1 billion page views per month (as of 2023). Importantly, it allows both human and AI generated content to be uploaded. Exploiting this, we perform the first analysis of the impact that AIGC has had on the social media ecosystem, through the lens of Pixiv. Based on a dataset of 15.2 million posts (including 2.4 million AI-generated images), we measure the impact of AIGC on the Pixiv community, as well as the differences between AIGC and human-generated content in terms of content creation and consumption patterns. Our results offer key insight to how AIGC is changing the dynamics of social media platforms like Pixiv.



Paperid:902 Poster
Authors:Kunyu Peng,David Schneider,Alina Roitberg,Kailun Yang,Jiaming Zhang,Chen Deng,Kaiyu Zhang,M. Saquib Sarfraz,Rainer Stiefelhagen
Abstract:
In this paper, we tackle the new task of video-based Activated Muscle Group Estimation (AMGE) aiming at identifying active muscle regions during physical activity in the wild. To this intent, we provide the MuscleMap dataset featuring >15K video clips with 135 different activities and 20 labeled muscle groups. This dataset opens the vistas to multiple video-based applications in sports and rehabilitation medicine under flexible environment constraints. The proposed MuscleMap dataset is constructed with YouTube videos, specifically targeting High-Intensity Interval Training (HIIT) physical exercise in the wild. To make the AMGE model applicable in real-life situations, it is crucial to ensure that the model can generalize well to numerous types of physical activities not present during training and involving new combinations of activated muscles. To achieve this, our benchmark also covers an evaluation setting where the model is exposed to activity types excluded from the training set. Our experiments reveal that the generalizability of existing architectures adapted for the AMGE task remains a challenge. Therefore, we also propose a new approach, TransM3E, which employs a multi-modality feature fusion mechanism between both the video transformer model and the skeleton-based graph convolution model with novel cross-modal knowledge distillation executed on multi-classification tokens. The proposed method surpasses all popular video classification models when dealing with both, previously seen and new types of physical activities. The contributed dataset and code will be publicly available.



Paperid:903 Poster
Authors:Yuzhihuang,Chenxin Li,ZiXu Lin,Hengyu Liu,haote xu,Yifan Liu,Yue Huang,Xinghao Ding,Xiaotong Tu,Yixuan Yuan
Abstract:
The ability to predict multiple potential outputs for a single input can significantly address visual ambiguity, such as diverse semantic segmentation annotations for a medical image provided by different experts. Existing methods employ various advanced probabilistic modeling techniques to model the ambiguous prediction, while they often struggle to fit the underlying distribution for multiple outputs when only a limited number of ambiguously labeled data is available, which is usually the case in real-world applications. To overcome the challenges, we propose a framework that leverages the prior knowledge from foundation models during segmenting ambiguous objects., termed as P² SAM. We delve into an inherent disadvantage of SAM, i.e., the sensitivity of the output to prompts, and ingeniously transform it into an advantage on ambiguous segmentation in turn by introducing a prompt generation module. Experimental results demonstrate that by utilizing only a small number of doctor-annotated ambiguous samples, our strategy significantly enhances the precision and diversity for medical segmentation. In rigorous benchmarking experiments against cutting-edge methods, our method achieves increased segmentation precision and diversified outputs with even fewer training data (5.5% sample, +12% $D_{max}$). P² SAM signifies a steady step towards the practical deployment of probabilistic models in real-world data-limited scenarios.



Paperid:904 Poster
Authors:Zhongnian Li,Meng Wei,Peng Ying,Tongfeng Sun,Xinzheng Xu
Abstract:
Annotating data for sensitive labels (e.g., disease, smoking) poses a potential threats to individual privacy in many real-world scenarios. To cope with this problem, we propose a novel setting to protect privacy of each instance, namely learning from concealed labels for multi-class classification. Concealed labels prevent sensitive labels from appearing in the label set during the label collection stage, as shown in Figure 1, which specifies none and some random sampled insensitive labels as concealed labels set to annotate sensitive data. In this paper, an unbiased estimator can be established from concealed data under mild assumptions, and the learned multi-class classifier can not only classify the instance from insensitive labels accurately but also recognize the instance from the sensitive labels. Moreover, we bound the estimation error and show that the multi-class classifier achieves the optimal parametric convergence rate. Experiments demonstrate the significance and effectiveness of the proposed method for concealed labels in synthetic and real-world datasets.



Paperid:905 Poster
Authors:Chencan Fu,Yabiao Wang,Jiangning Zhang,Zhengkai Jiang,Xiaofeng Mao,Jiafu Wu,Weijian Cao,Chengjie Wang,Yanhao Ge,Yong Liu
Abstract:
Co-speech gesture generation is an essential task for producing synchronized and realistic human gestures that accompany speech, playing a vital role in the animation of lifelike avatars for virtual environments. While diffusion models have shown impressive co-speech gesture generative capabilities, current approaches often fail to consider a wide range of modalities and do not provide an in-depth analysis of their interactions, which may result in less dynamic and contextually varied gestures. To address these challenges, we present MambaGesture, a novel framework that integrates a Mamba-based attention block, MambaAttn, with a multi-modality feature fusion module, SEAD. The MambaAttn block leverages the sequential data processing strengths of the Mamba model with the contextual richness of attention mechanisms, enhancing the temporal coherence of generated gestures. On the other hand, SEAD adeptly combines audio, text, style, and emotion modalities, employing disentanglement to deepen the fusion process and yield gestures with greater realism and diversity. Our approach, rigorously evaluated on the multi-modal BEAT dataset, demonstrates a significant improvement in Fréchet Gesture Distance (FGD) and a marked enhancement in diversity scores and beat alignment, achieving state-of-the-art performance in co-speech gesture generation.



Paperid:906 Poster
Authors:Hu Gao,Bowen Ma,Ying Zhang,Jingfan Yang,Jing Yang,Depeng Dang
Abstract:
Image deblurring aims to restore a high-quality image from its corresponding blurred. The emergence of CNNs and Transformers has enabled significant progress. However, these methods often face the dilemma between eliminating long-range degradation perturbations and maintaining computational efficiency. While the selective state space model (SSM) shows promise in modeling long-range dependencies with linear complexity, it also encounters challenges such as local pixel forgetting and channel redundancy. To address this issue, we propose an efficient image deblurring network that leverages selective state spaces model to aggregate enriched and accurate features. Specifically, we introduce an aggregate local and global information block (ALGBlock) designed to effectively capture and integrate both local invariant properties and non-local information. The ALGBlock comprises two primary modules: a module for capturing local and global features (CLGF), and a feature aggregation module (FA). The CLGF module is composed of two branches: the global branch captures long-range dependency features via a selective state spaces model, while the local branch employs simplified channel attention to model local connectivity, thereby reducing local pixel forgetting and channel redundancy. In addition, we design a FA module to accentuate the local part by recalibrating the weight during the aggregation of the two branches for restoration. Experimental results demonstrate that the proposed method outperforms state-of-the-art approaches on widely used benchmarks.



Paperid:907 Poster
Authors:Xinpeng Li,Teng Wang,Shuyi Mao,Jinbao Wang,Jian Zhao,Xiaojiang Peng,Feng Zheng,Xuelong Li
Abstract:
Emotion recognition aims to discern the emotional state of subjects within an image, relying on subject-centric and contextual visual cues. Current approaches typically follow a two-stage pipeline: first localize subjects by off-the-shelf detectors, then perform emotion classification through the late fusion of subject and context features. However, the complicated paradigm suffers from disjoint training stages and limited interaction between fine-grained subject-context elements. To address the challenge, we present a single-stage emotion recognition approach, employing a Decoupled Subject-Context Transformer (DSCT), for simultaneous subject localization and emotion classification. Rather than compartmentalizing training stages, we jointly leverage box and emotion signals as supervision to enrich subject-centric feature learning. Furthermore, we introduce DSCT to facilitate interactions between fine-grained subject-context cues in a ``decouple-then-fuse'' manner. The decoupled query tokens—subject queries and context queries—gradually intertwine across layers within DSCT, during which spatial and semantic relations are exploited and aggregated. We evaluate our single-stage framework on two widely used context-aware emotion recognition datasets, CAER-S and EMOTIC. Our approach surpasses two-stage alternatives with fewer parameter numbers, achieving a 3.39% accuracy improvement and a 6.46% average precision gain on CAER-S and EMOTIC datasets, respectively.



Paperid:908 Poster
Authors:mengmeng Ge,Xu Jia,Takashi Isobe,Xiaomin Li,Qinghe Wang,Jing Mu,Dong Zhou,liwang Amd,Huchuan Lu,Lu Tian,Ashish Sirasao,Emad Barsoum
Abstract:
Subject-driven image generation, aimed at customizing user-specified subjects, has experienced rapid progress. However, most of them focus on transferring the customized appearance of subjects. In this work, we consider a novel concept customization task, that is, capturing the interaction between subjects in exemplar images and transferring the learned concept of interaction to achieve customized text-to-image generation. Intrinsically, the interaction between subjects is diverse and is difficult to describe in only a few words. In addition, typical exemplar images are about the interaction between humans, which further intensifies the challenge of interaction-driven image generation with various categories of subjects. To address this task, we adopt a divide-and-conquer strategy and propose a two-stage interaction inversion framework. The framework begins by learning a pseudo-word for a single pose of each subject in the interaction. This is then employed to promote the learning of the concept for the interaction. In addition, language prior and cross-attention loss are incorporated into the optimization process to encourage the modeling of interaction. Extensive experiments demonstrate that the proposed methods are able to effectively invert the interactive pose from exemplar images and apply it to the customized generation with user-specified interaction.



Paperid:909 Poster
Authors:Zihan Huang,Xinyu Shi,Zecheng Hao,Tong Bu,Jianhao Ding,Zhaofei Yu,Tiejun Huang
Abstract:
Spiking neural networks (SNNs) show great potential due to their energy efficiency, fast processing capabilities, and robustness. There are two main approaches to constructing SNNs. Direct training methods require much memory, while conversion methods offer a simpler and more efficient option. However, current conversion methods mainly focus on converting convolutional neural networks (CNNs) to SNNs. Converting Transformers to SNN is challenging because of the presence of non-linear modules. In this paper, we propose an Expectation Compensation Module to preserve the accuracy of the conversion. The core idea is to use information from the previous T time-steps to calculate the expected output at time-step T. We also propose a Multi-Threshold Neuron and the corresponding Parallel Parameter normalization to address the challenge of large time steps needed for high accuracy, aiming to reduce network latency and power consumption. Our experimental results demonstrate that our approach achieves state-of-the-art performance. For example, we achieve a top-1 accuracy of 88.60% with only a 1% loss in accuracy using 4 time steps while consuming only 35% of the original power of the Transformer. To our knowledge, this is the first successful ANN to SNN conversion for Spiking Transformers that achieves high accuracy, low latency, and low power consumption on complex datasets.



Paperid:910 Poster
Authors:Yuzheng Wang,Zhaoyu Chen,Jie Zhang,Dingkang Yang,Zuhao Ge,Yang Liu,Siao Liu,Yunquan Sun,Wenqiang Zhang,Lizhe Qi
Abstract:
Data-Free Knowledge Distillation (DFKD) is a novel task that aims to train high-performance student models using only the pre-trained teacher network without original training data. Most of the existing DFKD methods rely heavily on additional generation modules to synthesize the substitution data resulting in high computational costs and ignoring the massive amounts of easily accessible, low-cost, unlabeled open-world data. Meanwhile, existing methods ignore the domain shift issue between the substitution data and the original data, resulting in knowledge from teachers not always trustworthy and structured knowledge from data becoming a crucial supplement. To tackle the issue, we propose a novel Open-world Data Sampling Distillation (ODSD) method for the DFKD task without the redundant generation process. First, we try to sample open-world data close to the original data's distribution by an adaptive sampling module and introduce a low-noise representation to alleviate the domain shift issue. Then, we build structured relationships of multiple data examples to exploit data knowledge through the student model itself and the teacher's structured representation. Extensive experiments on CIFAR-10, CIFAR-100, NYUv2, and ImageNet show that our ODSD method achieves state-of-the-art performance with lower FLOPs and parameters. Especially, we improve 1.50%-9.59% accuracy on the ImageNet dataset and avoid training the separate generator for each class.



Paperid:911 Poster
Authors:Yi Lei,Huilin Zhu,Jingling Yuan,Guangli Xiang,Xian Zhong,Shengfeng He
Abstract:
Drone-based crowd tracking faces difficulties in accurately identifying and monitoring objects from an aerial perspective, largely due to their small size and close proximity to each other, which complicates both localization and tracking. To address these challenges, we present the Density-aware Tracking (DenseTrack) framework. DenseTrack capitalizes on crowd counting to precisely determine object locations, blending visual and motion cues to improve the tracking of small-scale objects. It specifically addresses the problem of cross-frame motion to enhance tracking accuracy and dependability. DenseTrack employs crowd density estimates as anchors for exact object localization within video frames. These estimates are merged with motion and position information from the tracking network, with motion offsets serving as key tracking cues. Moreover, DenseTrack enhances the ability to distinguish small-scale objects using insights from the visual language model, integrating appearance with motion cues. The framework utilizes the Hungarian algorithm to ensure the accurate matching of individuals across frames. Demonstrated on DroneCrowd dataset, our approach exhibits superior performance, confirming its effectiveness in scenarios captured by drones.



Paperid:912 Poster
Authors:Liu Shuyuan,Jiawei Chen,Shouwei Ruan,Hang Su,ZHAOXIA YIN
Abstract:
Embodied intelligence empowers agents with a profound sense of perception, enabling them to respond in a manner closely aligned with real-world situations. Large Language Models (LLMs) delve into language instructions with depth, serving a crucial role in generating plans for intricate tasks. Thus, LLM-based embodied models further enhance the agent's capacity to comprehend and process information. However, this amalgamation also ushers in new challenges in the pursuit of heightened intelligence. Specifically, attackers can manipulate LLMs to produce irrelevant or even malicious outputs by altering their prompts. Confronted with this challenge, we observe a notable absence of multi-modal datasets essential for comprehensively evaluating the robustness of LLM-based embodied models. Consequently, we construct the Embodied Intelligent Robot Attack Dataset (EIRAD), tailored specifically for robustness evaluation. Additionally, two attack strategies are devised, including untargeted attacks and targeted attacks, to effectively simulate a range of diverse attack scenarios. At the same time, during the attack process, to more accurately ascertain whether our method is successful in attacking the LLM-based embodied model, we devise a new attack success evaluation method utilizing the BLIP2 model. Recognizing the time and cost-intensive nature of the GCG algorithm in attacks, we devise a scheme for prompt suffix initialization based on various target tasks, thus expediting the convergence process. Experimental results demonstrate that our method exhibits a superior attack success rate when targeting LLM-based embodied models, indicating a lower level of decision-level robustness in these models.



Paperid:913 Poster
Authors:Ruoxi Deng,Bin Yu,Jinxuan Lu,Caixia Zhou,Zhao-Min Chen,Jie Hu
Abstract:
Semantic edge detection (SED) is pivotal for the precise demarcation of object boundaries, yet it faces ongoing challenges due to the prevalence of low-quality labels in current methods. In this paper, we present a novel solution to bolster SED through the encoding of both language and image data. Distinct from antecedent language-driven techniques, which predominantly utilize static elements such as dataset labels, our method taps into the dynamic language content that details the objects in each image and their interrelations. By encoding this varied input, we generate integrated features that utilize semantic insights to refine the high-level image features and the ultimate mask representations. This advancement markedly betters the quality of these features and elevates SED performance. Experimental evaluation on benchmark datasets, including SBD and Cityscape, showcases the efficacy of our method, achieving leading ODS F-scores of 79.0 and 76.0, respectively. Our approach signifies a notable advancement in SED technology by seamlessly integrating multimodal textual information, embracing both static and dynamic aspects.



Paperid:914 Poster
Authors:Yuxin Mao,Xuyang Shen,Jing Zhang,Zhen Qin,Jinxing Zhou,Mochu Xiang,Yiran Zhong,Yuchao Dai
Abstract:
The Text to Audible-Video Generation (TAVG) task involves generating videos with accompanying audio based on text descriptions. Achieving this requires skillful alignment of both audio and video elements. To support research in this field, we have developed a comprehensive Text to Audible-Video Generation Benchmark (TAVGBench), which contains over 1.7 million clips with a total duration of 11.8 thousand hours. We propose an automatic annotation pipeline to ensure each audible video has detailed descriptions for both its audio and video contents. We also introduce the Audio-Visual Harmoni score (AVHScore) to provide a quantitative measure of the alignment between the generated audio and video modalities. Additionally, we present a baseline model for TAVG called TAVDiffusion, which uses a two-stream latent diffusion model to provide a fundamental starting point for further research in this area. We achieve the alignment of audio and video by employing cross-attention and contrastive learning. Through extensive experiments and evaluations on TAVGBench, we demonstrate the effectiveness of our proposed model under both conventional metrics and our proposed metrics.



Paperid:915 Poster
Authors:Jiacheng Zhang,Jie Wu,Huafeng Kuang,Haiming Zhang,Yuxi Ren,Weifeng Chen,Manlin Zhang,Xuefeng Xiao,Rui Wang,Shilei Wen,Guanbin Li
Abstract:
Recently, there has been significant progress in leveraging human feedback to enhance image generation, leading to the emergence of a rapidly evolving research area. However, current work faces several critical challenges: i) insufficient data quantity; and ii) rough feedback learning; To tackle these challenges, we present TreeReward, a novel multi-dimensional, fine-grained, and adaptive feedback learning framework that aims to improve both the semantic and aesthetic aspects of diffusion models. Specifically, To address the limitation of the fine-grained feedback data, we first design an efficient feedback data construction pipeline in an "AI + Expert" fashion, yielding about 2.2M high-quality feedback dataset encompassing six fine-grained dimensions. Built upon this, we introduce a tree-structure reward model to exploit the fine-grained feedback data efficiently and provide tailored optimization during feedback learning. Extensive experiments on both Stable Diffusion v1.5 (SD1.5) and Stable Diffusion XL (SDXL) demonstrate the effectiveness of our method in enhancing the general and fine-grained generation performance and the generalizability of downstream tasks.



Paperid:916 Poster
Authors:Mingjin Zhang,Chi Zhang,Qiming Zhang,Yunsong Li,Xinbo Gao,Jing Zhang
Abstract:
Recent advancements in deep learning have greatly advanced the field of infrared small object detection (IRSTD). Despite their remarkable success, a notable gap persists between these IRSTD methods and generic segmentation approaches in natural image domains. This gap primarily arises from the significant modality differences and the limited availability of infrared data. In this study, we aim to bridge this divergence by investigating the adaptation of generic segmentation models, such as the Segment Anything Model (SAM), to IRSTD tasks. Our investigation reveals that many generic segmentation models can achieve comparable performance to state-of-the-art IRSTD methods. However, their full potential in IRSTD remains untapped. To address this, we propose a simple, lightweight, yet effective baseline model for segmenting small infrared objects. Through appropriate distillation strategies, we empower smaller student models to outperform state-of-the-art methods, even surpassing fine-tuned teacher results. Furthermore, we enhance the model's performance by introducing a novel query design comprising dense and sparse queries to effectively encode multi-scale features. Through extensive experimentation across four popular IRSTD datasets, our model demonstrates significantly improved performance in both accuracy and throughput compared to existing approaches, surpassing SAM and Semantic-SAM by over 14 IoU on NUDT and 4 IoU on IRSTD1k. The source code and models will be released.



Paperid:917 Poster
Authors:Shiwei Li,Yingyi Cheng,Haozhao Wang,Xing Tang,Shijie Xu,weihongluo,Yuhua Li,Dugang Liu,xiuqiang He,Ruixuan Li
Abstract:
Federated learning is a promising distributed machine learning paradigm that can effectively protect data privacy. However, it may involve significant communication overhead, thereby potentially impairing training efficiency. In this paper, we aim to enhance communication efficiency from a new perspective. Specifically, we request the distributed clients to find optimal model updates relative to global model parameters within predefined random noise. For this purpose, we proposeFederated Masked Random Noise (FedMRN), a novel framework that enables clients to learn a 1-bit mask for each model parameter and apply masked random noise (i.e., the Hadamard product of random noise and masks) to represent model updates. To make FedMRN feasible, we propose an advanced mask training strategy, called progressive stochastic masking (PSM). After local training, clients only transmit local masks and a random seed to the server. Additionally, we provide theoretical guarantees for the convergence of FedMRN under both strongly convex and non-convex assumptions. Extensive experiments are conducted on four popular datasets. The results show that FedMRN exhibits superior convergence speed and test accuracy compared to relevant baselines, while attaining a similar level of accuracy as FedAvg.



Paperid:918 Poster
Authors:Jie Huang,Zhao-Min Chen,Xiaoqin Zhang,YisuGe,Lusi Ye,Guodao Zhang,Huiling Chen
Abstract:
Deep learning has made significant advancements and breakthroughs in medical image recognition. However, the clinical reality is complex and multifaceted, with patients often suffering from multiple intertwined diseases, not all of which are equally common, leading to medical datasets that are frequently characterized by multi-labels and a long-tailed distribution. In this paper, we propose a method involving label decoupling and reconstruction (LDRNet) to address these two specific challenges. The label decoupling utilizes the fusion of semantic information from both categories and images to capture the class-aware features across different labels. This process not only integrates semantic information from labels and images to improve the model's ability to recognize diseases, but also captures comprehensive features across various labels to facilitate a deeper understanding of disease characteristics within the dataset. Following this, our label reconstruction method uses the class-aware features to reconstruct the label distribution. This step generates a diverse array of virtual features for tail categories, promoting unbiased learning for the classifier and significantly enhancing the model’s generalization ability and robustness. Extensive experiments conducted on three multi-label long-tailed medical image datasets, including the Axial Spondyloarthritis Dataset, NIH Chest X-ray 14 Dataset, and ODIR-5K Dataset, have demonstrated that our approach achieves state-of-the-art performance, showcasing its effectiveness in handling the complexities associated with multi-label and long-tailed distributions in medical image recognition.



Paperid:919 Poster
Authors:Chengyi Yang,Mingda Dong,Xiaoyue Zhang,Jiayin Qi,Aimin Zhou
Abstract:
Continual learning aims to learn new knowledge from a sequence of tasks without forgetting. Recent studies have found that projecting gradients onto the orthogonal direction of task-specific features is effective. However, these methods mainly focus on mitigating catastrophic forgetting by adopting old features to construct projection spaces, neglecting the potential to enhance plasticity and the valuable information contained in previous gradients. To enhance plasticity and effectively utilize the gradients from old tasks, we propose Gradient Projection in Common Null Space (GPCNS), which projects current gradients into the common null space of final gradients under all preceding tasks. Moreover, to integrate both feature and gradient information, we propose a collaborative framework that allows GPCNS to be utilized in conjunction with existing gradient projection methods as a plugin that provides gradient information and better plasticity. Experimental evaluations conducted on three benchmarks demonstrate that GPCNS exhibits superior plasticity compared to conventional gradient projection methods. More importantly, GPCNS can effectively improve the backward transfer and average accuracy for existing gradient projection methods when applied as a plugin, which outperforms all the gradient projection methods without increasing learnable parameters and customized objective functions.



Paperid:920 Poster
Authors:Siru Zhong,Xixuan Hao,Yibo Yan,Ying Zhang,Yangqiu Song,Yuxuan Liang
Abstract:
Urbanization challenges underscore the necessity for effective satellite image-text retrieval methods to swiftly access specific information enriched with geographic semantics for urban applications. However, existing methods often overlook significant domain gaps across diverse urban landscapes, primarily focusing on enhancing retrieval performance within single domains. To tackle this issue, we present UrbanCross, a new framework for cross-domain satellite image-text retrieval. UrbanCross leverages a high-quality, cross-domain dataset enriched with extensive geo-tags from three countries to highlight domain diversity. It employs the Large Multimodal Model (LMM) for textual refinement and the Segment Anything Model (SAM) for visual augmentation, achieving a fine-grained alignment of images, segments and texts, yielding a 10% improvement in retrieval performance. Additionally, UrbanCross incorporates an adaptive curriculum-based source sampler and a weighted adversarial cross-domain fine-tuning module, progressively enhancing adaptability across various domains. Extensive experiments confirm UrbanCross's superior efficiency in retrieval and adaptation to new urban environments, demonstrating an average performance increase of 15% over its version without domain adaptation mechanisms, effectively bridging the domain gap. Our code is publicly accessible, and the dataset will be made available athttps://anonymous.4open.science/r/UrbanCross/.



Paperid:921 Poster
Authors:Zhengwei Yin,Mingze MA,Guixu Lin,Yinqiang Zheng
Abstract:
Amidst the prevailing trend of escalating demands for data and computational resources, the efficiency of data utilization emerges as a critical lever for enhancing the performance of deep learning models, especially in the realm of image restoration tasks. This investigation delves into the intricacies of data efficiency in the context of image restoration, with Gaussian image denoising serving as a case study. We postulate a strong correlation between the model's performance and the content information encapsulated in the training images. This hypothesis is rigorously tested through experiments conducted on synthetically blurred datasets. Building on this premise, we delve into the data efficiency within training datasets and introduce an effective and stabilized method for quantifying content information, thereby enabling the ranking of training images based on their influence. Our in-depth analysis sheds light on the impact of various subset selection strategies, informed by this ranking, on model performance. Furthermore, we examine the transferability of these efficient subsets across disparate network architectures. The findings underscore the potential to achieve comparable, if not superior, performance with a fraction of the data—highlighting instances where training IRCNN and Restormer models with only 3.89% and 2.30% of the data resulted in a negligible drop and, in some cases, a slight improvement in PSNR. This investigation offers valuable insights and methodologies to address data efficiency challenges in Gaussian denoising. Similarly, our method yields comparable conclusions in other restoration tasks. We believe this will be beneficial for future research. Codes will be available at [URL].



Paperid:922 Poster
Authors:Xihong Yang,Erxue Min,KE LIANG,Yue Liu,Siwei Wang,sihang zhou,Huijun Wu,Xinwang Liu,En Zhu
Abstract:
Contrastive deep graph clustering (CDGC) leverages the power of contrastive learning to group nodes into different clusters. The quality of contrastive samples is crucial for achieving better performance, making augmentation techniques a key factor in the process. However, the augmentation samples in existing methods are always predefined by human experiences, and agnostic from the downstream task clustering, thus leading to high human resource costs and poor performance. To overcome these limitations, we propose a Graph Node Clustering with Fully Learnable Augmentation, termed GraphLearner. It introduces learnable augmentors to generate high-quality and task-specific augmented samples for CDGC. GraphLearner incorporates two learnable augmentors specifically designed for capturing attribute and structural information. Moreover, we introduce two refinement matrices, including the high-confidence pseudo-label matrix and the cross-view sample similarity matrix, to enhance the reliability of the learned affinity matrix. During the training procedure, we notice the distinct optimization goals for training learnable augmentors and contrastive learning networks. In other words, we should both guarantee the consistency of the embeddings as well as the diversity of the augmented samples. To address this challenge, we propose an adversarial learning mechanism within our method. Besides, we leverage a two-stage training strategy to refine the high-confidence matrices. Extensive experimental results on six benchmark datasets validate the effectiveness of GraphLearner.



Paperid:923 Poster
Authors:Wenhao Li,Qiangchang Wang,peng zhao,Yilong Yin
Abstract:
Few-Shot Learning (FSL) aims to recognize new classes with limited labeled data. Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features. However, they usually struggle to capture complex semantic relationships between textual and visual features. Moreover, vanilla self-attention is heavily affected by useless information in images, severely constraining the potential of semantic priors in FSL due to the confusion of numerous irrelevant tokens during interaction. To address these aforementioned issues, a K-NN Transformer with Pyramid Prompts (KTPP) is proposed to select discriminative information with K-NN Context Attention (KCA) and adaptively modulate visual features with Pyramid Cross-modal Prompts (PCP). First, for each token, the KCA only selects the K most relevant tokens to compute the self-attention matrix and incorporates the mean of all tokens as the context prompt to provide the global context in three cascaded stages. As a result, irrelevant tokens can be progressively suppressed. Secondly, pyramid prompts are introduced in the PCP to emphasize visual features via interactions between text-based class-aware prompts and multi-scale visual features. This allows the ViT to dynamically adjust the importance weights of visual features based on rich semantic information at different scales, making models robust to spatial variations. Finally, augmented visual features and class-aware prompts are interacted via the KCA to extract class-specific features. Consequently, our model further enhances noise-free visual representations via deep cross-modal interactions, extracting generalized visual representation in scenarios with few labeled samples. Extensive experiments on four benchmark datasets demonstrate significant gains over the state-of-the-art methods, especially for the 1-shot task with 2.28% improvement on average due to semantically enhanced visual representations.



Paperid:924 Poster
Authors:Teng Hu,Jiangning Zhang,Ran Yi,Yating Wang,Jieyu Weng,Hongrui Huang,Yabiao Wang,Lizhuang Ma
Abstract:
The emergence of diffusion models has greatly propelled the progress in image and video generation. Recently, some efforts have been made in controllable video generation, including text-to-video, image-to-video generation, video editing, and video motion control, among which camera motion control is an important topic. However, existing camera motion control methods rely on training a temporal camera module, and necessitate substantial computation resources due to the large amount of parameters in video generation models. Moreover, existing methods pre-define camera motion types during training, which limits their flexibility in camera control, preventing the realization of some specific camera controls, such as various camera movements in films. Therefore, to reduce training costs and achieve flexible camera control, we propose COMD, a novel training-free video motion transfer model, which disentangles camera motions and object motions in source videos and transfers the extracted camera motions to new videos. We first propose a one-shot camera motion disentanglement method to extract camera motion from a single source video, which separates the moving objects from the background and estimates the camera motion in the moving objects region based on the motion in the background by solving a Poisson equation. Furthermore, we propose a few-shot camera motion disentanglement method to extract the common camera motion from multiple videos with similar camera motions, which employs a window-based clustering technique to extract the common features in temporal attention maps of multiple videos. Finally, we propose a motion combination method to combine different types of camera motions together, enabling our model a more controllable and flexible camera control. Extensive experiments demonstrate that our training-free approach can effectively decouple camera-object motion and apply the decoupled camera motion to a wide range of controllable video generation tasks, achieving flexible and diverse camera motion control.



Paperid:925 Poster
Authors:Tianqi Wei,Zhi Chen,Zi Huang,Xin Yu
Abstract:
Existing plant disease classification models have achieved remarkable performance in recognizing in-laboratory diseased images. However, their performance often significantly degrades in classifying in-the-wild images. Furthermore, we observed that in-the-wild plant images may exhibit similar appearances across various diseases (i.e., small inter-class discrepancy) while the same diseases may look quite different (i.e., large intra-class variance). Motivated by this observation, we propose an in-the-wild multimodal plant disease recognition dataset that contains the largest number of disease classes but also text-based descriptions for each disease. Particularly, the newly provided text descriptions are introduced to provide rich information in textual modality and facilitate in-the-wild disease classification with small inter-class discrepancy and large intra-class variance issues. Therefore, our proposed dataset can be regarded as an ideal testbed for evaluating disease recognition methods in the real world. In addition, we further present a strong yet versatile baseline that models text descriptions and visual data through multiple prototypes for a given class. By fusing the contributions of multimodal prototypes in classification, our baseline can effectively address the small inter-class discrepancy and large intra-class variance issues. Remarkably, our baseline model can not only classify diseases but also recognize diseases in few-shot or training-free scenarios. Extensive benchmarking results demonstrate that our proposed in-the-wild multimodal dataset sets many new challenges to the plant disease recognition task and there is a large space to improve for future works.



Paperid:926 Poster
Authors:Jianjun Xiang,Yuanjie Dang,Peng Chen,Ronghua Liang,Ruohong Huan,Nan Gao
Abstract:
Current state-of-the-art video quality assessment (VQA) models typically integrate various perceptual features to comprehensively represent video quality degradation. These models either directly concatenate features or fuse different perceptual scores while ignoring the domain gaps between cross-aware features, thus failing to adequately learn the correlations and interactions between different perceptual features. To this end, we analyze the independent effects and information gaps of quality- and semantic-aware features on video quality. Based on an analysis of the spatial and temporal differences between two aware features, we proposed a semantic-Aware and quality-AwareInteractionNetwork (A$^2$INet) for blind VQA (BVQA). For spatial gaps, we introduce a cross-aware guided interaction module to enhance the interaction between semantic- and quality-aware features in a local-to-global manner. Considering temporal discrepancies, we design a cross-aware temporal modeling module to further perceive temporal content variation and quality saliency information, and perceptual features are regressed into quality score by a temporal network and a temporal pooling. Extensive experiments on six benchmark VQA datasets show that our model achieves state-of-the-art performance, and ablation studies further validate the effectiveness of each module. We also present a simple video sampling strategy to balance the effectiveness and efficiency of the model. The code for the proposed method will be released.



Paperid:927 Poster
Authors:Weizhi Liu,Yue Li,Dongdong Lin,Hui Tian,Haizhou Li
Abstract:
Amid the burgeoning development of generative models like diffusion models, the task of differentiating synthesized audio from its natural counterpart grows more daunting. Deepfake detection offers a viable solution to combat this challenge. Yet, this defensive measure unintentionally fuels the continued refinement of generative models. Watermarking emerges as a proactive and sustainable tactic, preemptively regulating the creation and dissemination of synthesized content. Thus, this paper, as a pioneer, proposes the generative robust audio watermarking method (Groot), presenting a paradigm for proactively supervising the synthesized audio and its source diffusion models. In this paradigm, the processes of watermark generation and audio synthesis occur simultaneously, facilitated by parameter-fixed diffusion models equipped with a dedicated encoder. The watermark embedded within the audio can subsequently be retrieved by a lightweight decoder. The experimental results highlight Groot's outstanding performance, particularly in terms of robustness, surpassing that of the leading state-of-the-art methods. Beyond its impressive resilience against individual post-processing attacks, Groot exhibits exceptional robustness when facing compound attacks, maintaining an average watermark extraction accuracy of around 95%.



Paperid:928 Poster
Authors:Chaomin Shen,Yaomin Huang,HaoKun Zhu,Jinsong Fan,Guixu Zhang
Abstract:
Knowledge distillation has become widely recognized for its ability to transfer knowledge from a large teacher network to a compact and more streamlined student network. % Traditional knowledge distillation methods primarily follow a teacher-oriented paradigm that imposes the task of learning the teacher's complex knowledge onto the student network. However, significant disparities in model capacity and architectural design hinder students' comprehension of the complex knowledge imparted by the teacher, resulting in sub-optimal learning results. % This paper introduces a novel approach that emphasizes a student-oriented perspective and refining the teacher's knowledge to better align with the student's needs, thereby improving knowledge transfer effectiveness. % Specifically, we present the Student-Oriented Knowledge Distillation (SoKD), which incorporates a learnable feature augmentation strategy during training to dynamically refine the teacher's knowledge of the student. % Furthermore, we deploy the Distinctive Area Detection Module (DAM) to identify areas of mutual interest between the teacher and student, concentrating knowledge transfer within these critical areas to avoid spreading irrelevant information. This targeted approach ensures a more focused and effective knowledge distillation process. % Our approach, functioning as a plug-in, could be integrated with various knowledge distillation methods. Extensive experimental results demonstrate the efficacy and generalizability of our method.



Paperid:929 Poster
Authors:Bingyan Liu,Chengyu Wang,Jun Huang,Kui Jia
Abstract:
Building on recent breakthroughs in diffusion-based text-to-image synthesis (TIS), training-free text-guided image editing (TIE) has become an indispensable aspect of modern image editing practices. It involves modifying the features in attention layers to alter objects or their attributes within images during the generation process. Yet, current image editing algorithms still present certain difficulties and challenges when it comes to editing multiple objects within an image. In this paper, we propose VICTORIA, a novel approach that enhances TIE by incorporating linguistic knowledge when manipulating attention maps during image generation. VICTORIA leverages components within self-attention layers to maintain spatial consistency between source and target images. Additionally, a novel loss function is designed to refine cross-attention maps, ensuring their alignment with linguistic constraints and enhancing the editing of multiple target entities. We also introduce a linguistic mask blending technique to improve the retention of information in areas exempt from modification. Experimental results across seven diverse datasets demonstrate that VICTORIA achieves substantial enhancements over state-of-the-art methods. This work highlights the critical role and effectiveness of linguistic analysis in boosting the performance of TIE.



Paperid:930 Poster
Authors:Zhen-Xiang Ma,Zhen-Duo Chen,Li-Jun Zhao,Zi-Chao Zhang,Tai Zheng,Xin Luo,Xin-Shun Xu
Abstract:
In recent years, the Few-Shot Fine-Grained Image Classification (FS-FGIC) problem has gained widespread attention. A number of effective methods have been proposed that focus on extracting discriminative information within high-level features in a single episode/task. However, this is insufficient for addressing the cross-task challenges of FS-FGIC, which is represented in two aspects. On the one hand, from the perspective of the Fine-Grained Image Classification (FGIC) task, there is a need to supplement the model with mid-level features containing rich fine-grained information. On the other hand, from the perspective of the Few-Shot Learning (FSL) task, explicit modeling of cross-task general knowledge is required. In this paper, we propose a novel Bi-directional Task-Guided Network (BTG-Net) to tackle these issues. Specifically, from the FGIC task perspective, we design the Semantic-Guided Noise Filtering (SGNF) module to filter noise on mid-level features rich in detailed information. Further, from the FSL task perspective, the General Knowledge Prompt Modeling (GKPM) module is proposed to retain the cross-task general knowledge by utilizing the prompting mechanism, thereby enhancing the model’s generalization performance on novel classes. We have conducted extensive experiments on five few-shot fine-grained benchmark datasets, and the results demonstrate that BTG-Net outperforms state-of-the-art methods comprehensively.



Paperid:931 Poster
Authors:Guoqing Yang,Zhiming Luo,Jianzhe Gao,Yingxin Lai,Kun Yang,Yifan He,Shaozi Li
Abstract:
Human behavior anomaly detection aims to identify unusual human actions, playing a crucial role in intelligent surveillance and other areas. The current mainstream methods still adopt reconstruction or future frame prediction techniques. However, reconstructing or predicting low-level pixel features easily enables the network to achieve overly strong generalization ability, allowing anomalies to be reconstructed or predicted as effectively as normal data. Different from their methods, inspired by the Student-Teacher Network, we propose a novel framework called the Multilevel Guidance-Exploration Network (MGENet), which detects anomalies through the difference in high-level representation between the Guidance and Exploration network. Specifically, we first utilize the Normalizing Flow that takes skeletal keypoints as input to guide an RGB encoder, which takes unmasked RGB frames as input, to explore latent motion features. Then, the RGB encoder guides the mask encoder, which takes masked RGB frames as input, to explore the latent appearance feature. Additionally, we design a Behavior-Scene Matching Module (BSMM) to detect scene-related behavioral anomalies. Extensive experiments demonstrate that our proposed method achieves state-of-the-art performance on ShanghaiTech and UBnormal datasets, with AUC of 86.9 % and 74.3 %, respectively. The code is available on GitHub.



Paperid:932 Poster
Authors:Feifei Zhang,Sijia Qu,Fan Shi,Changsheng Xu
Abstract:
This work tackles the persistent challenge of image-text retrieval, a key problem at the intersection of computer vision and natural language processing. Despite significant advancements facilitated by large-scale Contrastive Language-Image Pretraining (CLIP) models, we found that existing methods fall short in bridging the fine-grained semantic gap between visual and textual representations, particularly in capturing the nuanced interplay of local visual details and the textual descriptions. To address the above challenges, we propose a general framework called Local and Generative-driven Modality Gap Correction (LG-MGC), which devotes to simultaneously enhancing representation learning and alleviating the modality gap in cross-modal retrieval. Specifically, the proposed model consists of two main components: a local-driven semantic completion module, which complements specific local context information that overlooked by traditional models within global features, and a generative-driven semantic translation module, which leverages generated features as a bridge to mitigate the modality gap. This framework not only tackles the granularity of semantic correspondence and improves the performance of existing methods without requiring additional trainable parameters, but is also designed to be plug-and-play, allowing for easy integration into existing retrieval models without altering their architectures. Extensive qualitative and quantitative experiments demonstrate the effectiveness of LG-MGC by achieving consistent state-of-the-art performance over strong baselines. \emph{\color{magenta}The code is included in the supplementary material.}



Paperid:933 Poster
Authors:Wenxiao Zhang,Hossein Rahmani,Xun Yang,Jun Liu
Abstract:
Unpaired point cloud completion involves filling in missing parts of a point cloud without requiring partial-complete correspondence. Meanwhile, since point cloud completion is an ill-posed problem, there are multiple ways to generate the missing parts. Existing GAN-based methods transform partial shape encoding into a complete one in the low-dimensional latent feature space. However, “mode collapse” often occurs, where only a subset of the shapes is represented in the low-dimensional space, reducing the diversity of the generated shapes. In this paper, we propose a novel unpaired multimodal shape completion approach that directly operates on point coordinate space. We achieve unpaired completion via an unconditional diffusion model trained on complete data by “hijacking” the generative process. We further augment the diffusion model by introducing two guidance mechanisms to help map the partial point cloud to the complete one while preserving its original structure. We conduct extensive evaluations of our approach, which show that our method generates shapes that are more diverse and better preserve the original structures compared to alternative methods.



Paperid:934 Poster
Authors:Lu Chen,Qiangchang Wang,Zhaohui Li,Yilong Yin
Abstract:
Fine-grained Visual Recognition (FGVR) aims to distinguish objects within similar subcategories. Humans adeptly perform this challenging task by leveraging both intra-category distinctiveness and inter-category similarity. However, previous methods failed to combine these two complementary dimensions and mine the intrinsic interrelationship among various semantic features. To address the above limitations, we propose HI2R, a Hypergraph-guided Intra- and Inter-category Relation Modeling approach, which simultaneously extracts the intra-category structural information and inter-category relation information for more precise reasoning. Specifically, we exploit a Hypergraph-guided Structure Learning (HSL) module, which employs hypergraphs to capture high-order structural relations, transcending traditional graph-based methods that are limited to pairwise linkages. This advancement allows the model to adapt to significant intra-category variations. Additionally, we propose an Inter-category Relation Perception (IRP) module to improve feature discrimination across categories by extracting and analyzing semantic relations among them. Our objective is to alleviate the robustness issue associated with exclusive reliance on intra-category discriminative features. Furthermore, a random semantic consistency loss is introduced to direct the model's attention to commonly overlooked yet distinctive regions, which indirectly enhances the representation ability of both HSL and IRP modules. Both qualitative and quantitative results demonstrate the effectiveness and usefulness of our proposed HI2R model.



Paperid:935 Poster
Authors:Yuran Wang,Zhijing Wan,Yansheng Qiu,Zheng Wang
Abstract:
In the realm of medical image analysis, self-supervised learning techniques (SSL) have emerged to alleviate labeling demands, while still facing the challenge of training data scarcity owing to escalating resource requirements and privacy constraints. Numerous efforts employ generative models to generate high-fidelity, unlabeled 3D volumes across diverse modalities and anatomical regions. However, the intricate and indistinguishable anatomical structures within the abdomen pose a unique challenge to abdominal CT volume generation compared to other anatomical regions. To address the overlooked challenge, we introduce the Locality-Aware Diffusion (Lad), a novel method tailored for exquisite 3D abdominal CT volume generation. We design a locality loss to refine crucial anatomical regions and devise a condition extractor to integrate abdominal priori into generation, thereby enabling the generation of large quantities of high-quality abdominal CT volumes essential for SSL tasks without the need for additional data such as labels or radiology reports. Volumes generated through our method demonstrate remarkable fidelity in reproducing abdominal structures, achieving a decrease in FID score from 0.0034 to 0.0002 on AbdomenCT-1K dataset, closely mirroring authentic data and surpassing current methods. Extensive experiments demonstrate the effectiveness of our method in self-supervised organ segmentation tasks, resulting in an improvement in mean Dice scores on two abdominal datasets effectively. These results underscore the potential of synthetic data to advance self-supervised learning in medical image analysis.



Paperid:936 Poster
Authors:Chang Wu,Guancheng Quan,Gang He,Xin-Quan Lai,Yunsong Li,Wenxin Yu,Xianmeng Lin,Cheng Yang
Abstract:
In this paper, we propose a neural representation for videos that enables real-time quality-scalable decoding, called QS-NeRV. QS-NeRV comprises a Self-Learning Distribution Mapping Network (SDMN) and Extensible Enhancement Networks (EENs). Firstly, SDMN functions as the base layer (BL) for scalable video coding, focusing on encoding videos of lower quality. Within SDMN, we employ a methodology that minimizes the bitstream overhead to achieve efficient information exchange between the encoder and decoder instead of direct transmission. Specifically, we utilize an invertible network to map the multi-scale information obtained from the encoder to a specific distribution. Subsequently, during the decoding process, this information is recovered from a randomly sampled latent variable to assist the decoder in achieving improved reconstruction performance. Secondly, EENs serve as the enhancement layers (ELs) and are trained in an overfitting manner to obtain robust restoration capability. By integrating the fixed BL bitstream with the parameters of EEN as an extension pack, the decoder can produce higher-quality enhanced videos. Furthermore, the scalability of the method allows for adjusting the number of combined packs to accommodate diverse quality requirements. Experimental results demonstrate our proposed QS-NeRV outperforms the state-of-the-art real-time decoding INR-based methods on various datasets for video compression and interpolation tasks.



Paperid:937 Poster
Authors:Yang Ding,Yi Dai,Xin Wang,Ling Feng,Lei Cao,Huijun Zhang
Abstract:
Stress has rapidly emerged as a significant public health concern in the contemporary society, necessitating prompt identification and effective intervention strategies. Video-based stress detection offers a non-invasive, low-cost, and mass-reaching approach for identifying stress. In this paper, we propose a three-level content-semantic-world knowledge framework, addressing three particular issues for video-based stress detection. (1) How to abstract and encode video semantics with frame contents into visual representation? (2) How to leverage general-purpose LMMs to augment task-specific visual representation? (3) To what extent could general-purpose LLMs contribute to video-based stress detection? We design a Slow-Emotion-Fast-Action scheme to encode fast temporal changes of body actions revealed from video frames, as well as subtle details of emotions per video segment, into visual representation. We augment task-specific visual representation with linguistic facial expression descriptions by prompting general-purpose Large Multimodal Models (LMMs). A knowledge retriever is built to evaluate and select the most proper deliverable of LMMs. Experimental results on two datasets show that 1) our proposed three-level framework can achieve 90.89% F1-score in UVSD dataset and 80.79% F1-score, outperforming state-of-the-art; 2) leveraging LMMs helps to improve the F1-score by 2.25% in UVSD and 3.55% in RSL, compared to using the traditional Facial Action Coding System; 3) purely relying on general-purpose LMMs. is insufficient with 88.73% F1-score in UVSD dataset and 77.48% F1-score in RSL dataset, demonstrating the necessity to combine task-specific dedicated solutions with world knowledge given by LMMs.



Paperid:938 Poster
Authors:Yao Wu,Mingwei Xing,Yachao Zhang,Yuan Xie,Yanyun Qu
Abstract:
Multi-modal Unsupervised Domain Adaptation (MM-UDA) for large-scale 3D semantic segmentation involves adapting 2D and 3D models to a target domain without labels, which significantly reduces the labor-intensive annotations. Existing MM-UDA methods have often attempted to mitigate the domain discrepancy by aligning features between the source and target data. However, this implementation falls short when applied to image perception due to the susceptibility of images to environmental changes compared to point clouds. To mitigate this limitation, in this work, we explore the potentials of an off-the-shelf Contrastive Language-Image Pre-training (CLIP) model with rich whilst heterogeneous knowledge. To make CLIP task-specific, we propose a top-performing method, dubbed \textbf{CLIP2UDA}, which makes frozen CLIP reward unsupervised domain adaptation in 3D semantic segmentation. Specifically, CLIP2UDA alternates between two steps during adaptation: (a) Learning task-specific prompt. 2D features response from the visual encoder are employed to initiate the learning of adaptive text prompt of each domain, and (b) Learning multi-modal domain-invariant representations. These representations interact hierarchically in the shared decoder to obtain unified 2D visual predictions. This enhancement allows for effective alignment between the modality-specific 3D and unified feature space via cross-modal mutual learning. Extensive experimental results demonstrate that our method outperforms state-of-the-art competitors in several widely-recognized adaptation scenarios. Code is available at: \textcolor{blue}{\url{https://github.com/Barcaaaa/CLIP2UDA}}.



Paperid:939 Poster
Authors:Pinxue Guo,Wanyun Li,Hao Huang,Lingyi Hong,Xinyu Zhou,Zhaoyu Chen,Jinglun Li,Kaixun Jiang,Wei Zhang,Wenqiang Zhang
Abstract:
Multi-modal Video Object Segmentation (VOS), including RGB-Thermal, RGB-Depth, and RGB-Event, has garnered attention due to its capability to address challenging scenarios where traditional VOS methods struggle, such as extreme illumination, rapid motion, and background distraction. Existing approaches often involve designing specific additional branches and performing full-parameter fine-tuning for fusion in each task. However, this approach not only duplicates research efforts and hardware costs but also risks model collapse with the limited multi-modal annotated data. In this paper, we propose a universal framework named X-Prompt for all multi-modal video object segmentation tasks, designated as RGB+X. The X-Prompt framework first pre-trains a video object segmentation foundation model using RGB data, and then utilize the additional modality of the prompt to adapt it to downstream multi-modal tasks with limited data. Within the X-Prompt framework, we introduce the Multi-modal Visual Prompter (MVP), which allows prompting foundation model the with various modalities to segment objects precisely. We further propose the Multi-modal Adaptation Expert (MAEs) to adapt the foundation model with pluggable modality-specific knowledge without compromising the generalization capacity. To evaluate the effectiveness of the X-Prompt framework, we conduct extensive experiments on 3 tasks across 4 benchmarks. The proposed universal X-Prompt framework consistently outperforms the full fine-tuning paradigm and achieves state-of-the-art performance. Codes will be available.



Paperid:940 Poster
Authors:Shenglin Yin,kelu Yao,Zhen Xiao,Jieyi Long
Abstract:
Existing defense methods against adversarial examples are static, meaning that they remain unchanged once trained, regardless of changes in the attack. Consequently, static defense methods are highly vulnerable to adaptive attacks. We contend that in order to defend against more powerful attacks, the model should continuously adapt to cope with various attack methods. We propose a novel dynamic defense approach that optimizes the input by generating pseudo-labels. Subsequently, it utilizes information maximization and enhanced average prediction as optimization objectives, followed by hierarchical optimization methods to effectively counteract adversarial examples through model parameter optimization. Importantly, our approach is implemented during the inference phase and does not necessitate model retraining. It can be readily applied to existing adversarially trained models, significantly enhancing the robustness of various models against white-box, black-box, and adaptive attacks across diverse datasets. We have conducted extensive experiments to validate the state-of-the-art of our proposed method.



Paperid:941 Poster
Authors:Jinglun Li,Xinyu Zhou,Kaixun Jiang,Lingyi Hong,Pinxue Guo,Zhaoyu Chen,Weifeng Ge,Wenqiang Zhang
Abstract:
Multimodal fusion, leveraging data like vision and language, is rapidly gaining traction. This enriched data representation improves performance across various tasks. Existing methods for out-of-distribution (OOD) detection, a critical area where AI models encounter unseen data in real-world scenarios, rely heavily on whole-image features. These image-level features can include irrelevant information that hinders the detection of OOD samples, ultimately limiting overall performance. In this paper, we propose \textbf{TagOOD}, a novel approach for OOD detection that leverages vision-language representations to achieve label-free object feature decoupling from whole images. This decomposition enables a more focused analysis of object semantics, enhancing OOD detection performance. Subsequently, TagOOD trains a lightweight network on the extracted object features to learn representative class centers. These centers capture the central tendencies of IND object classes, minimizing the influence of irrelevant image features during OOD detection. Finally, our approach efficiently detects OOD samples by calculating distance-based metrics as OOD scores between learned centers and test samples. We conduct extensive experiments to evaluate TagOOD on several benchmark datasets and demonstrate its superior performance compared to existing OOD detection methods. This work presents a novel perspective for further exploration of multimodal information utilization in OOD detection, with potential applications across various tasks. Code will be available.



Paperid:942 Poster
Authors:Kang Zeng,Hao Shi,Jiacheng Lin,Siyu Li,Jintao Cheng,Kaiwei Wang,Zhiyong Li,Kailun Yang
Abstract:
LiDAR-based Moving Object Segmentation (MOS) aims to locate and segment moving objects in point clouds of the current scan using motion information from previous scans. Despite the promising results achieved by previous MOS methods, several key issues, such as the weak coupling of temporal and spatial information, still need further study. In this paper, we propose a novel LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model, termed MambaMOS. Firstly, we develop a novel embedding module, the Time Clue Bootstrapping Embedding (TCBE), to enhance the coupling of temporal and spatial information in point clouds and alleviate the issue of overlooked temporal clues. Secondly, we introduce the Motion-aware State Space Model (MSSM) to endow the model with the capacity to understand the temporal correlations of the same object across different time steps. Specifically, MSSM emphasizes the motion states of the same object at different time steps through two distinct temporal modeling and correlation steps. We utilize an improved state space model to represent these motion differences, significantly modeling the motion states. Finally, extensive experiments on the SemanticKITTI-MOS and KITTI-Road benchmarks demonstrate that the proposed MambaMOS achieves state-of-the-art performance. The source code of this work will be made publicly available.



Paperid:943 Poster
Authors:Jinkai Zheng,Xinchen Liu,Boyue Zhang,Chenggang Yan,Jiyong Zhang,Wu Liu,Yongdong Zhang
Abstract:
Existing studies for gait recognition primarily utilized sequences of either binary silhouette or human parsing to encode the shapes and dynamics of persons during walking. Silhouettes exhibit accurate segmentation quality and robustness to environmental variations, but their low information entropy may result in sub-optimal performance. In contrast, human parsing provides fine-grained part segmentation with higher information entropy, but the segmentation quality may deteriorate due to the complex environments. To discover the advantages of silhouette and parsing and overcome their limitations, this paper proposes a novel cross-granularity alignment gait recognition method, named XGait, to unleash the power of gait representations of different granularity. To achieve this goal, the XGait first contains two branches of backbone encoders to map the silhouette sequences and the parsing sequences into two latent spaces, respectively. Moreover, to explore the complementary knowledge across the features of two representations, we design the Global Cross-granularity Module (GCM) and the Part Cross-granularity Module (PCM) after the two encoders. In particular, the GCM aims to enhance the quality of parsing features by leveraging global features from silhouettes, while the PCM aligns the dynamics of human parts between silhouette and parsing features using the high information entropy in parsing sequences. In addition, to effectively guide the alignment of two representations with different granularity at the part level, an elaborate-designed learnable division mechanism is proposed for the parsing features. Finally, comprehensive experiments on two large-scale gait datasets not only show the superior performance of XGait with the Rank-1 accuracy of 81.0% on Gait3D and 88.3% CCPG but also reflect the robustness of the learned features even under challenging conditions like occlusions and cloth changes.



Paperid:944 Poster
Authors:Qi Xu,Xuanye Fang,Yaxin Li,Jiangrong Shen,De Ma,Yi Xu,Gang Pan
Abstract:
Spiking Neural Networks (SNNs) have great advantages in discrete event data processing because of their binary digital computation form. However, due to the limitation of the current structures of SNNs, the original event data needs to be preprocessed to reduce the time calculation steps and information redundancy. The traditional methods of dividing data into frames lead to the loss of a large amount of time information. In this paper, we proposed an efficient Recurrent Spiking Neural Network (RSNN) to reduce the time domain information loss of original slice samples with the spiking based neural dynamics for processing the dynamic spatial-temporal information. By constructing the Recurrent Spiking Neural Network model, the recurrent structure was used to preprocess slices before it was further input into the spiking structure to enhance the time correlation between slices. In addition, in order to match the two-dimensional spatial structure of data sample frames efficiently, this paper adapts a variation of structures of the recurrent neural network, named Convolution LSTM (CONLSTM). Through experiments on event based datasets such as DVS128-Gesture and CIFAR10-DVS, we find that the proposed model could not only behave better than some other spiking based models but also save energy and power consumption which paves the way for practical applications of neuromorphic hardware.



Paperid:945 Poster
Authors:Wenlong Liao,Sunyuan Qiang,Xianfei Li,Xiaolei Chen,Haoyu Wang,Yanyan Liang,Junchi Yan,Tao He,Pai Peng
Abstract:
Camera calibration consists of determining the intrinsic and extrinsic parameters of an imaging system, which forms the fundamental basis for various computer vision tasks and applications, e.g., robotics and autonomous driving (AD). However, prevailing camera calibration models pose a time-consuming and labor-intensive off-board process particularly in mass production settings, while simultaneously lacking exploration of real-world autonomous driving scenarios. To this end, in this paper, inspired by recent advancements in bird's-eye-view (BEV) perception models, we proposes a novel automatic multi-camera Calibration method via Reversed BEV representations for autonomous driving, termed CalibRBEV. Specifically, the proposed CalibRBEV model primarily comprises two stages. Initially, we innovatively reverse the BEV perception pipeline, reconstructing bounding boxes through an attention auto-encoder module to fully extract the latent reversed BEV representations. Subsequently, the obtained representations from encoder are interacted with the surrounding multi-view image features for further refinement and calibration parameters prediction. Extensive experimental results on nuScenes and Waymo datasets validate the effectiveness of our proposed model.



Paperid:946 Poster
Authors:Xiaorui Huang,Gen Luo,Chaoyang Zhu,Bo Tong,Yiyi Zhou,Xiaoshuai Sun,Rongrong Ji
Abstract:
Recently, Segment Anything Model (SAM) has become a research hotspot in the fields of multimedia and computer vision, which exhibits powerful yet versatile capabilities on various (un) conditional image segmentation tasks. Although SAM can support different types of segmentation prompts, we note that, compared to point- and box-guided segmentations, it performs much worse on text-instructed tasks, e.g., referring image segmentation (RIS). In this paper, we argue that deep text instruction tuning is key to mitigate such shortcoming caused by the shallow fusion scheme in its default light-weight mask decoder. To address this issue, we propose two simple yet effective deep instruction tuning (DIT) methods for SAM, one is end-to-end and the other is layer-wise. With minimal modifications, DITs can directly transform the image encoder of SAM as a stand-alone vision-language learner in contrast to building another deep fusion branch, maximizing the benefit of its superior segmentation capability. Extensive experiments on three highly competitive benchmark datasets of RIS show that a simple end-to-end DIT can improve SAM by a large margin, while the layer-wise DIT can further boost the performance to state-of-the-art with much less data and training expenditures. Our code is anonymously released at:https://anonymous.4open.science/r/ACMMM-DIT-1075/.



Paperid:947 Poster
Authors:Jian-Jun Qiao,Meng-Yu Duan,Xiao Wu,Wei Li
Abstract:
Cartoon animal parsing aims to segment the body parts such as heads, arms, legs and tails from cartoon animals. Different from previous parsing tasks, cartoon animal parsing faces new challenges, including irregular body structures, abstract drawing styles and diverse animal categories. Existing methods have difficulties when addressing these challenges caused by the spatial and structural characteristics of cartoon animals. To address these challenges, a novel spatial learning and structural modeling network, named CAPNet, is proposed for cartoon animal parsing. It aims to address the critical problems of spatial perception, structure modeling and spatial-structural consistency learning. A spatial-aware learning module integrates deformable convolutions to learn spatial features of the irregular shapes of cartoon animals. The multi-task edge and center point predictions are incorporated to capture intricate spatial patterns. A structural modeling method is proposed to model the intricate structural representations of cartoon animals, which integrates a graph neural network with a shape-aware relation learning module. To mitigate the significant differences among animals, a spatial and structural consistency learning mechanism is proposed to capture and learn feature correlation across different animal categories. Extensive experiments conducted on benchmark datasets demonstrate the effectiveness of the proposed approach, which outperforms state-of-the-art methods.



Paperid:948 Poster
Authors:Yunshan Qi,Lin Zhu,Yifan Zhao,Nan Bao,Jia Li
Abstract:
Neural Radiance Fields (NeRF) achieve impressive 3D representation learning and novel view synthesis results with high-quality multi-view images as input. However, motion blur in images often occurs in low-light and high-speed motion scenes, which significantly degrade the reconstruction quality of NeRF. Previous deblurring NeRF methods struggled to estimate information during the exposure time, unable to accurately model the motion blur. In contrast, bio-inspired event cameras measuring intensity changes with high temporal resolution make up this information deficiency. In this paper, we propose Event-driven Bundle Adjustment for Deblurring Neural Radiance Fields (EBAD-NeRF) to jointly optimize the learnable poses and NeRF parameters by leveraging the hybrid event-RGB data. An intensity-change-metric event loss and a photo-metric blur loss are introduced to strengthen the explicit modeling of camera motion blur. Experiment results on both synthetic data and real captured data demonstrate that EBAD-NeRF can obtain accurate camera poses during the exposure time and learn sharper 3D representations compared to prior works.



Paperid:949 Poster
Authors:Qi Zang,Shuang Wang,Dong Zhao,Yang HU,Dou Quan,Jinlong Li,Nicu Sebe,Zhun Zhong
Abstract:
Unanticipated domain shifts can severely degrade model performance, prompting the need for model adaptation techniques (i.e., Source-free Domain Adaptation (SFDA)) to adapt a model to new domains without accessing source data. However, existing SFDA methods often sacrifice source domain performance to improve adaptation on the target, limiting overall model capability. In this paper, we focus on a more challenging paradigm in semantic segmentation, Generalized SFDA (G-SFDA), aiming to achieve robust performance on both source and target domains. To achieve this, we propose a novel G-SFDA framework, Reliable Knowledge Propagation (RKP), for semantic segmentation tasks, which leverages the text-to-image diffusion model to propagate reliable semantic knowledge from the segmentation model. The key of RKP lies in aggregating the predicted reliable but scattered segments into a complete semantic layout and using them to activate the diffusion model for conditional generation. Subsequently, diverse images with multiple domain factors can be synthesized to retrain the segmentation model. This enables the segmentation model to learn domain-invariant knowledge across multiple domains, improving its adaptability to target domain, maintaining discriminability to source domain, and even handling unseen domains. Our model-agnostic RKP framework establishes new state-of-the-art across current SFDA segmentation benchmarks, significantly advancing various SFDA methods. The code will be open source.



Paperid:950 Poster
Authors:Hongye Hou,Xuehao Gao,Zhan Liu,Yang Yang
Abstract:
Recovering the complete shape of a 3D object from limited viewpoints plays an important role in 3D vision. Encouraged by the effectiveness of feature extraction using deep neural networks, recent point cloud completion methods prefer an encoding-decoding architecture for generating the global structure and local geometry from a set of input point proxies. In this paper, we introduce an innovative completion method aimed at uncovering structural details from input point clouds and maximizing their utility. Specifically, we improve both Encoding and Decoding for this task: (1) Key Context Fusion Encoding extracts and aggregates homologous key context by adaptively increasing the sampling bias towards salient structure and special contour points that are more representative of object structure information. (2) Semantic-based Decoding introduces a semantic EdgeConv module to prompt next Transformer decoder, which effectively learns and generates local geometry with semantic correlations from non-nearest neighbors. The experiments are evaluated on several 3D point cloud and 2.5D depth image datasets. Both qualitative and quantitative evaluations demonstrate that our method outperforms previous state-of-the-art methods.



Paperid:951 Poster
Authors:Hanchi Sun,Xiaohong Liu,XINYANG JIANG,Yifei Shen,Dongsheng Li,Xiongkuo Min,Guangtao Zhai
Abstract:
This paper focuses on the task of quality enhancement for compressed videos. Although deep network-based video restorers achieve impressive progress, most of the existing methods lack a structured design to optimally leverage the priors within compression codecs. Since the quality degradation of the video is primarily induced by the compression algorithm, a new paradigm is urgently needed for a more "conscious" process of quality enhancement. As a result, we propose the Compression-Realize Deep Structural Network (CRDS), introducing three inductive biases aligned with the three primary processes in the classic compression codec, merging the strengths of classical encoder architecture with deep network capabilities. Inspired by the residual extraction and domain transformation process in the codec, a pre-trained Latent Degradation Residual Auto-Encoder is proposed to transform video frames into a latent feature space, and the mutual neighborhood attention mechanism is integrated for precise motion estimation and residual extraction. Furthermore, drawing inspiration from the quantization noise distribution of the codec, CRDS proposes a novel Progressive Denoising framework with intermediate supervision that decomposes the quality enhancement into a series of simpler denoising sub-tasks. Experimental results on datasets like LDV 2.0 and MFQE 2.0 indicate our approach surpasses state-of-the-art models.



Paperid:952 Poster
Authors:Jiaxuan Wu,Wu Zhengxian,Xue yiming,Juan Wen,Wanli Peng
Abstract:
Recent advances in large language models (LLMs) have blurred the boundary of high-quality text generation between humans and machines, which is favorable for generative text steganography. While, current advanced steganographic mapping is not suitable for LLMs since most users are restricted to accessing only the black-box API or user interface of the LLMs, thereby lacking access to the training vocabulary and its sampling probabilities. In this paper, we explore a black-box generative text steganographic method based on the user interfaces of large language models, which is called LLM-Stega. The main goal of LLM-Stega is that the secure covert communication between Alice (sender) and Bob (receiver) is conducted by using the user interfaces of LLMs. Specifically, We first construct a keyword set and design a new encrypted steganographic mapping to embed secret messages. Furthermore, to guarantee accurate extraction of secret messages and rich semantics of generated stego texts, an optimization mechanism based on reject sampling is proposed. Comprehensive experiments demonstrate that the proposed LLM-Stega outperforms current state-of-the-art methods.



Paperid:953 Poster
Authors:Miao Cao,Lishun Wang,Huan Wang,Guoqing Wang,Xin Yuan
Abstract:
Video Snapshot Compressive Imaging (SCI) uses a low-speed 2D camera to capture high-speed scenes as snapshot compressed measurements, followed by a reconstruction algorithm to retrieve the high-speed video frames. The fast evolving mobile devices and existing high-performance video SCI reconstruction algorithms motivate us to develop mobile reconstruction methods for real-world applications. Yet, it is still challenging to deploy previous reconstruction algorithms on mobile devices due to the complex inference process, let alone real-time mobile reconstruction. To the best of our knowledge, there is no video SCI reconstruction model designed to run on the mobile devices. Towards this end, in this paper, we present an effective approach for video SCI reconstruction, dubbed MobileSCI, which can run at real-time speed on mobile devices for the first time. Specifically, we first build a U-shaped 2D convolution-based architecture, which is much more efficient and mobile-friendly than previous state-of-the-art reconstruction methods. Besides, an efficient feature mixing block, based on the channel splitting and shuffling mechanisms, is introduced as a novel bottleneck block of our proposed MobileSCI to alleviate the computational burden. Finally, a customized knowledge distillation strategy is utilized to further improve the reconstruction quality. Extensive results on both simulated and real data show that our proposed MobileSCI can achieve superior reconstruction quality with high efficiency on the mobile devices. Particularly, we can reconstruct a 256 × 256 × 8 snapshot compressed measurement with real-time performance (about 35 FPS) on an iPhone 15. Code of this paper will be released.



Paperid:954 Poster
Authors:Wendong Huang,Jinwu Hu,Xiuli Bi,Bin Xiao
Abstract:
Few-shot semantic segmentation has considerable potential for low-data scenarios, especially for medical images that require expert-level dense annotations. Existing few-shot medical image segmentation methods strive to deal with the task by means of prototype learning. However, this scheme relies on support prototypes to guide the segmentation of query images, ignoring the rich anatomical prior knowledge in medical images, which hinders effective feature enhancement for medical images. In this paper, we propose an anatomical prior guided spatial contrastive learning, called APSCL, which exploits anatomical prior knowledge derived from medical images to construct contrastive learning from a spatial perspective for few-shot medical image segmentation. The new framework forces the model to learn the features in line with the embedded anatomical representations. Besides, to fully exploit the guidance information of the support samples, we design a mutual guidance decoder to predict the label of each pixel in the query image. Furthermore, our APSCL can be trained end-to-end in the form of episodic training. Comprehensive experiments on three challenging medical image datasets, i.e., CHAOS-T2, MS-CMRSeg, and Synapse, prove that our method significantly surpasses state-of-the-art few-shot medical segmentation methods, with a mean improvement of 3.61%, 2.30%, and 6.38% on the Dice score, respectively.



Paperid:955 Poster
Authors:Dongshuo Yin,Xueting Han,Bin Li,Hao Feng,Jing Bai
Abstract:
Pre-training & fine-tuning is a prevalent paradigm in computer vision (CV). Recently, parameter-efficient transfer learning (PETL) methods have shown promising performance in adapting to downstream tasks with only a few trainable parameters. Despite their success, the existing PETL methods in CV can be computationally expensive and require large amounts of memory and time cost during training, which limits low-resource users from conducting research and applications on large models. In this work, we propose Parameter, Memory, and Time Efficient Visual Adapter (E3VA) tuning to address this issue. We provide a gradient backpropagation highway for low-rank adapters which eliminates the need for expensive backpropagation through the frozen pre-trained model, resulting in substantial savings of training memory and training time. Furthermore, we optimise the E3VA structure for CV tasks to promote model performance. Extensive experiments on COCO, ADE20K, and Pascal VOC benchmarks show that E3VA can save up to 62.2% training memory and 26.2% training time on average, while achieving comparable performance to full fine-tuning and better performance than most PETL methods. Note that we can even train the Swin-Large-based Cascade Mask RCNN on GTX 1080Ti GPUs with less than 1.5% trainable parameters. We will release the code in the future.



Paperid:956 Poster
Authors:Haoning Wu,Xiele Wu,Chunyi Li,Zicheng Zhang,Chaofeng Chen,Xiaohong Liu,Guangtao Zhai,Weisi Lin
Abstract:
Text-to-image (T2I) generation is a pivotal and core interest within the realm of AI content generation. Amid the swift advancements of both open-source (such as Stable Diffusion) and proprietary (for example, DALLE, MidJourney) T2I models, there is a notable absence of a comprehensive and robust quantitative framework for evaluating their output quality. Traditional methods of quality assessment overlook the textual prompts when judging images; meanwhile, the advent of large multi-modal models (LMMs) introduces the capability to incorporate text prompts in evaluations, yet the challenge of fine-tuning these models for precise T2I quality assessment remains unresolved. In our study, we introduce the T2I-Scorer, a novel two-stage training methodology aimed at fine-tuning LMMs for T2I evaluation. For the first stage, we collect 397K GPT-4V-labeled question-answer pairs related to T2I evaluation. Termed as T2I-ITD, the pseudo-labeled dataset is analyzed and examined by human, and used for instruction tuning to improve the LMM's low-level quality perception. The first stage model, T2I-Scorer-IT, has reached superior accuracy on T2I evaluation than all kinds of existing T2I metrics under zero-shot settings. For the second stage, we define an explicit multi-task training scheme to further align the LMM with human opinion scores, and the fine-tuned T2I-Scorer can reach state-of-the-art accuracy on both image quality and image-text alignment perspectives with significant improvements. We anticipate the proposed metrics can serve as a reliable metric to gauge the ability of T2I generation models in the future. We will make code, data, and weights publicly available.



Paperid:957 Poster
Authors:Xiao Teng,Xingyu Shen,Kele Xu,Long Lan
Abstract:
Unsupervised visible-infrared person re-identification (USL-VI-ReID) is of great research and practical significance yet remains challenging due to significant modality discrepancy and lack of annotations. Many existing approaches utilize variants of bipartite graph global matching algorithms to address this issue, aiming to establish cross-modality correspondences. However, these methods may encounter mismatches due to significant modality gaps and limited model representation. To mitigate this, we propose a simple yet effective framework for USL-VI-ReID, which gradually establishes associations between different modalities. To measure the confidence whether samples from different modalities belong to the same identity, we introduce a bidirectional-consistency criterion, which not only considers direct relationships between samples from different modalities but also incorporates potential hard negative samples from the same modality. Additionally, we propose a cross-modality correlation preserving module to enhance the semantic representation of the model by maintaining consistency in correlations across modalities. Extensive experiments conducted on the public SYSU-MM01 and RegDB datasets demonstrate the superiority of our method over existing USL-VI-ReID approaches across various settings, despite its simplicity. Our code will be released.



Paperid:958 Poster
Authors:Panjun Liu,Jiacheng Li,Lizhi Wang,Zheng-Jun Zha,Zhiwei Xiong
Abstract:
The advent of High Dynamic Range/Wide Color Gamut (HDR/WCG) display technology has made significant progress in providing exceptional richness and vibrancy for the human visual experience. However, the widespread adoption of HDR/WCG images is hindered by their substantial storage requirements, imposing significant bandwidth challenges during distribution. Besides, HDR/WCG images are often tone-mapped into Standard Dynamic Range (SDR) versions for compatibility, necessitating the usage of inverse Tone Mapping (iTM) techniques to reconstruct their original representation. In this work, we propose a meta-transfer learning framework for practical HDR/WCG media transmission by embedding image-wise metadata into their SDR counterparts for later iTM reconstruction. Specifically, we devise a meta-learning strategy to pre-train a lightweight multilayer perceptron (MLP) model that maps SDR pixels to HDR/WCG ones on an external dataset, resulting in a domain-wise iTM model. Subsequently, for the transfer learning process of each HDR/WCG image, we present a spatial-aware online mining mechanism to select challenging training pairs to adapt the meta-trained model to an image-wise iTM model. Finally, the adapted MLP, embedded as metadata, is transmitted alongside the SDR image, facilitating the reconstruction of the original image on HDR/WCG displays. We conduct extensive experiments and evaluate the proposed framework with diverse metrics. Compared with existing solutions, our framework shows superior performance in fidelity (up to 3dB gain in perceptual-uniform PSNR), minimal latency (1.2s for adaptation and 2ms for reconstruction of a 4K image), and negligible overhead (40KB).



Paperid:959 Poster
Authors:Jian-Yu Jiang-Lin,Kang-Yang Huang,Ling Lo,Yi-Ning Huang,Terence Lin,Jhih-Ciang Wu,Hong-Han Shuai,Wen-Huang Cheng
Abstract:
Diffusion models revolutionize image generation by leveraging natural language to guide the creation of multimedia content. Despite significant advancements in such generative models, challenges persist in depicting detailed human-object interactions, especially regarding pose and object placement accuracy. We introduce a training-free method named Reasoning and Correcting Diffusion (ReCorD) to address these challenges. Our model couples Latent Diffusion Models with Visual Language Models to refine the generation process, ensuring precise depictions of HOIs. We propose an interaction-aware reasoning module to improve the interpretation of the interaction, along with an interaction correcting module to refine the output image for more precise HOI generation delicately. Through a meticulous process of pose selection and object positioning, ReCorD achieves superior fidelity in generated images while efficiently reducing computational requirements. We conduct comprehensive experiments on three benchmarks to demonstrate the significant progress in solving text-to-image generation tasks, showcasing ReCorD's ability to render complex interactions accurately by outperforming existing methods in HOI classification score, as well as FID and Verb CLIP-Score.



Paperid:960 Poster
Authors:Jiahao Cui,Wei Jiang,Zhan Peng,Zhiyu Pan,Zhiguo Cao
Abstract:
High dynamic range (HDR) video rendering from low dynamic range (LDR) videos where frames are of alternate exposure encounters significant challenges, due to the exposure change and absence at each time stamp. The exposure change and absence make existing methods generate flickering HDR results. In this paper, we propose a novel paradigm to render HDR frames via completing the absent exposure information, hence the exposure information is complete and consistent. Our approach involves interpolating neighbor LDR frames in the time dimension to reconstruct LDR frames for the absent exposures. Combining the interpolated and given LDR frames, the complete set of exposure information is available at each time stamp. This benefits the fusing process for HDR results, reducing noise and ghosting artifacts therefore improving temporal consistency. Extensive experimental evaluations on standard benchmarks demonstrate that our method achieves state-of-the-art performance, highlighting the importance of absent exposure completing in HDR video rendering. The code will be made publicly available upon the acceptance of this paper.



Paperid:961 Poster
Authors:Shengyang Sun,Jiashen Hua,Junyi Feng,Dongxu Wei,Baisheng Lai,Xiaojin Gong
Abstract:
Video anomaly detection has garnered widespread attention in industry and academia in recent years due to its significant role in public security. However, many existing methods overlook the influence of scenes on anomaly detection. These methods simply label the occurrence of certain actions or objects as anomalous. In reality, scene context plays a crucial role in determining anomalies. For example, running on a highway is anomalous, while running on a playground is normal. Therefore, understanding the scene is essential for effective anomaly detection. In this work, we aim to address the challenge of scene-dependent weakly supervised video anomaly detection by decoupling scenes. Specifically, we propose a novel text-driven scene-decoupled (TDSD) framework, consisting of a TDSD module (TDSDM) and fine-grained visual augmentation (FVA) modules. The scene-decoupled module extracts semantic information from scenes, while the FVA module assists in fine-grained visual enhancement. We validate the effectiveness of our approach by constructing two scene-dependent datasets and achieve state-of-the-art results on scene-agnostic datasets as well. Code is available athttps://github.com/shengyangsun/TDSD.



Paperid:962 Poster
Authors:Zheng Han,Xiaobin Zhu,Chun Yang,Hongyang Zhou,Jingyan Qin,Xu-Cheng Yin
Abstract:
Existing few-shot learning methods generally focus on designing exquisite structures of meta-learners for learning task-specific prior to improve the discriminative ability of global embeddings. However, they often ignore the importance of learning stability in meta-training, making it difficult to obtain a relatively optimal model. From this key observation, we propose an innovative generic differentiable Reinforcement Learning (RL) strategy for few-shot classification. It aims to explore stable meta-optimization patterns in meta-training by learning generalizable optimizations for producing task-adaptive embeddings. Accordingly, our differentiable RL strategy models the embedding procedure of feature transformation layers in meta-learner to optimize the gradient flow implicitly. Also, we propose a memory module to associate historical and current task states and actions for exploring inter-task similarity. Notably, our RL-based strategy can be easily extended to various backbones. In addition, we propose a novel task state encoder to encode task representation, which fully explores inner-task similarities between support set and query set. Extensive experiments verify that our approach can improve the performance of different backbones and achieve promising results against state-of-the-art methods in few-shot classification. Our code is available at an anonymous site:https://anonymous.4open.science/r/db8f0c012/.



Paperid:963 Poster
Authors:Wulin Xie,Xiaohuan Lu,Yadong Liu,Jiang Long,Bob Zhang,Shuping Zhao,Jie Wen
Abstract:
Multi-view multi-label classification has recently received extensive attention due to its wide-ranging applications across various fields, such as medical imaging and bioinformatics. However, views and labels are usually incomplete in practical scenarios, attributed to the uncertainties in data collection and manual labeling. To cope with this issue, we propose an uncertainty-aware pseudo-labeling and dual graph driven network (UPDGD-Net), which can fully leverage the supervised information of the available labels and feature information of available views. Different from the existing works, we leverage the label matrix to impose dual graph constraints on the embedded features of both view-level and label-level, which enables the method to maintain the inherent structure of the real data during the feature extraction stage. Furthermore, our network incorporates an uncertainty-aware pseudo-labeling strategy to fill the missing labels, which not only addresses the learning issue of incomplete multi-labels but also enables the method to explore more supervised information to guide the network training. Extensive experiments on five datasets demonstrate that our method outperforms other state-of-the-art methods.



Paperid:964 Poster
Authors:Zening Lin,Jiapeng Wang,Teng Li,Wenhui Liao,DAYI HUANG,Longfei Xiong,Lianwen Jin
Abstract:
Document pair extraction aims to identify key and value entities as well as their relationships from visually-rich documents. Most existing methods divide it into two separate tasks: semantic entity recognition (SER) and relation extraction (RE). However, simply concatenating SER and RE serially can lead to severe error propagation, and it fails to handle cases like multi-line entities in real scenarios. To address these issues, this paper introduces a novel framework,PEneo(PairExtractionnew decoderoption), which performs document pair extraction in a unified pipeline, incorporating three concurrent sub-tasks: line extraction, line grouping, and entity linking. This approach alleviates the error accumulation problem and can handle the case of multi-line entities. Furthermore, to better evaluate the model's performance and to facilitate future research on pair extraction, we introduce RFUND, a re-annotated version of the commonly used FUNSD and XFUND datasets, to make them more accurate and cover realistic situations. Experiments on various benchmarks demonstrate PEneo's superiority over previous pipelines, boosting the performance by a large margin (e.g., 19.89%-22.91% F1 score on RFUND-EN) when combined with various backbones like LiLT and LayoutLMv3, showing its effectiveness and generality. Codes and the new annotations will be open to the public.



Paperid:965 Poster
Authors:Yachun Mi,Yan Shu,Yu Li,Chen Hui,Puchao Zhou,Shaohui Liu
Abstract:
Video Quality Assessment (VQA) aims to simulate the process of perceiving video quality by the human visual system (HVS). Although subjective studies have shown that the judgments of HVS are strongly influenced by human feelings, it remains unclear how video content relates to human feelings. The recent rapid development of Multimodal Large Language Models (MLLMs) has established a solid link between language and vision. And human feelings can be accurately described by language, which means that MLLMs can extract information related to human feelings from visual content with linguistic prompts. In this paper, we propose CLiF-VQA, which innovatively utilizes the visual linguistic capabilities of MLLMs to introduce human feelings features based on traditional spatio-temporal features to more accurately simulate the perceptual process of HVS. In order to efficiently extract features related to human feelings from videos, we pioneer the exploration of the consistency between Contrastive Language-Image Pre-training (CLIP) and human feelings in video perception. In addition, we design effective prompts, i.e., a variety of objective and subjective descriptions closely related to human feelings, as prompts. Extensive experiments show that the proposed CLiF-VQA exhibits excellent performance on several VQA datasets. The results show that introducing human feelings features on top of spatio-temporal features is an effective way to obtain better performance.



Paperid:966 Poster
Authors:Wenxu Shi,Bochuan Zheng
Abstract:
Numerous domain adaptive object detection (DAOD) methods leverage domain adversarial training to align the features to mitigate domain gap, where a feature extractor is trained to fool a domain classifier in order to have aligned feature distributions. The discrimination capability of the domain classifier is easy to fall into the local optimum due to the equilibrium challenge, thus cannot effectively further drive the training of feature extractor. In this work, we propose an efficient optimization strategy called \underline{V}irtual-label \underline{F}ooled \underline{D}omain \underline{D}iscrimination (VFDD), which revitalizes the domain classifier during training using \emph{virtual} sample labels. Such virtual sample label makes the separable distributions less separable, and thus leads to a more easily confused domain classifier, which in turn further drives feature alignment. Particularly, we introduce a novel concept of \emph{virtual} label for the unaligned samples and propose the \emph{Virtual}-$\mathcal{H}$-divergence to overcome the problem of falling into local optimum due to the equilibrium challenge. The proposed VFDD is orthogonal to most existing DAOD methods and can be used as a plug-and-play module to facilitate existing DAOD models. Theoretical insights and experimental analyses demonstrate that VFDD improves many popular baselines and also outperforms the recent unsupervised domain adaptive object detection models.



Paperid:967 Poster
Authors:Guojin Zhong,YIHU GUO,Jin Yuan,Qianjun Zhang,Weili Guan,Long Chen
Abstract:
Exemplar-based image translation has garnered significant interest from researchers due to its broad applications in multimedia/multimodal processing. Existing methods primarily employ Euclidean-based losses to implicitly establish cross-domain correspondences between exemplar and conditional images, aiming to produce high-fidelity images. However, these methods often suffer from two challenges: 1) Insufficient excavation of domain-invariant features leads to low-quality cross-domain correspondences, and 2) Inaccurate correspondences result in errors propagated during the translation process due to a lack of reliable prior guidance. To tackle these issues, we propose a novel prior-guided diffusion model with global-local contrastive learning (PROMOTE), which is trained in a self-supervised manner. Technically, global-local contrastive learning is designed to align two cross-domain images within hyperbolic space and reduce the gap between their semantic correlation distributions using the Fisher-Rao metric, allowing the visual encoders to extract domain-invariant features more effectively. Moreover, a prior-guided diffusion model is developed that propagates the structural prior to all timesteps in the diffusion process. It is optimized by a novel prior denoising loss, mathematically derived from the transitions modified by prior information in a self-supervised manner, successfully alleviating the impact of inaccurate correspondences on image translation. Extensive experiments conducted across seven datasets demonstrate that our proposed PROMOTE significantly exceeds state-of-the-art performance in diverse exemplar-based image translation tasks.



Paperid:968 Poster
Authors:Yixin Guo,Yu Liu,Jianghao Li,Weimin Wang,Qi Jia
Abstract:
Zero-shot human-object interaction (HOI) detector is capable of generalizing to HOI categories even not encountered during training. Inspired by the impressive zero-shot capabilities offered by CLIP, latest methods strive to leverage CLIP embeddings for improving zero-shot HOI detection. However, these embedding-based methods train the classifier on seen classes only, inevitably resulting in seen-unseen confusion of the model during testing. Besides, we find that using prompt-tuning and adapters further increases the gap between seen and unseen accuracy. To tackle this challenge, we present the first generation-based model using CLIP for zero-shot HOI detection, coined HOIGen. It allows to unlock the potential of CLIP for feature generation instead of feature extraction only. To achieve it, we develop a CLIP-injected feature generator in accordance with the generation of human, object and union features. Then, we extract realistic features of seen samples and mix them with synthetic features together, allowing the model to train seen and unseen classes jointly. To enrich the HOI scores, we construct a generative prototype bank in a pairwise HOI recognition branch, and a multi-knowledge prototype bank in an image-wise HOI recognition branch, respectively. Extensive experiments on HICO-DET benchmark demonstrate our HOIGen achieves superior performance for both seen and unseen classes under various zero-shot settings, compared with other top-performing methods.



Paperid:969 Poster
Authors:Shiyu Tang,Zhaofan Luo,Yifan Wang,Lijun Wang,Huchuan Lu,Weibo Su,Libo Liu
Abstract:
Existing open-vocabulary object detectors require an accurate and compact vocabulary pre-defined during inference. Their performance is largely degraded in real scenarios where the underlying vocabulary may be indeterminate and often exponentially large. To have a more comprehensive understanding of this phenomenon, we propose a new setting called Large-and-Open Vocabulary object Detection, which simulates real scenarios by testing detectors with large vocabularies containing thousands of unseen categories. The vast unseen categories inevitably lead to an increase in category distractors, severely impeding the recognition process and leading to unsatisfactory detection results. To address this challenge, We propose a Large and Open Vocabulary Detector (LOVD) with two core components, termed the Image-to-Region Filtering (IRF) module and Cross-View Verification (CV$^2$) scheme. To relieve the category distractors of the given large vocabularies, IRF performs image-level recognition to build a compact vocabulary relevant to the image scene out of the large input vocabulary, followed by region-level classification upon the compact vocabulary. CV$^2$ further enhances the IRF by conducting image-to-region filtering in both global and local views and produces the final detection categories through a multi-branch voting mechanism. Compared to the prior works, our LOVD is more scalable and robust to large input vocabularies, and can be seamlessly integrated with predominant detection methods to improve their open-vocabulary performance. Source code will be made publicly available.



Paperid:970 Poster
Authors:Mingkai Lin,Wenzhong Li,Xiaobin Hong,Sanglu Lu
Abstract:
Graph Neural Networks (GNNs) have been shown as powerful tools in various scenarios, such as multimodal and multimedia. A fundamental approach, pre-training on available graphs and subsequently transferring the acquired knowledge to optimize downstream tasks with limited labels, was widely exploited to mitigate the demand for extensive labeled training data. However, previous works commonly assumed that pre-training and fine-tuning occur in the same or closely related domains that share similar feature/label spaces and graph distributions. A limitation is that for each individual graph without accessible pre-training data, a GNN must be trained from scratch, imposing high training overhead and hindering the ability of generalization. In this paper, we address the \emph{GNN multi-domain pre-training problem}, which intends to pre-train a transferable GNN model from heterogeneous multi-source graph domains and then apply it in an unseen one with minor fine-tuning costs. To this end, we propose a sca\underline{LA}ble \underline{M}ulti-source \underline{P}re-training (LAMP) method. For pre-training, LAMP presents a graph dual-distillation approach to distill massive knowledge from various graph domains to form synthetic homogeneous graphs. Simultaneously, high-level meta-knowledge from the synthetic graphs is extracted to train the GNN model, whose capability can be adjusted according to target graph contexts through a co-training modulation architecture. For fine-tuning, LAMP respectively aligns the target graph distribution, graph context, and graph task with the pretext so that the downstream task in the unseen domain can be reshaped to leverage the transferable knowledge efficiently. Extensive experiments on four real-world graph domain datasets demonstrate the superiority of LAMP, showcasing notable improvements in various downstream graph learning tasks. Our codes are publicly available on GitHub.



Paperid:971 Poster
Authors:Yuxuan Lu,Jiahao Nie,Zhiwei He,Hongjie Gu,Xudong Lv
Abstract:
Current LiDAR point cloud-based 3D single object tracking (SOT) methods typically rely on point-based representation network. Despite demonstrated success, such networks suffer from some fundamental problems: 1) It contains pooling operation to cope with inherently disordered point clouds, hindering the capture of 3D spatial information that is useful for tracking, a regression task. 2) The adopted set abstraction operation hardly handles density-inconsistent point clouds, also preventing 3D spatial information from being modeled. To solve these problems, we introduce a novel tracking framework, termed VoxelTrack. By voxelizing inherently disordered point clouds into 3D voxels and extracting their features via sparse convolution blocks, VoxelTrack effectively models precise and robust 3D spatial information, thereby guiding accurate position prediction for tracked objects. Moreover, VoxelTrack incorporates a dual-stream encoder with cross-iterative feature fusion module to further explore fine-grained 3D spatial information for tracking. Benefiting from accurate 3D spatial information being modeled, our VoxelTrack simplifies tracking pipeline with a single regression loss. Extensive experiments are conducted on three widely-adopted datasets including KITTI, NuScenes and Waymo Open Dataset. The experimental results confirm that VoxelTrack achieves state-of-the-art performance (88.3%, 71.4% and 63.6% mean precision on the three datasets, respectively), and outperforms the existing trackers with a real-time speed of 36 Fps on a single TITAN RTX GPU. The source code and model will be released.



Paperid:972 Poster
Authors:Guoqing Zhu,Honghu Pan,Qiang Wang,Chao Tian,Chao Yang,Zhenyu He
Abstract:
In challenging low-light and adverse weather conditions, thermal vision algorithms, especially object detection, have exhibited remarkable potential, contrasting with the frequent struggles encountered by visible vision algorithms. Nevertheless, the efficacy of thermal vision algorithms driven by deep learning models remains constrained by the paucity of available training data samples. To this end, this paper introduces a novel approach termed the edge-guided conditional diffusion model (ECDM). This framework aims to produce meticulously aligned pseudo thermal images at the pixel level, leveraging edge information extracted from visible images. By utilizing edges as contextual cues from the visible domain, the diffusion model achieves meticulous control over the delineation of objects within the generated images. To alleviate the impacts of those visible-specific edge information that should not appear in the thermal domain, a two-stage modality adversarial training (TMAT) strategy is proposed to filter them out from the generated images by differentiating the visible and thermal modality. Extensive experiments on LLVIP demonstrate ECDM’s superiority over existing state-of-the-art approaches in terms of image generation quality. The pseudo thermal images generated by ECDM also help to boost the performance of various thermal object detectors by up to 7.1 mAP.



Paperid:973 Poster
Authors:Jianzhi Lu,Ruian He,Shili Zhou,Weimin Tan,Bo Yan
Abstract:
Facial movements play a crucial role in conveying altitude and intentions, and facial optical flow provides a dynamic and detailed representation of it. However, the scarcity of datasets and a modern baseline hinders the progress in facial optical flow research. This paper proposes FacialFlowNet (FFN), a novel large-scale facial optical flow dataset, and the Decomposed Facial Flow Model (DecFlow), the first method capable of decomposing facial flow. FFN comprises 9,635 identities and 105,970 image pairs, offering unprecedented diversity for detailed facial and head motion analysis. DecFlow features a facial semantic-aware encoder and a decomposed flow decoder, excelling in accurately estimating and decomposing facial flow into head and expression components. Comprehensive experiments demonstrate that FFN significantly enhances the accuracy of facial flow estimation across various optical flow methods, achieving up to an 11% reduction in Endpoint Error (EPE) (from 3.91 to 3.48). Moreover, DecFlow, when coupled with FFN, outperforms existing methods in both synthetic and real-world scenarios, enhancing facial expression analysis. The decomposed expression flow achieves a substantial accuracy improvement of 18% (from 69.1% to 82.1%) in micro expressions recognition. These contributions represent a significant advancement in facial motion analysis and optical flow estimation. Codes and datasets will be available to the public.



Paperid:974 Poster
Authors:Longfei Lu,Huachen Gao,Tao Dai,Yaohua Zha,Zhi Hou,Junta Wu,Shu-Tao Xia
Abstract:
Recently, image-to-3D approaches have significantly advanced the generation quality and speed of 3D assets based on large reconstruction models, particularly 3D Gaussian reconstruction models. Existing large 3D Gaussian models directly map 2D image to 3D Gaussian parameters, while regressing 2D image to 3D Gaussian representations is challenging without 3D priors. In this paper, we propose a large Point-to-Gaussian model, that inputs the initial point cloud produced from large 3D diffusion model conditional on 2D image to generate the Gaussian parameters, for image-to-3D generation. The point cloud provides initial 3D geometry prior for Gaussian generation, thus significantly facilitating image-to-3D Generation. Moreover, we present the Attention mechanism, Projection mechanism, and Point feature extractor, dubbed as APP block, for fusing the image features with point cloud features. The qualitative and quantitative experiments extensively demonstrate the effectiveness of the proposed approach on GSO and Objaverse datasets, and show the proposed method achieves state-of-the-art performance.



Paperid:975 Poster
Authors:Hongze Zhu,Guoyang Xie,Chengbin Hou,Tao Dai,Can GAO,Jinbao Wang,Linlin Shen
Abstract:
High-resolution point clouds (HRPCD) anomaly detection (AD) plays a critical role in precision machining and high-end equipment manufacturing. Despite considerable 3D-AD methods that have been proposed recently, they still cannot meet the requirements of the HRPCD-AD task. There are several challenges: i) It is difficult to directly capture HRPCD information due to large amounts of points at the sample level; ii) The advanced transformer-based methods usually obtain anisotropic features, leading to degradation of the representation; iii) The proportion of abnormal areas is very small, which makes it difficult to characterize. To address these challenges, we propose a novel group-level feature-based network, called Group3AD, which has a significantly efficient representation ability. First, we design an Intercluster Uniformity Network (IUN) to present the mapping of different groups in the feature space as several clusters, and obtain a more uniform distribution between clusters representing different parts of the point clouds in the feature space. Then, an Intracluster Alignment Network (IAN) is designed to encourage groups within the cluster to be distributed tightly in the feature space. In addition, we propose an Adaptive Group-Center Selection~(AGCS) based on geometric information to improve the pixel density of potential anomalous regions during inference. The experimental results verify the effectiveness of our proposed Group3AD, which surpasses Reg3D-AD by the margin of 5% in terms of object-level AUROC on Real3D-AD.



Paperid:976 Poster
Authors:Kaijiang Li,Hao Li,Haining Li,Peisen Wang,Chunyi Guo,Wenfeng Jiang
Abstract:
Researchers have applied 3D Lookup Tables (LUTs) in cameras, offering new possibilities for enhancing image quality and achieving various tonal effects. However, these approaches often overlook the non-uniformity of color distribution in the original images, which limits the performance of learnable LUTs. To address this issue, we introduce a lightweight end-to-end image enhancement method called Simulated Infrared Fusion Guided Image-adaptive 3D Lookup Tables (SIRLUT). SIRLUT enhances the adaptability of 3D LUTs by reorganizing the color distribution of images through the integration of simulated infrared imagery. Specifically, SIRLUT consists of an efficient Simulated Infrared Fusion (SIF) module and a Simulated Infrared Guided (SIG) refinement module. The SIF module leverages a cross-modal channel attention mechanism to perceive global information and generate dynamic 3D LUTs, while the SIG refinement module blends simulated infrared images to match image consistency features from both structural and color aspects, achieving local feature fusion. Experimental results demonstrate that SIRLUT outperforms state-of-the-art methods on different tasks by up to 0.88 $\sim$ 2.25dB while reducing the number of parameters. Code is available at \href{https://github.com/riversky2025/SIRLUT.git}{https://github.com/riversky2025/SIRLUT}.



Paperid:977 Poster
Authors:Jingfan Tan,Hyunhee Park,Ying Zhang,Tao Wang,Kaihao Zhang,Xiangyu Kong,Pengwen Dai,Zikun Liu,Wenhan Luo
Abstract:
Within the domain of blind face restoration (BFR), approaches lacking facial priors frequently result in excessively smoothed visual outputs.Exiting BFR methods predominantly utilize generative facial priors to achieve realistic and authentic details. However, these methods, primarily designed for images, encounter challenges in maintaining temporal consistency when applied to face video restoration. To tackle this issue, we introduce StableBFVR, an innovative Blind Face Video Restoration method based on Stable Diffusion that incorporates temporal information into the generative prior. This is achieved through the introduction of temporal layers in the diffusion process.These temporal layers consider both long-term and short-term information aggregation.Moreover, to improve generalizability, BFR methods employ complex, large-scale degradation during training, but it often sacrifices accuracy. Addressing this, StableBFVR features a novel mixed-degradation-aware prompt module, capable of encoding specific degradation information to dynamically steer the restoration process.Comprehensive experiments demonstrate that our proposed StableBFVR outperforms state-of-the-art methods.



Paperid:978 Poster
Authors:Fangdi Wang,Siwei Wang,Jiaqi Jin,Zhibin Dong,Xihong Yang,Yu Feng,Xinzhong Zhu,Tianrui Liu,Xinwang Liu,En Zhu
Abstract:
Multi-view clustering, a pivotal technology in multimedia research, aims to leverage complementary information from diverse perspectives to enhance clustering performance. The current multi-view clustering methods normally enforce the reduction of distances between any pair of views, overlooking the heterogeneity between views, thereby sacrificing the diverse and valuable insights inherent in multi-view data. In this paper, we propose a Tree-Based View-Gap Maintaining Multi-View Clustering (TGM-MVC) method. Our approach introduces a novel conceptualization of multiple views as a graph structure. In this structure, each view corresponds to a node, with the view gap, calculated by the cosine distance between views, acting as the edge. Through graph pruning, we derive the minimum spanning tree of the views, reflecting the neighbouring relationships among them. Specifically, we applied a share-specific learning framework, and generate view trees for both view-shared and view-specific information. Concerning shared information, we only narrow the distance between adjacent views, while for specific information, we maintain the view gap between neighboring views. Theoretical analysis highlights the risks of eliminating the view gap, and comprehensive experiments validate the efficacy of our proposed TGM-MVC method.



Paperid:979 Poster
Authors:Yifeng Xie,Zhihong Zhu,Xin Chen,Zhanpeng Chen,Zhiqi Huang
Abstract:
In the field of multi-modal learning, model parameters are typically large, necessitating the use of parameter-efficient fine-tuning (PEFT) techniques. These methods have been pivotal in enhancing training efficiency for downstream tasks in almost all situations. However, directly applying PEFT methods struggles to fully address the intricate demands of multi-modal tasks, such as multi-modal sarcasm detection (MSD), which demands the extraction and comparison of cues from different modalities. MSD, particularly when reliant on textual and visual modalities, faces challenges in identifying sarcasm's incongruity. This issue often arises from the lack of intermodality interaction during tuning, resulting in a disconnect between textual and visual information. In this paper, we introduce a novel approach called Bi-directional Adapter (BA), designated as MoBA. This approach is designed to minimize training parameters while enhancing the model's ability to interpret sarcasm across modalities. By facilitating an exchange between textual and visual information through a low-rank representation, our method adeptly captures the nuances of sarcastic expressions with a reduced number of training parameters. Our empirical studies, carried out on two publicly accessible and emerging datasets, demonstrate that our model substantially improves sarcasm detection accuracy. These findings indicate that our approach provides a more reliable and efficient solution to address the complexities of MSD.



Paperid:980 Poster
Authors:Kaixiang Wang,Xiaojian Ding,Fan Yang
Abstract:
Insufficient labeled training samples pose a critical challenge in multi-label classification, potentially leading to overfitting of the model. This paper delineates a criterion for establishing a common domain among different datasets, whereby datasets sharing analogous object descriptions and label structures are considered part of the same field. Integrating samples from disparate datasets within this shared field for training purposes effectively mitigates overfitting and enhances model accuracy. Motivated by this approach, we introduce a novel method for multi-label classification termed Non-Overlapped Multi-View Weak-Label Learning Guided by Multiple Correlations (NOMWM). Our method strategically amalgamates samples from diverse datasets within the shared field to enrich the training dataset. Furthermore, we project samples from various datasets onto a unified subspace to facilitate learning in a consistent latent space. Additionally, we address the challenge of weak labels stemming from incomplete label overlaps across datasets. Leveraging weak-label indicator matrices and label correlation mining techniques, we effectively mitigate the impact of weak labels. Extensive experimentation on multiple benchmark datasets validates the efficacy of our method, demonstrating clear improvements over existing state-of-the-art approaches.



Paperid:981 Poster
Authors:Tianyi Zheng,Cong Geng,Peng-Tao Jiang,Ben Wan,Hao Zhang,Jinwei Chen,Jia Wang,Bo Li
Abstract:
Diffusion models have garnered significant success in generative tasks, emerging as the predominant model in this domain. Despite their success, the substantial computational resources required for training diffusion models restrict their practical applications. In this paper, we resort to the optimal transport theory to accelerate the training of diffusion models, providing an in-depth analysis of the forward diffusion process. It shows that the upper bound on the Wasserstein distance of the distribution between any two timesteps in the diffusion process is an exponential decrease of the initial distance by a factor of times. This finding suggests that the state distribution of the diffusion model has a non-uniform rate of change at different points in time, thus highlighting the different importance of the diffusion timestep. To this end, we propose a novel non-uniform timestep sampling method based on the Bernoulli distribution, which favors more frequent sampling in significant timestep intervals. The key idea is to make the model focus on timesteps with larger differences, thus accelerating the training of the diffusion model. Experiments on benchmark datasets reveal that the proposed method significantly reduces the computational overhead while improving the quality of the generated images.



Paperid:982 Poster
Authors:Xinyue Zhang,Tingjin Luo,liuyueying,Chenping Hou
Abstract:
Multi-instance multi-label learning (MIML), which deals with objects with complex structures and multiple semantics, plays a crucial role in various fields. In practice, the naturally skewed label distribution and label dependence contribute to the issue of label imbalance in MIML, which is crucial but rarely studied. Most existing MIML methods often produce biased models due to the ignorance of inter-class variations in imbalanced data. To address this issue, we propose a novel imbalanced multi-instance multi-label learning method named IMIMLC, based on the error-correcting coding ensemble and an adaptive threshold strategy. Specifically, we design a feature embedding method to extract the structural information of each object via Fisher vectors and eliminate inexact supervision. Subsequently, to alleviate the disturbance caused by the imbalanced distribution, a novel ensemble model is constructed by concatenating the error-correcting codes of randomly selected subtasks. Meanwhile, IMIMLC trains binary base classifiers on small-scale data blocks partitioned by our codes to enhance their diversity and then learns more reliable results to improve model robustness for the imbalance issue. Furthermore, IMIMLC adaptively learns thresholds for each individual label by margin maximization, preventing inaccurate predictions caused by the semantic discrepancy across many labels and their unbalanced ratios. Finally, extensive experimental results on various datasets validate the effectiveness of IMIMLC against state-of-the-art approaches.



Paperid:983 Poster
Authors:Chang'an Yi,Haotian Chen,Yifan Zhang,Yonghui Xu,Yan Zhou,Lizhen Cui
Abstract:
Test-time adaptation (TTA) aims to adapt a model, initially trained on training data, to test data with potential distribution shifts. Most existing TTA methods focus on classification problems. The pronounced success of classification might lead numerous newcomers and engineers to assume that classic TTA techniques can be directly applied to the more challenging task of semantic segmentation. However, this belief is still an open question. In this paper, we investigate the applicability of existing classic TTA strategies in semantic segmentation. Our comprehensive results have led to three key observations. First, the classic normalization updating strategy only brings slight performance improvement, and in some cases it might even adversely affect the results. Even with the application of advanced distribution estimation techniques like batch renormalization, the problem remains unresolved. Second, although the teacher-student scheme does enhance the training stability for segmentation TTA in the presence of noisy pseudo-labels and temporal correlation, it cannot directly result in performance improvement compared to the original model without TTA under complex data distribution. Third, segmentation TTA suffers a severe long-tailed class-imbalance problem, which is substantially more complex than that in TTA for classification. This long-tailed challenge negatively affects segmentation TTA performance, even when the accuracy of pseudo-labels is high. Besides those observations, we find that visual prompt tuning (VisPT) is promising in segmentation TTA. Further, we propose a novel benchmark named TTAP based the above findings and VisPT. The outstanding performance of TTAP has also been verified. We hope the community can give more attention to this challenging, yet important, segmentation TTA task in the future. The source code will be publicly available.



Paperid:984 Poster
Authors:Yongsen Zheng,Guohua Wang,Yang Liu,Liang Lin
Abstract:
Diversity plays a crucial role in Recommender Systems (RSs) as it ensures a wide range of recommended items, providing users with access to new and varied options. Without diversity, users often encounter repetitive content, limiting their exposure to novel choices. While significant efforts have been dedicated to enhancing recommendation diversification in static offline scenarios, relatively less attention has been given to online Conversational Recommender Systems (CRSs). However, the lack of recommendation diversity in CRSs will increasingly exacerbate over time due to the dynamic user-system feedback loop, resulting in challenges such as the Matthew effect, filter bubbles, and echo chambers. To address these issues, we propose an innovative end-to-end CRS paradigm called User-Centric Multi-Interest Learning for Conversational Movie Recommendation (CoMoRec), which aims to learn user interests from multiple perspectives to enhance result diversity as users engage in natural language conversations for movie recommendations. Firstly, CoMoRec automatically models various facets of user interests, including context-based, graph-based, and review-based interests, to explore a wide range of user intentions and preferences. Then, it leverages these multi-aspect user interests to accurately predict personalized and diverse movie recommendations and generate fluent and informative responses during conversations. Through extensive experiments conducted on two publicly available CRS-based movie datasets, our proposed CoMoRec achieves a new state-of-the-art performance and outperforms all the compared baselines in terms of improving recommendation diversity in the CRS.



Paperid:985 Poster
Authors:Daheng Yin,Jianxin Shi,Miao Zhang,Zhaowu Huang,Jiangchuan Liu,Fang Dong
Abstract:
Full-scene volumetric video streaming, an emerging technology providing immersive viewing experiences via the Internet, is receiving increasing attention from both the academic and industrial communities. Considering the vast amount of full-scene volumetric data to be streamed and the limited bandwidth on the internet, achieving adaptive full-scene volumetric video streaming over the internet presents a significant challenge. Inspired by the advantages offered by neural fields, especially the feature grid method, we propose FSVFG, a novel full-scene volumetric video streaming system integrated feature grids as the representation of volumetric content. FSVFG employs an incremental training approach for feature grids and stores the features and residuals between adjacent grids as frames. To support adaptive streaming, we delve into the data structure and rendering processes of feature grids and propose bandwidth adaptation mechanisms. The mechanisms involve a coarse ray-marching for the selection of features and residuals to be sent, and achieve variable bitrate streaming by Level-of-Detail (LoD) and residual filtering. Based on these mechanisms, FSVFG achieves adaptive streaming by adaptively balancing the transmission of feature and residual according to the available bandwidth. Our preliminary results demonstrate the effectiveness of FSVFG, demonstrating its ability to improve visual quality and reduce bandwidth requirements of full-scene volumetric video streaming.



Paperid:986 Poster
Authors:Xiaoheng Tan,Jiabin Zhang,Yuhui Quan,Jing Li,Yajing Wu,Zilin Bian
Abstract:
Deep Video Quality Assessment (VQA) methods have shown impressive high-performance capabilities. Notably, no-reference (NR) VQA methods play a vital role in situations where obtaining reference videos is restricted or not feasible. Nevertheless, as more streaming videos are being created in ultra-high definition (e.g., 4K) to enrich viewers' experiences, the current deep VQA methods face unacceptable computational costs. Furthermore, the resizing, cropping, and local sampling techniques employed in these methods can compromise the details and content of original 4K videos, thereby negatively impacting quality assessment. In this paper, we propose a highly efficient and novel NR 4K VQA technology. Specifically, first, a novel data sampling and training strategy is proposed to tackle the problem of excessive resolution. This strategy allows the VQA Swin Transformer-based model to effectively train and make inferences using the full data of 4K videos on standard consumer-grade GPUs without compromising content or details. Second, a weighting and scoring scheme is developed to mimic the human subjective perception mode, which is achieved by considering the distinct impact of each sub-region within a 4K frame on the overall perception. Third, we incorporate the frequency domain information of video frames to better capture the details that affect video quality, consequently further improving the model's generalizability. To our knowledge, this is the first technology for the NR 4K VQA task. Thorough empirical studies demonstrate it not only significantly outperforms existing methods on a specialized 4K VQA dataset but also achieves state-of-the-art performance across multiple open-source NR video quality datasets.



Paperid:987 Poster
Authors:Jingjun Yi,Qi Bi,Hao Zheng,Haolan Zhan,Wei Ji,Yawen Huang,Yuexiang Li,Yefeng Zheng
Abstract:
The rapid development of Vision Foundation Model (VFM) brings superior out-domain generalization for a variety of down-stream tasks. Among them, domain generalized semantic segmentation (DGSS) holds unique challenges as the cross-domain images share common pixel-wise content information (i.e., semantics) but vary greatly in terms of the style variation (e.g., urban landscape, environment dependencies).How to effectively fine-tune VLM for DGSS has recently become an open research topic for the vision community. In this paper, we present a novel Spectral-decomposited Tokens (SET) learning framework to push the frontier. Delving into further than existing fine-tuning token & frozen backbone paradigm, the proposed SET especially focuses on how to learn style-invariant features from these learnable tokens. Specifically, the frozen VLM features are first decomposited into the phase and amplitude component respectively in the frequency space, where the phase / amplitude component reflects more on the content / style, respectively. Then, learnable tokens are adapted to learn the content and style, respectively. As the cross-domain differences mainly rest in the style from the amplitude component, such information is decoupled from the tokens. Consequently, the refined feature maps are more stable to represent the pixel-wise content despite the style variation. Extensive cross-domain experiments under a variety of backbones and VFMs show the state-of-the-art performance. We will make the source code publicly available.



Paperid:988 Poster
Authors:Zihan Cao,Xiao Wu,Liang-Jian Deng,Yu Zhong
Abstract:
In image fusion tasks, images from different sources possess distinct characteristics. This has driven the development of numerous methods to explore better ways of fusing them while preserving their respective characteristics. Mamba, as a state space model, has emerged in the field of natural language processing. Recently, many studies have attempted to extend Mamba to vision tasks. However, due to the nature of images different from casual language sequences, the limited state capacity of Mamba weakens its ability to model image information. Additionally, the sequence modeling ability of Mamba is only capable of spatial information and cannot effectively capture the rich spectral information in images. Motivated by these challenges, we customize and improve the vision Mamba network designed for the image fusion task. Specifically, we propose the local-enhanced vision Mamba block, dubbed as LEVM. The LEVM block can improve local information perception of the network and simultaneously learn local and global spatial information. Furthermore, we propose the state sharing technique to enhance spatial details and integrate spatial and spectral information. Finally, the overall network is a multi-scale structure based on vision Mamba, called LE-Mamba. Extensive experiments show the proposed methods achieve state-of-the-art results on multispectral pansharpening and multispectral and hyperspectral image fusion datasets, and demonstrate the effectiveness of the proposed approach. Code will be made available.



Paperid:989 Poster
Authors:Shuxun Wang,Yunfei Lei,Ziqi Zhang,Wei Liu,Haowei Liu,Li Yang,Bing Li,Wenjuan Li,Jin Gao,Weiming Hu
Abstract:
With the rise of "Metaverse" and "Web 3.0", Non-Fungible Tokens (NFTs) have emerged as a kind of pivotal digital asset, garnering significant attention. By the end of March 2024, more than 1.7 billion NFTs have been minted across various blockchain platforms. To effectively locate a desired NFT token, conducting searches within the huge amount NFTs is essential. The challenge in NFT retrieval is heightened due to the high degree of similarity among different NFTs regarding regional and semantic aspects. In this paper, we introduce a dataset named “NFT Top1000 Visual-Text Dataset”(NFT1000), containing 7.56 million image-text pairs, and being collected from 1000 most famous PFP NFT collections by sales volume on the Ethereum blockchain. Based on this dataset, building upon the foundation of the CLIP series of pre-trained models, we propose a dynamic masking fine-grained contrastive learning fine-tuning approach, which enables us to fine-tune a more performant model using only 13% of the total training data (0.79 million v.s. 6.1 million), resulting in a 7.2% improvement in the top-1 accuracy rate. We also propose a robust metric Comprehensive Variance Index (CVI) to assess the similarity and retrieval difficulty of visual-text pairs data. Please try our retrieval demo athttps://876p9s4054.vicp.fun/



Paperid:990 Poster
Authors:Xiaopei Zhu,Peiyang Xu,Guanning Zeng,Yinpeng Dong,Xiaolin Hu
Abstract:
Research of adversarial attacks is important for AI security because it shows the vulnerability of deep learning models and helps to build more robust models. Adversarial attacks on images are most widely studied, which includes noise-based attacks, image editing-based attacks, and latent space-based attacks. However, the adversarial examples crafted by these methods often lack sufficient semantic information, making it challenging for humans to understand the failure modes of deep learning models under natural conditions. To address this limitation, we propose a natural language induced adversarial image attack method. The core idea is to leverage a text-to-image model to generate adversarial images given input prompts, which are maliciously constructed to lead to misclassification for a target model. To adopt commercial text-to-image models for synthesizing more natural adversarial images, we propose an adaptive genetic algorithm (GA) for optimizing discrete adversarial prompts without requiring gradients and an adaptive word space reduction method for improving the query efficiency. We further used CLIP to maintain the semantic consistency of the generated images. In our experiments, we found that some high-frequency semantic information such as "foggy'', "humid'', "stretching'', etc. can easily cause classifier errors. These adversarial semantic information exist not only in generated images, but also in photos captured in the real world. We also found that some adversarial semantic information can be transferred to unknown classification tasks. Furthermore, our attack method can transfer to different text-to-image models (e.g., Midjourney, DALL·E 3, etc.) and image classifiers.



Paperid:991 Poster
Authors:Yuanchen Wu,Xiaoqiang Li,Jide Li,KequanYang,Pinpin Zhu,Shaohua Zhang
Abstract:
Weakly supervised semantic segmentation (WSSS) using image-level labels is a challenging task, with relying on Class Activation Map (CAM) to derive segmentation supervision. Although many efficient single-stage solutions have been proposed, their performance is hindered by the inherent ambiguity of CAM. This paper introduces a new approach, dubbed ECA, to Exploit the self-supervised Vision Transformer, DINO, inducing the Class-aware semantic Affinity to overcome this limitation. Specifically, we introduce a Semantic Affinity Exploitation module (SAE). It establishes the class-agnostic affinity graph through the self-attention of DINO. Using the highly activated patches on CAMs as “seeds”, we propagate them across the affinity graph and yield the Class-aware Affinity Region Map (CARM) as supplementary semantic guidance. Moreover, the selection of reliable “seeds” is crucial to the CARM generation. Inspired by the observed CAM inconsistency between the global and local views, we develop a CAM Correspondence Enhancement module (CCE) to encourage dense local-to-global CAM correspondences, advancing high-fidelity CAM for seed selection in SAE. Our experimental results demonstrate that ECA effectively improves the model's object pattern understanding. Remarkably, it outperforms state-of-the-art alternatives on the PASCAL VOC 2012 and MS COCO 2014 datasets, achieving 90.1% upper bound performance compared to its fully supervised counterpart.



Paperid:992 Poster
Authors:Yiming Cui,Liang Li,Jiehua Zhang,Chenggang Yan,Hongkui Wang,Shuai Wang,Jin Heng,Wu Li
Abstract:
Domain Adaptive Object Detection (DAOD) aims to improve the adaptation of the detector for the unlabeled target domain by the labeled source domain. Recent advances leverage a self-training framework to enable a student model to learn the target domain knowledge using pseudo labels generated by a teacher model. Despite great successes, such category-level consistency supervision suffers from poor quality of pseudo labels. To mitigate the problem, we propose a stochastic context consistency reasoning (SOCCER) network with the self-training framework. Firstly, we introduce a stochastic complementary masking module (SCM) to generate complementary masked images thus preventing the network from over-relying on specific visual clues. Secondly, we design an inter-changeable context consistency reasoning module (Inter-CCR), which constructs an inter-context consistency paradigm to capture the texture and contour details in the target domain by aligning the predictions of the student model for complementary masked images. Meanwhile, we develop an intra-changeable context consistency reasoning module (Intra-CCR), which constructs an intra-context consistency paradigm to strengthen the utilization of context relations by utilizing pseudo labels to supervise the predictions of the student model. Experimental results on three DAOD benchmarks demonstrate our method outperforms current state-of-the-art methods by a large margin. Code is released in supplementary materials.



Paperid:993 Poster
Authors:Zhenhong Sun,Junyan Wang,Zhiyu Tan,Daoyi Dong,Hailan Ma,Hao Li,Dong Gong
Abstract:
Diffusion models have shown remarkable prowess in text-to-image synthesis and editing, yet they often stumble when tasked with interpreting complex prompts that describe multiple entities with specific attributes and interrelations. The generated images often contain inconsistent multi-entity representation (IMR), reflected as inaccurate presentations of the multiple entities and their attributes. Although providing spatial layout guidance improves the multi-entity generation quality in existing works, it is still challenging to handle the leakage attributes and avoid unnatural characteristics. To address the IMR challenge, we first conduct in-depth analyses of the diffusion process and attention operation, revealing that the IMR challenges largely stem from the process of cross-attention mechanisms. According to the analyses, we introduce the entity guidance generation mechanism, which maintains the integrity of the original diffusion model parameters by integrating plug-in networks. Our work advances the stable diffusion model by segmenting comprehensive prompts into distinct entity-specific prompts with bounding boxes, enabling a transition from multi-entity to single-entity generation in cross-attention layers. More importantly, we introduce entity-centric cross-attention layers that focus on individual entities to preserve their uniqueness and accuracy, alongside global entity alignment layers that refine cross-attention maps using multi-entity priors for precise positioning and attribute accuracy. Additionally, a linear attenuation module is integrated to progressively reduce the influence of these layers during inference, preventing oversaturation and preserving generation fidelity. Our comprehensive experiments demonstrate that this entity guidance generation enhances existing text-to-image models in generating detailed, multi-entity images.



Paperid:994 Poster
Authors:Pengyue Lin,Ruifan Li,Yuzhe Ji,Zhihan Yu,Fangxiang Feng,Zhanyu Ma,Xiaojie Wang
Abstract:
Phrase Grounding, i.e., PG aims to locate objects referred by noun phrases. Recently, PG under weak supervision (i.e., grounding without region-level annotations) and zero-shot PG (i.e., grounding from seen categories to unseen ones) are proposed, respectively. However, for real-world applications these two approaches are limited due to slight annotations and numerable categories during training. In this paper, we propose a framework of zero-shot PG under weak supervision. Specifically, our PG framework is built on triple alignment strategies. Firstly, we propose a region-text alignment (RTA) strategy to build region-level attribute associations via CLIP. Secondly, we propose a domain alignment (DomA) strategy by minimizing the difference between distributions of seen classes in the training and those of the pre-training. Thirdly, we propose a category alignment (CatA) strategy by considering both category semantics and region-category relations. Extensive experiment results show that our proposed PG framework outperforms previous zero-shot methods and achieves competitive performance compared with existing weakly-supervised methods. The code and data will be publicly available at GitHub after double-blind phase.



Paperid:995 Poster
Authors:Qihe Pan,Zhen Zhao,Zicheng Wang,Sifan Long,Yiming Wu,Wei Ji,Haoran Liang,Ronghua Liang
Abstract:
A plethora of text-guided image editing methods has recently been developed by leveraging the impressive capabilities of large-scale diffusion-based generative models especially Stable Diffusion. Despite the success of diffusion models in producing high-quality images, their application to small object generation has been limited due to difficulties in aligning cross-modal attention maps between text and these objects. Our approach offers a training-free method that significantly mitigates this alignment issue with local and global attention guidance , enhancing the model's ability to accurately render small objects in accordance with textual descriptions. We detail the methodology in our approach, emphasizing its divergence from traditional generation techniques and highlighting its advantages. What's more important is that we also provide~\textit{SOEBench} (Small Object Editing), a standardized benchmark for quantitatively evaluating text-based small object generation collected from \textit{MSCOCO}\cite{lin2014microsoft} and \textit{OpenImage}\cite{kuznetsova2020open}. Preliminary results demonstrate the effectiveness of our method, showing marked improvements in the fidelity and accuracy of small object generation compared to existing models. This advancement not only contributes to the field of AI and computer vision but also opens up new possibilities for applications in various industries where precise image generation is critical.



Paperid:996 Poster
Authors:Jiahe Tian,Cai Yu,Peng Chen,Zihao Xiao,Xi Wang,Jizhong Han,Yesheng Chai
Abstract:
The rapid advancement of deepfake technology poses significant threats to social trust. Although recent deepfake detectors have exhibited promising results on deepfakes of the same type as those present in training, their effectiveness degrades significantly on novel deepfakes crafted by unseen algorithms due to the gap in forgery patterns. Some studies have enhanced detectors by adapting to the continuously emerging deepfakes through incremental learning. Despite the progress, they overlooked the scarcity of novel samples that can easily lead to insufficient learning of forgery patterns. To mitigate this issue, we introduce the Dynamic Mixed-Prototype (DMP) model, which dynamically increases prototypes to adapt to novel deepfakes efficiently. Specifically, the DMP model adopts multiple prototypes to represent both real and fake classes, enabling learning novel patterns by expanding prototypes and jointly retaining knowledge learned in previous prototypes. Furthermore, we propose the Prototype-Guided Replay strategy and Prototype Representation Distillation loss, both of which effectively prevent forgetting learned knowledge based on the prototypical representation of samples. Our method surpasses existing incremental deepfake detectors across four datasets and exhibits superior generalizability to novel deepfakes through learning limited deepfake samples.



Paperid:997 Poster
Authors:Zeng Weili,Yichao Yan,Qi Zhu,Zhuo Chen,Pengzhi Chu,Weiming Zhao,Xiaokang Yang
Abstract:
Text-to-image (T2I) customization aims to create images that embody specific visual concepts delineated in textual descriptions. However, existing works still face a main challenge, \textbf{concept overfitting}. To tackle this challenge, we first analyze overfitting, categorizing it into concept-agnostic overfitting, which undermines non-customized concept knowledge, and concept-specific overfitting, which is confined to customize on limited modalities, \ie, backgrounds, layouts, styles. To evaluate the overfitting degree, we further introduce two metrics, \ie, Latent Fisher divergence and Wasserstein metric to measure the distribution changes of non-customized and customized concept respectively. Drawing from the analysis, we propose Infusion, a T2I customization method that enables the learning of target concepts to avoid being constrained by limited training modalities, while preserving non-customized knowledge. Remarkably, Infusion achieves this feat with remarkable efficiency, requiring a mere \textbf{11KB} of trained parameters. Extensive experiments also demonstrate that our approach outperforms state-of-the-art methods in both single and multi-concept customized generation.



Paperid:998 Poster
Authors:Yijia Guo,Yuanxi Bai,Liwen Hu,Guo Ziyi,Mianzhi Liu,Yu Cai,Tiejun Huang,Lei Ma
Abstract:
We proposed Precomputed Radiance Transfer of Gaussian Splats (PRTGS), a real-time high-quality relighting method for Gaussian splats in low-frequency lighting environments that captures soft shadows and interreflections by precomputing 3D Gaussian splats' radiance transfer. Existing studies have demonstrated that 3D Gaussian splatting (3DGS) outperforms neural fields in efficiency for dynamic lighting scenarios. However, the current relighting method based on 3DGS still struggling in computing high-quality shadow and indirect illumination in real time for dynamic light, leading to unrealistic rendering results. We solve this problem by precomputing the expensive transport simulations required for complex transfer functions like shadowing, the resulting transfer functions are represented as dense sets of vectors or matrices for every Gaussian splat. We introduce distinct precomputing methods tailored for training and rendering stages, along with unique ray tracing and indirect lighting precomputation techniques for 3D Gaussian splats to accelerate training speed and compute accurate indirect lighting related to environment light. Experimental analyses demonstrate that our approach achieves state-of-the-art visual quality while maintaining competitive training times and importantly allows high-quality real-time (30+ fps) relighting for dynamic light and relatively complex scenes at 1080p resolution.



Paperid:999 Poster
Authors:Fengqi Liu,Hexiang Wang,Jingyu Gong,Ran Yi,Qianyu Zhou,Xuequan Lu,Jiangbo Lu,Lizhuang Ma
Abstract:
Speech-driven gesture generation aims at synthesizing a gesture sequence synchronized with the input speech signal. Previous methods leverage neural networks to directly map a compact audio representation to the gesture sequence, ignoring the semantic association of different modalities and failing to deal with salient gestures. In this paper, we propose a novel speech-driven gesture generation method by emphasizing the semantic consistency of salient posture. Specifically, we first learn a joint manifold space for the individual representation of audio and body pose to exploit the inherent semantic association between the two modalities, and propose to enforce semantic consistency via a consistency loss. Furthermore, we emphasize the semantic consistency of salient postures by introducing a weakly-supervised detector to identify salient postures, and reweighting the consistency loss to focus more on learning the correspondence between salient postures and the high-level semantics of speech content. In addition, we propose to extract audio features dedicated to facial expression and body gesture separately, and design separate branches for face and body gesture synthesis. Extensive experiments and visualization results demonstrate the superiority of our method over the state-of-the-art approaches.



Paperid:1000 Poster
Authors:Yuzhen Niu,Lifen Yang,Rui Xu,Yuezhou Li,Yuzhong Chen
Abstract:
Existing weakly-supervised camouflaged object detection (WSCOD) methods have much difficulty in detecting accurate object boundaries due to insufficient and imprecise boundary supervision in scribble annotations. Drawing inspiration from human perception that discerns camouflaged objects by incorporating both object region and boundary information, we propose a novel Mutual Interaction Network (MiNet) for scribble-based WSCOD to alleviate the detection difficulty caused by insufficient scribbles. The proposed MiNet facilitates mutual reinforcement between region and edge cues, thereby integrating more robust priors to enhance detection accuracy. In this paper, we first construct an edge cue refinement net, featuring a core region-aware guidance module (RGM) aimed at leveraging the extracted region feature as a prior to generate the discriminative edge map. By considering both object semantic and positional relationships between edge feature and region feature, RGM highlights the areas associated with the object in the edge feature. Subsequently, to tackle the inherent similarity between camouflaged objects and the surroundings, we devise a region-boundary refinement net. This net incorporates a core edge-aware guidance module (EGM), which uses the enhanced edge map from the edge cue refinement net as guidance to refine the object boundaries in an iterative and multi-level manner. Experiments on CAMO, CHAMELEON, COD10K, and NC4K datasets demonstrate that the proposed MiNet outperforms the state-of-the-art methods.



Paperid:1001 Poster
Authors:Weichen Xu,Jian Cao,Tianhao Fu,Ruilong Ren,Zicong Hu,Xixin Cao,Xing Zhang
Abstract:
This paper revisits the development of generative self-supervised learning in 2D images and 3D point clouds in autonomous driving. In 2D images, the pretext task has evolved from low-level to high-level features. Inspired by this, through explore model analysis, we find that the gap in weight distribution between self-supervised learning and supervised learning is substantial when employing only low-level features as the pretext task in 3D point clouds. Low-level features represented by PoInt Cloud reconsTruction are insUfficient to learn 3D REpresentations (dubbed PICTURE). To advance the development of pretext tasks, we propose a unified generative self-supervised framework. Firstly, high-level features represented by the Seal features are demonstrated to exhibit semantic consistency with downstream tasks. We utilize the Seal voxel features as an additional pretext task to enhance the understanding of semantic information during the pre-training. Next, we propose inter-class and intra-class discrimination-guided masking (I$^2$Mask) based on the attributes of the Seal voxel features, adaptively setting the masking ratio for each superclass. On Waymo and nuScenes datasets, we achieve 75.13% mAP and 72.69% mAPH for 3D object detection, 79.4% mIoU for 3D semantic segmentation, and 18.4% mIoU for occupancy prediction. Extensive experiments have demonstrated the effectiveness and necessity of high-level features. The project page is available athttps://anonymous-picture.github.io/.



Paperid:1002 Poster
Authors:Mingzhen Sun,Weining Wang,Yanyuan Qiao,Jiahui Sun,Zihan Qin,Longteng Guo,Xinxin Zhu,Jing Liu
Abstract:
Sounding Video Generation (SVG) is an audio-video joint generation task challenged by high-dimensional signal spaces, distinct data formats, and different patterns of content information. To address these issues, we introduce a novel multi-modal latent diffusion model (MM-LDM) for the SVG task. We first unify the representation of audio and video data by converting them into a single or a couple of images. Then, we introduce a hierarchical multi-modal autoencoder that constructs a low-level perceptual latent space for each modality and a shared high-level semantic feature space. The former space is perceptually equivalent to the raw signal space of each modality but drastically reduces signal dimensions. The latter space serves to bridge the information gap between modalities and provides more insightful cross-modal guidance. Our proposed method achieves new state-of-the-art results with significant quality and efficiency gains. Specifically, our method achieves a comprehensive improvement on all evaluation metrics and a faster training and sampling speed on Landscape and AIST++ datasets. Moreover, we explore its performance on open-domain sounding video generation, long sounding video generation, audio continuation, video continuation, and conditional single-modal generation tasks for a comprehensive evaluation, where our MM-LDM demonstrates exciting adaptability and generalization ability.



Paperid:1003 Poster
Authors:Xintian Mao,Jiansheng Wang,Xingran Xie,Qingli Li,Yan Wang
Abstract:
Due to the computational complexity of self-attention (SA), prevalent techniques for image deblurring often resort to either adopting localized SA or employing coarse-grained global SA methods, both of which exhibit drawbacks such as compromising global modeling or lacking fine-grained correlation. In order to address this issue by effectively modeling long-range dependencies without sacrificing fine-grained details, we introduce a novel approach termed Local Frequency Transformer (LoFormer). Within each unit of LoFormer, we incorporate a Local Channel-wise SA in the frequency domain (Freq-LC) to simultaneously capture cross-covariance within low- and high-frequency local windows. These operations offer the advantage of (1) ensuring equitable learning opportunities for both coarse-grained structures and fine-grained details, and (2) exploring a broader range of representational properties compared to coarse-grained global SA methods. Additionally, we introduce an MLP Gating mechanism complementary to Freq-LC, which serves to filter out irrelevant features while enhancing global learning capabilities. Our experiments demonstrate that LoFormer significantly improves performance in the image deblurring task, achieving a PSNR of 34.09 dB on the GoPro dataset with 126G FLOPs. Code will be released.



Paperid:1004 Poster
Authors:Mingjin Zhang,Shilong Liu,Yuanjun Ouyang,Jie Guo,Zhihong Tang,Yunsong Li
Abstract:
Moving infrared small target detection, crucial in contexts like traffic management and maritime rescue, encounters challenges from factors such as complex backgrounds, target occlusion, camera shake, and motion blur. Existing algorithms fall short in comprehensively addressing these issues by finding mathematical models, impeding generalization in complex and dynamic motion scenes. In this paper, we propose a method for finding models of moving infrared small target detection via smoothed-particle hydrodynamics (SPH) and Markov decision processes (MDP). SPH can simulate the motion trajectories of targets and background scenes, while MDP can optimize detection system strategies for optimal action selection based on contexts and target states. Specifically, we develop an SPH-inspired image-level enhancement algorithm which models the image sequence of infrared video as a 3D spatiotemporal graph in SPH. In addition, we design an MDP-guided temporal feature perception module. This module selects reference frames, aggregates features from both reference frames and the current frame. The previous and current frames are modeled as an MDP tailored for multi-frame infrared small target detection tasks, aiding in detecting the current frame. Conducted extensive experiments on two public dataset: DAUB and DATR, the proposed STME-Net surpasses the state-of-the-art methods in terms of objective metrics and visual quality.



Paperid:1005 Poster
Authors:Yabing Wang,Le Wang,Qiang Zhou,zhibin wang,Hao Li,Gang Hua,Wei Tang
Abstract:
Cross-lingual cross-modal retrieval aims to retrieve visually relevant content based on non-English queries, without relying on human-labeled cross-modal data pairs during training. One popular approach involves utilizing machine translation (MT) to create pseudo-parallel data pairs, establishing correspondence between visual and non-English textual data. However, aligning their representations poses challenges due to the significant semantic gap between vision and text, as well as the lower quality of non-English representations caused by pre-trained encoders and data noise. To overcome these challenges, we propose LECCR, a novel solution that incorporates the multi-modal large language model (MLLM) to improve the alignment between visual and non-English representations. Specifically, we first employ MLLM to generate detailed visual content descriptions and aggregate them into multi-view semantic slots that encapsulate different semantics. Then, we take these semantic slots as internal features and leverage them to interact with the visual features. By doing so, we enhance the semantic information within the visual features, narrowing the semantic gap between modalities and generating local visual semantics for subsequent multi-level matching. Additionally, to further enhance the alignment between visual and non-English features, we introduce softened matching under English guidance. This approach provides more comprehensive and reliable inter-modal correspondences between visual and non-English features. Extensive experiments on two cross-lingual image-text retrieval benchmarks, Multi30K and MSCOCO, as well as two cross-lingual video-text retrieval benchmarks, VATEX and MSR-VTT-CN, demonstrate the effectiveness of our proposed method.



Paperid:1006 Poster
Authors:Jie Hu,Jie Li,Yue Ma,Liujuan Cao,Songan Zhang,Wei Zhang,GUANNAN JIANG,Rongrong Ji
Abstract:
Foundational segmentation models, predominantly trained on scenes typical of natural environments, struggle to generalize across varied image domains. Traditional "training-to-adapt" methods rely heavily on extensive data retraining and model architectures modifications. This significantly limits the models' generalization capabilities and efficiency in deployment. In this study, we propose a novel adaptation paradigm, termed "prompting-to-adapt", to tackle the above issue by introducing an innovative image prompter. This prompter generates domain-specific prompts through few-shot image-mask pairs, incorporating diverse image processing techniques to enhance adaptability. To tackle the inherent non-differentiability of image prompts, we further devise an information-estimation-based gradient descent strategy that leverages the information entropy of image processing combinations to optimize the prompter, ensuring effective adaptation. Through extensive experiments across nine datasets spanning seven image domains (\emph{i.e.}, depth, thermal, camouflage, endoscopic, ultrasound, grayscale, and natural) and four scenarios (\emph{i.e.}, common scenes, camouflage objects, medical images, and industrial data), we demonstrate that our approach significant improves the foundational models' adaptation capabilities. Moreover, the interpretability of the generated prompts provides insightful revelations into their image processing mechanisms. Our source code will be publicly available to foster further innovation and exploration in this field.



Paperid:1007 Poster
Authors:Jiangbin Zheng,Han Zhang,Qianqing Xu,An-Ping Zeng,Stan Z. Li
Abstract:
Enzyme design plays a crucial role in both industrial production and biology. However, this field faces challenges due to the lack of comprehensive benchmarks and the complexity of enzyme design tasks, leading to a dearth of systematic research. Consequently, computational enzyme design is relatively overlooked within the broader protein domain and remains in its early stages. In this work, we address these challenges by introducing MetaEnzyme, a staged and unified enzyme design framework. We begin by employing a cross-modal structure-to-sequence transformation architecture, as the feature-driven starting point to obtain initial robust protein representation. Subsequently, we leverage domain adaptive techniques to generalize specific enzyme design tasks under low-resource conditions. MetaEnzyme focuses on three fundamental low-resource enzyme redesign tasks: functional design (FuncDesign), mutation design (MutDesign), and sequence generation design (SeqDesign). Through novel unified paradigm and enhanced representation capabilities, MetaEnzyme demonstrates adaptability to diverse enzyme design tasks, yielding outstanding results. Wet lab experiments further validate these findings, reinforcing the efficacy of the redesign process.



Paperid:1008 Poster
Authors:Wei Lou,Guanbin Li,Xiang Wan,Haofeng Li
Abstract:
Whole-slide image (WSI) classification methods play a crucial role in tumor diagnosis. Most of them use hematoxylin and eosin (H&E) stained images, while Immunohistochemistry (IHC) staining provides molecular markers and protein expression information that highlights cancer regions. However, obtaining IHC-stained images requires higher costs in practice. In this work, we propose a multi-modal denoising diffusion pre-training framework that harnesses the advantages of IHC staining to learn visual representations. The framework is trained with the H&E-to-IHC re-staining task and IHC-stained image reconstruction task, which helps capture the structural similarity and staining difference between two image modalities. The trained model can then provide IHC-guided features, by taking only H&E-stained images as inputs. Besides, we build a new class-constraint constrastive loss to achieve the semantic consistency between dual-modal features from our pre-training framework. To integrate with WSI classifiers based on multi-instance learning, we further propose a bag feature augmentation strategy to extend bags with the features extracted by our pre-trained model. Experimental results on three datasets show that our pre-training framework effectively improves WSI classification and surpasses the state-of-the-art pre-training approaches.



Paperid:1009 Poster
Authors:Tao Tang,Hong Liu,Yingxuan You,Ti Wang,Wenhao Li
Abstract:
Although existing video-based 3D human mesh recovery methods have made significant progress, simultaneously estimating human pose and shape from low-resolution image features limits their performance. These image features lack sufficient spatial information about the human body and contain various noises (e.g., background, lighting, and clothing), which often results in inaccurate pose and inconsistent motion. Inspired by the rapid advance in human pose estimation, we discover that compared to image features, skeletons inherently contain accurate human pose and motion. Therefore, we propose a novel semi-Analytical Regressor using disenTangled Skeletal representations for human mesh recovery from videos, called ARTS, which effectively leverages disentangled information in skeletons. Specifically, a skeleton estimation and disentanglement module is proposed to estimate the 3D skeletons from a video and decouple them into disentangled skeletal representations (i.e., joint position, bone length, and human motion). Then, to fully utilize these representations, we introduce a semi-analytical regressor to estimate the parameters of the human mesh model. The regressor consists of three modules: Temporal Inverse Kinematics (TIK), Bone-guided Shape Fitting (BSF), and Motion-Centric Refinement (MCR). TIK utilizes joint position to estimate initial pose parameters and BSF leverages bone length to regress bone-aligned shape parameters. Finally, MCR combines human motion representation with image features to refine the initial parameters of the human model and enhance temporal consistency. Extensive experiments demonstrate that our ARTS surpasses existing state-of-the-art video-based methods in both per-frame accuracy and temporal consistency on popular benchmarks: 3DPW, MPI-INF-3DHP, and Human3.6M.



Paperid:1010 Poster
Authors:Jinyan Zhang,Mengyuan Liu,Hong Liu,Guoquan Wang,Wenhao Li
Abstract:
Current advancements in 3D human pose estimation have attained notable success by converting 2D poses into their 3D counterparts. However, this approach is inherently influenced by the errors introduced by 2D pose detectors and overlooks the intrinsic spatial information embedded within RGB images. To address these challenges, we introduce a versatile module called Adaptive Pose Pooling (APP), compatible with many existing 2D-to-3D lifting models. The APP module includes three novel sub-modules: Pose-Aware Offsets Generation (PAOG), Pose-Aware Sampling (PAS), and Spatial Temporal Information Fusion (STIF). First, we extract latent features of the multi-frame lifting model. Then, a 2D pose detector is utilized to extract multi-level feature maps from the image. After that, PAOG generates offsets according to featuremaps. PAS uses offsets to sample featuremaps. Then, STIF can fuse PAS sampling features and latent features. This innovative design allows the APP module to simultaneously capture spatial and temporal information. We conduct comprehensive experiments on two widely used datasets: Human3.6M and MPI-INF-3DHP. Meanwhile, we employ various lifting models to demonstrate the efficacy of the APP module. Our results show that the proposed APP module consistently enhances the performance of lifting models, achieving state-of-the-art results. Significantly, our module achieves these performance boosts without necessitating alterations to the architecture of the lifting model.



Paperid:1011 Poster
Authors:Jian-Jun Qiao,Meng-Yu Duan,Xiao Wu,Yu-Pei Song
Abstract:
Cartoon characters, with their complex appearances, abstract drawing styles, and irregular structures, pose great challenges for cartoon parsing. In this paper, a novel approach, named CartoonNet, is proposed for cartoon parsing that aims to recognize and segment various parts of cartoon characters. Semantic consistency and structure correlation, acting as two key factors, are integrated to address the visual diversity and structure complexity for cartoon images. A memory-based semantic consistency module is designed to encode and learn the diverse appearances exhibited by cartoon characters. It conducts consistent learning among the body parts that belong to the same class but from different characters, by storing, selecting and correlating different cartoon images. To recognize the intricate and irregular structures present in cartoon images, a structure correlation module is proposed. Leveraging graph attention networks and a main body-aware mechanism, the proposed approach enables structural learning and correlating, allowing it to recognize and parse cartoon images with significant complexity. Experiments conducted on cartoon parsing datasets demonstrate the effectiveness of the proposed method. Moreover, the proposed method also achieves competitive performance on human parsing dataset, proving its superiority.



Paperid:1012 Poster
Authors:Andong Lu,Jiacong Zhao,Chenglong Li,Yun Xiao,Bin Luo
Abstract:
Modality gap between RGB and thermal infrared (TIR) images is a crucial issue but often overlooked in existing RGBT tracking methods. It can be observed that modality gap mainly lies in the image style difference. In this work, we propose a novel Coupled Knowledge Distillation framework called CKD, which pursues common styles of different modalities to break modality gap, for high performance RGBT tracking. In particular, we introduce two student networks and employ the style distillation loss to make their style features consistent as much as possible. Through alleviating the style difference of two student networks, we can break modality gap of different modalities well. However, the distillation of style features might harm to the content representations of two modalities in student networks. To handle this issue, we take original RGB and TIR networks as the teachers, and distill their content knowledge into two student networks respectively by the style-content orthogonal feature decoupling scheme. We couple the above two distillation processes in an online optimization framework to form new feature representations of RGB and thermal modalities without modality gap. In addition, we design a masked modeling strategy and a multi-modal candidate token elimination strategy into CKD to improve tracking robustness and efficiency respectively. Extensive experiments on five standard RGBT tracking datasets validate the effectiveness of the proposed method against state-of-the-art methods while achieving the fastest tracking speed of 96.4 FPS. We will release the code.



Paperid:1013 Poster
Authors:Gangyan Zeng,Yuan Zhang,Jin Wei,Dongbao Yang,peng zhang,Yiwen Gao,Xugong Qin,Yu Zhou
Abstract:
Scene text retrieval aims to find all images containing the query text from an image gallery. Current efforts tend to adopt an Optical Character Recognition (OCR) pipeline, which requires complicated text detection and/or recognition processes, resulting in inefficient and inflexible retrieval. Different from them, in this work we propose to explore the intrinsic potential of Contrastive Language-Image Pre-training (CLIP) for OCR-free scene text retrieval. Through empirical analysis, we observe that the main challenges of CLIP as a text retriever are: 1) limited text perceptual scale, and 2) entangled visual-semantic concepts. To this end, a novel model termed FDP (Focus, Distinguish, and Prompt) is developed. FDP first focuses on scene text via shifting the attention to text area and probing the hidden text knowledge, and then divides the query text into content word and function word for processing, in which a semantic-aware prompting scheme and a distracted queries assistance module are utilized. Extensive experiments show that FDP significantly enhances the inference speed while achieving better or competitive retrieval accuracy. Notably, on the IIIT-STR benchmark, FDP surpasses the state-of-the-art method by 4.37% with a 4 times faster speed. Furthermore, additional experiments under phrase-level and attribute-aware scene text retrieval settings validate FDP's particular advantages in handling diverse forms of query text.



Paperid:1014 Poster
Authors:Yuhui Wu,Guoqing Wang,Zhiwen Wang,Yang Yang,Tianyu Li,Malu Zhang,Chongyi Li,Heng Tao Shen
Abstract:
Low-light image enhancement (LLIE) has achieved promising performance by employing conditional diffusion models. Despite the success of some conditional methods, previous methods may neglect the importance of a sufficient formulation of task-specific condition strategy, resulting in suboptimal visual outcomes. In this study, we propose JoReS-Diff, a novel approach that incorporates Retinex- and semantic-based priors as the additional pre-processing condition to regulate the generating capabilities of the diffusion model. We first leverage pre-trained decomposition network to generate the Retinex prior, which is updated with better quality by an adjustment network and integrated into a refinement network to implement Retinex-based conditional generation at both feature- and image-levels. Moreover, the semantic prior is extracted from the input image with an off-the-shelf semantic segmentation model and incorporated through semantic attention layers. By treating Retinex- and semantic-based priors as the condition, JoReS-Diff presents a unique perspective for establishing an diffusion model for LLIE and similar image enhancement tasks. Extensive experiments validate the rationality and superiority of our approach.



Paperid:1015 Poster
Authors:Du Chen,Zhengqiang ZHANG,Jie Liang,Lei Zhang
Abstract:
Generative adversarial networks (GAN) and generative diffusion models (DM) have been widely used in real-world image super-resolution (Real-ISR) to enhance the image perceptual quality. However, these generative models are prone to generating visual artifacts and false image structures, resulting in unnatural Real-ISR results. Based on the fact that natural images exhibit high self-similarities, i.e., a local patch can have many similar patches to it in the whole image, in this work we propose a simple yet effective self-similarity loss (SSL) to improve the performance of generative Real-ISR models, enhancing the hallucination of structural and textural details while reducing the unpleasant visual artifacts. Specifically, we compute a self-similarity graph (SSG) of the ground-truth image, and enforce the SSG of Real-ISR output to be close to it. To reduce the training cost and focus on edge areas, we generate an edge mask from the ground-truth image, and compute the SSG only on the masked pixels. The proposed SSL serves as a general plug-and-play penalty, which could be easily applied to the off-the-shelf Real-ISR models. Our experiments demonstrate that, by coupling with SSL, the performance of many state-of-the-art Real-ISR models, including those GAN and DM based ones, can be largely improved, reproducing more perceptually realistic image details and eliminating many false reconstructions and visual artifacts. Codes of SSL will be released.



Paperid:1016 Poster
Authors:Ming Tao,Bingkun BAO,Hao Tang,Yaowei Wang,Changsheng Xu
Abstract:
Story visualization aims to generate realistic and coherent images based on multi-sentence stories. However, current methods face challenges in achieving high-quality image generation while maintaining lightweight models and a fast generation speed. The main issue lies in the two existing frameworks. The independent framework prioritizes speed but sacrifices image quality with the non-collaborative image generation process and basic GAN-based learning. The autoregressive framework modifies the large pretrained text-to-image model in an auto-regressive manner with additional history modules, leading to large model size, resource-intensive requirements, and slow generation speed. To address these issues, we propose a lightweight and effective framework, namely CoIn. Specifically, we introduce a Context-aware Story Generator to predict shared context semantics for each image generator. Additionally, we propose an Intra-Story Interchange module that allows each image generator to exchange visual information with other image generators. Furthermore, we incorporate DINOv2 into the story and image discriminators to assess the story image quality more accurately. Extensive experiments show that our CoIn keeps the model size and generation speed of the independent framework, while achieving promising story image quality.



Paperid:1017 Poster
Authors:Peng Yin,Xiaosu Zhu,Jingkuan Song,Lianli Gao,Heng Tao Shen
Abstract:
Binarized Vision Transformers (BiViTs) aim to facilitate the efficient and lightweight utilization of Vision Transformers (ViTs) on devices with limited computational resources. Yet, the current approach to binarizing ViT leads to a substantial performance decrease compared to the full-precision model, posing obstacles to practical deployment. By empirical study, we reveal that spatial interaction (SI) is a critical factor that impacts performance due to lack of token-level correlation, but previous work ignores this factor. To this end, we design a ViT binarization approach dubbed SI-BiViT to incorporate spatial interaction in the binarization process. Specifically, an SI module is placed alongside the Multi-Layer Perceptron (MLP) module to formulate the dual-branch structure. This structure not only leverages knowledge from pre-trained ViTs by distilling over the original MLP, but also enhances spatial interaction via the introduced SI module. Correspondingly, we design a decoupled training strategy to train these two branches more effectively. Importantly, our SI-BiViT is orthogonal to existing Binarized ViTs approaches and can be directly plugged. Extensive experiments demonstrate the strong flexibility and effectiveness of SI-BiViT by plugging our method into four classic ViT backbones in supporting three downstream tasks, including classification, detection, and segmentation. In particular, SI-BiViT enhances the classification performance of binarized ViTs by an average of 10.52% in Top-1 accuracy compared to the previous state-of-the-art. The code will be made publicly available.



Paperid:1018 Poster
Authors:Menghao Zhang,Jingyu Wang,Qi Qi,Pengfei Ren,Haifeng Sun,Zirui Zhuang,Huazheng Wang,Lei Zhang,Jianxin Liao
Abstract:
Learning multiple proxy tasks is a popular training strategy in semi-supervised video anomaly detection. However, the traditional method of learning multiple proxy tasks simultaneously is prone to suboptimal solutions, and simply executing multiple proxy tasks sequentially cannot ensure continuous performance improvement. In this paper, we thoroughly investigate the impact of task composition and training order on performance enhancement. We find that ensuring continuous performance improvement in multi-task learning requires different but continuous optimization objectives in different training phases. To this end, a training strategy based on progressive learning is proposed to enhance the efficacy of multi-task learning in VAD. The learning objectives of the model in previous phases contribute to the training in subsequent phases. Specifically, we decompose video anomaly detection into three phases: perception, comprehension, and inference, continuously refining the learning objectives to enhance model performance. In the three phases, we perform the visual task, the semantic task and the open-set task in turn to train the model. The model learns different levels of features and focuses on different types of anomalies in different phases. Additionally, we design simple yet effective semantic leveraging the semantic consistency of context. Extensive experiments demonstrate the effectiveness of our method, highlighting that the benefits derived from the progressive learning transcend specific proxy tasks.



Paperid:1019 Poster
Authors:Hongzhi Wang,Xiubo Liang,Tao Zhang,Gu Yue,Weidong Geng
Abstract:
Spiking Neural Networks (SNNs) have indeed shown remarkable promise in the field of computer vision, emerging as a low-energy alternative to traditional Artificial Neural Networks (ANNs). However, SNNs also face several challenges: \romannumeral1) Existing SNNs are not purely additive and involve a substantial amount of floating-point computations, which contradicts the original design intention of adapting to neuromorphic chips; \romannumeral2) The incorrect positioning of convolutional and pooling layers relative to spiking layers leads to reduced accuracy; \romannumeral3) Leaky Integrate-and-Fire (LIF) neurons have limited capability in representing local information, which is disadvantageous for downstream visual tasks like semantic segmentation. \par To address the challenges in SNNs, \romannumeral1) we introduce Pure Sparse Self Attention (PSSA) and Dynamic Spiking Membrane Shortcut (DSMS), combining them to tackle the issue of floating-point computations; \romannumeral2) the Spiking Precise Gradient downsampling (SPG-down) method is proposed for accurate gradient transmission; \romannumeral3) the Group-LIF neuron concept is introduced to ensure LIF neurons' capability in representing local information both horizontally and vertically, enhancing their applicability in semantic segmentation tasks. Ultimately, these three solutions are integrated into the Powerful Sparse-Spike-Driven Transformer (PSSD-Transformer), effectively handling semantic segmentation tasks and addressing the challenges inherent in Spiking Neural Networks. The experimental results demonstrate that our model outperforms previous results on standard classification datasets and also shows commendable performance on semantic segmentation datasets. The code will be made publicly available after the paper is accepted for publication.



Paperid:1020 Poster
Authors:Tianyi Wang,Mengxiao Huang,Harry Cheng,Xiao Zhang,Zhiqi Shen
Abstract:
The Deepfake face manipulation technique has garnered significant public attention due to its impacts on both enhancing human experiences and posing security and privacy threats. Despite numerous passive Deepfake detection algorithms that have been attempted to thwart malicious Deepfake attacks, they mostly struggle with the generalizability challenge when confronted with hyper-realistic synthetic facial images contemporarily. To tackle the problem, this paper proposes a proactive Deepfake detection approach by introducing a novel training-free landmark perceptual watermark, LampMark for short. Firstly, we analyze the structure-sensitive characteristics of Deepfake manipulations and devise a secure and confidential transformation pipeline from the structural representations, i.e. facial landmarks, to binary landmark perceptual watermarks. Subsequently, we present an end-to-end watermarking framework that robustly and imperceptibly embeds and extracts watermarks concerning the images to be protected. Relying on promising watermark recovery accuracies, Deepfake detection is accomplished by assessing the consistency between the content-matched landmark perceptual watermark and the robustly recovered watermark of the suspect Deepfake image. Experimental results demonstrate the superior performance of our approach in watermark recovery and Deepfake detection compared to state-of-the-art methods across in-dataset, cross-dataset, and cross-manipulation scenarios.



Paperid:1021 Poster
Authors:Hao Wu,Likun Zhang,Shucheng Li,Fengyuan Xu,Sheng Zhong
Abstract:
In the federated learning (FL) process, since the data held by each participant is different, it is necessary to figure out which participant has a higher contribution to the model performance. Effective contribution assessment can help motivate data owners to participate in the FL training. The research work in this field can be divided into two directions based on whether a validation dataset is required. Validation-based methods need to use representative validation data to measure the model accuracy, which is difficult to obtain in practical FL scenarios. Existing validation-free methods assess the contribution based on the parameters and gradients of local models and the global model in a single training round, which is easily compromised by the stochasticity of DL training. In this work, we propose CoAst, a practical method to assess the FL participants' contribution without access to any validation data. The core idea of CoAst involves two aspects: one is to only count the most important part of model parameters through a weights quantization, and the other is a cross-round valuation based on the similarity between the current local parameters and the global parameter updates in several subsequent communication rounds. Extensive experiments show that the assessment reliability of CoAst is comparable to existing validation-based methods and outperforms existing validation-free methods. We believe that CoAst will inspire the community to study a new FL paradigm with an inherent contribution assessment.



Paperid:1022 Poster
Authors:Xin Zhang,Sheng-hua Zhong,Jianmin Jiang
Abstract:
Explaining what part of the input images primarily contributed to the predicted classification results by deep models has been widely researched over the years and many effective methods have been reported in the literature, for which deep Taylor decomposition (DTD) served as the primary foundation due to its advantage in theoretical explanations brought in by Taylor expansion and approximation. Recent research, however, has shown that the root of Taylor decomposition could extend beyond local linearity, and thus causing DTD to fail in delivering expected performances. In this paper, we propose a universal root inference method to overcome the shortfall and strengthen the roles of DTD in explainability and interpretability of deep classifications. In comparison with the existing approaches, our proposed features in: (i) theoretical establishment of the relationship between ideal roots and the propagated relevances; (ii) exploitation of gradient descents in learning a universal root inference; and (iii) constrained optimization of its final root selection. Extensive experiments, including both quantitative and qualitative, validate that our proposed root inference is not only effective, but also delivers significantly improved performances in explaining a range of deep classifiers.



Paperid:1023 Poster
Authors:Jiaxin Zhang,Yiqi Wang,Xihong Yang,Siwei Wang,Yu Feng,Yu Shi,Ren ruichao,En Zhu,Xinwang Liu
Abstract:
Graph Neural Networks have demonstrated great success in various fields of multimedia. However, the distribution shift between the training and test data challenges the effectiveness of GNNs. To mitigate this challenge, Test-Time Training (TTT) has been proposed as a promising approach. Traditional TTT methods require a demanding unsupervised training strategy to capture the information from test to benefit the main task. Inspired by the great annotation ability of Large Language Models (LLMs) on Text-Attributed Graphs (TAGs), we propose to enhance the test-time training on graphs with LLMs as annotators. In this paper, we design a novel Test-Time Training pipeline, LLMTTT, which conducts the test-time adaptation under the annotations by LLMs on a carefully-selected node set. Specifically, LLMTTT introduces a hybrid active node selection strategy that considers not only node diversity and representativeness, but also prediction signals from the pre-trained model. Given annotations from LLMs, a two-stage training strategy is designed to tailor the test-time model with the limited and noisy labels. A theoretical analysis ensures the validity of our method and extensive experiments demonstrate that the proposed LLMTTT can achieve a significant performance improvement compared to existing Out-of-Distribution (OOD) generalization methods.



Paperid:1024 Poster
Authors:Jiaxu Zhang,Xin Chen,Gang Yu,Zhigang Tu
Abstract:
Stylized motion breathes life into characters. However, the fixed skeleton structure and style representation hinder existing data-driven motion synthesis methods from generating stylized motion for various characters. In this work, we propose a generative motion stylization pipeline, named MotionS, for synthesizing diverse and stylized motion on cross-structure characters using cross-modality style prompts. Our key insight is to embed motion style into a cross-modality latent space and perceive the cross-structure skeleton topologies, allowing for motion stylization within a canonical motion space. Specifically, the large-scale Contrastive-Language-Image-Pre-training (CLIP) model is leveraged to construct the cross-modality latent space, enabling flexible style representation within it. Additionally, two topology-encoded tokens are learned to capture the canonical and specific skeleton topologies, facilitating cross-structure topology shifting. Subsequently, the topology-shifted stylization diffusion is designed to generate motion content for the particular skeleton and stylize it in the shifted canonical motion space using multi-modality style descriptions. Through an extensive set of examples, we demonstrate the flexibility and generalizability of our pipeline across various characters and style descriptions. Qualitative and quantitative comparisons show the superiority of our pipeline over state-of-the-art methods, consistently delivering high-quality stylized motion across a broad spectrum of skeletal structures.



Paperid:1025 Poster
Authors:Shouyu Chen,Tangwei Ye,Lai Zhong Yuan,Qi Zhang,KE LIU,Usman Naseem,Ke Sun,Nengjun Zhu,Liang Hu
Abstract:
Interpretable and robust medical diagnoses are essential traits for practicing clinicians. Most computer-augmented diagnostic systems suffer from three major problems: non-interpretability, limited modality analysis, and narrow focus. Existing frameworks can either deal with multimodality to some extent but suffer from non-interpretability or partially interpretable but provide a limited modality and multifaceted capabilities. Our work aims to integrate all these aspects in one complete framework to fully utilize the full spectrum of information offered by multiple modalities and facets. We propose our solution via our novel architecture VR-DiagNet, consisting of a planner and a classifier, optimized iteratively and cohesively. VR-DiagNet simulates the perceptual process of clinicians via the use of volumetric imaging information integrated with radiomic features modality; at the same time, it recreates human thought processes via a customized Monte Carlo Tree Search (MCTS) which constructs a volume-tailored experience tree to identify slices of interest (SoIs) in our multi-slice perception space. We conducted extensive experiments across two diagnostic tasks comprising six public medical volumetric benchmark datasets. Our findings showcase superior performance, as evidenced by heightened accuracy and area under the curve (AUC) metrics, reduced computational overhead, and expedited convergence while conclusively illustrating the immense value of integrating volumetric and radiomic modalities for our current problem setup.



Paperid:1026 Poster
Authors:Zhen Zou,Hu Yu,Jie Huang,Feng Zhao
Abstract:
Images corrupted by rain streaks often lose vital frequency information for perception, and image deraining aims to solve this issue which relies on global and local degradation modeling. Recent studies have witnessed the effectiveness and efficiency of Mamba for perceiving global and local information based on its exploiting local correlation among patches, however, rarely attempts have been explored to extend it with frequency analysis for image deraining, limiting its ability to perceive global degradation that is relevant to frequency modeling (e.g. Fourier transform). In this paper, we propose FreqMamba, an effective and efficient paradigm that leverages the complementary between Mamba and frequency analysis for image deraining. The core of our method lies in extending Mamba with frequency analysis from two perspectives: extending it with frequency-band for exploiting frequency correlation, and connecting it with Fourier transform for global degradation modeling. Specifically, FreqMamba introduces complementary triple interaction structures including spatial Mamba, frequency band Mamba, and Fourier global modeling. Frequency band Mamba decomposes the image into sub-bands of different frequencies to allow 2D scanning from the frequency dimension. Furthermore, leveraging Mamba's unique data-dependent properties, we use rainy images at different scales to provide degradation priors to the network, thereby facilitating efficient training. Extensive experiments show that our method outperforms state-of-the-art methods both visually and quantitatively.



Paperid:1027 Poster
Authors:Weijie Wang,Jichao Zhang,Chang Liu,Xia Li,Xingqian Xu,Humphrey Shi,Nicu Sebe,Bruno Lepri
Abstract:
Recently, diffusion models have made significant strides in synthesizing realistic 2D human images based on provided text prompts. Building upon this, researchers have extended 2D text-to-image diffusion models into the 3D domain for generating human textures (UV Maps). However, some important problems about UV Map Generative models are still not solved, i.e., how to generate personalized texture maps for any given face image, and how to define and evaluate the quality of these generated texture maps. To solve the above problems, we introduce a novel method, UVMap-ID, which is a controllable and personalized UV Map generative model. Unlike traditional large-scale training methods in 2D, we propose to fine-tune a pre-trained text-to-image diffusion model which is integrated with a face fusion module for achieving ID-driven customized generation. To support the finetuning strategy, we introduce a small-scale attribute-balanced training dataset, including high-quality textures with labeled text and Face ID. Additionally, we introduce some metrics to evaluate the multiple aspects of the textures. Finally, both quantitative and qualitative analyses demonstrate the effectiveness of our method in controllable and personalized UV Map generation.



Paperid:1028 Poster
Authors:Yue Duan,Zhangxuan Gu,Zhenzhe Ying,Lei Qi,Changhua Meng,Yinghuan Shi
Abstract:
In the realm of cross-modal retrieval, seamlessly integrating diverse modalities within multimedia remains a formidable challenge, especially given the complexities introduced by noisy correspondence learning (NCL). Such noise often stems from mismatched data pairs, a significant obstacle distinct from traditional noisy labels. This paper introduces Pseudo-Classification based Pseudo-Captioning (PC$^2$) framework to address this challenge. PC$^2$ offers a threefold strategy: firstly, it establishes an auxiliary "pseudo-classification" task that interprets captions as categorical labels, steering the model to learn image-text semantic similarity through a non-contrastive mechanism. Secondly, unlike prevailing margin-based techniques, capitalizing on PC$^2$'s pseudo-classification capability, we generate pseudo-captions to provide more informative and tangible supervision for each mismatched pair. Thirdly, the oscillation of pseudo-classification is borrowed to assistant the correction of correspondence. In addition to technical contributions, we develop a realistic NCL dataset called Noise of Web (NoW), which could be a new powerful NCL benchmark where noise exists naturally. Empirical evaluations of PC$^2$ showcase marked improvements over existing state-of-the-art robust cross-modal retrieval techniques, with the highest exceeding 11.5% in terms of sum for retrieving on NoW. The contributed dataset, source code and model weights will be released upon the acceptance of paper.



Paperid:1029 Poster
Authors:Chaofan Gan,Yuanpeng Tu,Yuxi Li,Weiyao Lin
Abstract:
With the recent burst of 2D and 3D data, cross-modal retrieval has attracted increasing attention recently. However, manual labeling by non-experts will inevitably introduce corrupted annotations given ambiguous 2D/3D content, leading to performance degradation. Though previous works have addressed this issue by designing a naive division strategy with hand-crafted thresholds, their performance generally exhibits great sensitivity to the threshold value, implying their poor robustness in real-world scenarios. Besides, they fail to fully utilize the valuable supervisory signals within each divided subset. To tackle this problem, we propose a Divide-and-conquer 2D-3D cross-modal Alignment and Correction framework (DAC), which comprises Multimodal Dynamic Division (MDD) and Adaptive Alignment and Correction (AAC). Specifically, the former performs accurate sample division by adaptive credibility modeling for each sample based on the compensation information within multimodal loss distribution. Then in AAC, samples in distinct subsets are exploited with different alignment strategies to fully enhance the semantic compactness and meanwhile alleviate over-fitting to noisy labels, where a self-correction strategy is introduced to improve the quality of representation by mining the valuable supervisory signals from multimodal predictions as well. Moreover. To evaluate the effectiveness in real-world scenarios, we introduce a challenging noisy benchmark, namely Objaverse-N200, which comprises 200k-level samples annotated with 1156 realistic noisy labels. Extensive experiments on both traditional and the newly proposed benchmarks demonstrate the generality and superiority of our DAC, where DAC outperforms state-of-the-art models by a large margin (i.e., with +5.9% gain on ModelNet40 and +5.8% on Objaverse-N200).



Paperid:1030 Poster
Authors:Guan Luo,Tian-Xing Xu,Ying-Tian Liu,Xiaoxiong Fan,Fang-Lue Zhang,Song-Hai Zhang
Abstract:
The modeling and manipulation of 3D scenes captured from the real world are pivotal in various applications, attracting growing research interest. While previous works on editing have achieved interesting results through manipulating 3D meshes, they often require accurately reconstructed meshes to perform editing, which limits their application in 3D content generation. To address this gap, we introduce a novel single-image-driven 3D scene editing approach based on 3D Gaussian Splatting, enabling intuitive manipulation via directly editing the content on a 2D image plane. Our method learns to optimize the 3D Gaussians to align with an edited version of the image rendered from a user-specified viewpoint of the original scene. To capture long-range object deformation, we introduce positional loss into the optimization process of 3D Gaussian Splatting and enable gradient propagation through reparameterization. To handle occluded 3D Gaussians when rendering from the specified viewpoint, we build an anchor-based structure and employ a coarse-to-fine optimization strategy capable of handling long-range deformation while maintaining structural stability. Furthermore, we design a novel masking strategy that adaptively identifies non-rigid deformation regions for fine-scale modeling. Extensive experiments show the effectiveness of our method in handling geometric details, long-range, and non-rigid deformation, demonstrating superior editing flexibility and quality compared to previous approaches.



Paperid:1031 Poster
Authors:Jiabao Guo,Huan Liu,Yizhi Luo,Xueli Hu,Hang Zou,Yuan Zhang,Hui Liu,Bo Zhao
Abstract:
Face anti-spoofing (FAS) based on domain generalization (DG) has garnered increasing attention from researchers. The poor generalization is attributed to the model being overfitted to salient liveness-irrelevant signals. Previous methods addressed this issue by either mapping images from multiple domains into a common feature space or promoting the separation of image features from domain-specific and task-related features. However, direct manipulation of image features inevitably disrupts semantic structure. Utilizing the text features of vision-language pre-trained (VLP) models, such as CLIP, to dynamically adjust image features offers the potential for better generalization, exploring a broader feature space while preserving semantic information. Specifically, we propose a FAS method called style-conditional prompt token learning (S-CTPL), which aims to generate generalized text features by training introduced prompt tokens to encode visual styles. These tokens are then utilized as weights for classifiers, enhancing the model's generalization. Unlike inherently static prompt tokens, our dynamic prompt tokens adaptively capture live-irrelevant signals from instance-specific styles, increasing their diversity through mixed feature statistics to further mitigate model overfitting. Thorough experimental analysis demonstrates that S-CPTL outperforms current top-performing methods across four distinct cross-dataset benchmarks.



Paperid:1032 Poster
Authors:Ajian Liu,Ma Hui,Junze Zheng,Haocheng Yuan,Xiaoyuan Yu,Yanyan Liang,Sergio Escalera,Jun Wan,Zhen Lei
Abstract:
Flexible modal Face Anti-spoofing (FAS) aims to aggregate all the available training modalities’ data to train a model, and enables flexible testing of any given modal samples. Previous works introduce shared cross-modal transformers (attentions) to facilitate the learning of modality-agnostic features, which inevitably leads to the distortion of feature structures and achieves limited performance. In this work, borrowing a solution from the large-scale vision-language models (VLMs) instead of directly removing modality-specific signals from visual features, we propose a novel Flexible Modal CLIP (\textbf{FM-CLIP}) for flexible modal FAS, that can utilize text features to dynamically adjust visual features to be modality independent. In the visual branch, considering the huge visual differences of the same attack in different modalities, which makes it difficult for classifiers to flexibly identify subtle spoofing clues in different test modalities, we propose Cross-Modal Spoofing Enhancer (\textbf{CMS-Enhancer}). It includes a Frequency Extractor (\textbf{FE}) and Cross-Modal Interactor (\textbf{CMI}), aiming to map different modal attacks in a shared frequency space to reduce interference from modality-specific signals and enhance spoofing clues by leveraging cross modal learning from the shared frequency space. In the text branch, we introduce a Language-Guided Patch Alignment (\textbf{LGPA}) based on the prompt learning, which further guides the image encoder to focus on patch level spoofing representations through dynamic weighting by text features. Thus, our FM-CLIP can flexibly test different modal samples by identifying and enhancing modality-agnostic spoofing cues. Finally, extensive experiments show that FM-CLIP is effective and outperforms state-of-the-art methods on multiple multi-modal datasets.



Paperid:1033 Poster
Authors:Yuzhen Du,Teng Hu,Ran Yi,Lizhuang Ma
Abstract:
Blind Face Restoration (BFR) aims to restore high-quality face images from low-quality images with unknown degradation. Previous GAN-based or ViT-based methods have shown promising results, but have identity details loss once degradation is severe; while recent diffusion-based methods work on image level and take a lot of time to infer. To restore images in any degradation types with high quality and spend less time, we propose LD-BFR, a novel BFR framework that integrates both the strengths of vector quantization and latent diffusion. First, we employ a Dual Cross-Attention vector quantization to restore the degraded image in a global manner. Then we utilize the restored high-quality quantized feature as the guidance in our latent diffusion model to generate high-quality restored images with rich details. With the help of the proposed high-quality feature injection module, our LD-BFR effectively injects the high-quality feature as a condition to guide the generation of our latent diffusion model. Extensive experiments demonstrate the superior performance of our model over the state-of-the-art BFR methods.



Paperid:1034 Poster
Authors:Yansong Qu,Shaohui Dai,Xinyang Li,Jianghang Lin,Liujuan Cao,Shengchuan Zhang,Rongrong Ji
Abstract:
3D open-vocabulary scene understanding, crucial for advancing augmented reality and robotic applications, involves interpreting and locating specific regions within a 3D space as directed by natural language instructions. To this end, we introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS) and identifies 3D Gaussians of Interest using an Optimizable Semantic-space Hyperplane. Our approach includes an efficient compression method that utilizes scene priors to condense noisy high-dimensional semantic features into compact low-dimensional vectors, which are subsequently embedded in 3DGS. During the open-vocabulary querying process, we adopt a distinct approach compared to existing methods, which depend on a manually set fixed empirical threshold to select regions based on their semantic feature distance to the query text embedding. This traditional approach often lacks universal accuracy, leading to challenges in precisely identifying specific target areas. Instead, our method treats the feature selection process as a hyperplane division within the feature space, retaining only those features that are highly relevant to the query. We leverage off-the-shelf 2D Referring Expression Segmentation (RES) models to fine-tune the semantic-space hyperplane, enabling a more precise distinction between target regions and others. This fine-tuning substantially improves the accuracy of open-vocabulary queries, ensuring the precise localization of pertinent 3D Gaussians. Extensive experiments demonstrate GOI's superiority over previous state-of-the-art methods.



Paperid:1035 Poster
Authors:Xiaochao Pan,Jiawei Yao,Hongrui Kou,Tong Wu,Canran Xiao
Abstract:
In the realm of autonomous driving, achieving precise 3D reconstruction of the driving environment is critical for ensuring safety and effective navigation. Neural Radiance Fields (NeRF) have shown promise in creating highly detailed and accurate models of complex environments. However, the application of NeRF in autonomous driving scenarios encounters several challenges, primarily due to the sparsity of viewpoints inherent in camera trajectories and the constraints on data collection in unbounded outdoor scenes, which typically occur along predetermined paths. This limitation not only reduces the available scene information but also poses significant challenges for NeRF training, as the sparse and path-distributed observational data lead to under-representation of the scene's geometry. In this paper, we introduce HarmonicNeRF, a novel approach for outdoor self-supervised monocular scene reconstruction. HarmonicNeRF capitalizes on the strengths of NeRF and enhances surface reconstruction accuracy by augmenting the input space with geometry-informed synthetic views. This is achieved through the application of spherical harmonics to generate novel radiance values, taking into careful consideration the color observations from the limited available real-world views. Additionally, our method incorporates proxy geometry to effectively manage occlusion, generating radiance pseudo-labels that circumvent the limitations of traditional image-warping techniques, which often fail in sparse data conditions typical of autonomous driving environments. Extensive experiments conducted on the KITTI, Argoverse, and NuScenes datasets demonstrate our approach establishes new benchmarks in synthesizing novel depth views and reconstructing scenes, significantly outperforming existing methods.



Paperid:1036 Poster
Authors:Pengfei Yue,Jianghang Lin,Shengchuan Zhang,Jie Hu,Yilin Lu,Hongwei Niu,Haixin Ding,Yan Zhang,GUANNAN JIANG,Liujuan Cao,Rongrong Ji
Abstract:
Referring image segmentation (RIS) aims to segment a particular region based on a specific expression. Existing one-stage methods have explored various fusion strategies, yet they encounter two significant issues. Primarily, most methods rely on manually selected visual features from the visual encoder layers, lacking the flexibility to selectively focus on language-preferred visual features. Moreover, the direct fusion of word-level features into coarse aligned features disrupts the established vision-language alignment, resulting in suboptimal performance. In this paper, we introduce an innovative framework for RIS that seeks to overcome these challenges with adaptive alignment of vision and language features, termed the Adaptive Selection with Dual Alignment (ASDA). ASDA innovates in two aspects. Firstly, we design an Adaptive Feature Selection and Fusion (AFSF) module to dynamically select visual features focusing on different regions related to various descriptions. AFSF is equipped with scale-wise feature aggregator to provide hierarchically coarse features that preserve crucial low-level details and provide robust features for successor dual alignment. Secondly, a Word Guided Dual-Branch Aligner (WGDA) is leveraged to integrate coarse features with linguistic cues by word-guided attention, which effectively addresses the common issue of vision-language misalignment by ensuring that linguistic descriptors directly interact with masks prediction. This guides the model to focus on relevant image regions and make robust prediction. Extensive experimental results demonstrate that our ASDA framework surpasses state-of-the-art methods on RefCOCO, RefCOCO+ and G-Ref benchmark. The improvement not only underscores the superiority of ASDA in capturing fine-grained visual details but also its robustness and adaptability to diverse descriptions.



Paperid:1037 Poster
Authors:Jiancheng Huang,Mingfu Yan,Songyan Chen,Yi Huang,Shifeng Chen
Abstract:
Amid the surge in generic text-to-video generation, the field of personalized human video generation has witnessed notable advancements, primarily concentrated on single-person scenarios. However, to our knowledge, the domain of two-person interactions, particularly in the context of martial arts combat, remains uncharted. We identify a significant gap: existing models for single-person dancing generation prove insufficient for capturing the subtleties and complexities of two engaged fighters, resulting in challenges such as identity confusion, anomalous limbs, and action mismatches. To address this, we introduce a pioneering new task, Personalized Martial Arts Combat Video Generation. Our approach, MagicFight, is specifically crafted to overcome these hurdles. Given this pioneering task, we face a lack of appropriate datasets. Thus, we generate a bespoke dataset using the game physics engine Unity, meticulously crafting a multitude of 3D characters, martial arts moves, and scenes designed to represent the diversity of combat. MagicFight refines and adapts existing models and strategies to generate high-fidelity two-person combat videos that maintain individual identities and ensure seamless, coherent action sequences, thereby laying the groundwork for future innovations in the realm of interactive video content creation.



Paperid:1038 Poster
Authors:Yibin Wang,WEIZHONG ZHANG,Jianwei Zheng,Cheng Jin
Abstract:
Image composition involves seamlessly integrating given objects into a specific visual context. The current training-free methods rely on composing attention weights from several samplers to guide the generator. However, since these weights are derived from disparate contexts, their combination leads to coherence confusion in synthesis and loss of appearance information. These issues worsen with their excessive focus on background generation, even when unnecessary in this task. This not only slows down inference but also compromises foreground generation quality. Moreover, these methods introduce unwanted artifacts in the transition area. In this paper, we formulate image composition as a subject-based local editing task, solely focusing on foreground generation. At each step, the edited foreground is combined with the noisy background to maintain scene consistency. To address the remaining issues, we propose PrimeComposer, a faster training-free diffuser that composites the images by well-designed attention steering across different noise levels. This steering is predominantly achieved by our Correlation Diffuser, utilizing its self-attention layers at each step. Within these layers, the synthesized subject interacts with both the referenced object and background, capturing intricate details and coherent relationships. This prior information is encoded into the attention weights, which are then integrated into the self-attention layers of the generator to guide the synthesis process. Besides, we introduce a Region-constrained Cross-Attention to confine the impact of specific subject-related words to desired regions, addressing the unwanted artifacts shown in the prior method thereby further improving the coherence in the transition area. Our method exhibits the fastest inference efficiency and extensive experiments demonstrate our superiority both qualitatively and quantitatively.



Paperid:1039 Poster
Authors:Bohong Chen,Yumeng Li,Yao-Xiang Ding,Tianjia Shao,Kun Zhou
Abstract:
Current co-speech motion generation approaches usually focus on upper body gestures following speech contents only, while lacking supporting the elaborate control of synergistic full-body motion based on text prompts, such as {\it talking while walking}. The major challenges lie in 1) the existing speech-to-motion datasets only involve highly limited full-body motions, making a wide range of common human activities out of training distribution; 2) these datasets also lack annotated user prompts. To address these challenges, we propose SynTalker, which utilizes the off-the-shelf text-to-motion dataset as an auxiliary for supplementing the missing full-body motion and prompts. The core technical contributions are two-fold. One is the multi-stage training process which obtains an aligned embedding space of motion, speech, and prompts despite the significant distributional mismatch in motion between speech-to-motion and text-to-motion datasets. Another is the diffusion-based conditional inference process, which utilizes the separate-then-combine strategy to realize fine-grained control of local body parts. Extensive experiments are conducted to verify that our approach supports precise and flexible control of synergistic full-body motion generation based on both speeches and user prompts, which is beyond the ability of existing approaches. The code is released on (link will be published upon acceptance).



Paperid:1040 Poster
Authors:Zhilin Huang,Yijie Yu,Ling Yang,Chujun Qin,Bing Zheng,Xiawu Zheng,Zikun Zhou,Yaowei Wang,Wenming Yang
Abstract:
With the advancement of AIGC, video frame interpolation (VFI) has become a crucial component in existing video generation frameworks, attracting widespread research interest. For the VFI task, the motion estimation between neighboring frames plays a crucial role in avoiding motion ambiguity. However, existing VFI methods always struggle to accurately predict the motion information between consecutive frames, and this imprecise estimation leads to blurred and visually incoherent interpolated frames. In this paper, we propose a novel diffusion framework, motion-aware latent diffusion models (MADiff), which is specifically designed for the VFI task. By incorporating motion priors between the conditional neighboring frames with the target interpolated frame predicted throughout the diffusion sampling procedure, MADiff progressively refines the intermediate outcomes, culminating in generating both visually smooth and realistic results. Extensive experiments conducted on benchmark datasets demonstrate that our method achieves state-of-the-art performance significantly outperforming existing approaches, especially under challenging scenarios involving dynamic textures with complex motion.



Paperid:1041 Poster
Authors:Zhenqiang Li,Jie LI,Yangjie Cao,Jiayi Wang,Runfeng Lv
Abstract:
Recent advancements in 3D generation have garnered considerable interest due to their potential applications. Despite these advancements, the field faces persistent challenges in multi-conditional control, primarily due to the lack of paired datasets and the inherent complexity of 3D structures. To address these challenges, we introduce ImageBind3D, a novel framework for controllable 3D generation that integrates text, hand-drawn sketches, and depth maps to enhance user controllability. Our innovative contribution is the adoption of an inversion-align strategy, facilitating controllable 3D generation without requiring paired datasets. Firstly, utilizing GET3D as a baseline, our method innovates a 3D inversion technique that synchronizes 2D images with 3D shapes within the latent space of 3D GAN. Subsequently, we leverage images as intermediaries to facilitate pseudo-pairing between the shapes and various modalities. Moreover, our multi-modal diffusion model design strategically aligns external control signals with the generative model's latent knowledge, enabling precise and controllable 3D generation. Extensive experiments validate that ImageBind3D surpasses existing state-of-the-art methods in both fidelity and controllability. Additionally, our approach can offer composable guidance for any feed-forward 3D generative models, significantly enhancing their controllability.



Paperid:1042 Poster
Authors:Yu-Pei Song,Yuan-Tong Liu,Xiao Wu,Qi He,Zhaoquan Yuan,Ao Luo
Abstract:
The 3D model can be formulated by regressing the pose and shape parameters of the digital model from the image. The reconstruction of 3D cartoon characters shows a distinct difference compared to human subjects, primarily due to their diverse visual representations and unrestricted posture. Direct application of human-related methods to cartoon data faces conflicts in pose and shape learning, hindering the network's ability to learn features related to cartoon identity effectively. To address this, a dual-branch structure method called MagicCartoon is introduced, which models pose and shape independently. To enhance the correlation between the features extracted by the backbone network and the specific task, a feature consistency loss is proposed to mitigate the interference caused by other attributes. To capture local details of cartoon characters to distinguish different categories, a hybrid feature fusion technique is introduced, which integrates the global features of the original image with the corresponding local features of the puzzle image, thereby offering a comprehensive representation of the input image. To cope with the diversity in cartoon character poses, a geometric-guided feedback loop is proposed. This mechanism achieves semantic alignment between modeling outcomes and input images through iterative loops. Evaluation of the 3DBiCar dataset demonstrates that MagicCartoon outperforms the parameter regression baseline method in reconstruction error.



Paperid:1043 Poster
Authors:Jiacheng Ruan,Jingsheng Gao,Mingye Xie,Suncheng Xiang,Zefang Yu,Ting Liu,yuzhuo fu,Xiaoye Qu
Abstract:
Recently, the Parameter Efficient Fine-Tuning (PEFT) method, which adjusts or introduces fewer trainable parameters to calibrate pre-trained models on downstream tasks, has been a hot research topic. However, existing PEFT methods within the traditional fine-tuning framework have two main shortcomings: 1) They overlook the explicit association between trainable parameters and downstream knowledge. 2) They neglect the interaction between the intrinsic task-agnostic knowledge of pre-trained models and the task-specific knowledge of downstream tasks. These oversights lead to insufficient utilization of knowledge and suboptimal performance. To address these issues, we propose a novel fine-tuning framework, named GIST, that can be seamlessly integrated into the current PEFT methods in a plug-and-play manner. Specifically, our framework first introduces a trainable token, called the Gist token, when applying PEFT methods on downstream tasks. This token serves as an aggregator of the task-specific knowledge learned by the PEFT methods and builds an explicit association with downstream tasks. Furthermore, to facilitate explicit interaction between task-agnostic and task-specific knowledge, we introduce the concept of knowledge interaction via a Bidirectional Kullback-Leibler Divergence objective. As a result, PEFT methods within our framework can enable the pre-trained model to understand downstream tasks more comprehensively by fully leveraging both types of knowledge. Extensive experiments on the 35 datasets demonstrate the universality and scalability of our framework. Notably, the PEFT method within our GIST framework achieves up to a 2.25% increase on the VTAB-1K benchmark with an addition of just 0.8K parameters (0.009‰ of ViT-B/16). Code is in the supplementary materials.



Paperid:1044 Poster
Authors:Jinpeng Yu,Binbin Huang,Yuxuan Zhang,Huaxia Li,Xu Tang,Shenghua Gao
Abstract:
Point cloud completion aims to recover accurate global geometry and preserve fine-grained local details from partial point clouds. Conventional methods typically predict unseen points directly from 3D point cloud coordinates or use self-projected multi-view depth maps to ease this task. However, these gray-scale depth maps cannot reach multi-view consistency, consequently restricting the performance. In this paper, we introduce a GeoFormer that simultaneously enhances the global geometric structure of the points and improves the local details. Specifically, we design a CCM Feature Enhanced Point Generator to integrate image features from multi-view consistent canonical coordinate maps (CCMs) and align them with pure point features, thereby enhancing the global geometry feature. Additionally, we employ the Multi-scale Geometry-aware Upsampler module to progressively enhance local details. This is achieved through cross attention between the multi-scale features extracted from the partial input and the features derived from previously estimated points. Extensive experiments on the PCN, ShapeNet-55/34, and KITTI benchmarks demonstrate that our GeoFormer outperforms recent methods, achieving the state-of-the-art performance. The code is ready and will be released soon.



Paperid:1045 Poster
Authors:Haonan Zhang,Pengpeng Zeng,Lianli Gao,Jingkuan Song,Heng Tao Shen
Abstract:
Recently, significant advancements have been in supporting text-video retrieval by transferring large-scale image-text pre-training models through model adaptation, i.e., full fine-tuning, or prompt tuning, a parameter-efficient fine-tuning strategy. While full fine-tuning involves high computational costs, particularly with increasing model size, prompt tuning offers greater flexibility and efficiency by adjusting only a few learnable parameters. However, current prompt tuning methods rely on coarse visual and textual cues for text-video retrieval task, neglecting the domain-specific features when performing the adaptation. This approach may lead to sub-optimal performance due to the incorporation of irrelevant and indiscriminate knowledge. To address such an issue, we present a Multi-grained Prompt Tuning (MPT) for text-video retrieval, that designs a variety of specific prompts to effectively explore semantic interaction across different modalities with diverse granularity. Specifically, we devise a multi-grained video encoder that employs spatial, temporal, and global prompts to transfer the base-generic knowledge from the image-text pre-trained model while comprehensively excavating determinative video-specific characteristics. Meanwhile, we introduce a novel multi-grained text encoder aimed at capturing various levels of textual clues through the utilization of word and phrase prompts. Extensive experiments on four benchmark datasets, i.e., MSR-VTT, ActivityNet, DiDeMo, and LSMDC demonstrate that MPT achieves outstanding performance, surpassing state-of-the-art methods with negligible computational cost. The codebase is publicly available at:https://anonymous.4open.science/r/MPT-565F.



Paperid:1046 Poster
Authors:JingJing Xie,Yuxin Zhang,Mingbao Lin,Liujuan Cao,Rongrong Ji
Abstract:
This paper presents the first study to explore the potential of parameter quantization for multimodal large language models to alleviate the significant resource constraint encountered during vision-language instruction tuning. We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW. This method is grounded in two key innovations: (1) The learning of group-wise scale factors for quantized LLM weights to mitigate the quantization error arising from activation outliers and achieve more effective vision-language instruction tuning; (2) The implementation of a multimodal warmup that progressively integrates linguistic and multimodal training samples, thereby preventing overfitting of the quantized model to multimodal data while ensuring stable adaptation of multimodal large language models to downstream vision-language tasks. Extensive experiments demonstrate that models quantized by QSLAW perform on par with, or even surpass, their full-precision counterparts, while facilitating up to 1.4 times reduction in VL tuning time and GPU consumption.



Paperid:1047 Poster
Authors:Qian Qu,Xinhang Wan,Weixuan Liang,Jiyuan Liu,Yu Feng,Huiying Xu,Xinwang Liu,En Zhu
Abstract:
The rapid development of multi-media techniques boosts the emergence of multi-view data, and how to uncover its intrinsic structure and utilize it to conduct the subsequent downstream tasks is crucial in data analysis. Multi-view clustering is a representative to handle multi-view data. The anchor-based method has received widespread attention for excellent performance and low time complexity. However, existing methods encounter two drawbacks, cutting down their performance, i.e., the assumption of the availability of all views and limited interaction of anchor generation among views. In some scenes, views arrive sequentially, and storing them is challenging owing to the limited space/ privacy considerations, and the existing anchor-based MVC is unsuitable for this. Additionally, recent works fail to generate anchors with the guidance of other views, and it is tough to align the anchor graphs. To this end, we propose A Lightweight Anchor-Based Incremental Framework for Multi-view Clustering. Specifically, we first initialize an anchor graph with the assistance of $k$-means when a new view arrives. Then, the consensus one of the anchor graphs is updated by the newly collected view with a permutation matrix. Our proposed method is more capable of anchor alignment because, in incremental MVC, the anchor graphs of previous views could be listed as a reference to guide the generation of anchor graphs of the coming view. Furthermore, we design a three-step iterative and convergent strategy to solve the resultant problem. Notably, the proposed algorithm shows outstanding effectiveness and time/space efficiency in extensive experiments.



Paperid:1048 Poster
Authors:Zhengze Xu,Mengting Chen,Zhao Wang,Linyu XING,Zhonghua Zhai,Nong Sang,Jinsong Lan,Shuai Xiao,Changxin Gao
Abstract:
Video try-on is a challenging task and has not been well tackled in previous works. The main obstacle lies in preserving the details of the clothing and modeling the coherent motions simultaneously. Faced with those difficulties, we address video try-on by proposing a diffusion-based framework named "Tunnel Try-on." The core idea is excavating a ``focus tunnel'' in the input video that gives close-up shots around the clothing regions. We zoom in on the region in the tunnel to better preserve the fine details of the clothing. To generate coherent motions, we first leverage the Kalman filter to construct smooth crops in the focus tunnel and inject the position embedding of the tunnel into attention layers to improve the continuity of the generated videos. In addition, we develop an environment encoder to extract the context information outside the tunnels as supplementary cues. Equipped with these techniques, Tunnel Try-on keeps the fine details of the clothing and synthesizes stable and smooth videos. Demonstrating significant advancements, Tunnel Try-on could be regarded as the first attempt toward the commercial-level application of virtual try-on in videos.



Paperid:1049 Poster
Authors:Mengmeng Sheng,Zeren Sun,Gensheng Pei,Tao Chen,Haonan Luo,Yazhou Yao
Abstract:
Label noise, an inevitable issue in various real-world datasets, tends to impair the performance of deep neural networks. A large body of literature focuses on symmetric co-training, aiming to enhance model robustness by exploiting interactions between models with distinct capabilities. However, the symmetric training processes employed in existing methods often culminate in model consensus, diminishing their efficacy in handling noisy labels. To this end, we propose an Asymmetric Co-Training (ACT) method to mitigate the detrimental effects of label noise. Specifically, we introduce an asymmetric training framework in which one model (i.e., RTM) is robustly trained with a selected subset of clean samples while the other (i.e., NTM) is conventionally trained using the entire training set. We propose two novel criteria based on agreement and discrepancy between models, establishing asymmetric sample selection and mining. Moreover, a metric, derived from the divergence between models, is devised to quantify label memorization, guiding our method in determining the optimal stopping point for sample mining. Finally, we propose to dynamically re-weight identified clean samples according to their reliability inferred from historical information. We additionally employ consistency regularization to achieve further performance improvement. Extensive experimental results on synthetic and real-world datasets demonstrate the effectiveness and superiority of our method.



Paperid:1050 Poster
Authors:Jiawei Chen,Dingkang Yang,Yue Jiang,Mingcheng Li,Jinjie Wei,Xiaolu Hou,Lihua Zhang
Abstract:
In the realm of Medical Visual Language Models (VLMs), the quest for universal efficient fine-tuning mechanisms remains paramount, especially given researchers in interdisciplinary fields are often extremely short of training resources, yet largely unexplored. Most of the current Parameter-Efficient Fine-Tuning(PEFT) methods, not only have not been comprehensively evaluated on Med-VLMs but also mostly focus on adding some components to the model's structure or input. However, fine-tuning intrinsic model components often yields better generality and consistency, and its impact on the ultimate performance of Med-VLMs has been widely overlooked and remains understudied. In this paper, we endeavour to explore an alternative to traditional PEFT methods, especially the impact of fine-tuning LayerNorm and Attention layers on Med-VLM. Our comprehensive study spans both small-scale and large-scale Med-VLMs, evaluating their performance under various fine-tuning paradigms across tasks such as Medical Visual Question Answering and Medical Imaging Report Generation. The findings reveal that fine-tuning solely the LayerNorm layers not only surpasses the efficiency of traditional PEFT methods but also retains the model's accuracy and generalization capabilities across a spectrum of medical downstream tasks. The experiments demonstrate LayerNorm fine-tuning's superior adaptability and scalability, particularly in the context of large-scale medical VLMs. We hope this work will contribute to the ongoing discourse on optimizing efficient fine-tuning strategies for medical VLMs.



Paperid:1051 Poster
Authors:Xinjie Jiang,Chenxi Zheng,Xuemiao Xu,Bangzhen Liu,Weiying Zheng,Huaidong Zhang,Shengfeng He
Abstract:
Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for getting a deeper insight into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, usually split the task into two parts: one for identifying what categories are present and another for figuring out their temporal boundaries. This split overlooks the natural connection between these elements. Addressing the need for recognizing entity independence and their interactions across a range of durations, we propose VrdONE, a streamlined yet efficacious one-stage model. VrdONE combines the features of subjects and objects, turning predicate detection into 1D instance segmentation on their combined representations. This setup allows for both category identification and binary mask generation in one go, eliminating the need for extra steps like proposal generation or post-processing. VrdONE facilitates the interaction of features across various frames, adeptly capturing both short-lived and enduring relations. Additionally, we introduce the Subject-Object Synergy (SOS) Module, enhancing how subjects and objects perceive each other before combining. VrdONE achieves state-of-the-art performances on both the VidOR benchmark and ImageNet-VidVRD, showcasing its superior capability in discerning relations across different temporal scales.



Paperid:1052 Poster
Authors:Xian Zhong,Shengwang Hu,Wenxuan Liu,Wenxin Huang,Jianhao Ding,Zhaofei Yu,Tiejun Huang
Abstract:
Spiking neural networks (SNNs) have garnered significant attention for their low power consumption and high biological interpretability. Their rich spatio-temporal information processing capability and event-driven nature make them ideally well-suited for neuromorphic datasets. However, current SNNs struggle to balance accuracy and latency in classifying these datasets. In this paper, we propose Hybrid Step-wise Distillation (HSD) method, tailored for neuromorphic datasets, to mitigate the notable decline in performance at lower time steps. Our work disentangles the dependency between the number of event frames and the time steps of SNNs, utilizing more event frames during the training stage to improve performance, while using fewer event frames during the inference stage to reduce latency. Nevertheless, the average output of SNNs across all time steps is susceptible to individual time step with abnormal outputs, particularly at extremely low time steps. To tackle this issue, we implement Step-wise Knowledge Distillation (SKD) module that considers variations in the output distribution of SNNs at each time step. Empirical evidence demonstrates that our method yields competitive performance in classification tasks on neuromorphic datasets, especially at lower time steps.



Paperid:1053 Poster
Authors:Lintao Dong,Wei Zhai,Zheng-Jun Zha
Abstract:
Universal few-shot dense prediction requires a versatile model capable of learning any dense prediction task from limited labeled images, which necessitates the model to possess efficient adaptation abilities. Prevailing few-shot learning methods rely on efficient fine-tuning of model weights for few-shot adaptation, which carries the risk of disrupting the pre-trained knowledge and lacks the capability to extract task-specific knowledge contained in the pre-trained model. To overcome these limitations, our paper approaches universal few-shot dense prediction from a novel perspective. Unlike conventional fine-tuning techniques that directly use all parameters of the model and modify a specific set of weights for few-shot adaptation, our method focuses on selecting the task-relevant computation pathways of the pre-trained model while keeping the model weights frozen. Building upon this idea, we introduce a novel framework UniDense for universal few-shot dense prediction. First, we construct a versatile MoE architecture for dense prediction based on the Stable Diffusion model. We then utilize episodes-based meta-learning to train a set of routers for this MoE model, called Meta-Routers, which act as hyper-networks responsible for selecting computation blocks relevant to each task. We demonstrate that fine-tuning these meta-routers for novel tasks enables efficient adaptation of the entire model. Moreover, for each few-shot task, we leverage support samples to extract a task embedding, which serves as a conditioning factor for meta-routers. This strategy allows meta-routers to dynamically adapt themselves for different few-shot task, leading to improved adaptation performance. Experiments on a challenging variant of Taskonomy dataset with 10 dense prediction tasks demonstrate the superiority of our approach.



Paperid:1054 Poster
Authors:Qi Mao,Lan Chen,Yuchao Gu,Zhen Fang,Mike Zheng Shou
Abstract:
Recent diffusion-based image editing approaches have exhibited impressive editing capabilities in images with one dominant object in simple compositions. However, localized editing in images containing multiple objects and intricate compositions has not been well-studied in the literature, despite its growing real-world demands. Existing mask-based inpainting methods fall short of retaining the underlying structure within the edit region, causing noticeable discordance with their complex surroundings. Meanwhile, attention-based methods such as Prompt-to-Prompt (P2P) often exhibit editing leakage and misalignment in more complex compositions. In this work, we propose MAG-Edit, a plug-and-play, inference-stage optimization method, that empowers attention-based editing approaches, such as P2P, to enhance localized image editing in intricate scenarios. In particular, MAG-Edit optimizes the noise latent feature by encouraging two mask-based cross-attention ratios of the edit token, which in turn gradually enhances the local alignment with the desired prompt. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method in achieving both text alignment and structure preservation for localized editing within complex scenarios.



Paperid:1055 Poster
Authors:Xiangyan Qu,Jing Yu,Keke Gai,Jiamin Zhuang,Yuanmin Tang,Gang Xiong,Gaopeng Gou,Qi Wu
Abstract:
Recent work shows that documents from encyclopedias serve as helpful auxiliary information for zero-shot learning. Existing methods align the entire semantics of a document with corresponding images to transfer knowledge. However, they disregard that semantic information is not equivalent between them, resulting in a suboptimal alignment. In this work, we propose a novel network to extract multi-view semantic concepts from documents and images and align the matching rather than entire concepts. Specifically, we propose a semantic decomposition module to generate multi-view semantic embeddings from visual and textual sides, providing the basic concepts for partial alignment. To alleviate the issue of information redundancy among embeddings, we propose the local-to-semantic variance loss to capture distinct local details and multiple semantic diversity loss to enforce orthogonality among embeddings. Subsequently, two losses are introduced to partially align visual-semantic embedding pairs according to their semantic relevance at the view and word-to-patch levels. Consequently, we consistently outperform state-of-the-art methods under two document sources in three standard benchmarks for document-based zero-shot learning. Qualitatively, we show that our model learns the interpretable partial semantic association. The code is available athttps://anonymous.4open.science/r/EmDepart.



Paperid:1056 Poster
Authors:Lijian Yang,Weisheng Li,Yucheng Shu,Jianxun Mi,Yuping Huang,Bin Xiao
Abstract:
Deformable image registration (DIR) is crucial for many medical image applications. In recent years, learning-based methods utilizing the convolutional neural network (CNN) or the Transformer have demonstrated their superiority in image registration, dominating a new era for DIR. However, very few of these methods can satisfy the demands of real-time applications due to the high spatial resolution of 3D volumes and the high complexity of 3D operators. To tackle this, we propose losslessly downsampling by shifting the strided convolution. A grouping strategy is then used to reduce redundant computations and support self-consistency learning. As an inherent regularizer of the network design, self-consistency learning improves the deformation quality and enables halving the proposed network after training. Furthermore, the proposed shifted connection converts the decoding operations into a lower-dimensional space, significantly reducing decoding overhead. Extensive experimental results on medical image registration demonstrate that our method is competitive with state-of-the-art methods in terms of registration performance, and additionally, it achieves over $3\times$ the speed of most of them.



Paperid:1057 Poster
Authors:Haodong Chen,Haojian Huang,Junhao Dong,Mingzhe Zheng,Dian Shao
Abstract:
Dynamic Facial Expression Recognition (DFER) is crucial for understanding human behavior. However, current methods exhibit limited performance mainly due to the scarcity of high-quality data, the insufficient utilization of facial dynamics, and the ambiguity of expression semantics, etc. To this end, we propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (FineCLIPER), incorporating the following novel designs: 1) To better distinguish between similar facial expressions, we extend the class labels to textual descriptions from both positive and negative aspects, and obtain supervision by calculating the cross-modal similarity based on the CLIP model; 2) Our FineCLIPER adopts a hierarchical manner to effectively mine useful cues from DFE videos. Specifically, besides directly embedding video frames as input (low semantic level), we propose to extract the face segmentation masks and landmarks based on each frame (middle semantic level) and utilize the Multi-modal Large Language Model (MLLM) to further generate detailed descriptions of facial changes across frames with designed prompts (high semantic level). Additionally, we also adopt Parameter-Efficient Fine-Tuning (PEFT) to enable efficient adaptation of large pre-trained models (i.e., CLIP) for this task. Our FineCLIPER achieves SOTA performance on the DFEW, FERV39k, and MAFW datasets in both supervised and zero-shot settings with few tunable parameters. Analysis and ablation studies further validate its effectiveness.



Paperid:1058 Poster
Authors:Haibo Wang,Chenghang Lai,Yixuan Sun,Weifeng Ge
Abstract:
Video Question Answering (VideoQA) aims to answer natural language questions based on the information observed in videos. Despite the recent success of Large Multimodal Models (LMMs) in image-language understanding and reasoning, they deal with VideoQA insufficiently, by simply taking uniformly sampled frames as visual inputs, which ignores question-relevant visual clues. Moreover, there are no human annotations for question-critical timestamps in existing VideoQA datasets. In light of this, we propose a novel weakly supervised framework to enforce the LMMs to reason out the answers with question-critical moments as visual inputs. Specifically, we first fuse the question and answer pairs as event descriptions to find multiple keyframes as target moments and pseudo-labels, with the visual-language alignment capability of the CLIP models. With these pseudo-labeled keyframes as additionally weak supervision, we devise a lightweight Gaussian-based Contrastive Grounding (GCG) module. GCG learns multiple Gaussian functions to characterize the temporal structure of the video, and sample question-critical frames as positive moments to be the visual inputs of LMMs. Extensive experiments on several benchmarks verify the effectiveness of our framework, and we achieve substantial improvements compared to previous state-of-the-art methods.



Paperid:1059 Poster
Authors:Zhaojian Li,Bin Zhao,Yuan Yuan
Abstract:
Synthesizing binaural audio according to personalized requirements is crucial for building immersive artificial spaces. Previous methods employ visual modalities to guide the spatialization of audio because it can provide spatial information about objects. However, the paradigm is dependent on object visibility and strict audiovisual correspondence, which makes it tough to satisfy personalized requirements. In addition, the visual counterpart to the audio may be crippled or even non-existent, which greatly limits the development of the field. To this end, we advocate exploring a novel task known as Text-guided Audio Spatialization (TAS), in which the goal is to convert mono audio into spatial audio based on text prompts. This approach circumvents harsh audiovisual conditions and allows for more flexible individualization. To facilitate this research, we construct the first TASBench dataset. The dataset provides a dense frame-level description of the spatial location of sounding objects in audio, enabling fine-grained spatial control. Since text prompts contain multiple sounding objects and spatial locations, the core issue of TAS is to establish the mapping relationship between text semantic information and audio objects. To tackle this issue, we design a Semantic-Aware Fusion (SAF) module to capture text-aware audio features and propose a text-guided diffusion model to learn the spatialization of audio, which can generate spatial audio consistent with text prompts. Extensive experiments on TASBench compare the proposed method with several methods from related tasks, demonstrating that our method is promising to achieve the personalized generation of spatial sense of audio under text prompts.



Paperid:1060 Poster
Authors:Zhengwei Yin,Guixu Lin,Mengshun Hu,Hao Zhang,Yinqiang Zheng
Abstract:
The domain of image restoration encompasses a wide array of highly effective models (e.g., SwinIR, CODE, DnCNN), each exhibiting distinct advantages in either efficiency or performance. Selecting and deploying these models necessitate careful consideration of resource limitations. While some studies have explored dynamic restoration through the integration of an auxiliary network within a unified framework, these approaches often fall short in practical applications due to the complexities involved in training, retraining, and hyperparameter adjustment, as well as limitations as being totally controlled by auxiliary network and biased by training data. To address these challenges, we introduce FlexIR: a flexible and manipulable framework for image restoration. FlexIR is distinguished by three components: a meticulously designed hierarchical branch network enabling dynamic output, an innovative progressive self-distillation process, and a channel-wise evaluation method to enhance knowledge distillation efficiency. Additionally, we propose two novel inference methodologies to fully leverage FlexIR, catering to diverse user needs and deployment contexts. Through this framework, FlexIR achieves unparalleled performance across all branches, allowing users to navigate the trade-offs between quality, cost, and efficiency during the inference phase. Crucially, FlexIR employs a dynamic mechanism powered by a non-learning metric independent of training data, ensuring that FlexIR is entirely under the direct control of the user. Comprehensive experimental evaluations validate FlexIR’s flexibility, manipulability, and cost-effectiveness, showcasing its potential for straightforward adjustments and quick adaptations across a range of scenarios. Codes will be available at [URL].



Paperid:1061 Poster
Authors:GuoBiao Li,Sheng Li,Zhenxing Qian,Xinpeng Zhang
Abstract:
Image steganography is the process of hiding secret data in a cover image by subtle perturbation. Recent studies show that it is feasible to use a fixed neural network for data embedding and extraction. Such Fixed Neural Network Steganography (FNNS) demonstrates favorable performance without the need for training networks, making it more practical for real-world applications. However, the stego-images generated by the existing FNNS methods exhibit high distortion, which is prone to be detected by steganalysis tools. To deal with this issue, we propose a Cover-separable Fixed Neural Network Steganography, namely Cs-FNNS. In Cs-FNNS, we propose a Steganographic Perturbation Search (SPS) algorithm to directly encode the secret data into an imperceptible perturbation, which is combined with an AI-generated cover image for transmission. Through accessing the same deep generative models, the receiver could reproduce the cover image using a pre-agreed key, to separate the perturbation in the stego-image for data decoding. such an encoding/decoding strategy focuses on the secret data and eliminates the disturbance of the cover images, hence achieving a better performance. We apply our Cs-FNNS to the steganographic field that hiding secret images within cover images. Through comprehensive experiments, we demonstrate the superior performance of the proposed method in terms of visual quality and undetectability. Moreover, we show the flexibility of our Cs-FNNS in terms of hiding multiple secret images for different receivers.



Paperid:1062 Poster
Authors:Yubo Li,De Cheng,Chaowei Fang,Changzhe Jiao,Nannan Wang,Xinbo Gao
Abstract:
Cloth-Changing Person Re-Identification (CC-ReID) aims to accurately identify a target person in the more realistic surveillance scenario where clothes of the pedestrian may change drastically, which is critical in public security systems for tracking down disguised criminal suspects. Existing methods mainly transform the CC-ReID problem into cross-modality feature alignment from the data-driven perspective, without modelling the interference factors such as clothes and camera view changes meticulously. This may lead to over-consideration or under-consideration of the influence of these factors on the extraction of robust and discriminative identity features. This paper proposes a novel algorithm for thoroughly disentangling identity features from interference factors brought by clothes and camera view changes while ensuring the robustness and discriminativeness. It adopts a dual-stream identity feature learning framework consisting of a raw image stream and a cloth-erasing stream, to explore discriminative and cloth-irrelevant identity feature representations. Specifically, an adaptive cloth-irrelevant contrastive objective is introduced to contrast features extracted by the two streams, aiming to suppress the fluctuation caused by clothes textures in the identity feature space. Moreover, we innovatively mitigate the influence of the interference factors through a generative adversarial interference factor decoupling network. This network is targeted at capturing identity-related information residing in the interference factors and disentangling the identity features from such information. Extensive experimental results demonstrate the effectiveness of the proposed method, achieving superior performances to state-of-the-art methods. Our source code is available in the supplementary materials.



Paperid:1063 Poster
Authors:Zitong Huang,Ze Chen,Yuanze Li,Bowen Dong,Erjin Zhou,Yong Liu,Rick Siow Mong Goh,Chun-Mei Feng,Wangmeng Zuo
Abstract:
Few-Shot Class-Incremental Learning has shown remarkable efficacy in efficient learning new concepts with limited annotations. Nevertheless, the heuristic few-shot annotations may not always cover the most informative samples, which largely restricts the capability of incremental learner. We aim to start from a pool of large-scale unlabeled data and then annotate the most informative samples for incremental learning. Based on this purpose, this paper introduces the Active Class-Incremental Learning (ACIL). The objective of ACIL is to select the most informative samples from the unlabeled pool to effectively train an incremental learner, aiming to maximize the performance of the resulting model. Note that vanilla active learning algorithms suffer from class-imbalanced distribution among annotated samples, which restricts the ability of incremental learning. To achieve both class balance and informativeness in chosen samples, we propose $\textbf{C}$lass-$\textbf{B}$alanced $\textbf{S}$election ($\textbf{CBS}$) strategy. Specifically, we first cluster the features of all unlabeled images into multiple groups. Then for each cluster, we employ greedy selection strategy to ensure that the Gaussian distribution of the sampled features closely matches the Gaussian distribution of all unlabeled features within the cluster. Our CBS can be plugged and played into those CIL methods which are based on pretrained models with prompts tunning technique. Extensive experiments under ACIL protocol across five diverse datasets demonstrate that CBS outperforms both random selection and other SOTA active learning approaches.



Paperid:1064 Poster
Authors:Ruiqi Zhang,Jie Chen
Abstract:
Real-time mesh reconstruction is highly demanded for integrating human avatar in modern computer graphics applications. Current methods typically use coordinate-based MLP to represent 3D scene as Signed Distance Field (SDF) and optimize it through volumetric rendering, relying on Marching Cubes for mesh extraction. However, volumetric rendering lacks training and rendering efficiency, and the dependence on Marching Cubes significantly impacts mesh extraction efficiency. This study introduces a novel approach, Mesh-Centric Gaussian Splatting (MCGS), which introduces a unique representation Mesh-Centric SDF and optimizes it using high-efficiency Gaussian Splatting. The primary innovation introduces Mesh-Centric SDF, a thin layer of SDF enveloping the underlying mesh, and could be efficiently derived from mesh. This derivation of SDF from mesh allows for mesh optimization through SDF, providing mesh as 0 iso-surface, and eliminating the need for slow Marching Cubes. The secondary innovation focuses on optimizing Mesh-Centric SDF with high-efficiency Gaussian Splatting. By dispersing the underlying mesh of Mesh-Centric SDF into multiple layers and generating Mesh-Constrained Gaussians on them, we create Multi-Layer Gaussians. These Mesh-Constrained Gaussians confine Gaussians within a 2D surface space defined by mesh, ensuring an accurate correspondence between Gaussian rendering and mesh geometry. The Multi-Layer Gaussians serve as sampling layers of Mesh-Centric SDF and can be optimized with Gaussian Splatting, which would further optimize Mesh-Centric SDF and its underlying mesh. As a result, our method can directly optimize the underlying mesh through Gaussian Splatting, providing fast training and rendering speeds derived from Gaussian Splatting, as well as precise surface learning of SDF. Experiments demonstrate that our method achieves dynamic mesh reconstruction at over 30 FPS. In contrast, SDF-based methods using Marching Cubes achieve less than 1 FPS, and concurrent 3D Gaussian Splatting-based methods cannot extract reasonable mesh.



Paperid:1065 Poster
Authors:Jialu ZHANG,Xinyi Wang,Chenglin Yao,Jianfeng Ren,Xudong Jiang
Abstract:
Acquiring commonsense knowledge about entity-pairs from images is crucial across diverse applications. Distantly supervised learning has made significant advancements by automatically retrieving images containing entity pairs and summarizing commonsense knowledge from the bag of images. However, the retrieved images may not always cover all possible relations, and the informative features across the bag of images are often overlooked. To address these challenges, a Multi-modal Cross-domain Feature Learning framework is proposed to incorporate the general domain knowledge from a large vision-text foundation model, ViT-GPT2, to handle unseen relations and exploit complementary information from multiple sources. Then, a Group Attention module is designed to exploit the attentive information from other instances of the same bag to boost the informative features of individual instances. Finally, a Gamma-corrected Gated Fusion is designed to select a subset of informative instances for a comprehensive summarization of commonsense entity relations. Extensive experimental results demonstrate the superiority of the proposed method over state-of-the-art models for extracting commonsense knowledge.



Paperid:1066 Poster
Authors:Wencheng Han,Chen Zhang,Yang Zhou,Wentao Liu,Chen Qian,Cheng-zhong Xu,Jianbing Shen
Abstract:
While RAW images are efficient for image editing and perception tasks, their large size can strain camera storage and bandwidth. Techniques exist to reconstruct RAW images from sRGB data, but these methods typically require additional metadata from the RAW image, which adds to camera processing demands. To address this problem, we propose using Prior Meta as a reference to reconstruct the RAW data instead of relying on per-image metadata. Prior metadata is extracted offline from reference RAW images, which are usually part of the training dataset and have similar scenes and light conditions as the target image. With this prior metadata, the camera does not need to provide any extra processing other than the sRGB images, and our model can autonomously find the desired prior information. To achieve this, we design a three-step pipeline. First, we build a pixel searching network that can find the most similar pixels in the reference RAW images as prior information. Then, in the second step, we compress the large-scale reference images to about 0.02% of their original size to reduce the searching cost. Finally, in the last step, we develop a neural network reconstructor to reconstruct the high-fidelity RAW images. Our model achieves comparable, and even better, performance than RAW reconstruction methods based on metadata.



Paperid:1067 Poster
Authors:XuHan Zhu,Yifei Xing,Ruiping Wang,Yaowei Wang,Xiangyuan Lan
Abstract:
Miscalibrated models tend to be unreliable and insecure for downstream applications. In this work, we attempt to highlight and remedy miscalibration in current scene graph generation (SGG) models, which has been overlooked by previous works. We discover that obtaining well-calibrated models for SGG is more challenging than conventional calibration settings, as long-tailed SGG training data exacerbates miscalibration with overconfidence in head classes and underconfidence in tail classes. We further analyze which components are explicitly impacted by the long-tailed data during optimization, thereby exacerbating miscalibration and unbalanced learning, including: \textbf{biased parameters}, \textbf{deviated boundaries}, and \textbf{distorted target distribution}. To address the above issues, we propose the \textbf{C}ompositional \textbf{O}ptimization \textbf{C}alibration (\textbf{COC}) method, comprising three modules: i. A parameter calibration module that utilizes a hyperspherical classifier to eliminate the bias introduced by biased parameters. ii. A boundary calibration module that disperses features of majority classes to consolidate the decision boundaries of minority classes and mitigate deviated boundaries. iii. A target distribution calibration module that addresses distorted target distribution, leverages within-triplet prior to guide confidence-aware and label-aware target calibration, and applies curriculum regulation to constrain learning focus from easy to hard classes. Extensive evaluation on popular benchmarks demonstrates the effectiveness of our proposed method in improving model calibration and resolving unbalanced learning for long-tailed SGG. Finally, our proposed method performs best on model calibration compared to different types of calibration methods and achieves state-of-the-art trade-off performance on balanced learning for SGG. The source codes and models will be available upon acceptance.



Paperid:1068 Poster
Authors:Ruxue Yan,wenya guo,XuBo Liu,Xumeng Liu,Ying Zhang,Xiaojie Yuan
Abstract:
Referring video object segmentation (RVOS) is a cross-modal task that aims to segment the target object described by language expressions. A video typically consists of multiple frames and existing works conduct segmentation at either the clip-level or the frame-level. Clip-level methods process a clip at once and segment the multiple frames in parallel, lacking explicit inter-frame interactions. In contrast, frame-level methods facilitate direct interactions between frames by processing videos frame by frame, but they are prone to error accumulation. In this paper, we propose a novel tracking-forced framework, introducing high-quality tracking information and forcing the model to achieve accurate segmentation. Concretely, we utilize the ground-truth segmentation of previous frames as accurate inter-frame interactions, providing high-quality tracking references for object segmentation in the next frame. This decouples the current input from the previous output, which enables our model to concentrate on accurately segmenting just based on given tracking information, improving training efficiency and preventing error accumulation. For the inference stage without ground-truth masks, we carefully select the beginning frame to construct tracking information, aiming to ensure accurate tracking-based frame-by-frame object segmentation. With these designs, our tracking-forced method significantly outperforms existing methods on 4 widely used benchmarks by at least 3%. Especially, our method achieves 88.3%P@0.5accuracy and 87.6 overall IoU score on the JHMDB-Sentences dataset, surpassing previous best methods by 5.0% and 8.0, respectively. Our code will be released oncethis manuscript is accepted.



Paperid:1069 Poster
Authors:Hu Gao,Jing Yang,Ying Zhang,Jingfan Yang,Bowen Ma,Depeng Dang
Abstract:
Stereo image super-resolution (stereoSR) strives to improve the quality of super-resolution by leveraging the auxiliary information provided by another perspective. Most approaches concentrate on refining module design, and stacking massive network blocks to extract and integrate information. Although there have been advancements, the memory and computation costs are increasing as well. To tackle this issue, we propose a lattice structure that autonomously learns the optimal combination patterns of network blocks, which enables the efficient and precise acquisition of feature representations, and ultimately achieves lightweight stereoSR. Specifically, we draw inspiration from the lattice phase equalizer and design lattice stereo NAFBlock (LSNB) to bridge pairs of NAFBlocks using re-weight block (RWBlock) through a coupled butterfly-style topological structures. RWBlock empowers LSNB with the capability to explore various combination patterns of pairwise NAFBlocks by adaptive re-weighting of feature. Moreover, we propose a lattice stereo attention module (LSAM) to search and transfer the most relevant features from another view. The resulting tightly interlinked architecture, named as LSSR, extensive experiments demonstrate that our method performs competitively to the state-of-the-art.



Paperid:1070 Poster
Authors:Leilei Ma,Hongxing Xie,Lei Wang,Yanping Fu,Dengdi Sun,Haifeng Zhao
Abstract:
Recently, large-scale visual language pre-trained (VLP) models have demonstrated impressive performance across various downstream tasks. Motivated by these advancements, pioneering efforts have emerged in multi-label image recognition with missing labels, leveraging VLP prompt-tuning technology. However, they usually cannot match text and vision features well, due to complicated semantics gaps and missing labels in a multi-label image. To tackle this challenge, we propose \textbf{T}ext-\textbf{R}egion \textbf{M}atching for optimizing \textbf{M}ulti-\textbf{L}abel prompt tuning, namely TRM-ML, a novel method for enhancing meaningful cross-modal matching. Compared to existing methods, we advocate exploring the information of category-aware regions rather than the entire image or pixels, which contributes to bridging the semantic gap between textual and visual representations in a one-to-one matching manner. Concurrently, we further introduce multimodal contrastive learning to narrow the semantic gap between textual and visual modalities and establish intra-class and inter-class relationships. Additionally, to deal with missing labels, we propose a multimodal category prototype that leverages intra- and inter-category semantic relationships to estimate unknown labels, facilitating pseudo-label generation. Extensive experiments on the MS-COCO, PASCAL VOC, Visual Genome, NUS-WIDE, and CUB-200-211 benchmark datasets demonstrate that our proposed framework outperforms the current state-of-the-art methods by a significant margin.



Paperid:1071 Poster
Authors:Pengxiang Cai,Zhiwei Liu,Guibo Zhu,Yunfang Niu,Jinqiao Wang
Abstract:
Pixel-level fine-grained image editing remains an open challenge. Previous works fail to achieve an ideal trade-off between control granularity and inference speed. They either fail to achieve pixel-level fine-grained control, or their inference speed requires optimization. To address this, this paper for the first time employs a regression based network to learn the variation patterns of StyleGAN latent codes during the image dragging process. This method enables pixel-level precision in drag editing with little time cost. Users can specify handle points and target points on any GAN-generated images, and our method will move each handle point to its corresponding target point. To achieve this, we decompose the entire movement process into multiple sub-processes. Specifically, we develop a encoder-decoder based network named 'Latent Predictor' to predict the latent code motion trajectories from handle points to target points in an autoregressive manner. Moreover, to enhance the prediction stability, we introduce a component named 'Latent Regularizer', aimed at constraining the latent code motion within the distribution of natural images. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) inference speed and image editing performance at the pixel-level granularity.



Paperid:1072 Poster
Authors:Xiaogang Wang,Yuhang Cheng,Ziyang Fan,Kai Xu
Abstract:
Great progress has been made in rendering translucent materials in recent years, but automatically estimating parameters for heterogeneous materials such as jade and human skin remains a challenging task, often requiring specialized and expensive physical measurement devices. In this paper, we present a novel approach for estimating and transferring the parameters of heterogeneous translucent materials from a single 2D image to 3D models. Our method consists of four key steps: (1) An efficient viewpoint selection algorithm to minimize redundancy and ensure comprehensive coverage of the model. (2) Initializing a homogeneous translucent material to render initial images for translucent dataset. (3) Edit the rendered translucent images to update the translucent dataset. (4) Optimize the edited translucent results onto material parameters using inverse rendering techniques. Our approach offers a practical and accessible solution that overcomes the limitations of existing methods, which often rely on complex and costly specialized devices. We demonstrate the effectiveness and superiority of our proposed method through extensive experiments, showcasing its ability to transfer and edit high-quality heterogeneous translucent materials on 3D models, surpassing the results achieved by previous techniques in 3D scene editing.



Paperid:1073 Poster
Authors:Guozhen Peng,Yunhong Wang,Yuwei Zhao,Shaoxiong Zhang,Annan Li
Abstract:
Gait recognition has attracted increasing attention from academia and industry as a human recognition technology from a distance in non-intrusive ways without requiring cooperation. Although advanced methods have achieved impressive success in lab scenarios, most of them perform poorly in the wild. Recently, some Convolution Neural Networks (ConvNets) based methods have been proposed to address the issue of gait recognition in the wild. However, the temporal receptive field obtained by convolution operations is limited for long gait sequences. If directly replacing convolution blocks with visual transformer blocks, the model may not enhance a local temporal receptive field, which is important for covering a complete gait cycle. To address this issue, we design a Global-Local Temporal Receptive Field Network (GLGait). GLGait employs a Global-Local Temporal Module (GLTM) to establish a global-local temporal receptive field, which mainly consists of a Pseudo Global Temporal Self-Attention (PGTA) and a temporal convolution operation. Specifically, PGTA is used to obtain a pseudo global temporal receptive field with less memory and computation complexity compared with a multi-head self-attention (MHSA). The temporal convolution operation is used to enhance the local temporal receptive field. Besides, it can also aggregate pseudo global temporal receptive field to a true holistic temporal receptive field. Furthermore, we also propose a Center-Augmented Triplet Loss (CTL) in GLGait to reduce the intra-class distance and expand the positive samples in the training stage. Extensive experiments show that our method obtains state-of-the-art results on in-the-wild datasets, $i.e.$, Gait3D and GREW.



Paperid:1074 Poster
Authors:Zhijian Wu,Jun Li,Yang Hu,Dingjiang Huang
Abstract:
Although deep learning-based methods have made significant advances in the field of image restoration (IR), they often suffer from excessive model parameters. To tackle this problem, this work proposes a compact Transformer (Compacter) for lightweight image restoration by making several key designs. We employ the concepts of projection sharing, adaptive interaction, and heterogeneous aggregation to develop a novel Compact Adaptive Self-Attention (CASA). Specifically, CASA utilizes shared projection to generate Query, Key, and Value to simultaneously model spatial and channel-wise self-attention. The adaptive interaction process is then used to propagate and integrate global information from two different dimensions, thus enabling omnidirectional relational interaction. Finally, a depth-wise convolution is incorporated on Value to complement heterogeneous local information, enabling global-local coupling. Moreover, we propose a Dual Selective Gated Network (DSGN) to dynamically encapsulate the globality into each pixel for context-adaptive aggregation. Extensive experiments demonstrate that our Compacter achieves state-of-the-art performance for a variety of lightweight IR tasks with approximately 400K parameters.



Paperid:1075 Poster
Authors:Jin Sun,Xiaoshuang Shi,Zhiyuan Wang,Kaidi Xu,Heng Tao Shen,Xiaofeng Zhu
Abstract:
Modeling in Computer Vision has evolved to MLPs. Vision MLPs naturally lack local modeling capability, to which the simplest treatment is combined with convolutional layers. Convolution, famous for its sliding window scheme, also suffers from this scheme of redundancy and limited parallel computation. In this paper, we seek to dispense with the windowing scheme and introduce a more elaborate and parallelizable method to exploit locality. To this end, we propose a new MLP module, namely Shifted-Pillars-Concatenation (SPC), that consists of two steps of processes: (1) Pillars-Shift, which generates four neighboring maps by shifting the input image along four directions, and (2) Pillars-Concatenation, which applies linear transformations and concatenation on the maps to aggregate local features. Then, we build a pure-MLP architecture called Caterpillar by replacing the convolutional layer with the SPC module in a hybrid model of sMLPNet. Caterpillar attains excellent scores on small-scale datasets without extra data, positioning it as ideal tool for data-hungry tasks without the reliance on Transfer Learning. On ImageNet-1k benchmark, Caterpillar also exhibits competitive performance (\textit{e.g.,} Caterpillar-B, 83.7%). Additionally, the SPC module offers superior local modeling power and performance gains, making it an promising alternative to the convolutional layer.



Paperid:1076 Poster
Authors:Ziyin Zhou,Ke Sun,Zhongxi Chen,Huafeng Kuang,Xiaoshuai Sun,Rongrong Ji
Abstract:
The rapid progress in generative models has given rise to the critical task of AI-Generated Content Stealth (AIGC-S), which aims to create AI-generated images that can evade both forensic detectors and human inspection. This task is crucial for understanding the vulnerabilities of existing detection methods and developing more robust techniques. However, current adversarial attacks often introduce visible noise, have poor transferability, and fail to address spectral differences between AI-generated and genuine images. To address this, we propose StealthDiffusion, a framework based on stable diffusion that modifies AI-generated images into high-quality, imperceptible adversarial examples capable of evading state-of-the-art forensic detectors. StealthDiffusion comprises two main components: Latent Adversarial Optimization, which generates adversarial perturbations in the latent space of stable diffusion, and Control-VAE, a module that reduces spectral differences between the generated adversarial images and genuine images without affecting the original diffusion model's generation process. Extensive experiments demonstrate the effectiveness of StealthDiffusion in both white-box and black-box settings, transforming AI-generated images into higher-quality adversarial forgeries with frequency spectra resembling genuine images. These images are classified as genuine by state-of-the-art forensic classifiers and are difficult for humans to distinguish.



Paperid:1077 Poster
Authors:Yan Zhuang,Yanlu Cai,WEIZHONG ZHANG,Cheng Jin
Abstract:
Multi-person motion prediction remains a challenging problem due to the intricate motion dynamics and complex interpersonal interactions, where uncertainty escalates rapidly across the forecasting horizon. Existing approaches always overlook the motion dynamic modeling among the prediction frames to reduce the uncertainty, but leave it entirely up to the deep neural networks, which lacks a dynamic inductive bias, leading to suboptimal performance. This paper addresses this limitation by proposing an effective multi-person motion prediction method named Hybrid Supervision Transformer (HSFormer), which formulates the dynamic modeling within the prediction horizon as a novel hybrid supervision task. To be precise, our method performs a rolling predicting process equipped with a hybrid supervision mechanism, which enforces the model to be able to predict the pose in the next frames based on the (typically error-contained) earlier predictions. Addition to the standard supervision loss, two self and auxiliary supervision mechanisms, which minimize the distance of the predictions with error-contained inputs and the predictions with error-free inputs (ground truth) and guide the model to make accurate predictions based on the ground truth, are introduced to improve the robustness of our model to the input deviation in inference and stabilize the training process, respectively. The optimization techniques, such as stop-gradient, are extended to our model to improve the training efficiency.



Paperid:1078 Poster
Authors:Seonggwan Ko,Yeong Jun Koh,Donghyeon Cho
Abstract:
Burst super-resolution (BurstSR) utilizes signal information from multiple adjacent frames successively taken to restore rich textures. However, due to hand tremors and other image degradation factors, even recent BurstSR methods struggle to reconstruct finely textured images. On the other hand, reference-based super-resolution (RefSR) leverages the high-fidelity reference (Ref) image to recover detailed contents. Nevertheless, if there is no correspondence between the Ref and the low-resolution (LR) image, the degraded output is derived. To overcome the limitations of existing BurstSR and RefSR methods, we newly introduce a reference-based burst super-resolution (RefBSR) that utilizes burst frames and a high-resolution (HR) external Ref image. The RefBSR can restore the HR image by properly fusing the benefits of burst frames and a Ref image. To this end, we propose the first RefBSR framework that consists of Ref-burst feature matching and burst feature-aware Ref texture transfer (BRTT) modules. In addition, our method adaptively integrates features with better quality between Ref and burst features using Ref-burst adaptive feature fusion (RBAF). To train and evaluate our method, we provide a new dataset of Ref-burst pairs collected by commercial smartphones. The proposed method achieves state-of-the-art performance compared to both existing RefSR and BurstSR methods, and we demonstrate its effectiveness through comprehensive experiments. The source codes and the newly constructed dataset will be made publicly available for further research.



Paperid:1079 Poster
Authors:Yunwei Bai,Bill Cai,Ying Kiat Tan,Zangwei Zheng,Shiming Chen,Tsuhan Chen
Abstract:
Few-shot learning (FSL) usually trains models on data from one set of classes, but tests them on data from a different set of classes, providing a few labeled support samples of the unseen classes as a reference for the trained model. Due to the lack of training data relevant to the target, there is usually high generalization error with respect to the test classes. Some existing methods attempt to address this generalization issue through ensemble. However, current ensemble-based FSL methods are computationally expensive. In this work, we propose a novel ensemble method (namely QuickBoost), which is surprisingly efficient and effective for improving the generalization of FSL. Specifically, QuickBoost includes a one-vs-all binary classifier (namely FSL-Forest) based on random forest algorithm and is ensembled with the off-the-shelf FSL models via logit-level averaging. FSL-Forest makes predictions via a set of decision tree stumps, each of which compares input pairs based on their feature-level elementwise value differences. Extensive experiments on three benchmarks show that our method achieves state-of-the-art performance with good efficiency, e.g., QuickBoost obtains 6% accuracy improvement on miniImagenet 5-way-5-shot classification tasks over the Prototypical Network with 8-seconds training on CPU. Codes are available athttps://anonymous.4open.science/r/FSL-QuickBoost-CEBA.



Paperid:1080 Poster
Authors:Zhihao Sun,Haipeng Fang,Juan Cao,Xinying Zhao,Danding Wang
Abstract:
Considering that image editing and manipulation technologies pose significant threats to the authenticity and security of image content, research on image regional manipulation detection has always been a critical issue. The accelerated advancement of generative AI significantly enhances the viability and effectiveness of generative regional editing methods and has led to their gradual replacement of traditional image editing tools or algorithms. However, current research primarily focuses on traditional image tampering, and there remains a lack of a comprehensive dataset containing images edited with abundant and advanced generative regional editing methods.We endeavor to fill this vacancy by constructing the GRE dataset, a large-scale generative regional editing detection dataset with the following advantages: 1) Integration of a logical and simulated editing pipeline, leveraging multiple large models in various modalities. 2) Inclusion of various editing approaches with distinct characteristics. 3) Provision of comprehensive benchmark and evaluation of SOTA methods across related domains. 4) Analysis of the GRE dataset from multiple dimensions including necessity, rationality, and diversity. Extensive experiments and in-depth analysis demonstrate that this larger and more comprehensive dataset will significantly enhance the development of detection methods for generative editing.



Paperid:1081 Poster
Authors:Yang Yang,LiyuanCao,Haoyu Shi,Huaiwen Zhang
Abstract:
Text-motion retrieval (TMR) is a significant cross-modal task, aiming at retrieving semantically similar motion sequences with the given query text. Existing studies primarily focus on representing and aligning the text and motion sequence with the single embeddings. However, in the real-world, the motion sequence usually consists of multiple atomic motions with complicated semantics. This simple approach may hardly capture the complex relation and abundant semantics in the text and motion sequence. In addition, most atomic motions may co-occur and be coupled together, which further brings considerable challenges in modeling and aligning query and motion sequences. In this paper, we regard TMR as a multi-instance multi-label learning (MIML) problem, where the motion sequence is viewed as a bag of atomic motion and the text is the bag of corresponding phrase description. To address the MIML problem, we propose a novel multi-granularity semantics interaction (MGSI) approach to capture and align the semantics of text and motion sequences at various scales. Specifically, the MGSI first decomposes the query and motion sequences into three levels of event (bag), actions (instances), and entities, respectively. After that, we adopt graph neural networks (GNNs) to explicitly model their semantic correlation and perform semantic interaction at corresponding scales to align text and motion. In addition, we introduce a co-occurred motion mining approach that adopts the semantic consistency between the atomic motions as measurement to identify the co-occurred atomic motions. These co-occurred atomic motions are fused and interacted with corresponding text to achieve a precise cross-modal alignment. We evaluate our methods on the HumanML3D and KIT-ML datasets, in which we achieve improvements in Rsum of 23.09% on HumanML3D and 21.84% on KIT-ML.



Paperid:1082 Poster
Authors:Xuri Ge,Junchen Fu,Fuhai Chen,Shan An,Nicu Sebe,Joemon M. Jose
Abstract:
Facial action units (AUs), as defined in the Facial Action Coding System (FACS), have received significant research interest owing to their diverse range of applications in facial state analysis. Current mainstream FAU recognition models have a notable limitation, i.e., focusing only on the accuracy of AU recognition and overlooking explanations of corresponding AU states. In this paper, we propose an end-to-end Vision-Language joint learning network for explainable FAU recognition (termed VL-FAU), which aims to reinforce AU representation capability and language interpretability through the integration of joint multimodal tasks. Specifically, VL-FAU brings together language models to generate fine-grained local muscle descriptions and distinguishable global face description when optimising FAU recognition. Through this, the global facial representation and its local AU representations will achieve higher distinguishability among different AUs and different subjects. In addition, multi-level AU representation learning is utilised to improve AU individual attention-aware representation capabilities based on the multi-scale combined facial stem feature. Extensive experiments on DISFA and BP4D AU datasets show that the proposed approach achieves superior performance over the state-of-the-art methods on most of the metrics. In addition, compared with mainstream FAU recognition methods, VL-FAU can provide local- and global-level interpretability language descriptions with the AUs' predictions.



Paperid:1083 Poster
Authors:Peng Wu,Xuerong Zhou,Guansong Pang,Zhiwei Yang,Qingsen Yan,PENG WANG,Yanning Zhang
Abstract:
Current weakly supervised video anomaly detection (WSVAD) task aims to achieve frame-level anomalous event detection with only coarse video-level annotations available. Existing works typically involve extracting global features from full-resolution video frames and training frame-level classifiers to detect anomalies in the temporal dimension. However, most anomalous events tend to occur in localized spatial regions rather than the entire video frames, which implies existing frame-level feature based works may be misled by the dominant background information and lack the interpretation of the detected anomalies. To address this dilemma, this paper introduces a novel method called STPrompt that learns spatio-temporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs). Our proposed method employs a two-stream network structure, with one stream focusing on the temporal dimension and the other primarily on the spatial dimension. By leveraging the learned knowledge from pre-trained VLMs and incorporating natural motion priors from raw videos, our model learns prompt embeddings that are aligned with spatio-temporal regions of videos (e.g., patches of individual frames) for identify specific local regions of anomalies, enabling accurate video anomaly detection while mitigating the influence of background information. Without relying on detailed spatio-temporal annotations or auxiliary object detection/tracking, our method achieves state-of-the-art performance on three public benchmarks for the WSVADL task.



Paperid:1084 Poster
Authors:Siyuan Xu,Guannan Li,Haofei Song,Jiansheng Wang,Yan Wang,Qingli Li
Abstract:
Immunohistochemistry (IHC) plays a crucial role in understanding disease mechanisms, diagnosing pathology and guiding treatment decisions. The precise analysis heavily depends on accurate nucleus segmentation. However, segmentation is challenging due to significant inter- and intra-nucleus variability in morphology and distribution, stemming from inherent characteristics, imaging techniques, tissue differences and other factors. While current deep learning-based methods have shown promising results, their generalization performance is limited, inevitably requiring specific training data. To address the problem, we propose a novel General framework for Nucleus Segmentation in IHC images (GeNSeg-Net). GeNSeg-Net effectively segments nuclei across diverse tissue types and imaging techniques with high variability using a small subset for training. It comprises an enhancement model and a segmentation model. Initially, all nuclei are enhanced to a uniform morphology with distinct features by the enhancement model through generation. The subsequent segmentation task is thereby simplified, leading to higher accuracy. We design a lightweight generator and discriminator to improve both enhancement quality and computational efficiency. Extensive experiments demonstrate the effectiveness of each component within GeNSeg-Net. Compared to existing methods, GeNSeg-Net achieves state-of-the-art (SOTA) segmentation accuracy and generalization performance on both private and public datasets, while maintaining highly competitive processing speed. Code will be available for research and clinical purposes.



Paperid:1085 Poster
Authors:Fengfan Zhou,Qianyu Zhou,Bangjie Yin,Hui Zheng,Xuequan Lu,Lizhuang Ma,Hefei Ling
Abstract:
Face Recognition (FR) systems can be easily deceived by adversarial examples that manipulate benign face images through imperceptible perturbations. Adversarial attacks on FR encompass two types: impersonation (targeted) attacks and dodging (untargeted) attacks. Previous methods often achieve a successful impersonation attack on FR; However, it does not necessarily guarantee a successful dodging attack on FR in the black-box setting. In this paper, our key insight is that the generation of adversarial examples should perform both impersonation and dodging attacks simultaneously. To this end, we propose a novel attack method termed as Adversarial Pruning (Adv-Pruning), to fine-tune existing adversarial examples to enhance their dodging capabilities while preserving their impersonation capabilities. Adv-Pruning consists of Priming, Pruning, and Restoration stages. Concretely, we propose Adversarial Priority Quantification to measure the region-wise priority of original adversarial perturbations, identifying and releasing those with minimal impact on absolute model output variances. Then, Biased Gradient Adaptation is presented to adapt the adversarial examples to traverse the decision boundaries of both the attacker and victim by adding perturbations favoring dodging attacks on the vacated regions, preserving the prioritized features of the original perturbations while boosting dodging performance. As a result, we can maintain the impersonation capabilities of original adversarial examples while effectively enhancing dodging capabilities. Comprehensive experiments demonstrate the superiority of our method compared with state-of-the-art adversarial attacks.



Paperid:1086 Poster
Authors:Hengfei Wang,Zhongqun Zhang,Yihua Cheng,Hyung Jin Chang
Abstract:
Generating face image with specific gaze information has attracted considerable attention. Existing approaches typically input gaze values directly for face generation, which is unnatural and requires annotated gaze datasets for training, thereby limiting its application. In this paper, we present a novel gaze-controllable face generation task. Our approach inputs textual descriptions that describe human gaze and head behavior and generates corresponding face images. Our work first introduces a text-of-gaze dataset containing over 90k text descriptions spanning a dense distribution of gaze and head poses. We further propose a gaze-controllable text-to-face method. Our method contains a sketch-conditioned face diffusion module and a model-based sketch diffusion module. We define a face sketch based on facial landmarks and eye segmentation map. The face diffusion module generates face images from the face sketch, and the sketch diffusion module employs a 3D face model to generate face sketch from text description. Experiments on the FFHQ dataset show the effectiveness of our method. We will release our dataset and code for future research.



Paperid:1087 Poster
Authors:Peng Ding,Jingyu Wu,Jun Kuang,Dan Ma,Xuezhi Cao,Xunliang Cai,Shi Chen,Jiajun Chen,Shujian Huang
Abstract:
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable performance on various visual-language understanding and generation tasks. However, MLLMs occasionally generate content inconsistent with the given images, which is known as "hallucination". Prior works primarily center on evaluating hallucination using standard, unperturbed benchmarks, which overlook the prevalent occurrence of perturbed inputs in real-world scenarios—such as image cropping or blurring—that are critical for a comprehensive assessment of MLLMs' hallucination. In this paper, to bridge this gap, we propose Hallu-PI, the first benchmark designed to evaluate Hallucination in MLLMs within Perturbed Inputs. Specifically, Hallu-PI consists of five perturbed scenarios, containing 3,930 perturbed images from 11 object types. Each image is accompanied by detailed annotations, which include fine-grained hallucination types, such as existence, attribute, and relation. We equip these annotations with a rich set of questions, making Hallu-PI suitable for both discriminative and generative tasks. Extensive experiments on 12 mainstream MLLMs, such as GPT-4V and Gemini-Pro Vision, demonstrate that these models exhibit significant hallucinations on Hallu-PI, which is not observed in unperturbed scenarios. Furthermore, our research reveals a severe bias in MLLMs’ ability to handle different types of hallucinations. We also design two baselines specifically for perturbed scenarios, namely Perturbed-Reminder and Perturbed-ICL. We hope that our study will bring researchers’ attention to the limitations of MLLMs when dealing with perturbed inputs, and spur further investigations to address this issue.



Paperid:1088 Poster
Authors:Timin Gao,Peixian Chen,Mengdan Zhang,Chaoyou Fu,Yunhang Shen,Yan Zhang,Shengchuan Zhang,Xiawu Zheng,Xing Sun,Liujuan Cao,Rongrong Ji
Abstract:
With the advent of large language models(LLMs) enhanced by the chain-of-thought(CoT) methodology, visual reasoning problem is usually decomposed into manageable sub-tasks and tackled sequentially with various external tools. However, such a paradigm faces the challenge of the potential ``determining hallucinations'' in decision-making due to insufficient visual information and the limitation of low-level perception tools that fail to provide abstract summaries necessary for comprehensive reasoning. We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. This paper delves into the realm of multimodal CoT to solve intricate visual reasoning tasks with multimodal large language models (MLLMs) and their cognitive capability. To this end, we propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Cantor first acts as a decision generator and integrates visual inputs to analyze the image and problem, ensuring a closer alignment with the actual context. Furthermore, Cantor leverages the advanced cognitive functions of MLLMs to perform as multifaceted experts for deriving higher-level information, enhancing the CoT generation process. Our extensive experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance across two complex visual reasoning datasets, without necessitating fine-tuning or ground-truth rationales.



Paperid:1089 Poster
Authors:Bolei Chen,Jiaxu Kang,Ping Zhong,Yixiong Liang,Yu Sheng,Jianxin Wang
Abstract:
Object Navigation (ObjcetNav), which enables an agent to seek any instance of an object category specified by a semantic label, has shown great advances. However, current agents are built upon occlusion-prone visual observations or compressed 2D semantic maps, which hinder their embodied perception of 3D scene geometry and easily lead to ambiguous object localization and blind exploration. To address these limitations, we present an Embodied Contrastive Learning (ECL) method with Geometric Consistency (GC) and Behavioral Awareness (BA), which motivates agents to actively encode 3D scene layouts and semantic cues. Driven by our embodied exploration strategy, BA is modeled by predicting navigational actions based on multi-frame visual images, as behaviors that cause differences between adjacent visual sensations are crucial for learning correlations among continuous visions. The GC is modeled as the alignment of behavior-aware visual stimulus with 3D semantic shapes by employing unsupervised contrastive learning. The aligned behavior-aware visual features and geometric invariance priors are injected into a modular ObjectNav framework to enhance object recognition and exploration capabilities. As expected, our ECL method performs well on object detection and instance segmentation tasks. Our ObjectNav strategy outperforms state-of-the-art methods on MP3D and Gibson datasets, showing the potential of our ECL in embodied navigation. The experimental code is available as supplementary material.



Paperid:1090 Poster
Authors:Shaocong Long,Qianyu Zhou,Xiangtai Li,Xuequan Lu,Chenhao Ying,Yuan Luo,Lizhuang Ma,Shuicheng YAN
Abstract:
Domain generalization (DG) aims at solving distribution shift problems in various scenes. Existing approaches are based on Convolution Neural Networks (CNNs) or Vision Transformers (ViTs), which suffer from limited receptive fields or quadratic complexity issues. Mamba, as an emerging state space model (SSM), possesses superior linear complexity and global receptive fields. Despite this, it can hardly be applied to DG to address distribution shifts, due to the hidden state issues and inappropriate scan mechanisms. In this paper, we propose a novel framework for DG, named DGMamba, that excels in strong generalizability toward unseen domains and meanwhile has the advantages of global receptive fields, and efficient linear complexity. Our DGMamba compromises two core components: Hidden State Suppressing (HSS) and Semantic-aware Patch Refining (SPR). In particular, HSS is introduced to mitigate the influence of hidden states associated with domain-specific features during output prediction. SPR strives to encourage the model to concentrate more on objects rather than context, consisting of two designs: Prior-Free Scanning (PFS), and Domain Context Interchange (DCI). Concretely, PFS aims to shuffle the non-semantic patches within images, creating more flexible and effective sequences from images, and DCI is designed to regularize Mamba with the combination of mismatched non-semantic and semantic information by fusing patches among domains. Extensive experiments on four commonly used DG benchmarks demonstrate that the proposed DGMamba achieves remarkably superior results to state-of-the-art models. The code will be made publicly available.



Paperid:1091 Poster
Authors:Wentao He,Jianfeng Ren,Ruibin Bai,Xudong Jiang
Abstract:
Advances in computer vision research enable human-like high-dimensional perceptual induction over analogical visual reasoning problems, such as Raven's Progressive Matrices (RPMs). In this paper, we propose a Hierarchical Perception and Predictive Analogy-Inference network (HP$^2$AI), consisting of three major components that tackle key challenges of RPM problems. Firstly, in view of the limited receptive fields of shallow networks in most existing RPM solvers, a perceptual encoder is proposed, consisting of a series of hierarchically coupled Patch Attention and Local Context (PALC) blocks, which could capture local attributes at early stages and capture the global panel layout at deep stages. Secondly, most methods seek for object-level similarities to map the context images directly to the answer image, while failing to extract the underlying analogies. The proposed reasoning module, Predictive Analogy-Inference (PredAI), consists of a set of Analogy-Inference Blocks (AIBs) to model and exploit the inherent analogical reasoning rules instead of object similarity. Lastly, the Squeeze-and-Excitation Channel-wise Attention (SECA) in the proposed PredAI discriminates essential attributes and analogies from irrelevant ones. Extensive experiments over four benchmark RPM datasets show that the proposed HP$^2$AI achieves significant performance gains over all the state-of-the-art methods consistently on all four datasets.



Paperid:1092 Poster
Authors:Zhangli Hu,Ye Chen,Zhongyin Zhao,Jinfan Liu,Bilian Ke,Bingbing Ni
Abstract:
Mainstream painting agents based on stroke-based rendering (SBR) attempt to translate visual appearance into a sequence of vectorized painting-style strokes. Lacking a direct mapping (and consequently the differentiable ability) between pixel domain and stroke parameter searching space, these methods often yield non-realistic/artist-incompatible stroke decompositions, hindering its further application in high quality art generation. To explicitly address this issue, we propose a novel SBR based image-to-painting framework which aligns with artistic oil painting behaviors/techniques. In the heart is a semantic content stratification module which decomposes images into hierarchical painting regions encapsulated with semantics, according to which a coarse-to-fine strategy is developed to first fill-in the abstract structure of the painting with coarse brushstrokes; and then depict the detailed texture portrayal with parallel-run localized multi-scale stroke search. In the meantime, we also propose a novel method that integrates SBR frameworks into a simulation-based interactive painting system for stroke quality assessment. Extensive experimental results on a wide range of images show that our method not only achieves high fidelity and artist-like painting rendering effect with a reduced number of strokes, but also exhibits greater stroke quality over prior methods.



Paperid:1093 Poster
Authors:Hao Zhang,Ee Yeo Keat,Basura Fernando
Abstract:
Vision foundational models (e.g., CLIP) show strong generalization on various downstream visual perception tasks. However, their ability to reason beyond mere perception is limited, as they are only pre-trained on image-text pairs that hold semantically equivalent meanings. To tackle this, we propose a simple yet effective \textit{Region Conditioned Adaptation} (RCA), a hybrid parameter-efficient fine-tuning method that equips the frozen CLIP with the ability to infer hypotheses from local visual cues. Specifically, the RCA contains two novel modules: regional prompt generator and Adapter$^\textbf{+}$. The prior encodes ''local hints'' and ''global contexts'' into visual prompts separately at fine and coarse-grained levels. The latter enhances the vanilla adapters with a newly designed Map Adapter, that directly steers the focus of attention map with trainable query and key projections. Finally, we train the RCA with a new Dual-Contrastive Loss to regress the visual feature simultaneously toward features of literal description (a.k.a. clue text) and plausible hypothesis (abductive inference text). The loss enables CLIP to maintain both perception and reasoning abilities. Experiments on the Sherlock visual abductive reasoning benchmark show that the RCA significantly outstands previous SOTAs, ranking the \nth{1} on the leaderboards (e.g., Human Acc: RCA 31.74 \textit{vs} CPT-CLIP 29.58, higher =better). We also validate the RCA is generalizable to local perception benchmarks like RefCOCO. We would open-source our codes for future research.



Paperid:1094 Poster
Authors:Yisu Liu,Jinyang An,Wanqian Zhang,Dayan Wu,JingziGU,Zheng Lin,Weiping Wang
Abstract:
With the development of diffusion-based customization methods like DreamBooth, individuals now have access to train the models that can generate their personalized images. Despite the convenience, malicious users have misused these techniques to create fake images, thereby triggering a privacy security crisis. In light of this, proactive adversarial attacks are proposed to protect users against customization. The adversarial examples are trained to distort the customization model's outputs and thus block the misuse. In this paper, we propose DisDiff (Disrupting Diffusion), a novel adversarial attack method to disrupt the diffusion model outputs. We first delve into the intrinsic image-text relationships, well-known as cross-attention, and empirically find that the subject-identifier token plays an important role in guiding image generation. Thus, we propose the Cross-Attention Erasure module to explicitly "erase" the indicated attention maps and disrupt the text guidance. Besides, we analyze the influence of the sampling process of the diffusion model on Projected Gradient Descent (PGD) attack and introduce a novel Merit Sampling Scheduler to adaptively modulate the perturbation updating amplitude in a step-aware manner. Our DisDiff outperforms the state-of-the-art methods by 12.75% of FDFR scores and 7.25% of ISM scores across two facial benchmarks and two commonly used prompts on average.



Paperid:1095 Poster
Authors:Jinxiao Zhang,Runmin Dong,Juepeng Zheng,Mengxuan Chen,Lixian Zhang,Yi Zhao,Haohuan Fu
Abstract:
With the increasing spatial and temporal resolutions of obtained remote sensing (RS) images, effective compression becomes critical for storage, transmission, and large-scale in-memory processing. Although image compression methods achieve a series of breakthroughs for daily images, a straightforward application of these methods to RS domain underutilizes the properties of the RS images, such as content duplication, homogeneity, and temporal redundancy. This paper proposes a Spatial-Temporal Context model (STCM) for RS image compression, jointly leveraging context from a broader spatial scope and across different temporal images. Specifically, we propose a stacked diagonal masked module to expand the contextual reference scope, which is stackable and maintains its parallel capability. Furthermore, we propose spatial-temporal contextual adaptive coding to enable the entropy estimation to reference context across different temporal RS images at the same geographic location. Experiments show that our method outperforms previous state-of-the-art compression methods on rate-distortion (RD) performance. For downstream tasks validation, our method reduces the bitrate by 52 times for single temporal images in the scene classification task while maintaining accuracy.



Paperid:1096 Poster
Authors:Ruowei Wang,Jiaqi Li,Dan Zeng,Xueqi Ma,Xu Zixiang,Jianwei Zhang,Qijun Zhao
Abstract:
Generating high-quality meshes with complex structures and realistic surfaces is the primary goal of 3D generative models. Existing methods typically employ sequence data or deformable tetrahedral grids for mesh generation. However, sequence-based methods have difficulty producing complex structures with many faces due to memory limits. The deformable tetrahedral grid-based method MeshDiffusion fails to recover realistic surfaces due to the inherent ambiguity in deformable grids. We propose the novel GenUDC framework to address these challenges, leveraging the Unsigned Dual Contouring (UDC) as a better mesh representation. UDC discretizes a mesh in a regular grid and divides it into the face and vertex parts, recovering both complex structures and fine details. As a result, the one-to-one mapping between UDC and mesh resolves the ambiguity problem. In addition, GenUDC adopts a two-stage, coarse-to-fine generative process for 3D mesh generation. It first generates the face part as a rough shape and then the vertex part to craft a detailed shape. Extensive evaluations demonstrate the superiority of UDC as a mesh representation and the favorable performance of GenUDC in mesh generation. The code and trained models will be released upon publication.



Paperid:1097 Poster
Authors:Pengcheng Zhang,Xiaohan Yu,Xiao Bai,Jin Zheng,Xin Ning
Abstract:
The development of person search techniques has been greatly promoted in recent years for its superior practicality and challenging goals. Despite their significant progress, existing person search models still lack the ability to continually learn from increasing real-world data and adaptively process input from different domains. To this end, this work introduces the continual person search task that sequentially learns on multiple domains and then performs person search on all seen domains. This requires balancing the stability and plasticity of the model to continually learn new knowledge without catastrophic forgetting. For this, we propose a \textbf{P}rompt-based C\textbf{o}ntinual \textbf{P}erson \textbf{S}earch (PoPS) model in this paper. First, we design a compositional person search transformer to construct an effective pre-trained transformer without exhaustive pre-training from scratch on large-scale person search data. This serves as the fundamental for prompt-based continual learning. On top of that, we design a domain incremental prompt pool with a diverse attribute matching module. For each domain, we independently learn a set of prompts to encode the domain-oriented knowledge. Meanwhile, we jointly learn a group of diverse attribute projection and prototype embeddings to capture discriminative domain attributes. By matching an input image with the learned attributes across domains, the learned prompts can be properly selected for model inference. Extensive experiments are conducted to validate the proposed method for continual person search. The source code will be made available upon publication.



Paperid:1098 Poster
Authors:Zeyu Xiao,Zhihe Lu,Michael Bi Mi,Zhiwei Xiong,Xinchao Wang
Abstract:
In real-world photography, local motion blur often arises from the interplay between moving objects and stationary backgrounds during exposure. Existing deblurring methods face challenges in addressing local motion deblurring due to (i) the presence of arbitrary localized blurs and uncertain blur extents; (ii) the limited ability to accurately identify specific blurs resulting from ambiguous motion boundaries. These limitations often lead to suboptimal solutions when estimating blur maps and generating final deblurred images. To that end, we propose a novel method named Motion-Uncertainty-Guided Network (MUGNet), which harnesses a probabilistic representational model to explicitly address the intricacies stemming from motion uncertainties. Specifically, MUGNet consists of two key components, i.e., motion-uncertainty quantification (MUQ) module and motion-masked separable attention (M2SA) module, serving for complementary purposes. Concretely, MUQ aims to learn a conditional distribution for accurate and reliable blur map estimation, while the M2SA module is to enhance the representation of regions influenced by local motion blur and static background, which is achieved by promoting the establishment of extensive global interactions. We demonstrate the superiority of our MUGNet with extensive experiments.



Paperid:1099 Poster
Authors:Ziyu Yao,Xuxin Cheng,Zhiqi Huang
Abstract:
Talking head generation is a significant research topic that still faces numerous challenges. Most of previous works adopt generative adversarial networks (GANs) or regression models, which are plagued by generation quality and average facial shape problem. Although current diffusion models demonstrate impressive generative ability, their exploration in talking head generation remains unsatisfactory. This is because they either solely use the diffusion model to obtain an intermediate representation and then employ another pre-trained renderer, or they overlook the feature decoupling of complex facial details, such as expressions, head poses and appearance textures. Therefore, we propose a Facial Decoupled Diffusion model for Talking head generation called FD2Talk, which fully leverages the advantages of diffusion models and decouples the complex facial information through multi-stages. Specifically, we separate facial details into motion and appearance. In the initial phase, we design the Diffusion Transformer to accurately predict motion coefficients from raw audio. These motions are highly decoupled from appearance, making them easier for the network to learn compared to high-dimensional RGB images. Subsequently, in the second phase, we encode the reference image to capture appearance textures. The predicted facial and head motions and encoded appearance then serve as the conditions for the Diffusion UNet, guiding the frame generation. Benefiting from decoupling facial details and fully leveraging diffusion models, extensive experiments substantiate that our approach excels in enhancing image quality and generating more accurate and diverse results compared to previous state-of-the-art methods.



Paperid:1100 Poster
Authors:Longtao Jiang,Min Wang,Zecheng Li,Yao Fang,Wengang Zhou,Houqiang Li
Abstract:
Sign language retrieval, as an emerging visual-language task, has received widespread attention. Different from traditional video retrieval, it is more biased towards understanding the semantic information of human actions contained in video clips. Previous works typically only encode RGB videos to obtain high-level semantic features, resulting in local action details drowned in a large amount of visual information redundancy. Furthermore, existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training, and adopt offline RGB encoder instead, leading to suboptimal feature representation. To address these issues, we propose a novel sign language representation framework called Semantically Enhanced Dual-Stream Encoder (SEDS), which integrates Pose and RGB modalities to represent the local and global information of sign language videos. Specifically, the Pose encoder embeds the coordinates of keypoints corresponding to human joints, effectively capturing detailed action features. For better context-aware fusion of two video modalities, we propose a Cross Gloss Attention Fusion (CGAF) module to aggregate the adjacent clip features with similar semantic information from intra-modality and inter-modality. Moreover, a Pose-RGB Fine-grained Matching Objective is developed to enhance the aggregated fusion feature by contextual matching of fine-grained dual-stream features. Besides the offline RGB encoder, the whole framework only contains learnable lightweight networks, which can be trained end-to-end. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods on How2Sign, PHOENIX-2014T, and CSL-Daily datasets.



Paperid:1101 Poster
Authors:Kunlun Xu,Haozhuo Zhang,Yu Li,Yuxin Peng,Jiahuan Zhou
Abstract:
Current lifelong person re-identification (LReID) methods focus on tackling a clean data stream with correct labels. When noisy data with wrong labels are given, their performance is severely degraded since the model inevitably and continually remembers erroneous knowledge induced by the noises. Moreover, the well-known catastrophic forgetting issue in LReID becomes even more challenging since the correct knowledge contained in the old model is disrupted by noisy labels. Such a practical noisy LReID task is important but challenging, and rare works attempted to handle it so far. In this paper, we initially investigate noisy LReID by proposing a Continual Knowledge Purification (CKP) method to address the catastrophic remembering of erroneous knowledge and catastrophic forgetting of correct knowledge simultaneously. Specifically, a Cluster-aware Data Purification module (CDP) is designed to obtain a cleaner subset of the given noisy data for learning. To achieve this, the label confidence is estimated based on the intra-identity clustering result where the high-confidence data are maintained. Besides, an Iterative Label Rectification (ILR) pipeline is proposed to rectify wrong labels by fusing the prediction and label information throughout the training epochs. Therefore, the noisy data are rectified progressively to facilitate new model learning. To handle the catastrophic remembering and forgetting issues, an Erroneous Knowledge Filtering (EKF) algorithm is proposed to estimate the knowledge correctness of the old model, and a weighted knowledge distillation loss is designed to transfer the correct old knowledge to the new model while excluding the erroneous one. Finally, a Noisy LReID benchmark is constructed for performance evaluation and extensive experimental results demonstrate that our proposed CKP method achieves state-of-the-art performance.



Paperid:1102 Poster
Authors:Qiwei Li,Yuxin Peng,Jiahuan Zhou
Abstract:
Online Continual Learning (OCL) aims at learning a model through a sequence of single-pass data, usually encountering the challenges of catastrophic forgetting both between different learning stages and within a stage. Currently, existing OCL methods address these issues by replaying part of previous data but inevitably raise data privacy concerns and stand in contrast to the setting of online learning where data can only be accessed once. Moreover, their performance will dramatically drop without any replay buffer. In this paper, we propose a Non-Exemplar Online Continual Learning method named Progressive Prototype Evolving (PPE). The core of our PPE is to progressively learn class-specific prototypes during the online learning phase without reusing any previously seen data. Meanwhile, the progressive prototypes of the current learning stage, serving as the accumulated knowledge of different classes, are fed back to the model to mitigate intra-stage forgetting. Additionally, to resist inter-stage forgetting, we introduce the Prototype Similarity Preserving and Prototype-Guided Gradient Constraint modules which distill and leverage the historical knowledge conveyed by prototypes to regularize the one-way model learning. Consequently, extensive experiments on three widely used datasets demonstrate the superiority of the proposed PPE against the state-of-the-art exemplar-based OCL approaches. Our code will be released.



Paperid:1103 Poster
Authors:Zichen Liu,Yuxin Peng,Jiahuan Zhou
Abstract:
Visual prompting is an efficient methodology for finetuning pretrained visual models by introducing a small number of learnable parameters while keeping the backbone frozen. However, most existing visual prompting methods learn a shared prompt for all samples, making it challenging to grasp distinct characteristics among diverse samples, thereby limiting the model's performance. While other methods partially address this issue through sample clustering and learning multiple prompts, they still struggle to capture nuanced differences among instances and incur significant parameter overhead. Therefore, to comprehensively and efficiently leverage discriminative characteristics of individual instances, we propose an Instance Visual Prompting method, called InsVP. Initially, the instance image prompt is introduced to extract both crucial and nuanced discriminative information from the original image itself and is overlaid onto the input image. Furthermore, the instance feature prompt is designed to capture both commonalities and characteristics among individual instances, fed into the model's intermediate layers to facilitate feature extraction. Consequently, the instance image and feature prompts complement each other, enhancing the adaptation ability of pretrained models to extract discriminative features from individual instances. Extensive experiments on various large-scale benchmarks show that our InsVP achieves superior performance exceeding the state-of-the-art methods at a lower parameter cost. Our code will be released.



Paperid:1104 Poster
Authors:Yang Xin,Yu Zhou,Jianmin Jiang
Abstract:
While margin-based deep face recognition models, such as ArcFace and AdaFace, have achieved remarkable successes over the recent years, they may suffer from degraded performances when encountering training sets corrupted with noises. This is often inevitable when massively large scale datasets need to be dealt with, yet it remains difficult to construct clean enough face datasets under these circumstances. In this paper, we propose a robust deep face recognition model, RobustFace, by combining the advantages of margin-based learning models with the strength of mining-based approaches to effectively mitigate the impact of noises during trainings. Specifically, we introduce a noise-adaptive mining strategy to dynamically adjust the emphasis balance between hard and noise samples by monitoring the model's recognition performances at the batch level to provide optimization-oriented feedback, enabling direct training on noisy datasets without the requirement of pre-training. Extensive experiments validate that our proposed RobustFace achieves competitive performances in comparison with the existing SoTA models when trained with clean datasets. When trained with both real-world and synthetic noisy datasets, RobustFace significantly outperforms the existing models, especially when the synthetic noisy datasets are corrupted with both close-set and open-set noises. While the existing baseline models suffer from an average performance drop of around 40%, under these circumstances, our proposed still delivers accuracy rates of more than 90%.



Paperid:1105 Poster
Authors:Siyang Wang,JingHao Zhang,Jie Huang,Feng Zhao
Abstract:
The constrained data scale in low-level vision often induces the demon overfitting hazard for restoration networks, necessitating the adoption of the pre-training paradigm. Mirroring the success of the high-level pre-training approaches, recent methods in the low-level community aim to derive general visual representation from extensive data with synthesized degradation. In this paper, we propose a new perspective beyond the data-driven image pre-training paradigm for low-level vision, building upon the following examination. First, unlike the semantic extraction prevalent in high-level vision tasks, low-level vision primarily focuses on the continuous and content-agnostic pixel-level regression, indicating that the diversified image contents inherent in large-scale data are potentially unnecessary for low-level vision pre-training. Secondary, considering the low-level degradations are highly relevant to the frequency spectrum, we discern that the low-level pre-training paradigm can be implemented in the Fourier space with fostered degradation sensibility. Therefore, we develop an Image-Free Pre-training (IFP) paradigm, a novel low-level pre-training approach with necessity of single randomly sampled Gaussian noise image, streamlining complicated data collection and synthesis procedure. The principle of the IFP involves reconstructing the original Gaussian noise from the randomly perturbed counterpart with partially masked spectrum band, facilitating the capability for robust spectrum representation extraction in response to the capricious downstream degradations. Extensive experiments demonstrate the significant improvements brought by the IFP paradigm to various downstream tasks, such as 1.31dB performance boost in low-light enhancement for Restormer, and improvements of 1.2dB in deblurring and 2.42dB in deraining for Uformer.



Paperid:1106 Poster
Authors:Xuntao Liu,Yuzhou Yang,Haoyue Wang,Qichao Ying,Zhenxing Qian,Xinpeng Zhang,Sheng Li
Abstract:
Deceptive images can quickly spread via social networking services, posing significant risks. The rapid progress in Image Manipulation Localization (IML) seeks to address this issue. However, the scarcity of public training datasets in the IML task directly hampers the performance of models. To address the challenge, we propose a Prompt-IML framework, which leverages the rich prior knowledge of pre-trained visual models by employing tunable prompts. Specifically, sets of tunable prompts enable the frozen pre-trained model to extract multi-view features, including spatial and high-frequency features. This approach minimizes redundant architecture for feature extraction across different views, resulting in reduced training costs. In addition, we develop a plug-and-play Feature Alignment and Fusion (FAF) module that seamlessly integrates into the backbone without additional structural modifications. The proposed module reduces noise and uncertainty in features through interactive processing. The experimental results showcase that our proposed method attains superior performance across 6 test datasets, demonstrating exceptional robustness



Paperid:1107 Poster
Authors:Rongyu Zhang,Zefan Cai,Huanrui Yang,Zidong Liu,Denis A Gudovskiy,Tomoyuki Okuno,Yohei Nakata,Kurt Keutzer,Baobao Chang,Yuan Du,Li Du,Shanghang Zhang
Abstract:
Finetuning a pretrained vision model (PVM) is a common technique for learning downstream vision tasks. However, the conventional finetuning process with randomly sampled data points results in diminished training efficiency. To address this drawback, we propose a novel approach, Vision-language Collaborative Active Finetuning (VeCAF). With the emerging availability of labels and natural language annotations of images through web-scale crawling or controlled generation, VeCAF makes use of these information to perform parametric data selection for PVM finetuning. VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence to meet the performance goal. This process is assisted by the inherent semantic richness of the text embedding space which we use to augment image features. Furthermore, the flexibility of text-domain augmentation allows VeCAF to handle out-of-distribution scenarios without external data. Extensive experiments show the leading performance and high computational efficiency of VeCAF that is superior to baselines in both in-distribution and out-of-distribution image classification tasks. On ImageNet, VeCAF uses up to 3.3$\times$ less training batches to reach the target performance compared to full finetuning, and achieves an accuracy improvement of 2.8% over the state-of-the-art active finetuning method with the same number of batches.



Paperid:1108 Poster
Authors:Yaopeng Peng,Milan Sonka,Danny Chen
Abstract:
The Vision Transformer has attained remarkable success in various computer vision applications. However, the large computational costs and complex design limit its ability in handling large feature maps. Existing research predominantly focuses on constraining attention to small local regions, which reduces the number of tokens attending the attention computation while overlooking computational demands caused by the feed-forward layer in the Vision Transformer block. In this paper, we introduce Group Vision Transformer (GVT), a relatively simple and efficient variant of Vision Transformer, aiming to improve attention computation. The core idea of our model is to divide and group the entire Transformer layer, instead of only the attention part, into multiple independent branches. This approach offers two advantages: (1) It helps reduce parameters and computational complexity; (2) it enhances the diversity of the learned features. We conduct comprehensive analysis of the impact of different numbers of groups on model performance, as well as their influence on parameters and computational complexity. Our proposed GVT demonstrates competitive performances in several common vision tasks. For example, our GVT-Tiny model achieves 84.8% top-1 accuracy on ImageNet-1K, 51.4% box mAP and 45.2% mask mAP on MS COCO object detection and instance segmentation, and 50.1% mIoU on ADE20K semantic segmentation, outperforming the CAFormer-S36 model by 0.3% in ImageNet-1K top-1 accuracy, 1.2% in box mAP, 1.0% in mask mAP on MS COCO object detection and instance segmentation, and 1.2% in mIoU on ADE20K semantic segmentation, with similar model parameters and computational complexity. Code is accessible athttps://github.com/AnonymousAccount6688/GVT.



Paperid:1109 Poster
Authors:Zehao Chen,Zhan Lu,De Ma,Huajin Tang,Xudong Jiang,Qian Zheng,Gang Pan
Abstract:
Intrinsic decomposition for 3D scenes from multi-view images is challenging, especially in adverse conditions. We propose a novel event-based intrinsic decomposition framework that leverages events and images for stable decomposition under extreme scenarios. Our method is based on two observations: event cameras maintain good imaging quality, and events from different viewpoints exhibit similarity in diffuse regions while varying in specular regions. We establish an event-based reflectance model and introduce an event-based warping method to extract specular clues. Our two-part framework constructs a radiance field and decomposes the scene into normal, material, and lighting. Experimental results demonstrate superior performance compared to state-of-the-art methods. Our contributions include an event-based reflectance model, event warping-based consistency learning, and a framework for event-based intrinsic decomposition.



Paperid:1110 Poster
Authors:zhentao he,Changqun Xia,Shengye Qiao,Jia Li
Abstract:
Camouflaged instance segmentation (CIS) aims to seamlessly detect and segment objects blending with their surroundings. While existing CIS methods rely heavily on fully-supervised training with massive precisely annotated data, consuming considerable annotation efforts yet struggling to segment highly camouflaged objects accurately. Despite their visual similarity to the background, camouflaged objects differ semantically. Since text associated with images offers explicit semantic cues to underscore this difference, in this paper we propose a novel approach: the first \textbf{T}ext-\textbf{P}rompt based weakly-supervised camouflaged instance segmentation method named TPNet, leveraging semantic distinctions for effective segmentation. Specifically, TPNet operates in two stages: initiating with the generation of pseudo masks followed by a self-training process. In the pseudo mask generation stage, we innovatively align text prompts with images using a pre-training language-image model to obtain region proposals containing camouflaged instances and specific text prompt. Additionally, a Semantic-Spatial Iterative Fusion module is ingeniously designed to assimilate spatial information with semantic insights, iteratively refining pseudo mask. In the following stage, we employ Graduated Camouflage Learning, a straightforward self-training optimization strategy that evaluates camouflage levels to sequence training from simple to complex images, facilitating for an effective learning gradient. Through the collaboration of the dual phases, our method offers a comprehensive experiment on two common benchmark and demonstrates a significant advancement, delivering a novel solution that bridges the gap between weak-supervised and high camouflaged instance segmentation.



Paperid:1111 Poster
Authors:Li Zhang,Zean Han,Yan Zhong,Qiaojun Yu,Xingyu Wu,xue Wang,RujingWang
Abstract:
Articulated objects are common in our daily life. However, current category-level articulation pose works mostly focus on predicting 9D poses on statistical point cloud observations. In this paper, we deal with the problem of category-level online robust 9D pose tracking of articulated objects, where we propose VoCAPTER, a novel 3D Voting-based Category-level Articulated object Pose TrackER. Our VoCAPTER efficiently updates poses between adjacent frames by utilizing partial observations from the current frame and the estimated per-part 9D poses from the previous frame. Specifically, by incorporating prior knowledge of continuous motion relationships between frames, we begin by canonicalizing the input point cloud, casting the pose tracking task as an inter-frame pose increment estimation challenge. Subsequently, to obtain a robust pose-tracking algorithm, our main idea is to leverage SE(3)-invariant features during motion. This is achieved through a voting-based articulation tracking algorithm, which identifies keyframes as reference states for accurate pose updating throughout the entire video sequence. We evaluate the performance of VoCAPTER in the synthetic dataset and real-world scenarios, which demonstrates VoCAPTER's generalization ability to diverse and complicated scenes. Through these experiments, we provide evidence of VoCAPTER's superiority and robustness in multi-frame pose tracking of articulated objects. We believe that this work can facilitate the progress of various fields, including robotics, embodied intelligence, and augmented reality. All the codes will be made publicly available.



Paperid:1112 Poster
Authors:Zhen Ye,Zeqian Ju,Haohe Liu,Xu Tan,Jianyi Chen,Yiwen Lu,Peiwen Sun,Jiahao Pan,Bianweizhen,Shulin He,Wei Xue,Qifeng Liu,Yike Guo
Abstract:
Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found inhttps://flashspeech.github.io



Paperid:1113 Poster
Authors:Wenxi Li,Yuchen Guo,Jilai Zheng,Haozhe Lin,Chao Ma,LU FANG,Xiaokang Yang
Abstract:
Recent years have seen an increase in the use of gigapixel-level image and video capture systems and benchmarks with high-resolution wide (HRW) shots. However, unlike close-up shots in the MS COCO dataset, the higher resolution and wider field of view raise unique challenges, such as extreme sparsity and huge scale changes, causing existing close-up detectors inaccuracy and inefficiency. In this paper, we present a novel model-agnostic sparse vision transformer, dubbed SparseFormer, to bridge the gap of object detection between close-up and HRW shots. The proposed SparseFormer selectively uses attentive tokens to scrutinize the sparsely distributed windows that may contain objects. In this way, it can jointly explore global and local attention by fusing coarse- and fine-grained features to handle huge scale changes. SparseFormer also benefits from a novel Cross-slice non-maximum suppression (C-NMS) algorithm to precisely localize objects from noisy windows and a simple yet effective multi-scale strategy to improve accuracy. Extensive experiments on two HRW benchmarks, PANDA and DOTA-v1.0, demonstrate that the proposed SparseFormer significantly improves detection accuracy (up to 5.8%) and speed (up to 3x) over the state-of-the-art approaches.



Paperid:1114 Poster
Authors:Chaolei Tan,Zihang Lin,Junfu Pu,Zhongang Qi,Wei-Yi Pei,Zhi Qu,Yexin Wang,Ying Shan,Wei-Shi Zheng,Jian-Fang Hu
Abstract:
Video grounding is a fundamental problem in vision-language understanding, which aims to localize the natural language queries in an untrimmed video. However, current video grounding datasets merely focus on the multimodal understanding of simple events and are either limited to shorter videos or brief sentences, which hinders the model from evolving toward stronger multimodal understanding capabilities or being applied in some more complex downstream scenarios. To address these limitations, we present a large-scale video grounding dataset named SynopGround, in which more than 2800 hours of videos are sourced from popular TV dramas and are paired with accurately localized human-written synopses. Each paragraph in the synopsis serves as a language query and is manually annotated with precise temporal boundaries in the long video. These paragraph queries are tightly correlated to each other and contain a wealth of abstract expressions summarizing video storylines and specific descriptions portraying event details, which enables the model to learn multimodal perception on more intricate concepts over longer context dependencies. Based on the dataset, we further introduce a more complex setting of video grounding dubbed Multi-Paragraph Video Grounding (MPVG), which takes as input multiple paragraphs and a long video for grounding each paragraph query to its temporal interval. In addition, we propose a novel Local-Global Multimodal Reasoner (LGMR) to explicitly model the local-global structures of long-term multimodal inputs and iteratively conduct fine-grained cross-modal reasoning within and across the two levels of structures between the long videos and long paragraphs. Our method provides an effective baseline solution to the multi-paragraph video grounding problem. Extensive experiments verify the proposed model's effectiveness as well as its superiority in long-term multi-paragraph video grounding over prior state-of-the-arts. Dataset and code will be released to foster the research on video grounding and multimodal understanding.



Paperid:1115 Poster
Authors:Jialiang Li,Haoyue Wang,Sheng Li,Zhenxing Qian,Xinpeng Zhang,ATHANASIOS V. VASILAKOS
Abstract:
Recently, a vast number of image generation models have been proposed, which raises concerns regarding the misuse of these artificial intelligence (AI) techniques for generating fake images. To attribute the AI-generated images, existing schemes usually design and train deep neural networks (DNNs) to learn the model fingerprints, which usually requires a large amount of data for effective learning. In this paper, we aim to answer the following two questions for AI-generated image attribution, 1) is it possible to design useful handcrafted filters to facilitate the fingerprint learning? and 2) how we could reduce the amount of training data after we incorporate the handcrafted filters? We first propose a set of Multi-Directional High-Pass Filters (MHFs) which are capable to extract the subtle fingerprints from various directions. Then, we propose a Directional Enhanced Feature Learning network (DEFL) to take both the MHFs and randomly-initialized filters into consideration. The output of the DEFL is fused with the semantic features to produce a compact fingerprint. To make the compact fingerprint discriminative among different models, we propose a Dual-Margin Contrastive (DMC) loss to tune our DEFL. Finally, we propose a reference based fingerprint classification scheme for image attribution. Experimental results demonstrate that it is indeed helpful to use our MHFs for attributing the AI-generated images. The performance of our proposed method is significantly better than the state-of-the-art for both the closed-set and open-set image attribution, where only a small amount of images are required for training.



Paperid:1116 Poster
Authors:Zhangchi Feng,Richong Zhang,Zhijie Nie
Abstract:
The Composed Image Retrieval (CIR) task aims to retrieve target images using a composed query consisting of a reference image and a modified text. Advanced methods often utilize contrastive learning as the optimization objective, which benefits from adequate positive and negative examples. However, the triplet for CIR incurs high manual annotation costs, resulting in limited positive examples. Furthermore, existing methods commonly use in-batch negative sampling, which reduces the negative number available for the model. To address the problem of lack of positives, we propose a data generation method by leveraging a multi-modal large language model to construct triplets for CIR. To introduce more negatives during fine-tuning, we design a two-stage fine-tuning framework for CIR, whose second stage introduces plenty of static representations of negatives to optimize the representation space rapidly. The above two improvements can be effectively stacked and designed to be plug-and-play, easily applied to existing CIR models without changing their original architectures. Extensive experiments and ablation analysis demonstrate that our method effectively scales positives and negatives and achieves state-of-the-art results on both FashionIQ and CIRR datasets. In addition, our methods also perform well in zero-shot composed image retrieval, providing a new CIR solution for the low-resources scenario. The code is released athttps://anonymous.4open.science/r/45F4and will be publicly available upon acceptance.



Paperid:1117 Poster
Authors:Jiawei Yao,Yingxin Lai,Hongrui Kou,Tong Wu,Ruixi Liu
Abstract:
3D object detection plays a pivotal role in autonomous driving and robotics, demanding precise interpretation of Bird’s Eye View (BEV) images. The dynamic nature of real-world environments necessitates the use of dynamic query mechanisms in 3D object detection to adaptively capture and process the complex spatio-temporal relationships present in these scenes. However, prior implementations of dynamic queries have often faced difficulties in effectively leveraging these relationships, particularly when it comes to integrating temporal information in a computationally efficient manner. Addressing this limitation, we introduce an framework utilizing dynamic query evolution strategy, harnesses K-means clustering and Top-K attention mechanisms for refined spatio-temporal data processing. By dynamically segmenting the BEV space and prioritizing key features through Top-K attention, our model achieves a real-time, focused analysis of pertinent scene elements. Our extensive evaluation on the nuScenes and Waymo dataset showcases a marked improvement in detection accuracy, setting a new benchmark in the domain of query-based BEV object detection. Our dynamic query evolution strategy has the potential to push the boundaries of current BEV methods with enhanced adaptability and computational efficiency.



Paperid:1118 Poster
Authors:Zhuoxiao Chen,Zixin Wang,Yadan Luo,Sen Wang,Zi Huang
Abstract:
LiDAR-based 3D object detection has seen impressive advances in recent times. However, deploying trained 3D detectors in the real world often yields unsatisfactory performance when the distribution of the test data significantly deviates from the training data due to different weather conditions, object sizes, etc. A key factor in this performance degradation is the diminished generalizability of pre-trained models, which creates a sharp loss landscape during training. Such sharpness, when encountered during testing, can precipitate significant performance declines, even with minor data variations. To address the aforementioned challenges, we propose dual-perturbation optimization (DPO) for Test-time Adaptation in 3D Object Detection (TTA-3OD). We minimize the sharpness to cultivate a flat loss landscape to ensure model resiliency to minor data variations, thereby enhancing the generalization of the adaptation process. To fully capture the inherent variability of the test point clouds, we further introduce adversarial perturbation to the input BEV features to better simulate the noisy test environment. As the dual perturbation strategy relies on trustworthy supervision signals, we utilize a reliable Hungarian matcher to filter out pseudo-labels sensitive to perturbations. Additionally, we introduce early Hungarian cutoff to avoid error accumulation from incorrect pseudo-labels by halting the adaptation process. Extensive experiments across three types of transfer tasks demonstrate that the proposed DPO significantly surpasses previous state-of-the-art approaches, specifically on Waymo $\rightarrow$ KITTI, outperforming the most competitive baseline by 57.72% in $\text{AP}_\text{3D}$ and reaching 91% of the fully supervised upper bound. Our code is available in the supplementary materials.



Paperid:1119 Poster
Authors:Xun Lin,Yi Yu,Zitong YU,Ruohan Meng,Jiale Zhou,Ajian Liu,Yizhong Liu,Shuai Wang,Wenzhong Tang,Zhen Lei,Alex Kot
Abstract:
Despite the advancements deep learning has brought to medical image analysis (MIA), protecting the privacy of images remains a challenge. In a client-server MIA framework, especially after the network deployment, patients' private medical images can be easily captured by attackers from the transmission channel or malicious third-party servers. Previous MIA privacy-enhancing methods, whether based on distortion or homomorphic encryption, expose the fact that the transmitted images are medical images or transform the images into semantic-lacking noise. This tends to alert attackers, thereby falling into a \textit{cat-and-mouse game} of theft and protection. To address this issue, we propose a covert MIA framework based on deep image hiding, namely HideMIA, which secures medical images by embedding them within natural cover images that are unlikely to raise suspicion. By directly analyzing the hidden medical images in the steganographic domain, HideMIA makes it difficult for attackers to notice the presence of medical images during transmission or on third-party servers. Specifically, we propose the Mixture-of-Difference-Convolutions (MoDC) and Asymmetric Wavelet Attention (AsyWA) to enable HideMIA to conduct fine-grained analysis on each wavelet sub-band within the steganographic domain, mining features that are irrelevant to the cover and specific to medical images. Moreover, to reduce the resource consumption of HideMIA on client devices, we design function-aligned knowledge distillation to obtain a lightweight image hiding network, namely LightIH. Extensive experiments on six medical datasets demonstrate that our resource-friendly HideMIA achieves superior MIA performance and protective imperceptibility on both covert medical image segmentation and classification tasks.



Paperid:1120 Poster
Authors:Zhenghao Chen,Luping Zhou,Zhihao Hu,Dong Xu
Abstract:
Content-adaptive compression is crucial for enhancing the adaptability of the pre-trained neural codec for various contents. Although these methods have been very practical in neural image compression (NIC), their application in neural video compression (NVC) is still limited due to two main aspects: 1), video compression relies heavily on temporal redundancy, therefore updating just one or a few frames can lead to significant errors accumulating over time; 2), NVC frameworks are generally more complex, with many large components that are not easy to update quickly during encoding. To address the previously mentioned challenges, we have developed a content-adaptive NVC technique called Group-aware Parameter-Efficient Updating (GPU). Initially, to minimize error accumulation, we adopt a group-aware approach for updating encoder parameters. This involves adopting a patch-based Group of Pictures (GoP) training strategy to segment a video into patch-based GoPs, which will be updated to facilitate a globally optimized domain-transferable solution. Subsequently, we introduce a parameter-efficient delta-tuning strategy, which is achieved by integrating several light-weight adapters into each coding component of the encoding process by both serial and parallel configuration. Such architecture-agnostic modules stimulate the components with large parameters, thereby reducing both the update cost and the encoding time. We incorporate our GPU into the latest NVC framework and conduct comprehensive experiments, whose results showcase outstanding video compression efficiency across four video benchmarks and adaptability of one medical image benchmark.



Paperid:1121 Poster
Authors:Gang Wu,Junjun Jiang,Kui Jiang,Xianming Liu
Abstract:
Deep learning-based all-in-one image restoration methods have garnered significant attention in recent years due to capable of addressing multiple degradation tasks. These methods focus on extracting task-oriented information to guide the unified model and have achieved promising results through elaborate architecture design. However, the proper optimization strategy for all-in-one tasks has been scarcely investigated, and the indiscriminately mixed training paradigm is typically adopted. This oversight neglects the intricate relationships and potential conflicts among various restoration tasks, consequently leading to inconsistent optimization rhythms and generalization bottlenecks. In this paper, we unveil this overlooked aspect of inconsistent convergence in multi-task learning dynamics and endeavor to alleviate this challenge from the perspective of active optimization. Specifically, we extend and redefine the conventional all-in-one image restoration task as a multi-task learning problem and propose a straightforward yet effective active-reweighting strategy, dubbed $\textbf{Art}$, to harmonize the optimization of multiple degradation tasks. Art is a plug-and-play optimization strategy designed to refine optimization dynamics and mitigate hidden conflicts among multi-task optimization processes. Through extensive experiments on a diverse range of all-in-one image restoration settings, Art has been demonstrated to substantially enhance the performance of existing methods. When incorporated into the AirNet and TransWeather models, it achieves average improvements of $\textbf{1.16}$ dB and $\textbf{1.24}$ dB on PSNR, respectively. We hope this work will provide a principled framework for collaborating multiple tasks in all-in-one image restoration and pave the way for more efficient and effective restoration models, ultimately advancing the state-of-the-art in this critical research domain. Code and pre-trained models are available at our project page.



Paperid:1122 Poster
Authors:Jingjing Hu,Dan Guo,Kun Li,Zhan Si,Xun Yang,Meng Wang
Abstract:
Video Moment Retrieval (MR) tasks involve predicting the moment described by a given natural language or spoken language query in an untrimmed video. In this paper, we propose a novel Maskable Retentive Network (MRNet) to address two key challenges in MR tasks: cross-modal guidance and video sequence modeling. Our approach introduces a new retention mechanism into the multimodal Transformer architecture, incorporating modality-specific attention modes. Specifically, we employ the Unlimited Attention for language-related attention regions to maximize cross-modal mutual guidance. Then, we introduce the Maskable Retention for video-only attention region to enhance video sequence modeling, that is, recognizing two crucial characteristics of video sequences: 1) bidirectional, decaying, and non-linear temporal associations between video clips, and 2) sparse associations of key information semantically related to the query. We propose a bidirectional decay retention mask to explicitly model temporal-distant context dependencies of video sequences, along with a learnable sparse retention mask to adaptively capture strong associations relevant to the target event. Extensive experiments conducted on five popular benchmarks ActivityNet Captions, TACoS, Charades-STA, ActivityNet Speech, and QVHighlights for MR tasks demonstrate the significant improvements achieved by our method over existing approaches. Code is available athttps://github.com/xian-sh/MRNet.



Paperid:1123 Poster
Authors:Xiaomin Li,Xu Jia,Qinghe Wang,Haiwen Diao,mengmeng Ge,Pengxiang Li,You He,Huchuan Lu
Abstract:
Existing pretrained text-to-video (T2V) models have demonstrated impressive abilities in generating realistic videos with basic motion or camera movement. However, these models exhibit significant limitations when generating intricate, human-centric motions. Current efforts primarily focus on fine-tuning models on a small set of videos containing a specific motion. They often fail to effectively decouple motion and the appearance in the limited reference videos, thereby weakening the modeling capability of motion patterns. To this end, we propose MoTrans, a customized motion transfer method enabling video generation of similar motion in new context. Specifically, we introduce a multimodal large language model (MLLM)-based recaptioner to expand the initial prompt to focus more on appearance and an appearance injection module to adapt appearance prior from video frames to the motion modeling process. These complementary multimodal representations from recaptioned prompt and video frames promote the modeling of appearance and facilitate the decoupling of appearance and motion. In addition, we devise a motion-specific embedding for further enhancing the modeling of the specific motion. Experimental results demonstrate that our method effectively learns specific motion pattern from singular or multiple reference videos, performing favorably against existing methods in customized video generation.



Paperid:1124 Poster
Authors:Hao Wang,Shangwei Guo,Jialing He,Kangjie Chen,Shudong Zhang,Tianwei Zhang,Tao Xiang
Abstract:
Text-to-image (T2I) diffusion models enjoy great popularity and many individuals and companies build their applications based on publicly released T2I diffusion models. Previous studies have demonstrated that backdoor attacks can elicit T2I diffusion models to generate unsafe target images through textual triggers. However, existing backdoor attacks typically demand substantial tuning data for poisoning, limiting their practicality and potentially degrading the overall performance of T2I diffusion models. To address these issues, we propose EvilEdit, atraining-freeanddata-freebackdoor attack against T2I diffusion models. EvilEdit directly edits the projection matrices in the cross-attention layers to achieve projection alignment between a trigger and the corresponding backdoor target. We preserve the functionality of the backdoored model using a protected whitelist to ensure the semantic of non-trigger words is not accidentally altered by the backdoor. We also propose a visual target attack EvilEdit$_{VTA}$, enabling adversaries to use specific images as backdoor targets. We conduct empirical experiments on Stable Diffusion and the results demonstrate that the EvilEdit can backdoor T2I diffusion models withinone secondwith up to 100% success rate. Furthermore, our EvilEdit modifies only 2.2% of the parameters and maintains the model’s performance on benign prompts. Our code is available athttps://github.com/haowang-cqu/EvilEdit.



Paperid:1125 Poster
Authors:Yuyan Bu,Qiang Sheng,Juan Cao,Peng Qi,Danding Wang,Jintao Li
Abstract:
As short-form video-sharing platforms become a significant channel for news consumption, fake news in short videos has emerged as a serious threat in the online information ecosystem, making developing detection methods for this new scenario an urgent need. Compared with that in text and image formats, fake news on short video platforms contains rich but heterogeneous information in various modalities, posing a challenge to effective feature utilization. Unlike existing works mostly focusing on analyzing what is presented, we introduce a novel perspective that considers how it might be created. Through the lens of the creative process behind news video production, our empirical analysis uncovers the unique characteristics of fake news videos in material selection and editing. Based on the obtained insights, we design FakingRecipe, a creative process-aware model for detecting fake news short videos. It captures the fake news preferences in material selection from sentimental and semantic aspects and considers the traits of material editing from spatial and temporal aspects. To improve evaluation comprehensiveness, we first construct FakeTT, an English dataset for this task, and conduct experiments on both FakeTT and the existing Chinese FakeSV dataset. The results show FakingRecipe's superiority in detecting fake news on short video platforms.



Paperid:1126 Poster
Authors:Bo Yuan,Danpei Zhao,Zhuoran Liu,Wentao Li,Tian Li
Abstract:
Continual learning (CL) breaks off the one-way training manner and enables a model to adapt to new data, semantics and tasks continuously. However, current CL methods mainly focus on single tasks. Besides, CL models are plagued by catastrophic forgetting and semantic drift since the lack of old data, which often occurs in remote-sensing interpretation due to the intricate fine-grained semantics. In this paper, we propose Continual Panoptic Perception (CPP), a unified continual learning model that leverages multi-task joint learning covering pixel-level classification, instance-level segmentation and image-level perception for universal interpretation in remote sensing images. Concretely, we propose a collaborative cross-modal encoder (CCE) to extract the input image features, which supports pixel classification and caption generation synchronously. To inherit the knowledge from the old model without exemplar memory, we propose a task-interactive knowledge distillation (TKD) method, which leverages cross-modal optimization and task-asymmetric pseudo-labeling (TPL) to alleviate catastrophic forgetting. Furthermore, we also propose a joint optimization mechanism to achieve end-to-end multi-modal panoptic perception. Experimental results on the fine-grained panoptic perception dataset validate the effectiveness of the proposed model, and also prove that joint optimization can boost sub-task CL efficiency with over 13% relative improvement on PQ.



Paperid:1127 Poster
Authors:Haizhuang Liu,Junbao Zhuo,Chen Liang,Jiansheng Chen,Huimin Ma
Abstract:
Zero-shot point cloud semantic segmentation aims to recognize novel classes at the point level. Previous methods mainly transfer excellent zero-shot generalization capabilities from images to point clouds. However, directly transferring knowledge from image to point clouds faces two ambiguous problems. On the one hand, 2D models will generate wrong predictions when the image changes. On the other hand, directly mapping 3D points to 2D pixels by perspective projection fails to consider the visibility of 3D points in camera view. The wrong geometric alignment of 3D points and 2D pixels causes semantic ambiguity. To tackle these two problems, we propose a framework named Affinity3D that intends to empower 3D semantic segmentation models to perceive novel samples. Our framework aggregates instances in 3D and recognizes them in 2D, leveraging the excellent geometric separation in 3D and the zero-shot capabilities of 2D models. Affinity3D involves an affinity module that rectifies the wrong predictions by comparing them with similar instances and a visibility module preventing knowledge transfer from visible 2D pixels to invisible 3D points. Extensive experiments have been conducted on SemanticKITTI datasets. Our framework achieves state-of-the-art performance in two settings.



Paperid:1128 Poster
Authors:Yuan Sun,Kaiming Liu,Yongxiang Li,Zhenwen Ren,Jian Dai,Dezhong Peng
Abstract:
With the massive emergence of multi-modal data, cross-modal retrieval (CMR) has become one of the hot topics. Thanks to fast retrieval and efficient storage, cross-modal hashing (CMH) provides a feasible solution for large-scale multi-modal data. Previous CMH methods always directly learn common hash codes to fuse different modalities. Although they have obtained some success, there are still some limitations: 1) These approaches often prioritize reducing the heterogeneity in multi-modal data by learning consensus hash codes, yet they may sacrifice specific information unique to each modality. 2) They frequently utilize pairwise similarities to guide hashing learning, neglecting class distribution correlations, which do not notably contribute to reducing differences among various modalities. To overcome these two issues, we propose a novel Distribution Consistency Guided Hashing (DCGH) framework. Specifically, we first learn the modality-specific representation to extract the private discriminative information. Further, we learn consensus hash codes from the private representation by consensus hashing learning, thereby merging the specifics with consistency. Finally, we propose distribution consistency learning to guide hash codes following a similar class distribution principle between multi-modal data, thereby exploring more consistent information. Lots of experimental results on four benchmark datasets demonstrate the effectiveness of our DCGH on both fully paired and partially paired CMR tasks.



Paperid:1129 Poster
Authors:Chunli Peng,Xuan Dong,Tiantian Cao,Zhengqing Li,Kun Dong,Weixin Li
Abstract:
The fusion of images from dual camera systems featuring a wide-angle and a telephoto camera has become a hotspot problem recently. By integrating simultaneously captured wide-angle and telephoto images from these systems, the resulting fused image achieves a wide field of view (FOV) coupled with high-definition quality. Existing approaches are mostly deep learning methods, and predominantly rely on supervised learning, where the training dataset plays a pivotal role. However, current datasets typically adopt a data synthesis approach generate input pairs of wide-angle and telephoto images alongside ground-truth images. Notably, the wide-angle inputs are synthesized rather than captured using real wide-angle cameras, and the ground-truth image is captured by wide-angle camera whose quality is substantially lower than that of input telephoto images captured by telephoto cameras. To address these limitations, we introduce a novel hardware setup utilizing a beam splitter to simultaneously capture three images, i.e. input pairs and ground-truth images, from two authentic cellphones equipped with wide-angle and telephoto dual cameras. Specifically, the wide-angle and telephoto images captured by cellphone 2 serve as the input pair, while the telephoto image captured by cellphone 1, which is calibrated to match the optical path of the wide-angle image from cellphone 2, serves as the ground-truth image, maintaining quality on par with the input telephoto image. Experiments validate the efficacy of our newly introduced dataset, named ReWiTe, significantly enhances the performance of various existing methods for real-world wide-angle and telephoto dual image fusion tasks.



Paperid:1130 Poster
Authors:Haozhe Jia,Yan Li,Hengfei Cui,Di Xu,Yuwang Wang,Tao Yu
Abstract:
In this work, we focus on exploring explicit fine-grained control of generative facial image editing, all while generating faithful facial appearances and consistent semantic details, which however, is quite challenging and has not been extensively explored, especially under an one-shot scenario. We identify the key challenge as the exploration of disentangled conditional control between high-level semantics and explicit parameters (e.g., 3DMM) in the generation process, and accordingly propose a novel diffusion-based editing framework, named DisControlFace. Specifically, we leverage a Diffusion Autoencoder (Diff-AE) as the semantic reconstruction backbone. To enable explicit face editing, we construct an Exp-FaceNet that is compatible with Diff-AE to generate spatial-wise explicit control conditions based on estimated 3DMM parameters. Different from current diffusion-based editing methods that train the whole conditional generative model from scratch, we freeze the pre-trained weights of the Diff-AE to maintain its semantically deterministic conditioning capability and accordingly propose a random semantic masking (RSM) strategy to effectively achieve an independent training of Exp-FaceNet. This setting endows the model with disentangled face control meanwhile reducing semantic information shift in editing. Our model can be trained using 2D in-the-wild portrait images without requiring 3D or video data and perform robust editing on any new facial image through a simple one-shot fine-tuning. Comprehensive experiments demonstrate that DisControlFace can generate realistic facial images with better editing accuracy and identity preservation over state-of-the-art methods.



Paperid:1131 Poster
Authors:Yulin He,Siqi Wang,Wei Chen,Tianci Xun,Yusong Tan
Abstract:
Autonomous driving (AD) is a typical application that requires effectively exploiting multimedia information. For AD, it is critical to ensure safety by detecting unknown objects in an open world, driving the demand for open world object detection (OWOD). However, existing OWOD methods treat generic objects beyond known classes in the training set as unknown objects and prioritize recall in evaluation. This encourages excessive false positives and endangers safety of AD. To address this issue, we restrict the definition of unknown objects to threatening objects in AD, and introduce a new evaluation protocol, which is built upon a new metric named U-ARecall, to alleviate biased evaluation caused by neglecting false positives. Under the new evaluation protocol, we re-evaluate existing OWOD methods and discover that they typically perform poorly in AD. Then, we propose a novel OWOD paradigm for AD based on fine-tuning foundational open-vocabulary models (OVMs), as they can exploit rich linguistic and visual prior knowledge for OWOD. Following this new paradigm, we propose a brand-new OWOD solution, which effectively addresses two core challenges of fine-tuning OVMs via two novel techniques: 1) the maintenance of open-world generic knowledge by a dual-branch architecture; 2) the acquisition of scenario-specific knowledge by the visual-oriented contrastive learning scheme. Besides, a dual-branch prediction fusion module is proposed to avoid post-processing and hand-crafted heuristics. Extensive experiments show that our proposed method not only surpasses classic OWOD methods in unknown object detection by a large margin ($\sim$3$\times$ U-ARecall), but also notably outperforms OVMs without fine-tuning in known object detection ($\sim$ 20% K-mAP). Our codes are available in the supplementary materials.



Paperid:1132 Poster
Authors:Zheqi Lv,Shaoxuan He,Tianyu Zhan,Shengyu Zhang,Wenqiao Zhang,Jingyuan Chen,Zhou Zhao,Fei Wu
Abstract:
Dynamic Sequential Recommendation (DSR) systems, which adapt to users' changing preferences by adjusting their parameters in real time, have emerged as a significant advancement over traditional static models. Despite their advantages, DSR systems face challenges including large item parameter spaces and heterogeneous user-item interactions, which can destabilize the recommendation process. To address these issues, we introduce the Semantic Codebook Learning for Dynamic Recommendation Models (SOLID) framework. SOLID compresses the parameter generation model's search space and utilizes homogeneity within the recommendation system more effectively. This is achieved by transforming item sequences into semantic sequences and employing a dual parameter model, which combines semantic and item-based cues to tailor recommendation parameters. Our innovative approach also includes the creation of a semantic codebook, which stores disentangled item representations to ensure stability and accuracy in parameter generation. Through extensive testing, SOLID has shown to surpass traditional DSR systems, providing more precise, stable, and dynamically adaptable recommendations.~\footnote{Our source code can be referred to \url{https://anonymous.4open.science/r/SOLID-0324}}.



Paperid:1133 Poster
Authors:Longan Wang,Yang Qin,Yuan Sun,Dezhong Peng,Xi Peng,Peng Hu
Abstract:
Cross-modal hashing has emerged as a promising technique for retrieving relevant information across distinct media types thanks to its low storage cost and high retrieval efficiency. However, the success of most existing methods heavily relies on large-scale well-annotated datasets, which are costly and scarce in the real world due to ubiquitous labeling noise. To tackle this problem, in this paper, we propose a novel framework, termed Noise Resistance Cross-modal Hashing (NRCH), to learn hashing with noisy labels by overcoming two key challenges, i.e. noise overfitting and error accumulation. Specifically, i) to mitigate the overfitting issue caused by noisy labels, we present a novel Robust Contrastive Hashing loss (RCH) to target homologous pairs instead of noisy positive pairs, thus avoiding overemphasizing noise. In other words, RCH enforces the model focus on more reliable positives instead of unreliable ones constructed by noisy labels, thereby enhancing the robustness of the model against noise; ii) to circumvent error accumulation, a Dynamic Noise Separator (DNS) is proposed to dynamically and accurately separate the clean and noisy samples by adaptively fitting the loss distribution, thus alleviate the adverse influence of noise on iterative training. Finally, we conduct extensive experiments on four widely used benchmarks to demonstrate the robustness and superiority of our NRCH against noisy labels for cross-modal retrieval.



Paperid:1134 Poster
Authors:Kepeng Xu,Zijia Ma,Li Xu,Gang He,Yunsong Li,Wenxin Yu,Taichu Han,Cheng Yang
Abstract:
Recent advances in neural camera imaging pipelines have demonstrated notable progress. Nevertheless, the real-world imaging pipeline still faces challenges including the lack of joint optimization in system components, computational redundancies, and optical distortions such as lens shading. In light of this, we propose an end-to-end camera imaging pipeline (RealCamNet) to enhance real-world camera imaging performance. Our methodology diverges from conventional, fragmented multi-stage image signal processing towards end-to-end architecture. This architecture facilitates joint optimization across the full pipeline and the restoration of coordinate-biased distortions. RealCamNet is designed for high-quality conversion from RAW to RGB and compact image compression. Specifically, we deeply analyze coordinate-dependent optical distortions, e.g., vignetting and dark shading, and design a novel Coordinate-Aware Distortion Restoration (CADR) module to restore coordinate-biased distortions. Furthermore, we propose a Coordinate-Independent Mapping Compression (CIMC) module to implement tone mapping and redundant information compression. Existing datasets suffer from misalignment and overly idealized conditions, making them inadequate for training real-world imaging pipelines. Therefore, we collected a real-world imaging dataset. Experiment results show that RealCamNet achieves the best rate-distortion performance with lower inference latency.



Paperid:1135 Poster
Authors:HuXiaowan,Yiyi Chen,Yan Li,Minquan Wang,Haoqian Wang,Quan Chen,Han Li,Peng Jiang
Abstract:
With the rapid expansion of e-commerce, more consumers have become accustomed to making purchases via livestreaming. Accurately identifying the products being sold by salespeople, i.e., livestreaming product retrieval (LPR), poses a fundamental and daunting challenge. The LPR task encompasses three primary dilemmas in real-world scenarios: 1) the recognition of intended products from distractor products present in the background; 2) the video-image heterogeneity that the appearance of products showcased in live streams often deviates substantially from standardized product images in stores; 3) there are numerous confusing products with subtle visual nuances in the shop. To tackle these challenges, we propose the Spatiotemporal Graphing Multi-modal Network (SGMN). First, we employ a text-guided attention mechanism that leverages the spoken content of salespeople to guide the model to focus toward intended products, emphasizing their salience over cluttered background products. Second, a long-range spatiotemporal graph network is further designed to achieve both instance-level interaction and frame-level matching, solving the misalignment caused by video-image heterogeneity. Third, we propose a multi-modal hard example mining, assisting the model in distinguishing highly similar products with fine-grained features across the video-image-text domain. Through extensive quantitative and qualitative experiments, we demonstrate the superior performance of our proposed SGMN model, surpassing the state-of-the-art methods by a substantial margin. The code and models will be public soon.



Paperid:1136 Poster
Authors:Guangyao Li,HenghuiDu,Di Hu
Abstract:
The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos. Such naturally multimodal videos contain rich and complex dynamic audio-visual components, with only a portion of them closely related to the given questions. Hence, effectively perceiving audio-visual cues relevant to the given questions is crucial for correctly answering them. In this paper, we propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions. Specifically, considering the challenge of aligning non-declarative questions and visual representations into the same semantic space using visual-language pretrained models, we construct declarative sentence prompts derived from the question template, to assist the temporal perception module in better identifying critical segments relevant to the questions. Subsequently, a spatial perception module is designed to merge visual tokens from selected segments to highlight key latent targets, followed by cross-modal interaction with audio to perceive potential sound-aware areas.Finally, the significant temporal-spatial cues from these modules are integrated for answering the question. Extensive experiments on multiple AVQA benchmarks demonstrate that our framework excels not only in understanding audio-visual scenes but also answering complex questions effectively.



Paperid:1137 Poster
Authors:Xingyi Li,Yizheng Wu,Jun CEN,Juewen Peng,Kewei Wang,Ke Xian,Zhe Wang,Zhiguo Cao,Guosheng Lin
Abstract:
3D content creation has long been a complex and time-consuming process, often requiring specialized skills and resources. While recent advancements have allowed for text-guided 3D object and scene generation, they still fall short of providing sufficient control over the generation process, leading to a gap between the user's creative vision and the generated results. In this paper, we present iControl3D, a novel interactive system that empowers users to generate and render customizable 3D scenes with precise control. To this end, a 3D creator interface has been developed to provide users with fine-grained control over the creation process. Technically, we leverage 3D meshes as an intermediary proxy to iteratively merge individual 2D diffusion-generated images into a cohesive and unified 3D scene representation. To ensure seamless integration of 3D meshes, we propose to perform boundary-aware depth alignment before fusing the newly generated mesh with the existing one in 3D space. Additionally, to effectively manage depth discrepancies between remote content and foreground, we propose to model remote content separately with an environment map instead of 3D meshes. Finally, our neural rendering interface enables users to build a radiance field of their scene online and navigate the entire scene. Extensive experiments have been conducted to demonstrate the effectiveness of our system.



Paperid:1138 Poster
Authors:Xuanyu Zhang,Youmin Xu,Runyi Li,Jiwen Yu,Weiqi Li,Zhipei Xu,Jian Zhang
Abstract:
AI-generated video has revolutionized short video production, filmmaking, and personalized media, making video local editing an essential tool. However, this progress also blurs the line between reality and fiction, posing challenges in multimedia forensics. To solve this urgent issue, V2A-Mark is proposed to address the limitations of current video tampering forensics, such as poor generalizability, singular function, and single modality focus. Combining the fragility of video-into-video steganography with deep robust watermarking, our method can embed invisible visual-audio localization watermarks and copyright watermarks into the original video frames and audio, enabling precise manipulation localization and copyright protection. We also design a temporal alignment and fusion module and degradation prompt learning to enhance the localization accuracy and decoding robustness. Meanwhile, we introduce a sample-level audio localization method and a cross-modal copyright extraction mechanism to couple the information of audio and video frames. The effectiveness of V2A-Mark has been verified on a visual-audio tampering dataset, emphasizing its superiority in localization precision and copyright accuracy, crucial for the sustainable development of video editing in the AIGC video era.



Paperid:1139 Poster
Authors:Yuhang Li,Jincen Jiang,Xiaosong Yang,Youdong Ding,Jian Jun Zhang
Abstract:
Video harmonization aims to address the discrepancy in color and lighting between foreground and background elements within video compositions, thereby enhancing the innate coherence of composite video content. Nevertheless, existing methods struggle to effectively handle video composite tasks with excessively large-scale foregrounds. In this paper, we propose Video Harmonization Masked Autoencoders (VHMAE), a simple yet powerful end-to-end video harmonization method designed to tackle this challenge once and for all. Unlike other typically MAE-based methods employing random or tube masking strategies, we innovative treat all foregrounds in each frame required for harmonization as prediction regions, which are designated as masked tokens and fed into our network to produce the final refinement video. To this end, the network is optimized to prioritize the harmonization task, proficiently reconstructing the masked region despite the limited background information. Specifically, we introduce the Pattern Alignment Module (PAM) to extract content information from the extensive masked foreground region, aligning the latent semantic features of the masked foreground content with the background context while disregarding the impact of various colors or illumination. Moreover, We propose the Patch Balancing Loss, which effectively mitigates the undesirable grid-like artifacts commonly observed in MAE-based approaches for image generation, thereby ensuring consistency between the predicted foreground and the visible background. Additionally, we introduce a real-composited video harmonization dataset named RCVH, which serves as a valuable benchmark for assessing the efficacy of techniques aimed at video harmonization across different real video sources. Comprehensive experiments demonstrate that our VHMAE outperforms state-of-the-art techniques on both our RCVH and the publicly available HYouTube dataset.



Paperid:1140 Poster
Authors:Lixiang Ru,Xin Guo,Lei Yu,Yingying Zhang,Jiangwei Lao,Jian Wang,Jingdong Chen,Yansheng Li,Ming Yang
Abstract:
Long-tailed recognition (LTR) aims to learn proficient models from extremely unbalanced training data. Recently, fine-tuning pretrained foundation models emerges as a promising research direction for LTR. However, we empirically observe that the fine-tuning process tends to degrade the intrinsic representation capability of pretrained models and lead to model bias towards certain classes, hindering the overall recognition performance. To unleash the intrinsic representation capability of pretrained foundation models, in this work, we propose a new Parameter-Efficient Complementary Expert Learning (PECEL) for LTR. Specifically, PECEL consists of multiple experts, where individual experts are trained by Parameter-Efficient Fine-Tuning (PEFT) and encouraged to learn different expertise on complementary sub-categories via the proposed sample-aware logit adjustment loss. By aggregating the predictions of different experts, PECEL effectively achieves a balanced performance on long-tailed classes. Nevertheless, learning multiple experts generally introduces extra trainable parameters. To ensure the parameter efficiency, we further propose a parameter sharing strategy which decomposes and shares the parameters in each expert. Extensive experiments on 4 LTR benchmarks show that the proposed PECEL can effectively learn multiple complementary experts without increasing the trainable parameters and achieve new state-of-the-art performance. The code will be publicly available.



Paperid:1141 Poster
Authors:Junxiong Lin,Zeng Tao,Xuan Tong,Xinji Mai,Haoran Wang,Boyang Wang,Yan Wang,Qing Zhao,Jiawen Yu,Yuxuan Lin,Shaoqi Yan,Shuyong Gao,Wenqiang Zhang
Abstract:
The problem of blind image super-resolution aims to recover high-resolution (HR) images from low-resolution (LR) images with unknown degradation modes. Most existing methods model the image degradation process using blur kernels. However, this explicit modeling approach struggles to cover the complex and varied degradation processes encountered in the real world, such as high-order combinations of JPEG compression, blur, and noise. Implicit modeling for the degradation process can effectively overcome this issue, but a key challenge of implicit modeling is the lack of accurate ground truth labels for the degradation process to conduct supervised training. To overcome this limitations inherent in implicit modeling, we propose an \textbf{U}ncertainty-based degradation representation for blind \textbf{S}uper-\textbf{R}esolution framework (\textbf{USR}). By suppressing the uncertainty of local degradation representations in images, USR facilitated self-supervised learning of degradation representations. The USR consists of two components: Adaptive Uncertainty-Aware Degradation Extraction (AUDE) and a feature extraction network composed of Variable Depth Dynamic Convolution (VDDC) blocks. To extract Uncertainty-based Degradation Representation from LR images, the AUDE utilizes the Self-supervised Uncertainty Contrast module with Uncertainty Suppression Loss to suppress the inherent model uncertainty of the Degradation Extractor. Furthermore, VDDC block integrates degradation information through dynamic convolution. Rhe VDDC also employs an Adaptive Intensity Scaling operation that adaptively adjusts the degradation representation according to the network hierarchy, thereby facilitating the effective integration of degradation information. Quantitative and qualitative experiments affirm the superiority of our approach.



Paperid:1142 Poster
Authors:Yuchen Pan,Junjun Jiang,Kui Jiang,Xianming Liu
Abstract:
Depression recognition (DR) using facial images, audio signals, or language text recordings has achieved remarkable performance. Recently, multimodal DR has shown improved performance over single-modal methods by leveraging information from a combination of these modalities. However, collecting high-quality data containing all modalities poses a challenge. In particular, these methods often encounter performance degradation when certain modalities are either missing or degraded. To tackle this issue, we present a generalizable multimodal framework for DR by aggregating feature disentanglement and privileged knowledge distillation. In detail, our approach aims to disentangle homogeneous and heterogeneous features within multimodal signals while suppressing noise, thereby adaptively aggregating the most informative components for high-quality DR. Subsequently, we leverage knowledge distillation to transfer privileged knowledge from complete modalities to the observed input with limited information, thereby significantly improving the tolerance and compatibility. These strategies form our novel Feature Disentanglement and Privileged knowledge Distillation Network for DR, dubbed Dis2DR. Experimental evaluations on AVEC 2013, AVEC 2014, AVEC 2017, and AVEC 2019 datasets demonstrate the effectiveness of our Dis2DR method. Remarkably, Dis2DR achieves superior performance even when only a single modality is available, surpassing existing state-of-the-art multimodal DR approaches AVA-DepressNet by up to 9.8% on the AVEC 2013 dataset.



Paperid:1143 Poster
Authors:Jia-Hong Liu,Shao-Kui Zhang,Chuyue Zhang,Song-Hai Zhang
Abstract:
Landscapes, recognized for their indispensable role in large-scale scenes, are experiencing growing demand. However, the manual modeling of such content is labor-intensive and lacks efficiency. Procedural Content Generation (PCG) techniques enable the rapid generation of diverse landscape elements. Nevertheless, ordinary users may encounter difficulties controlling these methods for desired results. In this paper, we introduce a controllable framework for procedurally generating landscapes. We integrate state-of-the-art Large Language Models (LLMs) to enhance user accessibility and control. By converting plain text inputs into parameters through LLMs, our framework allows ordinary users to generate a batch of plausible landscapes tailored to their specifications. A parameter-controlled PCG procedure is designed to leverage optimization techniques and employ rule-based refinements. It achieves harmonious layering in terrains, zoning, and roads while enabling aesthetic arrangement of vegetation and artificial elements. Extensive experiments demonstrate our framework's effectiveness in generating landscapes comparable to those crafted by experienced architects. Our framework has the potential to enhance the productivity of landscape designers significantly.



Paperid:1144 Poster
Authors:Yunqiu Xu,Linchao Zhu,Yi Yang
Abstract:
Text-driven 3D avatar customization has attracted increasing attention in recent years, where precisely editing specific local parts of avatars with only text prompts is particularly challenging. Previous editing methods usually use segmentation or cross-attention masks as constraints for local editing. Although these masks tightly covers existing objects/parts, they may limit editing methods to create drastic geometry deformations beyond the covered contents. From a different perspective, this paper presents a GPT-guided local avatar editing framework, namely GG-Editor. Specifically, GG-Editor progressively mines more reasonable candidate editing regions via harnessing multimodal large language models which already organically assimilate common-sense human knowledge. In order to improve the editing quality of the local areas, GG-Editor explicitly decouples the geometry/appearance optimization, and adopts a global-local synergy editing strategy with GPT-generated local prompts. Moreover, to preserve concepts residing in source avatars, GG-Editor proposes an orthogonal denoising score that orthogonally decomposes editing directions and introduce an explicit term for preservation. Comprehensive experiments demonstrate that GG-Editor with only textual prompts achieves realistic and high-fidelity local editing results, significantly surpassing prior works. Project page will be available.



Paperid:1145 Poster
Authors:Shao-Kui Zhang,Junkai Huang,Liang Yue,Jia-Tong Zhang,Jia-Hong Liu,Yu-Kun Lai,Song-Hai Zhang
Abstract:
Scene synthesis has gained significant attention recently, and interactive scene synthesis focuses on yielding scenes according to user preferences. Existing literature either generates floor plans or scenes according to the floor plans. This paper generates scenes over floor plans in real time. Given an initial scene, the only interaction a user needs is changing the room shapes. Our framework splits/merges rooms and adds/rearranges/removes objects for each transient moment during interactions. A systematic pipeline achieves our framework by compressing objects' arrangements over modified room shapes in a transient moment, thus enabling real-time performances. We also propose elastic boxes that indicate how objects should be arranged according to their continuously changed contexts, such as room shapes and other objects. Through a few interactions, a floor plan filled with object layouts is generated concerning user preferences on floor plans and object layouts according to floor plans. Experiments show that our framework is efficient at user interactions and plausible for synthesizing 3D scenes.



Paperid:1146 Poster
Authors:Xiao He,Chang Tang,Xinwang Liu,Chuankun Li,Shan An,Zhenglai Li
Abstract:
Spatial transcriptomics provides revolutionary insights into cellular interactions and disease development mechanisms by combining high-throughput gene sequencing and spatially resolved imaging technologies to analyze genes naturally associated with spatially variable tissue genes. However, existing methods typically map aggregated multi-view features into a unified representation, ignoring the heterogeneity and view independence of genes and spatial information. To this end, we construct a heterogeneous Graph guided Contrastive Learning (stGCL) for aggregating spatial transcriptomics data. The method is guided by the inherent heterogeneity of cellular molecules by dynamically coordinating triple-level node attributes through comparative learning loss distributed across view domains, thus maintaining view independence during the aggregation process. In addition, we introduce a cross-view hierarchical feature alignment module employing a parallel approach to decouple spatial and genetic views on molecular structures while aggregating multi-view features according to information theory, thereby enhancing the integrity of inter- and intra-views. Rigorous experiments demonstrate that stGCL outperforms existing methods in various tasks and related downstream applications. \end{abstract}



Paperid:1147 Poster
Authors:Haoyuan Jin,Xuesong Nie,Yunfeng Yan,Xi Chen,Zhihang Zhu,Donglian Qi
Abstract:
Multi-object tracking (MOT) is a pivotal task for media interpretation, where reliable motion and appearance cues are essential for cross-frame identity preservation. However, limited by the inherent perspective properties of 2D space, the crowd density and frequent occlusions in real-world scenes expose the fragility of these cues. We observe the natural advantage of objects being well-separated in high-dimensional space and propose a novel 2D MOT framework, ''Detecting-Lifting-Tracking'' (DLT). Initially, a pre-trained detector is employed to capture 2D object information. Secondly, we introduce a Mamba Distance Estimator to obtain the distances of objects to a monocular camera with temporal consistency, achieving object-level pseudo-3D lifting. Finally, we thoroughly explore distance-aware tracking via pseudo-3D information. Specifically, we introduce a Score-Distance Hierarchical Matching to achieve discrete and sequential matching. We also present a Short-Long Terms Association to optimize short-term associations between adjacent frames and enhance the long-term re-association capability. Even without appearance cues, DLT achieves state-of-the-art performance on MOT17, MOT20, and DanceTrack, demonstrating its potential to address occlusion challenges.



Paperid:1148 Poster
Authors:Harry Cheng,Yangyang Guo,Tianyi Wang,Liqiang Nie,Mohan Kankanhalli
Abstract:
Detecting diffusion-generated images has recently grown into an emerging research area. Existing diffusion-based datasets predominantly focus on general image generation. However, facial forgeries, which pose severe social risks, have remained less explored thus far. To address this gap, this paper introduces DiFF, a comprehensive dataset dedicated to face-focused diffusion-generated images. DiFF comprises over 500,000 images that are synthesized using thirteen distinct generation methods under four conditions. In particular, this dataset utilizes 30,000 carefully collected textual and visual prompts, ensuring the synthesis of images with both high fidelity and semantic consistency. We conduct extensive experiments on the DiFF dataset via human subject tests and several representative forgery detection methods. The results demonstrate that the binary detection accuracies of both human observers and automated detectors often fall below 30%, revealing insights on the challenges in detecting diffusion-generated facial forgeries. Moreover, our experiments demonstrate that DiFF, compared to previous facial forgery datasets, contains a more diverse and realistic range of forgeries, showcasing its potential to aid in the development of more generalized detectors. Finally, we propose an edge graph regularization approach to effectively enhance the generalization capability of existing detectors.



Paperid:1149 Poster
Authors:Yuhang Zhou,Yushu Zhang,Leo Yu Zhang,Zhongyun Hua
Abstract:
Computer vision models based on deep neural networks are proven to be vulnerable to adversarial attacks. Robustness distillation, as a countermeasure, takes both robustness challenges and efficiency challenges of edge models into consideration. However, most existing robustness distillations are data-driven, which can hardly be deployed in the data-privacy scenarios. Also, the trade-off between robustness and accuracy tends to transfer from the teacher to the student along with the robustness, and there has been no discussion on mitigating this trade-off in the data-free scenario yet. In this paper, we propose a Data-free Experts-guided Robustness Distillation (DERD) to extend robustness distillation to data-free paradigm, which offers three advantages: (1) Dual-level adversarial learning strategy achieves robustness distillation without the real data. (2) Expert-guided distillation strategy brings a better trade-off to the student model. (3) A novel stochastic gradient aggregation module reconciles the task conflicts of the multi-teacher from a consistency perspective. Extensive experiments demonstrate that the proposed DERD can even achieve comparable results to data-driven methods.