ACMMM2024

Abstract:
Radiance Fields (RFs) have emerged as a crucial technology for 3D scene representation, enabling the synthesis of novel views with remarkable realism. However, as RFs become more widely used, the need for effective editing techniques that maintain coherence across different perspectives becomes evident. Current methods primarily depend on per-frame 2D image inpainting, which often fails to maintain consistency across views, thus compromising the realism of edited RF scenes. In this work, we introduce a novel RF editing pipeline that significantly enhances consistency by requiring the inpainting of only a single reference image. This image is then projected across multiple views using a depth-based approach, effectively reducing the inconsistencies observed with per-frame inpainting. However, projections typically assume photometric consistency across views, which is often impractical in real-world settings. To accommodate realistic variations in lighting and viewpoint, our pipeline adjusts the appearance of the projected views by generating multiple directional variants of the inpainted image, thereby adapting to different photometric conditions. Additionally, we present an effective and robust multi-view object segmentation approach as a valuable byproduct of our pipeline. Extensive experiments demonstrate that our method significantly surpasses existing frameworks in maintaining content consistency across views and enhancing visual quality. More results are available at https://vulab-ai.github.io/View-consistent_Object_Removal_in_Radiance_Fields/.

Abstract:
The Text to Audible-Video Generation (TAVG) task involves generating videos with accompanying audio based on text descriptions. Achieving this requires skillful alignment of both audio and video elements. To support research in this field, we have developed a comprehensive Text to Audible-Video Generation Benchmark (TAVGBench), which contains over 1.7 million clips with a total duration of 11.8 thousand hours. We propose an automatic annotation pipeline to ensure each audible video has detailed descriptions for both its audio and video contents. We also introduce the Audio-Visual Harmoni score (AVHScore) to provide a quantitative measure of the alignment between the generated audio and video modalities. Additionally, we present a baseline model for TAVG called TAVDiffusion, which uses a two-stream latent diffusion model to provide a fundamental starting point for further research in this area. We achieve the alignment of audio and video by employing cross-attention and contrastive learning. Through extensive experiments and evaluations on TAVGBench, we demonstrate the effectiveness of our proposed model under both conventional metrics and our proposed metrics. The dataset and code can be found on this page https://npucvr.github.io/TAVGBench/ and on github https://github.com/OpenNLPLab/TAVGBench.

Abstract:
Despite having tremendous progress in image-to-3D generation, existing methods still struggle to produce multi-view consistent images with high-resolution textures in detail, especially in the paradigm of 2D diffusion that lacks 3D awareness. In this work, we present High-resolution Image-to-3D model (Hi3D), a new video diffusion based paradigm that redefines a single image to multi-view images as 3D-aware sequential image generation (i.e., orbital video generation). This methodology delves into the underlying temporal consistency knowledge in video diffusion model that generalizes well to geometry consistency across multiple views in 3D generation. Technically, Hi3D first empowers the pre-trained video diffusion model with 3D-aware prior (camera pose condition), yielding multi-view images with low-resolution texture details. A 3D-aware video-to-video refiner is learnt to further scale up the multi-view images with high-resolution texture details. Such high-resolution multi-view images are further augmented with novel views through 3D Gaussian Splatting, which are finally leveraged to obtain high-fidelity meshes via 3D reconstruction. Extensive experiments on both novel view synthesis and single view reconstruction demonstrate that our Hi3D manages to produce superior multi-view consistency images with highly-detailed textures. Source code and data are available at https://github.com/yanghb22-fdu/Hi3D-Official.

Abstract:
Point cloud completion aims to recover accurate global geometry and preserve fine-grained local details from partial point clouds. Conventional methods typically predict unseen points directly from 3D point cloud coordinates or use self-projected multi-view depth maps to ease this task. However, these gray-scale depth maps cannot reach multi-view consistency, consequently restricting the performance. In this paper, we introduce a GeoFormer that simultaneously enhances the global geometric structure of the points and improves the local details. Specifically, we design a CCM Feature Enhanced Point Generator to integrate image features from multi-view consistent canonical coordinate maps (CCMs) and align them with pure point features, thereby enhancing the global geometry feature. Additionally, we employ the Multi-scale Geometry-aware Upsampler module to progressively enhance local details. This is achieved through cross attention between the multi-scale features extracted from the partial input and the features derived from previously estimated points. Extensive experiments on the PCN, ShapeNet-55/34, and KITTI benchmarks demonstrate that our GeoFormer outperforms recent methods, achieving the state-of-the-art performance. Our code is available at https://github.com/Jinpeng-Yu/GeoFormer.

Abstract:
This paper proposes GaussianTalker, a novel framework for real-time generation of pose-controllable talking heads. It leverages the fast rendering capabilities of 3D Gaussian Splatting (3DGS) while addressing the challenges of directly controlling 3DGS with speech audio. GaussianTalker constructs a single 3DGS representation of the head and deforms it in sync with the audio. A key insight is to encode the 3D Gaussian attributes into a shared implicit feature representation, where it is merged with audio features to manipulate each Gaussian attribute. This design exploits the spatial information of the head and enforces interactions between neighboring points. The feature embeddings are then fed to a spatial-audio attention module, which predicts frame-wise offsets for the attributes of each Gaussian. This method is more stable than previous concatenation or multiplication approaches for manipulating the numerous Gaussians and their intricate parameters. Overall, GaussianTalker offers a promising approach for real-time generation of high-quality pose-controllable talking heads. pecifically, GaussianTalker achieves a remarkable rendering speed up to 120 FPS, surpassing previous benchmarks. Our demo video and code can be found at https://ku-cvlab.github.io/GaussianTalker/

Abstract:
With the increased attention to model efficiency, post-training sparsity (PTS) has become more and more prevalent because of its effectiveness and efficiency. However, there remain questions on better practice of PTS algorithms and the sparsification ability of models, which hinders the further development of this area.Therefore, a benchmark to comprehensively investigate the issues above is urgently needed. In this paper, we propose the first comprehensive post-training sparsity benchmark called PTSBench towards algorithms and models. We benchmark 10+ PTS general-pluggable fine-grained techniques on 3 typical tasks using over 40 off-the-shelf model architectures. Through extensive experiments and analyses, we obtain valuable conclusions and provide several insights from both algorithms and model aspects. Our PTSBench can provide (1) new observations for a better understanding of the PTS algorithms, (2) in-depth and comprehensive evaluations for the sparsification ability of models, and (3) a well-structured and easy-integrate open-source framework. We hope this work will provide illuminating conclusions and advice for future studies of post-training sparsity methods and sparsification-friendly model design. The code for our PTSBench is released at https://github.com/ModelTC/msbench.

Abstract:
Many recommender models have been proposed to investigate how to incorporate multimodal content information into traditional collaborative filtering framework effectively. The use of multimodal information is expected to provide more comprehensive information and lead to superior performance. However, the integration of multiple modalities often encounters the modal imbalance problem: since the information in different modalities is unbalanced, optimizing the same objective across all modalities leads to the under-optimization problem of the weak modalities with a slower convergence rate or lower performance. Even worse, we find that in multimodal recommendation models, all modalities suffer from the problem of insufficient optimization. To address these issues, we propose a Counterfactual Knowledge Distillation method that could solve the imbalance problem and make the best use of all modalities. Through modality-specific knowledge distillation, it could guide the multimodal model to learn modality-specific knowledge from uni-modal teachers. We also design a novel generic-and-specific distillation loss to guide the multimodal student to learn wider-and-deeper knowledge from teachers. Additionally, to adaptively recalibrate the focus of the multimodal model towards weaker modalities during training, we estimate the causal effect of each modality on the training objective using counterfactual inference techniques, through which we could determine the weak modalities, quantify the imbalance degree and re-weight the distillation loss accordingly. Our method could serve as a plug-and-play module for both late-fusion and early-fusion backbones. Extensive experiments on six backbones show that our proposed method can improve the performance by a large margin. The source code will be released at https://github.com/CRIPAC-DIG/Balanced-Multimodal-Rec

Abstract:
Current research in food analysis primarily concentrates on tasks such as food recognition, recipe retrieval and nutrition estimation from a single image. Nevertheless, there is a significant gap in exploring the impact of food intake on physiological indicators (e.g., weight) over time. This paper addresses this gap by introducing the DietDiary dataset, which encompasses daily dietary diaries and corresponding weight measurements of real users. Furthermore, we propose a novel task of weight prediction with a dietary diary that aims to leverage historical food intake and weight to predict future weights. To tackle this task, we propose a model-agnostic time series forecasting framework. Specifically, we introduce a Unified Meal Representation Learning (UMRL) module to extract representations for each meal. Additionally, we design a diet-aware loss function to associate food intake with weight variations. By conducting experiments on the DietDiary dataset with two state-of-the-art time series forecasting models, NLinear and iTransformer, we demonstrate that our proposed framework achieves superior performance compared to the original models. We make our dataset, code, and models publicly available at: https://yxg1005.github.io/weight-prediction.

Abstract:
Image-based virtual try-on aims to seamlessly fit in-shop clothing to a person image while maintaining pose consistency. Existing methods commonly employ the thin plate spline (TPS) transformation or appearance flow to deform in-shop clothing for aligning with the person's body. Despite their promising performance, these methods often lack precise control over fine details, leading to inconsistencies in shape between clothing and the person's body as well as distortions in exposed limb regions. To tackle these challenges, we propose a novel shape-guided clothing warping method for virtual try-on, dubbed SCW-VTON, which incorporates global shape constraints and additional limb textures to enhance the realism and consistency of the warped clothing and try-on results. To integrate global shape constraints for clothing warping, we devise a dual-path clothing warping module comprising a shape path and a flow path. The former path captures the clothing shape aligned with the person's body, while the latter path leverages the mapping between the pre- and post-deformation of the clothing shape to guide the estimation of appearance flow. Furthermore, to alleviate distortions in limb regions of try-on results, we integrate detailed limb guidance by developing a limb reconstruction network based on masked image modeling. Through the utilization of SCW-VTON, we are able to generate try-on results with enhanced clothing shape consistency and precise control over details. Extensive experiments demonstrate the superiority of our approach over state-of-the-art methods both qualitatively and quantitatively.

Abstract:
Recently, text-to-3D generation has attracted significant attention, resulting in notable performance enhancements. Previous methods utilize end-to-end 3D generation models to initialize 3D Gaussians, multi-view diffusion models to enforce multi-view consistency, and text-to-image diffusion models to refine details with score distillation algorithms. However, these methods exhibit two limitations. Firstly, they encounter conflicts in generation directions since different models aim to produce diverse 3D assets. Secondly, the issue of over-saturation in score distillation has not been thoroughly investigated and solved. To address these limitations, we propose PlacidDreamer, a text-to-3D framework that harmonizes initialization, multi-view generation, and text-conditioned generation with a single multi-view diffusion model, while simultaneously employing a novel score distillation algorithm to achieve balanced saturation. To unify the generation direction, we introduce the Latent-Plane module, a training-friendly plug-in extension that enables multi-view diffusion models to provide fast geometry reconstruction for initialization and enhanced multi-view images to personalize the text-to-image diffusion model. To address the over-saturation problem, we propose to view score distillation as a multi-objective optimization problem and introduce the Balanced Score Distillation algorithm, which offers a Pareto Optimal solution that achieves both rich details and balanced saturation. Extensive experiments validate the outstanding capabilities of our PlacidDreamer. The code is available at https://github.com/HansenHuang0823/PlacidDreamer.

Abstract:
3D Referring Expression Segmentation (3D-RES) is dedicated to segmenting a specific instance within a 3D space based on a natural language description. However, current approaches are limited to segmenting a single target, restricting the versatility of the task. To overcome this limitation, we introduce Generalized 3D Referring Expression Segmentation (3D-GRES), which extends the capability to segment any number of instances based on natural language instructions. In addressing this broader task, we propose the Multi-Query Decoupled Interaction Network (MDIN), designed to break down multi-object segmentation tasks into simpler, individual segmentations. MDIN comprises two fundamental components: Text-driven Sparse Queries (TSQ) and Multi-object Decoupling Optimization (MDO). TSQ generates sparse point cloud features distributed over key targets as the initialization for queries. Meanwhile, MDO is tasked with assigning each target in multi-object scenarios to different queries while maintaining their semantic consistency. To adapt to this new task, we build a new dataset, namely Multi3DRes. Our comprehensive evaluations on this dataset demonstrate substantial enhancements over existing models, thus charting a new path for intricate multi-object 3D scene comprehension. The benchmark and code are available at https://github.com/sosppxo/MDIN.

Abstract:
Cross-lingual cross-modal retrieval (CCR) aims to retrieve visually relevant content based on non-English queries, without relying on human-labeled cross-modal data pairs during training. One popular approach involves utilizing machine translation (MT) to create pseudo-parallel data pairs, establishing correspondence between visual and non-English textual data. However, aligning their representations poses challenges due to the significant semantic gap between vision and text, as well as the lower quality of non-English representations caused by pre-trained encoders and data noise. To overcome these challenges, we propose LECCR, a novel solution that incorporates the multi-modal large language model (MLLM) to improve the alignment between visual and non-English representations. Specifically, we first employ MLLM to generate detailed visual content descriptions and aggregate them into multi-view semantic slots that encapsulate different semantics. Then, we take these semantic slots as internal features and leverage them to interact with the visual features. By doing so, we enhance the semantic information within the visual features, narrowing the semantic gap between modalities and generating local visual semantics for subsequent multi-level matching. Additionally, to further enhance the alignment between visual and non-English features, we introduce softened matching under English guidance. This approach provides more comprehensive and reliable inter-modal correspondences between visual and non-English features. Extensive experiments on four CCR benchmarks, i.e., Multi30K, MSCOCO, VATEX, and MSR-VTT-CN, demonstrate the effectiveness of our proposed method. Code: https://github.com/LiJiaBei-7/leccr.

Abstract:
With the advent of large language models(LLMs) enhanced by the chain-of-thought(CoT) methodology, the visual reasoning problem is usually decomposed into manageable sub-tasks and tackled sequentially with various external tools. However, such a paradigm faces the challenge of the potential "determining hallucinations" in decision generation due to insufficient visual information and the limitation of low-level perception tools that fail to provide abstract summaries necessary for comprehensive reasoning. We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. This paper delves into the realm of multimodal CoT to solve intricate visual reasoning tasks with multimodal large language models(MLLMs) and their cognitive capability. To this end, we propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Cantor first acts as a decision generator and integrates visual inputs to analyze the image and problem, ensuring a closer alignment with the actual context. Furthermore, Cantor leverages the advanced cognitive functions of MLLMs to perform as multifaceted experts for deriving higher-level information, enhancing the CoT generation process. Our extensive experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance across two complex visual reasoning datasets, without necessitating fine-tuning or ground-truth rationales. Project Page: https://ggg0919.github.io/cantor/.

Abstract:
Blind face restoration endeavors to restore a clear face image from a degraded counterpart. Recent approaches employing Generative Adversarial Networks (GANs) as priors have demonstrated remarkable success in this field. However, these methods encounter challenges in achieving a balance between realism and fidelity, particularly in complex degradation scenarios. To inherit the exceptional realism generative ability of the diffusion model and also constrained by the identity-aware fidelity, we propose a novel diffusion-based framework by embedding the 3D facial priors as structure and identity constraints into a denoising diffusion process. Specifically, in order to obtain more accurate 3D prior representations, the 3D facial image is reconstructed by a 3D Morphable Model (3DMM) using an initial restored face image that has been processed by a pretrained restoration network. A customized multi-level feature extraction method is employed to exploit both structural and identity information of 3D facial images, which are then mapped into the noise estimation process. In order to enhance the fusion of identity information into the noise estimation, we propose a Time-Aware Fusion Block (TAFB). This module offers a more efficient and adaptive fusion of weights for denoising, considering the dynamic nature of the denoising process in the diffusion model, which involves initial structure refinement followed by texture detail enhancement. Extensive experiments demonstrate that our network performs favorably against state-of-the-art algorithms on synthetic and real-world datasets for blind face restoration.

Abstract:
Quality assessment and aesthetics assessment aim to evaluate the perceived quality and aesthetics of visual content. Current learning-based methods suffer greatly from the scarcity of labeled data and usually perform sub-optimally in terms of generalization. Although masked image modeling (MIM) has achieved noteworthy advancements across various high-level tasks (e.g., classification, detection etc.). In this work, we take on a novel perspective to investigate its capabilities in terms of quality- and aesthetics-awareness. To this end, we propose Quality- and aesthetics-aware pretraining (QPT V2), the first pretraining framework based on MIM that offers a unified solution to quality and aesthetics assessment. To perceive the high-level semantics and fine-grained details, pretraining data is curated. To comprehensively encompass quality- and aesthetics-related factors, degradation is introduced. To capture multi-scale quality and aesthetic information, model structure is modified. Extensive experimental results on 11 downstream benchmarks clearly show the superior performance of QPT V2 in comparison with current state-of-the-art approaches and other pretraining paradigms. Code and models will be released at https://github.com/KeiChiTse/QPT-V2.

Abstract:
Diffusion-based text-to-image generation models have significantly advanced the field of art content synthesis. However, current portrait stylization methods generally require either model fine-tuning based on examples or the employment of DDIM Inversion to revert images to noise space, both of which substantially decelerate the image generation process. To overcome these limitations, this paper presents an inversion-free portrait stylization framework based on diffusion models that accomplishes content and style feature fusion in merely four sampling steps. We observed that Latent Consistency Models employing consistency distillation can effectively extract representative Consistency Features from noisy images. To blend the Consistency Features extracted from both content and style images, we introduce a Style Enhancement Attention Control technique that meticulously merges content and style features within the attention space of the target image. Moreover, we propose a feature merging strategy to amalgamate redundant features in Consistency Features, thereby reducing the computational load of attention control. Extensive experiments have validated the effectiveness of our proposed framework in enhancing stylization efficiency and fidelity. The code is available at https://github.com/liujin112/ZePo.

Abstract:
This paper aims to introduce 3D Gaussians for efficient, expressive, and editable digital avatar generation. This task faces two major challenges: 1) The unstructured nature of 3D Gaussians makes it incompatible with current generation pipelines; 2) the expressive animation of 3D Gaussians in a generative setting that involves training with multiple subjects remains unexplored. In this paper, we propose a novel avatar generation method named E3 Gen, to effectively address these challenges. First, we propose a novel generative UV features plane representation that encodes unstructured 3D Gaussians onto a structured 2D UV space defined by the SMPL-X parametric model. This novel representation not only preserves the efficient advantage of the original 3D Gaussians but also introduces a shared structure among subjects to enable generative learning of the diffusion model. To tackle the second challenge, we propose a part-aware deformation module to achieve robust and accurate full-body expressive pose control. Extensive experiments demonstrate that our method achieves superior performance in avatar generation and enables expressive full-body pose control and editing. Our project page is https://olivia23333.github.io/E3Gen.

Abstract:
World models are progressively being employed across diverse fields, extending from basic environment simulation to complex scenario construction. However, existing models are mainly trained on domain-specific states and actions, and confined to single-modality state representations. In this paper, We introduce WorldGPT, a generalist world model built upon Multimodal Large Language Model (MLLM). WorldGPT acquires an understanding of world dynamics through analyzing millions of videos across various domains. To further enhance WorldGPT's capability in specialized scenarios and long-term tasks, we have integrated it with a novel cognitive architecture that combines memory offloading, knowledge retrieval, and context reflection. As for evaluation, we build WorldNet, a multimodal state transition prediction benchmark encompassing varied real-life scenarios. Conducting evaluations on WorldNet directly demonstrates WorldGPT's capability to accurately model state transition patterns, affirming its effectiveness in understanding and predicting the dynamics of complex scenarios. We further explore WorldGPT's emerging potential in serving as a world simulator, helping multimodal agents generalize to unfamiliar domains through efficiently synthesising multimodal instruction instances which are proved to be as reliable as authentic data for fine-tuning purposes. The code and dataset are available on the https://github.com/DCDmllm/WorldGPT

Abstract:
The detection and localization of highly realistic deepfake audio-visual content are challenging even for the most advanced state-of-the-art methods. While most of the research efforts in this domain are focused on detecting high-quality deepfake images and videos, only a few works address the problem of the localization of small segments of audio-visual manipulations embedded in real videos. In this research, we emulate the process of such content generation and propose the AV-Deepfake1M dataset. The dataset contains content-driven (i) video manipulations, (ii) audio manipulations, and (iii) audio-visual manipulations for more than 2K subjects resulting in a total of more than 1M videos. The paper provides a thorough description of the proposed data generation pipeline accompanied by a rigorous analysis of the quality of the generated data. The comprehensive benchmark of the proposed dataset utilizing state-of-the-art deepfake detection and localization methods indicates a significant drop in performance compared to previous datasets. The proposed dataset will play a vital role in building the next-generation deepfake localization methods. The dataset and associated code are available at https://github.com/ControlNet/AV-Deepfake1M.

Abstract:
Weakly-supervised visual recognition using inexact supervision is a critical yet challenging learning problem. It significantly reduces human labeling costs and traditionally relies on multi-instance learning and pseudo-labeling. This paper introduces WeakSAM and solves the weakly-supervised object detection (WSOD) and segmentation by utilizing the pre-learned world knowledge contained in a vision foundation model, i.e., the Segment Anything Model (SAM). WeakSAM addresses two critical limitations in traditional WSOD retraining, i.e., pseudo ground truth (PGT) incompleteness and noisy PGT instances, through adaptive PGT generation and Region of Interest (RoI) drop regularization. It also addresses the SAM's shortcomings of requiring human prompts and category unawareness in object detection and segmentation. Our results indicate that WeakSAM significantly surpasses previous state-of-the-art methods in WSOD and WSIS benchmarks with large margins, i.e. average improvements of 7.4% and 8.5%, respectively.

Abstract:
3D facial animation has attracted considerable attention due to its extensive applications in the multimedia field. Audio-driven 3D facial animation has been widely explored with promising results. However, multi-modal 3D facial animation, especially text-guided 3D facial animation is rarely explored due to the lack of multi-modal 3D facial animation dataset. To fill this gap, we first construct a large-scale multi-modal 3D facial animation dataset, MMHead, which consists of 49 hours of 3D facial motion sequences, speech audios, and rich hierarchical text annotations. Each text annotation contains abstract action and emotion descriptions, fine-grained facial and head movements (i.e., expression and head pose) descriptions, and three possible scenarios that may cause such emotion. Concretely, we integrate five public 2D portrait video datasets, and propose an automatic pipeline to 1) reconstruct 3D facial motion sequences from monocular videos; and 2) obtain hierarchical text annotations with the help of AU detection and ChatGPT. Based on the MMHead dataset, we establish benchmarks for two new tasks: text-induced 3D talking head animation and text-to-3D facial motion generation. Moreover, a simple but efficient VQ-VAE-based method named MM2Face is proposed to unify the multi-modal information and generate diverse and plausible 3D facial motions, which achieves competitive results on both benchmarks. Extensive experiments and comprehensive analysis demonstrate the significant potential of our dataset and benchmarks in promoting the development of multi-modal 3D facial animation. The dataset will be released at: https://wsj-sjtu.github.io/MMHead/.

Abstract:
3D referring segmentation is an emerging and challenging vision-language task that aims to segment the object described by a natural language expression in a point cloud scene. The key challenge behind this task is vision-language feature fusion and alignment. In this work, we propose RefMask3D to explore the comprehensive multi-modal feature interaction and understanding. First, we propose a Geometry-Enhanced Group-Word Attention to integrate language with geometrically coherent sub-clouds through cross-modal group-word attention, which effectively addresses the challenges posed by the sparse and irregular nature of point clouds. Then, we introduce a Linguistic Primitives Construction to produce semantic primitives representing distinct semantic attributes, which greatly enhance the vision-language understanding at the decoding stage. Furthermore, we introduce an Object Cluster Module that analyzes the interrelationships among linguistic primitives to consolidate their insights and pinpoint common characteristics, helping to capture holistic information and enhance the precision of target identification. The proposed RefMask3D achieves new state-of-the-art performance on 3D referring segmentation, 3D visual grounding, and also 2D referring image segmentation. Especially, RefMask3D outperforms previous state-of-the-art method by a large margin of 3.16% mIoU on the challenging ScanRefer dataset. Code is available at https://github.com/heshuting555/RefMask3D.

Abstract:
Synthesizing camera movements from music and dance is highly challenging due to the contradicting requirements and complexities of dance cinematography. Unlike human movements, which are always continuous, dance camera movements involve both continuous sequences of variable lengths and sudden drastic changes to simulate the switching of multiple cameras. However, in previous works, every camera frame is equally treated and this causes jittering and unavoidable smoothing in post-processing. To solve these problems, we propose to integrate animator dance cinematography knowledge by formulating this task as a three-stage process: keyframe detection, keyframe synthesis, and tween function prediction. Following this formulation, we design a novel end-to-end dance camera synthesis framework DanceCamAnimator, which imitates human animation procedures and shows powerful keyframe-based controllability with variable lengths. Extensive experiments on the DCM dataset demonstrate that our method surpasses previous baselines quantitatively and qualitatively. Code will be available at https://github.com/Carmenw1203/DanceCamAnimator-Official.

Abstract:
Diffusion models have revolutionized image synthesis, setting new benchmarks in quality and creativity. However, their widespread adoption is hindered by the intensive computation required during the iterative denoising process. Post-training quantization (PTQ) presents a solution to accelerate sampling, albeit at the expense of sample quality, extremely in low-bit settings. Addressing this, our study introduces a unified Quantization Noise Correction Scheme (QNCD), aimed at diminishing quantization noise throughout the sampling process. We identify two primary quantization challenges: intra and inter quantization noise. Intra quantization noise, mainly exacerbated by embeddings in the resblock module, extends activation quantization ranges, increasing disturbances in each single denoising step. Besides, inter quantization noise stems from cumulative quantization deviations across the entire denoising process, altering data distributions step-by-step. QNCD combats these through embedding-derived feature smoothing for eliminating intra quantization noise and an effective runtime noise estimation module for dynamically filtering inter quantization noise. Extensive experiments demonstrate that our method outperforms previous quantization methods for diffusion models, achieving lossless results in W4A8 and W8A8 quantization settings on ImageNet (LDM-4). Code is available at: https://github.com/huanpengchu/QNCD.

Abstract:
3D Gaussian Splatting (3D-GS) technique couples 3D Gaussian primitives with differentiable rasterization to achieve high-quality novel view synthesis results while providing advanced real-time rendering performance. However, due to the flaw of its adaptive density control strategy in 3D-GS, it frequently suffers from over-reconstruction issue in intricate scenes containing high-frequency details, leading to blurry rendered images. The underlying reason for the flaw has still been under-explored. In this work, we present a comprehensive analysis of the cause of aforementioned artifacts, namely gradient collision, which prevents large Gaussians in over-reconstructed regions from splitting. To address this issue, We propose the novel homodirectional view-space positional gradient as the criterion for densification. Our strategy efficiently identifies large Gaussians in over-reconstructed regions, and recovers fine details by splitting. We evaluate our proposed method on various challenging datasets. The experimental results indicate that our approach achieves the best rendering quality with reduced or similar memory consumption. Our method is easy to implement and can be incorporated into a wide variety of most recent Gaussian Splatting-based methods. The code is publicly available at https://ty424.github.io/AbsGS.github.io

Abstract:
With the rise of "Metaverse" and "Web 3.0", Non-Fungible Token (NFT) has emerged as a kind of pivotal digital asset, garnering significant attention. By the end of March 2024, more than 1.7 billion NFTs have been minted across various blockchain platforms. To effectively locate a desired NFT, conducting searches within a vast array of NFTs is essential. The challenge in NFT retrieval is heightened due to the high degree of similarity among different NFTs, regarding regional and semantic aspects. In this paper, we will introduce a benchmark dataset named "NFT Top1000 Visual-Text Dataset"(NFT1000), containing 7.56 million image-text pairs, and being collected from 1000 most famous PFP NFT collections by sales volume on the Ethereum blockchain. Based on this dataset and leveraging the CLIP series of pre-trained models as our foundation, we propose the dynamic masking fine-tuning scheme. This innovative approach results in a 7.4% improvement in the top1 accuracy rate, while utilizing merely 13% of the total training data (0.79 million vs. 6.1 million). We also propose a robust metric Comprehensive Variance Index (CVI) to assess the similarity and retrieval difficulty of visual-text pairs data. The dataset will be released as an open-source resource. For more details, please refer to: https://github.com/ShuxunoO/NFT-Net.git

Abstract:
Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual or linguistic knowledge separately while ignoring the multimodal corresponding information. Motivated by recent advancements in contrastive language-image pre-training and low-rank adaptation (LoRA) methods, we aim to solve the grounding task based on multimodal pre-training. However, there exists significant task gaps between pre-training and grounding. Therefore, to address these gaps, we propose a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG. Specifically, HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (HiLoRA) paradigm. The cross-modal bridge can address the inconsistency between visual features and those required for grounding, and establish a connection between multi-level visual and text features. HiLoRA prevents the accumulation of perceptual errors by adapting the cross-modal features from shallow to deep layers in a hierarchical manner. Experimental results on five datasets demonstrate the effectiveness of our approach and showcase the significant grounding capabilities as well as promising energy efficiency advantages. The project page: https://github.com/linhuixiao/HiVG.

Abstract:
Diffusion models revolutionize image generation by leveraging natural language to guide the creation of multimedia content. Despite significant advancements in such generative models, challenges persist in depicting detailed human-object interactions, especially regarding pose and object placement accuracy. We introduce a training-free method named Reasoning and Correcting Diffusion (ReCorD) to address these challenges. Our model couples Latent Diffusion Models with Visual Language Models to refine the generation process, ensuring precise depictions of HOIs. We propose an interaction-aware reasoning module to improve the interpretation of the interaction, along with an interaction correcting module to refine the output image for more precise HOI generation delicately. Through a meticulous process of pose selection and object positioning, ReCorD achieves superior fidelity in generated images while efficiently reducing computational requirements. We conduct comprehensive experiments on three benchmarks to demonstrate the significant progress in solving text-to-image generation tasks, showcasing ReCorD's ability to render complex interactions accurately by outperforming existing methods in HOI classification score, as well as FID and Verb CLIP-Score. Project website is available at https://alberthkyhky.github.io/ReCorD/.

Abstract:
In this paper, we tackle the new task of video-based Activated Muscle Group Estimation (AMGE) aiming at identifying active muscle regions during physical activity in the wild. To this intent, we provide the MuscleMap dataset featuring >15K video clips with 135 different activities and 20 labeled muscle groups. This dataset opens the vistas to multiple video-based applications in sports and rehabilitation medicine under flexible environment constraints. The proposed MuscleMap dataset is constructed with YouTube videos, specifically targeting High-Intensity Interval Training (HIIT) physical exercise in the wild. To make the AMGE model applicable in real-life situations, it is crucial to ensure that the model can generalize well to numerous types of physical activities not present during training and involving new combinations of activated muscles. To achieve this, our benchmark also covers an evaluation setting where the model is exposed to activity types excluded from the training set. Our experiments reveal that the generalizability of existing architectures adapted for the AMGE task remains a challenge. Therefore, we also propose a new approach, TransM3E, which employs a multi-modality feature fusion mechanism between both the video transformer model and the skeleton-based graph convolution model with novel cross-modal knowledge distillation executed on multi-classification tokens. The proposed method surpasses all popular video classification models when dealing with both, previously seen and new types of physical activities. The database and code can be found at https://github.com/KPeng9510/MuscleMap.

Abstract:
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. However, the domain gap between synthetic and real images poses a challenge in acquiring feature representations that align well with images on real scenes, thereby limiting the performance of these methods. We note that vision-language models like CLIP, pre-trained on extensive real image-text pairs, effectively align images and text in a unified embedding space, suggesting the potential to derive the representations of real images from text alone. Building upon this premise, we introduce a novel method named Decoder Pre-training with only text for STR (DPTR). DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder. An Offline Randomized Perturbation (ORP) strategy is introduced. It enriches the diversity of text embeddings by incorporating natural image embeddings extracted from the CLIP image encoder, effectively directing the decoder to acquire the potential representations of real images. In addition, we introduce a Feature Merge Unit (FMU) that guides the extracted visual embeddings focusing on the character foreground within the text image, thereby enabling the pre-trained decoder to work more efficiently and accurately. Extensive experiments across various STR decoders and language recognition tasks underscore the broad applicability and remarkable performance of DPTR, providing a novel insight for STR pre-training. Code is available at https://github.com/Topdu/OpenOCR.

Abstract:
The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos. Such naturally multimodal videos contain rich and complex dynamic audio-visual components, with only a portion of them closely related to the given questions. Hence, effectively perceiving audio-visual cues relevant to the given questions is crucial for correctly answering them. In this paper, we propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions. Specifically, considering the challenge of aligning non-declarative questions and visual representations into the same semantic space using visual-language pretrained models, we construct declarative sentence prompts derived from the question template, to assist the temporal perception module in better identifying critical segments relevant to the questions. Subsequently, a spatial perception module is designed to merge visual tokens from selected segments to highlight key latent targets, followed by cross-modal interaction with audio to perceive potential sound-aware areas. Finally, the significant temporal-spatial cues from these modules are integrated to answer the question. Extensive experiments on multiple AVQA benchmarks demonstrate that our framework excels not only in understanding audio-visual scenes but also in answering complex questions effectively. Code is available at https://github.com/GeWu-Lab/TSPM REMOVE 2nd URL://github.com/GeWu-Lab/TSPM.

Abstract:
Current co-speech motion generation approaches usually focus on upper body gestures following speech contents only, while lacking supporting the elaborate control of synergistic full-body motion based on text prompts, such as talking while walking. The major challenges lie in 1) the existing speech-to-motion datasets only involve highly limited full-body motions, making a wide range of common human activities out of training distribution; 2) these datasets also lack annotated user prompts. To address these challenges, we propose SynTalker, which utilizes the off-the-shelf text-to-motion dataset as an auxiliary for supplementing the missing full-body motion and prompts. The core technical contributions are two-fold. One is the multi-stage training process which obtains an aligned embedding space of motion, speech, and prompts despite the significant distributional mismatch in motion between speech-to-motion and text-to-motion datasets. Another is the diffusion-based conditional inference process, which utilizes the separate-then-combine strategy to realize fine-grained control of local body parts. Extensive experiments are conducted to verify that our approach supports precise and flexible control of synergistic full-body motion generation based on both speeches and user prompts, which is beyond the ability of existing approaches. Our code, pre-trained models, and videos are available at https://bohongchen.github.io/SynTalker-Page/.

Abstract:
The neural radiance field (NeRF) has made significant strides in representing 3D scenes and synthesizing novel views. Despite its advancements, the high computational costs of NeRF have posed challenges for its deployment in resource-constrained environments and real-time applications. As an alternative to NeRF-like neural rendering methods, 3D Gaussian Splatting (3DGS) offers rapid rendering speeds while maintaining excellent image quality. However, as it represents objects and scenes using a myriad of Gaussians, it requires substantial storage to achieve high-quality representation. To mitigate the storage overhead, we propose Factorized 3D Gaussian Splatting (F-3DGS), a novel approach that drastically reduces storage requirements while preserving image quality. Inspired by classical matrix and tensor factorization techniques, our method represents and approximates dense clusters of Gaussians with significantly fewer Gaussians through efficient factorization. We aim to efficiently represent dense 3D Gaussians by approximating them with a limited amount of information for each axis and their combinations. This method allows us to encode a substantially large number of Gaussians along with their essential attributes'such as color, scale, and rotation-necessary for rendering using a relatively small number of elements. Extensive experimental results demonstrate that F-3DGS achieves a significant reduction in storage costs while maintaining comparable quality in rendered images. Our project page is available at https://xiangyu1sun.github.io/Factorize-3DGS/.

Abstract:
Reducing the atmospheric haze and enhancing image clarity is crucial for computer vision applications. The lack of real-life hazy ground truth images necessitates synthetic datasets, which often lack diverse haze types, impeding effective haze type classification and dehazing algorithm selection. This research introduces the HazeSpace2M dataset, a collection of over 2 million images designed to enhance dehazing through haze type classification. HazeSpace2M includes diverse scenes with 10 haze intensity levels, featuring Fog, Cloud, and Environmental Haze (EH). Using the dataset, we introduce a technique of haze type classification followed by specialized dehazers to clear hazy images. Unlike conventional methods, our approach classifies haze types before applying type-specific dehazing, improving clarity in real-life hazy images. Benchmarking with state-of-the-art (SOTA) models, ResNet50 and AlexNet achieve 92.75% and 92.50% accuracy, respectively, against existing synthetic datasets. However, these models achieve only 80% and 70% accuracy, respectively, against our Real Hazy Testset (RHT), highlighting the challenging nature of our HazeSpace2M dataset. Additional experiments show that haze type classification followed by specialized dehazing improves results by 2.41% in PSNR, 17.14% in SSIM, and 10.2% in MSE over general dehazers. Moreover, when testing with SOTA dehazing models, we found that applying our proposed framework significantly improves their performance. These results underscore the significance of HazeSpace2M and our proposed framework in addressing atmospheric haze in multimedia processing. Complete code and dataset is available on GitHub (https://github.com/tanvirnwu/HazeSpace2M).

Abstract:
Previous works have shown that reducing parameter overhead and computations for transformer-based single image super-resolution (SISR) models (e.g., SwinIR) usually leads to a reduction of performance. In this paper, we present GRFormer, an efficient and lightweight method, which not only reduces the parameter overhead and computations, but also greatly improves performance. The core of GRFormer is Grouped Residual Self-Attention (GRSA), which is specifically oriented towards two fundamental components. Firstly, it introduces a novel grouped residual layer (GRL) to replace the Query, Key, Value (QKV) linear layer in self-attention, aimed at efficiently reducing parameter overhead, computations, and performance loss at the same time. Secondly, it integrates a compact Exponential-Space Relative Position Bias (ES-RPB) as a substitute for the original relative position bias to improve the ability to represent position information while further minimizing the parameter count. Extensive experimental results demonstrate that GRFormer outperforms state-of-the-art transformer-based methods for ×2, ×3 and ×4 SISR tasks, notably outperforming SOTA by a maximum PSNR of 0.23dB when trained on the DIV2K dataset, while reducing the number of parameter and MACs by about 60% and 49% in only self-attention module respectively. We hope that our simple and effective method that can easily applied to SR models based on window-division self-attention can serve as a useful tool for further research in image super-resolution. The code is available at https://github.com/sisrformer/GRFormer.

Abstract:
The massive generation of multimodal fake news involving both text and images exhibits substantial distribution discrepancies, prompting the need for generalized detectors. However, the insulated nature of training restricts the capability of classical detectors to obtain open-world facts. While Large Vision-Language Models (LVLMs) have encoded rich world knowledge, they are not inherently tailored for combating fake news and struggle to comprehend local forgery details. In this paper, we propose FKA-Owl, a novel framework that leverages forgery-specific knowledge to augment LVLMs, enabling them to reason about manipulations effectively. The augmented forgery-specific knowledge includes semantic correlation between text and images, and artifact trace in image manipulation. To inject these two kinds of knowledge into the LVLM, we design two specialized modules to establish their representations, respectively. The encoded knowledge embeddings are then incorporated into LVLMs. Extensive experiments on the public benchmark demonstrate that FKA-Owl achieves superior cross-domain performance compared to previous methods. Code is publicly available at https://liuxuannan.github.io/FKA_Owl.github.io/.

Abstract:
Incorporating a customized object into image generation presents an attractive feature in text-to-image (T2I) generation. Some methods finetune T2I models for each object individually at test-time, which tend to be overfitted and time-consuming. Others train an extra encoder to extract object visual information for customization efficiently but struggle to preserve the object's identity. To address these limitations, we present CustomNet, a unified encoder-based object customization framework that explicitly incorporates 3D novel view synthesis capabilities into the customization process. This integration facilitates the adjustment of spatial positions and viewpoints, producing diverse outputs while effectively preserving the object's identity. To train our model effectively, we propose a dataset construction pipeline to better handle real-world objects and complex backgrounds. Additionally, we introduce delicate designs that enable location control and flexible background control through textual descriptions or user-defined backgrounds. Our method allows for object customization without the need of test-time optimization, providing simultaneous control over viewpoints, location, and text. Experimental results show that our method outperforms other customization methods regarding identity preservation, diversity, and harmony. Codes are available at https://github.com/TencentARC/CustomNet.

Abstract:
Video Shadow Detection (VSD) aims to detect the shadow masks with frame sequence. Existing works suffer from inefficient temporal learning. Moreover, few works address the VSD problem by considering the characteristic (i.e., boundary) of shadow. Motivated by this, we propose a Timeline and Boundary Guided Diffusion (TBGDiff) network for VSD where we take account of the past-future temporal guidance and boundary information jointly. In detail, we design a Dual Scale Aggregation (DSA) module for better temporal understanding by rethinking the affinity of the long-term and short-term frames for the clipped video. Next, we introduce Shadow Boundary Aware Attention (SBAA) to utilize the edge contexts for capturing the characteristics of shadows. Moreover, we are the first to introduce the Diffusion model for VSD in which we explore a Space-Time Encoded Embedding (STEE) to inject the temporal guidance for Diffusion to conduct shadow detection. Benefiting from these designs, our model can not only capture the temporal information but also the shadow property. Extensive experiments show that the performance of our approach overtakes the state-of-the-art methods, verifying the effectiveness of our components. We release the codes at https://github.com/haipengzhou856/TBGDiff.

Abstract:
Large language model (LLM) based knowledge graph completion (KGC) aims to predict the missing triples in the KGs with LLMs. However, research about LLM-based KGC fails to sufficiently harness LLMs' inference proficiencies, overlooking critical structural information integral to KGs. In this paper, we explore methods to incorporate structural information into the LLMs, with the overarching goal of facilitating structure-aware reasoning. We first discuss on the existing LLM paradigms like in-context learning and instruction tuning, proposing basic structural information injection approaches. Then we propose a Knowledge Prefix Adapter (KoPA) to fulfill this stated goal. KoPA uses a structural pre-training phase to comprehend the intricate entities and relations within KGs, representing them as structural embeddings. Then KoPA communicates such cross-modal structural information understanding to the LLMs through a knowledge prefix adapter which projects the structural embeddings into the textual space and obtains virtual knowledge tokens positioned as a prefix of the input prompt. We conduct comprehensive experiments and provide incisive analysis. Our code and data are available at https://github.com/zjukg/KoPA.

Abstract:
Generic event boundary detection (GEBD), inspired by human visual cognitive behaviors of segmenting videos into meaningful temporal chunks, finds utility in various applications. This paper demonstrates that SOTA GEBD models often prioritize final performance over model complexity, resulting in low inference speed. We contribute to addressing this challenge by reexamining the architecture of GEBD models and uncovering several surprising findings. Firstly, we reveal that a concise GEBD base model already achieves promising performance without any sophisticated design. Secondly, we find that the widely applied image backbone based GEBDs contain plenty of redundancy, motivating us to "modernize'' each component for efficiency. We also show that the GEBDs using image backbones conducting spatial-then-temporal greedy feature learning can suffer from a distraction issue, which might be the inefficient villain for GEBD and can be effectively addressed by using a video-domain backbone. The outcome of our exploration, EfficientGEBD, significantly outperforms the previous SOTA methods by up to 1.7% performance gain and 280% speedup under the same backbone. The code is available at https://github.com/Ziwei-Zheng/EfficientGEBD.

Abstract:
Cross-modal coherence modeling is essential for intelligent systems to help them organize and structure information, thereby understanding and creating content of the physical world coherently like human-beings. Previous work on cross-modal coherence modeling attempted to leverage the order information from another modality to assist the coherence recovering of the target modality. Despite of the effectiveness, labeled associated coherency information is not always available and might be costly to acquire, making the cross-modal guidance hard to leverage. To tackle this challenge, this paper explores a new way to take advantage of cross-modal guidance without gold labels on coherency, and proposes the Weak Cross-Modal Guided Ordering (WeGO) model. More specifically, it leverages high-confidence predicted pairwise order in one modality as reference information to guide the coherence modeling in another. An iterative learning paradigm is further designed to jointly optimize the coherence modeling in two modalities with selected guidance from each other. The iterative cross-modal boosting also functions in inference to further enhance coherence prediction in each modality. Experimental results on two public datasets have demonstrated that the proposed method outperforms existing methods for cross-modal coherence modeling tasks. Major technical modules have been evaluated effective through ablation studies. Codes are available at: https://github.com/scvready123/IterWeGO.

Abstract:
Salient Object Detection (SOD) aims to identify and segment the most prominent objects in images. Advanced SOD methods often utilize various Convolutional Neural Networks (CNN) or Transformers for deep feature extraction. However, these methods still deliver low performance and poor generalization in complex cases. Recently, Segment Anything Model (SAM) has been proposed as a visual fundamental model, which gives strong segmentation and generalization capabilities. Nonetheless, SAM requires accurate prompts of target objects, which are unavailable in SOD. Additionally, SAM lacks the utilization of multi-scale and multi-level information, as well as the incorporation of fine-grained details. To address these shortcomings, we propose a Multi-scale and Detail-enhanced SAM (MDSAM) for SOD. Specifically, we first introduce a Lightweight Multi-Scale Adapter (LMSA), which allows SAM to learn multi-scale information with very few trainable parameters. Then, we propose a Multi-Level Fusion Module (MLFM) to comprehensively utilize the multi-level information from the SAM's encoder. Finally, we propose a Detail Enhancement Module (DEM) to incorporate SAM with fine-grained details. Experimental results demonstrate the superior performance of our model on multiple SOD datasets and its strong generalization on other segmentation tasks. The source code is released at https://github.com/BellyBeauty/MDSAM

Abstract:
Engagement estimation plays a crucial role in understanding human social behaviors, attracting increasing research interests in fields such as affective computing and human-computer interaction. In this paper, we propose a Dialogue-Aware Transformer framework (DAT) with Modality-Group Fusion (MGF), which relies solely on audio-visual input and is language-independent, for estimating human engagement in conversations. Specifically, our method employs a modality-group fusion strategy that independently fuses audio and visual features within each modality for each person before inferring the entire audio-visual content. This strategy significantly enhances the model's performance and robustness. Additionally, to better estimate the target participant's engagement levels, the introduced Dialogue-Aware Transformer considers both the participant's behavior and cues from their conversational partners. Our method was rigorously tested in the Multi-Domain Engagement Estimation Challenge held by MultiMediate'24, demonstrating notable improvements in engagement-level regression precision over the baseline model. Notably, our approach achieves a CCC score of 0.76 on the NoXi Base test set and an average CCC of 0.64 across the NoXi Base, NoXi-Add, and MPIIGI test sets. The source code will be available at https://github.com/MSA-LMC/DAT.

Abstract:
The audio-visual event localization task requires identifying concurrent visual and auditory events from unconstrained videos within a model, locating them, and classifying their category. The efficient extraction and integration of audio and visual modal information have always been challenging in this field. In this paper, we introduce CACE-Net, which differs from most existing methods that solely use audio signals to guide visual information. We propose an audio-visual co-guidance attention mechanism that allows for adaptive bi-directional cross-modal attentional guidance between audio and visual clues, thus reducing inconsistencies between modalities. Moreover, we have observed that existing methods have difficulty distinguishing between similar background and event and lack the fine-grained features for event classification. Consequently, we employ background-event contrast enhancement to increase the discrimination of fused features and fine-tuned pre-trained model to extract more discernible features from complex multimodal inputs. Experiments on the AVE dataset demonstrate that CACE-Net sets a new benchmark in the audio-visual event localization task, proving the effectiveness of our proposed methods in handling complex multimodal learning and event localization in unconstrained videos. Code is available at https://github.com/Brain-Cog-Lab/CACE-Net.

Abstract:
LiDAR-based Moving Object Segmentation (MOS) aims to locate and segment moving objects in point clouds of the current scan using motion information from previous scans. Despite the promising results achieved by previous MOS methods, several key issues, such as the weak coupling of temporal and spatial information, still need further study. In this paper, we propose a novel LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model, termed MambaMOS. Firstly, we develop a novel embedding module, the Time Clue Bootstrapping Embedding (TCBE), to enhance the coupling of temporal and spatial information in point clouds and alleviate the issue of overlooked temporal clues. Secondly, we introduce the Motion-aware State Space Model (MSSM) to endow the model with the capacity to understand the temporal correlations of the same object across different time steps. Specifically, MSSM emphasizes the motion states of the same object at different time steps through two distinct temporal modeling and correlation steps. We utilize an improved state space model to represent these motion differences, significantly modeling the motion states. Finally, extensive experiments on the SemanticKITTI-MOS and KITTI-Road benchmarks demonstrate that the proposed MambaMOS achieves state-of-the-art performance. The source code is publicly available at https://github.com/Terminal-K/MambaMOS

Abstract:
Video try-on is challenging and has not been well tackled in previous works. The main obstacle lies in preserving the clothing details and modeling the coherent motions simultaneously. Faced with those difficulties, we address video try-on by proposing a diffusion-based framework named ''Tunnel Try-on.'' The core idea is excavating a ''focus tunnel'' in the input video that gives close-up shots around the clothing regions. We zoom in on the region in the tunnel to better preserve the fine details of the clothing. To generate coherent motions, we leverage the Kalman filter to smooth the tunnel and inject its position embedding into attention layers to improve the continuity of the generated videos. In addition, we develop an environment encoder to extract the context information outside the tunnels. Equipped with these techniques, Tunnel Try-on keeps fine clothing details and synthesizes stable and smooth videos. Demonstrating significant advancements, Tunnel Try-on could be regarded as the first attempt toward the commercial-level application of virtual try-on in videos. The project page is https://mengtingchen.github.io/tunnel-try-on-page/.

Abstract:
A plethora of text-guided image editing methods has recently been developed by leveraging the impressive capabilities of large-scale diffusion-based generative models especially Stable Diffusion. Despite the success of diffusion models in producing high-quality images, their application to small object generation has been limited due to difficulties in aligning cross-modal attention maps between text and these objects. Our approach offers a training-free method that significantly mitigates this alignment issue with local and global attention guidance, enhancing the model's ability to accurately render small objects in accordance with textual descriptions. We detail the methodology in our approach, emphasizing its divergence from traditional generation techniques and highlighting its advantages. What's more important is that we also provide SOEBench (Small Object Editing), a standardized benchmark for quantitatively evaluating text-based small object generation collected from MSCOCO[22] and OpenImage[18]. Preliminary results demonstrate the effectiveness of our method, showing marked improvements in the fidelity and accuracy of small object generation compared to existing models. This advancement not only contributes to the field of AI and computer vision but also opens up new possibilities for applications in various industries where precise image generation is critical.We will release our dataset on our project page: https://soebench.github.io/

Abstract:
Recently, diffusion-based depth estimation methods have drawn widespread attention due to their elegant denoising patterns and promising performance. However, they are typically unreliable under adverse conditions prevalent in real-world scenarios, such as rainy, snowy, etc. In this paper, we propose a novel robust depth estimation method called D4RD, featuring a custom contrastive learning mode tailored for diffusion models to mitigate performance degradation in complex environments. Concretely, we integrate the strength of knowledge distillation into contrastive learning, building the 'trinity' contrastive scheme. This scheme utilizes the sampled noise of the forward diffusion process as a natural reference, guiding the predicted noise in diverse scenes toward a more stable and precise optimum. Moreover, we extend noise-level trinity to encompass more generic feature and image levels, establishing a multi-level contrast to distribute the burden of robust perception across the overall network. Before addressing complex scenarios, we enhance the stability of the baseline diffusion model with three straightforward yet effective improvements, which facilitate convergence and remove depth outliers. Extensive experiments demonstrate that D4RD surpasses existing state-of-the-art solutions on synthetic corruption datasets and real-world weather conditions. Source code and data are available at https://github.com/wangjiyuan9/D4RD.

Abstract:
Cross-modal (e.g. image-text, video-text) retrieval is an important task in information retrieval and multimodal vision-language understanding field. Temporal understanding makes video-text retrieval more challenging than image-text retrieval. However, we find that the widely used video-text benchmarks have shortcomings in comprehensively assessing abilities of models, especially in temporal understanding, causing large-scale image-text pre-trained models can already achieve comparable zero-shot performance with video-text pre-trained models. In this paper, we introduce RTime, a novel temporal-emphasized video-text retrieval dataset. We first obtain videos of actions or events with significant temporality, and then reverse these videos to create harder negative samples. We then recruit annotators to judge the significance and reversibility of candidate videos, and write captions for qualified videos. We further adopt GPT-4 to extend more captions based on human-written captions. Our RTime dataset currently consists of 21k videos with 10 captions per video, totalling about 122 hours. Based on RTime, we propose three retrieval benchmark tasks: RTime-Origin, RTime-Hard, and RTime-Binary. We further enhance the use of harder-negatives in model training, and benchmark a variety of video-text models on RTime. Extensive experiment analysis proves that RTime indeed poses new and higher challenges to video-text retrieval. We release our RTime dataset https://github.com/qyr0403/Reversed-in-Time to further advance video-text retrieval and multimodal understanding research.

Abstract:
Diffusion models have achieved remarkable success in generating realistic images but suffer from generating accurate human hands, such as incorrect finger counts or irregular shapes. This difficulty arises from the complex task of learning the physical structure and pose of hands from training images, which involves extensive deformations and occlusions. For correct hand generation, our paper introduces a lightweight post-processing solution called HandRefiner. HandRefiner employs a conditional inpainting approach to rectify malformed hands while leaving other parts of the image untouched. We leverage the hand mesh reconstruction model that consistently adheres to the correct number of fingers and hand shape, while also being capable of fitting the desired hand pose in the generated image. Given a generated failed image due to malformed hands, we utilize ControlNet modules to re-inject such correct hand information. Additionally, we uncover a phase transition phenomenon within ControlNet as we vary the control strength. It enables us to take advantage of more readily available synthetic data without suffering from the domain gap between realistic and synthetic hands. Experiments demonstrate that HandRefiner can significantly improve the generation quality quantitatively and qualitatively. The code is available at https://github.com/wenquanlu/HandRefiner.

Abstract:
Co-speech gesture generation is crucial for producing synchronized and realistic human gestures that accompany speech, enhancing the animation of lifelike avatars in virtual environments. While diffusion models have shown impressive capabilities, current approaches often overlook a wide range of modalities and their interactions, resulting in less dynamic and contextually varied gestures. To address these challenges, we present MambaGesture, a novel framework integrating a Mamba-based attention block, MambaAttn, with a multi-modality feature fusion module, SEAD. The MambaAttn block combines the sequential data processing strengths of the Mamba model with the contextual richness of attention mechanisms, enhancing the temporal coherence of generated gestures. SEAD adeptly fuses audio, text, style, and emotion modalities, employing disentanglement to deepen the fusion process and yield gestures with greater realism and diversity. Our approach, rigorously evaluated on the multi-modal BEAT dataset, demonstrates significant improvements in Fréchet Gesture Distance (FGD), diversity scores, and beat alignment, achieving state-of-the-art performance in co-speech gesture generation.

Abstract:
Image composition involves seamlessly integrating given objects into a specific visual context. Current training-free methods rely on composing attention weights from several samplers to guide the generator. However, since these weights are derived from disparate contexts, their combination leads to coherence confusion and loss of appearance information. These issues worsen with their excessive focus on background generation, even when unnecessary in this task. This not only impedes their swift implementation but also compromises foreground generation quality. Moreover, these methods introduce unwanted artifacts in the transition area. In this paper, we formulate image composition as a subject-based local editing task, solely focusing on foreground generation. At each step, the edited foreground is combined with the noisy background to maintain scene consistency. To address the remaining issues, we propose PrimeComposer, a faster training-free diffuser that composites the images by well-designed attention steering across different noise levels. This steering is predominantly achieved by our Correlation Diffuser, utilizing its self-attention layers at each step. Within these layers, the synthesized subject interacts with both the referenced object and background, capturing intricate details and coherent relationships. This prior information is encoded into the attention weights, which are then integrated into the self-attention layers of the generator to guide the synthesis process. Besides, we introduce a Region-constrained Cross-Attention to confine the impact of specific subject-related tokens to desired regions, addressing the unwanted artifacts shown in the prior method thereby further improving the coherence in the transition area. Our method exhibits the fastest inference efficiency and extensive experiments demonstrate our superiority both qualitatively and quantitatively. The code is available at https://github.com/CodeGoat24/PrimeComposer.

Abstract:
Generating high-quality meshes with complex structures and realistic surfaces is the primary goal of 3D generative models. Existing methods typically employ sequence data or deformable tetrahedral grids for mesh generation. However, sequence-based methods have difficulty producing complex structures with many faces due to memory limits. The deformable tetrahedral grid-based method MeshDiffusion fails to recover realistic surfaces due to the inherent ambiguity in deformable grids. We propose the GenUDC framework to address these challenges by leveraging the Unsigned Dual Contouring (UDC) as the mesh representation. UDC discretizes a mesh in a regular grid and divides it into the face and vertex parts, recovering both complex structures and fine details. As a result, the one-to-one mapping between UDC and mesh resolves the ambiguity problem. In addition, GenUDC adopts a two-stage, coarse-to-fine generative process for 3D mesh generation. It first generates the face part as a rough shape and then the vertex part to craft a detailed shape. Extensive evaluations demonstrate the superiority of UDC as a mesh representation and the favorable performance of GenUDC in mesh generation. The code and trained models are available at https://github.com/TrepangCat/GenUDC.

Abstract:
This research presents Muskits-ESPnet, a versatile toolkit that introduces new paradigms to Singing Voice Synthesis (SVS) through the application of pretrained audio models in both continuous and discrete approaches. Specifically, we explore discrete representations derived from SSL models and audio codecs and offer significant advantages in versatility and intelligence, supporting multi-format inputs and adaptable data processing workflows for various SVS models. The toolkit features automatic music score error detection and correction, as well as a perception auto-evaluation module to imitate human subjective evaluating scores. Muskits-ESPnet is available at https://github.com/espnet/espnet.

Abstract:
Out-of-Distribution (OOD) detection is crucial when deploying machine learning models in open-world applications. The core challenge in OOD detection is mitigating the model's overconfidence on OOD data. While recent methods using auxiliary outlier datasets or synthesizing outlier features have shown promising OOD detection performance, they are limited due to costly data collection or simplified assumptions. In this paper, we propose a novel OOD detection framework FodFoM that innovatively combines multiple foundation models to generate two types of challenging fake outlier images for classifier training. The first type is based on BLIP-2's image captioning capability, CLIP's vision-language knowledge, and Stable Diffusion's image generation ability. Jointly utilizing these foundation models constructs fake outlier images which are semantically similar to but different from in-distribution (ID) images. For the second type, GroundingDINO's object detection ability is utilized to help construct pure background images by blurring foreground ID objects in ID images. The proposed framework can be flexibly combined with multiple existing OOD detection methods. Extensive empirical evaluations show that image classifiers with the help of constructed fake images can more accurately differentiate real OOD image from ID ones. New state-of-the-art OOD detection performance is achieved on multiple benchmarks. The code is available at https://github.com/Cverchen/ACMMM2024-FodFoM.

Abstract:
3D open-vocabulary scene understanding, crucial for advancing augmented reality and robotic applications, involves interpreting and locating specific regions within a 3D space as directed by natural language instructions. To this end, we introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS) and identifies 3D Gaussians of Interest using an Optimizable Semantic-space Hyperplane. Our approach includes an efficient compression method that utilizes scene priors to condense noisy high-dimensional semantic features into compact low-dimensional vectors, which are subsequently embedded in 3DGS. During the open-vocabulary querying process, we adopt a distinct approach compared to existing methods, which depend on a manually set fixed empirical threshold to select regions based on their semantic feature distance to the query text embedding. This traditional approach often lacks universal accuracy, leading to challenges in precisely identifying specific target areas. Instead, our method treats the feature selection process as a hyperplane division within the feature space, retaining only those features that are highly relevant to the query. We leverage off-the-shelf 2D Referring Expression Segmentation (RES) models to fine-tune the semantic-space hyperplane, enabling a more precise distinction between target regions and others. This fine-tuning substantially improves the accuracy of open-vocabulary queries, ensuring the precise localization of pertinent 3D Gaussians. Extensive experiments demonstrate GOI's superiority over previous state-of-the-art methods. The dataset, model, and code are available at https://quyans.github.io/GOI-Hyperplane/.

Abstract:
Recognizing characters and predicting speakers of dialogue are critical for comic processing tasks, such as voice generation or translation. However, because characters vary by comic title, supervised learning approaches like training character classifiers which require specific annotations for each comic title are infeasible. This motivates us to propose a novel zero-shot approach, allowing machines to identify characters and predict speaker names based solely on unannotated comic images. In spite of their importance in real-world applications, these task have largely remained unexplored due to challenges in story comprehension and multimodal integration. Recent large language models (LLMs) have shown great capability for text understanding and reasoning, while their application to multimodal content analysis is still an open problem. To address this problem, we propose an iterative multimodal framework, the first to employ multimodal information for both character identification and speaker prediction tasks. Our experiments demonstrate the effectiveness of the proposed framework, establishing a robust baseline for these tasks. Furthermore, since our method requires no training data or annotations, it can be used as-is on any comic series.

Abstract:
Spiking neural networks (SNNs) have garnered significant attention for their low power consumption and high biological interpretability. Their rich spatio-temporal information processing capability and event-driven nature make them ideally well-suited for neuromorphic datasets. However, current SNNs struggle to balance accuracy and latency in classifying these datasets. In this paper, we propose Hybrid Step-wise Distillation (HSD) method, tailored for neuromorphic datasets, to mitigate the notable decline in performance at lower time steps. Our work disentangles the dependency between the number of event frames and the time steps of SNNs, utilizing more event frames during the training stage to improve performance, while using fewer event frames during the inference stage to reduce latency. Nevertheless, the average output of SNNs across all time steps is susceptible to individual time step with abnormal outputs, particularly at extremely low time steps. To tackle this issue, we implement Step-wise Knowledge Distillation (SKD) module that considers variations in the output distribution of SNNs at each time step. Empirical evidence demonstrates that our method yields competitive performance in classification tasks on neuromorphic datasets, especially at lower time steps. Our code will be available at: https://github.com/hsw0929/HSD.

Abstract:
Although existing video-based 3D human mesh recovery methods have made significant progress, simultaneously estimating human pose and shape from low-resolution image features limits their performance. These image features lack sufficient spatial information about the human body and contain various noises (e.g., background, lighting, and clothing), which often results in inaccurate pose and inconsistent motion. Inspired by the rapid advance in human pose estimation, we discover that compared to image features, skeletons inherently contain accurate human pose and motion. Therefore, we propose a novel semi-Analytical Regressor using disenTangled Skeletal representations for human mesh recovery from videos, called ARTS. Specifically, a skeleton estimation and disentanglement module is proposed to estimate the 3D skeletons from a video and decouple them into disentangled skeletal representations (i.e., joint position, bone length, and human motion). Then, to fully utilize these representations, we introduce a semi-analytical regressor to estimate the parameters of the human mesh model. The regressor consists of three modules: Temporal Inverse Kinematics (TIK), Bone-guided Shape Fitting (BSF), and Motion-Centric Refinement (MCR). TIK utilizes joint position to estimate initial pose parameters and BSF leverages bone length to regress bone-aligned shape parameters. Finally, MCR combines human motion representation with image features to refine the initial human model parameters. Extensive experiments demonstrate that our ARTS surpasses existing state-of-the-art video-based methods in both per-frame accuracy and temporal consistency on popular benchmarks: 3DPW, MPI-INF-3DHP, and Human3.6M. Code is available at https://github.com/TangTao-PKU/ARTS>

Abstract:
Zero-shot human-object interaction (HOI) detector is capable of generalizing to HOI categories even not encountered during training. Inspired by the impressive zero-shot capabilities offered by CLIP, latest methods strive to leverage CLIP embeddings for improving zero-shot HOI detection. However, these embedding-based methods train the classifier on seen classes only, inevitably resulting in seen-unseen confusion for the model during inference. Besides, we find that using prompt-tuning and adapters further increases the gap between seen and unseen accuracy. To tackle this challenge, we present the first generation-based model using CLIP for zero-shot HOI detection, coined HOIGen. It allows to unlock the potential of CLIP for feature generation instead of feature extraction only. To achieve it, we develop a CLIP-injected feature generator in accordance with the generation of human, object and union features. Then, we extract realistic features of seen samples and mix them with synthetic features together, allowing the model to train seen and unseen classes jointly. To enrich the HOI scores, we construct a generative prototype bank in a pairwise HOI recognition branch, and a multi-knowledge prototype bank in an image-wise HOI recognition branch, respectively. Extensive experiments on HICO-DET benchmark demonstrate our HOIGen achieves superior performance for both seen and unseen classes under various zero-shot settings, compared with other top-performing methods. Code is available at: https://github.com/soberguo/HOIGen.

Abstract:
Existing Transformer-based models for point cloud analysis suffer from quadratic complexity, leading to compromised point cloud resolution and information loss. In contrast, the newly proposed Mamba model, based on state space models (SSM), outperforms Transformer in multiple areas with only linear complexity. However, the straightforward adoption of Mamba does not achieve satisfactory performance on point cloud tasks. In this work, we present Mamba3D, a state space model tailored for point cloud learning to enhance local feature extraction, achieving superior performance, high efficiency, and scalability potential. Specifically, we propose a simple yet effective Local Norm Pooling (LNP) block to extract local geometric features. Additionally, to obtain better global features, we introduce a bidirectional SSM (bi-SSM) with both a token forward SSM and a novel backward SSM that operates on the feature channel. Extensive experimental results show that Mamba3D surpasses Transformer-based counterparts and concurrent works in multiple tasks, with or without pre-training. Notably, Mamba3D achieves multiple SoTA, including an overall accuracy of 92.6% (train from scratch) on the ScanObjectNN and 95.1% (with single-modal pre-training) on the ModelNet40 classification task, with only linear complexity. Our code and weights are available at https://github.com/xhanxu/Mamba3D.

Abstract:
3D textured face reconstruction from sketches applicable in many scenarios such as animation, 3D avatars, artistic design, missing people search, etc., is a highly promising but underdeveloped research topic. On the one hand, the stylistic diversity of sketches leads to existing sketch-to-3D-face methods only being able to handle pose-limited and realistically shaded sketches. On the other hand, texture plays a vital role in representing facial appearance, yet sketches lack this information, necessitating additional texture control in the reconstruction process. This paper proposes a novel method for reconstructing controllable textured and detailed 3D faces from sketches, named S2TD-Face. S2TD-Face introduces a two-stage geometry reconstruction framework that directly reconstructs detailed geometry from the input sketch. To keep geometry consistent with the delicate strokes of the sketch, we propose a novel sketch-to-geometry loss that ensures the reconstruction accurately fits the input features like dimples and wrinkles. Our training strategies do not rely on hard-to-obtain 3D face scanning data or labor-intensive hand-drawn sketches. Furthermore, S2TD-Face introduces a texture control module utilizing text prompts to select the most suitable textures from a library and seamlessly integrate them into the geometry, resulting in a 3D detailed face with controllable texture. S2TD-Face surpasses existing state-of-the-art methods in extensive quantitative and qualitative experiments. Our project is available at https://github.com/wang-zidu/S2TD-Face.

Abstract:
To address data heterogeneity, the key strategy of Personalized Federated Learning (PFL) is to decouple general knowledge (shared among clients) and client-specific knowledge, as the latter can have a negative impact on collaboration if not removed. Existing PFL methods primarily adopt a parameter partitioning approach, where the parameters of a model are designated as one of two types: parameters shared with other clients to extract general knowledge and parameters retained locally to learn client-specific knowledge. However, as these two types of parameters are put together like a jigsaw puzzle into a single model during the training process, each parameter may simultaneously absorb both general and client-specific knowledge, thus struggling to separate the two types of knowledge effectively. In this paper, we introduce FedDecomp, a simple but effective PFL paradigm that employs parameter additive decomposition to address this issue. Instead of assigning each parameter of a model as either a shared or personalized one, FedDecomp decomposes each parameter into the sum of two parameters: a shared one and a personalized one, thus achieving a more thorough decoupling of shared and personalized knowledge compared to the parameter partitioning method. In addition, as we find that retaining local knowledge of specific clients requires much lower model capacity compared with general knowledge across all clients, we let the matrix containing personalized parameters be low rank during the training process. Moreover, a new alternating training strategy is proposed to further improve the performance. Experimental results across multiple datasets and varying degrees of data heterogeneity demonstrate that FedDecomp outperforms state-of-the-art methods up to 4.9%. The code is available at https://github.com/XinghaoWu/FedDecomp

Abstract:
In image fusion tasks, images from different sources possess distinct characteristics. This has driven the development of numerous methods to explore better ways of fusing them while preserving their respective characteristics. Mamba, as a state space model, has emerged in the field of natural language processing. Recently, many studies have attempted to extend Mamba to vision tasks. However, due to the nature of images different from causal language sequences, the limited state capacity of Mamba weakens its ability to model image information. Additionally, the sequence modeling ability of Mamba is only capable of spatial information and cannot effectively capture the rich spectral information in images. Motivated by these challenges, we customize and improve the vision Mamba network designed for the image fusion task. Specifically, we propose the local-enhanced vision Mamba block, dubbed as LEVM. The LEVM block can improve local information perception of the network and simultaneously learn local and global spatial information. Furthermore, we propose the state sharing technique to enhance spatial details and integrate spatial and spectral information. Finally, the overall network is a multi-scale structure based on vision Mamba, called LE-Mamba. Extensive experiments show the proposed methods achieve state-of-the-art results on multispectral pansharpening and multispectral and hyperspectral image fusion datasets, and demonstrate the effectiveness of the proposed approach. Code can be accessed at https://github.com/294coder/Efficient-MIF.

Abstract:
Large-scale text-to-image diffusion models have been a revolutionary milestone in the evolution of generative AI, allowing wonderful image generation with natural-language text prompt. However, the issue of lacking controllability of such models restricts their practical applicability for real-life content creation. Thus, attention has been focused on leveraging a reference image to control text-to-image synthesis, which is also regarded as manipulating (or editing) a reference image as per a text prompt, namely, text-driven image-to-image translation. This paper contributes a novel, concise, and efficient approach that adapts pre-trained large-scale text-to-image (T2I) diffusion model to the image-to-image (I2I) paradigm in a plug-and-play manner, realizing high-quality and versatile text-driven I2I translation without model training, fine-tuning, or online optimization process. To guide T2I generation with a reference image, we propose to decompose diverse guiding factors with different frequency bands of diffusion features in the DCT spectral space, and accordingly devise a novel frequency band substitution layer which realizes dynamic control of the reference image to the T2I generation result in a plug-and-play manner. We demonstrate that our method allows flexible control over both guiding factor and guiding intensity of the reference image simply by tuning the type and bandwidth of the substituted frequency band, respectively. Extensive qualitative and quantitative experiments verify superiority of our approach over related methods in I2I translation visual quality, versatility, and controllability. Our project is publicly available at: https://xianggao1102.github.io/FBSDiff_webpage/.

Abstract:
In this paper, we introduce a novel multimodal camo-perceptive framework (MMCPF) aimed at handling zero-shot Camouflaged Object Detection (COD) by leveraging the powerful capabilities of Multimodal Large Language Models (MLLMs). Recognizing the inherent limitations of current COD methodologies, which predominantly rely on supervised learning models demanding extensive and accurately annotated datasets, resulting in weak generalization, our research proposes a zero-shot MMCPF that circumvents these challenges. Although MLLMs hold significant potential for broad applications, their effectiveness in COD is hindered and they would make misinterpretations of camouflaged objects. To address this challenge, we further propose a strategic enhancement called the Chain of Visual Perception (CoVP), which significantly improves the perceptual capabilities of MLLMs in camouflaged scenes by leveraging both linguistic and visual cues more effectively. We validate the effectiveness of MMCPF on five widely used COD datasets, containing CAMO, COD10K, NC4K, MoCA-Mask and OVCamo. Experiments show that MMCPF can outperform all existing state-of-the-art zero-shot COD methods, and achieve competitive performance compared to weakly-supervised and fully-supervised methods, which demonstrates the potential of MMCPF. The Github link of this paper is https://github.com/luckybird1994/MMCPF.

Abstract:
Test-time adaptation (TTA) aims to adapt a model, initially trained on training data, to test data with potential distribution shifts. Most existing TTA methods focus on classification problems. The pronounced success of classification might lead numerous newcomers and engineers to assume that classic TTA techniques can be directly applied to the more challenging task of semantic segmentation. However, this belief is still an open question. In this paper, we investigate the applicability of existing classic TTA strategies in semantic segmentation. Our comprehensive results have led to three key observations. First, the classic normalization updating strategy only brings slight performance improvement, and in some cases, it might even adversely affect the results. Even with the application of advanced distribution estimation techniques like batch renormalization, the problem remains unresolved. Second, although the teacher-student scheme does enhance the training stability for segmentation TTA in the presence of noisy pseudo-labels and temporal correlation, it cannot directly result in performance improvement compared to the original model without TTA under complex data distribution. Third, segmentation TTA suffers a severe long-tailed class-imbalance problem, which is substantially more complex than that in TTA for classification. This long-tailed challenge negatively affects segmentation TTA performance, even when the accuracy of pseudo-labels is high. Besides those observations, we find that visual prompt tuning (VisPT) is promising in segmentation TTA and propose a novel method named TTAP. The outstanding performance of TTAP has also been verified. We hope the community can give more attention to this challenging, yet important, segmentation TTA task in the future. The source code is available at: https://github.com/ycarobot/TTAP.

Abstract:
Understanding of video creativity and content often varies among individuals, with differences in focal points and cognitive levels across different ages, experiences, and genders. There is currently a lack of research in this area, and most existing benchmarks suffer from several drawbacks: 1) a limited number of modalities and answers with restrictive length; 2) the content and scenarios within the videos are excessively monotonous, transmitting allegories and emotions that are overly simplistic. To bridge the gap to real-world applications, we introduce a large-scale Video Subjective Multi-modal Evaluation dataset, namely Video-SME. Specifically, we collected real changes in Electroencephalographic (EEG) and eye-tracking regions from different demographics while they viewed identical video content. Utilizing this multi-modal dataset, we developed tasks and protocols to analyze and evaluate the extent of cognitive understanding of video content among different users. Along with the dataset, we designed a Hypergraph Multi-modal Large Language Model (HMLLM) to explore the associations among different demographics, video elements, EEG and eye-tracking indicators. HMLLM could bridge semantic gaps across rich modalities and integrate information beyond different modalities to perform logical reasoning. Extensive experimental evaluations on Video-SME and other additional video-based generative performance benchmarks demonstrate the effectiveness of our method. The code and dataset are available at https://github.com/mininglamp-MLLM/HMLLM

Abstract:
Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. Unlike prior approaches that address noise removal through iterative processes, AudioLCM integrates Consistency Models (CMs) into the generation process, facilitating rapid inference through a mapping from any point at any time step to the trajectory's initial point. To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver. This innovation shortens the time schedule from thousands to dozens of steps while maintaining sample quality, thereby achieving fast convergence and high-quality generation. Furthermore, to optimize the performance of transformer-based neural network architectures, we integrate the advanced techniques pioneered by LLaMA into the foundational framework of transformers. This architecture supports stable and efficient training, ensuring robust performance in text-to-audio synthesis. Experimental results on text-to-audio generation and text-to-music synthesis tasks demonstrate that AudioLCM needs only 2 iterations to synthesize high-fidelity audios, while it maintains sample quality competitive with state-of-the-art models using hundreds of steps. AudioLCM enables a sampling speed of 333x faster than real-time on a single NVIDIA 4090Ti GPU, making generative models practically applicable to text-to-audio generation deployment. Our extensive preliminary analysis shows that each design in AudioLCM is effective. https://AudioLCM.github.io/. Code is Available https://github.com/Text-to-Audio/AudioLCM

Abstract:
The detection and localization of deepfake content, particularly when small fake segments are seamlessly mixed with real videos, remains a significant challenge in the field of digital media security. Based on the recently released AV-Deepfake1M dataset, which contains more than 1 million manipulated videos across more than 2,000 subjects, we introduce the 1M-Deepfakes Detection Challenge. This challenge is designed to engage the research community in developing advanced methods for detecting and localizing deepfake manipulations within the large-scale high-realistic audio-visual dataset. The participants can access the AV-Deepfake1M dataset and are required to submit their inference results for evaluation across the metrics for detection or localization tasks. The methodologies developed through the challenge will contribute to the development of next-generation deepfake detection and localization systems. Evaluation scripts, baseline models, and accompanying code will be available on https://github.com/ControlNet/AV-Deepfake1M.

Abstract:
This paper makes the first attempt towards unsupervised preference alignment in Vision-Language Models (VLMs). We generate chosen and rejected responses with regard to the original and augmented image pairs, and conduct preference alignment with direct preference optimization. It is based on a core idea: properly designed augmentation to the image input will induce VLM to generate false but hard negative responses, which helps the model to learn from and produce more robust and powerful answers. The whole pipeline no longer hinges on supervision from GPT-4 or human involvement during alignment, and is highly efficient with few lines of code. With only 8k randomly sampled unsupervised data, it achieves 90% relative score to GPT-4 on complex reasoning in LLaVA-Bench, and improves LLaVA-7B/13B by 6.7%/5.6% score on complex multi-modal benchmark MM-Vet. Visualizations shows its improved ability to align with user-intentions. A series of ablations are firmly conducted to reveal the latent mechanism of the approach, which also indicates its potential towards further scaling.

Abstract:
With the prosperity of the intelligent surveillance, multiple cameras have been applied to localize pedestrians more accurately. However, previous methods rely on laborious annotations of pedestrians in every frame and camera view. Therefore, we propose in this paper an Unsupervised Multi-view Pedestrian Detection approach (UMPD) to learn an annotation-free detector via vision-language models and 2D-3D cross-modal mapping: 1) Firstly, Semantic-aware Iterative Segmentation (SIS) is proposed to extract unsupervised representations of multi-view images, which are converted into 2D masks as pseudo labels, via our proposed iterative PCA and zero-shot semantic classes from vision-language models; 2) Secondly, we propose Geometry-aware Volume-based Detector (GVD) to end-to-end encode multi-view 2D images into a 3D volume to predict voxel-wise density and color via 2D-to-3D geometric projection, trained by 3D-to-2D rendering losses with SIS pseudo labels; 3) Thirdly, for better detection results, i.e., the 3D density projected on Birds-Eye-View, we propose Vertical-aware BEV Regularization (VBR) to constrain pedestrians to be vertical like the natural poses. Extensive experiments on popular multi-view pedestrian detection benchmarks Wildtrack, Terrace, and MultiviewX, show that our proposed UMPD, as the first fully-unsupervised method to our best knowledge, performs competitively to the previous state-of-the-art supervised methods. Code is available at https://github.com/lmy98129/UMPD.

Abstract:
Vision transformer family has dominated the satellite pan-sharpening field driven by the global-wise spatial information modeling mechanism from the core self-attention ingredient. The standard modeling rules within these promising pan-sharpening methods are to roughly stack the transformer variants in a cascaded manner. Despite the remarkable advancement, their success may be at the huge cost of model parameters and FLOPs, thus preventing its application over low-resource satellites. To address this challenge between favorable performance and expensive computation, we tailor an efficient linearly-evolved transformer variant and employ it to construct a lightweight pan-sharpening framework. In detail, we deepen into the popular cascaded transformer modeling with cutting-edge methods and develop the alternative 1-order linearly-evolved transformer variant with the 1-dimensional linear convolution chain to achieve the same function. In this way, our proposed method is capable of benefiting the cascaded modeling rule while achieving favorable performance in the efficient manner. Extensive experiments over multiple satellite datasets suggest that our proposed method achieves competitive performance against other state-of-the-art with fewer computational resources. Further, the consistently favorable performance has been verified over the hyper-spectral image fusion task. Our main focus is to provide an alternative global modeling framework with an efficient structure. The code is publicly available at https://github.com/coder-JMHou/LFormer.

Abstract:
The development of person search techniques has been greatly promoted in recent years for its superior practicality and challenging goals. Despite their significant progress, existing person search models still lack the ability to continually learn from increasing real-world data and adaptively process input from different domains. To this end, this work introduces the continual person search task that sequentially learns on multiple domains and then performs person search on all seen domains. This requires balancing the stability and plasticity of the model to continually learn new knowledge without catastrophic forgetting. For this, we propose a Prompt-based Continual Person Search (PoPS) model in this paper. First, we design a compositional person search transformer to construct an effective pre-trained transformer without exhaustive pre-training from scratch on large-scale person search data. This serves as the fundamental for prompt-based continual learning. On top of that, we design a domain incremental prompt pool with a diverse attribute matching module. For each domain, we independently learn a set of prompts to encode the domain-oriented knowledge. Meanwhile, we jointly learn a group of diverse attribute projections and prototype embeddings to capture discriminative domain attributes. By matching an input image with the learned attributes across domains, the learned prompts can be properly selected for model inference. Extensive experiments are conducted to validate the proposed method for continual person search. The source code is available at https://github.com/PatrickZad/PoPS.

Abstract:
Multi-relational graph clustering has demonstrated remarkable success in uncovering underlying patterns in complex networks. Representative methods manage to align different views motivated by advances in contrastive learning. Our empirical study finds the pervasive presence of imbalance in real-world graphs, which is in principle contradictory to the motivation of alignment. In this paper, we first propose a novel metric, the Aggregation Class Distance, to empirically quantify structural disparities among different graphs. To address the challenge of view imbalance, we propose Balanced Multi-Relational Graph Clustering (BMGC), comprising unsupervised dominant view mining and dual signals guided representation learning. It dynamically mines the dominant view throughout the training process, synergistically improving clustering performance with representation learning. Theoretical analysis ensures the effectiveness of dominant view mining. Extensive experiments and in-depth analysis on real-world and synthetic datasets showcase that BMGC achieves state-of-the-art performance, underscoring its superiority in addressing the view imbalance inherent in multi-relational graphs. The source code and datasets are available at https://github.com/zxlearningdeep/BMGC.

Abstract:
Traditional shadow detectors often identify all shadow regions of static images or video sequences. This work presents the Referring Video Shadow Detection (RVSD), which is an innovative task that rejuvenates the classic paradigm by facilitating the segmentation of particular shadows in videos based on descriptive natural language prompts. This novel RVSD not only achieves segmentation of arbitrary shadow areas of interest based on descriptions (flexibility) but also allows users to interact with visual content more directly and naturally by using natural language prompts (interactivity), paving the way for abundant applications ranging from advanced video editing to virtual reality experiences. To pioneer the RVSD research, we curated a well-annotated RVSD dataset, which encompasses 86 videos and a rich set of 15,011 paired textual descriptions with corresponding shadows. To the best of our knowledge, this dataset is the first one for addressing RVSD. Based on this dataset, we propose a Referring Shadow-Track Memory Network (RSM-Net) for addressing the RVSD task. In our RSM-Net, we devise a Twin-Track Synergistic Memory (TSM) to store intra-clip memory features and hierarchical inter-clip memory features, and then pass these memory features into a memory read module to refine features of the current video frame for referring shadow detection. We also develop a Mixed-Prior Shadow Attention (MSA) to utilize physical priors to obtain a coarse shadow map for learning more visual features by weighting it with the input video frame. Experimental results show that our RSM-Net achieves state-of-the-art performance for RVSD with a notable Overall IOU increase of 4.4%. Our code and dataset are available at https://github.com/whq-xxh/RVSD.

Abstract:
Detecting hand actions in videos is crucial for understanding video content and has diverse real-world applications. Existing approaches often focus on whole-body actions or coarse-grained action categories, lacking fine-grained hand-action localization information. To fill this gap, we introduce the FHA-Kitchens (Fine-Grained Hand Actions in Kitchen Scenes) dataset, providing both coarse- and fine-grained hand action categories along with localization annotations. This dataset comprises 2,377 video clips and 30,047 frames, annotated with approximately 200k bounding boxes and 880 action categories. Evaluation of existing action detection methods on FHA-Kitchens reveals varying generalization capabilities across different granularities. To handle multi-granularity in hand actions, we propose MG-HAD, an End-to-End Multi-Granularity Hand Action Detection method. It incorporates two new designs: Multi-dimensional Action Queries and Coarse-Fine Contrastive Denoising. Extensive experiments demonstrate MG-HAD's effectiveness for multi-granularity hand action detection, highlighting the significance of FHA-Kitchens for future research and real-world applications. The dataset and source code are available at MG-HAD.

Abstract:
Detecting diffusion-generated images has recently developed as an emerging research area. Existing diffusion-based datasets predominantly focus on general image generation. However, facial forgeries, which pose severe social risks, have remained less explored thus far. To address this gap, this paper introduces DiFF, a comprehensive dataset dedicated to face-focused diffusion-generated images. DiFF comprises over 500,000 images that are synthesized using thirteen distinct generation methods under four conditions. In particular, this dataset utilizes 30,000 carefully collected textual and visual prompts, ensuring the synthesis of images with both high fidelity and semantic consistency. We conduct extensive experiments on the DiFF dataset via human subject tests and several representative forgery detection methods. The results demonstrate that the binary detection accuracies of both human observers and automated detectors often fall below 30%, revealing insights on the challenges in detecting diffusion-generated facial forgeries. Moreover, our experiments demonstrate that DiFF, compared to previous facial forgery datasets, contains a more diverse and realistic range of forgeries, showcasing its potential to aid in the development of more generalized detectors. Finally, we propose an edge graph regularization approach to effectively enhance the generalization capability of existing detectors.

Abstract:
Is the Text to Motion model robust? Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Sta ble Text-to-Motion Framework (SATO). SATO consists of three modules, each dedicated to stable attention, stable prediction, and maintaining a balance between accuracy and robustness trade-off. We present a methodology for constructing an SATO that satisfies the stability of attention and prediction. To verify the stability of the model, we introduced a new textual synonym perturbation dataset based on HumanML3D and KIT-ML. Results show that SATO is significantly more stable against synonyms and other slight perturbations while keeping its high accuracy performance. Codes and models are released at https://github.com/sato-team/Stable-Text-to-Motion-Framework.

Abstract:
Annotating data for sensitive labels (e.g., disease, smoking) poses a potential threats to individual privacy in many real-world scenarios. To cope with this problem, we propose a novel setting to protect privacy of each instance, namely learning from concealed labels for multi-class classification. Concealed labels prevent sensitive labels from appearing in the label set during the label collection stage, which specifies none and some random sampled insensitive labels as concealed labels set to annotate sensitive data. In this paper, an unbiased estimator can be established from concealed data under mild assumptions, and the learned multi-class classifier can not only classify the instance from insensitive labels accurately but also recognize the instance from the sensitive labels. Moreover, we bound the estimation error and show that the multi-class classifier achieves the optimal parametric convergence rate. Experiments demonstrate the significance and effectiveness of the proposed method for concealed labels in synthetic and real-world datasets. Source code is available at https://github.com/WilsonMqz/CLF

Abstract:
Audio-visual semantic segmentation (AVSS) aims to segment and classify sounding objects in videos with acoustic cues. However, most approaches operate on the close-set assumption and only identify pre-defined categories from training data, lacking the generalization ability to detect novel categories in practical applications. In this paper, we introduce a new task: open-vocabulary audio-visual semantic segmentation, extending AVSS task to open-world scenarios beyond the annotated label space. This is a more challenging task that requires recognizing all categories, even those that have never been seen nor heard during training. Moreover, we propose the first open-vocabulary AVSS framework, OV-AVSS, which mainly consists of two parts: 1) a universal sound source localization module to perform audio-visual fusion and locate all potential sounding objects and 2) an open-vocabulary classification module to predict categories with the help of the prior knowledge from large-scale pre-trained vision-language models. To properly evaluate the open-vocabulary AVSS, we split zero-shot training and testing subsets based on the AVSBench-semantic benchmark, namely AVSBench-OV. Extensive experiments demonstrate the strong segmentation and zero-shot generalization ability of our model on all categories. On the AVSBench-OV dataset, OV-AVSS achieves 55.43% mIoU on base categories and 29.14% mIoU on novel categories, exceeding the state-of-the-art zero-shot method by 41.88%/20.61% and open-vocabulary method by 10.2%/11.6%. The code is available at https://github.com/ruohaoguo/ovavss.

Abstract:
In this paper, we present the One-shot In-context Part Segmentation (OIParts) framework, designed to tackle the challenges of part segmentation by leveraging visual foundation models (VFMs). Existing training-based one-shot part segmentation methods that utilize VFMs encounter difficulties when faced with scenarios where the one-shot image and test image exhibit significant variance in appearance and perspective, or when the object in the test image is partially visible. We argue that training on the one-shot example often leads to overfitting, thereby compromising the model's generalization capability. Our framework offers a novel approach to part segmentation that is training-free, flexible, and data-efficient, requiring only a single in-context example for precise segmentation with superior generalization ability. By thoroughly exploring the complementary strengths of VFMs, specifically DINOv2 and Stable Diffusion, we introduce an adaptive channel selection approach by minimizing the intra-class distance for better exploiting these two features, thereby enhancing the discriminatory power of the extracted features for the fine-grained parts. We have achieved remarkable segmentation performance across diverse object categories. The OIParts framework not only eliminates the need for extensive labeled data but also demonstrates superior generalization ability. Through comprehensive experimentation on three benchmark datasets, we have demonstrated the superiority of our proposed method over existing part segmentation approaches in one-shot settings. Code is available at https://github.com/dai647/OIParts.

Abstract:
Recently, the multi-modal fusion of RGB, depth, and semantics has shown great potential in the domain of dense Simultaneous Localization and Mapping (SLAM), as known as dense semantic SLAM. Yet a prerequisite for generating consistent and continuous semantic maps is the availability of dense, efficient, and scalable scene representations. To date, existing semantic SLAM systems based on explicit scene representations (points/meshes/surfels) are limited by their resolutions and inabilities to predict unknown areas, thus failing to generate dense maps. Contrarily, a few implicit scene representations (Neural Radiance Fields) to deal with these problems rely on time-consuming ray tracing-based volume rendering technique, which cannot meet the real-time rendering requirements of SLAM. Fortunately, the Gaussian Splatting scene representation has recently emerged, which inherits the efficiency and scalability of point/surfel representations while smoothly represents geometric structures in a continuous manner, showing promise in addressing the aforementioned challenges. To this end, we propose GS3LAM, a Gaussian Semantic Splatting SLAM framework, which takes multimodal data as input and can render consistent, continuous dense semantic maps in real-time. To fuse multimodal data, GS3LAM models the scene as a Semantic Gaussian Field (SG-Field), and jointly optimizes camera poses and the field by establishing error constraints between observed and predicted data. Furthermore, a Depth-adaptive Scale Regularization (DSR) scheme is proposed to tackle the problem of misalignment between scale-invariant Gaussians and geometric surfaces within the SG-Field. To mitigate the forgetting phenomenon, we propose an effective Random Sampling-based Keyframe Mapping (RSKM) strategy, which exhibits notable superiority over local covisibility optimization strategies commonly utilized in 3DGS-based SLAM systems. Extensive experiments conducted on the benchmark datasets reveal that compared with state-of-the-art competitors, GS3 LAM demonstrates increased tracking robustness, superior real-time rendering quality, and enhanced semantic reconstruction precision. To make the results reproducible, the source code is available at https://github.com/lif314/GS3LAM.

Abstract:
Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper speaking style in a user-agent conversation setting. Existing CSS methods employ effective multi-modal context modeling techniques to achieve empathy understanding and expression. However, they often need to design complex network architectures and meticulously optimize the modules within them. In addition, due to the limitations of small-scale datasets containing scripted recording styles, they often fail to simulate real natural conversational styles. To address the above issues, we propose a novel generative expressive CSS system, termed GPT-Talker.We transform the multimodal information of the multi-turn dialogue history into discrete token sequences and seamlessly integrate them to form a comprehensive user-agent dialogue context. Leveraging the power of GPT, we predict the token sequence, that includes both semantic and style knowledge, of response for the agent. After that, the expressive conversational speech is synthesized by the conversation-enriched VITS to deliver feedback to the user.Furthermore, we propose a large-scale Natural CSS Dataset called NCSSD, that includes both naturally recorded conversational speech in improvised styles and dialogues extracted from TV shows. It encompasses both Chinese and English languages, with a total duration of 236 hours. We conducted comprehensive experiments on the reliability of the NCSSD and the effectiveness of our GPT-Talker. Both subjective and objective evaluations demonstrate that our model outperforms other state-of-the-art CSS systems significantly in terms of naturalness and expressiveness. The Code, Dataset, and Pre-trained Model are available at: https://github.com/AI-S2-Lab/GPT-Talker.

Abstract:
Recently, Multimodal Learning (MML) has gained significant interest as it compensates for single-modality limitations through comprehensive complementary information within multimodal data. However, traditional MML methods generally use the joint learning framework with a uniform learning objective that can lead to the modality competition issue, where feedback predominantly comes from certain modalities, limiting the full potential of others. In response to this challenge, this paper introduces DI-MML, a novel detached MML framework designed to learn complementary information across modalities under the premise of avoiding modality competition. Specifically, DI-MML addresses competition by separately training each modality encoder with isolated learning objectives. It further encourages cross-modal interaction via a shared classifier that defines a common feature space and employing a dimension-decoupled unidirectional contrastive (DUC) loss to facilitate modality-level knowledge transfer. Additionally, to account for varying reliability in sample pairs, we devise a certainty-aware logit weighting strategy to effectively leverage complementary information at the instance level during inference. Extensive experiments conducted on audio-visual, flow-image, and front-rear view datasets show the superior performance of our proposed method. The code is released at https://github.com/fanyunfeng-bit/DI-MML.

Abstract:
Shadow, as a natural consequence of light interacting with objects, plays a crucial role in shaping the aesthetics of an image, which however also impairs the content visibility and overall visual quality. Recent shadow removal approaches employ the mechanism of attention, due to its effectiveness, as a key component. However, they often suffer from two issues including large model size and high computational complexity for practical use. To address these shortcomings, this work devises a lightweight yet accurate shadow removal framework. First, we analyze the characteristics of the shadow removal task to seek the key information required for reconstructing shadow regions and designing a novel regional attention mechanism to effectively capture such information. Then, we customize a Regional Attention Shadow Removal Model (RASM, in short), which leverages non-shadow areas to assist in restoring shadow ones. Unlike existing attention-based models, our regional attention strategy allows each shadow region to interact more rationally with its surrounding non-shadow areas, for seeking the regional contextual correlation between shadow and non-shadow areas. Extensive experiments are conducted to demonstrate that our proposed method delivers superior performance over other state-of-the-art models in terms of accuracy and efficiency, making it appealing for practical applications. Our code can be found at https://github.com/CalcuLuUus/RASM.

Abstract:
Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in https://flashspeech.github.io/

Abstract:
In the information retrieval scenario, query augmentation is an essential technique to refine semantically imprecise queries to align more closely with users' actual information needs. Traditional methods typically rely on extracting signals from user interactions such as browsing or clicking behaviors to augment the queries, which may not accurately reflect the actual user intent due to inherent noise and the dependency on initial user interactions. To overcome these limitations, we introduce Brain-Aug, a novel approach that decodes semantic information directly from brain signals of users to augment query representation. Brain-Aug builds on three techniques: (i) Structurally, an adapter network is utilized to project brain signals into the embedding space of a language model, allowing query augmentation conditioned on both the users' initial query and their brain signals. (ii) During training, we use a next token prediction task for query augmentation and adopt prompt tuning to efficiently train the brain adapter. (iii)At the inference stage, a ranking-oriented decoding strategy is implemented, enabling Brain-Aug to generate augmentations that improve ranking performance. We evaluate our approach on multiple functional magnetic resonance imaging (fMRI) datasets, demonstrating that Brain-Aug not only produces semantically richer queries but also significantly improves document ranking accuracy, particularly for ambiguous queries. These results validate the effectiveness of Brain-Aug, and reveal the potential of using internal cognitive states to understand and augment text-based queries. Supplementary materials and code are available at https://github.com/YeZiyi1998/Brain-Query-Augmentation.

Abstract:
Research of adversarial attacks is important for AI security because it shows the vulnerability of deep learning models and helps to build more robust models. Adversarial attacks on images are most widely studied, which include noise-based attacks, image editing-based attacks, and latent space-based attacks. However, the adversarial examples crafted by these methods often lack sufficient semantic information, making it challenging for humans to understand the failure modes of deep learning models under natural conditions. To address this limitation, we propose a natural language induced adversarial image attack method. The core idea is to leverage a text-to-image model to generate adversarial images given input prompts, which are maliciously constructed to lead to misclassification for a target model. To adopt commercial text-to-image models for synthesizing more natural adversarial images, we propose an adaptive genetic algorithm (GA) for optimizing discrete adversarial prompts without requiring gradients and an adaptive word space reduction method for improving query efficiency. We further used CLIP to maintain the semantic consistency of the generated images. In our experiments, we found that some high-frequency semantic information such as ''foggy'', ''humid'', ''stretching'', etc. can easily cause classifier errors. This adversarial semantic information exists not only in generated images but also in photos captured in the real world. We also found that some adversarial semantic information can be transferred to unknown classification tasks. Furthermore, our attack method can transfer to different text-to-image models (e.g., Midjourney, DALL·E 3, etc.) and image classifiers. Our code is available at:https://github.com/zxp555/Natural-Language-Induced-Adversarial-Images.

Abstract:
Learning with Noisy labels (LNL) poses a significant challenge for the Machine Learning community. Some of the most widely used approaches that select as clean samples for which the model itself (the in-training model) has high confidence, e.g., 'small loss', can suffer from the so called 'self-confirmation' bias. This bias arises because the in-training model, is at least partially trained on the noisy labels. Furthermore, in the classification case, an additional challenge arises because some of the label noise is between classes that are visually very similar (`hard noise'). This paper addresses these challenges by proposing a method (CLIPCleaner) that leverages CLIP, a powerful Vision-Language (VL) model for constructing a zero-shot classifier for efficient, offline, clean sample selection. This has the advantage that the sample selection is decoupled from the in-training model and that the sample selection is aware of the semantic and visual similarities between the classes due to the way that CLIP is trained. We provide theoretical justifications and empirical evidence to demonstrate the advantages of CLIP for LNL compared to conventional pre-trained models. Compared to current methods that combine iterative sample selection with various techniques, CLIPCleaner offers a simple, single-step approach that achieves competitive or superior performance on benchmark datasets. To the best of our knowledge, this is the first time a VL model has been used for sample selection to address the problem of Learning with Noisy Labels (LNL), highlighting their potential in the domain.

Abstract:
Domain generalization (DG) task aims to learn a robust model from source domains that could handle the out-of-distribution (OOD) issue. In order to improve the generalization ability of the model in unseen domains, increasing the diversity of training samples is an effective solution. However, existing augmentation approaches always have some limitations. On the one hand, the augmentation manner in most DG methods is not enough as the model may not see the perturbed features in approximate the worst case due to the randomness, thus the transferability in features could not be fully explored. On the other hand, the causality in discriminative features is not involved in these methods, which harms the generalization ability of model due to the spurious correlations. To address these issues, we propose a Dual-stream Feature Augmentation (DFA) method by constructing some hard features from two perspectives. Firstly, to improve the transferability, we construct some targeted features with domain related augmentation manner. Through the guidance of uncertainty, some hard cross-domain fictitious features are generated to simulate domain shift. Secondly, to take the causality into consideration, the spurious correlated non-causal information is disentangled by an adversarial mask, then the more discriminative features can be extracted through these hard causal related information. Different from previous fixed synthesizing strategy, the two augmentations are integrated into a unified learnable feature disentangle model. Based on these hard features, contrastive learning is employed to keep the semantic consistency and improve the robustness of the model. Extensive experiments on several datasets demonstrated that our approach could achieve state-of-the-art performance for domain generalization. Our code is available at: https://github.com/alusi123/DFA.

Abstract:
Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for gaining deeper insights into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation categories are present and another for determining their temporal boundaries. This split overlooks the inherent connection between these elements. Addressing the need to recognize entity pairs' spatiotemporal interactions across a range of durations, we propose VrdONE, a streamlined yet efficacious one-stage model. VrdONE combines the features of subjects and objects, turning predicate detection into 1D instance segmentation on their combined representations. This setup allows for both relation category identification and binary mask generation in one go, eliminating the need for extra steps like proposal generation or post-processing. VrdONE facilitates the interaction of features across various frames, adeptly capturing both short-lived and enduring relations. Additionally, we introduce the Subject-Object Synergy (SOS) module, enhancing how subjects and objects perceive each other before combining. VrdONE achieves state-of-the-art performances on the VidOR benchmark and ImageNet-VidVRD, showcasing its superior capability in discerning relations across different temporal scales. The code is available at https://github.com/lucaspk512/vrdone.

Abstract:
Exposure correction aims to enhance visual data suffering from improper exposures, which can greatly improve satisfactory visual effects. However, previous methods mainly focus on the image modality, and the video counterpart is less explored in the literature. Directly applying prior image-based methods to videos results in temporal incoherence with low visual quality. Through thorough investigation, we find that the development of relevant communities is limited by the absence of a benchmark dataset. Therefore, in this paper, we construct the first real-world paired video dataset, including both underexposure and overexposure dynamic scenes. To achieve spatial alignment, we utilize two DSLR cameras and a beam splitter to simultaneously capture improper and normal exposure videos. Additionally, we propose an end-to-end video exposure correction network, in which a dual-stream module is designed to deal with both underexposure and overexposure factors, enhancing the illumination based on Retinex theory. The extensive experiments based on various metrics and user studies demonstrate the significance of our dataset and the effectiveness of our method. The code and dataset are available at https://github.com/kravrolens/VECNet.

Abstract:
Visual grounding is the task of locating objects specified by natural language expressions. Existing methods extend generic object detection frameworks to tackle this task. They typically extract visual and textual features separately using independent visual and textual encoders, then fuse these features in a multi-modal decoder for final prediction. However, visual grounding presents unique challenges. It often involves locating objects with different text descriptions within the same image. Existing methods struggle with this task because the independent visual encoder produces identical visual features for the same image, limiting detection performance. Some recently approaches propose various language-guided visual encoders to address this issue, but they mostly rely solely on textual information and require sophisticated designs. In this paper, we introduce Multi-modal Conditional Adaptation (MMCA), which enables the visual encoder to adaptively update weights, directing its focus towards text-relevant regions. Specifically, we first integrate information from different modalities to obtain multi-modal embeddings. Then we utilize a set of weighting coefficients, which generated from the multimodal embeddings, to reorganize the weight update matrices and apply them to the visual encoder of the visual grounding model. Extensive experiments on four widely used datasets demonstrate that MMCA achieves significant improvements and state-of-the-art results. Ablation experiments further demonstrate the lightweight and efficiency of our method. Our source code is available at: https://github.com/Mr-Bigworth/MMCA.

Abstract:
Learning the prior knowledge of the 3D human-object spatial relation is crucial for reconstructing human-object interaction from images and understanding how humans interact with objects in 3D space. Previous works learn this prior from the latest-released human-object interaction dataset collected in controlled environments. However, due to the domain divergence, these methods are limited by the data that the prior learned from and fail to generalize to real-world data with high diversity. To overcome this limitation, we present a 2D-supervised method that learns the 3D human-object spatial relation prior purely from 2D images in the wild. Our method utilizes a flow-based neural network to learn the prior distribution of the 2D human-object keypoint layout and viewports for each image in the dataset. The effectiveness of the prior learned from 2D images is demonstrated on the human-object reconstruction task by applying the prior to tune the relative pose between the human and the object during the post-optimization stage. To validate and benchmark our method on in-the-wild images, we collect the WildHOI dataset from the YouTube website, which consists of various interactions with 8 objects in real-world scenarios. We conduct the experiments on the indoor BEHAVE dataset and the outdoor WildHOI dataset. The results show that our method achieves almost comparable performance with fully 3D supervised methods on the BEHAVE dataset, even if we have only utilized the 2D layout information, and outperforms previous methods in terms of generality and interaction diversity on in-the-wild images. The code and the dataset are available at https://huochf.github.io/WildHOI/ for research purposes.

Abstract:
Fully test-time adaptation aims to adapt a network model online based on sequential analysis of input samples during the inference stage. We observe that, when applying a transformer network model into a new domain, the self-attention profiles of image samples in the target domain deviate significantly from those in the source domain, which results in large performance degradation during domain changes. To address this important issue, we propose a new structure for the self-attention modules in the transformer. Specifically, we incorporate three domain-conditioning vectors, called domain conditioners, into the query, key, and value components of the self-attention module. We learn a network to generate these three domain conditioners from the class token at each transformer network layer. We find that, during fully online test-time adaptation, these domain conditioners at each transform network layer are able to gradually remove the impact of domain shift and largely recover the original self-attention profile. Our extensive experimental results demonstrate that the proposed domain-conditioned transformer significantly improves the online fully test-time domain adaptation performance and outperforms existing state-of-the-art methods by large margins. The code is available at https://github.com/yushuntang/DCT.

Abstract:
While neural radiance fields (NeRF) have shown promise in novel view synthesis, their implicit representation limits explicit control over object manipulation. Existing research has proposed the integration of explicit geometric proxies to enable deformation. However, these methods face two primary challenges: firstly, the time-consuming and computationally demanding tetrahedralization process; and secondly, handling complex or thin structures often leads to either excessive, storage-intensive tetrahedral meshes or poor-quality ones that impair deformation capabilities. To address these challenges, we propose DeformRF, a method that seamlessly integrates the manipulability of tetrahedral meshes with the high-quality rendering capabilities of feature grid representations. To avoid ill-shaped tetrahedra and tetrahedralization for each object, we propose a two-stage training strategy. Starting with an almost-regular tetrahedral grid, our model initially retains key tetrahedra surrounding the object and subsequently refines object details using finer-granularity mesh in the second stage. We also present the concept of recursively subdivided tetrahedra to create higher-resolution meshes implicitly. This enables multi-resolution encoding while only necessitating the storage of the coarse tetrahedral mesh generated in the first training stage. We conduct a comprehensive evaluation of our DeformRF on both synthetic and real-captured datasets. Both quantitative and qualitative results demonstrate the effectiveness of our method for novel view synthesis and deformation tasks. Project page: https://ustc3dv.github.io/DeformRF/

Abstract:
Drawing freehand sketches of mechanical components on multimedia devices for AI-based engineering modeling has become a new trend. However, its development is being impeded because existing works cannot produce suitable sketches for data-driven research. These works either generate sketches lacking a freehand style or utilize generative models not originally designed for this task resulting in poor effectiveness. To address this issue, we design a two-stage generative framework mimicking the human sketching behavior pattern, called MSFormer, which is the first time to produce humanoid freehand sketches tailored for mechanical components. The first stage employs Open CASCADE technology to obtain multi-view contour sketches from mechanical components, filtering perturbing signals for the ensuing generation process. Meanwhile, we design a view selector to simulate viewpoint selection tasks during human sketching for picking out information-rich sketches. The second stage translates contour sketches into freehand sketches by a transformer-based generator. To retain essential modeling features as much as possible and rationalize stroke distribution, we introduce a novel edge-constraint stroke initialization. Furthermore, we utilize a CLIP vision encoder and a new loss function incorporating the Hausdorff distance to enhance the generalizability and robustness of the model. Extensive experiments demonstrate that our approach achieves state-of-the-art performance for generating freehand sketches in the mechanical domain. Project page: https://mcfreeskegen.github.io/.

Abstract:
We propose Magic Clothing, a latent diffusion model (LDM)-based network architecture for an unexplored garment-driven image synthesis task. Aiming at generating customized characters wearing the target garments with diverse text prompts, the image controllability is the most critical issue, i.e., to preserve the garment details and maintain faithfulness to the text prompts. To this end, we introduce a garment extractor to capture the detailed garment features, and employ self-attention fusion to incorporate them into the pretrained LDMs, ensuring that the garment details remain unchanged on the target character. Then, we leverage the joint classifier-free guidance to balance the control of garment features and text prompts over the generated results. Meanwhile, the proposed garment extractor is a plug-in module applicable to various finetuned LDMs, and it can be combined with other extensions like ControlNet and IP-Adapter to enhance the diversity and controllability of the generated characters. Furthermore, we design Matched-Points-LPIPS (MP-LPIPS), a robust metric for evaluating the consistency of the target image to the source garment. Extensive experiments demonstrate that our Magic Clothing achieves state-of-the-art results under various conditional controls for garment-driven image synthesis. Our source code is available at https://github.com/ShineChen1024/MagicClothing.

Abstract:
Modeling in Computer Vision has evolved to MLPs. Vision MLPs naturally lack local modeling capability, to which the simplest treatment is combined with convolutional layers. Convolution, famous for its sliding window scheme, also suffers from this scheme of redundancy and lower parallel computation. In this paper, we seek to dispense with the windowing scheme and introduce a more elaborate and parallelizable method to exploit locality. To this end, we propose a new MLP module, namely Shifted-Pillars-Concatenation (SPC), that consists of two steps of processes: (1) Pillars-Shift, which generates four neighboring maps by shifting the input image along four directions, and (2) Pillars-Concatenation, which applies linear transformations and concatenation on the maps to aggregate local features. SPC module offers superior local modeling power and performance gains, making it a promising alternative to the convolutional layer. Then, we build a pure-MLP architecture called Caterpillar by replacing the convolutional layer with the SPC module in a hybrid model of sMLPNet. Extensive experiments show Caterpillar's excellent performance on both small-scale and ImageNet-1k classification benchmarks, with remarkable scalability and transfer capability possessed as well. The code is available at https://github.com/sunjin19126/Caterpillar.

Abstract:
The rise of online multi-modal sharing platforms like TikTok and YouTube has enabled personalized recommender systems to incorporate multiple modalities (such as visual, textual, and acoustic) into user representations. However, addressing the challenge of data sparsity in these systems remains a key issue. To address this limitation, recent research has introduced self-supervised learning techniques to enhance recommender systems. However, these methods often rely on simplistic random augmentation or intuitive cross-view information, which can introduce irrelevant noise and fail to accurately align the multi-modal context with user-item interaction modeling. To fill this research gap, we propose a novel multi-modal graph diffusion model for recommendation called DiffMM. The proposed framework integrates a modality-aware graph diffusion model with a cross-modal contrastive learning paradigm to improve modality-aware user representation learning, better aligning multi-modal feature information with collaborative relation modeling. Our approach leverages diffusion models' generative capabilities to automatically generate a user-item graph that is aware of different modalities, enabling the incorporation of useful multi-modal knowledge in modeling user-item interactions. We conduct extensive experiments on three public datasets, demonstrating the superiority of our DiffMM over various competitive baselines.

Abstract:
Recent advancements in machine learning have fueled research on multimodal tasks, such as for instance text-to-video and text-to-audio retrieval. These tasks require models to understand the semantic content of video and audio data, including objects, and characters. The models also need to learn spatial arrangements and temporal relationships. In this work, we analyse the temporal ordering of sounds, which is an understudied problem in the context of text-to-audio retrieval. In particular, we dissect the temporal understanding capabilities of a state-of-the-art model for text-to-audio retrieval on the AudioCaps and Clotho datasets. Additionally, we introduce a synthetic text-audio dataset that provides a controlled setting for evaluating temporal capabilities of recent models. Lastly, we present a loss function that encourages text-audio models to focus on the temporal ordering of events. Code and data are available at https://github.com/oncescuandreea/DTU_text_audio.

Abstract:
Existing blind image quality assessment (BIQA) models are susceptible to biases related to distortion intensity and domain. Intensity bias refers to the relatively accurate perception of severe distortions but larger estimation errors for mild distortions, while domain bias stems from the discrepancies between synthetic and authentic distortion properties. This work introduces a unified learning framework towards addressing these distortion biases. We integrate distortion perception and restoration methods to mitigate intensity bias, where images with minor distortions, which are easily restorable, serve as references for mildly distorted images, while severe distortions benefit directly from distortion perception. The restoration modules employ a combined image-level and feature-level denoising approach, and then an intensity-aware cross-attention mechanism is designed for adaptive handling of intensity bias. To tackle domain bias, we introduce a distortion domain recognition task based on the intrinsic differences between distortion domains and use intra-domain similarity for weighting the quality scores from these domains. Experimental results show that the proposed method achieves state-of-the-art performance on multiple synthetic and authentic distortion datasets. Code and models will be available at https://github.com/xxVENTAZEDxx/Distortion-Debiased-BIQA

Abstract:
Although the diffusion model has achieved remarkable performance in the field of image generation, its high inference delay hinders its wide application in edge devices with scarce computing resources. Therefore, many training-free sampling methods have been proposed to reduce the number of sampling steps required for diffusion models. However, they perform poorly under a very small number of sampling steps. Thanks to the emergence of knowledge distillation technology, the existing training scheme methods have achieved excellent results at very low step numbers. However, the current methods mainly focus on designing novel diffusion model sampling methods with knowledge distillation. How to transfer better diffusion knowledge from teacher models is a more valuable problem but rarely studied. Therefore, we propose Relational Diffusion Distillation (RDD), a novel distillation method tailored specifically for distilling diffusion models. Unlike existing methods that simply align teacher and student models at pixel level or feature distributions, our method introduces cross-sample relationship interaction during the distillation process and alleviates the memory constraints induced by multiple sample interactions. Our RDD significantly enhances the effectiveness of the progressive distillation framework within the diffusion model. Extensive experiments on several datasets (e.g., CIFAR-10 and ImageNet) demonstrate that our proposed RDD leads to 1.47 FID decrease under 1 sampling step compared to state-of-the-art diffusion distillation methods and achieving 256x speed-up compared to DDIM strategy. Code is available at https://github.com/cantbebetter2/RDD.

Abstract:
Recently, Segment Anything Model (SAM) has become a research hotspot in the fields of multimedia and computer vision, which exhibits powerful yet versatile capabilities on various (un) conditional image segmentation tasks. Although SAM can support different types of segmentation prompts, we note that, compared to point- and box-guided segmentations, it performs much worse on text-instructed tasks, e.g., referring image segmentation (RIS). In this paper, we argue that deep text instruction tuning is key to mitigate such shortcoming caused by the shallow fusion scheme in its default light-weight mask decoder. To address this issue, we propose two simple yet effective deep instruction tuning (DIT) methods for SAM, one is end-to-end and the other is layer-wise. With minimal modifications, DITs can directly transform the image encoder of SAM as a stand-alone vision-language learner in contrast to building another deep fusion branch, maximizing the benefit of its superior segmentation capability. Extensive experiments on three highly competitive benchmark datasets of RIS show that a simple end-to-end DIT can improve SAM by a large margin, while the layer-wise DIT can further boost the performance to state-of-the-art with much less data and training expenditures. Our code is released at: https://github.com/wysnzzzz/DIT.

Abstract:
Image captioning, which generates natural language descriptions of images, is a crucial task in vision-language research. Previous models have typically addressed this task by aligning the generative capabilities of machines with humans through statistical fitting existing datasets. While effective for normal images, they may struggle to accurately describe those where certain parts of the image are obscured or edited, unlike humans who excel in such cases. These weaknesses, including hallucinations and limited interpretability, often hinder performance in scenarios with shifted association patterns. In this paper, we present a generic image captioning framework that employs causal inference to make existing models more capable of interventional tasks, and counterfactually explainable. Our approach includes two variants leveraging either total effect or natural direct effect. Integrating them into the training process enables models to handle counterfactual scenarios, increasing their generalizability. Extensive experiments on various datasets show that our method effectively reduces hallucinations and improves the model's faithfulness to images, demonstrating high portability across both small-scale and large-scale image-to-text models. The code is available at https://github.com/Aman-4-Real/See-or-Guess.

Abstract:
Recently, significant advancements have been made in supporting text-video retrieval by transferring large-scale image-text pre-training models through model adaptation, i.e., full fine-tuning, or prompt tuning, a parameter-efficient fine-tuning strategy. While full fine-tuning involves high computational costs, particularly with increasing model size, prompt tuning offers greater flexibility and efficiency by adjusting only a few learnable parameters. However, current prompt tuning methods rely on coarse visual and textual cues for text-video retrieval task, neglecting the domain-specific features when performing the adaptation. This approach may lead to sub-optimal performance due to the incorporation of irrelevant and indiscriminate knowledge. To address such an issue, we present a Multi-grained Prompt Tuning (MPT) for text-video retrieval, that designs a variety of specific prompts to effectively explore semantic interaction across different modalities with diverse granularity. Specifically, we devise a multi-grained video encoder that employs spatial, temporal, and global prompts to transfer the base-generic knowledge from the image-text pre-trained model while comprehensively excavating determinative video-specific characteristics. Meanwhile, we introduce a novel multi-grained text encoder aimed at capturing various levels of textual clues through the utilization of word and phrase prompts. Extensive experiments on four benchmark datasets, i.e., MSR-VTT, ActivityNet, DiDeMo, and LSMDC, demonstrate that MPT achieves outstanding performance, surpassing state-of-the-art methods with negligible computational cost. The codebase is publicly available at: https://github.com/zchoi/MPT.

Abstract:
Point tracking is a challenging task in computer vision, aiming to establish point-wise correspondence across long video sequences. Recent advancements have primarily focused on temporal modeling techniques to improve local feature similarity, often overlooking the valuable semantic consistency inherent in tracked points. In this paper, we introduce a novel approach leveraging language embeddings to enhance the coherence of frame-wise visual features related to the same object. Our proposed method, termed autogenic language embedding for visual feature enhancement, strengthens point correspondence in long-term sequences. Unlike existing visual-language schemes, our approach learns text embeddings from visual features through a dedicated mapping network, enabling seamless adaptation to various tracking tasks without explicit text annotations. Additionally, we introduce a consistency decoder that efficiently integrates text tokens into visual features with minimal computational overhead. Through enhanced visual consistency, our approach significantly improves tracking trajectories in lengthy videos with substantial appearance variations. Extensive experiments on widely-used tracking benchmarks demonstrate the superior performance of our method, showcasing notable enhancements compared to trackers relying solely on visual cues. The code will be available at https://github.com/SkyeSong38/ALTrack.

Abstract:
Remote photoplethysmography (rPPG) is a promising technique for non-contact physiological signal measurement. It has great potential applications in human health monitoring and emotion analysis. However, existing methods for the rPPG task ignore the long-tail phenomenon of physiological signal data, especially on multi-domain joint training. In addition, we find that the long-tail problem of the physiological label (phys-label) exists in different datasets, and the long-tail problem of some domain exists under the same phys-label. To tackle these problems, we propose a hierarchical balanced framework, to mitigate the bias caused by domain and phys-label imbalance. Specifically, we propose anti-spurious domain center learning tailored to learning domain-balanced embeddings space. Then, we adopt compact-aware continuity regularization to estimate phys-label-wise imbalances and construct continuity between embeddings. Extensive experiments demonstrate that our method outperforms the state-of-the-art in cross-dataset and intra-dataset settings. Our code is available at https://github.com/pywin/HiBa.

Abstract:
Recent work in Video Frame Interpolation (VFI) tries to formulate VFI as a diffusion-based conditional image generation problem, synthesizing the intermediate frame given a random noise and neighboring frames. Due to the relatively high resolution of videos, Latent Diffusion Models (LDMs) are employed to run diffusion models in latent space efficiently. Such a formulation poses a crucial challenge: VFI expects that the output is deterministically equal to the ground truth intermediate frame, but LDMs randomly generate a diverse set of different images when the model runs multiple times. The diversity is due to the large cumulative variance (variance accumulated at each generation step) of generated latent representations in LDMs, making the sampling trajectory random. To address this problem, we propose our unique solution: Frame Interpolation with Consecutive Brownian Bridge Diffusion. Specifically, we propose consecutive Brownian Bridge diffusion that takes a deterministic initial value as input, resulting in a much smaller cumulative variance of generated latent representations. Our experiments suggest that our method can improve together with the improvement of the autoencoder and achieve state-of-the-art performance in VFI, leaving strong potential for further enhancement. Our code is available at https://github.com/ZonglinL/ConsecutiveBrownianBridge.

Abstract:
Research on Multi-modal Large Language Models (MLLMs) towards the multi-image cross-modal instruction has received increasing attention and made significant progress, particularly in scenarios involving closely resembling images ( e.g., change captioning). Existing MLLMs typically follow a two-step process in their pipelines: first, extracting visual tokens independently for each input image, and then aligning these visual tokens from different images with the Large Language Model (LLM) in its textual feature space. However, the independent extraction of visual tokens for each image may result in different semantics being prioritized for different images in the first step, leading to a lack of preservation of linking information among images for subsequent LLM analysis. This issue becomes more serious in scenarios where significant variations exist among the images (e.g., visual storytelling). To address this challenge, we introduce Semantic Alignment for Multi-modal large language models (SAM). By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis and align the semantics of different images before feeding them into LLM. As the test bed, we propose a large-scale dataset named MmLINK consisting of 69K samples. Different from most existing datasets for MLLMs fine-tuning, our MmLINK dataset comprises multi-modal instructions with significantly diverse images. Extensive experiments on the group captioning task and the storytelling task prove the effectiveness of our SAM model, surpassing the state-of-the-art methods by a large margin (+37% for group captioning and +22% for storytelling on CIDEr score). Project page: https://mccartney01.github.io/SAM.

Abstract:
Text-to-image (T2I) customization aims to create images that embody specific visual concepts delineated in textual descriptions. However, existing works still face a main challenge, concept overfitting. To tackle this challenge, we first analyze overfitting, categorizing it into concept-agnostic overfitting, which undermines non-customized concept knowledge, and concept-specific overfitting, which is confined to customize on limited diversities, i.e, backgrounds, layouts, styles. To evaluate the overfitting degree, we further introduce two metrics, i.e, Latent Fisher divergence and Wasserstein metric to measure the distribution changes of non-customized and customized concept respectively. Drawing from the analysis, we propose Infusion, a T2I customization method that enables the learning of target concepts to avoid being constrained by limited training diversities, while preserving non-customized knowledge. Remarkably, Infusion achieves this feat with remarkable efficiency, requiring a mere 11KB of trained parameters. Extensive experiments also demonstrate that our approach outperforms state-of-the-art methods in both single and multi-concept customized generation. Project page: https://zwl666666.github.io/infusion/.

Abstract:
Vision-language models such as CLIP are capable of mapping the different modality data into a unified feature space, enabling zero/few-shot inference by measuring the similarity of given images and texts. However, most existing methods overlook modality gaps in CLIP's encoded features, which is shown as the text and image features lie far apart from each other, resulting in limited classification performance. To tackle this issue, we introduce a method called Selective Vision-Language Subspace Projection (SSP), which incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs. Specifically, our SSP framework comprises two parallel modules: a vision projector and a language projector. Both projectors utilize local image features to span the respective subspaces for image and texts, thereby projecting the image and text features into their respective subspaces to achieve alignment. Moreover, our approach entails only training-free matrix calculations and can be seamlessly integrated into advanced CLIP-based few-shot learning frameworks. Extensive experiments on 11 datasets have demonstrated SSP's superior text-image alignment capabilities, outperforming the state-of-the-art alignment methods. The code is available at https://github.com/zhuhsingyuu/SSP

Abstract:
Recent advancements in "deepfake" techniques have paved the way for generating various media forgeries. In response to the potential hazards of these media forgeries, many researchers engage in exploring detection methods, increasing the demand for high-quality media forgery datasets. Despite this, existing datasets have certain limitations. Firstly, most datasets focus on manipulating visual modality and usually lack diversity, as only a few forgery approaches are considered. Secondly, the quality of media is often inadequate in clarity and naturalness. Meanwhile, the size of the dataset is also limited. Thirdly, it is commonly observed that real-world forgeries are motivated by identity, yet the identity information of the individuals portrayed in these forgeries within existing datasets remains under-explored. For detection, identity information could be an essential clue to boost performance. Moreover, official media concerning relevant identities on the Internet can serve as prior knowledge, aiding both the audience and forgery detectors in determining the true identity. Therefore, we propose an identity-driven multimedia forgery dataset, IDForge, which contains 249,138 video shots sourced from 324 wild videos of 54 celebrities collected from the Internet. The fake video shots involve 9 types of manipulation across visual, audio, and textual modalities. Additionally, IDForge provides extra 214,438 real video shots as a reference set for the 54 celebrities. Correspondingly, we propose the Reference-assisted Multimodal Forgery Detection Network (R-MFDN), aiming at the detection of deepfake videos. Through extensive experiments on the proposed dataset, we demonstrate the effectiveness of R-MFDN on the multimedia detection task. The dataset is available at: https://github.com/xyyandxyy/IDForge.

Abstract:
Skeleton-based action recognition has garnered significant attention due to the utilization of concise and resilient skeletons. Nevertheless, the absence of detailed body information in skeletons restricts performance, while other multimodal methods require substantial inference resources and are inefficient when using multimodal data during both training and inference stages. To address this and fully harness the complementary multimodal features, we propose a novel multi-modality co-learning (MMCL) framework by leveraging the multimodal large language models (LLMs) as auxiliary networks for efficient skeleton-based action recognition, which engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference. Our MMCL framework primarily consists of two modules. First, the Feature Alignment Module (FAM) extracts rich RGB features from video frames and aligns them with global skeleton features via contrastive learning. Second, the Feature Refinement Module (FRM) uses RGB images with temporal information and text instruction to generate instructive features based on the powerful generalization of multimodal LLMs. These instructive text features will further refine the classification scores and the refined scores will enhance the model's robustness and generalization in a manner similar to soft labels. Extensive experiments on NTU RGB+D, NTU RGB+D 120 and Northwestern-UCLA benchmarks consistently verify the effectiveness of our MMCL, which outperforms the existing skeleton-based action recognition methods. Meanwhile, experiments on UTD-MHAD and SYSU-Action datasets demonstrate the commendable generalization of our MMCL in zero-shot and domain-adaptive action recognition. Our code is publicly available at: https://github.com/liujf69/MMCL-Action.

Abstract:
Due to its high speed and low latency, DVS is frequently employed in motion deblurring. Ideally, high-quality events would adeptly capture intricate motion information. However, real-world events are generally degraded, thereby introducing significant artifacts into the deblurred results. In response to this challenge, we model the degradation of events and propose RDNet to improve the quality of image deblurring. Specifically, we first analyze the mechanisms underlying degradation and simulate paired events based on that. These paired events are then fed into the first stage of the RDNet for training the restoration model. The events restored in this stage serve as a guide for the second-stage deblurring process. To better assess the deblurring performance of different methods on real-world degraded events, we present a new real-world dataset named DavisMCR. This dataset incorporates events with diverse degradation levels, collected by manipulating environmental brightness and target object contrast. Our experiments are conducted on synthetic datasets (GOPRO), real-world datasets (REBlur), and the proposed dataset (DavisMCR). The results demonstrate that RDNet outperforms classical event denoising methods in event restoration. Furthermore, RDNet exhibits better performance in deblurring tasks compared to state-of-the-art methods. DavisMCR are available at https://github.com/Yeeesir/DVS_RDNet.

Abstract:
Recent advances in vision-language foundational models, such as CLIP, have demonstrated significant strides in zero-shot classification. However, the extensive parameterization of models like CLIP necessitates a resource-intensive fine-tuning process. In response, TIP-Adapter and SuS-X have introduced training-free methods aimed at bolstering the efficacy of downstream tasks. While these approaches incorporate support sets to maintain data distribution consistency between knowledge cache and test sets, they often fall short in terms of generalization on the test set, particularly when faced with test data exhibiting substantial distributional variations. In this work, we present CapS-Adapter, an innovative method that employs a caption-based support set, effectively harnessing both image and caption features to exceed existing state-of-the-art techniques in training-free scenarios. CapS-Adapter adeptly constructs support sets that closely mirror target distributions, utilizing instance-level distribution features extracted from multimodal large models. By leveraging CLIP's single and cross-modal strengths, CapS-Adapter enhances predictive accuracy through the use of multimodal support sets. Our method achieves outstanding zero-shot classification results across 19 benchmark datasets, improving accuracy by 2.19% over the previous leading method. Our contributions are substantiated through extensive validation on multiple benchmark datasets, demonstrating superior performance and robust generalization capabilities.

Abstract:
Cross-modal hashing has emerged as a promising technique for retrieving relevant information across distinct media types thanks to its low storage cost and high retrieval efficiency. However, the success of most existing methods heavily relies on large-scale well-annotated datasets, which are costly and scarce in the real world due to ubiquitous labeling noise. To tackle this problem, in this paper, we propose a novel framework, termed Noise Resistance Cross-modal Hashing (NRCH), to learn hashing with noisy labels by overcoming two key challenges, i.e. noise overfitting and error accumulation. Specifically, i) to mitigate the overfitting issue caused by noisy labels, we present a novel Robust Contrastive Hashing loss (RCH) to target homologous pairs instead of noisy positive pairs, thus avoiding overemphasizing noise. In other words, RCH enforces the model focus on more reliable positives instead of unreliable ones constructed by noisy labels, thereby enhancing the robustness of the model against noise; ii) to circumvent error accumulation, a Dynamic Noise Separator (DNS) is proposed to dynamically and accurately separate the clean and noisy samples by adaptively fitting the loss distribution, thus alleviate the adverse influence of noise on iterative training. Finally, we conduct extensive experiments on four widely used benchmarks to demonstrate the robustness of our NRCH against noisy labels for cross-modal retrieval. The code is available at: https://github.com/LonganWANG-cs/NRCH.git.

Abstract:
Recent advancements in Large Vision-Language Models (VLMs) have underscored their superiority in various multimodal tasks. However, the adversarial robustness of VLMs has not been fully explored. Existing methods mainly assess robustness through unimodal adversarial attacks that perturb images, while assuming inherent resilience against text-based attacks. Different from existing attacks, in this work we propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within VLMs. Specifically, we propose a dual optimization objective aimed at guiding the model to generate highly toxic affirmative responses. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input, thus imbuing the image with toxic semantics. Subsequently, an adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions. The discovered adversarial image prefix and text suffix are collectively denoted as a Universal Master Key (UMK). When integrated into various malicious queries, UMK can circumvent the alignment defenses of VLMs and lead to the generation of objectionable content, known as jailbreaks. The experimental results demonstrate that our universal attack strategy can effectively jailbreak MiniGPT-4 with a 96% success rate, highlighting the fragility of VLMs and the exigency for new alignment strategies. Codes are available at https://github.com/roywang021/UMK. Disclaimer: This paper contains potentially disturbing and offensive content.

Abstract:
Generating face image with specific gaze information has attracted considerable attention in recent years. Existing approaches typically input gaze values directly for face generation, which is unnatural and requires annotated gaze datasets for training, thereby limiting its application. In this paper, we present a novel gaze-controllable face generation task that overcomes these limitations. Our approach inputs textual descriptions that describe human gaze and head behavior and generates corresponding face images. Our work first introduces a text-of-gaze dataset containing over 90k text descriptions spanning a dense distribution of gaze and head poses. We further propose a gaze-controllable text-to-face method. Our method contains a sketch-conditioned face diffusion module and a model-based sketch diffusion module. We define a face sketch based on facial landmarks and eye segmentation map. It provides a structured and detailed foundation for generating facial images. The face diffusion module generates face images from the face sketch, and the sketch diffusion module employs a 3D face model to generate face sketch from text description. Experiments on the FFHQ dataset show the effectiveness of our method. Our dataset is available at https://github.com/hengfei-wang/TextGaze.

Abstract:
Multi-view learning methods often focus on improving decision accuracy, while neglecting the decision uncertainty, limiting their suitability for safety-critical applications. To mitigate this, researchers propose trusted multi-view learning methods that estimate classification probabilities and uncertainty by learning the class distributions for each instance. However, these methods assume that the data from each view can effectively differentiate all categories, ignoring the semantic vagueness phenomenon in real-world multi-view data. Our findings demonstrate that this phenomenon significantly suppresses the learning of view-specific evidence in existing methods. We propose a Consistent and Complementary-aware trusted Multi-view Learning (CCML) method to solve this problem. We first construct view opinions using evidential deep neural networks, which consist of belief mass vectors and uncertainty estimates. Next, we dynamically decouple the consistent and complementary evidence. The consistent evidence is derived from the shared portions across all views, while the complementary evidence is obtained by averaging the differing portions across all views. We ensure that the opinion constructed from the consistent evidence strictly aligns with the ground-truth category. For the opinion constructed from the complementary evidence, we allow it for potential vagueness in the evidence. We compare CCML with state-of-the-art baselines on one synthetic and six real-world datasets. The results validate the effectiveness of the dynamic evidence decoupling strategy and show that CCML significantly outperforms baselines on accuracy and reliability. The code is released at https://github.com/Lihong-Liu/CCML.

Abstract:
Artwork analysis is an important and fundamental skill for art appreciation, which could enrich personal aesthetic sensibility and facilitate the critical thinking ability. Understanding artworks is challenging due to its subjective nature, diverse interpretations, and complex visual elements, requiring expertise in art history, cultural background, and aesthetic theory. However, limited by the data collection and model ability, previous works for automatically analyzing artworks mainly focus on classification, retrieval, and other simple tasks, which is far from the goal of AI. To facilitate the research progress, in this paper, we step further to compose comprehensive analysis inspired by the remarkable perception and generation ability of large multimodal models. Specifically, we first propose a task of composing paragraph analysis for artworks, i.e., painting in this paper, only focusing on visual characteristics to formulate more comprehensive understanding of artworks. To support the research on formal analysis, we collect a large dataset PaintingForm, with about 19k painting images and 50k analysis paragraphs. We further introduce a superior large multimodal model for painting analysis composing, dubbed GalleryGPT, which is slightly modified and fine-tuned based on LLaVA architecture leveraging our collected data. We conduct formal analysis generation and zero-shot experiments across several datasets to assess the capacity of our model. The results show remarkable performance improvements comparing with powerful baseline LMMs, demonstrating its superb ability of art analysis and generalization. The codes and model are available at: https://github.com/steven640pixel/GalleryGPT.

Abstract:
Research on continual learning in multi-modal tasks has been receiving increasing attention. However, most existing work overlooks the explicit cross-modal and cross-task interactions. In this paper, we innovatively propose the Low-rank Prompt Interaction (LPI) to address this general problem of multi-modal understanding, which considers both cross-modal and cross-task interactions. Specifically, as for the former, we employ multi-modal correlation modules for corresponding Transformer layers. Considering that the training parameters scale to the number of layers and tasks, we propose low-rank interaction-augmented decomposition to avoid memory explosion while enhancing the cross-modal association through sharing and separating common-specific low-rank factors. In addition, due to the multi-modal semantic differences carried by the low-rank initialization, we adopt hierarchical low-rank contrastive learning to ensure training robustness. As for the latter, we initially employ a visual analysis and identify that different tasks have clear distinctions in proximity. Therefore, we introduce explicit task contrastive constraints in the prompt learning process based on task semantic distances. Experiments on two retrieval tasks show performance improvements with the introduction of a minimal number of parameters, demonstrating the effectiveness of our method. Code is available at https://github.com/Kelvin-ywc/LPI.

Abstract:
Current state-of-the-art flow methods are mostly based on dense all-pairs cost volumes. However, as image resolution increases, the computational and spatial complexity of constructing these cost volumes grows at a quartic rate, making these methods impractical for high-resolution images. In this paper, we propose a novel Hybrid Cost Volume for memory-efficient optical flow, named HCV. To construct HCV, we first propose a Top-k strategy to separate the 4D cost volume into two global 3D cost volumes. These volumes significantly reduce memory usage while retaining a substantial amount of matching information. We further introduce a local 4D cost volume with a local search space to supplement the local information for HCV. Based on HCV, we design a memory-efficient optical flow network, named HCVFlow. Compared to the recurrent flow methods based the all-pairs cost volumes, our HCVFlow significantly reduces memory consumption while ensuring high accuracy. We validate the effectiveness and efficiency of our method on the Sintel and KITTI datasets and real-world 4K (2160 × 3840) resolution images. Extensive experiments show that our HCVFlow has very low memory usage and outperforms other memory-efficient methods in terms of accuracy. The code is publicly available at https://github.com/gangweiX/HCVFlow.

Abstract:
Few-Shot Class-Incremental Learning has shown remarkable efficacy in efficient learning new concepts with limited annotations. Nevertheless, the heuristic few-shot annotations may not always cover the most informative samples, which largely restricts the capability of incremental learner. We aim to start from a pool of large-scale unlabeled data and then annotate the most informative samples for incremental learning. Based on this premise, Based on this purpose, this paper introduces the Active Class-Incremental Learning (ACIL). The objective of ACIL is to select the most informative samples from the unlabeled pool to effectively train an incremental learner, aiming to maximize the performance of the resulting model. Note that vanilla active learning algorithms suffer from class-imbalanced distribution among annotated samples, which restricts the ability of incremental learning. To achieve both class balance and informativeness in chosen samples, we propose Class-Balanced Selection (CBS) strategy. Specifically, we first cluster the features of all unlabeled images into multiple groups. Then for each cluster, we employ greedy selection strategy to ensure that the Gaussian distribution of the sampled features closely matches the Gaussian distribution of all unlabeled features within the cluster.Our CBS can be plugged and played into those CIL methods which are based on pretrained models with prompts tunning technique.Extensive experiments under ACIL protocol across five diverse datasets demonstrate that CBS outperforms both random selection and other SOTA active learning approaches.

Abstract:
Recent studies on AI security have highlighted the vulnerability of Vision-Language Pre-training (VLP) models to subtle yet intentionally designed perturbations in images and texts. Investigating multimodal systems' robustness via adversarial attacks is crucial in this field. Most multimodal attacks are sample-specific, generating a unique perturbation for each sample to construct adversarial samples. To the best of our knowledge, it is the first work through multimodal decision boundaries to explore the creation of a universal, sample-agnostic perturbation that applies to any image. Initially, we explore strategies to move sample points beyond the decision boundaries of linear classifiers, refining the algorithm to ensure successful attacks under the top k accuracy metric. Based on this foundation, in visual-language tasks, we treat visual and textual modalities as reciprocal sample points and decision hyperplanes, guiding image embeddings to traverse text-constructed decision boundaries, and vice versa. This iterative process consistently refines a universal perturbation, ultimately identifying a singular direction within the input space which is exploitable to impair the retrieval performance of VLP models. The proposed algorithms support the creation of global perturbations or adversarial patches. Comprehensive experiments validate the effectiveness of our method, showcasing its data, task, and model transferability across various VLP models and datasets. Code: https://github.com/LibertazZ/MUAP

Abstract:
Specular highlight removal plays a pivotal role in multimedia applications, as it enhances the quality and interpretability of images and videos, ultimately improving the performance of downstream tasks such as content-based retrieval, object recognition, and scene understanding. Despite significant advances in deep learning-based methods, current state-of-the-art approaches often rely on additional priors or supervision, limiting their practicality and generalization capability. In this paper, we propose the Dual-Hybrid Attention Network for Specular Highlight Removal (DHAN-SHR), an end-to-end network that introduces novel hybrid attention mechanisms to effectively capture and process information across different scales and domains without relying on additional priors or supervision. DHAN-SHR consists of two key components: the Adaptive Local Hybrid-Domain Dual Attention Transformer (L-HD-DAT) and the Adaptive Global Dual Attention Transformer (G-DAT). The L-HD-DAT captures local inter-channel and inter-pixel dependencies while incorporating spectral domain features, enabling the network to effectively model the complex interactions between specular highlights and the underlying surface properties. The G-DAT models global inter-channel relationships and long-distance pixel dependencies, allowing the network to propagate contextual information across the entire image and generate more coherent and consistent highlight-free results. To evaluate the performance of DHAN-SHR and facilitate future research in this area, we compile a large-scale benchmark dataset comprising a diverse range of images with varying levels of specular highlights. Through extensive experiments, we demonstrate that DHAN-SHR outperforms 18 state-of-the-art methods both quantitatively and qualitatively, setting a new standard for specular highlight removal in multimedia applications. The code and dataset are available at https://github.com/CXH-Research/DHAN-SHR.

Abstract:
Due to the computational complexity of self-attention (SA), prevalent techniques for image deblurring often resort to either adopting localized SA or employing coarse-grained global SA methods, both of which exhibit drawbacks such as compromising global modeling or lacking fine-grained correlation. In order to address this issue by effectively modeling long-range dependencies without sacrificing fine-grained details, we introduce a novel approach termed Local Frequency Transformer (LoFormer). Within each unit of LoFormer, we incorporate a Local Channel-wise SA in the frequency domain (Freq-LC) to simultaneously capture cross-covariance within low- and high-frequency local windows. These operations offer the advantage of (1) ensuring equitable learning opportunities for both coarse-grained structures and fine-grained details, and (2) exploring a broader range of representational properties compared to coarse-grained global SA methods. Additionally, we introduce an MLP Gating mechanism complementary to Freq-LC, which serves to filter out irrelevant features while enhancing global learning capabilities. Our experiments demonstrate that LoFormer significantly improves performance in the image deblurring task, achieving a PSNR of 34.09 dB on the GoPro dataset with 126G FLOPs. https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur

Abstract:
This paper presents a pilot study that explores the application of active learning, traditionally studied in the context of discriminative models, to generative models. We specifically focus on image synthesis personalization tasks. The primary challenge in conducting active learning on generative models lies in the open-ended nature of querying, which differs from the closed form of querying in discriminative models that typically target a single concept. We introduce the concept of anchor directions to transform the querying process into a semi-open problem. We propose a direction-based uncertainty sampling strategy to enable generative active learning and tackle the exploitation-exploration dilemma. Extensive experiments are conducted to validate the effectiveness of our approach, demonstrating that an open-source model can achieve superior performance compared to closed-source models developed by large companies, such as Google's StyleDrop. The source code is available at https://github.com/zhangxulu1996/GAL4Personalization.

Abstract:
Amid the surge in generic text-to-video generation, the field of personalized human video generation has witnessed notable advancements, primarily concentrated on single-person scenarios. However, to our knowledge, the domain of two-person interactions, particularly in the context of martial arts combat, remains uncharted. We identify a significant gap: existing models for single-person dancing generation prove insufficient for capturing the subtleties and complexities of two engaged fighters, resulting in challenges such as identity confusion, anomalous limbs, and action mismatches. To address this, we introduce a pioneering new task, Personalized Martial Arts Combat Video Generation. Our approach, MagicFight, is specifically crafted to overcome these hurdles. Given this pioneering task, we face a lack of appropriate datasets. Thus, we generate a bespoke dataset using the game physics engine Unity, meticulously crafting a multitude of 3D characters, martial arts moves, and scenes designed to represent the diversity of combat. MagicFight refines and adapts existing models and strategies to generate high-fidelity two-person combat videos that maintain individual identities and ensure seamless, coherent action sequences, thereby laying the groundwork for future innovations in the realm of interactive video content creation.

Abstract:
We present Sophia-in-Audition (SiA), a new frontier in virtual production, by employing the humanoid robot Sophia within an UltraStage environment composed of a controllable lighting dome coupled with multiple cameras. We demonstrate Sophia's capability to replicate iconic film segments, follow real performers, and perform a variety of motions and expressions, showcasing her versatility as a virtual actor. Key to this process is the integration of facial motion transfer algorithms and the UltraStage's controllable lighting and multi-camera setup, enabling dynamic performances that align with the director's vision. Our comprehensive user studies indicate positive audience reception towards Sophia's performances, highlighting her potential to reduce the uncanny valley effect in virtual acting. Additionally, the immersive lighting in dynamic clips was highly rated for its naturalness and its ability to mirror professional film standards. The paper presents a first-of-its-kind multi-view robot performance video dataset with dynamic lighting, offering valuable insights for future enhancements in humanoid robotic performers and virtual production techniques. This research contributes significantly to the field by presenting a unique virtual production setup, developing tools for sophisticated performance control, and providing a comprehensive dataset and user study analysis for diverse applications.

Abstract:
This paper presents Team Xaiofei's innovative approach to exploring Face-Voice Association in Multilingual Environments (FAME) at ACM Multimedia 2024. We focus on the impact of different languages in face-voice matching by building upon Fusion and Orthogonal Projection (FOP), introducing four key components: a dual-branch structure, dynamic sample pair weighting, robust data augmentation, and score polarization strategy. Our dual-branch structure serves as an auxiliary mechanism to better integrate and provide more comprehensive information. We also introduce a dynamic weighting mechanism for various sample pairs to optimize learning. Data augmentation techniques are employed to enhance the model's generalization across diverse conditions. Additionally, score polarization strategy based on age and gender matching confidence clarifies and accentuates the final results. Our methods demonstrate significant effectiveness, achieving an equal error rate (EER) of 20.07 on the V2-EH dataset and 21.76 on the V1-EU dataset. Project page: https://github.com/cnzvan/Exploring-Robust-Face-Voice-Matching-in-Multilingual-Environments.

Abstract:
The innate correlation between a person's face and voice has recently emerged as a compelling area of study, especially within the context of multilingual environments. This paper introduces our novel solution to the Face-Voice Association in Multilingual Environments (FAME) 2024 challenge, focusing on a contrastive learning-based chaining-cluster method to enhance face-voice association. This task involves the challenges of building biometric relations between auditory and visual modality cues and modelling the prosody interdependence between different languages while addressing both intrinsic and extrinsic variability present in the data. To handle these non-trivial challenges, our method employs supervised cross-contrastive (SCC) learning to establish robust associations between voices and faces in multi-language scenarios. Following this, we have specifically designed a chaining-cluster-based post-processing step to mitigate the impact of outliers often found in unconstrained in the wild data. We conducted extensive experiments to investigate the impact of language on face-voice association. The overall results were evaluated on the FAME public evaluation platform, where we achieved 2nd place. The results demonstrate the superior performance of our method, and we validate the robustness and effectiveness of our proposed approach. Code is available at https://github.com/colaudiolab/FAME24_solution.

Abstract:
Throughout rapid development of multimodal large language models, a crucial ingredient is a fair and accurate evaluation of their multimodal comprehension abilities. Although Visual Question Answering (VQA) could serve as a developed test field, limitations of VQA evaluation, like the inflexible pattern of Exact Match, have hindered MLLMs from demonstrating their real capability and discourage rich responses. Therefore, this paper proposes the use of semantics-based evaluators for assessing unconstrained open-ended responses on VQA datasets. As characteristics of VQA have made such evaluation significantly different from the traditional Semantic Textual Similarity (STS) task, to systematically analyze the behaviour and compare the performance of various evaluators including LLM-based ones, we propose three key properties, i.e., Alignment, Consistency and Generalization, and a corresponding dataset Assessing VQA Evaluators (AVE) to facilitate analysis. In addition, this paper proposes a Semantically Flexible VQA Evaluator (SFVE) with meticulous design based on the unique features of VQA evaluation.Experimental results verify the feasibility of model-based VQA evaluation and effectiveness of the proposed evaluator that surpasses existing semantic evaluators by a large margin. The proposed training scheme generalizes to both the BERT-like encoders and decoder-only LLM. Relaed codes and data available at https://github.com/jihuishan/flexible_evaluation_for_vqa_mm24.

Abstract:
Recent advances of large multi-modality models (LMM) have greatly improved the ability of image quality assessment (IQA) method to evaluate and explain the quality of visual content. However, these advancements are mostly focused on overall quality assessment, and the detailed examination of local quality, which is crucial for comprehensive visual understanding, is still largely unexplored. In this work, we introduce Q-Ground, the first framework aimed at tackling fine-scale visual quality grounding by combining large multi-modality models with detailed visual quality analysis. Central to our contribution is the introduction of the QGround-100K dataset, a novel resource containing 100k triplets of (image, quality text, distortion segmentation) to facilitate deep investigations into visual quality. The dataset comprises two parts: one with human-labeled annotations for accurate quality assessment, and another labeled automatically by LMMs such as GPT4V, which helps improve the robustness of model training while also reducing the costs of data collection. With the QGround-100K dataset, we propose a LMM-based method equipped with multi-scale feature learning to learn models capable of performing both image quality answering and distortion segmentation based on text prompts. This dual-capability approach not only refines the model's understanding of region-aware image quality but also enables it to interactively respond to complex, text-based queries about image quality and specific distortions. Q-Ground takes a step towards sophisticated visual quality analysis in a finer scale, establishing a new benchmark for future research in the area. Codes and dataset are available at https://github.com/Q-Future/Q-Ground.

Abstract:
Image captioning evaluation metrics can be divided into two categories, reference-based metrics and reference-free metrics. However, reference-based approaches may struggle to evaluate descriptive captions with abundant visual details produced by advanced multimodal large language models, due to their heavy reliance on limited human-annotated references. In contrast, previous reference-free metrics have been proven effective via CLIP cross-modality similarity. Nonetheless, CLIP-based metrics, constrained by their solution of global image-text compatibility, often have a deficiency in detecting local textual hallucinations and are insensitive to small visual objects. Besides, their single-scale designs are unable to provide an interpretable evaluation process such as pinpointing the position of caption mistakes and identifying visual regions that have not been described. To move forward, we propose a novel reference-free metric for image captioning evaluation, dubbed Hierarchical Image Captioning Evaluation Score (HICE-S). By detecting local visual regions and textual phrases, HICE-S builds an interpretable hierarchical scoring mechanism, breaking through the barriers of the single-scale structure of existing reference-free metrics. Comprehensive experiments indicate that our proposed metric achieves the SOTA performance on several benchmarks, outperforming existing reference-free metrics like CLIP-S and PAC-S, and reference-based metrics like METEOR and CIDEr. Moreover, several case studies reveal that the assessment process of HICE-S on detailed captions closely resembles interpretable human judgments. Our code is available at https://github.com/joeyz0z/HICE.

Abstract:
To address the occlusion issues in person Re-Identification (ReID) tasks, many methods have been proposed to extract part features by introducing external spatial information. However, due to missing part appearance information caused by occlusion and noisy spatial information from external model, these purely vision-based approaches fail to correctly learn the features of human body parts from limited training data and struggle in accurately locating body parts, ultimately leading to misaligned part features. To tackle these challenges, we propose a Prompt-guided Feature Disentangling method (ProFD), which leverages the rich pre-trained knowledge in the textual modality facilitate model to generate well-aligned part features. ProFD first designs part-specific prompts and utilizes noisy segmentation mask to preliminarily align visual and textual embedding, enabling the textual prompts to have spatial awareness. Furthermore, to alleviate the noise from external masks, ProFD adopts a hybrid-attention decoder, ensuring spatial and semantic consistency during the decoding process to minimize noise impact. Additionally, to avoid catastrophic forgetting, we employ a self-distillation strategy, retaining pre-trained knowledge of CLIP to mitigate over-fitting. Evaluation results on the Market1501, DukeMTMC-ReID, Occluded-Duke, Occluded-ReID, and P-DukeMTMC datasets demonstrate that ProFD achieves state-of-the-art results.

Abstract:
Drone-based crowd tracking faces difficulties in accurately identifying and monitoring objects from an aerial perspective, largely due to their small size and close proximity to each other, which complicates both localization and tracking. To address these challenges, we present the Density-aware Tracking (DenseTrack) framework. DenseTrack capitalizes on crowd counting to precisely determine object locations, blending visual and motion cues to improve the tracking of small-scale objects. It specifically addresses the problem of cross-frame motion to enhance tracking accuracy and dependability. DenseTrack employs crowd density estimates as anchors for exact object localization within video frames. These estimates are merged with motion and position information from the tracking network, with motion offsets serving as key tracking cues. Moreover, DenseTrack enhances the ability to distinguish small-scale objects using insights from the visual-language model, integrating appearance with motion cues. The framework utilizes the Hungarian algorithm to ensure the accurate matching of individuals across frames. Demonstrated on DroneCrowd dataset, our approach exhibits superior performance, confirming its effectiveness in scenarios captured by drones. Our code will be available at: https://github.com/Zebrabeast/DenseTrack.

Abstract:
Nowadays, misinformation is widely spreading over various social media platforms and causes extremely negative impacts on society. To combat this issue, automatically identifying misinformation, especially those containing multimodal content, has attracted growing attention from the academic and industrial communities, and induced an active research topic named Multimodal Misinformation Detection (MMD). Typically, existing MMD methods capture the semantic correlation and inconsistency between multiple modalities, but neglect some potential clues in multimodal content. Recent studies suggest that manipulated traces of the images in articles are non-trivial clues for detecting misinformation. Meanwhile, we find that the underlying intentions behind the manipulation, e.g., harmful and harmless, also matter in MMD. Accordingly, in this work, we propose to detect misinformation by learning manipulation features that indicate whether the image has been manipulated, as well as intention features regarding the harmful and harmless intentions of the manipulation. Unfortunately, the manipulation and intention labels that make these features discriminative are unknown. To overcome the problem, we propose two weakly supervised signals as alternatives by introducing additional datasets on image manipulation detection and formulating two classification tasks as positive and unlabeled learning problems. Based on these ideas, we propose a novel MMD method, namely Harmfully Manipulated Images Matter in MMD (Hami-m3d). Extensive experiments across three benchmark datasets can demonstrate that Hami-m3d can consistently improve the performance of any MMD baselines.

Abstract:
Gaze target detection aims at determining the image location where a person is looking. While existing studies have made significant progress in this area by regressing accurate gaze heatmaps, these achievements have largely relied on access to extensive labeled datasets, which demands substantial human labor. In this paper, our goal is to reduce the reliance on the size of labeled training data for gaze target detection. To achieve this, we propose AL-GTD, an innovative approach that integrates supervised and self-supervised losses within a novel sample acquisition function to perform active learning (AL). Additionally, it utilizes pseudo-labeling to mitigate distribution shifts during the training phase. AL-GTD achieves the best of all AUC results by utilizing only 40-50% of the training data, in contrast to state-of-the-art (SOTA) gaze target detectors requiring the entire training dataset to achieve the same performance. Importantly, AL-GTD quickly reaches satisfactory performance with 10-20% of the training data, showing the effectiveness of our acquisition function, which is able to acquire the most informative samples. We provide a comprehensive experimental analysis by adapting several AL methods for the task. AL-GTD outperforms AL competitors, simultaneously exhibiting superior performance compared to SOTA gaze target detectors when all are trained within a low-data regime. Code is available at: https://github.com/francescotonini/al-gtd.

Abstract:
The key challenge of cross-modal domain-incremental learning (DIL) is to enable the learning model to continuously learn from novel data with different feature distributions under the same task without forgetting old ones. However, existing top-performing methods still cause high forgetting rates, by lacking intra-domain knowledge extraction and inter-domain common prompting strategy. In this paper, we propose a simple yet effective framework, CP-Prompt, by training limited parameters to instruct a pre-trained model to learn new domains and avoid forgetting existing feature distributions. CP-Prompt captures intra-domain knowledge by compositionally inserting personalized prompts into multi-head self-attention layers and then learns the inter-domain knowledge with a common prompting strategy. CP-Prompt shows superiority compared with state-of-the-art baselines among three widely evaluated DIL tasks. The source code is available at https://github.com/dannis97500/CP_Prompt.

Abstract:
Amid the burgeoning development of generative models like diffusion models, the task of differentiating synthesized audio from its natural counterpart grows more daunting. Deepfake detection offers a viable solution to combat this challenge. Yet, this defensive measure unintentionally fuels the continued refinement of generative models. Watermarking emerges as a proactive and sustainable tactic, preemptively regulating the creation and dissemination of synthesized content. Thus, this paper, as a pioneer, proposes the generative robust audiowatermarking method (Groot), presenting a paradigm for proactively supervising the synthesized audio and its source diffusion models. In this paradigm, the processes of watermark generation and audio synthesis occur simultaneously, facilitated by parameter-fixed diffusion models equipped with a dedicated encoder. The watermark embedded within the audio can subsequently be retrieved by a lightweight decoder. The experimental results highlight Groot's outstanding performance, particularly in terms of robustness, surpassing that of the leading state-of-the-art methods. Beyond its impressive resilience against individual post-processing attacks, Groot exhibits exceptional robustness when facing compound attacks, maintaining an average watermark extraction accuracy of around 95%. Our audio samples are available at https://groot-gaw.github.io/.

Abstract:
Domain generalization (DG) aims at solving distribution shift problems in various scenes. Existing approaches are based on Convolution Neural Networks (CNNs) or Vision Transformers (ViTs), which suffer from limited receptive fields or quadratic complexity issues. Mamba, as an emerging state space model (SSM), possesses superior linear complexity and global receptive fields. Despite this, it can hardly be applied to DG to address distribution shifts, due to the hidden state issues and inappropriate scan mechanisms. In this paper, we propose a novel framework for DG, named DGMamba, that excels in strong generalizability toward unseen domains and meanwhile has the advantages of global receptive fields, and efficient linear complexity. Our DGMamba compromises two core components: Hidden State Suppressing (HSS) and Semantic-aware Patch Refining (SPR). In particular, HSS is introduced to mitigate the influence of hidden states associated with domain-specific features during output prediction. SPR strives to encourage the model to concentrate more on objects rather than context, consisting of two designs: Prior-Free Scanning (PFS), and Domain Context Interchange (DCI). Concretely, PFS aims to shuffle the non-semantic patches within images, creating more flexible and effective sequences from images, and DCI is designed to regularize Mamba with the combination of mismatched non-semantic and semantic information by fusing patches among domains. Extensive experiments on four commonly used DG benchmarks demonstrate that the proposed DGMamba achieves remarkably superior results to state-of-the-art models. The code will be made publicly available at https://github.com/longshaocong/DGMamba.

Abstract:
Face super-resolution aims to reconstruct a high-resolution face image from a low-resolution face image. Previous methods typically employ an encoder-decoder structure to extract facial structural features, where the direct downsampling inevitably introduces distortions, especially to high-frequency features such as edges. To address this issue, we propose a wavelet-based feature enhancement network, which mitigates feature distortion by losslessly decomposing the input feature into high and low-frequency components using the wavelet transform and processing them separately. To improve the efficiency of facial feature extraction, a full domain Transformer is further proposed to enhance local, regional, and global facial features. Such designs allow our method to perform better without stacking many modules as previous methods did. Experiments show that our method effectively balances performance, model size, and speed. Code link: https://github.com/PRIS-CV/WFEN.

Abstract:
In this work, we propose a hyperparameter optimization method named HyperTime to find hyperparameters robust to potential temporal distribution shifts in the unseen test data. Our work is motivated by an important observation that it is, in many cases, possible to achieve temporally robust predictive performance via hyperparameter optimization. Based on this observation, we leverage the 'worst-case-oriented' philosophy from the robust optimization literature to help find such robust hyperparameter configurations. HyperTime imposes a lexicographic priority order on average validation loss and worst-case validation loss over chronological validation sets. We perform a theoretical analysis on the upper bound of the expected test loss, which reveals the unique advantages of our approach. We also demonstrate the strong empirical performance of the proposed method on multiple machine learning tasks with temporal distribution shifts. The algorihtm is available in ~https://microsoft.github.io/FLAML/.

Abstract:
Feature upsampling is an essential operation in constructing deep convolutional neural networks. However, existing upsamplers either lack specific feature guidance or necessitate the utilization of high-resolution feature maps, resulting in a loss of performance and flexibility. In this paper, we find that the local self-attention naturally has the feature guidance capability, and its computational paradigm aligns closely with the essence of feature upsampling (i.e. feature reassembly of neighboring points). Therefore, we introduce local self-attention into the upsampling task and demonstrate that the majority of existing upsamplers can be regarded as special cases of upsamplers based on local self-attention. Considering the potential semantic gap between upsampled points and their neighboring points, we further introduce the deformation mechanism into the upsampler based on local self-attention, thereby proposing LDA-AQU. As a novel dynamic kernel-based upsampler, LDA-AQU utilizes the feature of queries to guide the model in adaptively adjusting the position and aggregation weight of neighboring points, thereby meeting the upsampling requirements across various complex scenarios. In addition, LDA-AQU is lightweight and can be easily integrated into various model architectures. We evaluate the effectiveness of LDA-AQU across four dense prediction tasks: object detection, instance segmentation, panoptic segmentation, and semantic segmentation. LDA-AQU consistently outperforms previous state-of-the-art upsamplers, achieving performance enhancements of 1.7 AP, 1.5 AP, 2.0 PQ, and 2.5 mIoU compared to the baseline models in the aforementioned four tasks, respectively.

Abstract:
Deep learning-based lane detection (LD) plays a critical role in autonomous driving systems, such as adaptive cruise control. However, it is vulnerable to backdoor attacks. Existing backdoor attack methods on LD exhibit limited effectiveness in dynamic real-world scenarios, primarily because they fail to consider dynamic scene factors, including changes in driving perspectives (e.g., viewpoint transformations) and environmental conditions (e.g., weather or lighting changes). To tackle this issue, this paper introduces BadLANE, a dynamic scene adaptation backdoor attack for LD designed to withstand changes in real-world dynamic scene factors. To address the challenges posed by changing driving perspectives, we propose an amorphous trigger pattern composed of shapeless pixels. This trigger design allows the backdoor to be activated by various forms or shapes of mud spots or pollution on the road or lens, enabling adaptation to changes in vehicle observation viewpoints during driving. To mitigate the effects of environmental changes, we design a meta-learning framework to train meta-generators tailored to different environmental conditions. These generators produce meta-triggers that incorporate diverse environmental information, such as weather or lighting conditions, as the initialization of the trigger patterns for backdoor implantation, thus enabling adaptation to dynamic environments. Extensive experiments on various commonly used LD models in both digital and physical domains validate the effectiveness of our attacks, outperforming other baselines significantly (+25.15% on average in Attack Success Rate). Our codes can be found in https://github.com/Veee9/BadLANE.

Abstract:
Multi-modal Video Object Segmentation (VOS), including RGB-Thermal, RGB-Depth, and RGB-Event, has garnered attention due to its capability to address challenging scenarios where traditional VOS methods struggle, such as extreme illumination, rapid motion, and background distraction. Existing approaches often involve designing specific additional branches and performing full-parameter fine-tuning for fusion in each task. However, this paradigm not only duplicates research efforts and hardware costs but also risks model collapse with the limited multi-modal annotated data. In this paper, we propose a universal framework named X-Prompt for all multi-modal video object segmentation tasks, designated as RGB+X. The X-Prompt framework first pre-trains a video object segmentation foundation model using RGB data, and then utilize the additional modality of the prompt to adapt it to downstream multi-modal tasks with limited data. Within the X-Prompt framework, we introduce the Multi-modal Visual Prompter (MVP), which allows prompting foundation model with the various modalities to segment objects precisely. We further propose the Multi-modal Adaptation Experts (MAEs) to adapt the foundation model with pluggable modality-specific knowledge without compromising the generalization capacity. To evaluate the effectiveness of the X-Prompt framework, we conduct extensive experiments on 3 tasks across 4 benchmarks. The proposed universal X-Prompt framework consistently outperforms the full fine-tuning paradigm and achieves state-of-the-art performance. Code: https://github.com/PinxueGuo/X-Prompt.git

Abstract:
Image matching aims at identifying corresponding points between a pair of images. Currently, detector-free methods have shown impressive performance in challenging scenarios, thanks to their capability of generating dense matches and global receptive field. However, performing feature interaction and proposing matches across the entire image is unnecessary, because not all image regions contribute to the matching process. Interacting and matching in unmatchable areas can introduce errors, reducing matching accuracy and efficiency. Meanwhile, the scale discrepancy issue still troubles existing methods. To address above issues, we propose PRogressive dependency maxImization for Scale-invariant image Matching (PRISM), which jointly prunes irrelevant patch features and tackles the scale discrepancy. To do this, we firstly present a Multi-scale Pruning Module (MPM) to adaptively prune irrelevant features by maximizing the dependency between the two feature sets. Moreover, we design the Scale-Aware Dynamic Pruning Attention (SADPA) to aggregate information from different scales via a hierarchical design. Our method's superior matching performance and generalization capability are confirmed by leading accuracy across various evaluation benchmarks and downstream tasks. The code is publicly available at https://github.com/Master-cai/PRISM.

Abstract:
Contrastive deep graph clustering (CDGC) leverages the power of contrastive learning to group nodes into different clusters. The quality of contrastive samples is crucial for achieving better performance, making augmentation techniques a key factor in the process. However, the augmentation samples in existing methods are always predefined by human experiences, and agnostic from the downstream task clustering, thus leading to high human resource costs and poor performance. To overcome these limitations, we propose a Graph Node Clustering with Fully Learnable Augmentation, termed GraphLearner. It introduces learnable augmentors to generate high-quality and task-specific augmented samples for CDGC. GraphLearner incorporates two learnable augmentors specifically designed for capturing attribute and structural information. Moreover, we introduce two refinement matrices, including the high-confidence pseudo-label matrix and the cross-view sample similarity matrix, to enhance the reliability of the learned affinity matrix. During the training procedure, we notice the distinct optimization goals for training learnable augmentors and contrastive learning networks. In other words, we should both guarantee the consistency of the embeddings as well as the diversity of the augmented samples. To address this challenge, we propose an adversarial learning mechanism within our method. Besides, we leverage a two-stage training strategy to refine the high-confidence matrices. Extensive experimental results on six benchmark datasets validate the effectiveness of GraphLearner.The code and appendix of GraphLearner are available at https://github.com/xihongyang1999/GraphLearner on Github.

Abstract:
Urbanization challenges underscore the necessity for effective satellite image-text retrieval methods to swiftly access specific information enriched with geographic semantics for urban applications. However, existing methods often overlook significant domain gaps across diverse urban landscapes, primarily focusing on enhancing retrieval performance within single domains. To tackle this issue, we present UrbanCross, a new framework for cross-domain satellite image-text retrieval. UrbanCross leverages a high-quality, cross-domain dataset enriched with extensive geo-tags from three countries to highlight domain diversity. It employs the Large Multimodal Model (LMM) for textual refinement and the Segment Anything Model (SAM) for visual augmentation, achieving fine-grained alignment of images, segments and texts, yielding a 10% improvement in retrieval performance. Additionally, UrbanCross incorporates an adaptive curriculum-based source sampler and a weighted adversarial cross-domain fine-tuning module, progressively enhancing adaptability across various domains. Extensive experiments confirm UrbanCross's superior efficiency in retrieval and adaptation to new urban environments, demonstrating an average performance increase of 15% over its version without domain adaptation mechanisms, effectively bridging the domain gap. Our code and dataset are publicly accessible at https://github.com/siruzhong/UrbanCross.

Abstract:
Online person re-identification services face privacy breaches from potential data leakage and recovery attacks, exposing cloud-stored images to malicious attackers and triggering public concern. The privacy protection of pedestrian images is crucial. Previous privacy-preserving person re-identification methods are unable to resist recovery attacks and compromise accuracy. In this paper, we propose an iterative method (PixelFade) to optimize pedestrian images into noise-like images to resist recovery attacks. We first give an in-depth study of protected images from previous privacy methods, which reveal that the chaos of protected images can disrupt the learning of recovery models. Accordingly, Specifically, we propose Noise-guided Objective Function with the feature constraints of a specific authorization model, optimizing pedestrian images to normal-distributed noise images while preserving their original identity information as per the authorization model. To solve the above non-convex optimization problem, we propose a heuristic optimization algorithm that alternately performs the Constraint Operation and the Partial Replacement Operation. This strategy not only safeguards that original pixels are replaced with noises to protect privacy, but also guides the images towards an improved optimization direction to effectively preserve discriminative features. Extensive experiments demonstrate that our PixelFade outperforms previous methods in resisting recovery attacks and Re-ID performance.

Abstract:
With the evolution of Text-to-Image (T2I) models, the quality defects of AI-Generated Images (AIGIs) pose a significant barrier to their widespread adoption. In terms of both perception and alignment, existing models cannot always guarantee high-quality results. To mitigate this limitation, we introduce G-Refine, a general image quality refiner designed to enhance low-quality images without compromising the integrity of high-quality ones. The model is composed of three interconnected modules: a perception quality indicator, an alignment quality indicator, and a general quality enhancement module. Based on the mechanisms of the Human Visual System (HVS) and syntax trees, the first two indicators can respectively identify the perception and alignment deficiencies, and the last module can apply targeted quality enhancement accordingly. Extensive experimentation reveals that when compared to alternative optimization methods, AIGIs after G-Refine outperform in 10+ quality metrics across 4 databases. This improvement significantly contributes to the practical application of contemporary T2I models, paving the way for their broader adoption. The code will be released on https://github.com/Q-Future/Q-Refine.

Abstract:
Real-world navigation often involves dealing with unexpected obstructions such as closed doors, moved objects, and unpredictable entities. However, mainstream Vision-and-Language Navigation (VLN) tasks typically assume instructions perfectly align with the fixed and predefined navigation graphs without any obstructions. This assumption overlooks potential discrepancies in actual navigation graphs and given instructions, which can cause major failures for both indoor and outdoor agents. To address this issue, we integrate diverse obstructions into the R2R dataset by modifying both the navigation graphs and visual observations, introducing an innovative dataset and task, R2R with UNexpected Obstructions (R2R-UNO). R2R-UNO contains various types and numbers of path obstructions to generate instruction-reality mismatches for VLN research. Experiments on R2R-UNO reveal that state-of-the-art VLN methods inevitably encounter significant challenges when facing such mismatches, indicating that they rigidly follow instructions rather than navigate adaptively. Therefore, we propose a novel method called ObVLN (Obstructed VLN), which includes a curriculum training strategy and virtual graph construction to help agents effectively adapt to obstructed environments. Empirical results show that ObVLN not only maintains robust performance in unobstructed scenarios but also achieves a substantial performance advantage with unexpected obstructions. The source code is available at https://github.com/honghd16/ObstructedVLN.

Abstract:
Humans exhibit a remarkable ability to learn quickly from a limited number of labeled samples, a capability that starkly contrasts with that of current machine learning systems. Unsupervised Few-Shot Learning (U-FSL) seeks to bridge this divide by reducing reliance on annotated datasets during initial training phases. In this work, we first quantitatively assess the impacts of Masked Image Modeling (MIM) and Contrastive Learning (CL) on few-shot learning tasks. Our findings highlight the respective limitations of MIM and CL in terms of discriminative and generalization abilities, which contribute to their underperformance in U-FSL contexts. To address these trade-offs between generalization and discriminability in unsupervised pretraining, we introduce a novel paradigm named Masked Image Contrastive Modeling (MICM). MICM creatively combines the targeted object learning strength of CL with the generalized visual feature learning capability of MIM, significantly enhancing its efficacy in downstream few-shot learning inference. Extensive experimental analyses confirm the advantages of MICM, demonstrating significant improvements in both generalization and discrimination capabilities for few-shot learning. Our comprehensive quantitative evaluations further substantiate the superiority of MICM, showing that our two-stage U-FSL framework based on MICM markedly outperforms existing leading baselines. The repository of this project is available at https://github.com/iCGY96/MICM.

Abstract:
Although large multi-modality models (LMMs) have seen extensive exploration and application in various quality assessment studies, their integration into Point Cloud Quality Assessment (PCQA) remains unexplored. Given LMMs' exceptional performance and robustness in low-level vision and quality assessment tasks, this study aims to investigate the feasibility of imparting PCQA knowledge to LMMs through text supervision. To achieve this, we transform quality labels into textual descriptions during the fine-tuning phase, enabling LMMs to derive quality rating logits from 2D projections of point clouds. To compensate for the loss of perception in the 3D domain, structural features are extracted as well. These quality logits and structural features are then combined and regressed into quality scores. Our experimental results affirm the effectiveness of our approach, showcasing a novel integration of LMMs into PCQA that enhances model understanding and assessment accuracy. We hope our contributions can inspire subsequent investigations into the fusion of LMMs with PCQA, fostering advancements in 3D visual quality analysis and beyond. The code is available at https://github.com/zzc-1998/LMM-PCQA.

Abstract:
Traditional deep neural network (DNN)-based image quality assessment (IQA) models leverage convolutional neural networks (CNN) or Transformer to learn the quality-aware feature representation, achieving commendable performance on natural scene images. However, when applied to AI-Generated images (AGIs), these DNN-based IQA models exhibit subpar performance. This situation is largely due to the semantic inaccuracies inherent in certain AGIs caused by uncontrollable nature of the generation process. Thus, the capability to discern semantic content becomes crucial for assessing the quality of AGIs. Traditional DNN-based IQA models, constrained by limited parameter complexity and training data, struggle to capture complex fine-grained semantic features, making it challenging to grasp the existence and coherence of semantic content of the entire image. To address the shortfall in semantic content perception of current IQA models, we introduce a large Multi-modality model Assisted AI-Generated Image Quality Assessment (MA-AGIQA) model, which utilizes semantically informed guidance to sense semantic information and extract semantic vectors through carefully designed text prompts. Moreover, it employs a mixture of experts (MoE) structure to dynamically integrate the semantic information with the quality-aware features extracted by traditional DNN-based IQA models. Comprehensive experiments conducted on two AI-generated content datasets and two traditional IQA datasets show that MA-AGIQA achieves state-of-the-art performance, and demonstrate its superior generalization capabilities on assessing the quality of AGIs. The code is available at https://github.com/wangpuyi/MA-AGIQA.

Abstract:
With the rapid development of generative models, AI-Generated Content (AIGC) has exponentially increased in daily lives. Among them, Text-to-Video (T2V) generation has received widespread attention. Though many T2V models have been released for generating high perceptual quality videos, there is still lack of a method to evaluate the quality of these videos quantitatively. To solve this issue, we establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date. The dataset is composed of 10,000 videos generated by 9 different T2V models, along with each video's corresponding mean opinion score. Based on T2VQA-DB, we propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA). The model extracts features from text-video alignment and video fidelity perspectives, then it leverages the ability of a large language model to give the prediction score. Experimental results show that T2VQA outperforms existing T2V metrics and SOTA video quality assessment models. Quantitative analysis indicates that T2VQA is capable of giving subjective-align predictions, validating its effectiveness. The dataset and code are available at https://github.com/QMME/T2VQA.

Abstract:
Recently, the Parameter Efficient Fine-Tuning (PEFT) method, which adjusts or introduces fewer trainable parameters to calibrate pre-trained models on downstream tasks, has been a hot research topic. However, existing PEFT methods within the traditional fine-tuning framework have two main shortcomings: 1) They overlook the explicit association between trainable parameters and downstream knowledge. 2) They neglect the interaction between the intrinsic task-agnostic knowledge of pre-trained models and the task-specific knowledge of downstream tasks. These oversights lead to insufficient utilization of knowledge and suboptimal performance. To address these issues, we propose a novel fine-tuning framework, named GIST, that can be seamlessly integrated into the current PEFT methods in a plug-and-play manner. Specifically, our framework first introduces a trainable token, called the Gist token, when applying PEFT methods on downstream tasks. This token serves as an aggregator of the task-specific knowledge learned by the PEFT methods and builds an explicit association with downstream tasks. Furthermore, to facilitate explicit interaction between task-agnostic and task-specific knowledge, we introduce the concept of knowledge interaction via a Bidirectional Kullback-Leibler Divergence objective. As a result, PEFT methods within our framework can enable the pre-trained model to understand downstream tasks more comprehensively by fully leveraging both types of knowledge. Extensive experiments on the 35 datasets demonstrate the universality and scalability of our framework. Notably, the PEFT method within our GIST framework achieves up to a 2.25% increase on the VTAB-1K benchmark with an addition of just 0.8K parameters (0.009‰ of ViT-B/16). The code is available at https://github.com/JCruan519/GIST.

Abstract:
Visual abductive reasoning aims to make likely explanations for visual observations. We propose a simple yet effective Region Conditioned Adaptation, a hybrid parameter-efficient fine-tuning method that equips the frozen CLIP with the ability to infer explanations from local visual cues. We encode ''local hints'' and ''global contexts'' into visual prompts of the CLIP model separately at fine and coarse-grained levels. Adapters are used for fine-tuning CLIP models for downstream tasks and we design a new attention adapter, that directly steers the focus of the attention map with trainable query and key projections of a frozen CLIP model. Finally, we train our new model with a modified contrastive loss to regress the visual feature simultaneously toward features of literal description and plausible explanations. The loss enables CLIP to maintain both perception and reasoning abilities. Experiments on the Sherlock visual abductive reasoning benchmark show that the RCA significantly outstands previous SOTAs, ranking the 1st on the leaderboards (e.g., Human Acc: RCA 31.74 vs CPT-CLIP 29.58, higher =better). We also validate the RCA is generalizable to local perception benchmarks like RefCOCO. We open-source our project at https://github.com/LUNAProject22/RPA.

Abstract:
Video Snapshot Compressive Imaging (SCI) uses a low-speed 2D camera to capture high-speed scenes as snapshot compressed measurements, followed by a reconstruction algorithm to retrieve the high-speed video frames. The fast evolving mobile devices and existing high-performance video SCI reconstruction algorithms motivate us to develop mobile reconstruction methods for real-world applications. Yet, it is still challenging to deploy previous reconstruction algorithms on mobile devices due to the complex inference process, let alone real-time mobile reconstruction. To the best of our knowledge, there is no video SCI reconstruction model designed to run on the mobile devices. Towards this end, in this paper, we present an effective approach for video SCI reconstruction, dubbed MobileSCI, which can run at real-time speed on the mobile devices for the first time. Specifically, we first build a U-shaped 2D convolution-based architecture, which is much more efficient and mobile-friendly than previous state-of-the-art reconstruction methods. Besides, an efficient feature mixing block, based on the channel splitting and shuffling mechanisms, is introduced as a novel bottleneck block of our proposed MobileSCI to alleviate the computational burden. Finally, a customized knowledge distillation strategy is utilized to further improve the reconstruction quality. Extensive results on both simulated and real data show that our proposed MobileSCI can achieve superior reconstruction quality with high efficiency on the mobile devices. Particularly, we can reconstruct a 256x256x8 snapshot compressed measurement with real-time performance (about 35 FPS) on an iPhone 15. Code is available at https://github.com/mcao92/MobileSCI.

Abstract:
With the advent of virtual reality technology, omnidirectional image (ODI) rescaling techniques are increasingly embraced to reduce transmitted and stored file sizes while preserving high image quality. Despite this progress, current ODI rescaling methods predominantly focus on enhancing the quality of images in equirectangular projection (ERP) format, which overlooks the fact that the content viewed on head-mounted displays (HMDs) is actually a rendered viewport instead of an ERP image. In this work, we emphasize that focusing solely on ERP quality results in inferior viewport visual experiences for users. Thus, we propose ResVR, which is the first comprehensive framework for the joint Rescaling and Viewport Rendering of ODIs. ResVR allows obtaining LR ERP images for transmission while rendering high-quality viewports for users to watch on HMDs. In our ResVR, a novel discrete pixel sampling strategy is developed to tackle the complex mapping between the viewport and ERP, enabling end-to-end training of the ResVR pipeline. Furthermore, a spherical pixel shape representation technique is innovatively derived from spherical differentiation to significantly improve the visual quality of rendered viewports. Extensive experiments demonstrate that our ResVR outperforms existing methods in viewport rendering tasks across different fields of view, resolutions, and view directions while keeping a low transmission overhead. Code is available at https://github.com/lwq20020127/ResVR.

Abstract:
Thanks to the powerful generative capacity of diffusion models, recent years have witnessed rapid progress in human motion generation. Existing diffusion-based methods employ disparate network architectures and training strategies. The effect of the design of each component is still unclear. In addition, the iterative denoising process consumes considerable computational overhead, which is prohibitive for real-time scenarios such as virtual characters and humanoid robots. For this reason, we first conduct a comprehensive investigation into network architectures, training strategies, and inference process. Based on the profound analysis, we tailor each component for efficient high-quality human motion generation. Despite the promising performance, the tailored model still suffers from foot skating which is an ubiquitous issue in diffusion-based solutions. To eliminate footskate, we identify foot-ground contact and correct foot motions along the denoising process. By organically combining these well-designed components together, we present StableMoFusion, a robust and efficient framework for human motion generation. Extensive experimental results show that our StableMoFusion performs favorably against current state-of-the-art methods.

Abstract:
Human gait recognition is crucial in multimedia, enabling identification through walking patterns without direct interaction, enhancing the integration across various media forms in real-world applications like smart homes, healthcare and non-intrusive security. LiDAR's ability to capture depth makes it pivotal for robotic perception and holds promise for real-world gait recognition. In this paper, based on a single LiDAR, we present the Hierarchical Multi-representation Feature Interaction Network (HMRNet) for robust gait recognition. Prevailing LiDAR-based gait datasets primarily derive from controlled settings with predefined trajectory, remaining a gap with real-world scenarios. To facilitate LiDAR-based gait recognition research, we introduce FreeGait, a comprehensive gait dataset from large-scale, unconstrained settings, enriched with multi-modal and varied 2D/3D data. Notably, our approach achieves state-of-the-art performance on prior dataset (SUSTech1K) and on FreeGait. https://4dvlab.github.io/project_page/FreeGait.html

Abstract:
A soundscape is defined by the acoustic environment a person perceives at a location. In this work, we propose a framework for mapping soundscapes across the Earth. Since soundscapes involve sound distributions that span varying spatial scales, we represent locations with multi-scale satellite imagery and learn a joint representation among this imagery, audio, and text. To capture the inherent uncertainty in the soundscape of a location, we design the representation space to be probabilistic. We also fuse ubiquitous metadata (including geolocation, time, and data source) to enable learning of spatially and temporally dynamic representations of soundscapes. We demonstrate the utility of our framework by creating large-scale soundscape maps integrating both audio and text with temporal control. To facilitate future research on this task, we also introduce a large-scale dataset, GeoSound, containing over 300k geotagged audio samples paired with both low- and high-resolution satellite imagery. We demonstrate that our method outperforms the existing state-of-the-art on both GeoSound and the existing SoundingEarth dataset. Our dataset and code is available at https://github.com/mvrl/PSM.

Abstract:
Real-world data consistently exhibits a long-tailed distribution, often spanning multiple categories. This complexity underscores the challenge of content comprehension, particularly in scenarios requiring Long-Tailed Multi-Label image Classification (LTMLC). In such contexts, imbalanced data distribution and multi-object recognition pose significant hurdles. To address this issue, we propose a novel and effective approach for LTMLC, termed Category-Prompt Refined Feature Learning (CPRFL), utilizing semantic correlations between different categories and decoupling category-specific visual representations for each category. Specifically, CPRFL initializes category-prompts from the pretrained CLIP's embeddings and decouples category-specific visual representations through interaction with visual features, thereby facilitating the establishment of semantic correlations between the head and tail classes. To mitigate the visual-semantic domain bias, we design a progressive Dual-Path Back-Propagation mechanism to refine the prompts by progressively incorporating context-related visual information into prompts. Simultaneously, the refinement process facilitates the progressive purification of the category-specific visual representations under the guidance of the refined prompts. Furthermore, taking into account the negative-positive sample imbalance, we adopt the Asymmetric Loss as our optimization objective to suppress negative samples across all classes and potentially enhance the head-to-tail recognition performance. We validate the effectiveness of our method on two LTMLC benchmarks and extensive experiments demonstrate the superiority of our work over baselines.The code is available at https://github.com/jiexuanyan/CPRFL.

Abstract:
Building a unified model for general low-level vision tasks holds significant research and practical value. Current methods encounter several critical issues. Multi-task restoration approaches can address multiple degradation-to-clean restoration tasks, while their applicability to tasks with different target domains (e.g., image stylization) is limited. Methods like PromptGIP can handle multiple input-target domains but rely on the Masked Autoencoder (MAE) paradigm. Consequently, they are tied to the ViT architecture, resulting in suboptimal image reconstruction quality. In addition, these methods are sensitive to prompt image content and often struggle with low-frequency information processing. In this paper, we propose a Visual task Prompt-based Image Processing (VPIP) framework to overcome these challenges. VPIP employs visual task prompts to manage tasks with different input-target domains and allows flexible selection of backbone network suitable for general tasks. Besides, a new prompt cross-attention is introduced to facilitate interaction between the input and prompt information. Based on the VPIP framework, we train a low-level vision generalist model, namely GenLV, on 30 diverse tasks. Experimental results show that GenLV can successfully address a variety of low-level tasks, significantly outperforming existing methods both quantitatively and qualitatively. Codes are available at https://github.com/chxy95/GenLV.

Abstract:
In this paper, we investigate the challenge of spatio-temporal video prediction, which involves generating future videos based on historical data streams. Existing approaches typically utilize external information such as semantic maps to enhance video prediction, which often neglect the inherent physical knowledge embedded within videos. Furthermore, their high computational demands could impede their applications for high-resolution videos. To address these constraints, we introduce a novel approach called Physics-assisted Spatio-temporal Network (PastNet) for generating high-quality video prediction. The core of our PastNet lies in incorporating a spectral convolution operator in the Fourier domain, which efficiently introduces inductive biases from the underlying physical laws. Additionally, we employ a memory bank with the estimated intrinsic dimensionality to discretize local features during the processing of complex spatio-temporal signals, thereby reducing computational costs and facilitating efficient high-resolution video prediction. Extensive experiments on various widely-used datasets demonstrate the effectiveness and efficiency of the proposed PastNet compared with a range of state-of-the-art methods, particularly in high-resolution scenarios. Our code is available at https://github.com/easylearningscores/PastNet.

Abstract:
Generative adversarial networks (GAN) and generative diffusion models (DM) have been widely used in real-world image super-resolution (Real-ISR) to enhance the image perceptual quality. However, these generative models are prone to generating visual artifacts and false image structures, resulting in unnatural Real-ISR results. Based on the fact that natural images exhibit high self-similarities, i.e., a local patch can have many similar patches to it in the whole image, in this work we propose a simple yet effective self-similarity loss (SSL) to improve the performance of generative Real-ISR models, enhancing the hallucination of structural and textural details while reducing the unpleasant visual artifacts. Specifically, we compute a self-similarity graph (SSG) of the ground-truth image, and enforce the SSG of Real-ISR output to be close to it. To reduce the training cost and focus on edge areas, we generate an edge mask from the ground-truth image, and compute the SSG only on the masked pixels. The proposed SSL serves as a general plug-and-play penalty, which could be easily applied to the off-the-shelf Real-ISR models. Our experiments demonstrate that, by coupling with SSL, the performance of many state-of-the-art Real-ISR models, including those GAN and DM based ones, can be largely improved, reproducing more perceptually realistic image details and eliminating many false reconstructions and visual artifacts. Codes and supplementary material are available at https://github.com/ChrisDud0257/SSL

Abstract:
Recent advancements in the field of Diffusion Transformers have substantially improved the generation of high-quality 2D images, 3D videos, and 3D shapes. However, the effectiveness of the Transformer architecture in the domain of co-speech gesture generation remains relatively unexplored, as prior methodologies have predominantly employed the Convolutional Neural Network (CNNs) or simple a few transformer layers. In an attempt to bridge this research gap, we introduce a novel Masked Diffusion Transformer for co-speech gesture generation, referred to as MDT-A2G, which directly implements the denoising process on gesture sequences. To enhance the contextual reasoning capability of temporally aligned speech-driven gestures, we incorporate a novel Masked Diffusion Transformer. This model employs a mask modeling scheme specifically designed to strengthen temporal relation learning among sequence gestures, thereby expediting the learning process and leading to coherent and realistic motions. Apart from audio, Our MDT-A2G model also integrates multi-modal information, encompassing text, emotion, and identity. Furthermore, we propose an efficient inference strategy that diminishes the denoising computation by leveraging previously calculated results, thereby achieving a speedup with negligible performance degradation. Experimental results demonstrate that MDT-A2G excels in gesture generation, boasting a learning speed that is over 6× faster than traditional diffusion transformers and an inference speed that is 5.7× than the standard diffusion model. Our code is available at https://xiaofenmao.github.io/web-project/MDT-A2G/

Abstract:
Object detectors often suffer a decrease in performance due to the large domain gap between the training data (source domain) and real-world data (target domain). Diffusion-based generative models have shown remarkable abilities in generating high-quality and diverse images, suggesting their potential for extracting valuable feature from various domains. To effectively leverage the cross-domain feature representation of diffusion models, in this paper, we train a detector with frozen-weight diffusion model on the source domain, then employ it as a teacher model to generate pseudo labels on the unlabeled target domain, which are used to guide the supervised learning of the student model on the target domain. We refer to this approach as Diffusion Domain Teacher (DDT). By employing this straightforward yet potent framework, we significantly improve cross-domain object detection performance without compromising the inference speed. Our method achieves an average mAP improvement of 21.2% compared to the baseline on 6 datasets from three common cross-domain detection benchmarks (Cross-Camera, Syn2Real, Real2Artistic), surpassing the current state-of-the-art (SOTA) methods by an average of 5.7% mAP. Furthermore, extensive experiments demonstrate that our method consistently brings improvements even in more powerful and complex models, highlighting broadly applicable and effective domain adaptation capability of our DDT.

Abstract:
Existing pretrained text-to-video (T2V) models have demonstrated impressive abilities in generating realistic videos with basic motion or camera movement. However, these models exhibit significant limitations when generating intricate, human-centric motions. Current efforts primarily focus on fine-tuning models on a small set of videos containing a specific motion. They often fail to effectively decouple motion and the appearance in the limited reference videos, thereby weakening the modeling capability of motion patterns. To this end, we propose MoTrans, a customized motion transfer method enabling video generation of similar motion in new context. Specifically, we introduce a multimodal large language model (MLLM)-based recaptioner to expand the initial prompt to focus more on appearance and an appearance injection module to adapt appearance prior from video frames to the motion modeling process. These complementary multimodal representations from recaptioned prompt and video frames promote the modeling of appearance and facilitate the decoupling of appearance and motion. In addition, we devise a motion-specific embedding for further enhancing the modeling of the specific motion. Experimental results demonstrate that our method effectively learns specific motion pattern from singular or multiple reference videos, performing favorably against existing methods in customized video generation.

Abstract:
Recent works on audio-driven talking head synthesis using Neural Radiance Fields (NeRF) have achieved impressive results. However, due to inadequate pose and expression control caused by NeRF implicit representation, these methods still have some limitations, such as unsynchronized or unnatural lip movements, and visual jitter and artifacts. In this paper, we propose GaussianTalker, a novel method for audio-driven talking head synthesis based on 3D Gaussian Splatting. With the explicit representation property of 3D Gaussians, intuitive control of the facial motion is achieved by binding Gaussians to 3D facial models. GaussianTalker consists of two modules, Speaker-specific Motion Translator and Dynamic Gaussian Renderer. Speaker-specific Motion Translator achieves accurate lip movements specific to the target speaker through universalized audio feature extraction and customized lip motion generation. Dynamic Gaussian Renderer introduces Speaker-specific BlendShapes to enhance facial detail representation via a latent pose, delivering stable and realistic rendered videos. Extensive experimental results suggest that GaussianTalker outperforms existing state-of-the-art methods in talking head synthesis, delivering precise lip synchronization and exceptional visual quality. Our method achieves rendering speeds of 130 FPS on NVIDIA RTX4090 GPU, significantly exceeding the threshold for real-time rendering performance, and can potentially be deployed on other hardware platforms.

Abstract:
Current text-video retrieval methods mainly rely on cross-modal matching between queries and videos to calculate their similarity scores, which are then sorted to obtain retrieval results. This method considers the matching between each candidate video and the query, but it incurs a significant time cost and will increase notably with the increase of candidates. Generative models are common in natural language processing and computer vision, and have been successfully applied in document retrieval, but their application in multimodal retrieval remains unexplored. To enhance retrieval efficiency, in this paper, we introduce a model-based video indexer named T2VIndexer, which is a sequence-to-sequence generative model directly generating video identifiers and retrieving candidate videos with constant time complexity. T2VIndexer aims to reduce retrieval time while maintaining high accuracy. To achieve this goal, we propose video identifier encoding and query-identifier augmentation approaches to represent videos as short sequences while preserving their semantic information. Our method consistently enhances the retrieval efficiency of current state-of-the-art models on four standard datasets. It enables baselines with only 30%-50% of the original retrieval time to achieve better retrieval performance on MSR-VTT (+1.0%), MSVD (+1.8%), ActivityNet (+1.5%), and DiDeMo (+0.2%). The code is available at https://github.com/Lilidamowang/T2VIndexer-generativeSearch.

Abstract:
In the field of multi-modal learning, model parameters are typically large, necessitating the use of parameter-efficient fine-tuning (PEFT) techniques. These methods have been pivotal in enhancing training efficiency for downstream tasks in almost all situations. However, directly applying PEFT methods struggles to fully address the intricate demands of multi-modal tasks, such as multi-modal sarcasm detection (MSD), which demands the extraction and comparison of cues from different modalities. MSD, particularly when reliant on textual and visual modalities, faces challenges in identifying sarcasm's incongruity. This issue often arises from the lack of intermodality interaction during tuning, resulting in a disconnect between textual and visual information. In this paper, we introduce a novel approach called Bi-directional Adapter (BA), designated as MoBA. This approach is designed to minimize training parameters while enhancing the model's ability to interpret sarcasm across modalities. By facilitating an exchange between textual and visual information through a low-rank representation, our method adeptly captures the nuances of sarcastic expressions with a reduced number of training parameters. Our empirical studies, carried out on two publicly accessible and emerging datasets, demonstrate that our model substantially improves sarcasm detection accuracy. These findings indicate that our approach provides a more reliable and efficient solution to address the complexities of MSD.

Abstract:
Data-free knowledge distillation (DFKD) has emerged as a pivotal technique in the domain of model compression, substantially reducing the dependency on the original training data. Nonetheless, conventional DFKD methods that employ synthesized training data are prone to the limitations of inadequate diversity and discrepancies in distribution between the synthesized and original datasets. To address these challenges, this paper introduces an innovative approach to DFKD through diverse diffusion augmentation (DDA). Specifically, we revise the paradigm of common data synthesis in DFKD to a composite process through leveraging diffusion models subsequent to data synthesis for self-supervised augmentation, which generates a spectrum of data samples with similar distributions while retaining controlled variations. Furthermore, to mitigate excessive deviation in the embedding space, we introduce an image filtering technique grounded in cosine similarity to maintain fidelity during the knowledge distillation process. Comprehensive experiments conducted on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets showcase the superior performance of our method across various teacher-student network configurations, outperforming the contemporary state-of-the-art DFKD methods. Code will be available at: https://github.com/SLGSP/DDA.

Abstract:
Recent work shows that documents from encyclopedias serve as helpful auxiliary information for zero-shot learning. Existing methods align the entire semantics of a document with corresponding images to transfer knowledge. However, they disregard that semantic information is not equivalent between them, resulting in a suboptimal alignment. In this work, we propose a novel network to extract multi-view semantic concepts from documents and images and align the matching rather than entire concepts. Specifically, we propose a semantic decomposition module to generate multi-view semantic embeddings from visual and textual sides, providing the basic concepts for partial alignment. To alleviate the issue of information redundancy among embeddings, we propose the local-to-semantic variance loss to capture distinct local details and multiple semantic diversity loss to enforce orthogonality among embeddings. Subsequently, two losses are introduced to partially align visual-semantic embedding pairs according to their semantic relevance at the view and word-to-patch levels. Consequently, we consistently outperform state-of-the-art methods under two document sources in three standard benchmarks for document-based zero-shot learning. Qualitatively, we show that our model learns the interpretable partial association. Code is available at https://github.com/MorningStarOvO/EmDepart.

Abstract:
Geometry and color information provided by the point clouds are both crucial for 3D scene understanding. Two pieces of information characterize the different aspects of point clouds, but existing methods lack an elaborate design for the discrimination and relevance. Hence we explore a 3D self-supervised paradigm that can better utilize the relations of point cloud information. Specifically, we propose a universal 3D scene pre-training framework via Geometry-Color Contrast (Point-GCC), which aligns geometry and color information using a Siamese network. To take care of actual application tasks, we design (i) hierarchical supervision with point-level contrast and reconstruct and object-level contrast based on the novel deep clustering module to close the gap between pre-training and downstream tasks; (ii) architecture-agnostic backbone to adapt for various downstream models. Benefiting from the object-level representation associated with downstream tasks, Point-GCC can directly evaluate model performance and the result demonstrates the effectiveness of our methods. Transfer learning results on a wide range of tasks also show consistent improvements across all datasets. e.g., new state-of-the-art object detection results on SUN RGB-D and S3DIS datasets. Codes are released on Github.

Abstract:
Industrial parks are critical to urban economic growth. Yet, their development often encounters challenges stemming from imbalances between industrial requirements and urban services, underscoring the need for strategic planning and operations. This paper introduces IndustryScopeKG, a pioneering large-scale multi-modal, multi-level industrial park knowledge graph, which integrates diverse urban data including street views, corporate, socio-economic, and geospatial information, capturing the complex relationships and semantics within industrial parks. Alongside this, we present the IndustryScopeGPT framework, which leverages Large Language Models (LLMs) with Monte Carlo Tree Search to enhance tool-augmented reasoning and decision-making in Industrial Park Planning and Operation (IPPO). Our work significantly improves site recommendation and functional planning, demonstrating the potential of combining LLMs with structured datasets to advance industrial park management. This approach sets a new benchmark for intelligent IPPO research and lays a robust foundation for advancing urban industrial development. The dataset and related code are available at https://github.com/Tongji-KGLLM/IndustryScope.

Abstract:
Different from traditional video retrieval, sign language retrieval is more biased towards understanding the semantic information of human actions contained in video clips. Previous works typically only encode RGB videos to obtain high-level semantic features, resulting in local action details drowned in a large amount of visual information redundancy. Furthermore, existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training, and adopt offline RGB encoder instead, leading to suboptimal feature representation. To address these issues, we propose a novel sign language representation framework called Semantically Enhanced Dual-Stream Encoder (SEDS), which integrates Pose and RGB modalities to represent the local and global information of sign language videos. Specifically, the Pose encoder embeds the coordinates of keypoints corresponding to human joints, effectively capturing detailed action features. For better context-aware fusion of two video modalities, we propose a Cross Gloss Attention Fusion (CGAF) module to aggregate the adjacent clip features with similar semantic information from intra-modality and inter-modality. Moreover, a Pose-RGB Fine-grained Matching Objective is developed to enhance the aggregated fusion feature by contextual matching of fine-grained dual-stream features. Besides the offline RGB encoder, the whole framework only contains learnable lightweight networks, which can be trained end-to-end. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods on various datasets. Code will be available at https://github.com/longtaojiang/SEDS.

Abstract:
In the field of artificial intelligence, AI models are frequently described as 'black boxes' due to the obscurity of their internal mechanisms. It has ignited research interest on model interpretability, especially in attribution methods that offers precise explanations of model decisions. Current attribution algorithms typically evaluate the importance of each parameter by exploring the sample space. A large number of intermediate states are introduced during the exploration process, which may reach the model's Out-of-Distribution (OOD) space. Such intermediate states will impact the attribution results, making it challenging to grasp the relative importance of features. In this paper, we firstly define the local space and its relevant properties, and we propose the Local Attribution (LA) algorithm that leverages these properties. The LA algorithm comprises both targeted and untargeted exploration phases, which are designed to effectively generate intermediate states for attribution that thoroughly encompass the local space. Compared to the state-of-the-art attribution methods, our approach achieves an average improvement of 38.21% in attribution effectiveness. Extensive ablation studies in our experiments also validate the significance of each component in our algorithm. Our code is available at: https://github.com/LMBTough/LA/

Abstract:
Finetuning a pretrained vision model (PVM) is a common technique for learning downstream vision tasks. The conventional finetuning process with the randomly sampled data points results in diminished training efficiency. To address this drawback, we propose a novel approach, Vision- languageCollaborativeActiveFinetuning (VeCAF). VeCAF optimizes a parametric data selection model by incorporating the training objective of the model being tuned. Effectively, this guides the PVM towards the performance goal with improved data and computational efficiency.With the ever-growing feasibility of acquiring labels and natural language annotations of image data through web-scale crawling, we exploit the inherent semantic richness of the text embedding space and utilize text embeddings of image annotations to augment PVM image features for better data selection and finetuning. Furthermore, the flexibility of text-domain augmentation gives VeCAF the unique ability to handle out-of-distribution scenarios without external augmented data. Extensive experiments show the leading performance and high efficiency of VeCAF that is superior to baselines in both in-distribution and out-of-distribution image classification tasks. On ImageNet, VeCAF needs up to 3.3× less training batches to reach the target performance compared to full fine-tuning and achieves an accuracy improvement of 2.8% over active SOTA fine-tuning methods with the same number of batches. Our code is now available at https://github.com/RoyZry98/VeCAF-Pytorch.

Abstract:
Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations. Since many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as auxiliary semantic knowledge through vision-language pre-training (VLP). However, there are still areas where existing research falls short. Previous approaches primarily focused on leveraging textual information from language models but overlooked the alignment of dynamic human action and VLP knowledge in a joint space. Furthermore, the deterministic representation employed in previous studies struggles to capture fine-grained human motions. To address these problems, we propose a novel framework that aligns human action knowledge and VLP knowledge in a probabilistic embedding space. Moreover, we propose intra- and inter-distribution contrastive learning to enhance the probabilistic embedding space based on statistical similarities. Extensive experiments and ablation studies reveal that our method significantly outperforms all previous state-of-the-art methods. Code is available at https://github.com/sejong-rcv/PVLR.

Abstract:
In Video-based Facial Expression Recognition (V-FER), models are typically trained on closed-set datasets with a fixed number of known classes. However, these models struggle with unknown classes common in real-world scenarios. In this paper, we introduce a challenging Open-set Video-based Facial Expression Recognition (OV-FER) task, aiming to identify both known and new, unseen facial expressions. While existing approaches use large-scale vision-language models like CLIP to identify unseen classes, we argue that these methods may not adequately capture the subtle human expressions needed for OV-FER. To address this limitation, we propose a novel Human Expression-Sensitive Prompting (HESP) mechanism to significantly enhance CLIP's ability to model video-based facial expression details effectively. Our proposed HESP comprises three components: 1) a textual prompting module with learnable prompts to enhance CLIP's textual representation of both known and unknown emotions, 2) a visual prompting module that encodes temporal emotional information from video frames using expression-sensitive attention, equipping CLIP with a new visual modeling ability to extract emotion-rich information, and 3) an open-set multi-task learning scheme that promotes interaction between the textual and visual modules, improving the understanding of novel human emotions in video sequences. Extensive experiments conducted on four OV-FER task settings demonstrate that HESP can significantly boost CLIP's performance (a relative improvement of 17.93% on AUROC and 106.18% on OSCR) and outperform other state-of-the-art open-set video understanding methods by a large margin. Code is available at https://github.com/cosinehuang/HESP.

Abstract:
The highly abstract nature of image aesthetics perception (IAP) poses a significant challenge for current multimodal large language models (MLLMs). The lack of human-annotated multi-modality aesthetic data further exacerbates this dilemma, resulting in MLLMs falling short of aesthetics perception capabilities. To address the above challenge, we first introduce a comprehensively annotated Aesthetic Multi-Modality Instruction Tuning (AesMMIT) dataset, which serves as the footstone for building multi-modality aesthetics foundation models. Specifically, to align MLLMs with human aesthetics perception, we construct a corpus-rich aesthetic critique database with 21,904 diverse-sourced images and 88K human natural language feedbacks, which are collected via progressive questions, ranging from coarse-grained aesthetic grades to fine-grained aesthetic descriptions. To ensure that MLLMs can handle diverse queries, we further prompt GPT to refine the aesthetic critiques and assemble the large-scale aesthetic instruction tuning dataset, i.e.AesMMIT, which consists of 409K multi-typed instructions to activate stronger aesthetic capabilities. Based on the AesMMIT database, we fine-tune the open-sourced general foundation models, achieving multi-modality Aesthetic Expert models, dubbed AesExpert. Extensive experiments demonstrate that the proposed AesExpert models deliver significantly better aesthetic perception performances than the state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision. Project Page: https://yipoh.github.io/aes-expert/.

Abstract:
In recent years, immersive communication has emerged as a compelling alternative to traditional video communication methods. One prospective avenue for immersive communication involves augmenting the user's immersive experience through the transmission of three-dimensional (3D) talking heads (THs). However, transmitting 3D THs poses significant challenges due to its complex and voluminous nature, often leading to pronounced distortion and a compromised user experience. Addressing this challenge, we introduce the 3D Talking Heads Quality Assessment (THQA-3D) dataset, comprising 1,000 sets of distorted and 50 original TH mesh sequences (MSs), to facilitate quality assessment in 3D TH transmission. A subjective experiment, characterized by a novel interactive approach, is conducted with recruited participants to assess the quality of MSs in THQA-3D dataset. Leveraging this dataset, we also propose a multimodal Quality-of-Experience (QoE) method incorporating a Large Quality Model (LQM). This method involves frontal projection of MSs and subsequent rendering into videos, with quality assessment facilitated by the LQM and a variable-length video memory filter (VVMF). Additionally, tone-lip coherence and silence detection techniques are employed to characterize audio-visual coherence in 3D MS streams. Experimental evaluation demonstrates the proposed method's superiority, achieving state-of-the-art performance on the THQA-3D dataset and competitiveness on other QoE datasets. Both the THQA-3D dataset and the QoE model have been publicly released at https://github.com/zyj-2000/THQA-3D

Abstract:
Recently, large-scale visual language pre-trained (VLP) models have demonstrated impressive performance across various downstream tasks. Motivated by these advancements, pioneering efforts have emerged in multi-label image recognition with missing labels, leveraging VLP prompt-tuning technology. However, they usually cannot match text and vision features well, due to complicated semantics gaps and missing labels in a multi-label image. To tackle this challenge, we propose Text-Region Matching for optimizing Multi-Label prompt tuning, namely TRM-ML, a novel method for enhancing meaningful cross-modal matching. Compared to existing methods, we advocate exploring the information of category-aware regions rather than the entire image or pixels, which contributes to bridging the semantic gap between textual and visual representations in a one-to-one matching manner. Concurrently, we further introduce multimodal contrastive learning to narrow the semantic gap between textual and visual modalities and establish intra-class and inter-class relationships. Additionally, to deal with missing labels, we propose a multimodal category prototype that leverages intra- and inter-category semantic relationships to estimate unknown labels, facilitating pseudo-label generation. Extensive experiments on the MS-COCO, PASCAL VOC, Visual Genome, NUS-WIDE, and CUB-200-211 benchmark datasets demonstrate that our proposed framework outperforms the state-of-the-art methods by a significant margin. Our code is available here.

Abstract:
In this paper, we introduce a novel approach to single-image super-resolution (SISR) that balances perceptual quality and distortion through multi-objective optimization (MOO). Traditional pixel-based distortion metrics like PSNR and SSIM often fail to align with human perceptual quality, resulting in blurry outputs despite high scores. To address this, we propose the Multi-Objective Bayesian Optimization Super-Resolution (MOBOSR) framework, which dynamically adjusts loss weights during training. This reduces the need for manual hyperparameter tuning and lessens computational demands compared to AutoML. Our method conceptualizes the relationship between loss weights and image quality assessment (IQA) metrics as black-box objective functions, optimized to achieve an optimal perception-distortion Pareto frontier. Extensive experiments demonstrate that MOBOSR surpasses current state-of-the-art methods in both perception and distortion, significantly advancing the perception-distortion Pareto frontier. Our work lays a foundation for future exploration of the balance between perceptual quality and fidelity in image restoration tasks. Source codes and pretrained models are available at: https://github.com/ZhuKeven/MOBOSR.

Abstract:
Human motion prediction is crucial for human-centric multimedia understanding and interacting. Current methods typically rely on ground truth human poses as observed input, which is not practical for real-world scenarios where only raw visual sensor data is available. To implement these methods in practice, a pre-phrase of pose estimation is essential. However, such two-stage approaches often lead to performance degradation due to the accumulation of errors. Moreover, reducing raw visual data to sparse keypoint representations significantly diminishes the density of information, resulting in the loss of fine-grained features. In this paper, we propose LiDAR-HMP, the first single-LiDAR-based 3D human motion prediction approach, which receives the raw LiDAR point cloud as input and forecasts future 3D human poses directly. Building upon our novel structure-aware body feature descriptor, LiDAR-HMP adaptively maps the observed motion manifold to future poses and effectively models the spatial-temporal correlations of human motions for further refinement of prediction results. Extensive experiments show that our method achieves state-of-the-art performance on two public benchmarks and demonstrates remarkable robustness and efficacy in real-world deployments. https://4dvlab.github.io/project_page/LiDARHMP.html

Abstract:
Multimodal contrastive learning (MCL) has shown remarkable advances in zero-shot classification by learning from millions of image-caption pairs crawled from the Internet. However, this reliance poses privacy risks, as hackers may unauthorizedly exploit image-text data for model training, potentially including personal and privacy-sensitive information. Recent works propose generating unlearnable examples by adding imperceptible perturbations to training images to build shortcuts for protection. However, they are designed for unimodal classification, which remains largely unexplored in MCL. We first explore this context by evaluating the performance of existing methods on image-caption pairs, and they do not generalize effectively to multimodal data and exhibit limited impact to build shortcuts due to the lack of labels and the dispersion of pairs in MCL. In this paper, we propose Multi-step Error Minimization (MEM), a novel optimization process for generating multimodal unlearnable examples. It extends the Error-Minimization (EM) framework to optimize both image noise and an additional text trigger, thereby enlarging the optimized space and effectively misleading the model to learn the shortcut between the noise features and the text trigger. Specifically, we adopt projected gradient descent to solve the noise minimization problem and use HotFlip to approximate the gradient and replace words to find the optimal text trigger. Extensive experiments demonstrate the effectiveness of MEM, with post-protection retrieval results nearly half of random guessing, and its high transferability across different models. Our code is available on the https://github.com/thinwayliu/Multimodal-Unlearnable-Examples

Abstract:
Detecting out-of-distribution (OOD) inputs is a principal task for ensuring the safety of deploying deep-neural-network classifiers in open-set scenarios. OOD samples can be drawn from arbitrary distributions and exhibit deviations from in-distribution (ID) data in various dimensions, such as foreground features (e.g., objects in CIFAR100 images vs. those in CIFAR10 images) and background features (e.g., textural images vs. objects in CIFAR10). Existing methods can confound foreground and background features in training, failing to utilize the background features for OOD detection. This paper considers the importance of feature disentanglement in out-of-distribution detection and proposes the simultaneous exploitation of both foreground and background features to support the detection of OOD inputs in in out-of-distribution detection. To this end, we propose a novel framework that first disentangles foreground and background features from ID training samples via a dense prediction approach, and then learns a new classifier that can evaluate the OOD scores of test images from both foreground and background features. It is a generic framework that allows for a seamless combination with various existing OOD detection methods. Extensive experiments show that our approach 1) can substantially enhance the performance of four different state-of-the-art (SotA) OOD detection methods on multiple widely-used OOD datasets with diverse background features, and 2) achieves new SotA performance on these benchmarks. Code is available at https://github.com/mala-lab/DFB.

Abstract:
Modality gap between RGB and thermal infrared (TIR) images is a crucial issue but often overlooked in existing RGBT tracking methods. It can be observed that modality gap mainly lies in the image style difference. In this work, we propose a novel Coupled Knowledge Distillation framework called CKD, which pursues common styles of different modalities to break modality gap, for high performance RGBT tracking. In particular, we introduce two student networks and employ the style distillation loss to make their style features consistent as much as possible. Through alleviating the style difference of two student networks, we can break modality gap of different modalities well. However, the distillation of style features might harm to the content representations of two modalities in student networks. To handle this issue, we take original RGB and TIR networks as the teachers, and distill their content knowledge into two student networks respectively by the style-content orthogonal feature decoupling scheme. We couple the above two distillation processes in an online optimization framework to form new feature representations of RGB and thermal modalities without modality gap. In addition, we design a masked modeling strategy and a multi-modal candidate token elimination strategy into CKD to improve tracking robustness and efficiency respectively. Extensive experiments on five standard RGBT tracking datasets validate the effectiveness of the proposed method against state-of-the-art methods while achieving the fastest tracking speed of 96.4 FPS.

Abstract:
Sign language recognition (SLR) refers to interpreting sign language glosses from given videos automatically. This research area presents a complex challenge in computer vision because of the rapid and intricate movements inherent in sign languages, which encompass hand gestures, body postures, and even facial expressions. Recently, skeleton-based action recognition has attracted increasing attention due to its ability to handle variations in subjects and backgrounds independently. However, current skeleton-based SLR methods exhibit three limitations: 1) they often neglect the importance of realistic hand poses, where most studies train SLR models on non-realistic skeletal representations; 2) they tend to assume complete data availability in both training or inference phases, and capture intricate relationships among different body parts collectively; 3) these methods treat all sign glosses uniformly, failing to account for differences in complexity levels regarding skeletal representations. To enhance the realism of hand skeletal representations, we present a kinematic hand pose rectification method for enforcing constraints. Mitigating the impact of missing data, we propose a feature-isolated mechanism to focus on capturing local spatial-temporal context. This method captures the context concurrently and independently from individual features, thus enhancing the robustness of the SLR model. Additionally, to adapt to varying complexity levels of sign glosses, we develop an input-adaptive inference approach to optimise computational efficiency and accuracy. Experimental results demonstrate the effectiveness of our approach, as evidenced by achieving a new state-of-the-art (SOTA) performance on WLASL100 and LSA64. For WLASL100, we achieve a top-1 accuracy of 86.50%, marking a relative improvement of 2.39% over the previous SOTA. For LSA64, we achieve a top-1 accuracy of 99.84%. The artefacts and code related to this study are made publicly online (https://github.com/mpuu00001/Siformer.git).

Abstract:
With the rapid expansion of e-commerce, more consumers have become accustomed to making purchases via livestreaming. Accurately identifying the products being sold by salespeople, i.e., livestreaming product retrieval (LPR), poses a fundamental and daunting challenge. The LPR task encompasses three primary dilemmas in real-world scenarios: 1) the recognition of intended products from distractor products present in the background; 2) the video-image heterogeneity that the appearance of products showcased in live streams often deviates substantially from standardized product images in stores; 3) there are numerous confusing products with subtle visual nuances in the shop. To tackle these challenges, we propose the Spatiotemporal Graphing Multi-modal Network (SGMN). First, we employ a text-guided attention mechanism that leverages the spoken content of salespeople to guide the model to focus toward intended products, emphasizing their salience over cluttered background products. Second, a long-range spatiotemporal graph network is further designed to achieve both instance-level interaction and frame-level matching, solving the misalignment caused by video-image heterogeneity. Third, we propose a multi-modal hard example mining, assisting the model in distinguishing highly similar products with fine-grained features across the video-image-text domain. Through extensive quantitative and qualitative experiments, we demonstrate the superior performance of our proposed SGMN model, surpassing the state-of-the-art methods by a substantial margin. The code is available at https://github.com/Huxiaowan/SGMN.

Abstract:
Online vector map construction based on visual data can bypass the processes of data collection, post-processing, and manual annotation required by traditional map construction, which significantly enhances map-building efficiency. However, existing work treats the online mapping task as a local range perception task, overlooking the spatial scalability required for map construction. We propose IC-Mapper, an instance-centric online mapping framework, which comprises two primary components: 1) Instance-centric temporal association module: For the detection queries of adjacent frames, we measure them in both feature and geometric dimensions to obtain the matching correspondence between instances across frames. 2) Instance-centric spatial fusion module: We integrate features from the detected instances of the current frame with BEV features and spatially sampled points from the historical map. Then, we concatenate point sets with the same ID to achieve real-time map expansion and updating. Based on the nuScenes dataset, we evaluate our approach on detection, tracking, and global mapping metrics. Experimental results demonstrate the superiority of IC-Mapper against other state-of-the-art methods. Code will be released on https://github.com/Brickzhuantou/IC-Mapper.

Abstract:
Indoor scene modification has emerged as a prominent area within computer vision, particularly for its applications in Augmented Reality (AR) and Virtual Reality (VR). Traditional methods often rely on pre-existing object databases and predetermined object positions, limiting their flexibility and adaptability to new scenarios. In response to this challenge, we present a novel end-to-end multi-modal deep neural network capable of generating point cloud objects seamlessly integrated with their surroundings, driven by textual instructions. Our work proposes a novel approach in scene modification by enabling the creation of new environments with previously unseen object layouts, eliminating the need for pre-stored CAD models. Leveraging Point-E as our generative model, we introduce innovative techniques such as quantized position prediction and Top-K estimation to address the issue of false negatives resulting from ambiguous language descriptions. Furthermore, we conduct comprehensive evaluations to showcase the diversity of generated objects, the efficacy of textual instructions, and the quantitative metrics, affirming the realism and versatility of our model in generating indoor objects. To provide a holistic assessment, we incorporate visual grounding as an additional metric, ensuring the quality and coherence of the scenes produced by our model. Through these advancements, our approach not only advances the state-of-the-art in indoor scene modification but also lays the foundation for future innovations in immersive computing and digital environment creation. The project is available at https://ainnovatelab.github.io/Context-aware-Indoor-PCG.

Abstract:
Image steganography is the process of hiding secret data in a cover image by subtle perturbation. Recent studies show that it is feasible to use a fixed neural network for data embedding and extraction. Such Fixed Neural Network Steganography (FNNS) demonstrates favorable performance without the need for training networks, making it more practical for real-world applications. However, the stego-images generated by the existing FNNS methods exhibit high distortion, which is prone to be detected by steganalysis tools. To deal with this issue, we propose a Cover-separable Fixed Neural Network Steganography, namely Cs-FNNS. In Cs-FNNS, we propose a Steganographic Perturbation Search (SPS) algorithm to directly encode the secret data into an imperceptible perturbation, which is combined with an AI-generated cover image for transmission. Through accessing the same deep generative models, the receiver could reproduce the cover image using a pre-agreed key, to separate the perturbation in the stego-image for data decoding. such an encoding/decoding strategy focuses on the secret data and eliminates the disturbance of the cover images, hence achieving a better performance. We apply our Cs-FNNS to the steganographic field that hiding secret images within cover images. Through comprehensive experiments, we demonstrate the superior performance of the proposed method in terms of visual quality and undetectability. Moreover, we show the flexibility of our Cs-FNNS in terms of hiding multiple secret images for different receivers. Code is available at https://github.com/albblgb/Cs-FNNS

Abstract:
Spiking neural networks (SNNs) show great potential due to their energy efficiency, fast processing capabilities, and robustness. There are two main approaches to constructing SNNs. Direct training methods require much memory, while conversion methods offer a simpler and more efficient option. However, current conversion methods mainly focus on converting convolutional neural networks (CNNs) to SNNs. Converting Transformers to SNN is challenging because of the presence of non-linear modules. In this paper, we propose an Expectation Compensation Module to preserve the accuracy of the conversion. The core idea is to use information from the previous T time-steps to calculate the expected output at time-step T. We also propose a Multi-Threshold Neuron and the corresponding Parallel Parameter normalization to address the challenge of large time steps needed for high accuracy, aiming to reduce network latency and power consumption. Our experimental results demonstrate that our approach achieves state-of-the-art performance. For example, we achieve a top-1 accuracy of 88.60% with only a 1% loss in accuracy using 4 time steps while consuming only 35% of the original power of the Transformer. To our knowledge, this is the first successful Artificial Neural Network (ANN) to SNN conversion for Spiking Transformers that achieves high accuracy, low latency, and low power consumption on complex datasets. The source codes of the proposed method are available at https://github.com/h-z-h-cell/Transformer-to-SNN-ECMT.

Abstract:
Recently, diffusion models have made significant strides in synthesizing realistic 2D human images based on provided text prompts. Building upon this, researchers have extended 2D text-to-image diffusion models into the 3D domain for generating human textures (UV Maps). However, some important problems about UV Map Generative models are still not solved, i.e., how to generate personalized texture maps for any given face image, and how to define and evaluate the quality of these generated texture maps. To solve the above problems, we introduce a novel method, UVMap-ID, which is a controllable and personalized UV Map generative model. Unlike traditional large-scale training methods in 2D, we propose to fine-tune a pre-trained text-to-image diffusion model which is integrated with a face fusion module for achieving ID-driven customized generation. To support the finetuning strategy, we introduce a small-scale attribute-balanced training dataset, including high-quality textures with labeled text and Face ID. Additionally, we introduce some metrics to evaluate the multiple aspects of the textures. Finally, both quantitative and qualitative analyses demonstrate the effectiveness of our method in controllable and personalized UV Map generation.

Abstract:
3D content creation has long been a complex and time-consuming process, often requiring specialized skills and resources. While recent advancements have allowed for text-guided 3D object and scene generation, they still fall short of providing sufficient control over the generation process, leading to a gap between the user's creative vision and the generated results. In this paper, we present iControl3D, a novel interactive system that empowers users to generate and render customizable 3D scenes with precise control. To this end, a 3D creator interface has been developed to provide users with fine-grained control over the creation process. Technically, we leverage 3D meshes as an intermediary proxy to iteratively merge individual 2D diffusion-generated images into a cohesive and unified 3D scene representation. To ensure seamless integration of 3D meshes, we propose to perform boundary-aware depth alignment before fusing the newly generated mesh with the existing one in 3D space. Additionally, to effectively manage depth discrepancies between remote content and foreground, we propose to model remote content separately with an environment map instead of 3D meshes. Finally, our neural rendering interface enables users to build a radiance field of their scene online and navigate the entire scene. Extensive experiments have been conducted to demonstrate the effectiveness of our system. The code will be made available at https://github.com/xingyi-li/iControl3D.

Abstract:
We propose RainyScape, an unsupervised framework to reconstruct pristine scenes from a collection of multi-view rainy images. RainyScape consists of two main modules: a neural rendering module and a rain-prediction module that incorporates a predictor network and a learnable latent embedding that captures the rain characteristics of the scene. Specifically, leveraging the spectral bias property of neural networks, we first optimize the neural rendering pipeline to obtain a low-frequency scene representation. Subsequently, we jointly optimize the two modules, driven by the proposed adaptive direction-sensitive gradient-based reconstruction loss, which encourages the network to distinguish between scene details and rain streaks, facilitating the propagation of gradients to the relevant components. Extensive experiments on both the classic neural radiance field and the recently proposed 3D Gaussian splatting demonstrate the superiority of our method in effectively eliminating rain streaks and rendering clean images, achieving state-of-the-art performance. The constructed high-quality dataset, source code, and supplementary material are publicly available at https://github.com/lyuxianqiang/RainyScape.

Abstract:
We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 70 different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 20 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host a comprehensive leaderboard (https://huggingface.co/spaces/opencompass/open_vlm_leaderboard) to track the progress of multi-modality learning research. The toolkit is released on GitHub (https://github.com/open-compass/VLMEvalKit) and is actively maintained.

Abstract:
Depression is a prevalent mental health disorder that significantly impacts individuals' lives and well-being. Early detection and intervention are crucial for effective treatment and management of depression. Recently, there are many end-to-end deep learning methods leveraging the facial expression features for automatic depression detection. However, most current methods overlook the temporal dynamics of facial expressions. Although very recent 3DCNN methods remedy this gap, they introduce more computational cost due to the selection of CNN-based backbones and redundant facial features. To address the above limitations, by considering the timing correlation of facial expressions, we propose a novel framework called FacialPulse, which recognizes depression with high accuracy and speed. By harnessing the bidirectional nature and proficiently addressing long-term dependencies, the Facial Motion Modeling Module (FMMM) is designed in FacialPulse to fully capture temporal features. Since the proposed FMMM has parallel processing capabilities and has the gate mechanism to mitigate gradient vanishing, this module can also significantly boost the training speed. Besides, to effectively use facial landmarks to replace original images to decrease information redundancy, a Facial Landmark Calibration Module (FLCM) is designed to eliminate facial landmark errors to further improve recognition accuracy. Extensive experiments on the AVEC2014 dataset and MMDA dataset (a depression dataset) demonstrate the superiority of FacialPulse on recognition accuracy and speed, with the average MAE (Mean Absolute Error) decreased by 21% compared to baselines, and the recognition speed increased by 100% compared to state-of-the-art methods. Codes are released at https://github.com/volatileee/FacialPulse.

Abstract:
Semi-supervised multi-organ medical image segmentation aids physicians in improving disease diagnosis and treatment planning and reduces the time and effort required for organ annotation.Existing state-of-the-art methods train the labeled data with ground truths and train the unlabeled data with pseudo-labels. However, the two training flows are separate, which does not reflect the interrelationship between labeled and unlabeled data.To address this issue, we propose a semi-supervised multi-organ segmentation method called GuidedNet, which leverages the knowledge from labeled data to guide the training of unlabeled data. The primary goals of this study are to improve the quality of pseudo-labels for unlabeled data and to enhance the network's learning capability for both small and complex organs.A key concept is that voxel features from labeled and unlabeled data that are close to each other in the feature space are more likely to belong to the same class.On this basis, a 3D Consistent Gaussian Mixture Model (3D-CGMM) is designed to leverage the feature distributions from labeled data to rectify the generated pseudo-labels.Furthermore, we introduce a Knowledge Transfer Cross Pseudo Supervision (KT-CPS) strategy, which leverages the prior knowledge obtained from the labeled data to guide the training of the unlabeled data, thereby improving the segmentation accuracy for both small and complex organs. Extensive experiments on two public datasets, FLARE22 and AMOS, demonstrated that GuidedNet is capable of achieving state-of-the-art performance.

Abstract:
Ultra-high-definition (UHD) technology has attracted widespread attention due to its exceptional visual quality, but it also poses new challenges for low-light image enhancement (LLIE) techniques. UHD images inherently possess high computational complexity, leading existing UHD LLIE methods to employ high-magnification downsampling to reduce computational costs, which in turn results in information loss. The wavelet transform not only allows downsampling without loss of information, but also separates the image content from the noise. It enables state space models (SSMs) to avoid being affected by noise when modeling long sequences, thus making full use of the long-sequence modeling capability of SSMs. On this basis, we propose Wave-Mamba, a novel approach based on two pivotal insights derived from the wavelet domain: 1) most of the content information of an image exists in the low-frequency component, less in the high-frequency component. 2) The high-frequency component exerts a minimal influence on the outcomes of low-light enhancement. Specifically, to efficiently model global content information on UHD images, we proposed a low-frequency state space block (LFSSBlock) by improving SSMs to focus on restoring the information of low-frequency sub-bands. Moreover, we propose a high-frequency enhance block (HFEBlock) for high-frequency sub-band information, which uses the enhanced low-frequency information to correct the high-frequency information and effectively restore the correct high-frequency details. Through comprehensive evaluation, our method has demonstrated superior performance, significantly outshining current leading techniques while maintaining a more streamlined architecture. The code is available at https://github.com/AlexZou14/Wave-Mamba.

Abstract:
Zero-shot anomaly detection (ZSAD) methods detect anomalies without prior access to known normal or abnormal samples within target categories. Existing methods typically rely on pretrained multimodal models, computing similarities between manually crafted textual features representing ''normal'' or ''abnormal'' semantics and image patch features to detect anomalies. However, the generic descriptions of ''abnormal'' often fail to precisely match diverse types of anomalies across different object categories. Additionally, computing feature similarities for single patches struggles to pinpoint specific locations of anomalies with various sizes and scales. To address these issues, we propose a novel ZSAD method called FiLo, comprising two components: adaptively learned Fine-Grained Description (FG-Des) and position-enhanced High-Quality Localization (HQ-Loc). FG-Des introduces fine-grained anomaly descriptions for each category using Large Language Models (LLMs) and employs adaptively learned textual templates to enhance the accuracy and interpretability of anomaly detection. HQ-Loc, utilizing Grounding DINO for preliminary localization, position-enhanced text prompts, and Multi-scale Multi-shape Cross-modal Interaction (MMCI) module, facilitates more accurate localization of anomalies of different sizes and shapes. Experimental results on datasets like MVTec and VisA demonstrate that FiLo significantly improves the performance of ZSAD in both detection and localization, achieving state-of-the-art performance with an image-level AUC of 83.9% and a pixel-level AUC of 95.9% on the VisA dataset. Code is available at https://github.com/CASIA-IVA-Lab/FiLo.

Abstract:
Multimodal Sentiment Analysis (MSA) focuses on leveraging multimodal signals for understanding human sentiment. Most of the existing works rely on superficial information, neglecting the incorporation of contextual world knowledge (e.g., background information derived from but beyond the given image and text pairs), thereby restricting their ability to achieve better multimodal sentiment analysis (MSA). In this paper, we propose a plug-in framework named WisdoM, to leverage the contextual world knowledge induced from the large vision-language models (LVLMs) for enhanced MSA. WisdoM utilizes LVLMs to comprehensively analyze both images and corresponding texts, simultaneously generating pertinent context. Besides, to reduce the noise in the context, we design a training-free contextual fusion mechanism. We evaluate our WisdoM in both the aspect-level and sentence-level MSA tasks on the Twitter2015, Twitter2017, and MSED datasets. Experiments on three MSA benchmarks upon several advanced LVLMs, show that our approach brings consistent and significant improvements (up to +6.3% F1 score). Code is available at https://github.com/DreamMr/WisdoM.

Abstract:
Ancient artifacts are an important medium for cultural preservation and restoration. However, many physical copies of artifacts are either damaged or lost, leaving a blank space in archaeological and historical studies that calls for techniques to re-visualize these artifacts. Despite the significant advancements in open-domain text-to-image synthesis, existing approaches fail to capture the important domain knowledge presented in the textual descriptions of artifacts, resulting in errors in recreated images such as incorrect shapes and patterns. In this paper, we propose a novel knowledge-aware artifact image synthesis approach that brings lost historical objects accurately into their visual forms. We use a pretrained diffusion model as backbone and introduce three key techniques to enhance the text-to-image generation framework: 1) we construct prompts with explicit archaeological knowledge elicited from large language models (LLMs); 2) we incorporate additional textual guidance to correlated historical expertise in a contrastive manner; 3) we introduce further visual-semantic constraints on edge and perceptual features that enable our model to learn more intricate visual details of the artifacts. Compared to existing approaches, our proposed model produces higher-quality artifact images that align better with the implicit details and historical knowledge contained within written documents, thus achieving significant improvements both across automatic metrics and in human evaluation.

Abstract:
Although Federated Learning has been widely studied in recent years, there are still high overhead expenses in each communication round for large-scale models such as Vision Transformer. To lower the communication complexity, we propose a novel Federated Block Coordinate Gradient Descent (FedBCGD) method for communication efficiency. The proposed method splits model parameters into several blocks including a shared block and enables uploading a specific parameter block by each client, which can significantly reduce communication overhead. Moreover, we also develop an accelerated FedBCGD algorithm (called FedBCGD+) with client drift control and stochastic variance reduction. To the best of our knowledge, this paper is the first work on parameter block communication for training large-scale deep models. We also provide the convergence analysis for the proposed algorithms. Our theoretical results show that the communication complexities of our algorithms are a factor 1 /N lower than those of existing methods, where N is the number of parameter blocks, and they enjoy much faster convergence than their counterparts. Empirical results indicate the superiority of the proposed algorithms compared to state-of-the-art algorithms.

Abstract:
LiDAR-based 3D object detection has seen impressive advances in recent times. However, deploying trained 3D detectors in the real world often yields unsatisfactory performance when the distribution of the test data significantly deviates from the training data due to different weather conditions, object sizes, etc. A key factor in this performance degradation is the diminished generalizability of pre-trained models, which creates a sharp loss landscape during training. Such sharpness, when encountered during testing, can precipitate significant performance declines, even with minor data variations. To address the aforementioned challenges, we propose dual-perturbation optimization (DPO) for Test-time Adaptation in 3D Object Detection (TTA-3OD). We minimize the sharpness to cultivate a flat loss landscape to ensure model resiliency to minor data variations, thereby enhancing the generalization of the adaptation process. To fully capture the inherent variability of the test point clouds, we further introduce adversarial perturbation to the input BEV features to better simulate the noisy test environment. As the dual perturbation strategy relies on trustworthy supervision signals, we utilize a reliable Hungarian matcher to filter out pseudo-labels sensitive to perturbations. Additionally, we introduce early Hungarian cutoff to avoid error accumulation from incorrect pseudo-labels by halting the adaptation process. Extensive experiments across three types of transfer tasks demonstrate that the proposed DPO significantly surpasses previous state-of-the-art approaches, specifically on Waymo → KITTI, outperforming the most competitive baseline by 57.72% in AP3D and reaching 91% of the fully supervised upper bound. Our code is available at https://github.com/Jo-wang/DPO.

Abstract:
Recently, transformers have demonstrated great potential for modeling long-term dependencies from skeleton sequences and thereby gained ever-increasing attention in skeleton action recognition. However, the existing transformer-based approaches heavily rely on the naive attention mechanism for capturing the spatiotemporal features, which falls short in learning discriminative representations that exhibit similar motion patterns. To address this challenge, we introduce the Frequency-aware Mixed Transformer (FreqMixFormer), specifically designed for recognizing similar skeletal actions with subtle discriminative motions. First, we introduce a frequency-aware attention module to unweave skeleton frequency representations by embedding joint features into frequency attention maps, aiming to distinguish the discriminative movements based on their frequency coefficients. Subsequently, we develop a mixed transformer architecture to incorporate spatial features with frequency features to model the comprehensive frequency-spatial patterns. Additionally, a temporal transformer is proposed to extract the global correlations across frames. Extensive experiments show that FreqMiXFormer outperforms SOTA on 3 popular skeleton action recognition datasets, including NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets. Our project is publicly available at: https://github.com/wenhanwu95/FreqMixFormer.

Abstract:
High-resolution point clouds (HRPCD) anomaly detection (AD) plays a critical role in precision machining and high-end equipment manufacturing. Despite considerable 3D-AD methods that have been proposed recently, they still cannot meet the requirements of the HRPCD-AD task. There are several challenges: i) It is difficult to directly capture HRPCD information due to large amounts of points at the sample level; ii) The advanced transformer-based methods usually obtain anisotropic features, leading to degradation of the representation; iii) The proportion of abnormal areas is very small, which makes it difficult to characterize. To address these challenges, we propose a novel group-level feature-based network, called Group3AD, which has a significantly efficient representation ability. First, we design an Intercluster Uniformity Network (IUN) to present the mapping of different groups in the feature space as several clusters, and obtain a more uniform distribution between clusters representing different parts of the point clouds in the feature space. Then, an Intracluster Alignment Network (IAN) is designed to encourage groups within the cluster to be distributed tightly in the feature space. In addition, we propose an Adaptive Group-Center Selection (AGCS) based on geometric information to improve the pixel density of potential anomalous regions during inference. The experimental results verify the effectiveness of our proposed Group3AD, which surpasses Reg3D-AD by the margin of 5% in terms of object-level AUROC on Real3D-AD. We provide the code and supplementary information on our website: https://github.com/M-3LAB/Group3AD.

Abstract:
Lane detection (LD) is an essential component of autonomous driving systems, providing fundamental functionalities like adaptive cruise control and automated lane centering. Existing LD benchmarks primarily focus on evaluating common cases, neglecting the robustness of LD models against environmental illusions such as shadows and tire marks on the road. This research gap poses significant safety challenges since these illusions exist naturally in real-world traffic situations. For the first time, this paper studies the potential threats caused by these environmental illusions to LD and establishes the first comprehensive benchmark LanEvil for evaluating the robustness of LD against this natural corruption. We systematically design 14 prevalent yet critical types of environmental illusions (e.g., shadow, reflection) that cover a wide spectrum of real-world influencing factors in LD tasks. Based on real-world environments, we create 94 realistic and customizable 3D cases using the widely used CARLA simulator, resulting in a dataset comprising 90,292 sampled images. Through extensive experiments, we benchmark the robustness of popular LD methods using LanEvil, revealing substantial performance degradation (-5.37% Accuracy and -10.70% F1-Score on average), with shadow effects posing the greatest risk (-7.39% Accuracy). Additionally, we assess the performance of commercial auto-driving systems OpenPilot and Apollo through collaborative simulations, demonstrating that proposed environmental illusions can lead to incorrect decisions and potential traffic accidents. To defend against environmental illusions, we propose the Attention Area Mixing (AAM) approach using hard examples, which witness significant robustness improvement (+3.76%) under illumination effects. We hope our paper can contribute to advancing more robust auto-driving systems in the future. Part of our dataset and demos can be found at the https://lanevil.github.io/.

Abstract:
Deep neural networks (DNNs) are known to be vulnerable to adversarial attacks even under a black-box setting where the adversary can only query the model. Particularly, query-based black-box adversarial attacks estimate adversarial gradients based on the returned probability vectors of the target model for a sequence of queries. During this process, the queries made to the target model are intermediate adversarial examples crafted at the previous attack step, which share high similarities in the pixel space. Motivated by this observation, stateful detection methods have been proposed to detect and reject query-based attacks. While demonstrating promising results, these methods either have been evaded by more advanced attacks or suffer from low efficiency in terms of the number of shots (queries) required to detect different attacks. Arguably, the key challenge here is to assign high similarity scores for any two intermediate adversarial examples perturbed from the same clean image. To address this challenge, we propose a novel Adversarial Contrastive Prompt Tuning (ACPT) method to robustly fine-tune the CLIP image encoder to extract similar embeddings for any two intermediate adversarial queries. With ACPT, we further introduce a detection framework AdvQDet that can detect 7 state-of-the-art query-based attacks with >99% detection rate within 5 shots. We also show that ACPT is robust to 3 types of adaptive attacks. Code is available at https://github.com/xinwong/AdvQDet.

Abstract:
In this paper, we delve into the concept of interpretable image enhancement, a technique that enhances image quality by adjusting filter parameters with easily understandable names such as "Exposure'' and "Contrast''. Unlike using predefined image editing filters, our framework utilizes learnable filters that acquire interpretable names through training. Our contribution is two-fold. Firstly, we introduce a novel filter architecture called an image-adaptive neural implicit lookup table, which uses a multilayer perceptron to implicitly define the transformation from input feature space to output color space. By incorporating image-adaptive parameters directly into the input features, we achieve highly expressive filters. Secondly, we introduce a prompt guidance loss to assign interpretable names to each filter. We evaluate visual impressions of enhancement results, such as exposure and contrast, using a vision and language model along with guiding prompts. We define a constraint to ensure that each filter affects only the targeted visual impression without influencing other attributes, which allows us to obtain the desired filter effects. Experimental results show that our method outperforms existing predefined filter-based methods, thanks to the filters optimized to predict target results. Our source code is available at https://github.com/satoshi-kosugi/PG-IA-NILUT.

Abstract:
Latent Diffusion Models (LDMs) are renowned for their powerful capabilities in image and video synthesis. Yet, compared to text-to-image (T2I) editing, text-to-video (T2V) editing suffers from a lack of decent temporal consistency and structure, due to insufficient pre-training data, limited model editability, or extensive tuning costs. To address this gap, we propose FLDM (Fused Latent Diffusion Model), a training-free framework that achieves high-quality T2V editing by integrating various T2I and T2V LDMs. Specifically, FLDM utilizes a hyper-parameter with an update schedule to effectively fuse image and video latents during the denoising process. This paper is the first to reveal that T2I and T2V LDMs can complement each other in terms of structure and temporal consistency, ultimately generating high-quality videos. It is worth noting that FLDM can serve as a versatile plugin, applicable to off-the-shelf image and video LDMs, to significantly enhance the quality of video editing. Extensive quantitative and qualitative experiments on popular T2I and T2V LDMs demonstrate FLDM's superior editing quality than state-of-the-art T2V editing methods.

Abstract:
Weak diagram analysis abilities of LLMs or Multimodal LLMs greatly limit their application scenarios for scientific academic paper writing. In this work, towards a more versatile copilot for academic paper writing, we mainly focus on strengthening the multi-modal diagram analysis ability of Multimodal LLMs. By parsing Latex source files of academic papers, we carefully build a multi-modal diagram understanding dataset M-Paper. By aligning diagrams in the paper with related paragraphs, we construct professional diagram analysis samples for training and evaluation. M-Paper is the first dataset to support joint comprehension of multiple scientific diagrams, including figures and tables in the format of images or Latex codes. Besides, to better align the copilot with the user's intention, we introduce the 'outline' as the control signal, which could be directly given by the user or revised based on auto-generated ones. Comprehensive experiments with a state-of-the-art Multimodal LLM demonstrate that training on our dataset shows stronger scientific diagram understanding performance. The dataset, code, and model are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl.

Abstract:
In recent years, Multi-Modality Image Fusion (MMIF) has been applied to many fields, which has attracted many scholars to endeavour to improve the fusion performance. However, the prevailing focus has predominantly been on the architecture design, rather than the training strategies. As a low-level vision task, image fusion is supposed to quickly deliver output images for observation and supporting downstream tasks. Thus, superfluous computational and storage overheads should be avoided. In this work, a lightweight Distilled Mini-Model with a Dynamic Refresh strategy (MMDRFuse) is proposed to achieve this objective. To pursue model parsimony, an extremely small convolutional network with a total of 113 trainable parameters (0.44 KB) is obtained by three carefully designed supervisions. First, digestible distillation is constructed by emphasising external spatial feature consistency, delivering soft supervision with balanced details and saliency for the target network. Second, we develop a comprehensive loss to balance the pixel, gradient, and perception clues from the source images. Third, an innovative dynamic refresh training strategy is used to collaborate history parameters and current supervision during training, together with an adaptive adjust function to optimise the fusion network. Extensive experiments on several public datasets demonstrate that our method exhibits promising advantages in terms of model efficiency and complexity, with superior performance in multiple image fusion tasks and downstream pedestrian detection application. The code of this work is publicly available at https://github.com/yanglinDeng/MMDRFuse.

Abstract:
In the Fourier frequency domain, luminance information is primarily encoded in the amplitude component, while spatial structure information is significantly contained within the phase component. Existing low-light image enhancement techniques using Fourier transform have mainly focused on amplifying the amplitude component and simply replicating the phase component, an approach that often leads to color distortions and noise issues. In this paper, we propose a Dual-Stage Multi-Branch Fourier Low-Light Image Enhancement (DMFourLLIE) framework to address these limitations by emphasizing the phase component's role in preserving image structure and detail. The first stage integrates structural information from infrared images to enhance the phase component and employs a luminance-attention mechanism in the luminance-chrominance color space to precisely control amplitude enhancement. The second stage combines multi-scale and Fourier convolutional branches for robust image reconstruction, effectively recovering spatial structures and textures. This dual-branch joint optimization process ensures that complex image information is retained, overcoming the limitations of previous methods that neglected the interplay between amplitude and phase. Extensive experiments across multiple datasets demonstrate that DMFourLLIE outperforms current state-of-the-art methods in low-light image enhancement.

Abstract:
In recent years, Few-Shot Object Detection (FSOD) has gained widespread attention and made significant progress due to its ability to build models with a good generalization power using extremely limited annotated data. The fine-tuning based paradigm is currently dominating this field, where detectors are initially pre-trained on base classes with sufficient samples and then fine-tuned on novel ones with few samples, but the scarcity of labeled samples of novel classes greatly interferes precisely fitting their data distribution, thus hampering the performance. To address this issue, we propose a new framework for FSOD, namely Prototype-based Soft-labels and Test-Time Learning (PS-TTL). Specifically, we design a Test-Time Learning (TTL) module that employs a mean-teacher network for self-training to discover novel instances from test data, allowing detectors to learn better representations and classifiers for novel classes. Furthermore, we notice that even though relatively low-confidence pseudo-labels exhibit classification confusion, they still tend to recall foreground. We thus develop a Prototype-based Soft-labels (PS) strategy through assessing similarities between low-confidence pseudo-labels and category prototypes as soft-labels to unleash their potential, which substantially mitigates the constraints posed by few-shot samples. Extensive experiments on both the VOC and COCO benchmarks show that PS-TTL achieves the state-of-the-art, highlighting its effectiveness. The code and model are available at https://github.com/gaoyingjay/PS-TTL.

Abstract:
Joint Multimodal Entity-Relation Extraction (JMERE) is a challenging task that aims to extract entities and their relations from textimage pairs in social media posts. Existing methods for JMERE require large amounts of labeled data. However, gathering and annotating fine-grained multimodal data for JMERE poses significant challenges. Initially, we construct diverse and comprehensive multimodal few-shot datasets fitted to the original data distribution. To address the insufficient information in the few-shot setting, we introduce the Knowledge-Enhanced Cross-modal Prompt Model (KECPM) for JMERE. This method can effectively address the problem of insufficient information in the few-shot setting by guiding a large language model to generate supplementary background knowledge. Our proposed method comprises two stages: (1) a knowledge ingestion stage that dynamically formulates prompts based on semantic similarity guide ChatGPT generating relevant knowledge and employs self-reflection to refine the knowledge; (2) a knowledge-enhanced language model stage that merges the auxiliary knowledge with the original input and utilizes a transformerbased model to align with JMERE's required output format. We extensively evaluate our approach on a few-shot dataset derived from the JMERE dataset, demonstrating its superiority over strong baselines in terms of both micro and macro F1 scores. Additionally, we present qualitative analyses and case studies to elucidate the effectiveness of our model. Code and Data are released at https://github.com/YuanLi95/KECPM.

Abstract:
Trailer generation is a challenging video clipping task that aims to select highlighting shots from long videos like movies and re-organize them in an attractive way. In this study, we propose an inverse partial optimal transport (IPOT) framework to achieve music-guided movie trailer generation. In particular, we formulate the trailer generation task as selecting and sorting key movie shots based on audio shots, which involves matching the latent representations across visual and acoustic modalities. We learn a multi-modal latent representation model in the proposed IPOT framework to achieve this aim. In this framework, a two-tower encoder derives the latent representations of movie and music shots, respectively, and an attention-assisted Sinkhorn matching network parameterizes the grounding distance between the shots' latent representations and the distribution of the movie shots. Taking the correspondence between the movie shots and its trailer music shots as the observed optimal transport plan defined on the grounding distances, we learn the model by solving an inverse partial optimal transport problem, leading to a bi-level optimization strategy. We collect real-world movies and their trailers to construct a dataset with abundant label information called CMTD and, accordingly, train and evaluate various automatic trailer generators. Compared with state-of-the-art methods, our IPOT method consistently shows superiority in subjective visual effects and objective quantitative measurements. The code is available at https://github.com/Dixin-Lab/Automatic-Movie-Trailer-Generator.

Abstract:
Cinemagraph creates captivating video experience by combining elements of still photography and subtle motion. However, most existing cinemagraph video generation lacks depth information, being restricted within 2-dimensional (2D) image space. We advance cinemagraph from 2D image space to 3-dimensional (3D) space with high quality by proposing LoopGaussian. It is based on 3D Gaussian modeling, taking advantage of the 3D Gaussian Splatting (3D-GS) technique that has significantly improved the field of novel view synthesis. Here is a brief overview of our new approach: It employs 3D-GS to reconstruct 3D Gaussian point clouds from multi-view images of static scenes, where shape regularization is used to prevent blurring or artifacts caused by object deformation. To maintain local continuity between scenes, it then clusters the 3D Gaussian points by the proposed SuperGaussian algorithm using features acquired by an autoencoder tailored for 3D Gaussian. Similarities between clusters are used to derive an Eulerian motion field for describing velocities across the entire scene. The estimated Eulerian motion field drives the movement of the 3D Gaussian points, based on which a 3D Cinemagraph is generated through bidirectional animation. The resulting 3D Cinemagraph exhibits natural and seamlessly loopable dynamics. Experiment results validate the effectiveness of the proposed approach, demonstrating high-quality and visually appealing video generation.

Abstract:
Recent AIGC systems possess the capability to generate digital multimedia content based on human language instructions, such as text, image and video. However, when it comes to speech, existing methods related to human instruction-to-speech generation exhibit two limitations. Firstly, they require the division of inputs into content prompt (transcript) and description prompt (style and speaker), instead of directly supporting human instruction. This division is less natural in form and does not align with other AIGC models. Secondly, the practice of utilizing an independent description prompt to model speech style, without considering the transcript content, restricts the ability to control speech at a fine-grained level. To address these limitations, we propose VoxInstruct, a novel unified multilingual codec language modeling framework that extends traditional text-to-speech tasks into a general human instruction-to-speech task. Our approach enhances the expressiveness of human instruction-guided speech generation and aligns the speech generation paradigm with other modalities. To enable the model to automatically extract the content of synthesized speech from raw text instructions, we introduce speech semantic tokens as an intermediate representation for instruction-to-content guidance. We also incorporate multiple Classifier-Free Guidance (CFG) strategies into our codec language model, which strengthens the generated speech following human instructions. Furthermore, our model architecture and training strategies allow for the simultaneous support of combining speech prompt and descriptive human instruction for expressive speech synthesis, which is a first-of-its-kind attempt. Codes, models and demos are at: https://github.com/thuhcsi/VoxInstruct.

Abstract:
Multi-object tracking in traffic videos is a crucial research area, offering immense potential for enhancing traffic monitoring accuracy and promoting road safety measures through the utilisation of advanced machine learning algorithms. However, existing datasets for multi-object tracking in traffic videos often feature limited instances or focus on single classes, which cannot well simulate the challenges encountered in complex traffic scenarios. To address this gap, we introduce TrafficMOT, an extensive dataset designed to encompass diverse traffic situations with complex scenarios. To validate the complexity and challenges presented by TrafficMOT, we conducted comprehensive empirical studies using three different settings: fully-supervised, semi-supervised, and a recent powerful zero-shot foundation model Tracking Anything Model (TAM). The experimental results highlight the inherent complexity of this dataset, emphasising its value to drive advancements in the field of traffic monitoring and multi-object tracking. Code and data are available at the project page: https://lihaoliu-cambridge.github.io/trafficmot/

Abstract:
Automatically narrating videos in natural language complying with user requests, i.e. Controllable Video Captioning task, can help people manage massive videos with desired intentions. However, existing works suffer from two shortcomings: 1) the control signal is single-grained which can not satisfy diverse user intentions; 2) the video description is generated in a single round which can not be further edited to meet dynamic needs. In this paper, we propose a novel Video Caption Editing (VCE) task to automatically revise an existing video description guided by multi-grained user requests. Inspired by human writing-revision habits, we design the user command as a pivotal triplet operation, position, attribute to cover diverse user needs from coarse-grained to fine-grained. To facilitate the VCE task, we automatically construct an open-domain benchmark dataset named VATEX-EDIT and manually collect an e-commerce dataset called EMMAD-EDIT. We further propose a specialized small-scale model (i.e., OPA) compared with two generalist Large Multi-modal Models to perform an exhaustive analysis of the novel task. For evaluation, we adopt comprehensive metrics considering caption fluency, command-caption consistency, and video-caption alignment. Experiments reveal the task challenges of fine-grained multi-modal semantics understanding and processing. Our datasets, codes, and evaluation tools are available at https://github.com/yaolinli/VCE.

Abstract:
Continual learning (CL) breaks off the one-way training manner and enables a model to adapt to new data, semantics and tasks continuously. However, current CL methods mainly focus on single tasks. Besides, CL models are plagued by catastrophic forgetting and semantic drift since the lack of old data, which often occurs in remote-sensing interpretation due to the intricate fine-grained semantics. In this paper, we propose Continual Panoptic Perception (CPP), a unified continual learning model that leverages multi-task joint learning covering pixel-level classification, instance-level segmentation and image-level perception for universal interpretation in remote sensing images. Concretely, we propose a collaborative cross-modal encoder (CCE) to extract the input image features, which supports pixel classification and caption generation synchronously. To inherit the knowledge from the old model without exemplar memory, we propose a task-interactive knowledge distillation (TKD) method, which leverages cross-modal optimization and task-asymmetric pseudo-labeling (TPL) to alleviate catastrophic forgetting. Furthermore, we also propose a joint optimization mechanism to achieve end-to-end multi-modal panoptic perception. Experimental results on the fine-grained panoptic perception dataset validate the effectiveness of the proposed model, and also prove that joint optimization can boost sub-task CL efficiency with over 13% relative improvement on panoptic quality. The project page is available at https://github.com/YBIO/CPP.

Abstract:
Unsupervised Visible-Infrared Person Re-identification (USVI-ReID) presents a formidable challenge, which aims to match pedestrian images across visible and infrared modalities without any annotations. Recently, clustered pseudo-label methods have become predominant in USVI-ReID, although the inherent noise in pseudo-labels presents a significant obstacle. Most existing works primarily focus on shielding the model from the harmful effects of noise, neglecting to calibrate noisy pseudo-labels usually associated with hard samples, which will compromise the robustness of the model. To address this issue, we design a Robust Pseudo-label Learning with Neighbor Relation (RPNR) framework for USVI-ReID. To be specific, we first introduce a straightforward yet potent Noisy Pseudo-label Calibration module to correct noisy pseudo-labels. Due to the high intra-class variations, noisy pseudo-labels are difficult to calibrate completely. Therefore, we introduce a Neighbor Relation Learning module to reduce high intra-class variations by modeling potential interactions between all samples. Subsequently, we devise an Optimal Transport Prototype Matching module to establish reliable cross-modality correspondences. On that basis, we design a Memory Hybrid Learning module to jointly learn modality-specific and modality-invariant information. Comprehensive experiments conducted on two widely recognized benchmarks, SYSU-MM01 and RegDB, demonstrate that RPNR outperforms the current state-of-the-art GUR with an average Rank-1 improvement of 10.3%. The code is available at https://github.com/XiangboYin/RPNR.

Abstract:
Dynamic Facial Expression Recognition (DFER) is crucial for understanding human behavior. However, current methods exhibit limited performance mainly due to the insufficient utilization of facial dynamics, and the ambiguity of expression semantics, etc. To this end, we propose a novel framework, named Multi-modal Fine-grained CLIP for DFER with AdaptERs (FineCLIPER), incorporating the following novel designs: 1) To better distinguish between similar facial expressions, we extend the class labels to textual descriptions from both positive and negative aspects, and obtain supervision by calculating the cross-modal similarity based on the CLIP model; 2) Our FineCLIPER adopts a hierarchical manner to effectively mine useful cues from DFE videos. Specifically, besides directly embedding video frames as input (low semantic level), we propose to extract the face segmentation masks and landmarks based on each frame (middle semantic level) and utilize the Multi-modal Large Language Model (MLLM) to further generate detailed descriptions of facial changes across frames with designed prompts (high semantic level). Additionally, we also adopt Parameter-Efficient Fine-Tuning (PEFT) to enable efficient adaptation of large pre-trained models (i.e., CLIP) for this task. Our FineCLIPER achieves SOTA performance on the DFEW, FERV39k, and MAFW datasets in both supervised and zero-shot settings with few tunable parameters. Project page: https://haroldchen19.github.io/FineCLIPER-Page/

Abstract:
Considering that image editing and manipulation technologies pose significant threats to the authenticity and security of image content, research on image regional manipulation detection has always been a critical issue. The accelerated advancement of generative AI significantly enhances the viability and effectiveness of generative regional editing methods and has led to their gradual replacement of traditional image editing tools or algorithms. However, current research primarily focuses on traditional image tampering, and there remains a lack of a comprehensive dataset containing images edited with abundant generative regional editing methods. We endeavor to fill this vacancy by constructing the GRE dataset, a large-scale generative regional editing detection dataset with the following advantages: 1) Integration of a logical and simulated editing pipeline, leveraging multiple large models in various modalities. 2) Inclusion of various editing approaches with distinct characteristics. 3) Provision of comprehensive benchmark and evaluation of SOTA methods across related domains. 4) Analysis of the GRE dataset from multiple dimensions including necessity, rationality, and diversity. Extensive experiments and in-depth analysis demonstrate that this larger and more comprehensive dataset will significantly enhance the development of detection methods for generative editing. The corresponding repository is at https://github.com/ICTMCG/GRE.

Abstract:
Recent advancements in diffusion models trained on large-scale data have enabled the generation of indistinguishable human-level images, yet they often produce harmful content misaligned with human values, e.g., social bias, and offensive content. Despite extensive research on Large Language Models (LLMs), the challenge of Text-to-Image (T2I) model alignment remains largely unexplored. Addressing this problem, we propose LiVO (Lightweight Value Optimization), a novel lightweight method for aligning T2I models with human values. LiVO only optimizes a plug-and-play value encoder to integrate a specified value principle with the input prompt, allowing the control of generated images over both semantics and values. Specifically, we design a diffusion model-tailored preference optimization loss, which theoretically approximates the Bradley-Terry model used in LLM alignment but provides a more flexible trade-off between image quality and value conformity. To optimize the value encoder, we also develop a framework to automatically construct a text-image preference dataset of 86k (prompt, aligned image, violating image, value principle) samples. Without updating most model parameters and through adaptive value selection from the input prompt, LiVO significantly reduces harmful outputs and achieves faster convergence, surpassing several strong baselines and taking an initial step towards ethically aligned T2I models. Warning: This paper involves descriptions and images depicting discriminatory, pornographic, bloody, and horrific scenes.

Abstract:
Open-vocabulary detection (OVD) aims to detect novel objects without instance-level annotations to achieve open-world object detection at a lower cost. Existing OVD methods mainly rely on the powerful open-vocabulary image-text alignment capability of Vision-Language Pretrained Models (VLM) such as CLIP. However, CLIP is trained on image-text pairs and lacks the perceptual ability for local regions within an image, resulting in the gap between image and region representations. Directly using CLIP for OVD causes inaccurate region classification. We find the image-region gap is primarily caused by the deformation of region feature maps during region of interest (RoI) extraction. To mitigate the inaccurate region classification in OVD, we propose a new Shape-Invariant Adapter named SIA-OVD to bridge the image-region gap in the OVD task. SIA-OVD learns a set of feature adapters for regions with different shapes and designs a new adapter allocation mechanism to select the optimal adapter for each region. The adapted region representations can align better with text representations learned by CLIP. Extensive experiments demonstrate that SIA-OVD effectively improves the classification accuracy for regions by addressing the gap between images and regions caused by shape deformation. SIA-OVD achieves substantial improvements over representative methods on the COCO-OVD benchmark. The code is available at https://github.com/PKU-ICST-MIPL/SIA-OVD_ACMMM2024.

Abstract:
Zero-shot learning (ZSL) enables the recognition of novel classes by leveraging semantic knowledge transfer from known to unknown categories. This knowledge, typically encapsulated in attribute descriptions, aids in identifying class-specific visual features, thus facilitating visual-semantic alignment and improving ZSL performance. However, real-world challenges such as distribution imbalances and attribute co-occurrence among instances often hinder the discernment of local variances in images, a problem exacerbated by the scarcity of fine-grained, region-specific attribute annotations. Moreover, the variability in visual presentation within categories can also skew attribute-category associations. In response, we propose a bidirectional cross-modal ZSL approach CREST. It begins by extracting representations for attribute and visual localization and employs Evidential Deep Learning (EDL) to measure underlying epistemic uncertainty, thereby enhancing the model's resilience against hard negatives. CREST incorporates dual learning pathways, focusing on both visual-category and attribute-category alignments, to ensure robust correlation between latent and observable spaces. Moreover, we introduce an uncertainty-informed cross-modal fusion technique to refine visual-attribute inference. Extensive experiments demonstrate our model's effectiveness and unique explainability across multiple datasets. Our code and data are available at: https://github.com/JethroJames/CREST

Abstract:
In recent years, large-scale pre-trained diffusion models have demonstrated their outstanding capabilities in image and video generation tasks. However, existing models tend to produce visual objects commonly found in the training dataset, which diverges from user input prompts. The underlying reason behind the inaccurate generated results lies in the model's difficulty in sampling from specific intervals of the initial noise distribution corresponding to the prompt. Moreover, it is challenging to directly optimize the initial distribution, given that the diffusion process involves multiple denoising steps. In this paper, we introduce a Fine-tuning Initial Noise Distribution (FIND) framework with policy optimization, which unleashes the powerful potential of pre-trained diffusion networks by directly optimizing the initial distribution to align the generated contents with user-input prompts. To this end, we first reformulate the diffusion denoising procedure as a one-step Markov decision process and employ policy optimization to directly optimize the initial distribution. In addition, a dynamic reward calibration module is proposed to ensure training stability during optimization. Furthermore, we introduce a ratio clipping algorithm to utilize historical data for network training and prevent the optimized distribution from deviating too far from the original policy to restrain excessive optimization magnitudes. Extensive experiments demonstrate the effectiveness of our method in both text-to-image and text-to-video tasks, surpassing SOTA methods in achieving consistency between prompts and the generated content. Our method achieves 10 times faster than the SOTA approach.

Abstract:
The outdoor vision systems are frequently contaminated by rain streaks and raindrops, which significantly degenerate the performance of visual tasks and multimedia applications. The nature of videos exhibits redundant temporal cues for rain removal with higher stability. Traditional video deraining methods heavily rely on optical flow estimation and kernel-based manners, which have a limited receptive field. Yet, transformer architectures, while enabling long-term dependencies, bring about a significant increase in computational complexity. Recently, the linear-complexity operator of the state space models (SSMs) has contrarily facilitated efficient long-term temporal modeling, which is crucial for rain streaks and raindrops removal in videos. Unexpectedly, its uni-dimensional sequential process on videos destroys the local correlations across the spatio-temporal dimension by distancing adjacent pixels. To address this, we present an improved SSMs-based video deraining network (RainMamba) with a novel Hilbert scanning mechanism to better capture sequence-level local information. We also introduce a difference-guided dynamic contrastive locality learning strategy to enhance the patch-level self-similarity learning ability of the proposed network. Extensive experiments on four synthesized video deraining datasets and real-world rainy videos demonstrate the superiority of our network in the removal of rain streaks and raindrops. Our code and results are available at https://github.com/TonyHongtaoWu/RainMamba.

Abstract:
Action detection and understanding provide the foundation for the generation and interaction of multimedia content. However, existing methods mainly focus on constructing complex relational inference networks, overlooking the judgment of detection effectiveness. Moreover, these methods frequently generate detection results with cognitive abnormalities. To solve the above problems, this study proposes a cognitive effectiveness network based on fuzzy inference (Cefdet), which introduces the concept of 'cognition--based detection' to simulate human cognition. First, a fuzzy-driven cognitive effectiveness evaluation module (FCM) is established to introduce fuzzy inference into action detection. FCM is combined with human action features to simulate the cognition-based detection process, which clearly locates the position of frames with cognitive abnormalities. Then, a fuzzy cognitive update strategy (FCS) is proposed based on the FCM, which utilizes fuzzy logic to re-detect the cognition-based detection results and effectively update the results with cognitive abnormalities. Experimental results demonstrate that Cefdet exhibits superior performance against several mainstream algorithms on the public datasets, validating its effectiveness and superiority.

Abstract:
Despite the widespread adoption of Vision-Language Understanding (VLU) benchmarks such as VQA v2, OKVQA, A-OKVQA, GQA, VCR, SWAG, and VisualCOMET, our analysis reveals a pervasive issue affecting their integrity: these benchmarks contain samples where answers rely on assumptions unsupported by the provided context. Training models on such data fosters biased learning and hallucinations as models tend to make similar unwarranted assumptions. To address this issue, we collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions. Strong improvements across multiple benchmarks demonstrate the effectiveness of our approach. Further, we develop a general-purpose Context-AwaRe Abstention (CARA) detector to identify samples lacking sufficient context and enhance model accuracy by abstaining from responding if the required context is absent. CARA exhibits generalization to new benchmarks it wasn't trained on, underscoring its utility for future VLU benchmarks in detecting or cleaning samples with inadequate context. Finally, we curate a Context Ambiguity and Sufficiency Evaluation (CASE) set to benchmark the performance of insufficient context detectors. Overall, our work represents a significant advancement in ensuring that vision-language models generate trustworthy and evidence-based outputs in complex real-world scenarios. GitHub link: https://github.com/JunzhangLiu/CARA

Abstract:
Handling varying computational resources is a critical issue in modern AI applications. Adaptive deep networks, featuring the dynamic employment of multiple classifier heads among different layers, have been proposed to address classification tasks under varying computing resources. Existing approaches typically utilize the last classifier supported by the available resources for inference, as they believe that the last classifier always performs better across all classes. However, our findings indicate that earlier classifier heads can outperform the last head for certain classes. Based on this observation, we introduce the Collaborative Decision Making (CDM) module, which fuses the multiple classifier heads to enhance the inference performance of adaptive deep networks. CDM incorporates an uncertainty-aware fusion method based on evidential deep learning (EDL), that utilizes the reliability (uncertainty values) from the first c-1 classifiers to improve the c-th classifier' accuracy. We also design a balance term that reduces fusion saturation and unfairness issues caused by EDL constraints to improve the fusion quality of CDM. Finally, a regularized training strategy that uses the last classifier to guide the learning process of early classifiers is proposed to further enhance the CDM module's effect, called the Guided Collaborative Decision Making (GCDM) framework. The experimental evaluation demonstrates the effectiveness of our approaches. Results on ImageNet datasets show CDM and GCDM obtain 0.4% to 2.8% accuracy improvement (under varying computing resources) on popular adaptive networks. The code is available at the link https://github.com/Meteor-Stars/GCDM_AdaptiveNet.

Abstract:
Novel View Synthesis plays a crucial role by generating new 2D renderings from multi-view images of 3D scenes. However, capturing high-speed scenes with conventional cameras often leads to motion blur, hindering the effectiveness of 3D reconstruction. To address this challenge, high-frame-rate dense 3D reconstruction emerges as a vital technique, enabling detailed and accurate modeling of real-world objects or scenes in various fields, including Virtual Reality or embodied AI. Spike cameras, a novel type of neuromorphic sensor, continuously record scenes with an ultra-high temporal resolution, showing potential for accurate 3D reconstruction. Despite their promise, existing approaches, such as applying Neural Radiance Fields (NeRF) to spike cameras, encounter challenges due to the time-consuming rendering process. To address this issue, we make the first attempt to introduce the 3D Gaussian Splatting (3DGS) into spike cameras in high-speed capture, providing 3DGS as dense and continuous clues of views, then constructing SpikeGS. Specifically, to train SpikeGS, we establish computational equations between the rendering process of 3DGS and the processes of instantaneous imaging and exposing-like imaging of the continuous spike stream. Besides, we build a very lightweight but effective mapping process from spikes to instant images to support training. Furthermore, we introduced a new spike-based 3D rendering dataset for validation. Extensive experiments have demonstrated our method possesses the high quality of novel view rendering, proving the tremendous potential of spike cameras in modeling 3D scenes. Code and data are available at https://github.com/Leozhangjiyuan/SpikeGS.

Abstract:
Emotion recognition aims to discern the emotional state of subjects within an image, relying on subject-centric and contextual visual cues. Current approaches typically follow a two-stage pipeline: first localize subjects by off-the-shelf detectors, then perform emotion classification through the late fusion of subject and context features. However, the complicated paradigm suffers from disjoint training stages and limited fine-grained interaction between subject-context elements. To address the challenge, we present a single-stage emotion recognition approach, employing a Decoupled Subject-Context Transformer (DSCT), for simultaneous subject localization and emotion classification. Rather than compartmentalizing training stages, we jointly leverage box and emotion signals as supervision to enrich subject-centric feature learning. Furthermore, we introduce DSCT to facilitate interactions between fine-grained subject-context cues in a ''decouple-then-fuse'' manner. The decoupled query tokens-subject queries and context queries-gradually intertwine across layers within DSCT, during which spatial and semantic relations are exploited and aggregated. We evaluate our single-stage framework on two widely used context-aware emotion recognition datasets, CAER-S and EMOTIC. Our approach surpasses two-stage alternatives with fewer parameter numbers, achieving a 3.39% accuracy improvement and a 6.46% average precision gain on CAER-S and EMOTIC datasets, respectively. Code and models are available at: https://github.com/Sampson-Lee/DSCT.

Abstract:
In the realm of cross-modal retrieval, seamlessly integrating diverse modalities within multimedia remains a formidable challenge, especially given the complexities introduced by noisy correspondence learning (NCL). Such noise often stems from mismatched data pairs, which is a significant obstacle distinct from traditional noisy labels. This paper introduces Pseudo-Classification based Pseudo-Captioning (PC2) framework to address this challenge. PC2 offers a threefold strategy: firstly, it establishes an auxiliary ''pseudo-classification'' task that interprets captions as categorical labels, steering the model to learn image-text semantic similarity through a non-contrastive mechanism. Secondly, unlike prevailing margin-based techniques, capitalizing on PC2's pseudo-classification capability, we generate pseudo-captions to provide more informative and tangible supervision for each mismatched pair. Thirdly, the oscillation of pseudo-classification is borrowed to assistant the correction of correspondence. In addition to technical contributions, we develop a realistic NCL dataset called Noise of Web (NoW), which could be a new powerful NCL benchmark where noise exists naturally. Empirical evaluations of PC2 showcase marked improvements over existing state-of-the-art robust cross-modal retrieval techniques on both simulated and realistic datasets with various NCL settings. The contributed dataset and source code are released at https://github.com/alipay/PC2-NoiseofWeb.

Abstract:
Scene text image super-resolution (STISR) aims at simultaneously increasing the resolution and readability of low-resolution scene text images, thus boosting the performance of the downstream recognition task. Two factors in scene text images, visual structure and semantic information, affect the recognition performance significantly. To mitigate the effects from these factors, this paper proposes a Prior-Enhanced Attention Network (PEAN). Specifically, an attention-based modulation module is leveraged to understand scene text images by neatly perceiving the local and global dependence of images, despite the shape of the text. Meanwhile, a diffusion-based module is developed to enhance the text prior, hence offering better guidance for the SR network to generate SR images with higher semantic accuracy. Additionally, a multi-task learning paradigm is employed to optimize the network, enabling the model to generate legible SR images. As a result, PEAN establishes new SOTA results on the TextZoom benchmark. Experiments are also conducted to analyze the importance of the enhanced text prior as a means of improving the performance of the SR network. Code is available at https://github.com/jdfxzzy/PEAN.

Abstract:
High dynamic range (HDR) video rendering from low dynamic range (LDR) videos where frames are of alternate exposures encounters significant challenges, due to the exposure change and absence at each time stamp. The exposure change and absence make existing methods generate flickering HDR results. In this paper, we propose a novel paradigm to render HDR frames via completing the absent exposure information, hence the exposure information is complete and consistent. Our approach involves interpolating neighbor LDR frames in the time dimension to reconstruct LDR frames for the absent exposures. Combining the interpolated and given LDR frames, the complete set of exposure information is available at each time stamp. This benefits the fusing process for HDR results, reducing noise and ghosting artifacts therefore improving temporal consistency. Extensive experimental evaluations on standard benchmarks demonstrate that our method achieves state-of-the-art performance, highlighting the importance of absent exposure completing in HDR video rendering. The code is available at https://github.com/cuijiahao666/NECHDR.

Abstract:
Hand-drawn 2D animation workflow is typically initiated with the creation of sketch keyframes. Subsequent manual inbetweens are crafted for smoothness, which is a labor-intensive process and the prospect of automatic animation sketch interpolation has become highly appealing. Yet, common frame interpolation methods are generally hindered by two key issues: 1) limited texture and colour details in sketches, and 2) exaggerated alterations between two sketch keyframes. To overcome these issues, we propose a novel deep learning method - Sketch-Aware Interpolation Network (SAIN). This approach incorporates multi-level guidance that formulates region-level correspondence, stroke-level correspondence and pixel-level dynamics. A multi-stream U-Transformer is then devised to characterize sketch inbetweening patterns using these multi-level guides through the integration of self / cross-attention mechanisms. Additionally, to facilitate future research on animation sketch inbetweening, we constructed a large-scale dataset - STD-12K, comprising 30 sketch animation series in diverse artistic styles. Comprehensive experiments on this dataset convincingly show that our proposed SAIN surpasses the state-of-the-art interpolation methods. Our code and dataset are avaliable in https://github.com/none-master/SAIN.

Abstract:
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable performance on various visual-language understanding and generation tasks. However, MLLMs occasionally generate content inconsistent with the given images, which is known as "hallucination". Prior works primarily center on evaluating hallucination using standard, unperturbed benchmarks, which overlook the prevalent occurrence of perturbed inputs in real-world scenarios-such as image cropping or blurring-that are critical for a comprehensive assessment of MLLMs' hallucination. In this paper, to bridge this gap, we propose Hallu-PI, the first benchmark designed to evaluate Hallucination in MLLMs within Perturbed Inputs. Specifically, Hallu-PI consists of seven perturbed scenarios, containing 1,260 perturbed images from 11 object types. Each image is accompanied by detailed annotations, which include fine-grained hallucination types, such as existence, attribute, and relation. We equip these annotations with a rich set of questions, making Hallu-PI suitable for both discriminative and generative tasks. Extensive experiments on 12 mainstream MLLMs, such as GPT-4V and Gemini-Pro Vision, demonstrate that these models exhibit significant hallucinations on Hallu-PI, which is not observed in unperturbed scenarios. Furthermore, our research reveals a severe bias in MLLMs' ability to handle different types of hallucinations. We also design two baselines specifically for perturbed scenarios, namely Perturbed-Reminder and Perturbed-ICL. We hope that our study will bring researchers' attention to the limitations of MLLMs when dealing with perturbed inputs, and spur further investigations to address this issue. Our code and datasets are publicly available at https://github.com/NJUNLP/Hallu-PI.

Abstract:
Incomplete multi-modal image segmentation is a fundamental task in medical imaging to refine deployment efficiency when only partial modalities are available. However, the common practice that complete-modality data is visible during model training is far from realistic, as modalities can have imbalanced missing rates in clinical scenarios. In this paper, we, for the first time, formulate such a challenging setting and propose Preference-Aware Self-diStillatION (PASSION) for incomplete multi-modal medical image segmentation under imbalanced missing rates. Specifically, we first construct pixel-wise and semantic-wise self-distillation to balance the optimization objective of each modality. Then, we define relative preference to evaluate the dominance of each modality during training, based on which to design task-wise and gradient-wise regularization to balance the convergence rates of different modalities. Experimental results on two publicly available multi-modal datasets demonstrate the superiority of PASSION against existing approaches for modality balancing. More importantly, PASSION is validated to work as a plug-and-play module for consistent performance improvement across different backbones. Code is available.

Abstract:
In the field of affective computing, fully leveraging information from a variety of sensory modalities is essential for the comprehensive understanding and processing of human emotions. Inspired by the process through which the human brain handles emotions and the theory of cross-modal plasticity, we propose UMBEnet, a brain-like unified modal affective processing network. The primary design of UMBEnet includes a Dual-Stream (DS) structure that fuses inherent prompts with a Prompt Pool and a Sparse Feature Fusion (SFF) module. The design of the Prompt Pool is aimed at integrating information from different modalities, while inherent prompts are intended to enhance the system's predictive guidance capabilities and effectively manage knowledge related to emotion classification. Moreover, considering the sparsity of effective information across different modalities, the SSF module aims to make full use of all available sensory data through the sparse integration of modality fusion prompts and inherent prompts, maintaining high adaptability and sensitivity to complex emotional states. Extensive experiments on the largest benchmark datasets in the Dynamic Facial Expression Recognition (DFER) field, including DFEW, FERV39k, and MAFW, have proven that UMBEnet consistently outperforms the current state-of-the-art methods. Notably, in scenarios of Modality Missingness and multimodal contexts, UMBEnet significantly surpasses the leading current methods, demonstrating outstanding performance and adaptability in tasks that involve complex emotional understanding with rich multimodal information. Code can be obtained at https://github.com/Xinji-Mai/UMBEnet.

Abstract:
Visual grounding aims to localize the object referred to in an image based on a natural language query. Although progress has been made recently, accurately localizing target objects within multiple-instance distractions (multiple objects of the same category as the target) remains a significant challenge. Existing methods demonstrate a significant performance drop when there are multiple distractions in an image, indicating an insufficient understanding of the fine-grained semantics and spatial relationships between objects. In this paper, we propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue. Firstly, we enhance the model's understanding of fine-grained semantics by injecting semantic prior information derived from text queries into the model. This is achieved by leveraging text-to-image generation models to produce images representing the semantic attributes of target objects described in queries. Secondly, we tackle the lack of training samples with multiple distractions by introducing a relation-sensitive data augmentation method. This method generates additional training data by synthesizing images containing multiple objects of the same category and pseudo queries based on their spatial relationships. The proposed ReSVG model significantly improves the model's ability to comprehend both object semantics and spatial relations, leading to enhanced performance in visual grounding tasks, particularly in scenarios with multiple-instance distractions. We conduct extensive experiments to validate the effectiveness of our methods on five datasets. Code is available at https://github.com/minghangz/ResVG.

Abstract:
The Composed Image Retrieval (CIR) task aims to retrieve target images using a composed query consisting of a reference image and a modified text. Advanced methods often utilize contrastive learning as the optimization objective, which benefits from adequate positive and negative examples. However, the triplet for CIR incurs high manual annotation costs, resulting in limited positive examples. Furthermore, existing methods commonly use in-batch negative sampling, which reduces the negative number available for the model. To address the problem of lack of positives, we propose a data generation method by leveraging a multi-modal large language model to construct triplets for CIR. To introduce more negatives during fine-tuning, we design a two-stage fine-tuning framework for CIR, whose second stage introduces plenty of static representations of negatives to optimize the representation space rapidly. The above two improvements can be effectively stacked and designed to be plug-and-play, easily applied to existing CIR models without changing their original architectures. Extensive experiments and ablation analysis demonstrate that our method effectively scales positives and negatives and achieves state-of-the-art results on both FashionIQ and CIRR datasets. In addition, our method also performs well in zero-shot composed image retrieval, providing a new CIR solution for the low-resources scenario. Our code and data are released at https://github.com/BUAADreamer/SPN4CIR.

Abstract:
Sparse and noisy images (SNIs), like those in spatial gene expression data, pose significant challenges for effective representation learning and clustering, which are essential for thorough data analysis and interpretation. In response to these challenges, we propose Dual Advancement of Representation Learning and Clustering (DARLC), an innovative framework that leverages contrastive learning to enhance the representations derived from masked image modeling. Simultaneously, DARLC integrates cluster assignments in a cohesive, end-to-end approach. This integrated clustering strategy addresses the "class collision problem" inherent in contrastive learning, thus improving the quality of the resulting representations. To generate more plausible positive views for contrastive learning, we employ a graph attention network-based technique that produces denoised images as augmented data. As such, our framework offers a comprehensive approach that improves the learning of representations by enhancing their local perceptibility, distinctiveness, and the understanding of relational semantics. Furthermore, we utilize a Student's t mixture model to achieve more robust and adaptable clustering of SNIs. Extensive experiments, conducted across 12 different types of datasets consisting of SNIs, demonstrate that DARLC surpasses the state-of-the-art methods in both image clustering and generating image representations that accurately capture gene interactions. Code is available at https://github.com/zipging/DARLC.

Abstract:
Recent years have witnessed remarkable progress in the development of large vision-language models (LVLMs). Benefiting from the strong language backbones and efficient cross-modal alignment strategies, LVLMs exhibit surprising capabilities to perceive visual signals and perform visually grounded reasoning. However, the capabilities of LVLMs have not been comprehensively and quantitatively evaluated. Most existing multi-modal benchmarks require task-oriented input-output formats, posing great challenges to automatically assess the free-form text output of LVLMs. To effectively leverage the annotations available and reduce the manual efforts required for constructing new benchmarks, we propose to re-formulate existing benchmarks into unified LVLM-compatible formats. Through systematic data collection and reformulation, we present ReForm-Eval benchmark, offering substantial data for evaluating various capabilities of LVLMs. Through extensive experiments and analysis in ReForm-Eval, we demonstrate the comprehensiveness and reliability of ReForm-Eval in assessing various LVLMs. Our benchmark and evaluation framework is now available at https://github.com/FudanDISC/ReForm-Eval

Abstract:
We study an emerging and intriguing problem of multimodal temporal event forecasting with large language models. Compared to using text or graph modalities, the investigation of utilizing images for temporal event forecasting has not been fully explored, especially in the era of large language models (LLMs). To bridge this gap, we are particularly interested in two key questions of: 1) why images will help in temporal event forecasting, and 2) how to integrate images into the LLM-based forecasting framework. To answer these research questions, we propose to identify two essential functions that images play in the scenario of temporal event forecasting, i.e.,highlighting and complementary. Then, we develop a novel framework, named MM-Forecast. It employs an Image Function Identification module to recognize these functions as verbal descriptions using multimodal large language models (MLLMs), and subsequently incorporates these function descriptions into LLM-based forecasting models. To evaluate our approach, we construct a new multimodal dataset, MidEast-TE-mm, by extending an existing event dataset MidEast-TE-mini with images. Empirical studies demonstrate that our MM-Forecast can correctly identify the image functions, and further more, incorporating these verbal function descriptions significantly improves the forecasting performance. The dataset, code, and prompts are available at https://github.com/LuminosityX/MM-Forecast.

Abstract:
3D object detection plays a pivotal role in autonomous driving and robotics, demanding precise interpretation of Bird's Eye View (BEV) images. The dynamic nature of real-world environments necessitates the use of dynamic query mechanisms in 3D object detection to adaptively capture and process the complex spatio-temporal relationships present in these scenes. However, prior implementations of dynamic queries have often faced difficulties in effectively leveraging these relationships, particularly when it comes to integrating temporal information in a computationally efficient manner. Addressing this limitation, we introduce a framework utilizing dynamic query evolution strategy, harnesses K-means clustering and Top-K attention mechanisms for refined spatio-temporal data processing. By dynamically segmenting the BEV space and prioritizing key features through Top-K attention, our model achieves a real-time, focused analysis of pertinent scene elements. Our extensive evaluation on the nuScenes and Waymo dataset showcases a marked improvement in detection accuracy, setting a new benchmark in the domain of query-based BEV object detection. Our dynamic query evolution strategy has the potential to push the boundaries of current BEV methods with enhanced adaptability and computational efficiency. Project page: https://github.com/Jiawei-Yao0812/QE-BEV

Abstract:
Foundation models, pre-trained on massive datasets, have achieved unprecedented generalizability. However, is it truly necessary to involve such vast amounts of data in pre-training, consuming extensive computational resources? This paper introduces data-effective learning, aiming to use data in the most impactful way to pre-train foundation models. This involves strategies that focus on data quality rather than quantity, ensuring the data used for training has high informational value. Data-effective learning plays a profound role in accelerating foundation model training, reducing computational costs, and saving data storage, which is very important as the volume of medical data in recent years has grown beyond many people's expectations. However, due to the lack of standards and comprehensive benchmark, research on medical data-effective learning is poorly studied. To address this gap, our paper introduces a comprehensive benchmark specifically for evaluating data-effective learning in the medical field. This benchmark includes a dataset with millions of data samples from 31 medical centers (DataDEL), a baseline method for comparison (MedDEL), and a new evaluation metric (NormDEL) to objectively measure data-effective learning performance. Our extensive experimental results show the baseline MedDEL can achieve performance comparable to the original large dataset with only 5% of the data. Establishing such an open data-effective learning benchmark is crucial for the medical foundation model research community because it facilitates efficient data use, promotes collaborative breakthroughs, and fosters the development of cost-effective, scalable, and impactful healthcare solutions. The benchmark can be accessed at https://github.com/shadow2469/Data-Effective-Learning-A-Comprehensive-Medical-Benchmark.git GitHub Repository.

Abstract:
Diffusion models have shown impressive potential on talking head generation. While plausible appearance and talking effect are achieved, these methods still suffer from temporal, 3D or expression inconsistency due to the error accumulation and inherent limitation of single-image generation ability. In this paper, we propose ConsistentAvatar, a novel framework for fully consistent and high-fidelity talking avatar generation. Instead of directly employing multi-modal conditions to the diffusion process, our method learns to first model the temporal representation for stability between adjacent frames. Specifically, we propose a Temporally-Sensitive Detail (TSD) map containing high-frequency feature and contours that vary significantly along the time axis. Using a temporal consistent diffusion module, we learn to align TSD of the initial result to that of the video frame ground truth. The final avatar is generated by a fully consistent diffusion module, conditioned on the aligned TSD, rough head normal, and emotion prompt embedding. We find that the aligned TSD, which represents the temporal patterns, constrains the diffusion process to generate temporally stable talking head. Further, its reliable guidance complements the inaccuracy of other conditions, suppressing the accumulated error while improving the consistency on various aspects. Extensive experiments demonstrate that ConsistentAvatar outperforms the state-of-the-art methods on the generated appearance, 3D, expression and temporal consistency.

Abstract:
This paper introduces a new Segment Anything Model with Depth Perception (DSAM) for Camouflaged Object Detection (COD). DSAM exploits the zero-shot capability of SAM to realize precise segmentation in the RGB-D domain. It consists of the Prompt-Deeper Module and the Finer Module. The Prompt-Deeper Module utilizes knowledge distillation and the Bias Correction Module to achieve the interaction between RGB features and depth features, especially using depth features to correct erroneous parts in RGB features. Then, the interacted features are combined with the box prompt in SAM to create a prompt with depth perception. The Finer Module explores the possibility of accurately segmenting highly camouflaged targets from a depth perspective. It uncovers depth cues in areas missed by SAM through mask reversion, self-filtering, and self-attention operations, compensating for its defects in the COD domain. DSAM represents the first step towards the SAM-based RGB-D COD model. It maximizes the utilization of depth features while synergizing with RGB features to achieve multimodal complementarity, thereby overcoming the segmentation limitations of SAM and improving its accuracy in COD. Experimental results on COD benchmarks demonstrate that DSAM achieves excellent segmentation performance and reaches the state-of-the-art (SOTA) on COD benchmarks with less consumption of training resources. The code will be available at https://github.com/guobaoxiao/DSAM.

Abstract:
High frame-rate~(HFR) videos of action recognition improve fine-grained expression while reducing the spatio-temporal relation and motion information density. Thus, large amounts of video samples are continuously required for traditional data-driven training. However, samples are not always sufficient in real-world scenarios, promoting few-shot action recognition~(FSAR) research. We observe that most recent FSAR works build spatio-temporal relation of video samples via temporal alignment after spatial feature extraction, cutting apart spatial and temporal features within samples. They also capture motion information via narrow perspectives between adjacent frames without considering density, leading to insufficient motion information capturing. Therefore, we propose a novel plug-and-play architecture for FSAR called Spatio-tempOral frAme tuPle enhancer (SOAP) in this paper. The model we designed with such architecture refers to SOAP-Net. Temporal connections between different feature channels and spatio-temporal relation of features are considered instead of simple feature extraction. Comprehensive motion information is also captured, using frame tuples with multiple frames containing more motion information than adjacent frames. Combining frame tuples of diverse frame counts further provides a broader perspective. SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and HMDB51. Extensive empirical evaluations underscore the competitiveness, pluggability, generalization, and robustness of SOAP. The code is released at https://github.com/wenbohuang1002/SOAP.

Abstract:
Traditional federated learning mainly focuses on parallel settings (PFL), which can suffer significant communication and computation costs. In contrast, one-shot and sequential federated learning (SFL) have emerged as innovative paradigms to alleviate these costs. However, the issue of non-IID (Independent and Identically Distributed) data persists as a significant challenge in one-shot and SFL settings, exacerbated by the restricted communication between clients. In this paper, we improve the one-shot sequential federated learning for non-IID data by proposing a local model diversity-enhancing strategy. Specifically, to leverage the potential of local model diversity for improving model performance, we introduce a local model pool for each client that comprises diverse models generated during local training, and propose two distance measurements to further enhance the model diversity and mitigate the effect of non-IID data. Consequently, our proposed framework can improve the global model performance while maintaining low communication costs. Extensive experiments demonstrate that our method exhibits superior performance to existing one-shot PFL methods and achieves better accuracy compared with state-of-the-art one-shot SFL methods on both label-skew and domain-shift tasks (e.g., 6%+ accuracy improvement on the CIFAR-10 dataset). Our code and supplementary are available online: https://github.com/NaiboWang/FedELMY.

Abstract:
In the realm of autonomous driving, achieving precise 3D reconstruction of the driving environment is critical for ensuring safety and effective navigation. Neural Radiance Fields (NeRF) have shown promise in creating highly detailed and accurate models of complex environments. However, the application of NeRF in autonomous driving scenarios encounters several challenges, primarily due to the sparsity of viewpoints inherent in camera trajectories and the constraints on data collection in unbounded outdoor scenes, which typically occur along predetermined paths. This limitation not only reduces the available scene information but also poses significant challenges for NeRF training, as the sparse and path-distributed observational data leads to under-representation of the scene's geometry. In this paper, we introduce HarmonicNeRF, a novel approach for outdoor self-supervised monocular scene reconstruction. HarmonicNeRF capitalizes on the strengths of NeRF and enhances surface reconstruction accuracy by augmenting the input space with geometry-informed synthetic views. This is achieved through the application of spherical harmonics to generate novel radiance values, taking into careful consideration the color observations from the limited available real-world views. Additionally, our method incorporates proxy geometry to effectively manage occlusion, generating radiance pseudo-labels that circumvent the limitations of traditional image-warping techniques, which often fail in sparse data conditions typical of autonomous driving environments. Extensive experiments conducted on the KITTI, Argoverse, and NuScenes datasets demonstrate our approach establishes new benchmarks in synthesizing novel depth views and reconstructing scenes, significantly outperforming existing methods. Project page: https://github.com/Jiawei-Yao0812/HarmonicNeRF

Abstract:
Few-shot open-set recognition (FSOR) is a challenging task that requires a model to recognize known classes and identify unknown classes with limited labeled data. Existing approaches, particularly Negative-Prototype-Based methods, generate negative prototypes based solely on known class data. However, as the unknown space is infinite while the known space is limited, these methods suffer from limited representation capability. To address this limitation, we propose a novel approach, termed Diversified Negative Prototypes Generator (DNPG), which adopts the principle of "learning unknowns from unknowns." Our method leverages the unknown space information learned from base classes to generate more representative negative prototypes for novel classes. During the pre-training phase, we learn the unknown space representation of the base classes. This representation, along with inter-class relationships, is then utilized in the meta-learning process to construct negative prototypes for novel classes. To prevent prototype collapse and ensure adaptability to varying data compositions, we introduce the Swap Alignment (SA) module. Our DNPG model, by learning from the unknown space, generates negative prototypes that cover a broader unknown space, thereby achieving state-of-the-art performance on three standard FSOR datasets. The repository of this project is available at https://github.com/iCGY96/DNPG.

Abstract:
The paper introduces AniTalker, an innovative framework designed to generate lifelike talking faces from a single portrait. Unlike existing models that primarily focus on verbal cues such as lip synchronization and fail to capture the complex dynamics of facial expressions and nonverbal cues, AniTalker employs a universal motion representation. This innovative representation effectively captures a wide range of facial dynamics, including subtle expressions and head movements. AniTalker enhances motion depiction through two self-supervised learning strategies: the first involves reconstructing target video frames from source frames within the same identity to learn subtle motion representations, and the second develops an identity encoder using metric learning while actively minimizing mutual information between the identity and motion encoders. This approach ensures that the motion representation is dynamic and devoid of identity-specific details, significantly reducing the need for labeled data. Additionally, the integration of a diffusion model with a variance adapter allows for the generation of diverse and controllable facial animations. This method not only demonstrates AniTalker's capability to create detailed and realistic facial movements but also underscores its potential in crafting dynamic avatars for real-world applications. Synthetic results can be viewed at https://github.com/X-LANCE/AniTalker.

Abstract:
While existing Aspect-based Sentiment Analysis (ABSA) has received extensive effort and advancement, there are still gaps in defining a more holistic research target seamlessly integrating multimodality, conversation context, fine-granularity, and also covering the changing sentiment dynamics as well as cognitive causal rationales. This paper bridges the gaps by introducing a multimodal conversational ABSA, where two novel subtasks are proposed: 1) Panoptic Sentiment Sextuple Extraction, panoramically recognizing holder, target, aspect, opinion, sentiment, rationale from multi-turn multi-party multimodal dialogue. 2) Sentiment Flipping Analysis, detecting the dynamic sentiment transformation throughout the conversation with the causal reasons. To benchmark the tasks, we construct PanoSent, a dataset annotated both manually and automatically, featuring high quality, large scale, multimodality, multilingualism, multi-scenarios, and covering both implicit&explicit sentiment elements. To effectively address the tasks, we devise a novel Chain-of-Sentiment reasoning framework, together with a novel multimodal large language model (namely Sentica) and a paraphrase-based verification mechanism. Extensive evaluations demonstrate the superiority of our methods over strong baselines, validating the efficacy of all our proposed methods. The work is expected to open up a new era for the ABSA community, and thus all our codes and data are open at https://PanoSent.github.io/.

Abstract:
This work aims to address the multi-view perspective RGB generation from text prompts given Bird-Eye-View(BEV) semantics. Unlike prior methods that neglect layout consistency, lack the ability to handle detailed text prompts, or are incapable of generalizing to unseen view points, MVPbev simultaneously generates cross-view consistent images of different perspective views with a two-stage design, allowing object-level control and novel view generation at test-time. Specifically, MVPbev firstly projects given BEV semantics to perspective view with camera parameters, empowering the model to generalize to unseen view points. Then we introduce a multi-view attention module where special initialization and de-noising processes are introduced to explicitly enforce local consistency among overlapping views w.r.t. cross-view homography. Last but not least, MVPbev further allows test-time instance-level controllability by refining a pre-trained text-to-image diffusion model. Our extensive experiments on NuScenes demonstrate that our method is capable of generating high-resolution photorealistic images from text descriptions with thousands of training samples, surpassing the state-of-the-art methods under various evaluation metrics. We further demonstrate the advances of our method in terms of generalizability and controllability with the help of novel evaluation metrics and comprehensive human analysis.

Abstract:
Human-object interactions (HOI) detection aims at capturing human-object pairs in images and corresponding actions. It is an important step toward high-level visual reasoning and scene understanding. However, due to the natural bias from the real world, existing methods mostly struggle with rare human-object pairs and lead to sub-optimal results. Recently, with the development of the generative model, a straightforward approach is to construct a more balanced dataset based on a group of supplementary samples. Unfortunately, there is a significant domain gap between the generated data and the original data, and simply merging the generated images into the original dataset cannot significantly boost the performance. To alleviate the above problem, we present a novel model-agnostic framework called Context-Enhanced Feature Alignment (CEFA) module, which can effectively align the generated data with the original data at the feature level and bridge the domain gap. Specifically, CEFA consists of a feature alignment module and a context enhancement module. On one hand, considering the crucial role of human-object pairs information in HOI tasks, the feature alignment module aligns the human-object pairs by aggregating instance information. On the other hand, to mitigate the issue of losing important context information caused by the traditional discriminator-style alignment method, we employ a context-enhanced image reconstruction module to improve the model's learning ability of contextual cues. Extensive experiments have shown that our method can serve as a plug-and-play module to improve the detection performance of HOI models on rare categories.

Abstract:
In challenging low-light and adverse weather conditions, thermal vision algorithms, especially object detection, have exhibited remarkable potential, contrasting with the frequent struggles encountered by visible vision algorithms. Nevertheless, the efficacy of thermal vision algorithms driven by deep learning models remains constrained by the paucity of available training data samples. To this end, this paper introduces a novel approach termed the edge-guided conditional diffusion model (ECDM). This framework aims to produce meticulously aligned pseudo thermal images at the pixel level, leveraging edge information extracted from visible images. By utilizing edges as contextual cues from the visible domain, the diffusion model achieves meticulous control over the delineation of objects within the generated images. To alleviate the impacts of those visible-specific edge information that should not appear in the thermal domain, a two-stage modality adversarial training (TMAT) strategy is proposed to filter them out from the generated images by differentiating the visible and thermal modality. Extensive experiments on LLVIP demonstrate ECDM's superiority over existing state-of-the-art approaches in terms of image generation quality. The pseudo thermal images generated by ECDM also help to boost the performance of various thermal object detectors by up to 7.1 mAP. Code is available at https://github.com/lengmo1996/ECDM.

Abstract:
This paper presents the first study to explore the potential of parameter quantization for multimodal large language models to alleviate the significant resource constraint encountered during vision-language instruction tuning. We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW. This method is grounded in two key innovations: (1) The learning of group-wise scale factors for quantized LLM weights to mitigate the quantization error arising from activation outliers and achieve more effective vision-language instruction tuning; (2) The implementation of a multimodal warmup that progressively integrates linguistic and multimodal training samples, thereby preventing overfitting of the quantized model to multimodal data while ensuring stable adaptation of multimodal large language models to downstream vision-language tasks. Extensive experiments demonstrate that models quantized by QSLAW perform on par with, or even surpass, their full-precision counterparts, while facilitating up to 1.4 times reduction in VL tuning time and GPU consumption.

Abstract:
In personalized federated learning (PFL), it is widely recognized that achieving both high model generalization and effective personalization poses a significant challenge due to their conflicting nature. As a result, existing PFL methods can only manage a trade-off between these two objectives. This raises an interesting question: Is it feasible to develop a model capable of achieving both objectives simultaneously? Our paper presents an affirmative answer, and the key lies in the observation that deep models inherently exhibit hierarchical architectures, which produce representations with various levels of generalization and personalization at different stages. A straightforward approach stemming from this observation is to select multiple representations from these layers and combine them to concurrently achieve generalization and personalization. However, the number of candidate representations is commonly huge, which makes this method infeasible due to high computational costs. To address this problem, we propose DualFed, a new method that can directly yield dual representations correspond to generalization and personalization respectively, thereby simplifying the optimization task. Specifically, DualFed inserts a personalized projection network between the encoder and classifier. The pre-projection representations are able to capture generalized information shareable across clients, and the post-projection representations are effective to capture task-specific information on local clients. This design minimizes the mutual interference between generalization and personalization, thereby achieving a win-win situation. Extensive experiments show that DualFed can outperform other FL methods. Code is available at https://github.com/GuogangZhu/DualFed.

Abstract:
This paper presents a pilot study aimed at introducing multi-agent debate into multimodal reasoning. The study addresses two key challenges: the trivialization of opinions resulting from excessive summarization and the diversion of focus caused by distractor concepts introduced from images. These challenges stem from the inductive (bottom-up) nature of existing debating schemes. To address the issue, we propose a deductive (top-down) debating approach called Blueprint Debate on Graphs (BDoG). In BDoG, debates are confined to a blueprint graph to prevent opinion trivialization through world-level summarization. Moreover, by storing evidence in branches within the graph, BDoG mitigates distractions caused by frequent but irrelevant concepts. Extensive experiments validate that BDoG is able to achieve state-of-the-art results in ScienceQA and MMBench with significant improvements over previous methods. The source code can be accessed at https://github.com/thecharm/BDoG.

Abstract:
Visual emotion recognition (VER) is a longstanding field that has garnered increasing attention with the advancement of deep neural networks. Although recent studies have achieved notable improvements by leveraging the knowledge embedded within pre-trained visual models, the lack of direct association between factual-level features and emotional categories, called the ''affective gap'', limits the applicability of pre-training knowledge for VER tasks. On the contrary, the explicit emotional expression and high information density in textual modality eliminate the ''affective gap''. Therefore, we propose borrowing the knowledge from the pre-trained textual model to enhance the emotional perception of pre-trained visual models. We focus on the factual and emotional connections between images and texts in noisy social media data, and propose Partitioned Adaptive Contrastive Learning (PACL) to leverage these connections. Specifically, we manage to separate different types of samples and devise distinct contrastive learning strategies for each type. By dynamically constructing negative and positive pairs, we fully exploit the potential of noisy samples. Through comprehensive experiments, we demonstrate that bridging the "affective gap'' significantly improves the performance of various pre-trained visual models in downstream emotion-related tasks. Our code is released on https://github.com/wdqqdw/PACL.

Abstract:
Gait recognition has attracted increasing attention from academia and industry as a human recognition technology from a distance in non-intrusive ways without requiring cooperation. Although advanced methods have achieved impressive success in lab scenarios, most of them perform poorly in the wild. Recently, some Convolution Neural Networks (ConvNets) based methods have been proposed to address the issue of gait recognition in the wild. However, the temporal receptive field obtained by convolution operations is limited for long gait sequences. If directly replacing convolution blocks with visual transformer blocks, the model may not enhance a local temporal receptive field, which is important for covering a complete gait cycle. To address this issue, we design a Global-Local Temporal Receptive Field Network (GLGait). GLGait employs a Global-Local Temporal Module (GLTM) to establish a global-local temporal receptive field, which mainly consists of a Pseudo Global Temporal Self-Attention (PGTA) and a temporal convolution operation. Specifically, PGTA is used to obtain a pseudo global temporal receptive field with less memory and computation complexity compared with a multi-head self-attention (MHSA). The temporal convolution operation is used to enhance the local temporal receptive field. Besides, it can also aggregate pseudo global temporal receptive field to a true holistic temporal receptive field. Furthermore, we also propose a Center-Augmented Triplet Loss (CTL) in GLGait to reduce the intra-class distance and expand the positive samples in the training stage. Extensive experiments show that our method obtains state-of-the-art results on in-the-wild datasets, i.e., Gait3D and GREW. The code is available at https://github.com/bgdpgz/GLGait.

Abstract:
Mountains of researches center around the Remote Sensing Image-Text Retrieval (RSITR), aiming at retrieving the corresponding targets based on the given query. Among them, the transfer of Foundation Models (FMs), such as CLIP, to remote sensing domain shows promising results. However, existing FM-based approaches neglect the negative impact of weakly correlated sample pairs and the key distinctions among remote sensing texts, leading to biased and superficial exploration of sample pairs. To address these challenges, we propose a novel Eliminate Before Align strategy with Keyword Explicit Reasoning framework (EBAKER) for RSITR. Specifically, we devise an innovative Eliminate Before Align (EBA) strategy to filter out the weakly correlated sample pairs to mitigate their deviations from optimal embedding space during alignment. Moreover, we introduce a Keyword Explicit Reasoning (KER) module to facilitate the positive role of subtle key concept differences. Without bells and whistles, our method achieves a one-step transformation from FM to RSITR task, obviating the necessity for extra pretraining on remote sensing data. Extensive experiments on three popular benchmark datasets validate that our proposed EBAKER method outperform the state-of-the-art methods with fewer training data. Our source code will be released soon.

Abstract:
Scene text retrieval aims to find all images containing the query text from an image gallery. Current efforts tend to adopt an Optical Character Recognition (OCR) pipeline, which requires complicated text detection and/or recognition processes, resulting in inefficient and inflexible retrieval. Different from them, in this work we propose to explore the intrinsic potential of Contrastive Language-Image Pre-training (CLIP) for OCR-free scene text retrieval. Through empirical analysis, we observe that the main challenges of CLIP as a text retriever are: 1) limited text perceptual scale, and 2) entangled visual-semantic concepts. To this end, a novel model termed FDP (Focus, Distinguish, and Prompt) is developed. FDP first focuses on scene text via shifting the attention to the text area and probing the hidden text knowledge, and then divides the query text into content word and function word for processing, in which a semantic-aware prompting scheme and a distracted queries assistance module are utilized. Extensive experiments show that FDP significantly enhances the inference speed while achieving better or competitive retrieval accuracy compared to existing methods. Notably, on the IIIT-STR benchmark, FDP surpasses the state-of-the-art model by 4.37% with a 4 times faster speed. Furthermore, additional experiments under phrase-level and attribute-aware scene text retrieval settings validate FDP's particular advantages in handling diverse forms of query text. The source code will be available at https://github.com/Gyann-z/FDP.

Abstract:
In the realm of Medical Visual Language Models (Med-VLMs), the quest for universal efficient fine-tuning mechanisms remains paramount, especially given researchers in interdisciplinary fields are often extremely short of training resources, yet largely unexplored. Given the unique challenges in the medical domain, such as limited data scope and significant domain-specific requirements, evaluating and adapting Parameter-Efficient Fine-Tuning (PEFT) methods specifically for Med-VLMs is essential. Most of the current PEFT methods on Med-VLMs have yet to be comprehensively investigated but mainly focus on adding some components to the model's structure or input. However, fine-tuning intrinsic model components often yields better generality and consistency, and its impact on the ultimate performance of Med-VLMs has been widely overlooked and remains understudied. In this paper, we endeavour to explore an alternative to traditional PEFT methods, especially the impact of fine-tuning Layer Normalization (LayerNorm) layers, Feedforward Networks and Attention layers on the Med-VLMs. Our comprehensive studies span both small-scale and large-scale Med-VLMs, evaluating their performance under various fine-tuning paradigms across tasks such as Medical Visual Question Answering and Medical Imaging Report Generation. The findings reveal unique insights into the effects of intrinsic parameter fine-tuning methods on fine-tuning Med-VLMs to downstream tasks and expose fine-tuning solely the LayerNorm layers not only surpasses the efficiency of traditional PEFT methods but also retains the model's accuracy and generalization capabilities across a spectrum of medical downstream tasks. The experiments show LayerNorm fine-tuning's superior adaptability and scalability, particularly in the context of large-scale Med-VLMs. We hope this work will contribute to the ongoing discourse on optimizing efficient fine-tuning strategies for Med-VLMs. The code has been released at https://github.com/TIMMY-CHAN/Intrinstic_tuning.

Abstract:
Humanoid Reaction Synthesis is pivotal for creating highly interactive and empathetic robots that can seamlessly integrate into human environments, enhancing the way we live, work, and communicate. However, it is difficult to learn the diverse interaction patterns of multiple humans and generate physically plausible reactions. In this work, we propose a Forward Dynamics Guided 4D Imitation method to generate physically plausible human-like reactions. The learned policy is capable of generating physically plausible and human-like reactions in real-time, significantly improving the speed(x33) for inference and quality of reactions compared with the existing methods. Our experiments on the InterHuman and Chi3D datasets, along with ablation studies, demonstrate the effectiveness of our approach. More visualizations are available at https://yunzeliu.github.io/PhysReaction/.

Abstract:
Multimodal emotion recognition in conversation (MERC) seeks to identify the speakers' emotions expressed in each utterance, offering significant potential across diverse fields. The challenge of MERC lies in balancing speaker modeling and context modeling, encompassing both long-distance and short-distance contexts, as well as addressing the complexity of multimodal information fusion. Recent research adopts graph-based methods to model intricate conversational relationships effectively. Nevertheless, the majority of these methods utilize a fixed fully connected structure to link all utterances, relying on convolution to interpret complex context. This approach can inherently heighten the redundancy in contextual messages and excessive graph network smoothing, particularly in the context of long-distance conversations. To address this issue, we propose a framework that dynamically adjusts hypergraph connections by variational hypergraph autoencoder (VHGAE), and employs contrastive learning to mitigate uncertainty factors during the reconstruction process. Experimental results demonstrate the effectiveness of our proposal against the state-of-the-art methods on IEMOCAP and MELD datasets. We release the code to support the reproducibility of this work at https://github.com/yzjred/-HAUCL.

Abstract:
Human behavior anomaly detection aims to identify unusual human actions, playing a crucial role in intelligent surveillance and other areas. The current mainstream methods still adopt reconstruction or future frame prediction techniques. However, reconstructing or predicting low-level pixel features easily enables the network to achieve overly strong generalization ability, allowing anomalies to be reconstructed or predicted as effectively as normal data. Different from their methods, inspired by the Student-Teacher Network, we propose a novel framework called the Multilevel Guidance-Exploration Network (MGENet), which detects anomalies through the difference in high-level representation between the Guidance and Exploration network. Specifically, we first utilize the Normalizing Flow that takes skeletal keypoints as input to guide an RGB encoder, which takes unmasked RGB frames as input, to explore latent motion features. Then, the RGB encoder guides the mask encoder, which takes masked RGB frames as input, to explore the latent appearance feature. Additionally, we design a Behavior-Scene Matching Module to detect scene-related behavioral anomalies. Extensive experiments demonstrate that our proposed method achieves state-of-the-art performance on ShanghaiTech and UBnormal datasets, with AUC of 86.9% and 74.3%, respectively. The code is available at https://github.com/molu-ggg/GENet.

Abstract:
Large 2D vision-language models (2D-LLMs) have gained significant attention by bridging Large Language Models (LLMs) with images using a simple projector. Inspired by their success, large 3D point cloud-language models (3D-LLMs) also integrate point clouds into LLMs. However, directly aligning point clouds with LLM requires expensive training costs, typically in hundreds of GPU-hours on A100, which hinders the development of 3D-LLMs. In this paper, we introduce MiniGPT-3D, an efficient and powerful 3D-LLM that achieves multiple SOTA results while training for only 27 hours on one RTX 3090. Specifically, we propose to align 3D point clouds with LLMs using 2D priors from 2D-LLMs, which can leverage the similarity between 2D and 3D visual information. We introduce a novel four-stage training strategy for modality alignment in a cascaded way, and a mixture of query experts module to adaptively aggregate features with high efficiency. Moreover, we utilize parameter-efficient fine-tuning methods LoRA and Norm fine-tuning, resulting in only 47.8M learnable parameters, which is up to 260x fewer than existing methods. Extensive experiments show that MiniGPT-3D achieves SOTA on 3D object classification and captioning tasks, with significantly cheaper training costs. Notably, MiniGPT-3D gains an 8.12 increase on GPT-4 evaluation score for the challenging object captioning task compared to ShapeLLM-13B, while the latter costs 160 total GPU-hours on 8 A800. We are the first to explore the efficient 3D-LLM, offering new insights to the community. Code and weights are available at https://github.com/TangYuan96/MiniGPT-3D.

Abstract:
Video grounding is a fundamental problem in multimodal content understanding, aiming to localize specific natural language queries in an untrimmed video. However, current video grounding datasets merely focus on simple events and are either limited to shorter videos or brief sentences, which hinders the model from evolving toward stronger multimodal understanding capabilities. To address these limitations, we present a large-scale video grounding dataset named SynopGround, in which more than 2800 hours of videos are sourced from popular TV dramas and are paired with accurately localized human-written synopses. Each paragraph in the synopsis serves as a language query and is manually annotated with precise temporal boundaries in the long video. These paragraph queries are tightly correlated to each other and contain a wealth of abstract expressions summarizing video storylines and specific descriptions portraying event details, which enables the model to learn multimodal perception on more intricate concepts over longer context dependencies. Based on the dataset, we further introduce a more complex setting of video grounding dubbed Multi-Paragraph Video Grounding (MPVG), which takes as input multiple paragraphs and a long video for grounding each paragraph query to its temporal interval. In addition, we propose a novel Local-Global Multimodal Reasoner (LGMR) to explicitly model the local-global structures of long-term multimodal inputs for MPVG. Our method provides an effective baseline solution to the multi-paragraph video grounding problem. Extensive experiments verify the proposed model's effectiveness as well as its superiority in long-term multi-paragraph video grounding over prior state-of-the-arts. Dataset and code are publicly available. Project page: https://synopground.github.io/.

Abstract:
We present NNVISR - an open-source filter plugin for the VapourSynth video processing framework, which facilitates the application of neural networks for various kinds of video enhancing tasks, including denoising, super resolution, interpolation, and spatio-temporal super-resolution. NNVISR fills the gap between video enhancement neural networks and video processing pipelines, by accepting any network that enhances a group of frames, and handling all other network agnostic details during video processing. NNVISR is publicly released at https://github.com/tongyuantongyu/vs-NNVISR.

Abstract:
Grasp generation aims to create complex hand-object interactions with a specified object. While traditional approaches for hand generation have primarily focused on visibility and diversity under scene constraints, they tend to overlook the fine-grained hand-object interactions such as contacts, resulting in inaccurate and undesired grasps. To address these challenges, we propose a controllable grasp generation task and introduce ClickDiff, a controllable conditional generation model that leverages a fine-grained Semantic Contact Map (SCM). Particularly when synthesizing interactive grasps, the method enables the precise control of grasp synthesis through either user-specified or algorithmically predicted Semantic Contact Map. Specifically, to optimally utilize contact supervision constraints and to accurately model the complex physical structure of hands, we propose a Dual Generation Framework. Within this framework, the Semantic Conditional Module generates reasonable contact maps based on fine-grained contact information, while the Contact Conditional Module utilizes contact maps alongside object point clouds to generate realistic grasps. We evaluate the evaluation criteria applicable to controllable grasp generation. Both unimanual and bimanual generation experiments on GRAB and ARCTIC datasets verify the validity of our proposed method, demonstrating the efficacy and robustness of ClickDiff, even with previously unseen objects. Our code is available at https://github.com/adventurer-w/ClickDiff.

Abstract:
Multimodal fusion, leveraging data like vision and language, is rapidly gaining traction. This enriched data representation improves performance across various tasks. Existing methods for out-of-distribution (OOD) detection, a critical area where AI models encounter unseen data in real-world scenarios, rely heavily on whole-image features. These image-level features can include irrelevant information that hinders the detection of OOD samples, ultimately limiting overall performance. In this paper, we propose TagOOD, a novel approach for OOD detection that leverages vision language representations to achieve label-free object feature decoupling from whole images. This decomposition enables a more focused analysis of object semantics, enhancing OOD detection performance. Subsequently, TagOOD trains a lightweight network on the extracted object features to learn representative class centers. These centers capture the central tendencies of IND object classes, minimizing the influence of irrelevant image features during OOD detection. Finally, our approach efficiently detects OOD samples by calculating distance-based metrics as OOD scores between learned centers and test samples. We conduct extensive experiments to evaluate TagOOD on several benchmark datasets and demonstrate its superior performance compared to existing OOD detection methods. This work presents a novel perspective for further exploration of multimodal information utilization in OOD detection, with potential applications across various tasks. Code is available at: https://github.com/Jarvisgivemeasuit/tagood.

Abstract:
In the last year, universal monocular metric depth estimation (universal MMDE) has gained considerable attention, serving as the foundation model for various multimedia tasks, such as video and image editing. Nonetheless, current approaches face challenges in maintaining consistent accuracy across diverse scenes without scene-specific parameters and pre-training, hindering the practicality of MMDE. Furthermore, these methods rely on extensive datasets comprising millions, if not tens of millions, of data for training, leading to significant time and hardware expenses. This paper presents SM4Depth, a model that seamlessly works for both indoor and outdoor scenes, without needing extensive training data and GPU clusters. Firstly, to obtain consistent depth across diverse scenes, we propose a novel metric scale modeling, i.e., variation- based unnormalized depth bins. It reduces the ambiguity of the conventional metric bins and enables better adaptation to large depth gaps of scenes during training. Secondly, we propose a ''divide and conquer'' solution to reduce reliance on massive training data. Instead of estimating directly from the vast solution space, the metric bins are estimated from multiple solution sub-spaces to reduce complexity. Additionally, we introduce an uncut depth dataset, BUPT Depth, to evaluate the depth accuracy and consistency across various indoor and outdoor scenes. Trained on a consumer-grade GPU using just 150K RGB-D pairs, SM4Depth achieves outstanding performance on the most never-before-seen datasets, especially maintaining consistent accuracy across indoors and outdoors. The code can be found here.

Abstract:
Graph neural networks (GNNs) have a wide range of applications in multimedia. Recent studies have shown that Graph neural networks (GNNs) are vulnerable to link stealing attacks, which infers the existence of edges in the target GNN's training graph. Existing attacks are usually based on the assumption that links exist between two nodes that share similar posteriors; however, they fail to focus on links that do not hold under this assumption. To this end, we propose LinkThief, an improved link stealing attack that combines generalized structure knowledge with node similarity, in a scenario where the attackers' background knowledge contains partially leaked target graph and shadow graph. Specifically, to equip the attack model with insights into the link structure spanning both the shadow graph and the target graph, we introduce the idea of creating a Shadow-Target Bridge Graph and extracting edge subgraph structure features from it. Through theoretical analysis from the perspective of privacy theft, we first explore how to implement the aforementioned ideas. Building upon the findings, we design the Bridge Graph Generator to construct the Shadow-Target Bridge Graph. Then, the subgraph around the link is sampled by the Edge Subgraph Preparation Module. Finally, the Edge Structure Feature Extractor is designed to obtain generalized structure knowledge, which is combined with node similarity to form the features provided to the attack model. Extensive experiments validate the correctness of theoretical analysis and demonstrate that LinkThief still effectively steals links without extra assumptions. Our code is available at https://github.com/octopusStar218/LinkThief-MM2024.

Abstract:
Document pair extraction aims to identify key and value entities as well as their relationships from visually-rich documents. Most existing methods divide it into two separate tasks: semantic entity recognition (SER) and relation extraction (RE). However, simply concatenating SER and RE serially can lead to severe error propagation, and it fails to handle cases like multi-line entities in real scenarios. To address these issues, this paper introduces a novel framework, PEneo (Pair Extraction new decoder option), which performs document pair extraction in a unified pipeline, incorporating three concurrent sub-tasks: line extraction, line grouping, and entity linking. This approach alleviates the error accumulation problem and can handle the case of multi-line entities. Furthermore, to better evaluate the model's performance and to facilitate future research on pair extraction, we introduce RFUND, a re-annotated version of the commonly used FUNSD and XFUND datasets, to make them more accurate and cover realistic situations. Experiments on various benchmarks demonstrate PEneo's superiority over previous pipelines, boosting the performance by a large margin (e.g., 19.89%-22.91% F1 score on RFUND-EN) when combined with various backbones like LiLT and LayoutLMv3, showing its effectiveness and generality. Codes and the new annotations are available at https://github.com/ZeningLin/PEneo.

Abstract:
Multiple Instance Learning (MIL) represents the predominant framework in Whole Slide Image (WSI) classification, covering aspects such as sub-typing, diagnosis, and beyond. Current MIL models predominantly rely on instance-level features derived from pretrained models such as ResNet. These models segment each WSI into independent patches and extract features from these local patches, leading to a significant loss of global spatial context and restricting the model's focus to merely local features. To address this issue, we propose a novel MIL framework, named SAM-MIL, that emphasizes spatial contextual awareness and explicitly incorporates spatial context by extracting comprehensive, image-level information. The Segment Anything Model (SAM) represents a pioneering visual segmentation foundational model that can capture segmentation features without the need for additional fine-tuning, rendering it an outstanding tool for extracting spatial context directly from raw WSIs. Our approach includes the design of group feature extraction based on spatial context and a SAM-Guided Group Masking strategy to mitigate class imbalance issues. We implement a dynamic mask ratio for different segmentation categories and supplement these with representative group features of categories. Moreover, SAM-MIL divides instances to generate additional pseudo-bags, thereby augmenting the training set, and introduces consistency of spatial context across pseudo-bags to further enhance the model's performance. Experimental results on the CAMELYON-16 and TCGA Lung Cancer datasets demonstrate that our proposed SAM-MIL model outperforms existing mainstream methods in WSIs classification. Our open-source implementation code is is available at https://github.com/FangHeng/SAM-MIL.

Abstract:
Text-to-3D content creation has recently received much attention, especially with the prevalence of 3D Gaussians Splatting (3D GS). In general, GS-based methods comprise two key stages: initialization and rendering optimization. To achieve initialization, existing works directly apply random sphere initialization or 3D diffusion models, e.g., Point-E, to derive the initial shapes. However, such strategies suffer from two critical yet challenging problems: 1) the final shapes are still similar to the initial ones even after training; 2) shapes can be produced only from simple texts, e.g., "a dog", not for lexically richer (or harder) texts, e.g., "a dog is sitting on the top of the airplane". To address these problems, this paper proposes a novel general framework to boost the 3D GS Initialization for text-to-3D generation upon the lexical richness. Our key idea is to aggregate 3D Gaussians into spatially uniform voxels to represent complex shapes while enabling the spatial interaction among the 3D Gaussians and semantic interaction between Gaussians and texts. Specifically, we first construct a voxelized representation, where each voxel holds a 3D Gaussian with its position, scale, and rotation fixed while setting opacity as the sole factor to determine a position's occupancy. We then design an initialization network mainly consisting of two novel components: 1) Global Information Perception (GIP) block and 2) Gaussians-Text Fusion (GTF) block. Such a design enables each 3D Gaussian to assimilate the spatial information from other areas and semantic information from texts. Extensive experiments show the superiority of our framework of high-quality 3D GS initialization against the existing methods, e.g., Shap-E, by taking lexically simple, medium, and hard texts. Also, our framework can be seamlessly plugged into state-of-the-art training frameworks, e.g., LucidDreamer for semantically consistent text-to-3D generation. The project code is available at https://vlislab22.github.io/DreamInit/.

Abstract:
In this paper, we explore the use of large language models (LLMs) to enhance video moment retrieval (VMR) by integrating general knowledge and pseudo-events as priors. We address the limitations of LLMs in generating continuous outputs, such as salience scores and inter-frame embeddings, which are critical for capturing inter-frame relations. To address these limitations, we propose using LLM encoders, which refine inter-concept relations in multimodal embeddings effectively, even without textual training. Our feasibility study shows that this capability extends to other embeddings like BLIP and T5 when they exhibit similar patterns to CLIP embeddings. We present a general framework for integrating LLM encoders into existing VMR architectures, specifically within the fusion module. The LLM encoder's ability to refine concept relation can help the model to achieve a balanced understanding of the foreground concepts (e.g., persons, faces) and background concepts (e.g., street, mountains) rather focusing only on the visually dominant foreground concepts. Additionally, we utilize pseudo-events, identified via event detection, to guide accurate moment prediction within event boundaries, reducing distractions from adjacent moments. Our plug-in approach for semantic refinement and pseudo-event regulation demonstrates state-of-the-art VMR performance through experimental validation. The source code can be accessed at https://github.com/fletcherjiang/LLMEPET.

Abstract:
In the realm of medical image analysis, self-supervised learning (SSL) techniques have emerged to alleviate labeling demands, while still facing the challenge of training data scarcity owing to escalating resource requirements and privacy constraints. Numerous efforts employ generative models to generate high-fidelity, unlabeled 3D volumes across diverse modalities and anatomical regions. However, the intricate and indistinguishable anatomical structures within the abdomen pose a unique challenge to abdominal CT volume generation compared to other anatomical regions. To address the overlooked challenge, we introduce the Locality-Aware Diffusion (Lad), a novel method tailored for exquisite 3D abdominal CT volume generation. We design a locality loss to refine crucial anatomical regions and devise a condition extractor to integrate abdominal priori into generation, thereby enabling the generation of large quantities of high-quality abdominal CT volumes essential for SSL tasks without the need for additional data such as labels or radiology reports. Volumes generated through our method demonstrate remarkable fidelity in reproducing abdominal structures, achieving a decrease in FID score from 0.0034 to 0.0002 on AbdomenCT-1K dataset, closely mirroring authentic data and surpassing current methods. Extensive experiments demonstrate the effectiveness of our method in self-supervised organ segmentation tasks, resulting in an improvement in mean Dice scores on two abdominal datasets effectively. These results underscore the potential of synthetic data to advance self-supervised learning in medical image analysis.

Abstract:
Cross-center data heterogeneity and annotation unreliability significantly challenge the intelligent diagnosis of diseases using brain signals. A notable example is the EEG-based diagnosis of neurodegenerative diseases, which features subtler abnormal neural dynamics typically observed in small-group settings. To advance this area, in this work, we introduce a transferable framework employing Manifold Attention and Confidence Stratification (MACS) to diagnose neurodegenerative disorders based on EEG signals sourced from four centers with unreliable annotations. The MACS framework's effectiveness stems from these features: 1) The Augmentor generates various EEG-represented brain variants to enrich the data space; 2) The Switcher enhances the feature space for trusted samples and reduces overfitting on incorrectly labeled samples; 3) The Encoder uses the Riemannian manifold and Euclidean metrics to capture spatiotemporal variations and dynamic synchronization in EEG; 4) The Projector, equipped with dual heads, monitors consistency across multiple brain variants and ensures diagnostic accuracy; 5) The Stratifier adaptively stratifies learned samples by confidence levels throughout the training process; 6) Forward and backpropagation in MACS are constrained by confidence stratification to stabilize the learning system amid unreliable annotations. Our subject-independent experiments, conducted on both neurocognitive and movement disorders using cross-center corpora, have demonstrated superior performance compared to existing related algorithms. This work not only improves EEG-based diagnostics for cross-center and small-setting brain diseases but also offers insights into extending MACS techniques to other data analyses, tackling data heterogeneity and annotation unreliability in multimedia and multimodal content understanding. We have released our code here: https://github.com/ICI-BCI/EEG-MACS.

Abstract:
Long-term time series forecasting is a long-standing challenge in various applications. A central issue in time series forecasting is that methods should expressively capture long-term dependency. Furthermore, time series forecasting methods should be flexible when applied to different scenarios. Although Fourier analysis offers an alternative to effectively capture reusable and periodic patterns to achieve long-term forecasting in different scenarios, existing methods often assume high-frequency components represent noise and should be discarded in time series forecasting. However, we conduct a series of motivation experiments and discover that the role of certain frequencies varies depending on the scenarios. In some scenarios, removing high-frequency components from the original time series can improve the forecasting performance, while in others scenarios, removing them is harmful to forecasting performance. Therefore, it is necessary to treat the frequencies differently according to specific scenarios. To achieve this, we first reformulate the time series forecasting problem as learning a transfer function of each frequency in the Fourier domain. Further, we design Frequency Dynamic Fusion (FreDF), which individually predicts each Fourier component, and dynamically fuses the output of different frequencies. Moreover, we provide a novel insight into the generalization ability of time series forecasting and propose the generalization bound of time series forecasting. Then we prove FreDF has a lower bound, indicating that FreDF has better generalization ability. Extensive experiments conducted on multiple benchmark datasets and ablation studies demonstrate the effectiveness of FreDF.

Abstract:
Recent advancements in deep learning have greatly advanced the field of infrared small object detection (IRSTD). Despite their remarkable success, a notable gap persists between these IRSTD methods and generic segmentation approaches in natural image domains. This gap primarily arises from the significant modality differences and the limited availability of infrared data. In this study, we aim to bridge this divergence by investigating the adaptation of generic segmentation models, such as the Segment Anything Model (SAM), to IRSTD tasks. Our investigation reveals that many generic segmentation models can achieve comparable performance to state-of-the-art IRSTD methods. However, their full potential in IRSTD remains untapped. To address this, we propose a simple, lightweight, yet effective baseline model for segmenting small infrared objects. Through appropriate distillation strategies, we empower smaller student models to outperform state-of-the-art methods, even surpassing fine-tuned teacher results. Furthermore, we enhance the model's performance by introducing a novel query design comprising dense and sparse queries to effectively encode multi-scale features. Through extensive experimentation across four popular IRSTD datasets, our model demonstrates significantly improved performance in both accuracy and throughput compared to existing approaches, surpassing SAM and Semantic-SAM by over 14 IoU on NUDT and 4 IoU on IRSTD1k. The source code and models will be released at https://github.com/O937-blip/SimIR.

Abstract:
Large Vision-Language Models (LVLMs) have demonstrated their powerful multimodal capabilities. However, they also face serious safety problems, as adversaries can induce robustness issues in LVLMs through the use of well-designed adversarial examples. Therefore, LVLMs are in urgent need of detection tools for adversarial examples to prevent incorrect responses. In this work, we first discover that LVLMs exhibit regular attention patterns for clean images when presented with probe questions. We propose an unconventional method named PIP, which utilizes the attention patterns of one randomly selected irrelevant probe question (e.g., "Is there a clock''') to distinguish adversarial examples from clean examples. Regardless of the image to be tested and its corresponding question, PIP only needs to perform one additional inference of the image to be tested and the probe question, and then achieves successful detection of adversarial examples. Even under black-box attacks and open dataset scenarios, our PIP, coupled with a simple SVM, still achieves more than 98% recall and a precision of over 90%. Our PIP is the first attempt to detect adversarial attacks on LVLMs via simple irrelevant probe questions, shedding light on deeper understanding and introspection within LVLMs. The code is available at https://github.com/btzyd/pip.

Abstract:
Spiking Neural Networks (SNNs) have received widespread attention due to their unique neuronal dynamics and low-power nature. Previous research empirically shows that SNNs with Poisson coding are more robust than Artificial Neural Networks (ANNs) on small-scale datasets. However, it is still unclear in theory how the adversarial robustness of SNNs is derived, and whether SNNs can still maintain its adversarial robustness advantage on large-scale dataset tasks. This work theoretically demonstrates that SNN's inherent adversarial robustness stems from its Poisson coding. We reveal the conceptual equivalence of Poisson coding and randomized smoothing in defense strategies, and analyze in depth the trade-off between accuracy and adversarial robustness in SNNs via the proposed Randomized Smoothing Coding (RSC) method. Experiments demonstrate that the proposed RSC-SNNs show remarkable adversarial robustness, surpassing ANNs and achieving state-of-the-art robustness results on large-scale dataset ImageNet.

Abstract:
Reconstructing dynamic scenes from video sequences is a highly promising task in the multimedia domain. While previous methods have made progress, they often struggle with slow rendering and managing temporal complexities such as significant motion and object appearance/disappearance. In this paper, we propose SaRO-GS as a novel dynamic scene representation capable of achieving real-time rendering while effectively handling temporal complexities in dynamic scenes. To address the issue of slow rendering speed, we adopt a Gaussian primitive-based representation and optimize the Gaussians in 4D space, which facilitates real-time rendering with the assistance of 3D Gaussian Splatting. Additionally, to handle temporally complex dynamic scenes, we introduce a Scale-aware Residual Field. This field considers the size information of each Gaussian primitive while encoding its residual feature and aligns with the self-splitting behavior of Gaussian primitives. Furthermore, we propose an Adaptive Optimization Schedule, which assigns different optimization strategies to Gaussian primitives based on their distinct temporal properties, thereby expediting the reconstruction of dynamic regions. Through evaluations on monocular and multi-view datasets, our method has demonstrated state-of-the-art performance.

Abstract:
Burst super-resolution (BurstSR) utilizes signal information from multiple adjacent frames successively taken to restore rich textures. However, due to hand tremors and other image degradation factors, even recent BurstSR methods struggle to reconstruct finely textured images. On the other hand, reference-based super-resolution (RefSR) leverages the high-fidelity reference (Ref) image to recover detailed contents. Nevertheless, if there is no correspondence between the Ref and the low-resolution (LR) images, the degraded output is derived. To overcome the limitations of existing BurstSR and RefSR methods, we newly introduce a reference-based burst super-resolution (RefBSR) that utilizes burst frames and a high-resolution (HR) external Ref image. The RefBSR can restore the HR image by properly fusing the benefits of burst frames and a Ref image. To this end, we propose the first RefBSR framework that consists of Ref-burst feature matching and burst feature-aware Ref texture transfer (BRTT) modules. In addition, our method adaptively integrates features with better quality between Ref and burst features using Ref-burst adaptive feature fusion (RBAF). To train and evaluate our method, we provide a new dataset of Ref-burst pairs collected by commercial smartphones. The proposed method achieves state-of-the-art performance compared to both existing RefSR and BurstSR methods, and we demonstrate its effectiveness through comprehensive experiments. The source codes and the dataset are available at https://github.com/SeonggwanKo/RefBSR.

Abstract:
The Vision Transformer has attained remarkable success in various computer vision applications. However, the large computational costs and complex design limit its ability in handling large feature maps. Existing research predominantly focuses on constraining attention to small local regions, which reduces the number of tokens attending the attention computation while overlooking computational demands caused by the feed-forward layer in the Vision Transformer block. In this paper, we introduce Group Vision Transformer (GVT), a relatively simple and efficient variant of Vision Transformer, aiming to improve attention computation. The core idea of our model is to divide and group the entire Transformer layer, instead of only the attention part, into multiple independent branches. This approach offers two advantages: (1) It helps reduce parameters and computational complexity; (2) it enhances the diversity of the learned features. We conduct comprehensive analysis of the impact of different numbers of groups on model performance, as well as their influence on parameters and computational complexity. Our proposed GVT demonstrates competitive performances in several common vision tasks. For example, our GVT-Tiny model achieves 84.8% top-1 accuracy on ImageNet-1K, 51.4% box mAP and 45.2% mask mAP on MS COCO object detection and instance segmentation, and 50.1% mIoU on ADE20K semantic segmentation, outperforming the CAFormer-S36 model by 0.3% in ImageNet-1K top-1 accuracy, 1.2% in box mAP, 1.0% in mask mAP on MS COCO object detection and instance segmentation, and 1.2% in mIoU on ADE20K semantic segmentation, with similar model parameters and computational complexity. Code is accessible at https://github.com/yaoppeng/GVT.

Abstract:
Transforming the multi-round vanilla Federated Learning (FL) into one-shot FL (OFL) significantly reduces the communication burden and makes a big leap toward practical deployment. However, we note that existing OFL methods all build on model lossy reconstruction (i.e., aggregating while partially discarding local knowledge in clients' models), which attains one-shot at the cost of degraded inference performance. By identifying the root cause of stressing too much on finding a one-fit-all model, this work proposes a novel one-shot FL framework by embodying each local model as an independent expert and leveraging a Mixture-of-Experts network to maintain all local knowledge intact. A dedicated self-supervised training process is designed to tune the network, where the sample generation is guided by approximating underlying distributions of local data and making distinct predictions among experts. Notably, the framework also fuels FL with flexible, data-free aggregation and heterogeneity tolerance. Experiments on 4 datasets show that the proposed framework maintains the one-shot efficiency, facilitates superior performance compared with 8 OFL baselines (+5.54% on CIFAR-10), and even attains over ×4 performance gain compared with 3 multi-round FL methods, while only requiring less than 85% trainable parameters. Our code will be available at https://github.com/zenghui9977/IntactOFL.

Abstract:
With the rapid development of high-bit-depth display devices, bit-depth expansion (BDE) algorithms that extend low-bit-depth images to high-bit-depth images have received increasing attention. Due to the sensitivity of bit-depth distortions to tiny numerical changes in the least significant bits, the nuanced degradation differences in the training process may lead to varying degradation data distributions, causing the trained models to overfit specific types of degradations. This paper focuses on the problem of blind video BDE, proposing a degradation prediction and embedding framework, and designing a video BDE network based on a recurrent structure and dual-frame alignment fusion. Experimental results demonstrate that the proposed model can outperform some state-of-the-art (SOTA) models in terms of banding artifact removal and color correction, avoiding overfitting to specific degradations and obtaining better generalization ability across multiple datasets. https://github.com/duanpanjun/BVBDE

Abstract:
Recent advances in continuous super-resolution (SR) has made a substantial progress towards universal SR models, which are characterized by using a single deep neural network (DNN) to fulfill arbitrary scale SR tasks. When deployed on resource stringent platforms, however, a trained DNN model usually requires experience-demanding and laborious manual efforts to compress the models following a predetermined compute budget. This paper proposes an inference-time adaptive network width optimization method for arbitrary scale SR modules, dubbed as Scalable Super-Resolution Neural Operator (SSRNO), which is capable of efficient performance-preserving deployment on various mobile or edge devices with only a user input parameter indicating the desired compression rate. SSRNO realizes the continuous parameterization of SRNO (CVPR2023) by virtue of two novel contributions. First, we propose the Integral Neural Network (INN) formulation for the Galerkin type attention, which is an indispensable component for spatial discretization invariant SR neural networks. Second, we further propose an adaptive layer-wise compression rate estimation mechanism, which allows for flexible adaptation to variant capacity through the neural network layers. Extensive experiments validate the outperforming overall performances over existing continuous SR models in terms of reconstruction accuracy, model scalability as well as throughput. For instance, compared with the baseline SRNO, a typical configuration of SSRNO can achieve a model size compression up to 62% and an over 2x speedup in situations where resources are limited, while it can also expand itself to keep the PSNR degradation within 0.1 dB when the limitations are alleviated. The code is available at https://github.com/ZXS-Labs/SSRNO.

Abstract:
Micro-action involves low-amplitude movement of human body, which brings challenges to common action recognition. This paper focuses on the extremely small region of human body as well as the severe long-tail distribution in micro-action recognition. An intuitive yet effective instance-aware data preprocessing is designed to enlarge the movement of human body and alleviate the multi-scale variant by a pretrianed human detector. Long-tail distribution brings severe data imbalance difficult to solve directly. To simplify this problem, we propose a novel coarse-grained focal loss to focus on the misclassification at coarse-grained level by introducing adaptive weights. Two-level supervision, the fine-grained and coarse-grained annotations, benefits to improve model performance further. Finally, our method achieved 3rd place and 2nd place in MAC 2024 Track 1 and Track 2 respectively, which demonstrates the effectiveness and generalization of our proposed method. Our code is available in https://github.com/ilovepose/instance-aware-fine-grained-micro-action-recognition.

Abstract:
The advent of High Dynamic Range/Wide Color Gamut (HDR/WCG) display technology has made significant progress in providing exceptional richness and vibrancy for the human visual experience. However, the widespread adoption of HDR/WCG images is hindered by their substantial storage requirements, imposing significant bandwidth challenges during distribution. Besides, HDR/WCG images are often tone-mapped into Standard Dynamic Range (SDR) versions for compatibility, necessitating the usage of inverse Tone Mapping (iTM) techniques to reconstruct their original representation. In this work, we propose a meta-transfer learning framework for practical HDR/WCG media transmission by embedding image-wise metadata into their SDR counterparts for later iTM reconstruction. Specifically, we devise a meta-learning strategy to pre-train a lightweight multilayer perceptron (MLP) model that maps SDR pixels to HDR/WCG ones on an external dataset, resulting in a domain-wise iTM model. Subsequently, for the transfer learning process of each HDR/WCG image, we present a spatial-aware online mining mechanism to select challenging training pairs to adapt the meta-trained model to an image-wise iTM model. Finally, the adapted MLP, embedded as metadata, is transmitted alongside the SDR image, facilitating the reconstruction of the original image on HDR/WCG displays. We conduct extensive experiments and evaluate the proposed framework with diverse metrics. Compared with existing solutions, our framework shows superior performance in fidelity, minimal latency, and negligible overhead. The codes are available at https://github.com/pjliu3/MLP_iTM.

Abstract:
Whether it is an e-commerce platform or a short video platform, the effective use of multi-modal data plays an important role in the recommendation system. More and more researchers are exploring how to effectively use multimodal signals to entice more users to buy goods or watch short videos. Some studies have added multimodal features as side information to the model and achieved certain results. In practice, the purchase behavior of users mainly depends on some personalized intentions. However, it is difficult for neural networks to process noise information and extract high-level intention information effectively. To investigate the benefits of latent intentions and leverage them effectively for recommendation, we propose a Multimodal-aware Multi-intention Learning method for recommendation (MMIL). Specifically, we first construct a multi-intention recommendation framework based on the probability distribution relationship among predicted objectives and intentions, while avoiding intention overfitting. We then design an intention representation learning module to learn accurate multiple intention representations based on intention prototypes. Further, we propose a multi-modal intention perceptron module to learn multi-modal intention representations. In addition, we design self-supervised intention-based and representation-based contrastive objectives to achieve cross-modal representation alignment. On three real-world data sets, the proposed MMIL method outperforms other advanced techniques. The effectiveness of intention modeling and intention alignment is verified by comprehensive experiments. The source code is available at: https://github.com/ml-mindset/MMIL.

Abstract:
As one of the most fundamental computer vision problems, image feature matching aims to establish correct correspondences between two-view images. Existing studies enhance the descriptions of feature points with graph neural network (GNN), identifying correspondences with the predicted assignment matrix. However, this pipeline easily falls into a suboptimal result during training for the solution space is extremely complex, and is inaccessible to the prior that can guide the information propagation and network convergence. In this paper, we propose a novel method called DiffGlue that introduces the Diffusion Model into the sparse image feature matching framework. Concretely, based on the incrementally iterative diffusion and denoising processes, DiffGlue can be guided by the prior from the Diffusion Model and trained step by step on the optimization path, approaching the optimal solution progressively. Besides, it contains a special Assignment-Guided Attention as a bridge to merge the Diffusion Model and sparse image feature matching, which injects the inherent prior into GNN thereby ameliorating the message delivery. Extensive experiments reveal that DiffGlue converges faster and better, outperforming state-of-the-arts on several applications such as homography estimation, relative pose estimation, and visual localization. The code is available at https://github.com/SuhZhang/DiffGlue.

Abstract:
The constrained data scale in low-level vision often induces the demon overfitting hazard for restoration networks, necessitating the adoption of the pre-training paradigm. Mirroring the success of the high-level pre-training approaches, recent methods in the low-level community aim to derive general visual representation from extensive data with synthesized degradation. In this paper, we propose a new perspective beyond the data-driven image pre-training paradigm for low-level vision, building upon the following examination. First, unlike the semantic extraction prevalent in high-level vision, low-level vision primarily focuses on the continuous and content-agnostic pixel-level regression, indicating that the diversified contents inherent in large-scale data are potentially unnecessary for low-level vision pre-training. Second, considering the low-level degradations are highly relevant to the frequency spectrum, we discern that the low-level pre-training paradigm can be implemented in the Fourier space with fostered degradation sensibility. Therefore, we develop an Image-free Pre-training (IFP) paradigm, a novel low-level pre-training approach with necessity of single randomly sampled Gaussian noise image, streamlining complicated data collection and synthesis procedure. The principle of the IFP involves reconstructing the original Gaussian noise from the randomly perturbed counterpart with partially masked spectrum band, facilitating the capability for robust spectrum representation extraction in response to the capricious downstream degradations. Extensive experiments demonstrate the significant improvements brought by IFP to various downstream tasks, such as 1.31 dB boost in low-light enhancement for Restormer, and improvements of 1.2 dB in deblurring, and 2.42 dB in deraining for Uformer. Code is publicly available at https://github.com/siywang541/IFP.

Abstract:
Existing open-vocabulary object detectors require an accurate and compact vocabulary pre-defined during inference. Their performance is largely degraded in real scenarios where the underlying vocabulary may be indeterminate and often exponentially large. To have a more comprehensive understanding of this phenomenon, we propose a new setting called Large-and-Open Vocabulary object Detection, which simulates real scenarios by testing detectors with large vocabularies containing thousands of unseen categories. The vast unseen categories inevitably lead to an increase in category distractors, severely impeding the recognition process and leading to unsatisfactory detection results. To address this challenge, We propose a Large and Open Vocabulary Detector (LOVD) with two core components, termed the Image-to-Region Filtering (IRF) module and Cross-View Verification (CV2) scheme. To relieve the category distractors of the given large vocabularies, IRF performs image-level recognition to build a compact vocabulary relevant to the image scene out of the large input vocabulary, followed by region-level classification upon the compact vocabulary. CV2 further enhances the IRF by conducting image-to-region filtering in both global and local views and produces the final detection categories through a two-branch voting mechanism. Compared to the prior works, our LOVD is more scalable and robust to large input vocabularies, and can be seamlessly integrated with predominant detection methods to improve their open-vocabulary performance. The code can be found at https://github.com/Altria-luo/LOVD.

Abstract:
Language-based image colorization aims to convert grayscale images to plausible and visually pleasing color images with language guidance, enjoying wide applications in historical photo restoration and film industry. Existing methods mainly leverage large language models and diffusion models to incorporate language guidance into the colorization process. However, it is still a great challenge to build accurate correspondence between the gray image and the semantic instructions, leading to mismatched, overflowing and under-saturated colors. In this paper, we introduce a novel coarse-to-fine framework, COlorfulness COntrollable Language-based Colorization (COCO-LC), that effectively reinforces the image-text correspondence with a coarsely colorized results. In addition, a multi-level condition that leverages both low-level and high-level cues of the gray image is introduced to realize accurate semantic-aware colorization without color overflows. Furthermore, we condition COCO-LC with a scale factor to determine the colorfulness of the output, flexibly meeting the different needs of users. We validate the superiority of COCO-LC over state-of-the-art image colorization methods in accurate, realistic and controllable colorization through extensive experiments. The code and demo will be released at https://lyf1212.github.io/COCO-LC.

Abstract:
Recently, automatic multi-domain fake news detection has attracted widespread attention. Many methods achieve domain adaptation by modeling domain category gate networks and domain-invariant features. However, existing multi-domain fake news detection faces three main challenges: (1) Inter-domain modal semantic deviation, where similar texts and images carry different meanings across various domains. (2) Inter-domain modal dependency deviation, where the dependence on different modalities varies across domains. (3) Inter-domain knowledge dependency deviation, where the reliance on cross-domain knowledge and domain-specific knowledge differs across domains. To address these issues, we propose a Multi-modal Multi-Domain Fake News Detection Model (MMDFND). MMDFND incorporates domain embeddings and attention mechanisms into a progressive hierarchical extraction network to achieve domain-adaptive domain-related knowledge extraction. Furthermore, MMDFND utilizes Stepwise Pivot Transformer networks and adaptive instance normalization to effectively utilize information from different modalities and domains. We validate the effectiveness of MMDFND through comprehensive comparative experiments on two real-world datasets and conduct ablation experiments to verify the effectiveness of each module, achieving state-of-the-art results on both datasets. The source code is available at https://github.com/yutchina/MMDFND.

Abstract:
Multi-task dense prediction plays an important role in the field of computer vision and has an abundant array of applications. Its main purpose is to reduce the amount of network training parameters by sharing network parameters while using the correlation between tasks to improve overall performance. We propose a task-conditional network that handles one task at a time and shares most network parameters to achieve these goals. Inspired by adapter tuning, we propose an adapter module that focuses on both spatial- and channel-wise information to extract features from the frozen encoder backbone. This approach not only reduces the number of training parameters, but also saves training time and memory resources by attaching a parallel adapter pathway to the encoder. We additionally use learnable task prompts to model different tasks and use these prompts to adjust some parameters of adapters to fit the network to diverse tasks. These task-conditional adapters are also applied to the decoder, which enables the entire network to switch between various tasks, producing better task-specific features and achieving excellent performance. Extensive experiments on two challenging multi-task benchmarks, NYUD-v2 and PASCAL-Context, show that our approach achieves state-of-the-art performance with excellent parameter, time, and memory efficiency. The code is available at https://github.com/jfzleo/Task-Conditional-Adapter.

Abstract:
Missed polyps are the major risk factor for colorectal cancer. To minimize misdiagnosis, many methods have been developed. However, they either rely on laborious instance-level annotations, require labeling of prompt points, or lack the ability to filter noise proposals and detect polyps integrally, resulting in severe challenges in this area. In this paper, we propose a novel Cooperation-Based network (CBNet), a two-stage polyp detection framework supervised by image labels that removes wrong proposals through classification in collaboration with segmentation and obtains a more accurate detector by aggregating adaptive multi-level regional features. Specifically, we conduct a Cooperation-Based Region Proposal Network (CBRPN) to reduce the negative impact of noises by deleting proposals without polyps, enabling our network to capture polyp features. Moreover, to enhance location integrity and classification precision of polyps, we aggregate multi-level region of interest (ROI) features under the guidance of the backbone classification layer, namely Adaptive ROI Fusion Module (ARFM). Extensive experiments on the public and private datasets show that our method achieves state-of-the-art performance for weakly supervised methods and even outperforms full supervision in some terms. All code is available at https://github.com/dxqllp/CBNet.

Abstract:
Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in videos according to associated audio cues, where both modalities are affected by noise to different extents, such as the blending of background noises in audio or the presence of distracted objects in video. Most existing methods focus on learning interactions between modalities at high semantic levels but is incapable of filtering low-level noise or achieving fine-grained representational interactions during the early feature extraction phase. Consequently, they struggle with illusion issues, where nonexistent audio cues are erroneously linked to visual objects. In this paper, we present SelM, a novel architecture that leverages selective mechanisms to counteract these illusions. SelM employs State Space model for noise reduction and robust feature selection. By imposing additional bidirectional constraints on audio and visual embeddings, it is able to precisely identify crucial features corresponding to sound-emitting targets. To fill the existing gap in early fusion within AVS, SelM introduces a dual alignment mechanism specifically engineered to facilitate intricate spatio-temporal interactions between audio and visual streams, achieving more fine-grained representations. Moreover, we develop a cross-level decoder for layered reasoning, significantly enhancing segmentation precision by exploring the complex relationships between audio and visual information. SelM achieves state-of-the-art performance in AVS tasks, especially in the challenging Audio-Visual Semantic Segmentation subset. The code can be found at https://github.com/Cyyzpoi/SelM.

Abstract:
The 3D Gaussian Splatting (3D-GS) method has recently sparked a revolution in novel view synthesis with its remarkable visual effects and fast rendering speed. However, its reliance on simple spherical harmonics for color representation leads to subpar performance in complex scenes, particularly with effects like specular highlights and light refraction. Also, 3D-GS adopts a periodic split strategy, which significantly increases the model's disk space and hinders rendering efficiency. To tackle these challenges, we propose Gaussian Splatting with Neural Basis Extension (GSNB), a novel approach that substantially enhances the performance of 3D-GS in demanding scenes while reducing storage consumption. Drawing inspiration from basis function, GSNB utilizes a light-weight MLP to share feature coefficients with Spherical Harmonics (SH). This extends the color calculation of 3D Gaussians, resulting in more accurate visual effect modeling. This combination allows GSNB to achieve remarkable results even in scenes with challenging lighting and reflection conditions. Additionally, GSNB uses pre-computation to bake the MLP's output, thereby alleviating inference workload and subsequent speed loss. Furthermore, to leverage the capabilities of Neural Basis Extension and eliminate redundant Gaussians, we propose a new importance criterion to prune the converged Gaussian model and obtain a more compact representation through re-optimization. Our experimental results demonstrate that our method delivers high-quality rendering in most scenarios and effectively reduces redundant Gaussians without compromising rendering speed. The code is available at https://github.com/Dojizz/GSNB/.

Abstract:
Few-shot learning (FSL) usually trains models on data from one set of classes, but tests them on data from a different set of classes, providing a few labeled support samples of the unseen classes as a reference for the trained model. Due to the lack of target-relevant training data, there is usually high generalization error with respect to the test classes. In this work, we conduct empirical explorations and propose an ensemble method (namely QuickBoost), which is efficient and effective for improving the generalization of FSL. Specifically, QuickBoost includes an alternative-architecture pretrained encoder with a one-vs-all binary classifier (namely FSL-Forest) based on random forest algorithm, and is ensembled with the off-the-shelf FSL models via logit-level averaging. Experiments on three benchmarks demonstrate that our method achieves state-of-the-art performance with good efficiency. Codes are available at https://github.com/WendyBaiYunwei/FSL-QuickBoost.

Abstract:
Recent studies have shown impressive progress in universal style transfer which can integrate arbitrary styles into content images. However, existing approaches struggle with low aesthetics and disharmonious patterns in the final results. To address this problem, we propose AesStyler, a novel Aesthetic Guided Universal Style Transfer method. Specifically, our approach introduces the aesthetic assessment model, trained on a dataset with human-assessed aesthetic scores, into the universal style transfer task to accurately capture aesthetic features that universally resonate with human aesthetic preferences. Unlike previous methods which only consider aesthetics of specific style images, we propose to build a Universal Aesthetic Codebook (UAC) to harness universal aesthetic features that encapsulate the global aspects of aesthetics. Aesthetic features are fed into a novel Universal and Style-specific Aesthetic-Guided Attention (USAesA) module to guide the style transfer process. USAesA empowers our model to integrate the aesthetic attributes of both universal and style-specific aesthetic features with style features and facilitates the fusion of these aesthetically enhanced style features with content features. Extensive experiments and user studies have demonstrated that our approach generates aesthetically more harmonious and pleasing results than the state-of-the-art methods, both aesthetic-free and aesthetic-aware. The code is available at: https://github.com/zwandering/AesStyler.

Abstract:
This paper introduces HINER, a novel neural representation for compressing HSI and ensuring high-quality downstream tasks on compressed HSI. HINER fully exploits inter-spectral correlations by explicitly encoding of spectral wavelengths and achieves a compact representation of the input HSI sample through joint optimization with a learnable decoder. By additionally incorporating the Content Angle Mapper with the L1 loss, we can supervise the global and local information within each spectral band, thereby enhancing the overall reconstruction quality. For downstream classification on compressed HSI, we theoretically demonstrate the task accuracy is not only related to the classification loss but also to the reconstruction fidelity through a first-order expansion of the accuracy degradation, and accordingly adapt the reconstruction by introducing Adaptive Spectral Weighting. Owing to the monotonic mapping of HINER between wavelengths and spectral bands, we propose Implicit Spectral Interpolation for data augmentation by adding random variables to input wavelengths during classification model training. Experimental results on various HSI datasets demonstrate the superior compression performance of our HINER compared to the existing learned methods and also the traditional codecs. Our model is lightweight and computationally efficient, which maintains high accuracy for downstream classification task even on decoded HSIs at high compression ratios. Our materials will be released at https://github.com/Eric-qi/HINER.

Abstract:
Understanding a meme is a challenging task, due to the metaphorical information contained in the meme that requires intricate interpretation to grasp its intended meaning fully. In previous works, attempts have been made to facilitate computational understanding of memes through introducing human-annotated metaphors as extra input features into machine learning models. However, these approaches mainly focus on formulating linguistic representation of a metaphor (extracted from the texts appearing in memes), while ignoring the connection between the metaphor and corresponding visual features (e.g., objects in meme images). In this paper, we argue that a more comprehensive understanding of memes can only be achieved through a joint modelling of both visual and linguistic features of memes. To this end, we propose an approach to generate Multimodal Metaphorical feature for Meme Classification, named MMMC. MMMC derives visual characteristics from linguistic attributes of metaphorical concepts, which more effectively convey the underlying metaphorical concept, leveraging a text-conditioned generative adversarial network. The linguistic and visual features are then integrated into a set of multimodal metaphorical features for classification purpose. We perform extensive experiments on a benchmark metaphorical meme dataset, MET-Meme. Experimental results show that MMMC significantly outperforms existing baselines on the task of emotion classification and intention detection. Our code and dataset are available at https://github.com/liaolianfoka/MMMC.

Abstract:
3D occupancy prediction (OCC) aims to estimate and predict the semantic occupancy state of the surrounding environment, which is crucial for scene understanding and reconstruction in the real world. However, existing methods for 3D OCC mainly rely on surround-view camera images, whose performance is still insufficient in some challenging scenarios, such as low-light conditions. To this end, we propose a new multi-modal fusion network for 3D occupancy prediction by fusing features of LiDAR point clouds and surround-view images, called FusionOcc. Our model fuses features of these two modals in 2D and 3D space, respectively. By integrating the depth information from point clouds, a cross-modal fusion module is designed to predict a 2D dense depth map, enabling an accurate depth estimation and a better transition of 2D image features into 3D space. In addition, features of voxelized point clouds are aligned and merged with image features converted by a view-transformer in 3D space. Experiments show that FusionOcc establishes the new state of the art on Occ3D-nuScenes dataset, achieving a mIoU score of 35.94% (without visibility mask) and 56.62% (with visibility mask), showing an average improvement of 3.42% compared to the best previous method. Our work provides a new baseline for further research in multi-modal fusion for 3D occupancy prediction. Codes will be made publicly at https://github.com/ShuoZhang-code/FusionOcc.

Abstract:
Video Moment Retrieval (MR) tasks involve predicting the moment described by a given natural language or spoken language query in an untrimmed video. In this paper, we propose a novel Maskable Retentive Network (MRNet) to address two key challenges in MR tasks: cross-modal guidance and video sequence modeling. Our approach introduces a new retention mechanism into the multimodal Transformer architecture, incorporating modality-specific attention modes. Specifically, we employ the Unlimited Attention for language-related attention regions to maximize cross-modal mutual guidance. Then, we introduce the Maskable Retention for video-only attention region to enhance video sequence modeling, that is, recognizing two crucial characteristics of video sequences: 1) bidirectional, decaying, and non-linear temporal associations between video clips, and 2) sparse associations of key information semantically related to the query. We propose a bidirectional decay retention mask to explicitly model temporal-distant context dependencies of video sequences, along with a learnable sparse retention mask to adaptively capture strong associations relevant to the target event. Extensive experiments conducted on five popular benchmarks ActivityNet Captions, TACoS, Charades-STA, ActivityNet Speech, and QVHighlights for MR tasks demonstrate the significant improvements achieved by our method over existing approaches. Code is available at https://github.com/xian-sh/MRNet.

Abstract:
A key object in eXplainable Artificial Intelligence (XAI) is to create intelligent systems capable of reasoning and explaining real-world data to facilitate reliable decision-making. Recent studies have acknowledged the importance of providing user-friendly and verifiable explanations to facilitate trustworthy Visual Question Answering (VQA) systems. This paper aims to promote explainable VQA from both data and method perspectives. First, we propose a new Standard Multimodal Explanation (SME) dataset and a new Few-Shot Multimodal Explanation for VQA (FS-MEVQA) task, which aims to generate the multimodal explanation of the underlying reasoning process for solving visual questions with few training samples. Our SME dataset includes 1,028,230 samples composed of questions, images, answers, and multimodal explanations, which can facilitate research in both traditional MEVQA and FS-MEVQA. To the best of our knowledge, this is the first large-scale dataset with joint language-vision explanations based on standard English and additional visual grounding tokens. Second, we propose a training-free Multimodal Explaining Agent (MEAgent) method based on an LLM agent with multimodal open-world tools to infer answers and generate multimodal explanations for visual questions. Our MEAgent can learn multimodal explanation from merely N(=16) training samples and leverage open-world abilities to perform FS-MEVQA on test samples. Comprehensive experimental results evaluated by language quality metrics, visual detection metric, and visual attribution metrics on our SME dataset indicate the superiority of our method for FS-MEVQA. Our code and data are available at https://github.com/LivXue/FS-MEVQA.

Abstract:
The latest Text-to-Speech (TTS) systems can produce speech with voice quality and naturalness comparable to human speech. Yet the demand for large amount of high-quality data from target speakers remains a significant challenge. Particularly for long-form expressive reading, target speaker's training speech that covers rich contextual information are needed. In this paper a novel design of context-aware speech pre-trained model is developed for expressive TTS based on contrastive learning. The model can be trained with abundant speech data without explicitly labelled speaker identities. It captures the intricate relationship between the speech expression of a spoken sentence and the contextual text information. By incorporating cross-modal text and speech features into the TTS model, it enables the generation of coherent and expressive speech, which is especially beneficial when there is a scarcity of target speaker data. The pre-trained model is evaluated first in the task of Context-Speech retrieval and then as the integral part of a zero-shot TTS system. Experimental results demonstrate that the pretraining framework effectively learns Context-Speech representations and significantly enhances the expressiveness of synthesized speech. Audio demos are available at: https://ccsp2024.github.io/demo/.

Abstract:
In real-world photography, local motion blur often arises from the interplay between moving objects and stationary backgrounds during exposure. Existing deblurring methods face challenges in addressing local motion deblurring due to (i) the presence of arbitrary localized blurs and uncertain blur extents; (ii) the limited ability to accurately identify specific blurs resulting from ambiguous motion boundaries. These limitations often lead to suboptimal solutions when estimating blur maps and generating final deblurred images. To that end, we propose a novel method named Motion-Uncertainty-Guided Network (MUGNet), which harnesses a probabilistic representational model to explicitly address the intricacies stemming from motion uncertainties. Specifically, MUGNet consists of two key components, i.e., motion-uncertainty quantification (MUQ) module and motion-masked separable attention (M2SA) module, serving for complementary purposes. Concretely, MUQ aims to learn a conditional distribution for accurate and reliable blur map estimation, while the M2SA module is to enhance the representation of regions influenced by local motion blur and static background, which is achieved by promoting the establishment of extensive global interactions. We demonstrate the superiority of our MUGNet with extensive experiments. The code is publicly available at: https://github.com/zeyuxiao1997/MUGNet.

Abstract:
As machine learning advances, machine learning as a service (MLaaS) in the cloud brings convenience to human lives but also privacy risks, as powerful neural networks used for generation, classification or other tasks can also become privacy snoopers. This motivates privacy preservation in the inference phase. Many approaches for preserving privacy in the inference phase introduce multi-objective functions, training models to remove specific private information from users' uploaded data. Although effective, these adversarial learning-based approaches suffer not only from convergence difficulties, but also from limited generalization beyond the specific privacy for which they are trained. To address these issues, we propose a method for privacy preservation in the inference phase by removing task-irrelevant information, which requires no knowledge of the privacy attacks nor introduction of adversarial learning. Specifically, we introduce a metric to distinguish task-irrelevant information from task-relevant information, and achieve more efficient metric estimation to remove task-irrelevant features. The experiments demonstrate the potential of our method in several tasks. Our code will be available at: https://github.com/iwhoyoung/PriFU.

Abstract:
ISP (Image Signal Processor) serves as a pipeline converting unprocessed raw images to sRGB images, positioned before nearly all visual tasks. Due to the varying spectral sensitivities of cameras, raw images captured by different cameras exist in different color spaces, making it challenging to deploy ISP across cameras with consistent performance. To address this challenge, it is intuitively to incorporate a raw-to-raw mapping (mapping raw images across camera color spaces) module into the ISP. However, the lack of paired data (i.e., images of the same scene captured by different cameras) makes it difficult to train a raw-to-raw model using supervised learning methods. In this paper, we aim to achieve ISP generalization by proposing the first unsupervised raw-to-raw model. To be specific, we propose a CSTPP (Color Space Transformation Parameters Predictor) module to predict the space transformation parameters in a patch-wise manner, which can accurately perform color space transformation and flexibly manage complex lighting conditions. Additionally, we design a CycleGAN-style training framework to realize unsupervised learning, overcoming the deficiency of paired data. Our proposed unsupervised model achieved performance comparable to that of the state-of-the-art semi-supervised method in raw-to-raw task. Furthermore, to assess its ability to generalize the ISP model across different cameras, we for the first formulated cross-camera ISP task and demonstrated the performance of our method through extensive experiments. The codes are released at https://github.com/ydxxxx/Unsupervised-Raw-to-raw-Mapping.

Abstract:
Classifying videos differs from that of images in the need to capture the information on what has happened, instead of what is in the frames. Conventional methods typically follow the data-driven approach, which uses transformer-based attention models to extract and aggregate the features of video frames as the representation of the entire video. However, this approach tends to extract the object information of frames and may face difficulties in classifying the classes talking about events, such as "fixing bicycle". To address this issue, This paper presents an Event-level Causal Representation Learning (ECRL) model for the spatio-temporal modeling of both the in-frame object interactions and their cross-frame temporal correlations. Specifically, ECRL first employs a Frame-to-Video Causal Modeling (F2VCM) module, which simultaneously builds the in-frame causal graph with the background and foreground information and models their cross-frame correlations to construct a video-level causal graph. Subsequently, a Causality-aware Event-level Representation Inference (CERI) module is introduced to eliminate the spurious correlations in contexts and objects via the back- and front-door interventions, respectively. The former involves visual context de-biasing to filter out background confounders, while the latter employs global-local causal attention to capture event-level visual information. Experimental results on two benchmarking datasets verified that ECRL may better capture the cross-frame correlations to describe videos in event-level features. The source codes have been released at https://github.com/wyqcrystal/ECRL.

Abstract:
With the massive emergence of multi-modal data, cross-modal retrieval (CMR) has become one of the hot topics. Thanks to fast retrieval and efficient storage, cross-modal hashing (CMH) provides a feasible solution for large-scale multi-modal data. Previous CMH methods always directly learn common hash codes to fuse different modalities. Although they have obtained some success, there are still some limitations: 1) These approaches often prioritize reducing the heterogeneity in multi-modal data by learning consensus hash codes, yet they could sacrifice modality-specific information. 2) They frequently utilize pairwise similarities to guide hashing learning and neglect class distribution correlations. To overcome these two issues, we propose a novel Distribution Consistency Guided Hashing (DCGH) framework. Specifically, we first learn the modality-specific representation to extract the private discriminative information. Further, we learn consensus hash codes from the private representation by consensus hashing learning, thereby merging the specifics with consistency. Finally, we propose distribution consistency learning to guide hash codes following a similar class distribution principle between multi-modal data, thereby exploring more consistent information. Lots of experimental results on four benchmark datasets demonstrate the effectiveness of our DCGH on both fully paired and partially paired CMR tasks. The code can be available at: https://github.com/sunyuan-cs/2024-MM-DCGH.

Abstract:
Vision-based intrusion detection has many applications in life environments, e.g., security, intelligent monitoring, and autonomous driving. Previous works improve the performance of intrusion detection under unknown environments by introducing unsupervised domain adaptation (UDA) methods. However, these works do not fully fulfill the practical requirements due to the performance gap between UDA and fully supervised methods. To address the problem, we develop a new and vital active domain adaptation intrusion detection task, namely Ada-iD. Our aim is to query and annotate the most informative samples of the target domain at the lowest possible cost, striving for a balance between achieving high performance and keeping low annotation expenses. Specifically, we propose a multi-task joint active domain adaptation intrusion detection framework, namely ADAID-YOLO. It consists of a lower branch for detection and an upper branch for segmentation. Further, three effective strategies are designed to better achieve the Ada-iD task: 1) An efficient Dynamic Diffusion Pseudo-Labeling method (DDPL) is introduced to get Pseudo ground truth to help identify areas of uncertainty in segmentation. 2) An Enhanced Region Impurity and Prediction Uncertainty sampling strategy (Enhanced-RIPU) is proposed to better capture the uncertainty of the segmentation region. 3) A Multi-Element Joint sampling strategy (MEJ) is designed to calculate the uncertainty of the detection comprehensively. Finally, comprehensive experiments and comparisons are conducted on multiple dominant intrusion detection datasets. The results show that our method can outperform other classic and promising active domain adaptation methods and reach current SOTA performance, even surpassing the performance of UDA and full supervision on Normal-Foggy with only 0.1% and 10% data annotation, respectively. Available code: https://github.com/1012537710/Ada-iD.

Abstract:
Dataset distillation, also known as dataset condensation, offers a possibility for compressing a large-scale dataset into a small-scale one (i.e., distilled dataset) while achieving similar performance during model training. This method effectively tackles the challenges of training efficiency and storage cost posed by the large-scale dataset. Existing dataset distillation methods can be categorized into Optimization-Oriented (OO)-based and Distribution-Matching (DM)-based methods. Since OO-based methods require bi-level optimization to alternately optimize the model and the distilled data, they face challenges due to high computational overhead in practical applications. Thus, DM-based methods have emerged as an alternative by aligning the prototypes of the distilled data to those of the original data. Although efficient, these methods overlook the diversity of the distilled data, which will limit the performance of evaluation tasks. In this paper, we propose a novel Diversified Semantic Distribution Matching (DSDM) approach for dataset distillation. To accurately capture semantic features, we first pre-train models for dataset distillation. Subsequently, we estimate the distribution of each category by calculating its prototype and covariance matrix, where the covariance matrix indicates the direction of semantic feature transformations for each category. Then, in addition to the prototypes, the covariance matrices are also matched to obtain more diversity for the distilled data. However, since the distilled data are optimized by multiple pre-trained models, the training process will fluctuate severely. Therefore, we match the distilled data of the current pre-trained model with the historical integrated prototypes. Experimental results demonstrate that our DSDM achieves state-of-the-art results on both image and speech datasets. Code is available at https://github.com/Li-Hongcheng/DSDM.

Abstract:
Binarized Vision Transformers (BiViTs) aim to facilitate the efficient and lightweight utilization of Vision Transformers (ViTs) on devices with limited computational resources. Yet, the current approach to binarizing ViT leads to a substantial performance decrease compared to the full-precision model, posing obstacles to practical deployment. By empirical study, we reveal that spatial interaction (SI) is a critical factor that impacts performance due to lack of token-level correlation, but previous work ignores this factor. To this end, we design a ViT binarization approach dubbed SI-BiViT to incorporate spatial interaction in the binarization process. Specifically, an SI module is placed alongside the Multi-Layer Perceptron (MLP) module to formulate the dual-branch structure. This structure not only leverages knowledge from pre-trained ViTs by distilling over the original MLP, but also enhances spatial interaction via the introduced SI module. Correspondingly, we design a decoupled training strategy to train these two branches more effectively. Importantly, our SI-BiViT is orthogonal to existing Binarized ViTs approaches and can be directly plugged. Extensive experiments demonstrate the strong flexibility and effectiveness of SI-BiViT by plugging our method into four classic ViT backbones in supporting three downstream tasks, including classification, detection, and segmentation. In particular, SI-BiViT enhances the classification performance of binarized ViTs by an average of 10.52% in Top-1 accuracy compared to the previous state-of-the-art. Codes are available at https://github.com/VL-Group/SI-BiViT

Abstract:
Visual effects synthesis is crucial in the film and television industry, which aims at enhancing raw footage with virtual elements for greater expressiveness. As the demand for detailed and realistic effects escalates in modern production, professionals are compelled to allocate substantial time and resources to this endeavor. Thus, there is an urgent need to explore more convenient and less resource-intensive methods, such as incorporating the burgeoning Artificial Intelligence Generated Content (AIGC) technology. However, research into this potential integration has yet to be conducted. As the first work to establish a connection between visual effects synthesis and AIGC technology, we start by carefully setting up two paradigms according to the need for pre-produced effects or not: synthesis with reference effects and synthesis without reference effects. Following this, we compile a dataset by processing a collection of effects videos and scene videos, which contains a wide variety of effect categories and scenarios, adequately covering the common effects seen in films and television industry. Furthermore, we explore the capabilities of a pre-trained text-to-video model to synthesize visual effects within these two paradigms. The experimental results demonstrate that the pipeline we established can effectively produce impressive visual effects synthesis outcomes, thereby evidencing the significant potential of existing AIGC technology for application in visual effects synthesis tasks. Our dataset can be found in https://github.com/ruffiann/MagicVFX.

Abstract:
With the development of deep learning, traffic forecasting technology has made significant progress and is being applied in many practical scenarios. However, various events held in cities, such as sporting events, exhibitions, concerts, etc., have a significant impact on traffic patterns of surrounding areas, causing current advanced prediction models to fail in this case. In this paper, to broaden the applicable scenarios of traffic forecasting, we focus on modeling the impact of events on traffic patterns and propose an event traffic forecasting problem with multimodal inputs. We outline the main challenges of this problem: diversity and sparsity of events, as well as insufficient data. To address these issues, we first use textual modal data containing rich semantics to describe the diverse characteristics of events. Then, we propose a simple yet effective multi-modal event traffic forecasting model that uses pre-trained text and traffic encoders to extract the embeddings and fuses the two embeddings for prediction. Encoders pre-trained on large-scale data have powerful generalization abilities to cope with the challenge of sparse data. Next, we design an efficient large language model-based event description text generation pipeline to build multi-modal event traffic forecasting datasets, ShenzhenCEC and SuzhouIEC. Experiments on two real-world datasets show that our method achieves state-of-the-art performance compared with eight baselines, reducing mean absolute error during the event peak period by 4.26%. Code is available at: https://github.com/2448845600/EventTrafficForecasting.

Abstract:
Black-box domain adaptation treats the source domain model as a black box. During the transfer process, the only available information about the target domain is the noisy labels output by the black-box model. This poses significant challenges for domain adaptation. Conventional approaches typically tackle the black-box noisy label problem from two aspects: self-knowledge distillation and pseudo-label denoising, both achieving limited performance due to limited knowledge information. To mitigate this issue, we explore the potential of off-the-shelf vision-language (ViL) multimodal models with rich semantic information for black-box domain adaptation by introducing an Adversarial Experts Model (AEM). Specifically, our target domain model is designed as one feature extractor and two classifiers, trained over two stages: In the knowledge transferring stage, with a shared feature extractor, the black-box source model and the ViL model act as two distinct experts for joint knowledge contribution, guiding the learning of one classifier each. While contributing their respective knowledge, the experts are also updated due to their own limitation and bias. In the adversarial alignment stage, to further distill expert knowledge to the target domain model, adversarial learning is conducted between the feature extractor and the two classifiers. A new consistency-max loss function is proposed to measure two classifier consistency and further improve classifier prediction certainty. Extensive experiments on multiple datasets demonstrate the effectiveness of our approach. Code is available at https://github.com/singinger/AEM.

Abstract:
Panoramic audio-visual saliency detection is to segment the most attention-attractive regions in 360° panoramic videos with sound. To meticulously delineate the detected salient regions and effectively model human attention shift, we extend this task to more fine-grained instance scenarios: identifying salient object instances and inferring their saliency ranks. In this paper, we propose the first instance-level framework that can simultaneously be applied to segmentation and ranking of multiple salient objects in panoramic videos. Specifically, it consists of a distortion-aware pixel decoder to overcome panoramic distortions, a sequential audio-visual fusion module to integrate audio-visual information, and a spatio-temporal object decoder to separate individual instances and predict their saliency scores. Moreover, owing to the absence of such annotations, we create the ground-truth saliency ranks for the PAVS10K benchmark. Extensive experiments demonstrate that our model is capable of achieving state-of-the-art performance on the PAVS10K for both saliency detection and ranking tasks. The code is available at https://github.com/ruohaoguo/pavsodr.

Abstract:
Engaging in conversational recommendations within a specific scenario represents a promising paradigm in the real world. Scenario-relevant situations often affect conversations and recommendations from two closely related aspects: varying the appealingness of items to users, namely situated item representation, and shifting user interests in the targeted items, namely situated user preference. We highlight that considering those situational factors is crucial, as this aligns with the realistic conversational recommendation process in the physical world. However, it is challenging yet under-explored. In this work, we are pioneering to bridge this gap and introduce a novel setting: Situated Conversational Recommendation Systems (SCRS). We observe an emergent need for high-quality datasets, and building one from scratch requires tremendous human effort. To this end, we construct a new benchmark, named SCREEN, via a role-playing method based on multimodal large language models. We take two multimodal large language models to play the roles of a user and a recommender, simulating their interactions in a co-observed scene. Our SCREEN comprises over 20k dialogues across 1.5k diverse situations, providing a rich foundation for exploring situational influences on conversational recommendations. Based on the SCREEN, we propose three worth-exploring subtasks and evaluate several representative baseline models. Our evaluations suggest that the benchmark is high quality, establishing a solid experimental basis for future research. The code and data are available at https://github.com/DongdingLin/SCREEN.

Abstract:
Reconstructing 3D scenes from multi-view images is challenging, especially under extreme scenarios. We propose Event-ID, an event-based intrinsic decomposition framework that leverages events and images for stable decomposition under extreme scenarios. Our method is based on two observations: event cameras maintain good imaging quality under blurry or poorly exposed scenarios, and event signals from different viewpoints exhibit similarity in diffuse regions while varying in specular regions. We establish an event-based reflectance model and introduce an event-based warping method to extract specular clues. Our two-stage framework constructs a radiance field and decomposes the scene into normal, material, and lighting. Experimental results demonstrate superior performance compared to state-of-the-art methods. Our project can be found at https://zehaoc.github.io/EventID.github.io/

Abstract:
In practical data collection processes, certain views may become partially unavailable due to sensor failures or equipment issues, leading to the problem of incomplete multi-view clustering (IMVC). While some IMVC methods employing prototype completion achieve satisfactory performance, almost all of them implicitly assume correct alignment of prototypes across all views. However, during prototype generation, different networks could generate different cluster centers, thereby leading to the produced prototypes from different views may be misaligned, \ie prototype noisy correspondence. To address this issue, we propose Robust Prototype Completion for Incomplete Multi-view Clustering (RPCIC), which mitigates the impact of noisy correspondence in prototypes. Specifically, RPCIC initially utilizes cross-view contrastive learning module to obtain consistent feature representations across different views. Subsequently, we devise robust contrastive loss for the produced prototypes, aiming to alleviate the influence of noisy correspondence within them. Finally, we employ prototype fusion-based strategy to complete the missing data. Comprehensive experiments demonstrate that RPCIC outperforms 11 state-of-the-art methods in terms of both performance and robustness. The code is available at https://github.com/hl-yuan/RPCIC.

Abstract:
The rising prominence of Virtual Reality (VR), Mixed Reality (MR), and Extended Reality (XR) devices is transforming various industries by offering immersive and interactive experiences. Despite this, there is a notable gap in research and development of frameworks that seamlessly integrate real-world scenes into virtual environments for collaborative use. Current methods experience extended reconstruction times and latency issues, rendering them impractical for real-time collaborative applications. This paper introduces Room2XR, a framework designed to dynamically reconstruct real-world scenes into virtual semantically similar representations using CAD models, and share them for remote collaboration. Room2XR employs advanced scene understanding technologies to create detailed virtual environments, enabling multiple users to interact with and manipulate these spaces in real time. By leveraging efficient algorithms and data transmission methods, Room2XR ensures accessibility and usability even on low-bandwidth networks. The source code and detailed documentation are provided in the following two repositories- 1. https://github.com/HenryGuo2003/Room2XR-Unity and 2. https://hub.docker.com/r/kumarhiranya/vrrec

Abstract:
We introduce the DEMON Challenge, defined as a benchmark for demonstrative instruction following, to ACM Multimedia 2024. The DEMON Challenge aims to assess the ability of models and systems to comprehend demonstrative instructions consisting of multiple, interleaved, and multimodal context that demonstrate the required information to complete a task. These instructions are curated from a diverse range of multi-modal datasets, spanning various fields and scenarios, to ensure comprehensive coverage and challenge diversity. The challenge details and participation information are available on the https://dcdmllm.github.io/DEMON-challenge/.

Abstract:
Advances in multimedia technology and its widespread application in education have made multimedia learning increasingly important. Knowledge Tracing (KT) is the key technology for achieving adaptive multimedia learning, aiming to monitor the degree of knowledge acquisition and predict students' performance during the learning process. Current KT research is dedicated to enhancing the performance of KT problems by integrating the most advanced deep learning techniques. However, this has led to increasingly complex models, which reduce model usability and divert researchers' attention away from exploring the core issues of KT. This paper aims to tackle the fundamental challenges of KT tasks, including the knowledge state representation and the core architecture design, and investigate a novel KT model that is both simple and powerful. We have revisited the KT task and propose the ReKT model. First, taking inspiration from the decision-making process of human teachers, we model the knowledge state of students from three distinct perspectives: questions, concepts, and domains. Second, building upon human cognitive development models, such as constructivism, we have designed a Forget-Response-Update (FRU) framework to serve as the core architecture for the KT task. The FRU is composed of just two linear regression units, making it an extremely lightweight framework. Extensive comparisons were conducted with 22 state-of-the-art KT models on 7 publicly available datasets. The experimental results demonstrate that ReKT outperforms all the comparative methods in question-based KT tasks, and consistently achieves the best (in most cases) or near-best performance in concept-based KT tasks. Furthermore, in comparison to other KT core architectures like Transformers or LSTMs, the FRU achieves superior prediction performance with approximately only 38% computing resources. Through an exploration of the ReKT model that is both simple and powerful, is able to offer new insights to future KT research. The code can be found at https://github.com/lilstrawberry/ReKT.

Abstract:
Multimodal Relation Extraction (MRE) has achieved great improvements. However, modern MRE models are easily affected by irrelevant objects during multimodal alignment which are called error sensitivity issues. The main reason is that visual features are not fully aligned with textual features and the reasoning process may suppress redundant and noisy information at the risk of losing critical information. In light of this, we propose a Caption-Aware Multimodal Relation Extraction Network with Mutual Information Maximization (CAMIM). Specifically, we first generate detailed image captions through the Large Language Model (LLM). Then, the Caption-Aware Module (CAM) hierarchically aligns the fine-grained visual entities and textual entities for reasoning. In addition, for preserving crucial information within different modalities, we leverage a Mutual Information Maximization method to regulate the multimodal reasoning module. Experiments show that our model outperforms the state-of-the-art MRE models on the benchmark dataset MNRE. Further ablation studies prove the pluggable and effective performance of our Caption-Aware Module and Mutual Information Maximization method. Our code is available at https://github.com/zefanZhang-cn/CAMIM.

Abstract:
Zero-Shot Composed Image Retrieval (ZS-CIR) has attracted more attention in recent years, focusing on retrieving a specific image based on a query composed of a reference image and a relative text without training samples. Specifically, the relative text describes the differences between the two images. Prevailing ZS-CIR methods employ image-to-text (I2T) models to convert the query image into a single caption, which is further merged with the relative text by text-fusion approaches to form a composed text for retrieval. However, these methods neglect the fact that ZS-CIR entails considering not only the final similarity between the composed text and retrieved images but also the semantic increment during the compositional editing process. To address this limitation, this paper proposes a training-free method called Semantic Editing Increment for ZS-CIR (SEIZE) to retrieve the target image based on the query image and text without training. Firstly, we employ a pre-trained captioning model to generate diverse captions for the reference image and prompt Large Language Models (LLMs) to perform breadth compositional reasoning based on these captions and relative text, thereby covering the potential semantics of the target image. Then, we design a semantic editing search to incorporate the semantic editing increment contributed by the relative text into the retrieval process. Concretely, we comprehensively consider relative semantic increment and absolute similarity as the final retrieval score, which is subsequently utilized to retrieve the target image in the CLIP feature space. Extensive experiments on three public datasets demonstrate that our proposed SEIZE achieves the new state-of-the-art performance. The code is publicly available at https://github.com/yzy-bupt/SEIZE.

Abstract:
Domain shift significantly hinders crowd counting performance in unseen domains. Domain adaptation methods tackle this issue using target domain images but falter when acquiring these images is difficult. Moreover, they demand additional training time for fine-tuning. To address this issue, we propose an Uncertainty-Guided Style Diversity Augmentation (UGSDA) method, enabling the models to be trained solely on the source domain and directly generalized to various target domains. It is achieved by generating sufficiently diverse and realistic samples during the training process. Specifically, our UGSDA method incorporates three tailor-designed components: the Global Styling Elements Extraction (GSEE) module, the Local Uncertainty Perturbations (LUP) module, and the Density Distribution Consistency (DDC) loss. The GSEE extracts global style elements from the feature space of the whole source domain. The LUP aims to obtain uncertainty perturbations from the batch-level input to form style distributions beyond the source domain, which used to generate diversified stylized samples together with global style elements. To regulate the extent of perturbations, the DDC loss imposes constraints between the source samples and the stylized samples, ensuring the stylized samples maintain a higher degree of realism and reliability. Comprehensive experiments validate the superiority of our approach, demonstrating its strong generalization capabilities across various datasets and models. Code is available at https://github.com/gcding/UGSDA-pytorch.

Abstract:
Multimodal Sentiment Analysis (MSA) has witnessed remarkable progress and gained increasing attention in recent decade. However, current MSA methodologies primarily rely on global representations extracted from different modalities, such as the mean of all token representations, to construct sophisticated fusion networks. These approaches often overlook the valuable details present in local representations, which consist of fused representations of consecutive several tokens. Additionally, the integration of multiple local representations, and the fusion of local and global information present significant challenges. To address these limitations, we propose the Global-Local Modal (GLoMo) Fusion framework. It comprises two essential components: (i) modality-specific mixture of experts layers that integrate diverse local representations within each modality, and (ii) a global-guided fusion module that effectively combines global and local representations. The former component leverages specialized expert networks to automatically select and integrate crucial local representations from each modality, while the latter ensures the preservation of global information during the fusion process. We evaluate GLoMo on various datasets, encompassing tasks in multimodal sentiment analysis, multimodal humor detection, and multimodal emotion recognition. Extensive experiments demonstrate that GLoMo outperforms existing state-of-the-art models, validating the effectiveness of our proposed framework. Our code is publicly available at https://github.com/YetZzzzzz/GLoMo.

Abstract:
Event-based human pose estimation has gained popularity due to the benefits of high temporal resolution and high dynamic range offered by event cameras. The inherent spatial sparsity of event data makes discarding less significant regions a straightforward and effective way to decrease the computation. However, implementing this operation in CNNs poses a challenge, as it disrupts the regularity of dense convolutional workload. In this paper, we propose an adaptive vision transformer, a novel efficient backbone for human pose estimation with event cameras. Specifically, we present two adaptive patch and token sampling approaches based on the characteristics of events, thereby reducing the computational load while still achieving comparable performance. Firstly, we design an adaptive patch sampling scheme to eliminate inactivity patches by assessing the entropy of the events before they are inputted into the transformer. Secondly, we further propose an adaptive token reduction strategy to selectively remove less informative tokens in transformer layers through a dynamic token pruning algorithm. To exploit event-based visual cues in human pose estimation tasks, we construct a large-scale frame-event-based dataset, dubbed Event Multi Movement HPE (EventMM HPE). The dataset provides annotation frequencies up to 240 Hz. Extensive experiments demonstrate that our proposed approach outperforms existing state-of-the-art methods in estimation accuracy. The source code and dataset are available at https://github.com/doublemanyu/Adaptive-Vision-Transformer-for-Event-Based-HPE.

Abstract:
Text-to-image (T2I) diffusion models enjoy great popularity and many individuals and companies build their applications based on publicly released T2I diffusion models. Previous studies have demonstrated that backdoor attacks can elicit T2I diffusion models to generate unsafe target images through textual triggers. However, existing backdoor attacks typically demand substantial tuning data for poisoning, limiting their practicality and potentially degrading the overall performance of T2I diffusion models. To address these issues, we propose EvilEdit, a training-free and data-free backdoor attack against T2I diffusion models. EvilEdit directly edits the projection matrices in the cross-attention layers to achieve projection alignment between a trigger and the corresponding backdoor target. We preserve the functionality of the backdoored model using a protected whitelist to ensure the semantic of non-trigger words is not accidentally altered by the backdoor. We also propose a visual target attack EvilEdit VTA, enabling adversaries to use specific images as backdoor targets. We conduct empirical experiments on Stable Diffusion and the results demonstrate that the EvilEdit can backdoor T2I diffusion models within one second with up to 100% success rate. Furthermore, our EvilEdit modifies only 2.2% of the parameters and maintains the model's performance on benign prompts. Our code is available at https://github.com/haowang-cqu/EvilEdit.

Abstract:
Although multi-view learning has achieved remarkable progress over the past decades, most existing methods implicitly assume that all views (or modalities) are well-aligned. In practice, however, collecting fully aligned views is challenging due to complexities and discordances in time and space, resulting in the Partially View-unaligned Problem (PVP), such as audio-video asynchrony caused by network congestion. While some methods are proposed to align the unaligned views by learning view-invariant representations, almost all of them overlook specific information across different views for complementarity, limiting performance improvement. To address these problems, we propose a robust framework, dubbed VariatIonal ConTrAstive Learning (VITAL), designed to learn both common and specific information simultaneously. To be specific, each data sample is first modeled as a Gaussian distribution in the latent space, where the mean estimates the most probable common information, while the variance indicates view-specific information. Second, by using variational inference, VITAL conducts intra- and inter-view contrastive learning to preserve common and specific semantics in the distribution representations, thereby achieving comprehensive perception. As a result, the common representation (mean) could be used to guide category-level realignment, while the specific representation (variance) complements sample semantic information, thereby boosting overall performance. Finally, considering the abundance of False Negative Pairs (FNPs) generated by unsupervised contrastive learning, we propose a robust loss function that seamlessly incorporates FNP rectification into the contrastive learning paradigm. Empirical evaluations on eight benchmark datasets reveal that VITAL outperforms ten state-of-the-art deep clustering baselines, demonstrating its efficacy in both partially and fully aligned scenarios. The Code is available at https://github.com/He-Changhao/2024-MM-VITAL.

Abstract:
Video anomaly detection has garnered widespread attention in industry and academia in recent years due to its significant role in public security. However, many existing methods overlook the influence of scenes on anomaly detection. These methods simply label the occurrence of certain actions or objects as anomalous. In reality, scene context plays a crucial role in determining anomalies. For example, running on a highway is anomalous, while running on a playground is normal. Therefore, understanding the scene is essential for effective anomaly detection. In this work, we aim to address the challenge of scene-dependent weakly supervised video anomaly detection by decoupling scenes. Specifically, we propose a novel text-driven scene-decoupled (TDSD) framework, consisting of a TDSD module (TDSDM) and fine-grained visual augmentation (FVA) modules. The scene-decoupled module extracts semantic information from scenes, while the FVA module assists in fine-grained visual enhancement. We validate the effectiveness of our approach by constructing two scene-dependent datasets and achieve state-of-the-art results on scene-agnostic datasets as well. Code is available at https://github.com/shengyangsun/TDSD.

Abstract:
This paper delves into federated class-incremental learning (FCiL), where new classes appear continually or even privately to local clients. However, existing FCiL methods suffer from the problem of spatial-temporal catastrophic forgetting, i.e., forgetting the previously learned knowledge over time and the client-specific information owned by different clients. Additionally, private class and knowledge heterogeneity amongst local clients further exacerbate spatial-temporal forgetting, making FCiL challenging to apply. To address these issues, we propose Federated Class-specific Binary Classifier (FedCBC), an innovative approach to transferring and fusing knowledge across both temporal and spatial perspectives. FedCBC consists of two novel components: (1) continual personalization that distills previous knowledge from a global model to multiple local models, and (2) selective knowledge fusion that enhances knowledge integration of the same class from divergent clients and shares private knowledge with other clients. Extensive experiments using three newly-formulated metrics (termed GA, KRS, and KRT) demonstrate the effectiveness of the proposed approach. Our code is now hosted at: https://github.com/SkyOfBeginning/FedCBC.

Abstract:
Mainstream multimodal recommender systems are designed to learn user interest by analyzing user-item interaction graphs. However, what they learn about user interest needs to be completed because historical interactions only record items that best match user interest (i.e., the first-order interest), while suboptimal items are absent. To fully exploit user interest, we propose a Second-Order Interest Learning (SOIL) framework to retrieve second-order interest from unrecorded suboptimal items. In this framework, we build a user-item interaction graph augmented by second-order interest, an interest-aware item-item graph for the visual modality, and a similar graph for the textual modality. In our work, all three graphs are constructed from user-item interaction records and multimodal feature similarity. Similarly to other graph-based approaches, we apply graph convolutional networks to each of the three graphs to learn representations of users and items. To improve the exploitation of both first-order and second-order interest, we optimize the model by implementing contrastive learning modules for user and item representations at both the user-item and item-item levels. The proposed framework is evaluated on three real-world public datasets in online shopping scenarios. Experimental results verify that our method is able to significantly improve prediction performance. For instance, our method outperforms the previous state-of-the-art method MGCN by an average of 8.1% in terms of Recall@10. Code: https://github.com/TL-UESTC/SOIL.

Abstract:
Cross-view consensus representation plays a critical role in hyperspectral images (HSIs) clustering. Recent multi-view contrastive cluster methods utilize contrastive loss to extract contextual consensus representation. However, these methods have a fatal flaw: contrastive learning may treat similar heterogeneous views as positive sample pairs and dissimilar homogeneous views as negative sample pairs. At the same time, the data representation via self-supervised contrastive loss is not specifically designed for clustering. Thus, to tackle this challenge, we propose a novel multi-view clustering method, i.e., Enhanced Multi-View Contrastive Clustering (EMVCC). First, the spatial multi-view is designed to learn the diverse features for contrastive clustering, and the globally relevant information of spectrum-view is extracted by Transformer, enhancing the spatial multi-view differences between neighboring samples. Then, a joint self-supervised loss is designed to constrain the consensus representation from different perspectives to efficiently avoid false negative pairs. Specifically, to preserve the diversity of multi-view information, the features are enhanced by using probabilistic contrastive loss, and the data is projected into a semantic representation space, ensuring that the similar samples in this space are closer in distance. Finally, we design a novel clustering loss that aligns the view feature representation with high confidence pseudo-labels for promoting the network to learn cluster-friendly features. In the training process, the joint self-supervised loss is used to optimize the cross-view features.Abundant experiment studies on numerous benchmarks verify the superiority of EMVCC in comparison to some state-of-the-art clustering methods. The codes are available at https://github.com/YiLiu1999/EMVCC.

Abstract:
Visual prompting is an efficient methodology for finetuning pretrained visual models by introducing a small number of learnable parameters while keeping the backbone frozen. However, most existing visual prompting methods learn a shared prompt for all samples, making it challenging to grasp distinct characteristics among diverse samples, thereby limiting the model's performance. While other methods partially address this issue through sample clustering and learning multiple prompts, they still struggle to capture nuanced differences among instances and incur significant parameter overhead. Therefore, to comprehensively and efficiently leverage discriminative characteristics of individual instances, we propose an Instance Visual Prompting method, called InsVP. Initially, the instance image prompt is introduced to extract both crucial and nuanced discriminative information from the original image itself and is overlaid onto the input image. Furthermore, the instance feature prompt is designed to capture both commonalities and characteristics among individual instances, fed into the model's intermediate layers to facilitate feature extraction. Consequently, the instance image and feature prompts complement each other, enhancing the adaptation ability of pretrained models to extract discriminative features from individual instances. Extensive experiments on various large-scale benchmarks show that our InsVP achieves superior performance exceeding the state-of-the-art methods at a lower parameter cost. The code is available at https://github.com/zhoujiahuan1991/MM2024-InsVP .

Abstract:
As one of the crucial elements in human-robot interaction, responsive listening head generation has attracted considerable attention from researchers. It aims to generate a listening head video based on speaker's audio and video as well as a reference listener image. However, existing methods exhibit two limitations: 1) the generation capability of their models is limited, resulting in generated videos that are far from real ones, and 2) they mostly employ autoregressive generative models, unable to mitigate the risk of error accumulation. To tackle these issues, we propose Listenformer that leverages the powerful temporal modeling capability of transformers for generation. It can perform non-autoregressive prediction with the proposed two-stage training method, simultaneously achieving temporal continuity and overall consistency in the outputs. To fully utilize the information from the speaker inputs, we designed an audio-motion attention fusion module, which improves the correlation of audio and motion features for accurate response. Additionally, a novel decoding method called sliding window with a large shift is proposed for Listenformer, demonstrating both excellent computational efficiency and effectiveness. Extensive experiments show that Listenformer outperforms the existing state-of-the-art methods on ViCo and L2L datasets. And a perceptual user study demonstrates the comprehensive performance of our method in generating diversity, identity preserving, speaker-listener synchronization, and attitude matching. Our code is available at https://liushenme.github.io/ListenFormer.github.io/.

Abstract:
3D novelty detection plays a crucial role in various real-world applications, especially in safety-critical fields such as autonomous driving and intelligent surveillance systems. However, existing 3D novelty detection methods are constrained by the scarcity of 3D data, which may impede the model's ability to learn adequate representations, thereby impacting detection accuracy. To address this challenge, we propose a Unified Learning Framework (UniL) for facilitating novelty detection. During the pretraining phase, UniL assists the point cloud encoder in learning information from other modalities, aligning visual, textual, and 3D features within the same feature space. Additionally, we introduce a novel Multimodal Supervised Contrastive Loss (MSC Loss) to improve the model's ability to cluster samples from the same category in feature space by leveraging label information during pretraining. Furthermore, we propose a straightforward yet powerful scoring method, Depth Map Error (DME), which assesses the discrepancy between projected depth maps before and after point cloud reconstruction during novelty detection. Extensive experiments conducted on 3DOS have demonstrated the effectiveness of our approach, significantly enhancing the performance of the unsupervised VAE method in 3D novelty detection. Codes are avaliable at https://github.com/EugeneWon9/UniL.

Abstract:
Contrastive multi-view clustering is widely recognized for its effectiveness in mining feature representation across views via contrastive learning (CL), gaining significant attention in recent years. Most existing methods mainly focus on the feature-level or/and cluster-level CL, but there are still two shortcomings. Firstly, feature-level CL is limited by the influence of anomalies and large noise data, resulting in insufficient mining of discriminative feature representation. Secondly, cluster-level CL lacks the guidance of global information and is always restricted by the local diversity information. We in this paper Learn dUal enhanCed rEpresentation for Contrastive Multi-view Clustering (LUCE-CMC) to effectively addresses the above challenges, and it mainly contains two parts, i.e., enhanced feature-level CL (En-FeaCL) and enhanced cluster-level CL (En-CluCL). Specifically, we first adopt a shared encoder to learn shared feature representations between multiple views and then obtain cluster-relevant information that is beneficial to the clustering results. Moreover, we design a reconstitution approach to force the model to concentrate on learning features that are critical to reconstructing the input data, reducing the impact of noisy data and maximizing the sufficient discriminative information of different views in helping the En-FeaCL part. Finally, instead of contrasting the view-specific clustering result like most existing methods do, we in the En-CluCL part make the information at the cluster-level more richer by contrasting the cluster assignment from each view and the cluster assignment obtained from the shared fused features. The end-to-end training methods of the proposed model are mutually reinforcing and beneficial. Extensive experiments conducted on multi-view datasets show that the proposed LUCE-CMC outperforms established baselines to a considerable extent. The source code is released at https://github.com/ShizheHu.

Abstract:
Open-ended VideoQA presents a significant challenge due to the absence of fixed options, requiring the identification of the correct answer from a vast pool of candidate answers. Previous approaches typically utilize classifier or similarity comparison on fusion feature to yield prediction directly, lacking coarse-to-fine filtering on numerous candidates. Gradual refining the probability distribution of candidates can achieve more precise prediction. Thus, we propose the DiffAns model, which integrates the diffusion model to handle open-ended VideoQA task, simulating the gradual process by which humans answer open-ended question. Specifically, we first diffuse the true answer label into a random distribution (forward process). And under the guidance of answer-aware condition generated from video and question, the model iteratively denoises to obtain the correct probability distribution (backward process). This equips the model with the capability to progressively refine the random probability distribution of candidates, ultimately predicting the correct answer. We conduct experiments on three challenging open-ended VideoQA datasets, surpassing existing SoTA methods. Extensive experiments further explore and analyse the impact of each modules, as well as the design of diffusion model, demonstrating the effectiveness of DiffAns. Our code is available at https://github.com/WanJJJh/DiffAns.

Abstract:
Semi-supervised medical image segmentation has gained increasing attention due to its potential to alleviate the manual annotation burden. Mainstream methods typically involve two subnets, and conduct a consistency objective to ensure them producing consistent predictions for unlabeled data. However, they often ignore that the complementarity of model predictions is equally crucial. To realize the potential of the multi-subnet architecture, we propose a novel cross-view mutual learning method with a two-branch co-training framework. Specifically, we first introduce a novel conflict-based feature learning (CFL) that encourages the two subnets to learn distinct features from the same input. These distinct features are then decoded into complementary model predictions, allowing both subnets to understand the input from different views. More importantly, we propose a cross-view mutual learning (CML) to maximize the effectiveness of CFL. This approach requires only modifications to the model inputs and supervisory signals, and implements a heterogeneous consistency objective to fully explore the complementarity of model predictions. Consequently, the aggregated predictions can effectively capture both consistency and complementarity across two subnets. Experimental results on three public datasets demonstrate the superiority of CML over previous SoTA methods. Code is available at https://github.com/SongwuJob/CML.

Abstract:
Whole-slide image (WSI) classification methods play a crucial role in tumor diagnosis. Most of them use hematoxylin and eosin (H&E) stained images, while Immunohistochemistry (IHC) staining provides molecular markers and protein expression information that highlights cancer regions. However, obtaining IHC-stained images requires higher costs in practice. In this work, we propose a multi-modal denoising diffusion pre-training framework that harnesses the advantages of IHC staining to learn visual representations. The framework is trained with the H&E-to-IHC re-staining task and IHC-stained image reconstruction task, which helps capture the structural similarity and staining difference between two image modalities. The trained model can then provide IHC-guided features, by taking only H&E-stained images as inputs. Besides, we build a new class-constraint constrastive loss to achieve the semantic consistency between dual-modal features from our pre-training framework. To integrate with WSI classifiers based on multi-instance learning, we further propose a bag feature augmentation strategy to extend bags with the features extracted by our pre-trained model. Experimental results on three datasets show that our pre-training framework effectively improves WSI classification and surpasses the state-of-the-art pre-training approaches. Code and model are released via https://github.com/lhaof/MDDP

Abstract:
The human brain has the capability to associate the unknown person's voice and face by leveraging their general relationship, referred to as "cross-modal speaker verification''. This task poses significant challenges due to the complex relationship between the modalities. In this paper, we propose a "Multi-stage Face-voice Association Learning with Keynote Speaker Diarization''(MFV-KSD) framework. MFV-KSD contains a keynote speaker diarization front-end to effectively address the noisy speech inputs issue. To balance and enhance the intra-modal feature learning and inter-modal correlation understanding, MFV-KSD utilizes a novel three-stage training strategy. Our experimental results demonstrated robust performance, achieving the first rank in the 2024 Face-voice Association in Multilingual Environments (FAME) challenge with an overall Equal Error Rate (EER) of 19.9%. Details can be found in https://github.com/TaoRuijie/MFV-KSD.

Abstract:
While significant progress has been made in multi-modal learning driven by large-scale image-text datasets, there is still a noticeable gap in the availability of such datasets within the facial domain. To facilitate and advance the field of facial representation learning, we present FLIP-80M, a large-scale visual-linguistic dataset comprising over 80 million face images paired with text descriptions. FLIP-80M is constructed by leveraging the large openly available image-text-pair dataset LAION-5B and a mixed-method approach to filter face-related pairs from both visual and linguistic perspectives. Our curation process involves face detection, face caption classification, text de-noising, and synthesis-based image augmentation. As a result, FLIP-80M stands as the largest face-text dataset to date. To evaluate the potential of our dataset, we fine-tune the CLIP model using the proposed FLIP-80M, to create FLIP (Facial Language-Image Pretraining) and assess its representation capabilities across various downstream tasks. Our experiments demonstrate that our FLIP model achieves state-of-the-art results in a range of face analysis tasks, including face parsing, face alignment, and face attribute classification. The dataset and models are available at https://github.com/ydli-ai/FLIP.

Abstract:
Pansharpening is an important technique for remote sensing imaging systems to obtain high-resolution multispectral images. Existing deep learning-based methods mostly rely on using pseudo-groundtruth multi-spectral images for supervised learning. The whole training process only remains at the scale of reduced resolution, which means that the impact of the degradation process is ignored and high-quality images cannot be guaranteed at full resolution. To address the challenge, we propose a new unsupervised framework that does not rely on pseudo-groundtruth but uses the invariance of the degradation process to build a consistent loss function on the original scale for network training. Specifically, we first introduce the operator learning method to build an exact mapping function from multi-spectral to panchromatic images and decouple both spectral and texture features. Then, through joint training, operators and convolutional networks can learn the spatial degradation process and spectral degradation process at full resolution, respectively. By introducing them to build consistency constraints, we can train the pansharpening network at the original full resolution. Our approach can be applied to existing pansharpening methods, improving their usability on original data, which matches practical application requirements. The experimental results on different kinds of satellite datasets demonstrate that the proposed network outperforms state-of-the-art methods both visually and quantitatively. Our code is available at https://github.com/quycruin/Qvac.

Abstract:
Image customization involves learning the subject from provided concept images and generating it within textual contexts, typically yielding alterations of attributes such as style or background. Prevailing methods primarily rely on fine-tuning technique, wherein a unified latent embedding is employed to characterize various concept attributes. However, the attribute entanglement renders customized result challenging to mitigate the influence of subject-irrelevant attributes (e.g., style and background). To overcome these issues, we propose Equilibrated Diffusion, an innovative method that achieves equilibrated image customization by decoupling entangled concept attributes from a frequency-aware perspective, thus harmonizing textual and visual consistency. Unlike conventional approaches that employ a shared latent embedding and tuning process to learn concept, our Equilibrated Diffusion draws inspiration from the correlation between high- and low-frequency components with image style and content, decomposing concept accordingly in the frequency domain. Through independently optimizing concept embeddings in the frequency domain, the denoising model not only enriches its comprehension of style attribute irrelevant to subject identity but also inherently augments its aptitude for accommodating novel stylized descriptions. Furthermore, by combining different frequency embeddings, our model retains the spatially original customization capability. We further design a diffusion process guided by subject masks to alleviate the influence of background attribute, thereby strengthening text alignment. To ensure subject-related information consistency, Residual Reference Attention (RRA) is incorporated into the denoising model of spatial attention computation, effectively preserving structural details. Experimental results demonstrate that Equilibrated Diffusion surpasses other competitors with better subject consistency while closely adhering to text descriptions, thus validating the superiority of our approach. The code is available at https://github.com/maple-research-lab/EqDiff.

Abstract:
Remote photoplethysmography (rPPG) measurement aims to estimate physiological signals by analyzing subtle skin color changes induced by heartbeats in facial videos. Existing methods primarily rely on the fundamental video frame features or vanilla facial ROI (region of interest) features. Recognizing the varying light absorption and reactions of different facial regions over time, we adopt a new perspective to conduct a more fine-grained exploration of the key clues present in different facial regions within each frame and across temporal frames. Concretely, we propose a novel clustering-driven remote physiological measurement framework called Cluster-Phys, which employs a facial ROI prototypical clustering module to adaptively cluster the representative facial ROI features as facial prototypes and then update facial prototypes with highly semantic correlated base ROI features. In this way, our approach can mine facial clues from a more compact and informative prototype level rather than the conventional video/ROI level. Furthermore, we also propose a spatial-temporal prototype interaction module to learn facial prototype correlation from both spatial (across prototypes) and temporal (within prototype) perspectives. Extensive experiments are conducted on both intra-dataset and cross-dataset tests. The results show that our Cluster-Phys achieves significant performance improvement with less computation consumption. The source code will be available at https://github.com/VUT-HFUT/ClusterPhys.

Abstract:
Multimodal Emotion Recognition (MER) may encounter incomplete multimodal scenarios caused by sensor damage or privacy protection in practical applications. Existing incomplete multimodal learning methods focus on learning better joint representations across modalities. However, our investigation shows that they are lacking in learning the unimodal representations which are rather discriminative as well. Instead, we propose a novel framework named Mixture of Modality Knowledge Experts (MoMKE) with two-stage training. In unimodal expert training, each expert learns the unimodal knowledge from the corresponding modality. In experts mixing training, both unimodal and joint representations are learned by leveraging the knowledge of all modality experts. In addition, we design a special Soft Router that can enrich the modality representations by dynamically mixing the unimodal representations and the joint representations. Various incomplete multimodal experiments on three benchmark datasets showcase the robust performance of MoMKE, especially on severely incomplete conditions. Visualization analysis further reveals the considerable value of unimodal and joint representations. Codes are realised at https://github.com/wxxv/MoMKE.

Abstract:
Camera relocalization is a challenging task to estimate camera pose within a known scene, with wide applications in the fields of Virtual Reality (VR), Augmented Reality (AR), robotics, and etc. Most existing learning-based methods invariably utilize all the information within an image for pose estimation. Although these methods have demonstrated leading pose accuracy in some cases, they are still far from being sufficient to handle the robustness under challenging viewpoints with less impacts on the localization accuracy for viewpoints that are easier to localize. In this paper, we propose a novel two-branch camera pose estimation framework: one branch utilizes keypoint-guided partial scene coordinate regression, while the other employs full scene coordinate regression to assess the credibility of image poses, thereby enabling more accurate camera localization. In particular, we devise a keypoint selection method predicated on matching rates which is designed to measure the matching quality between a 3D keypoint and 2D keypoints across views. With these selected 3D keypoints, we can generate 2D supervision mask with the ground-truth camera pose to supervise the keypoint prediction from the keypoint selection network. Meanwhile, we further refine the 2D supervision mask through the optimization with reprojection errors on the scene coordinate network, which estimates the scene coordinates for points within the scene that truly warrant attention, also enhances the localization performance. We also introduce a gated camera pose estimation strategy on the two-branch pose estimation framework, employing an updated keypoint selection network for images with higher credibility and a more robust network for difficult viewpoints. By adopting an effective curriculum learning scheme, we achieve higher accuracy within a training span of just 20 minutes. Our method's superior performance is validated through rigorous experimentation. The code is released at https://github.com/DUT-ICCD/KP-Guided-Reloc.

Abstract:
Domain Adaptive Object Detection (DAOD) aims to improve the adaptation of the detector for the unlabeled target domain by the labeled source domain. Recent advances leverage a self-training framework to enable a student model to learn the target domain knowledge using pseudo labels generated by a teacher model. Despite great successes, such category-level consistency supervision suffers from poor quality of pseudo labels to fully explore the contextual target domain knowledge. To mitigate the problem, we propose a stochastic context consistency reasoning network with the self-training framework. Firstly, we introduce a stochastic complementary masking module (SCM) to generate complementary masked images thus preventing the network from over-relying on specific visual clues. Secondly, we design an inter-changeable context consistency reasoning module (Inter-CCR), which constructs an inter-context consistency paradigm to capture the texture and contour details in the target domain by aligning the predictions of the student model for complementary masked images. Meanwhile, we develop an intra-changeable context consistency reasoning module (Intra-CCR), which constructs an intra-context consistency paradigm to strengthen the utilization of context relations by utilizing pseudo labels to supervise the predictions of the student model. Experimental results on three DAOD benchmarks demonstrate our method outperforms current state-of-the-art methods by a large margin. Code is released at https://github.com/HDUyiming/SOCCER.

Abstract:
Some recent methods address few-shot image classification by extracting semantic information from class names and devising mechanisms for aligning vision and semantics to integrate information from both modalities. However, class names provide only limited information, which is insufficient to capture the visual details in images. As a result, such vision-semantics alignment is inherently biased, leading to suboptimal integration outcomes. In this paper, we avoid such biased vision-semantics alignment by introducing CLIP, a natural bridge between vision and semantics, and enforcing unbiased vision-vision alignment as a proxy task. Specifically, we align features encoded from the few-shot encoder and CLIP's vision encoder on the same image. This alignment is accomplished through a linear projection layer, with a training objective formulated using optimal transport-based assignment prediction. Thanks to the inherent alignment between CLIP's vision and text encoders, the few-shot encoder is indirectly aligned to CLIP's text encoder, which serves as the foundation for better vision-semantics integration. In addition, to further improve vision-semantics integration at the testing stage, we mine potential fine-grained semantic attributes of class names from large language models. Correspondingly, an online optimization module is designed to adaptively integrate the semantic attributes and visual information extracted from images. Extensive results on four datasets demonstrate that our method outperforms state-of-the-art methods. The code is available at https://github.com/zhuolingli/FewVS.

Abstract:
Thermal infrared(TIR) data exhibits higher tolerance to extreme environments, making it a valuable complement to RGB data in tracking tasks. RGBT tracking aims to leverage information from RGB and TIR images for stable and robust tracking. However, existing RGBT tracking methods face challenges due to significant modality differences and selective emphasis on interactive information, leading to inefficiencies in the cross-modal interaction. To address these issues, we propose a novel Integrating Interaction into Modality-shared Features with ViT(IIMF) framework, which is a simplified cross-modal interaction network including modality-shared, RGB modality-specific, and TIR modality-specific branches. The Modality-shared branch aggregates modality-shared information and implements inter-modal interaction. Specifically, our approach first extracts modality-shared features from RGB and TIR features with a cross-attention mechanism. Furthermore, we design a Cross-Attention-based Modality-shared Information Aggregation(CAMIA) module to further aggregate modality-shared information with modality-shared tokens. We evaluate our model on three widely-used benchmark datasets and extensive experiments demonstrate that our method achieves state-of-the-art performance. All the source code are released at https://github.com/Liqiu-Chen/IIMF.

Abstract:
Image Aesthetic Quality Assessment (IAQA) aims to simulate users' visual perception to judge the aesthetic quality of images. In social media, users' aesthetic experiences are often reflected in their textual comments regarding the aesthetic attributes of images. To fully explore the attribute information perceived by users for evaluating image aesthetic quality, this paper proposes an image aesthetic quality assessment method based on attribute-driven multimodal hierarchical prompts. Unlike existing IAQA methods that utilize multimodal pre-training or straightforward prompts for model learning, the proposed method leverages attribute comments and quality-level text templates to hierarchically learn the aesthetic attributes and quality of images. Specifically, we first leverage users' aesthetic attribute comments to perform prompt learning on images. The learned attribute-driven multimodal features can comprehensively capture the semantic information of image aesthetic attributes perceived by users. Then, we construct text templates for different aesthetic quality levels to further facilitate prompt learning through semantic information related to the aesthetic quality of images. The proposed method can explicitly simulate users' aesthetic judgment of images to obtain more precise aesthetic quality. Experimental results demonstrate that the proposed IAQA method based on hierarchical prompts outperforms existing methods significantly on multiple IAQA databases. Our source code is public at https://github.com/GitHub-Ju/AMHP.

Abstract:
Cross-area evaluation poses a significant challenge for ground-to-aerial geo-localization, in which the training and testing data are captured from entirely distinct areas. However, current methods struggle in cross-area evaluation due to their emphasis solely on learning global information from single-scale features. Some efforts alleviate this problem but rely on complex and specific technologies like pre-processing and hard sample mining. To this end, we propose a pure end-to-end solution, free from task-specific techniques, termed the Multi-scale Feature Representation Generalization Network (MFRGN) to improve generalization. Specifically, we introduce multi-scale features and explicitly utilize them by an novel global-local information representation structure with two flows, to bolster feature representations. In the global flow, we present a lightweight Self and Cross Attention Module (SCAM) to efficiently learn global embeddings. In the local flow, we develop a Global-Prompt Attention Block (GPAB) to capture discriminative features under the global embeddings as prompts. As a result, our approach generates robust descriptors representing multi-scale global and local information, thereby enhancing the model's invariance to scene variations. Extensive experiments on benchmarks show our MFRGN achieves competitive performance in same-area evaluation and improves cross-area generalization by a significant margin compared to SOTA methods. Our code is available at https://github.com/ytao-wang/MFRGN.

Abstract:
Blind Face Restoration (BFR) aims to restore high-quality face images from low-quality images with unknown degradation. Previous GAN-based or ViT-based methods have shown promising results, but have identity details loss once degradation is severe; while recent diffusion-based methods work on image level and take a lot of time to infer. To restore images in any degradation types with high quality and spend less time compared to the classic diffusion-based method, we propose LD-BFR, a novel BFR framework that integrates both the strengths of vector quantization and latent diffusion. First, we employ a Dual Cross-Attention vector quantization to restore the degraded image in a global manner. Then we utilize the restored high-quality quantized feature as the guidance in our latent diffusion model to generate high-quality restored images with rich details. With the help of the proposed high-quality feature injection module, our LD-BFR effectively injects the high-quality feature as a condition to guide the generation of our latent diffusion model. Extensive experiments demonstrate the superior performance of our model over the SOTA BFR methods. The code is available at: https://github.com/YuzhenD/LD-BFR.git

Abstract:
Change detection identifies differences between images captured at different times. Real-world change detection faces challenges posed by the diverse and intricate nature of change areas, while current datasets and algorithms are often limited to simpler, consistent changes, reducing their effectiveness in practical applications. Existing dual-branch methods process images independently, risking the loss of change information due to insufficient early interaction. In contrast, single-stream approaches, though improving early integration, lack efficacy in capturing complex changes. To address these limitations, we introduce a novel single-stream framework, the Multi-scale Change-Aware Transformer (MCAT), which features the Dynamic Change-Aware Attention module and the Multi-scale Change-Enhanced Aggregator. The Dynamic Change-Aware Attention module, integrating local self-attention and cross-temporal attention, conducts dynamic iteration on images differences, thereby targeting feature extraction of change areas. The Multi-scale Change-Enhanced Aggregator enables the model to adapt to various scales and complex shapes through local change enhancement and multi-scale aggregation strategies. To overcome the limitations of existing datasets regarding the scale diversity and morphological complexity of change areas, we construct the Mining Area Change Detection dataset. The dataset offers a diverse array of change areas that span multiple scales and exhibit complex shapes, providing a robust benchmark for change detection. Extensive experiments demonstrate that our model outperforms existing methods, especially for irregular and multi-scale changes. Codes and dataset are available at https://github.com/chh11/MCAT.

Abstract:
With the rise of new 3D representations like NeRF and 3D Gaussian splatting, creating realistic 3D scenes is easier than ever before. However, the incompatibility of these 3D representations with existing editing software has also introduced unprecedented challenges to 3D editing tasks. Although recent advances in text-to-image generative models have made some progress in 3D editing, these methods either lack precision or require users to manually specify the editing areas in 3D space, complicating the editing process. To overcome these issues, we propose Edit3D, an innovative 3D editing method designed to enhance editing quality. Specifically, we propose a multi-turn editing framework and introduce an attention-driven open-set segmentation (ADSS) technique within this framework. ADSS allows for more precise segmentation of parts, which enhances the editing precision and minimizes interference with pixels in areas that are not being edited. Additionally, we propose a fine-tuning phase, intended to further improve the overall editing quality without compromising the training efficiency. Experiments demonstrate that Edit3D effectively adjusts 3D scenes based on textual instructions. Through continuous and multiple turns of editing, it achieves more intricate combinations, enhancing the diversity of 3D editing effects. Code is available at https://github.com/PeterouZh/Edit3D.

Abstract:
The emergence of diffusion models has greatly propelled the progress in image and video generation. Recently, some efforts have been made in controllable video generation, including text-to-video, image-to-video generation, video editing, and video motion control, among which camera motion control is an important topic. However, existing camera motion control methods rely on training a temporal camera module, and necessitate substantial computation resources due to the large amount of parameters in video generation models. Moreover, existing methods pre-define camera motion types during training, which limits their flexibility in camera control, preventing the realization of some specific camera controls, such as various camera movements in films. Therefore, to reduce training costs and achieve flexible camera control, we propose COMD, a novel training-free video motion transfer model, which disentangles camera motions and object motions in source videos and transfers the extracted camera motions to new videos. We first propose a one-shot camera motion disentanglement method to extract camera motion from a single source video, which separates the moving objects from the background and estimates the camera motion in the moving objects region based on the motion in the background by solving a Poisson equation. Furthermore, we propose a few-shot camera motion disentanglement method to extract the common camera motion from multiple videos with similar camera motions, which employs a window-based clustering technique to extract the common features in temporal attention maps of multiple videos. Finally, we propose a motion combination method to combine different types of camera motions together, enabling our model a more controllable and flexible camera control. Extensive experiments demonstrate that our training-free approach can effectively decouple camera-object motion and apply the decoupled camera motion to a wide range of controllable video generation tasks, achieving flexible and diverse camera motion control. Project Page: https://sjtuplayer.github.io/projects/MotionMaster.

Abstract:
In the burgeoning field of Audio-Visual Speech Recognition (AVSR), extant research has predominantly concentrated on the training paradigms tailored for high-quality resources. However, owing to the challenges inherent in real-world data collection, audio-visual data are frequently affected by modality-distortion, which encompasses audio-visual asynchrony, video noise and audio noise. The recognition accuracy of existing AVSR method is significantly compromised when multiple modality-distortion coexist in low-resource data. In light of the above challenges, we propose PCD: cluster-Prompt with Contrastive Decomposition, a robust framework for modality-distortion speech recognition, specifically devised to transpose the pre-trained knowledge from high-resource domain to the targeted domain by leveraging contrast-augmented prompts. In contrast to previous studies, we take into consideration the possibility of various types of distortion in both the audio and visual modalities. Concretely, we design bespoke prompts to delineate each modality-distortion, guiding the model to achieve speech recognition applicable to various distortion scenarios with quite few learnable parameters. To materialize the prompt mechanism, we employ multiple cluster-based strategies that better suits the pre-trained audio-visual model. Additionally, we design a contrastive decomposition mechanism to restrict the explicit relationships among various modality conditions, given their shared task knowledge and disparate modality priors. Extensive results on LRS2 dataset demonstrate that PCD achieves state-of-the-art performance for audio-visual speech recognition under the constraints of distorted resources. Code is available at https://github.com/ballooncatt/PCD.

Abstract:
Multimodal medical data, such as brain scans and non-imaging clinical records like demographics and neuropsychology examinations, play an important role in diagnosing neurodegenerative disorders, e.g., Alzheimer's disease (AD) and Parkinson's disease (PD). However, the disease-relevant information is overwhelmed by the high-dimensional image scans and the massive non-imaging data, making it a challenging task to fuse multimodal medical inputs efficiently. Recent multimodal learning methods adopt deep encoders to extract features and simple concatenation or alignment techniques for feature fusion, which suffer the representation degeneration issue due to the vast irrelevant information. To address this challenge, we propose a deep self-weighted multimodal relevance weighting approach, which leverages clustering-based constrastive learning and eliminates the intra- and inter-modal irrelevancy. The learned relevance score is integrated as a gate with a multimodal attention transformer to provide an improved fusion for the final diagnosis. Our proposed model, called SMART (Self-weighted Multimodal Attention-and-Relevance gated Transformer), is extensively evaluated on three public AD/PD datasets and achieves state-of-the-art (SOTA) performance in the diagnostics of neurodegenerative disorders. Our source code is available at https://github.com/Qybc/SMART.

Abstract:
With global occurrences of crowd crushes and stampedes, dense crowd simulation has been drawing great attention. In this research, our goal is to simulate dense crowd motions under six classic motion patterns, more specifically, to generate subsequent motions of dense crowds from the given initial states. Since dense crowds share similarities with fluids, such as continuity and fluidity, one common approach for dense crowd simulation is to construct hydrodynamics-based models, which consider dense crowds as fluids, guide crowd motions with Navier-Stokes equations, and conduct dense crowd simulation by solving governing equations. Despite the proposal of these models, dense crowd simulation faces multiple challenges, including the difficulty of directly solving Navier-Stokes equations due to their nonlinear nature, the ignorance of distinctive crowd characteristics which fluids lack, and the gaps in the evaluation and validation of crowd simulation models. To address the above challenges, we build a hydrodynamic model, which captures the crowd physical properties (continuity, fluidity, etc.) with Navier-Stokes equations and reflects the crowd social properties (sociality, personality, etc.) with operators that describe crowd interactions and crowd-environment interactions. To tackle the computational problem, we propose to solve the governing equation based on Navier-Stokes equations using neural networks, and introduce the Hydrodynamics-Informed Neural Network (HINN) which preserves the structure of the governing equation in its network architecture. To facilitate the evaluation, we construct a new dense crowd motion video dataset called Dense Crowd Flow Dataset (DCFD), containing six classic motion patterns (line, curve, circle, cross, cluster and scatter) and 457 video clips, which can serve as the groundtruths for various objective metrics. Numerous experiments are conducted using HINN to simulate dense crowd motions under six motion patterns with video clips from DCFD. Objective evaluation metrics that concerns authenticity, fidelity and diversity demonstrate the superior performance of our model in dense crowd simulation compared to other simulation models. Our code and dataset are available at https://github.com/shanshan-zys/HINN.

Abstract:
Multi-Domain Recommendation (MDR) aims to leverage data from multiple domains to enhance recommendations through overlapping users or items. However, extreme overlap sparsity in some applications makes it challenging for existing multi-domain models to capture domain-shared information. Also, the sparse overlapping users or items result in a cold start problem in every single domain and hinder feature space alignment of different domains, posing a challenge for joint optimization across domains. However, in multi-domain short video recommendation, we identify two key characteristics that can greatly alleviate the overlapping sparsity issue and enable domain alignment. (1) The following relations between users and publishers exhibit strong preferences and a concentration effect, as popular video publishers, who constitute a small portion of all users, are followed by a majority of users across various domains. (2) The tag tree structure shared by all videos can help facilitate multi-grained alignment across multiple domains. Based on these characteristics, we propose tag tree-guided multi-grained alignment with publisher enhancement for multi-domain video recommendation. Our model integrates publisher and tag nodes into the user-video bipartite graph as central nodes, enabling user and video alignment across all domains via graph propagation. Then, we propose a tag tree-guided decomposition method to obtain hierarchical graphs for multi-grained alignment. Further, we design tree-guided contrastive learning methods to capture the intra-level and inter-level node relations respectively. Finally, extensive experiments on two real-world short video recommendation datasets demonstrate the effectiveness of our model. Our code is available at https://github.com/17231087/TGMAPE.git

Abstract:
Temporal Sentence Grounding (TSG), which aims to localize events in untrimmed videos with a given language query, has been widely studied in the last decades. However, recently researchers have demonstrated that previous approaches are severely limited in out-of-distribution generalization, thus proposing the De-biased TSG challenge which requires models to overcome weakness towards outlier test samples. In this paper, we design a novel framework, termed Counterfactually-Augmented Event Matching (CAEM), which incorporates counterfactual data augmentation to learn event-query joint representations to resist the training bias. Specifically, it consists of three components: (1) A Temporal Counterfactual Augmentation module that generates counterfactual video-text pairs by temporally delaying events in the untrimmed video, enhancing the model's capacity for counterfactual thinking. (2) An Event-Query Matching model that is used to learn joint representations and predict corresponding matching scores for each event candidate. (3) A Counterfact-Adaptive Framework (CAF) that incorporates the counterfactual consistency rules on the matching process of the same event-query pairs, furtherly mitigating the bias learned from training sets. We conduct thorough experiments on two widely used DTSG datasets, i.e., Charades-CD and ActivityNet-CD, to evaluate our proposed CAEM method. Extensive experimental results show our proposed CAEM method outperforms recent state-of-the-art methods on all datasets. Our implementation code is available at https://github.com/CFM-MSG/CAEM_Code.

Abstract:
While current CNN-based low-light image enhancement (LIE) approaches have achieved significant progress, they often fail to generate better perceptual quality which requires restoring better details and more natural colors. To address these problems, we set a new path, called PercepLIE, by presenting the VQGAN with Multi-luminance Detail Compensation (MDC) and Global Color Adjustment (GCA). Specifically, observed that latent light features of the low-light images are quite different from those captured in normal light, we utilize VQGAN to explore the latent light representation of normal-light images to help the estimation of the low-light and normal-light mapping. Furthermore, we employ Gamma correction with varying Gamma values on the gradient to create multi-luminance details, forming the basis for our MDC module to facilitate better detail estimation. To optimize the colors of low-light input images, we introduce a simple yet effective GCA module that is based on spatially-varying representation between the estimated normal-light images in this module and low-light inputs. By combining the VQGAN with MDC and GCA within a stage-wise training mechanism, our method generates images with finer details and natural colors and achieves favorable performance on both synthetic and real-world datasets in terms of perceptual quality metrics including NIQE, PI, and LPIPS. The source codes will be made available at https://github.com/supersupercong/PercepLIE.

Abstract:
Reconstructing garments from monocular videos has attracted considerable attention as it provides a convenient and low-cost solution for clothing digitization. In reality, people wear clothing with countless variations and multiple layers. Existing studies attempt to extract garments from a single video. They either behave poorly in generalization due to reliance on limited clothing templates or struggle to handle the intersections of multi-layered clothing leading to the lack of physical plausibility. Besides, there are inevitable and undetectable overlaps for a single video that hinder researchers from modeling complete and intersection-free multi-layered clothing. To address the above limitations, in this paper, we propose a novel method to reconstruct multi-layered clothing from multiple monocular videos sequentially, which surpasses existing work in generalization and robustness against penetration. For each video, neural fields are employed to implicitly represent the clothed body, from which the meshes with frame-consistent structures are explicitly extracted. Next, we implement a template-free method for extracting a single garment by back-projecting the image segmentation labels of different frames onto these meshes. In this way, multiple garments can be obtained from these monocular videos and then aligned to form the whole outfit. However, intersection always occurs due to overlapping deformation in the real world and perceptual errors in monocular videos. To this end, we innovatively introduce a physics-aware module that combines neural fields with a position-based simulation framework to fine-tune the penetrating vertices of garments, ensuring robustly intersection-free. Additionally, we collect a mini dataset with fashionable garments to evaluate the quality of clothing reconstruction comprehensively. We release our code and data at https://github.com/SMY19999/IF-Garments.

Abstract:
Generative AI technologies, including text-to-speech (TTS) and voice conversion (VC), frequently become indistinguishable from genuine samples, posing challenges for individuals in discerning between real and synthetic content. This indistinguishability undermines trust in media, and the arbitrary cloning of personal voice signals presents significant challenges to privacy and security. In the field of deepfake audio detection, the majority of models achieving higher detection accuracy currently employ self-supervised pre-trained models. However, with the ongoing development of deepfake audio generation algorithms, maintaining high discrimination accuracy against new algorithms grows more challenging. To enhance the sensitivity of deepfake audio features, we propose a deepfake audio detection model that incorporates an SLS (Sensitive Layer Selection) module. Specifically, utilizing the pre-trained XLS-R enables our model to extract diverse audio features from its various layers, each providing distinct discriminative information. Utilizing the SLS classifier, our model captures sensitive contextual information across different layer levels of audio features, effectively employing this information for fake audio detection. Experimental results show that our method achieves state-of-the-art (SOTA) performance on both the ASVspoof 2021 DF and In-the-Wild datasets, with a specific Equal Error Rate (EER) of 1.92% on the ASVspoof 2021 DF dataset and 7.46% on the In-the-Wild dataset. Codes and data can be found at https://github.com/QiShanZhang/SLSforADD.

Abstract:
Screening similar but non-target images in text-based image retrieval is crucial for pinpointing the user's desired images accurately. However, conventional methods mainly focus on enhancing text-image matching performance, often failing to identify images that exactly match the retrieval intention because of the query quality. User-provided queries frequently lack adequate information for screening similar but not target images, especially when the target database (DB) contains numerous similar images. Therefore, a novel approach is needed to extract valuable information from users for effective screening. In this paper, we propose a DB question generation (DQG) model to enhance exact cross-modal image retrieval performance. Our DQG model learns to generate effective questions that precisely screen similar but non-target images using DB contents information. By answering the questions generated from our model, users can reach their desired images by only answering the presented questions even within DBs with similar content. Experimental results on publicly available datasets show that our proposed approach can significantly improve exact cross-modal image retrieval performance. Code will be publicly available at https://github.com/yanarin/DQG.

Abstract:
Image Aesthetic Assessment (IAA) aims to objectively predict the generic or personalized evaluations, of the aesthetic or fine-grained multi-attributes, based on visual or multimodal inputs. Previously, researchers have designed diverse and specialized methods, for specific IAA tasks, based on different input-output situations. Is it possible to design a universal IAA framework applicable for the whole IAA task taxonomy? In this paper, we explore this issue, and propose a modular IAA framework, dubbed AesMamba. Specially, we use the Visual State Space Model (VMamba), instead of CNNs or ViTs, to learn comprehensive representations of aesthetic-related attributes; because VMamba can efficiently achieve both global and local effective receptive fields. Afterward, a modal-adaptive module is used to automatically produce the integrated representations, conditioned on the type of input. In the prediction module, we propose a Multitask Balanced Adaptation (MBA) module, to boost task-specific features, with emphasis on the tail instances. Finally, we formulate the personalized IAA task as a multimodal learning problem, by converting a user's anonymous subject characters to a text prompt. This prompting strategy effectively employs the semantics of flexibly selected characters, for inferring individual preferences. AesMamba can be applied to diverse IAA tasks, through flexible combination of these modules. Extensive experiments on numerous datasets, demonstrate that AesMamba consistently achieves superior or competitive performance, on all IAA tasks, in comparison with previous SOTA methods. The code has been released at https://github.com/AiArt-Gao/AesMamba Github.

Abstract:
Inscriptions on ancient steles, as carriers of culture, encapsulate the humanistic thoughts and aesthetic values of our ancestors. However, these relics often deteriorate due to environmental and human factors, resulting in significant information loss. Since the advent of inscription rubbing technology over a millennium ago, archaeologists and epigraphers have devoted immense effort to manually restoring these cultural imprints, endeavoring to unlock the storied past within each rubbing. This paper approaches this challenge as a multi-modal task, aiming to establish a novel benchmark for the inscription restoration from rubbings. In doing so, we construct the Chinese Inscription Rubbing Image (CIRI) dataset, which includes a wide variety of real inscription rubbing images characterized by diverse calligraphy styles, intricate character structures, and complex degradation forms. Furthermore, we develop a synthesis approach to generate "intact-degraded'' paired data, mirroring real-world degradation faithfully. On top of the datasets, we propose a baseline framework that achieves visual consistency and textual integrity through global and local diffusion-based restoration processes and explicit incorporation of domain knowledge. Comprehensive evaluations confirm the effectiveness of our pipeline, demonstrating significant improvements in visual presentation and textual integrity. The project is available at: https://github.com/blackprotoss/CIRI.

Abstract:
Temporal Automatic White Balance (TAWB) corrects the color cast within each frame, while ensuring consistent illumination across consecutive frames. Unlike conventional AWB, there has been limited research conducted on TAWB for an extended period. However, the growing popularity of short-form videos has increased focus on video color experiences. To further advance research on TAWB, we aim to address the bottlenecks associated with datasets, models, and benchmarks. 1) Dataset challenge: Currently, only one TAWB dataset (BCC), captured with a single camera, is available. It lacks temporal continuity due to challenges in capturing realistic illuminations and dynamic raw data. In response, we meticulously designed an acquisition strategy based on the actual distribution pattern of illuminations and created a comprehensive TAWB dataset named CTA comprising 6 cameras that offer 12K continuous illuminations. Furthermore, we employed video frame interpolation techniques, extending the captured static raw data into dynamic form and ensuring continuous illumination. 2) Model challenge: Among the two prevailing TAWB methods, both rely on LSTM. However, the fixed gating mechanism of LSTM often fails to adapt to varying content or illuminations, resulting in unstable illumination estimation. In response, we propose CTANet, which integrates cross-frame attention and RepViT for self-adjustment to content and illumination variations. Additionally, the mobile-friendly design of RepViT enhances the portability of CTANet. 3) Benchmark challenge: Currently, there is no benchmark of TAWB methods on illumination and camera types to date. Addressing this, a benchmark has been proposed by conducting a comparative analysis of 8 cutting-edge AWB and TAWB methods with CTANet across 3 typical illumination scenes and 7 cameras from two representative datasets. The dataset and code are available in https://github.com/ChunxiaoLe/CTA-Dataset.

Affiliations: School of Information Science and Technology, Haixi Institutes of Chinese Academy of Sciences, Hangzhou Normal University, China ; School of Digital Media Technology, Hangzhou Dianzi University, China ; School of Computer Science and Engineering, Tianjin University of Technology, China ; Quanzhou Institute of Equipment Manufacturing, China ; the Department of Technology, Management and Economics, Technical University of Denmark, Denmark ; Trustworthy and General AI Lab, School of Engineering, Westlake University

Abstract:
Locating lesions is the primary goal of colonoscopy examinations.3D perception techniques can enhance the accuracy of lesion localization by restoring 3D spatial information of the colon. However, existing methods focus on the local depth estimation of a single frame and neglect the precise global positioning of the colonoscope, thus failing to provide the accurate 3D location of lesions. The root causes of this shortfall is twofold: Firstly, existing methods treat colon depth and colonoscope pose estimation as independent tasks or design them as parallel sub-task branches. Secondly, the light source in the colon environment moves with the colonoscope, leading to brightness fluctuations among continuous frame images. To address these two issues, we propose ColVO, a novel deep learning-based Visual Odometry framework, which can continuously estimate colon depth and colonoscopic pose using two key components: a deep couple strategy for depth and pose estimation (DCDP) and a light consistent calibration mechanism (LCC). DCDP utilization of multimodal fusion and loss function constraints to couple depth and pose estimation modes ensure seamless alignment of geometric projections between consecutive frames. Meanwhile, LCC accounts for brightness variations by recalibrating the luminosity values of adjacent frames, enhancing ColVO's robustness. A comprehensive evaluation of ColVO on colon odometry benchmarks reveals its superiority over state-of-the-art methods in depth and pose estimation. We also demonstrate two valuable applications: immediate polyp localization and complete 3D reconstruction of the intestine. The code for ColVO is available at https://github.com/HNUicda/CoIVO.

Abstract:
Remarkable progresses have been made in hyperspectral image (HSI) denoising. However, the majority of existing methods are predominantly confined to the spatial-spectral domain, overlooking the untapped potential inherent in the Fourier domain. This paper presents a novel approach to address HSI denoising by bridging the information from the Fourier and spatial-spectral domains. Our method highlights key insights into the Fourier properties within spatial and spectral domains through the Fourier transform. Specifically, we note that the amplitude inherently embody noise and photon reflection characteristics, while the phase holds structural information. These insights unveil new perspectives on the physical properties of HSIs, motivating us to leverage complementary information exchange between Fourier and spatial-spectral domains. To this end, we introduce the Fourier-prior Integration Denoising Network (FIDNet), a potent yet straightforward approach that utilizes Fourier insights to synergistically interact with spatial-spectral domains for superior HSI denoising. In FIDNet, we independently extract spatial and Fourier features through dual branches and merge these representations to enhance spectral evolution modeling through the inherent structure consistency constraints and continuing reflection variation revealed in Fourier prior. Our proposed method demonstrates robust generalization across synthetic and real-world benchmark datasets, achieves comparable results with state-of-the-art methods in both quantitative quality and visual results. The code is available at https://github.com/MIV-XJTU/FIDNet.

Abstract:
Existing multi-modal image fusion algorithms are typically designed for high-quality images and fail to tackle degradation (e.g., low light, low resolution, and noise), which restricts image fusion from unleashing the potential in practice. In this work, we present Degradation-Robust Multi-modality image Fusion (DRMF), leveraging the powerful generative properties of diffusion models to counteract various degradations during image fusion. Our critical insight is that generative diffusion models driven by different modalities and degradation are inherently complementary during the denoising process. Specifically, we pre-train multiple degradation-robust conditional diffusion models for different modalities to handle degradations. Subsequently, the diffusion priori combination module is devised to integrate generative priors from pre-trained uni-modal models, enabling effective multi-modal image fusion. Extensive experiments demonstrate that DRMF excels in infrared-visible and medical image fusion, even under complex degradations. Our code is available at https://github.com/Linfeng-Tang/DRMF.

Abstract:
Continual learning emerges as a framework that trains the model on a sequence of tasks without forgetting previously learned knowledge, which has been applied in multiple multimodal scenarios. Recently, prompt-based continual learning has achieved excellent domain adaptability and knowledge transfer through prompt generation. However, existing methods mainly focus on designing the architecture of a generator, neglecting the importance of providing effective guidance for training the generator. To address this issue, we propose Generating Prompts in Latent Space (GPLS), which considers prompts as latent variables to account for the uncertainty of prompt generation and aligns with the fact that prompts are inserted into the hidden layer outputs and exert an implicit influence on classification. GPLS adopts a trainable encoder to encode task and feature information into prompts with reparameterization technique, and provides refined and targeted guidance for the training process through the evidence lower bound (ELBO) related to Mahalanobis distance. Extensive experiments demonstrate that GPLS achieves state-of-the-art performance on various benchmarks. Our code is available at https://github.com/Hifipsysta/GPLS.

Abstract:
Recently, dynamic convolution shows performance boost for the CNN-related networks in medical image segmentation. The core idea is to replace static convolutional kernel with a linear combination of multiple convolutional kernels, conditioned on input-dependent attention function. However, the existing dynamic convolution design suffers from two limitations: i) The convolutional kernels are weighted by enforcing a single-dimensional attention function upon the input maps, overlooking the synergy in multi-dimensional information. This results in sub-optimal computations of convolution kernels. ii) The linear kernel aggregation is inefficient, restricting the model's capacity to learn more intricate patterns. In this paper, we rethink the dynamic convolution design to address these limitations and propose multi-dimensional aggregation dynamic convolution (MAGIC). Specifically, our MAGIC introduce a dimensional-reciprocal fusion module to capture correlations among input maps across the spatial, channel, and global dimensions simultaneously for computing convolutional kernels. Furthermore, we design kernel recalculation module, which enhances the efficiency of aggregation through learning the interaction between kernels. As a drop-in replacement for regular convolution, our MAGIC can be flexibly integrated into prevalent pure CNN or hybrid CNN-Transformer backbones. The extensive experiments on four benchmarks demonstrate that our MAGIC outperforms regular convolution and existing dynamic convolution. Code is available at: https://github.com/Segment82/MAGIC

Abstract:
Driving scene topology reasoning aims to understand the objects present in the current road scene and model their topology relationships to provide guidance information for downstream tasks. Previous approaches fail to adequately facilitate interactions among traffic objects and neglect to incorporate scene information into topology reasoning, thus limiting the comprehensive exploration of potential correlations among objects and diminishing the practical significance of the reasoning results. Besides, the lack of constraints on lane direction may introduce erroneous guidance information and lead to a decrease in topology prediction accuracy. In this paper, we propose a novel topology reasoning framework, dubbed TSTGT, to address these issues. Specifically, we design a divide-and-conquer topology graph Transformer to respectively infer the lane-lane and lane-traffic topology relationships, which can effectively aggregate the local and global object information in the driving scene and facilitate the topology relationship learning. Additionally, a traffic scene-assisted reasoning module is devised and combined with the topology graph Transformer to enhance the practical significance of lane-traffic topology. In terms of lane detection, we develop a point-wise matching strategy to infer lane centerlines with correct directions, thereby improving the topology reasoning accuracy. Extensive experimental results on Openlane-V2 benchmark validate the superiority of our TSTGT over state-of-the-art methods and the effectiveness of our proposed modules. The code is available at https://github.com/rongfu-dsb/TSTGT.

Abstract:
3D reconstruction remains a pivotal topic in contemporary multimedia research, with multi-view stereo (MVS) methods being instrumental. Traditional MVS algorithms, have demonstrated robust performance in large-scale scenes. However, these methods often struggle with depth estimation in non-textured regions, resulting in inconsistencies across different areas of the reconstructed 3D model. This paper introduces a novel MVS framework, General Sampling Non-local Texture Features (GST-MVS), designed to enhance depth estimation by leveraging the intrinsic relationship between image texture and object depth. Our approach optimizes the 3D reconstruction process, yielding superior results, especially in low-textured regions. Comprehensive evaluations on the ETH3D dataset indicate that GST-MVS achieves dense and accurate 3D point cloud reconstructions, outperforming existing MVS algorithms. The GST-MVS framework is open-source and accessible on GitHub at https://github.com/Jasmine-tjy/GST-mvs. Additionally, Docker image facilitating the required execution environment, which is available at https://hub.docker.com/r/tangjas111/cmake_cuda_opencv/tags.

Abstract:
Deep video compression has attracted increasing attention in recent years due to its end-to-end optimization ability. However, most existing neural video compression (NVC) models focus on incorporating sophisticated motion or residual coding networks for successive frames leveraging spatial-temporal redundancy removal, neglecting the efficient motion representation and essential structure of scaled prediction for motion dynamics. To resolve this problem, this paper proposed a novel model, named scaled hierarchical bi-directional prediction structure, which effectively captures temporal correlation among frames considering the quality variation when managing the reference frames. This paper first introduces parameter-shared motion codecs and efficient information fusion strategies to obtain predictive features more precisely. Subsequently, scaled motions from temporal contexts are learned as bi-directional prior for motion representation. Additionally, the concept of trustworthy motion modeling is proposed to represent the effectiveness of reference information, measuring the reliability of predictive accuracy in complex motions, camera rotations and occlusions. Extensive experimental results demonstrate that our approach offers significant advantages over state-of-the-art bi-directional NVC models in coding efficiency. The proposed method has been adopted as the latest reference model by Moving Picture, Audio, and Data Coding by Artificial Intelligence (MPAI) end-to-end video coding (EEV) standard. The code is available at: https://github.com/yefeng00/DVC_with_Scaled_Hierarchical_Bi_directional_Motion_Model.

Abstract:
The Visual Spatial Description Challenge (VSD) is the first competition event focused on visual spatial understanding, organized under the auspices of the ACM Multimedia Conference 2024. The goal of the VSD challenge is to assess the the ability of models and systems to comprehend spatial concepts, relationships and other semantics from a scene presented with visual appearance. The VSD challenge provides two benchmark datasets for three subtasks, i.e., visual spatial relationship classification, single spatial description generation, and open-ended spatial description generation. The challenge details are available on https://lllogen.github.io/vsd-challenge.github.io/.

Abstract:
Complaints are pivotal expressions within e-commerce communication, yet the intricate nuances of human interaction present formidable challenges for AI agents to grasp comprehensively. While recent attention has been drawn to analyzing complaints within a multimodal context, relying solely on text and images is insufficient for organizations. The true value lies in the ability to pinpoint complaints within the intricate structures of discourse, scrutinizing them at a granular aspect level. Our research delves into the discourse structure of e-commerce video-based product reviews, pioneering a novel task we term Aspect-Level Complaint Detection from Discourse (ACDD). Embedded in a multimodal framework, this task entails identifying aspect categories and assigning complaint/non-complaint labels at a nuanced aspect level. To facilitate this endeavour, we have curated a unique multimodal product review dataset, meticulously annotated at the utterance level with aspect categories and associated complaint labels. To support this undertaking, we introduce a Multimodal Aspect-Aware Complaint Analysis (MAACA) model that incorporates a novel pre-training strategy and a global feature fusion technique across the three modalities. Additionally, the proposed framework leverages a moment retrieval step to identify the relevant portion of the clip, crucial for accurately detecting the fine-grained aspect categories and conducting aspect-level complaint detection. Extensive experiments conducted on the proposed dataset showcase that our framework outperforms unimodal and bimodal baselines, offering valuable insights into the application of video-audio-text representation learning frameworks for downstream tasks. The dataset and code are available at: https://github.com/rdev12/MAACA.

Abstract:
Domain generalization 3D segmentation aims to learn the point clouds with unknown distributions. Feature augmentation has been proven to be effective for domain generalization. However, each point of the 3D segmentation scene contains uncertainty in the target domain, which affects model generalization. This paper proposes the Domain Generalization-Aware Uncertainty Introspective Learning (DGUIL) method, including Potential Uncertainty Modeling (PUM) and Momentum Introspective Learning (MIL), to deal with the point uncertainty in domain shift. Specifically, PUM explores the underlying uncertain point cloud features and generates the different distributions for each point. The PUM enhances the point features over an adaptive range, which provides various information for simulating the distribution of the target domain. Then, MIL is designed to learn generalized feature representation in uncertain distributions. The MIL utilizes uncertainty correlation representation to measure the predicted divergence of knowledge accumulation, which learns to carefully judge and understand divergence through uncertainty introspection loss. Finally, extensive experiments verify the advantages of the proposed method over current state-of-the-art methods. The code will be available at https://github.com/ChicalH/DGUIL.

Abstract:
Achieving 3D hand-object pose estimation in interaction scenarios is challenging due to the severe occlusion generated during the interaction. Existing methods address this issue by utilizing the correlation between the hand and object poses as additional cues. They usually first extract the hand and object features from their respective regions and then refine them with each other. However, this paradigm disregards the role of a broad range of image context. To address this problem, we propose a novel and robust approach that learns a broad range of context by imposing priors. First, we build this approach using stacked transformer decoder layers. These layers are required for extracting image-wide context and regional hand or object features by constraining cross-attention operations. We share the context decoder layer parameters between the hand and object pose estimations to avoid interference in the context-learning process. This imposes a prior, indicating that the hand and object are mutually the most important context for each other, significantly enhancing the robustness of obtained context features. Second, since they play different roles, we provide customized feature maps for the context, hand, and object decoder layers. This strategy facilitates the disentanglement of these layers, reducing the feature learning complexity. Finally, we conduct extensive experiments on the popular HO3D and Dex-YCB databases. The experimental results indicate that our method significantly outperforms state-of-the-art approaches and can be applied to other hand pose estimation tasks. Code is available at https://github.com/zskuang58/LCP.

Abstract:
The fusion of visible and infrared images aims to produce high-quality fusion images with rich textures and salient target information. Existing methods lack interactivity and flexibility in the execution of fusion. It is unfeasible to express the requirements to modify the fusion effect, and the different regions in the source images are treated equally across the identical fusion model, which causes fusion homogenization and low distinction. Besides, their pre-defined fusion strategies invariably lead to monotonous effects, which are insufficiently comprehensive. They fail to adequately consider data credibility, scene illumination, and noise degradation inherent in the source information. To address these issues, we propose the Te xt-driven and Region-aware Flexible visible and infrared image fusion, termed as TeRF. On the one hand, we propose a flexible image fusion framework with multiple large language and vision models, which facilitates the visual-text interaction. On the other hand, we aggregate comprehensive fine-tuning paradigms for the different fusion requirements to build a unified fine-tuning pipeline. It allows the linguistic selection of the regions and effects, yielding visually appealing fusion outcomes. Extensive experiments demonstrate the competitiveness of our method both qualitatively and quantitatively compared to existing state-of-the-art methods. Our code is publicly available at https://github.com/Baixuzx7/TeRF.

Abstract:
Images corrupted by rain streaks often lose vital frequency information for perception, and image deraining aims to solve this problem, which relies on global and local degradation modeling. Recent studies have witnessed the effectiveness and efficiency of Mamba for perceiving global and local information based on its exploiting local correlation among patches, however, rarely attempts have been explored to extend it with frequency analysis for image deraining, limiting its ability to perceive global degradation that is relevant to frequency modeling (e.g. Fourier transform). In this paper, we propose FreqMamba, an effective and efficient paradigm that leverages the complementary between Mamba and frequency analysis for image deraining. The core of our method lies in extending Mamba with frequency analysis from two perspectives: extending it with frequency band for exploiting frequency correlation, and connecting it with Fourier transform for global degradation modeling. Specifically, FreqMamba introduces complementary triple interaction structures including spatial Mamba, frequency-band Mamba, and Fourier global modeling. Frequency-Band Mamba decomposes the image into sub-bands of different frequencies to allow 2D scanning from the frequency dimension. Furthermore, leveraging Mamba's unique data-dependent properties, we use rainy images at different scales to provide degradation priors to the network, thereby facilitating efficient training. Extensive experiments show that our method outperforms state-of-the-art methods both visually and quantitatively. Our code is available at: https://github.com/aSleepyTree/FreqMamba.

Abstract:
The recent advancements in cross-modal transformers have demonstrated their superior performance in RGB-D segmentation tasks by effectively integrating information from both RGB and depth modalities. However, existing methods often overlook the varying levels of informative content present in each modality, treating them equally and using models of the same architecture. This oversight can potentially hinder segmentation performance, especially considering that RGB images typically contain significantly more information than depth images. To address this issue, we propose PrimKD, a knowledge distillation based approach that focuses on guided multimodal fusion, with an emphasis on leveraging the primary RGB modality. In our approach, we utilize a model trained exclusively on the RGB modality as the teacher, guiding the learning process of a student model that fuses both RGB and depth modalities. To prioritize information from the primary RGB modality while leveraging the depth modality, we incorporate primary focused feature reconstruction and a selective alignment scheme. This integration enhances the overall freature fusion, resulting in improved segmentation results. We evaluate our proposed method on the NYU Depth V2 and SUN-RGBD datasets, and the experimental results demonstrate the effectiveness of PrimKD. Specifically, our approach achieves mIoU scores of 57.8 and 52.5 on these two datasets, respectively, surpassing existing counterparts by 1.5 and 0.4 mIoU. The code is available at https://github.com/xiaoshideta/PrimKD.

Abstract:
The perception of image aesthetics is built upon the understanding of semantic content. However, how to evaluate the aesthetic quality of images with diversified semantic backgrounds remains challenging in image aesthetics assessment (IAA). To address the dilemma, this paper presents a semantics-aware image aesthetics assessment approach, which first analyzes the semantic content of images and then models the aesthetic distinctions among images from two perspectives, i.e., aesthetic attribute and aesthetic level. Concretely, we propose two strategies, dubbed tag matching and contrastive ranking, to extract knowledge pertaining to image aesthetics. The tag matching identifies the semantic category and the dominant aesthetic attributes based on predefined tag libraries. The contrastive ranking is designed to uncover the comparative relationships among images with different aesthetic levels but similar semantic backgrounds. In the process of contrastive ranking, the impact of long-tailed distribution of aesthetic data is also considered by balanced sampling and traversal contrastive learning. Extensive experiments and comparisons on three benchmark IAA databases demonstrate the superior performance of the proposed model in terms of both prediction accuracy and alleviating long-tailed effect. The code will be public at https://github.com/yzc-ippl/TMCR REMOVE 2nd URL://github.com/yzc-ippl/TMCR.

Abstract:
Volume electron microscopy (vEM) is becoming a prominent technique in three-dimensional (3D) cellular visualization. vEM collects a series of two-dimensional (2D) images and reconstructs ultrastructures at the nanometer scale by rational axial interpolation between neighboring sections. However, section damage inevitably occurs in the sample preparation and imaging process, suffering from manual operational errors or occasional mechanical failures. The damaged regions present blurry and contaminated structure information, even local blank holes. Despite significant progress in single-image inpainting, it is still a great challenge to recover missing biological structures, that satisfy 3D structural continuity among sections. In this paper, we propose an optical flow-based serial section inpainting architecture to effectively combine the 3D structure information from neighboring sections and 2D image features from surrounding regions. We design a two-stage reference generation strategy to predict a rational and detailed intermediate state image from coarse to fine. Then, a GAN-based inpainting network is adopted to integrate all reference information and guide the restoration of missing structures, while ensuring consistent distribution of pixel values across the 2D image. Extensive experimental results well demonstrate the superiority of our method over existing inpainting tools. Our code is available at https://github.com/chengyr1999/FlowInpaint/.

Abstract:
Knowledge Tracing (KT) is a critical service in distance education, predicting students' future performance based on their responses to learning resources. The reasonable assessment of the knowledge state, along with accurate response prediction, is crucial for KT. However, existing KT methods prioritize fitting results and overlook attention to the problem-solving process. They equate the knowledge students memorize before problem-solving with the knowledge that can be acquired or applied during problem-solving, leading to dramatic fluctuations in knowledge states between mastery and non-mastery, with low interpretability. This paper explores knowledge transformation in problem-solving and proposes an interpretable model, Problem-solving Knowledge Tracing (PSKT). Specifically, we first present a knowledge-centered problem representation that enhances its expression by adjusting problem variability. Then, we meticulously designed a Sequential Neural Network (SNN) with three stages: (1) Before problem-solving, we model students' personalized problem space and simulate their acquisition of problem-related knowledge through a gating mechanism. (2) During problem-solving, we evaluate knowledge application and calculate response with a four-parameter IRT. (3) After problem-solving, we quantify student knowledge internalization and forgetting using an incremental indicator. The SNN, inspired by problem-solving and constructivist learning theories, is an interpretable model that attributes learner performance to subjective problems (difficulty, discrimination), objective knowledge (knowledge acquisition and application), and behavior (guessing and slipping). Experimental results show PSKT's advantages in prediction accuracy, reasonable knowledge state assessment, and learning process explanation. The code is available at https://github.com/Oia-10/PSKT.

Abstract:
Phrase Grounding, i.e., PG aims to locate objects referred by noun phrases. Recently, PG under weak supervision (i.e., grounding without region-level annotations) and zero-shot PG (i.e., grounding from seen categories to unseen ones) are proposed, respectively. However, for real-world applications these two approaches are limited due to slight annotations and numerable categories during training. In this paper, we propose a framework of zero-shot PG under weak supervision. Specifically, our PG framework is built on triple alignment strategies. Firstly, we propose a region-text alignment (RTA) strategy to build region-level attribute associations via CLIP. Secondly, we propose a domain alignment (DomA) strategy by minimizing the difference between distributions of seen classes in the training and those of the pre-training. Thirdly, we propose a category alignment (CatA) strategy by considering both category semantics and region-category relations. Extensive experimental results show that our proposed PG framework outperforms previous zero-shot methods and weakly-supervised methods. Our code is available at https://github.com/LinPengyue/ZS-WSG.

Abstract:
The heterogeneity of medical images poses significant challenges to accurate disease diagnosis. To tackle this issue, the impact of such heterogeneity on the causal relationship between image features and diagnostic labels should be incorporated into model design, which however remains underexplored. In this paper, we propose a mixed prototype correction for causal inference (MPCCI) method, aimed at mitigating the impact of unseen confounding factors on the causal relationships between medical images and disease labels, so as to enhance the diagnostic accuracy of deep learning models. The MPCCI comprises a causal inference component based on front-door adjustment and an adaptive training strategy. The causal inference component employs a multi-view feature extraction (MVFE) module to establish mediators, and a mixed prototype correction (MPC) module to execute causal interventions. Moreover, the adaptive training strategy incorporates both information purity and maturity metrics to maintain stable model training. Experimental evaluations on four medical image datasets, encompassing CT and ultrasound modalities, demonstrate the superior diagnostic accuracy and reliability of the proposed MPCCI. The code will be available at https://github.com/Yajie-Zhang/MPCCI.

Abstract:
Deep learning-based all-in-one image restoration methods have garnered significant attention in recent years due to capable of addressing multiple degradation tasks. These methods focus on extracting task-oriented information to guide the unified model and have achieved promising results through elaborate architecture design. They commonly adopt a simple mix training paradigm, and the proper optimization strategy for all-in-one tasks has been scarcely investigated. This oversight neglects the intricate relationships and potential conflicts among various restoration tasks, consequently leading to inconsistent optimization rhythms. In this paper, we extend and redefine the conventional all-in-one image restoration task as a multi-task learning problem and propose a straightforward yet effective active-reweighting strategy, dubbed Art, to harmonize the optimization of multiple degradation tasks. Art is a plug-and-play optimization strategy designed to mitigate hidden conflicts among multi-task optimization processes. Through extensive experiments on a diverse range of all-in-one image restoration settings, Art has been demonstrated to substantially enhance the performance of existing methods. When incorporated into the AirNet and TransWeather models, it achieves average improvements of 1.16 dB and 1.21 dB on PSNR, respectively. We hope this work will provide a principled framework for collaborating multiple tasks in all-in-one image restoration and pave the way for more efficient and effective restoration models, ultimately advancing the state-of-the-art in this critical research domain. Code and pre-trained models are available at our project page https://github.com/Aitical/Art.

Abstract:
Estimating 3D human poses from monocular images is an important research area with many practical applications. However, the depth ambiguity of 2D solutions limits their accuracy in actions where occlusion exits or where slight centroid shifts can result in significant 3D pose variations. In this paper, we introduce a novel multimodal approach to mitigate the depth ambiguity inherent in monocular solutions by integrating spatial-aware pressure information. We first establish a data collection system with a pressure mat and a monocular camera, and construct a large-scale multimodal human activity dataset comprising over 600,000 frames of motion data. Utilizing this dataset, we propose a pressure image reconstruction network to extract pressure priors from monocular images. Subsequently, we introduce a Transformer-based multimodal pose estimation network to combine pressure priors with monocular images, achieving a world mean per joint position error of 51.6mm, outperforming state-of-the-art methods. Extensive experiments demonstrate the effectiveness of our multimodal 3D human pose estimation method across various actions and joints, highlighting the significance of spatial-aware pressure in improving the accuracy of monocular-vision-based methods. Our dataset is available at: https://github.com/LishuangZhan/SATPose.

Abstract:
As an important task in multimodal information extraction, Multimodal Named Entity Recognition (MNER) has recently attracted considerable attention. One key challenge of MNER lies in the lack of sufficient fine-grained annotated data, especially in low-resource scenarios. Although data augmentation is a widely used technique to tackle the above issue, it is challenging to simultaneously generate synthetic text-image pairs and their corresponding high-quality entity annotations. In this work, we propose a novel Generative Multimodal Data Augmentation (GMDA) framework for MNER, which contains two stages: Multimodal Text Generation and Multimodal Image Generation. Specifically, we first transform each annotated sentence into a linearized labeled sequence, and then train a Label-aware Multimodal Large Language Model (LMLLM) to generate the labeled sequence based on a label-aware prompt and its associated image. We further employ a Stable Diffusion model to generate the synthetic images that are semantically related to these sentences. Experimental results on three benchmark datasets demonstrate the effectiveness of the proposed GMDA framework, which consistently boosts the performance of several competitive methods for two subtasks of MNER in both full-supervision and low-resource settings. The low-resource dataset and source code are released at https://github.com/NUSTM/GMDA.

Abstract:
Multimodal knowledge graph (MKG) reasoning has attracted significant attention since impressive performance has been achieved by adding multimodal auxiliary information (i.e., texts and images) to the entities of traditional KGs. However, existing studies heavily rely on path-based methods for learning structural modality, failing to capture the complex structural interactions among multimodal entities beyond the reasoning path. In addition, existing studies have largely ignored the dynamic impact of different multimodal features on different decision facts for reasoning, which utilize asymmetric coattention to independently learn the static interplay between different modalities without dynamically joining the reasoning process. We propose a novel Dynamic Structure-aware representation learning method, namely DySarl, to overcome this problem and significantly improve the MKG reasoning performance. Specifically, we devise a dual-space multihop structural learning module in DySarl, aggregating the multihop structural features of multimodal entities via a novel message-passing mechanism. It integrates the message paradigms in Euclidean and hyperbolic spaces, effectively preserving the neighborhood information beyond the limited multimodal query paths. Furthermore, DySarl has an interactive symmetric attention module to explicitly learn the dynamic impacts of unimodal attention senders and multimodal attention targets on decision facts through a newly designed symmetric attention component and fact-specific gated attention unit, equipping DySarl with the dynamic associations between the multimodal feature learning and later reasoning. Extensive experiments show that DySarl achieves significantly improved reasoning performance on two public MKG datasets compared with that of the state-of-the-art baselines. Source codes are available at https://github.com/HUSTNLP-codes/DySarl.

Abstract:
Large pre-trained vision-language models like CLIP have shown amazing zero-shot recognition performance. To adapt pre-trained vision-language models to downstream tasks, recent studies have focused on the learnable context + class name paradigm, which learns continuous prompt contexts on downstream datasets. In practice, the learned prompt context tends to overfit the base categories and cannot generalize well to novel categories out of the training data. Recent works have also noticed this problem and have proposed several improvements. In this work, we draw a new insight based on empirical analysis, that is, uninformative class names lead to degraded base-to-novel generalization performance in prompt learning, which is usually overlooked by existing works. Under this motivation, we advocate to improve the base-to-novel generalization performance of prompt learning by enhancing the semantic richness of class names. We coin our approach as the Information Disengagement based Associative Prompt Learning (IDAPL) mechanism which considers the associative, meanwhile, decoupled learning of prompt context and class name embedding. IDAPL can effectively alleviate the phenomenon of learnable context overfitting to base classes, meanwhile, learning more informative semantic representation of base classes by fine-tuning the class name embedding, leading to improved performance on both base and novel classes. Experimental results on eleven widely used few-shot learning benchmarks clearly validate the effectiveness of our proposed approach. Code is available at https://github.com/tiggers23/IDAPL

Abstract:
People nowadays use smartphones to capture photos from multimedia platforms. The presence of moire patterns resulting from spectral aliasing can significantly degrade the visual quality of images, particularly in ultra-high-definition (UHD) images. However, existing demoireing methods have mostly been designed for low-definition images, making them unsuitable for handling moire patterns in UHD images due to their substantial memory requirements. In this paper, we propose a novel patch bilateral compensation network (P-BiC) for the demoire pattern removal in UHD images, which is memory-efficient and prior-knowledge-based. Specifically, we divide the UHD images into small patches and perform patch-level demoireing to maintain the low memory cost even for ultra-large image sizes. Moreover, a pivotal insight, namely that the green channel of an image remains relatively less affected by moire patterns, while the tone information in moire images is still well-retained despite color shifts, is directly harnessed for the purpose of bilateral compensation. The bilateral compensation is achieved by two key components in our P-BiC, i.e., a green-guided detail transfer (G2DT) module that complements distorted features with the intact content, and a style-aware tone adjustment (STA) module for the color adjustment. We quantitatively and qualitatively evaluate the effectiveness of P-BiC with extensive experiments. The code is publicly available at: https://github.com/zeyuxiao1997/P-BiC.

Abstract:
Recent years have witnessed remarkable advances in graph representation learning using Graph Neural Networks (GNNs). To fully exploit the unlabeled graphs, researchers pre-train GNNs on large-scale graph databases and then fine-tune these pre-trained G raph M odels (GMs) for better performance in downstream tasks. Because different GMs are developed with diverse pre-training tasks or datasets, they can be complementary to each other for a more complete knowledge base. Naturally, a compelling question is emerging: How can we exploit the diverse knowledge captured by different GMs simultaneously in downstream tasks? In this paper, we make one of the first attempts to exploit multiple GMs to advance the performance in the downstream tasks. More specifically, for homogeneous GMs that share the same model architecture but are obtained with different pre-training tasks or datasets, we align each layer of these GMs and then aggregate them adaptively on a per-sample basis with a tailored Recurrent Aggregation Policy Network (RAPNet). For heterogeneous GMs with different model architectures, we design an alignment module to align the output of diverse GMs and a meta-learner to decide the importance of each GM conditioned on each sample automatically before aggregating the GMs. Extensive experiments in various downstream tasks from 3 domains reveal our dominance over each single GM. Additionally, our methods (UniGM) can achieve better performance with moderate computational overhead compared to alternative approaches including ensemble and model fusion. Also, we verify that our methods are not limited to graph data but could be flexibly applied to multiple modalities. The codes are available at https://github.com/monica309673/UniGM.

Abstract:
Autonomous driving (AD) is a typical application that requires effectively exploiting multimedia information. For AD, it is critical to ensure safety by detecting unknown objects in an open world, driving the demand for open world object detection (OWOD). However, existing OWOD methods treat generic objects beyond known classes in the train set as unknown objects and prioritize recall in evaluation. This encourages excessive false positives and endangers safety of AD. To address this issue, we restrict the definition of unknown objects to threatening objects in AD, and introduce a new evaluation protocol, which is built upon a new metric named U-ARecall, to alleviate biased evaluation caused by neglecting false positives. Under the new evaluation protocol, we re-evaluate existing OWOD methods and discover that they typically perform poorly in AD. Then, we propose a novel OWOD paradigm for AD based on fine-tuning foundational open-vocabulary models (OVMs), as they can exploit rich linguistic and visual prior knowledge for OWOD. Following this new paradigm, we propose a brand-new OWOD solution, which effectively addresses two core challenges of fine-tuning OVMs via two novel techniques: 1) the maintenance of open-world generic knowledge by a dual-branch architecture; 2) the acquisition of scenario-specific knowledge by the visual-oriented contrastive learning scheme. Besides, a dual-branch prediction fusion module is proposed to avoid post-processing and hand-crafted heuristics. Extensive experiments show that our proposed method not only surpasses classic OWOD methods in unknown object detection by a large margin (∼× U-ARecall), but also notably outperforms OVMs without fine-tuning in known object detection (∼ 20% K-mAP). Our codes are available at https://github.com/harrylin-hyl/AD-OWOD.

Abstract:
Human facial reactions play crucial roles in dyadic human-human interactions, where individuals (i.e., listeners) with varying cognitive process styles may display different but appropriate facial reactions in response to an identical behaviour expressed by their conversational partners. While several existing facial reaction generation approaches are capable of generating multiple appropriate facial reactions (AFRs) in response to each given human behaviour, they fail to take human's personalised cognitive process in AFRs generation. In this paper, we propose the first online personalised multiple appropriate facial reaction generation (MAFRG) approach which learns a unique personalised cognitive style from the target human listener's previous facial behaviours and represents it as a set of network weight shifts. These personalised weight shifts are then applied to edit the weights of a pre-trained generic MAFRG model, allowing the obtained personalised model to naturally mimic the target human listener's cognitive process in its reasoning for multiple AFRs generations. Experimental results show that our approach not only largely outperformed all existing approaches in generating more appropriate and diverse generic AFRs, but also serves as the first reliable personalised MAFRG solution. Our code is made available at https://github.com/xk0720/PerFRDiff.

Abstract:
The field of Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) is undergoing a paradigm shift, transitioning from specialized models designed for individual tasks to more general retrieval models capable of managing various specialized scenarios. Inspired by the impressive generalization ability of the Contrastive Language-Image Pretraining (CLIP) model, we propose a CLIP-driven universal framework (Dr. CLIP), which leverages prompt learning to guide the synergy between CLIP and ZS-SBIR. Dr. CLIP can perfectly cover four variants of ZS-SBIR tasks (inter-category, intra-category, cross-datasets, and generalization). Moreover, we decompose the synergy into classification learning, metric learning, and ranking learning, as well as introduce three key components to enhance learning effectiveness. i) a forgetting suppression idea is applied to prevent catastrophic forgetting and constrains the feature distribution of the new categories in classification learning. ii) a domain balanced loss is proposed to address sample imbalance and establish effective cross-domain correlations in metric learning. iii) a pair-relation strategy is introduced to capture relevance and ranking relationships between instances in ranking learning. Eventually, we reorganize and redivide three coarse-grained datasets and two fine-grained datasets to accommodate the training settings for four ZS-SBIR tasks. The comparison experiments confirmed our method surpassed the state-of-the-art (SOTA) methods by a significant margin (1.95% ~ 19.14,% mAP), highlighting its generality and superiority. The code is available at https://github.com/x-28/CDUF.git.

Abstract:
Current state-of-the-art video quality assessment (VQA) models typically integrate various perceptual features to comprehensively represent video quality degradation. These models either directly concatenate features or fuse different perceptual scores while ignoring the domain gaps between cross-aware features, thus failing to adequately learn the correlations and interactions between different perceptual features. To this end, we analyze the independent effects and information gaps of quality-and semantic-aware features on video quality. Based on an analysis of the spatial and temporal differences between two aware features, we propose a semantic-Aware and quality-Aware Interaction Network (A2INet) for blind VQA. For spatial gaps, we introduce a cross-aware guided interaction module to enhance the interaction between semantic-and quality-aware features in a local-to-global manner. Considering temporal discrepancies, we design a cross-aware temporal modeling module to further perceive temporal content variation and quality saliency information, and perceptual features are regressed into quality score by a temporal network and a temporal pooling. Extensive experiments on six benchmark VQA datasets show that our model achieves state-of-the-art performance, and ablation studies further validate the effectiveness of each module. We also present a simple video sampling strategy to balance the effectiveness and efficiency of the model. The code for the proposed method will be released at https://github.com/JianjunXiang/A2INet.

Abstract:
News event discovery refers to the identification and detection of news events using multimodal data on social media. Currently, most works assume that the test set consists of known events. However, in real life, the emergence of new events is more frequent, which invalidates this assumption. In this paper, we propose a Dynamic Augmentation and Entropy Optimization (DAEO) model to address the scenario of generalized news event discovery, which requires the model to not only identify known events but also distinguish various new events. Specifically, we first introduce a multimodal augmentation module, which utilizes adversarial learning to enhance the multimodal representation capability. Secondly, we design an adaptive entropy optimization strategy combined with a self-distillation method, which uses multi-view pseudo-label consistency to improve the model's performance on both known and new events. In addition, we collect a multimodal news event discovery (MNED) dataset of 161,350 samples annotated with 66 real-world events. Extensive experimental results on the MNED dataset demonstrate the effectiveness of our proposed method. Our dataset is available on https://github.com/RetrainIt/MNED.

Abstract:
With the growing diversity of data sources, multi-view learning methods have attracted considerable attention. Among these, by modeling the multi-view data as multi-view graphs, multi-view Graph Neural Networks (GNNs) have shown encouraging performance on various multi-view learning tasks. The message passing is the critical mechanism empowering GNNs with superior capacity to process complex graph data. However, most multi-view GNNs are designed on the well-established overall framework, overlooking the intrinsic challenges of the message passing on multi-view scenarios. To clarify this, we first revisit the message passing mechanism from a Laplacian smoothing perspective, revealing the key to designing a multi-view message passing. Following the analysis, in this paper, we propose an enhanced GNN framework termed Confluent Graph Neural Networks (CGNN), with Cross-view Confulent Message Pssing (CCMP) tailored for multi-view learning. Inspired by the optimization of an improved multi-view Laplacian smoothing problem, CCMP contains three sub-modules that enable the interaction between graph structures and consistent representations, which makes it aware of consistency and complementarity information across views. Extensive experiments on four types of data including multi-modality data demonstrate that our proposed model exhibits superior effectiveness and robustness. The code is available at https://github.com/shumanzhuang/CGNN.

Abstract:
Omnidirectional vision systems provide a 360-degree panoramic view, enabling full environmental awareness in various fields, such as Advanced Driver Assistance Systems (ADAS) and Virtual Reality (VR). Existing omnidirectional stitching methods rely on a single specialized 360-degree camera. However, due to hardware limitations such as high mounting heights and blind spots, adapting these methods to vehicles of varying sizes and geometries is challenging. These challenges include limited generalizability due to the reliance on predefined stitching regions for fixed camera arrays, performance degradation from distance parallax leading to large depth differences, and the absence of suitable datasets with ground truth for multi-camera omnidirectional systems. To overcome these challenges, we propose a novel omnidirectional stitching framework and a publicly available dataset tailored for varying distance scenarios with multiple cameras. The framework, referred to as OmniStitch, consists of a Stitching Region Maximization (SRM) module for automatic adaptation to different vehicles with multiple cameras and a Depth-Aware Stitching (DAS) module to handle depth differences caused by distance parallax between cameras. In addition, we create and release an omnidirectional stitching dataset, called GV360, which provides ground truth images that maintain the perspective of the 360-degree FOV, designed explicitly for vehicle-agnostic systems. Extensive evaluations of this dataset demonstrate that our framework outperforms state-of-the-art stitching models, especially in handling varying distance parallax. The proposed dataset and code are publicly available in https://github.com/tngh5004/Omnistitch.

Abstract:
The significant advancement in face recognition drives face privacy protection into a prominent research direction. Unlike de-identification, a recent class of face privacy protection schemes preserves identifiable formation for face recognition. However, these schemes fail to support the revocation of the leaked identity, causing attackers to potentially identify individuals and then pose security threats. In this paper, we explore the possibility of generating privacy-preserving faces (not features) supporting cancelable biometric recognition. Specifically, we propose a cancelable face generator (CanFG), which removes the physical identity for privacy protection and embeds the virtual identity for face recognition. Particularly, when leaked, the virtual identity can be revoked and renew as another one, preventing re-identification from attackers. Benefiting from the designed distance-preserving identity transformation, CanFG can guarantee separability and preserve recognizability of virtual identities. Moreover, to make CanFG lightweight, we design a simple but effective training strategy, which allows CanFG to require only one (rather than two) network for achieving stable multi-objective learning. Extensive experimental results and sufficient security analyses demonstrate the ability of CanFG to effectively protect physical identity and support cancelable biometric recognition. Our code is available at https://github.com/daizigege/CanFG.

Abstract:
Single image reflection removal is a severely ill-posed problem and it is very hard to separate the desirable transmission and undesirable reflection layers. Most of the existing single image reflection removal methods try to recover the transmission layer by exploiting cues that are extracted only from the given input image. However, there is abundant unutilized information in the form of millions of reflection free images available publicly. Even though this information is easily available, utilizing the same for effectively removing reflections is non-trivial. In this paper, we propose a novel method, termed R^2SFD, for improving single image reflection removal using a Semantic Feature Dictionary (SFD) constructed from a database of reflection-free images. The SFD is constructed using a novel Reflection Aware Feature Extractor (RAFENet) that extracts features invariant to the presence of reflections. The SFD and the input image are then passed to another novel network termed SFDNet. This network first extracts RAFENet features from the reflection-corrupted input image, searches for similar features in the SFD, and transfers the semantic content to generate the final output. To further improve reflection removal, we also introduce a Large Scale Reflection Removal (LSRR) dataset consisting of 2650 image pairs comprising of a variety of real world reflection scenarios. The proposed method achieves superior results both qualitatively and quantitatively compared to the state of the art single image reflection removal methods on real public datasets as well as our LSRR dataset. We will release the dataset at https://github.com/ee19d005/r2sfd.

Abstract:
Speech-driven 3D facial animation has attracted considerable attention due to its extensive applicability across diverse domains. The majority of existing 3D facial animation methods ignore the avatar's expression, while emotion-controllable methods struggle with specifying the avatar's identity and portraying various emotional intensities, resulting in a lack of naturalness and realism in the animation. To address this issue, we first present an Emolib dataset containing 10,736 expression images with eight emotion categories, i.e., neutral, happy, angry, sad, fear, surprise, disgust, and contempt, where each image is accompanied by a corresponding emotion label and a 3D model with expression. Additionally, we present a novel 3D facial animation framework that operates with unpaired training data. This framework produces emotional facial animations aligned with the input face image, effectively conveying diverse emotional expressions and intensities. Our framework initially generates lip-synchronized and expression models separately. These models are then combined using a fusion network to generate face models that effectively synchronize with speech while conveying emotions. Moreover, the mouth structure is incorporated to create a comprehensive face model. This model is then fed into our skin-realistic renderer, resulting in a highly realistic animation. Experimental results demonstrate that our approach outperforms state-of-the-art 3D facial animation methods in terms of realism and emotional expressiveness while also maintaining precise lip synchronization. The Emolib dataset is available at https://github.com/yuminjing/Emolib.git.

Abstract:
Speech-driven 3D facial animation aims to synthesize 3D talking head animations with precise lip movements and rich stylistic expressions. However, existing methods exhibit two limitations: 1) they mostly focused on emotionless facial animation modeling, neglecting the importance of emotional expression, due to the lack of high-quality 3D emotional talking head datasets, and 2) several latest works treated emotional intensity as a global controllable parameter, akin to emotional or speaker style, leading to over-smoothed emotional expressions in their outcomes. To address these challenges, we first collect a 3D talking head dataset comprising five emotional styles with a set of coefficients based on the MetaHuman character model and then propose an end-to-end deep neural network, DEITalk, which conditions on speech and emotional style labels to generate realistic facial animation with dynamic expressions. To model emotional saliency variations in long-term audio contexts, we design a dynamic emotional intensity (DEI) modeling module and a dynamic positional encoding (DPE) strategy. The former extracts implicit representations of emotional intensity from speech features and utilizes them as local (high temporal frequency) emotional supervision, whereas the latter offers abilities to generalize to longer speech sequences. Moreover, we introduce an emotion-guided feature fusion decoder and a four-way loss function to generate emotion-enhanced 3D facial animation with controllable emotional styles. Extensive experimental results demonstrate that our method outperforms existing state-of-the-art methods. Our video demo and dataset are available at https://github.com/KangShen-seu/DEITalk.

Abstract:
Latent diffusion model has demonstrated impressive efficacy in image generation and editing tasks. Recently, it has also promoted the advancement of image harmonization. However, methods involving latent diffusion model all face a common challenge: the severe image distortion introduced by the VAE component, while image harmonization is a low-level image processing task that relies on pixel-level evaluation metrics. In this paper, we propose Harmony-VAE, leveraging the input of the harmonization task itself to enhance the quality of decoded images. The input involving composite image contains the precise pixel level information, which can complement the correct foreground appearance and color information contained in denoised latents. Meanwhile, the inherent generative nature of diffusion models makes it naturally adapt to inverse image harmonization, i.e. generating synthetic composite images based on real images and foreground masks. We train an inverse harmonization diffusion model to perform data augmentation on two subsets of iHarmony4 and construct a new human harmonization dataset with prominent foreground objects. Extensive experiments demonstrate the effectiveness of our proposed Harmony-VAE and inverse harmonization model. Code and pretrained models are available at https://github.com/nicecv/DiffHarmony.

Abstract:
Video action recognition has been a hot research direction in computer vision, with most existing technologies focusing on coarse-grained macro-action recognition. However, fine-grained action recognition remains challenging. Micro-actions, characterized by high fine-grained, low-intensity, and brief, are crucial for emotion recognition and psychological assessment applications. In this paper, we build on popular video action recognition frameworks as foundation models, introducing multi-auxiliary heads and hybrid loss optimization to advance micro-action recognition. Specifically, the Frame-Level pred and Coarse-Grained Body-Action auxiliary heads work collaboratively to enhance the model and Fine-Grained Micro-Action primary head for perceiving fine-grained and capturing keyframes. Incorporating F1 loss, ArcFace loss, and weighted multi-task loss improves training stability, convergence speed, and performance. Additionally, integrating the optical flow modality enriches the model's diversity, and ensemble learning across all foundational models. Finally, our method achieves a 75.37% F1-mean on the MA-52 dataset, ranking 1st in the Micro-Action Analysis Grand Challenge in conjunction with ACM MM'24. The code is available at https://github.com/qklee-lz/ACMMM2024-MAC.

Abstract:
Diffusion models show impressive performances in image generation with excellent perceptual quality. However, its tendency to introduce additional distortion prevents its direct application in image compression. To address the issue, this paper introduces a Consistency Guided Diffusion Model (CGDM) tailored for perceptual image compression, which integrates an end-to-end image compression model with a diffusion-based post-processing network, aiming to learn richer detail representations with less fidelity loss. In detail, the compression and post-processing networks are cascaded and a branch of consistency guided features is added to constrain the deviation in the diffusion process for better reconstruction quality. Furthermore, a Syntax driven Feature Fusion (SFF) module is constructed to take an extra ultra-low bitstream from the encoding end as input, guiding the adaptive fusion of information from the two branches. In addition, we design a globally uniform boundary control strategy with overlapped patches and adopt a continuous online optimization mode to improve both coding efficiency and global consistency. Extensive experiments validate the superiority of our method to existing perceptual compression techniques. Our project is publicly available at: https://ellisonkuang.github.io/CGDM.github.io/.

Abstract:
Current advancements in 3D human pose estimation have attained notable success by converting 2D poses into their 3D counterparts. However, this approach is inherently influenced by the errors introduced by 2D pose detectors and overlooks the intrinsic spatial information embedded within RGB images. To address these challenges, we introduce a versatile module called Adaptive Pose Pooling (APP), which is compatible with many existing 2D-to-3D lifting models. The APP module includes three novel sub-modules: Pose-Aware Offsets Generation (PAOG), Pose-Aware Sampling (PAS), and Spatial Temporal Information Fusion (STIF). First, we extract latent features of the multi-frame lifting model. Then, a 2D pose detector is utilized to extract multi-level feature maps from the image. After that, PAOG generates offsets according to featuremaps. PAS uses offsets to sample featuremaps. Then, STIF can fuse PAS sampling features and latent features. This innovative design allows the APP module to simultaneously capture spatial and temporal information. We conduct comprehensive experiments on two widely used datasets: Human3.6M and MPI-INF-3DHP. Meanwhile, we employ various lifting models to demonstrate the efficacy of the APP module. Our results show that the proposed APP module consistently enhances the performance of lifting models, achieving state-of-the-art results. Significantly, our module achieves these performance boosts without necessitating alterations to the architecture of the lifting model. Our code is available at https://github.com/jinyanzhang/APP.

Abstract:
Recently, large pre-trained vision-language models, such as CLIP, have demonstrated significant potential in zero-/few-shot anomaly detection tasks. However, existing methods not only rely on expert knowledge to manually craft extensive text prompts but also suffer from a misalignment of high-level language features with fine-level vision features in anomaly segmentation tasks. In this paper, we propose a method, named SimCLIP, which focuses on refining the aforementioned misalignment problem through bidirectional adaptation of both Multi-Hierarchy Vision Adapter (MHVA) and Implicit Prompt Tuning (IPT). In this way, our approach requires only a simple binary prompt to efficiently accomplish anomaly classification and segmentation tasks in zero-shot scenarios. Furthermore, we introduce its few-shot extension, SimCLIP+, integrating the relational information among vision embeddings and skillfully merging the cross-modal synergy information between vision and language to address downstream anomaly detection tasks. Extensive experiments on two challenging datasets prove the more remarkable generalization capacity of our method compared to the current SOTA approaches. Our code is available at https://github.com/CH-ORGI/SimCLIP.

Abstract:
Online Continual Learning (OCL) aims at learning a model through a sequence of single-pass data, usually encountering the challenges of catastrophic forgetting both between different learning stages and within a stage. Currently, existing OCL methods address these issues by replaying part of previous data but inevitably raise data privacy concerns and stand in contrast to the setting of online learning where data can only be accessed once. Moreover, their performance will dramatically drop without any replay buffer. In this paper, we propose a Non-Exemplar Online Continual Learning method named Progressive Prototype Evolving (PPE). The core of our PPE is to progressively learn class-specific prototypes during the online learning phase without reusing any previously seen data. Meanwhile, the progressive prototypes of the current learning stage, serving as the accumulated knowledge of different classes, are fed back to the model to mitigate intra-stage forgetting. Additionally, to resist inter-stage forgetting, we introduce the Prototype Similarity Preserving and Prototype-Guided Gradient Constraint modules which distill and leverage the historical knowledge conveyed by prototypes to regularize the one-way model learning. Consequently, extensive experiments on three widely used datasets demonstrate the superiority of the proposed PPE against the state-of-the-art exemplar-based OCL approaches. Our code is available at https://github.com/zhoujiahuan1991/MM24-PPE.

Abstract:
Building on recent breakthroughs in diffusion-based text-to-image synthesis (TIS), training-free text-guided image editing (TIE) has emerged as an indispensable aspect of modern image editing practices. This technique involves the modification of features within attention layers to alter objects or their attributes within images during the generation process. Despite its utility, current image editing algorithms face challenges, particularly when editing multiple objects in an image. In this paper, we introduce VICTORIA, a novel approach that augments TIE by incorporating linguistic knowledge into the manipulation of attention maps during image generation. VICTORIA capitalizes on mechanisms within self-attention layers to ensure spatial consistency between source and target images. Further, we design a novel loss function that refines cross-attention maps, ensuring their alignment with linguistic constraints, thereby enhancing the editing precision of multiple target objects. We also present a linguistic mask blending technique that aids in the retention of information in regions not subjected to modification. Experimental results across seven diverse datasets show that VICTORIA achieves significant improvements over state-of-the-art methods. Our work underscores the critical role and effectiveness of linguistic analysis in elevating the performance of TIE, with a specific emphasis on multi-object scenarios. The code is available at https://github.com/alibaba/EasyNLP/tree/master/diffusion/VICTORIA.

Abstract:
Visual grounding is a task of locating the object referred by a natural language description. To reduce annotation costs, recent researchers are devoted into one-stage weakly supervised methods for visual grounding, which typically adopt the anchor-text matching paradigm. Despite the efficiency, we identify that anchor representations are often noisy and insufficient to describe object information, which inevitably hinders the vision-language alignments. In this paper, we propose a novel query-based one-stage framework for weakly supervised visual grounding, namely QueryMatch. Different from previous work, QueryMatch represents candidate objects with a set of query features, which inherently establish accurate one-to-one associations with visual objects. In this case, QueryMatch re-formulates weakly supervised visual grounding as a query-text matching problem, which can be optimized via the query-based contrastive learning. Based on QueryMatch, we further propose an innovative strategy for effective weakly supervised learning, namely Active Query Selection (AQS). In particular, AQS aims to enhance the effectiveness of query-based contrastive learning by actively selecting high-quality query features. Through this strategy, AQS can greatly benefit the weakly supervised learning of QueryMatch. To validate our approach, we conduct extensive experiments on three benchmark datasets of two grounding tasks, i.e., referring expression comprehension (REC) and segmentation (RES). Experimental results not only show the state-of-art performance of QueryMatch in two tasks, e.g., over +5% IoU@0.5 on RefCOCO in REC and over +20% mIOU on RefCOCO in RES, but also confirm the effectiveness of AQS in weakly supervised learning. Source codes are available at https://github.com/TensorThinker/QueryMatch.

Abstract:
Immunohistochemistry (IHC) plays a crucial role in understanding disease mechanisms, diagnosing pathology and guiding treatment decisions. The precise analysis heavily depends on accurate nucleus segmentation. However, segmentation is challenging due to significant inter- and intra-nucleus variability in morphology and distribution, stemming from inherent characteristics, imaging techniques, tissue differences and other factors. While current deep learning-based methods have shown promising results, their generalization performance is limited, inevitably requiring specific training data. To address the problem, we propose a novel Ge neral framework for Nucleus Seg mentation in IHC images (GeNSeg-Net). GeNSeg-Net effectively segments nuclei across diverse tissue types and imaging techniques with high variability using a small subset for training. It comprises an enhancement model and a segmentation model. Initially, all nuclei are enhanced to a uniform morphology with distinct features by the enhancement model through generation. The subsequent segmentation task is thereby simplified, leading to higher accuracy. We design a lightweight generator and discriminator to improve both enhancement quality and computational efficiency. Extensive experiments demonstrate the effectiveness of each component within GeNSeg-Net. Compared to existing methods, GeNSeg-Net achieves state-of-the-art (SOTA) segmentation accuracy and generalization performance on both private and public datasets, while maintaining highly competitive processing speed. Code is available at https://github.com/SikangSHU/GeNSeg-Net.

Abstract:
LiDAR-based 3D detection, as an essential technique in multimedia applications such as augmented reality and autonomous driving, has made great progress in recent years. However, the performance of a well trained 3D detector is considerably graded when deployed in unseen environments due to the severe domain gap. Traditional unsupervised domain adaptation methods, including co-training and mean-teacher frameworks, do not effectively bridge the domain gap as they struggle with noisy and incomplete pseudo-labels and the inability to capture domain-invariant features. In this work, we introduce a novel Co-training Mean-Teacher (CMT) framework for unsupervised domain adaptation in 3D object detection. Our framework enhances adaptation by leveraging both source and target domain data to construct a hybrid domain that aligns domain-specific features more effectively. We employ hard instance mining to enrich the target domain feature distribution and utilize class-aware contrastive learning to refine feature representations across domains. Additionally, we develop batch adaptive normalization to fine-tune the batch normalization parameters of the teacher model dynamically, promoting more stable and reliable learning. Extensive experiments across various benchmarks, including Waymo, nuScenes and KITTI, demonstrate the superiority of our CMT over existing state-of-the-art approaches in different adaptation scenarios. Codes are available at https://github.com/csj777/CMT.

Abstract:
Explaining what part of the input images primarily contributed to the predicted classification results by deep models has been widely researched over the years and many effective methods have been reported in the literature, for which deep Taylor decomposition (DTD) served as the primary foundation due to its advantage in theoretical explanations brought in by Taylor expansion and approximation. Recent research, however, has shown that the root of Taylor decomposition could extend beyond local linearity, and thus causing DTD to fail in delivering expected performances. In this paper, we propose a universal root inference method to overcome the shortfall and strengthen the roles of DTD in explainability and interpretability of deep classifications. In comparison with the existing approaches, our proposed features in: (i) theoretical establishment of the relationship between ideal roots and the propagated relevances; (ii) exploitation of gradient descents in learning a universal root inference; and (iii) constrained optimization of its final root selection. Extensive experiments, including both quantitative and qualitative, validate that our proposed root inference is not only effective, but also delivers significantly improved performances in explaining a range of deep classifiers. We share our codes via the link: https://github.com/meetxinzhang/XAI-RootInference.

Abstract:
Dynamic facial expression recognition (DFER) is a rapidly developing field that focuses on recognizing facial expressions in video sequences. However, the complex temporal modeling caused by noisy frames, along with the limited training data significantly hinder the further development of DFER. Previous efforts in this domain have been limited as they tackled these issues separately. Inspired by recent advances of pretrained vision-language models (e.g., CLIP), we propose to leverage it to jointly address the two limitations in DFER. Since the raw CLIP model lacks the ability to model temporal relationships and determine the optimal task-related textual prompts, we utilize DFER-specific domain knowledge, including characteristics of temporal correlations and relationships between facial behavior descriptions at different levels, to guide the adaptation of CLIP to DFER. Specifically, we propose enhancements to CLIP's visual encoder through the design of a hierarchical video encoder that captures both short- and long-term temporal correlations in DFER. Meanwhile, we align facial expressions with action units through prior knowledge to construct semantically rich textual prompts, which are further enhanced with visual contents. Furthermore, we introduce a class-aware consistency regularization mechanism that adaptively filters out noisy frames, bolstering the model's robustness against interference. Extensive experiments on three in-the-wild dynamic facial expression datasets demonstrate that our method outperforms the state-of-the-art DFER approaches. The code is available at https://github.com/liliupeng28/DK-CLIP.

Abstract:
Sarcasm is an intricate expression phenomenon and has garnered increasing attentions over the recent years, especially for multimodal contexts such as videos. Nevertheless, despite being a significant aspect of human sentiment, the effect of sarcasm is consistently overlooked in sentiment analysis. Videos with sarcasm often convey sentiments that diverge or even contradict their explicit messages. Prior works mainly concentrate on simply modeling sarcasm and sentiment features by utilizing the Multi-Task Learning (MTL) framework, which we found introduces detrimental interplays between the sarcasm detection task and sentiment analysis task. Therefore, this study explores the effective enhancement of video sentiment analysis through the incorporation of sarcasm information. To this end, we propose the Progressively Sentiment-oriented Sarcasm Refinement and Integration (PS2RI) framework, which focuses on modeling sentiment-oriented sarcasm features to enhance sentiment prediction. Instead of naively combining sarcasm detection and sentiment prediction under an MTL framework, PS2RI iteratively performs the sentiment-oriented sarcasm refinement and sarcasm integration operations within the sentiment recognition framework, in order to progressively learn sarcasm-aware sentiment feature without suffering the detrimental interplays caused by information irrelevant to the sentiment analysis task. Extensive experiments are conducted to validate the effectiveness of our approach. Code is available at https://github.com/tiggers23/PS2RI.

Abstract:
Point cloud plays a significant role in recent learning-based vision tasks, which contain additional information about the physical space compared to 2D images. However, such a 3D data format also results in more expensive training costs to train a sophisticated network with large 3D datasets. Previous methods for point cloud compression focus on compacting the representation of each point cloud for better storage and transmission but ignore the improvements in training efficiency. In this paper, we introduce a new open problem in the point cloud field, named point cloud condensation : Can we condense a large point cloud dataset into a much smaller synthetic dataset while preserving the important information of the original large dataset? In other words, we explore the possibility of training a network on a smaller dataset of informative point clouds extracted from the original large dataset but maintaining similar network classification performance. Training on this small synthetic dataset could largely improve the training efficiency. To achieve this goal, we propose a two-stage approach to generate the synthetic dataset. We first introduce a nearest-feature-mean based strategy to initialize the synthetic dataset, and then formulate our goal as a parameter-matching problem, which we solve by introducing a gradient-matching strategy to iteratively refine the synthetic dataset. We conduct extensive experiments on various synthetic and real-scanned 3D object classification benchmarks, showing that our synthetic dataset could achieve almost the same performance with only 5% point clouds of ScanObjectNN dataset compared to training with the full dataset. Codes are available at https://github.com/XLechter/PointCondensation.

Abstract:
Sentiment analysis and complaint identification are key tools in mining user preferences by measuring the polarity and breach of expectations. Recent works on complaint identification identify aspect categories and classify them into complaint or non-complaint classes. However, aspect category-based complaint identification provides high-level information about the features of products. In addition, it is also observed that the user sometimes does not complain about a specific aspect but expresses concern about specific aspects in a respectful way. Currently, unimodal and multimodal studies do not differentiate this thin line between complaint and concern. In this work, we propose the task of multimodal aspect term-based analysis beyond sentiments and complaints. It comprises of two sub-tasks, viz (i) classification of the given aspect term into one of the four classes, viz. praise, concern, complaint, and others, (ii) identification of the cause of praise, concern, and complaint classes. We propose a first benchmark explainable multimodal corpus annotated for aspect term-based complaints, praises, concerns, their corresponding causes, and sentiments. Further, we propose an effective technique for the joint learning of aspect term-based complaint/concern/praise identification and cause extraction tasks (primary tasks) where sentiment analysis is used as a secondary task to assist primary tasks and establish them as baselines for further research in this direction. The dataset has been made available on https://www.iitp.ac.in/~ai-nlp-ml/resources.html and at Github repository: https://github.com/20118/MAspectX.

Abstract:
Droplet-based microfluidic devices, with their high throughput and low power consumption, have found wide-ranging applications in the life sciences, such as drug discovery and cancer detection. However, the lack of real-time methods for accurately estimating droplet generation parameters has resulted in droplet microfluidic systems remaining largely offline-controlled, making it challenging to achieve efficient feedback in droplet generation. To meet the real-time requirements, it's imperative to minimize the data throughput of the collection system while employing parameter estimation algorithms that are both resource-efficient and highly effective. Spike camera, as an innovative form of neuromorphic camera, facilitates high temporal resolution scene capture with comparatively low data throughput. In this paper, we propose a real-time evaluation method for high-speed droplet parameters based on spike-based microfluidic flow-focusing, named RTDE, that integrates spike camera into the droplet collection system to efficiently capture information using spike stream. To process the spike stream effectively, we develop a spike-based estimation algorithm for real-time droplet generation parameters. To validate the performance of our method, we collected spike-based droplet datasets (SDD), comprising synthetic and real data with varying flow velocities, frequencies, and droplet sizes. Experiments result on these datasets consistently demonstrate that our method achieves parameter estimations that closely match the ground truth values, showcasing high precision. Furthermore, comparative experiments with image-based parameter estimation methods highlight the superior time efficiency of our method, enabling real-time calculation of parameter estimations. Cdoe and datasets are avaliable at: https://github.com/Onetism/RTDE

Abstract:
Forensic person identification is of paramount importance in accidents and criminal investigations. Existing methods based on soft tissue or DNA can be unavailable if the body is badly decomposed, white-ossified, or charred. However, bones last a long time. This raises a natural question: can we learn to identify a person using bone data? We present a novel feature of bones called Neural Boneprint for personal identification. In particular, we exploit the thoracic skeletal data including chest radiographs (CXRs) and computed tomography (CT) images enhanced by the volume rendering technique (VRT) as an example to explore the availability of the neural boneprint. We then represent the neural boneprint as a joint latent embedding of VRT images and CXRs through a bidirectional cross-modality translation and contrastive learning. Preliminary experimental results on real skeletal data demonstrate the effectiveness of the Neural Boneprint for identification. We hope that this approach will provide a promising alternative for challenging forensic cases where conventional methods are limited. The code is available at https://github.com/CheltonNiu/Neural-Boneprint.git.

Abstract:
Collaborative autonomous driving with multiple vehicles usually requires the data fusion from multiple modalities. To ensure effective fusion, the data from each individual modality shall maintain a reasonably high quality. However, in collaborative perception, the quality of object detection based on a modality is highly sensitive to the relative pose errors among the agents. It leads to feature misalignment and significantly reduces collaborative performance. To address this issue, we propose RoCo, a novel unsupervised framework to conduct iterative object matching and agent pose adjustment. To the best of our knowledge, our work is the first to model the pose correction problem in collaborative perception as an object matching task, which reliably associates common objects detected by different agents. On top of this, we propose a graph optimization process to adjust the agent poses by minimizing the alignment errors of the associated objects, and the object matching is re-done based on the adjusted agent poses. This process is carried out iteratively until convergence. Experimental study on both simulated and real-world datasets demonstrates that the proposed framework RoCo consistently outperforms existing relevant methods in terms of the collaborative object detection performance, and exhibits highly desired robustness when the pose information of agents is with high-level noise. Ablation studies are also provided to show the impact of its key parameters and components. The code is released at https://github.com/HuangZhe885/RoCo.

Abstract:
Incomplete Multi-View Clustering (IMVC) is crucial for multi-media data analysis. While graph learning-based IMVC methods have shown promise, they still have limitations. The prevalent first-order affinity graph often misclassifies out-neighborhood intra-cluster and in-neighbor inter-cluster samples, worsened by data incompleteness. These inaccuracies, combined with high computational demands, restrict their suitability for large-scale IMVC tasks. To address these issues, we propose a novel Fast and Scalable IMVC with duality Optimal graph Filtering (FSIMVC-OF). Specifically, we refine the clustering-friendly structure of the bipartite graph by learning an optimal filter within a consensus clustering framework. Instead of learning a sample-side filter, we optimize an anchor-side graph filter and apply it to the anchor side, ensuring computational efficiency with linear complexity, supported by the provable equivalence between these two types of graph filters. We present an alternative optimization algorithm with linear complexity. Extensive experimental analysis demonstrates the superior performance of FSIMVC-OF over current IMVC methods. The codes of this article are released in https://github.com/sroytik/FSIMVC-OF.

Abstract:
Zero-shot point cloud semantic segmentation aims to recognize novel classes at the point level. Previous methods mainly transfer excellent zero-shot generalization capabilities from images to point clouds. However, directly transferring knowledge from images to point clouds faces two ambiguous problems. On the one hand, 2D models will generate wrong predictions when the image changes. On the other hand, directly mapping 3D points to 2D pixels by perspective projection fails to consider the visibility of 3D points in camera view. The wrong geometric alignment of 3D points and 2D pixels causes semantic ambiguity. To tackle these two problems, we propose a framework named Affinity3D that intends to empower 3D semantic segmentation models to perceive novel samples. Our framework aggregates instances in 3D and recognizes them in 2D, leveraging the excellent geometric separation in 3D and the zero-shot capabilities of 2D models. Affinity3D involves an affinity module that rectifies the wrong predictions by comparing them with similar instances and a visibility module preventing knowledge transfer from visible 2D pixels to invisible 3D points. Extensive experiments have been conducted on the SemanticKITTI and nuScenes datasets. Our framework achieves state-of-the-art performance on both two datasets. Code is available at https://github.com/opjang5/Affinity3D.

Abstract:
Multi-modal fusion techniques, such as radar and images, enable a complementary and cost-effective perception of the surrounding environment regardless of lighting and weather conditions. However, existing fusion methods for surround-view images and radar are challenged by the inherent noise and positional ambiguity of radar, which leads to significant performance losses. To address this limitation effectively, our paper presents a robust, end-to-end fusion framework dubbed SparseInteraction. First, we introduce the Noisy Radar Filter (NRF) module to extract foreground features by creatively using queried semantic features from the image to filter out noisy radar features. Furthermore, we implement the Sparse Cross-Attention Encoder (SCAE) to effectively blend foreground radar features and image features to address positional ambiguity issues at a sparse level. Ultimately, to facilitate model convergence and performance, the foreground prior queries containing position information of the foreground radar are concatenated with predefined queries and fed into the subsequent transformer-based decoder. The experimental results demonstrate that the proposed fusion strategies markedly enhance detection performance and achieve new state-of-the-art results on the nuScenes benchmark. Source code is available at https://github.com/GG-Bonds/SparseInteraction.

Abstract:
The task of text semantic matching focuses on measuring the semantic similarity between two texts and is widely applied in search and ranking scenarios. In recent years, pre-trained foundation models based on the Transformer architecture have demonstrated powerful semantic representation capabilities. The pipeline of fine-tuning pre-trained foundation models on downstream semantic matching tasks has achieved promising results and widespread adoption. However, practical downstream scenarios often face severe challenges in terms of data quality and quantity. Ensuring high-quality and large quantities of samples is often difficult. Current research on enhancing pre-trained models for few-shot text semantic matching tasks is still not advanced enough. Therefore, this paper focuses on providing a general enhancement scheme for few-shot text semantic matching tasks. Specifically, we propose an Enhancing Transformer-based Semantic Matching method for few-shot learning through weakly contrastive pre-training, which is named as SEMFormer. First, starting from the token-level and structural-level perspectives, we design a simple and low-cost data augmentation method to construct weakly supervised samples. Then, we use the global semantic representations to construct a contrastive objective from the relation-aspect perspective. Next, we design a contrastive objective based on the alignment-aspect, aiming to achieve effective semantic matching by optimizing the bidirectional semantic awareness between texts. We conducted comprehensive experiments based on five Chinese and English datasets. The experimental results validated that our proposed weakly contrastive pre-training augmentation method significantly improves model performance. Further experiments confirmed the effectiveness of our design. The source code is available at: https://github.com/llm-ml/SEMFormer.

Abstract:
Text-driven 3D avatar customization has attracted increasing attention in recent years, where precisely editing specific local parts of avatars with only text prompts is particularly challenging. Previous editing methods usually use segmentation or cross-attention masks as constraints for local editing. Although these masks tightly cover existing objects/parts, they may limit editing methods to create drastic geometry deformations beyond the covered contents. From a different perspective, this paper presents a GPT-guided local avatar editing framework, namely GG-Editor. Specifically, GG-Editor progressively mines more reasonable candidate editing regions via harnessing multimodal large language models which already organically assimilate common-sense human knowledge. In order to improve the editing quality of the local areas, GG-Editor explicitly decouples the geometry/appearance optimization, and adopts a global-local synergy editing strategy with GPT-generated local prompts. Moreover, to preserve concepts residing in source avatars, GG-Editor proposes an orthogonal denoising score that orthogonally decomposes editing directions and introduce an explicit term for preservation. Comprehensive experiments demonstrate that GG-Editor with only textual prompts achieves realistic and high-fidelity local editing results, significantly surpassing prior works. Project page: https://xuyunqiu.github.io/GG-Editor/.

Abstract:
Artificial intelligence (AI) encompasses knowledge acquisition and real-world grounding across various modalities, including language, visual, auditory, and sensory data. Multimodal large language models (MLLMs) have thus recently garnered growing interest in both academia and industry, showing an unprecedented trend to achieve human-level AI. This tutorial aims to deliver a comprehensive review of cutting-edge research in MLLMs, focusing on three key areas: MLLM architecture design, instructional learning, and multimodal reasoning of MLLMs. We will explore technical advancements, synthesize key challenges, and discuss potential avenues for future research. All the resources and materials will be made available online. https://mllm2024.github.io/ACM-MM2024

Abstract:
Multimodal Multi-Label Emotion Recognition (MMER) aims to identify one or more emotion categories expressed by an utterance of a speaker. Despite obtaining promising results, previous studies on MMER represent each emotion category using a one-hot vector and ignore the intrinsic relations between emotions. Moreover, existing works mainly learn the unimodal representation based on the multimodal supervision signal of a single sample, failing to explicitly capture the unique emotional state of each modality as well as its emotional correlation between samples. To overcome these issues, we propose a Unimodal Valence-Arousal driven contrastive learning framework (UniVA) for the MMER task. Specifically, we adopt the valence-arousal (VA) space to represent each emotion category and regard the emotion correlation in the VA space as priors to learn the emotion category representation. Moreover, we employ pre-trained unimodal VA models to obtain the VA scores for each modality of the training samples, and then leverage the VA scores to construct positive and negative samples, followed by applying supervised contrastive learning to learn the VA-aware unimodal representations for multi-label emotion prediction. Experimental results on two benchmark datasets MOSEI and M3ED show that the proposed UniVA framework consistently outperforms a number of existing methods for the MMER task. The source code is publicly released at https://github.com/NUSTM/UniVA.

Abstract:
For solving the limitations of the current self knowledge distillation including never fully utilizing the knowledge of shallow exits and neglecting the impact of auxiliary exits' structure on the performance of network, a novel self knowledge distillation framework via virtual teacher-students mutual learning named LOTH is proposed in this paper. A knowledgeable virtual teacher is constructed from the rich feature maps of each exit to help the learning of each exit. Meanwhile, the logit knowledges of each exit are incorporated to guide the learning of the virtual teacher. They learn mutually through the well-designed loss in LOTH. Moreover, two kinds of auxiliary building blocks are designed to balance the efficiency and effectiveness of network. Extensive experiments with diverse backbones on CIFAR-100 and Tiny-ImageNet validate the effectiveness of LOTH, which realizes superior performance with less resource by the comparison with the state-of-the-art distillation methods. The code of LOTH is available on Github https://github.com/cloak-s/LOTH.

Abstract:
Exemplar-based image translation has garnered significant interest from researchers due to its broad applications in multimedia/multimodal processing. Existing methods primarily employ Euclidean-based losses to implicitly establish cross-domain correspondences between exemplar and conditional images, aiming to produce high-fidelity images. However, these methods often suffer from two challenges: 1) Insufficient excavation of domain-invariant features leads to low-quality cross-domain correspondences, and 2) Inaccurate correspondences result in errors propagated during the translation process due to a lack of reliable prior guidance. To tackle these issues, we propose a novel prior-guided diffusion model with global-local contrastive learning (PROMOTE), which is trained in a self-supervised manner. Technically, global-local contrastive learning is designed to align two cross-domain images within hyperbolic space and reduce the gap between their semantic correlation distributions using the Fisher-Rao metric, allowing the visual encoders to extract domain-invariant features more effectively. Moreover, a prior-guided diffusion model is developed that propagates the structural prior to all timesteps in the diffusion process. It is optimized by a novel prior denoising loss, mathematically derived from the transitions modified by prior information in a self-supervised manner, successfully alleviating the impact of inaccurate correspondences on image translation. Extensive experiments conducted across seven datasets demonstrate that our proposed PROMOTE significantly exceeds state-of-the-art performance in diverse exemplar-based image translation tasks. The source code is publicly available at http://github.com/zgj77/PROMOTE.

Abstract:
Stereo matching is a pivotal technique for depth estimation and has been popularly applied in various computer vision tasks. Although many related methods have been reported recently, they still face some challenges such as significant disparity variations at object boundaries, difficult prediction at large disparity regions, and suboptimal generalization when label distribution varies between source and target domains. Therefore, we propose a stereo-matching model (i.e., Eglcr) that utilizes edge structure information and multi-scale matching similarity features for better disparity estimation. First, we use a lightweight network to predict the initial disparity. Then, we develop a multi-scale similarity feature extraction module, incorporating adaptive attention mechanisms, to capture the fusion similarity information of stereo images across various scales. Meanwhile, we introduce an edge structure-aware module that features an iteratively optimized disparity map and a scale attention factor, aimed at accurately delineating edge information in complex scenes. After that, we employ an iterative strategy for disparity estimation, guided by the fusion similarity features across multiple scales and the detailed edge structure information. We conduct abundant experiments on some popular stereo matching datasets including Middlebury, KITTI, ETH3D, and Scene Flow. The results show that our proposed Eglcr achieves state-of-the-art performance both in accuracy and generalization. Our code is available at https://github.com/kangarooCV/Eglcr.

Abstract:
Radiology report generation aims to automatically generate clinical descriptions for radiology images, reducing the workload of radiologists. Compared to general image captioning tasks, the subtle differences in medical images and the specialized, complex nature of medical terminology limit the performance of data-driven radiology report generation. Previous research has attempted to leverage prior knowledge, such as organ-disease graphs, to enhance models' abilities to identify specific diseases and generate corresponding medical terminology. However, these methods cover only a limited number of disease types, focusing solely on disease terms mentioned in reports but ignoring their normal or abnormal attributes, which are critical to generating accurate reports. To address this issue, we propose a Divide-and-Conquer approach, named DCG, which separately constructs disease-free and disease-specific nodes within the knowledge graphs. Specifically, we extracted more comprehensive organ-disease entities from reports than previous methods and constructed disease-free and disease-specific nodes by rigorously distinguishing between normal conditions and specific diseases. This enables our model to consciously focus on abnormal information and mitigate the impact of excessively common diseases on report generation. Subsequently, the constructed graph is utilized to enhance the correlation between visual representations and disease terminology, thereby guiding the decoder in report generation. Extensive experiments conducted on benchmark datasets IU-Xray and MIMIC-CXR demonstrate the superiority of our proposed method. Code is available at https://github.com/ecoxial2007/DCG_Enhanced_distilGPT2.

Abstract:
Martian terrain segmentation plays a crucial role in autonomous navigation and safe driving of Mars rovers as well as global analysis of Martian geological landforms. However, most deep learning-based segmentation models cannot effectively handle the challenges of highly unstructured and unbalanced terrain distribution on the Martian surface, thus leading to inadequate adaptability and generalization ability. In this paper, we propose a novel multi-view Martian Terrain Segmentation framework (MTSNet) by developing an efficient Martian Terrain text-Guided Segment Anything Model (MTG-SAM) and combining it with a tailored Local Terrain Feature Enhancement Network (LTEN) to capture intricate terrain details. Specifically, the proposed MTG-SAM is equipped with a Terrain Context attention Adapter Module (TCAM) to efficiently and effectively unleashing the model adaptability and transferability on Mars-specific terrain distribution. Then, a Local Terrain Feature Enhancement Network (LTEN) is designated to compensate for the limitations of MTG-SAM in capturing the fine-grained local terrain features of Mars surface. Afterwards, a simple yet efficient Gated Fusion Module (GFM) is introduced to dynamically merge the global contextual features from MTG-SAM encoder and the local refined features from LTEN module for comprehensive terrain feature learning. Moreover, the proposed MTSNet enables terrain-specific text as prompts resolving the efficiency issue of existing methods that require costly annotation of bounding boxes or foreground points. Experimental results on AI4Mars and ConeQuest datasets demonstrate that our proposed MTSNet can effectively learns the unique Martian terrain feature distribution and achieves state-of-the-art performance on multi-view terrain segmentation from both the perspectives of the Mars rover and the satellite remote sensing. Code is available at https://github.com/raoxuefeng/mtsnet.

Abstract:
Current Lifelong Person Re-Identification (LReID) methods focus on tackling a clean data stream with accurate labels. When noisy data with incorrect labels are given, their performance is severely degraded since the model inevitably and continually remembers erroneous knowledge induced by the label noises. Moreover, the well-known issue of catastrophic forgetting in LReID is exacerbated by noisy labels, which disrupt the retention of correct knowledge from previous models. Such a practical noisy LReID task is important but challenging, and rare works have attempted to handle it. In this paper, we initially investigate noisy LReID and propose a Continual Knowledge Purification (CKP) method to address the catastrophic remembering of erroneous knowledge and catastrophic forgetting of correct knowledge simultaneously. Specifically, a Cluster-aware Data Purification module (CDP) is designed to select clean labels based on clustering-guided label confidence estimation. Besides, an Iterative Label Rectification (ILR) pipeline is proposed to rectify wrong labels by fusing the prediction and label information throughout the training epochs. To handle the catastrophic remembering problem, an Erroneous Knowledge Filtering (EKF) algorithm is proposed to estimate and transfer the correct old knowledge to the new model. Finally, a Noisy LReID benchmark is constructed for performance evaluation and extensive experimental results demonstrate that our proposed CKP method achieves state-of-the-art performance. Our code is available at https://github.com/zhoujiahuan1991/MM2024-CKP

Abstract:
Due to the limitation of collection device and unstable scanning process, point cloud data is usually noisy. This noise deforms the underlying structures of point clouds and inevitably affects downstream tasks such as rendering, reconstruction and classification. In this paper, we propose a Cross-stage Cross-coder Adaptive Edge Graph Convolution Network (C2AENet) to denoise point clouds. Our network uses multiple stages to progressively and iteratively denoise points. To improve the effectiveness, we add connections between two stages and between the encoder and decoder, leading to the cross-stage cross-coder architecture. Additionally, existing graph-based point cloud learning methods tend to capture the local structure. They typically construct a semantic graph based on semantic distance, which may ignore Euclidean neighbors and lead to insufficient geometry perception. Therefore, we introduce a geometric graph and adaptively calculate edge attention based on the local and global structural information of the points. This results in a novel graph convolution module that allows the network to capture richer contextual information and focus on more important parts. Extensive experiments demonstrate that the proposed method is competitive compared with other state-of-the-art methods. The code is available at: https://github.com/chenwuwq/C2AENet.

Abstract:
Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects while preserving the vocal timbre of one brief reference audio. The wide variations in emotion, pace, and environment that dubbed speech must exhibit to achieve real alignment make dubbing a complex task. Considering the limited scale of the movie dubbing datasets (due to copyright) and the interference from background noise, directly learning from movie dubbing datasets limits the pronunciation quality of learned models. To address this problem, we propose a two-stage dubbing method that allows the model to first learn pronunciation knowledge before practicing it in movie dubbing. In the first stage, we introduce a multi-task approach to pre-train a phoneme encoder on a large-scale text-speech corpus for learning clear and natural phoneme pronunciations. For the second stage, we devise a prosody consistency learning module to bridge the emotional expression with the phoneme-level dubbing prosody attributes (pitch and energy). Finally, we design a duration consistency reasoning module to align the dubbing duration with the lip movement. Extensive experiments demonstrate that our method outperforms several state-of-the-art methods on two primary benchmarks. The demos are available at https://speaker2dubber.github.io/.

Abstract:
Point clouds represent one of the prevalent formats for 3D content. Distortions introduced at various stages in the point cloud processing pipeline affect the visual quality, altering their geometric composition, texture information, or both. Understanding and quantifying the impact of the distortion domain on visual quality is vital to driving rate optimization and guiding post-processing steps to improve the quality of experience. In this paper, we propose a multi-task guided multi-modality no reference metric (M3-Unity), which utilizes 4 types of modalities across attributes and dimensionalities to represent point clouds. An attention mechanism establishes inter/intra associations among 3D/2D patches, which can complement each other, yielding local and global features, to fit the highly nonlinear property of the human vision system. A multi-task decoder involving distortion type classification selects the best association among 4 modalities, aiding the regression task and enabling the in-depth analysis of the interplay between geometrical and textural distortions. Furthermore, our framework design and attention strategy enable us to measure the impact of individual attributes and their combinations, providing insights into how these associations contribute particularly in relation to distortion type. Extensive experimental results on 4 datasets consistently outperform the state-of-the-art metrics by a large margin. The code is available at https://github.com/cwi-dis/ACMMM2024-Oral.

Abstract:
Multi-modal Unsupervised Domain Adaptation (MM-UDA) for large-scale 3D semantic segmentation involves adapting 2D and 3D models to a target domain without labels, which significantly reduces the labor-intensive annotations. Existing MM-UDA methods have often attempted to mitigate the domain discrepancy by aligning features between the source and target data. However, this implementation falls short when applied to image perception due to the susceptibility of images to environmental changes compared to point clouds. To mitigate this limitation, in this work, we explore the potentials of an off-the-shelf Contrastive Language-Image Pre-training (CLIP) model with rich whilst heterogeneous knowledge. To make CLIP task-specific, we propose a top-performing method, dubbed CLIP2UDA, which makes frozen CLIP reward unsupervised domain adaptation in 3D semantic segmentation. Specifically, CLIP2UDA alternates between two steps during adaptation: (a) Learning task-specific prompt. 2D features response from the visual encoder are employed to initiate the learning of adaptive text prompt of each domain, and (b) Learning multi-modal domain-invariant representations. These representations interact hierarchically in the shared decoder to obtain unified 2D visual predictions. This enhancement allows for effective alignment between the modality-specific 3D and unified feature space via cross-modal mutual learning. Extensive experimental results demonstrate that our method outperforms state-of-the-art competitors in several widely-recognized adaptation scenarios. Code is available at: https://github.com/Barcaaaa/CLIP2UDA.

Abstract:
Image-to-Video adaptation is proposed to train a model using labeled images and unlabeled videos to facilitate the classification of unlabeled videos. The latest work synthesizes videos using still images to mitigate the modality gap between images and videos. However, the synthesized videos are not realistic due to the camera movements are only simulated in 2D space. Therefore, we generate realistic videos by simulating arbitrary camera movements in 3D scenes, and then the model can be trained using the generated source videos. Unfortunately, the optical flows from the generated videos have unexpected negative impacts, resulting in suboptimal performance. To address this issue, we propose the Category-aware Flow Memory Bank, which replaces optical flows in source videos with real target flows, and the new composed videos are beneficial for training. In addition, we leverage the video pace prediction task to enhance the model's perception of speed. Our method achieves state-of-the-art performance and comparable performance on three widely used benchmarks. Our code is available at https://github.com/KenanHuang/mm2024\_cfb4i2v.

Abstract:
Image-text retrieval stands as a pivotal task within information retrieval, gaining increasing importance with the rapid advancements in Visual-Language Pretraining models. However, current benchmarks for evaluating these models face limitations, exemplified by instances such as BLIP2 achieving near-perfect performance on existing benchmarks. In response, this paper advocates for a more robust evaluation benchmark for image-text retrieval, one that embraces several essential characteristics. Firstly, a comprehensive benchmark should cover a diverse range of tasks in both perception and cognition-based retrieval. Recognizing this need, we introduce ReCoS, a novel benchmark specifically designed for cross-modal image-text retrieval in complex real-life scenarios. Unlike existing benchmarks, ReCoS encompasses 12 retrieval tasks, with a particular focus on three cognition-based tasks, providing a more holistic assessment of model capabilities. To ensure the novelty of the benchmark, we emphasize the use of original data sources, steering clear of reliance on existing publicly available datasets to minimize the risk of data leakage. Additionally, to strike a balance between the complexity of the real world and benchmark usability, ReCoS includes text descriptions that are neither overly detailed, making retrieval overly simplistic, nor under-detailed to the point where retrieval becomes impossible. Our evaluation results shed light on the challenges faced by existing methods, especially in cognition-based retrieval tasks within ReCoS. This underscores the necessity for innovative approaches in addressing the complexities of image-text retrieval in real-world scenarios. Our code and benchmark datasets are available for further research and development in this field https://github.com/Bruce-XJChen/ReCos

Abstract:
Text-driven human motion generation, which creates motion sequences based on textual descriptions, has attracted great attention in the communities of multimedia and artificial intelligence. By parsing and comprehending textual information and converting it into specific human movements, it realizes a direct transformation from human semantics to motion sequences. New text-driven human motion generators are springing up to achieve better performance. However, the absence of well-trained evaluators that can effectively estimate the consistency between the text prompts and motions generated by existing generators remains a challenge. To address the above issues, we propose an open-source library with a powerful Contrastive Language-and-Motion (CLaM) pre-training evaluator, which can be employed for evaluating a variety of text-driven human motion generation algorithms. We perform a thorough performance evaluation of the existing algorithms on various metrics, such as R-Precision. As a by-product, we build a large-scale HumanML3D-synthesis dataset, which consists of 14,616 motion sequences and 547,102 textual descriptions, which is ten times larger than the widely-used HumanML3D dataset. The source codes and models for CLaM are available at~https://github.com/SheldongChen/CLaM/.

Abstract:
Detecting eye contact is essential for embodied robots to engage in natural interactions with humans, enhancing the intuitiveness and comfort of these exchanges. However, eye contact detection often presents a significant challenge due to a variety of factors, such as low contrast and various forms of occlusions. Existing methods incorporate convolutional neural networks (CNNs) or Transformers to learn discriminative representations, but usually ignore the influence of noisy or less relevant regions in facial images. To address this gap, we propose the deep feature selection and fusion network (FSFNet) for eye contact detection in multi-party conversations. Our proposed method adaptively selects fine-grained visual features and reduces the impacts of irrelevant features. Specifically, we present a local feature selection scheme that leverages the attention scores to progressively concentrate on the most informative features. By integrating the carefully selected features into the multi-head self-attention module, we can maintain the superior properties of Transformers while simultaneously reducing the overall computational demands. We evaluate the proposed method on the official eye contact detection datasets, which achieves promising results of 0.8174 and 0.79 on the validation and test sets, respectively. We have made the source code publicly accessible in https://github.com/ma-hnu/FSFNet.

Abstract:
Inductive link prediction aims to infer missing triples on unseen graphs, which contain unseen entities and relations during training. The performances of existing inductive inference methods were hindered by the limited generalization capability in fully unseen graphs, which is rooted in the neglect of the intrinsic graph structure. In this paper, we aim to enhance the model's generalization ability to unseen graphs and thus propose a novel Hyper-Relation aware multi-views model (HyRel) for learning the global transferable structure of graphs. Distinct from existing studies, we introduce a novel perspective focused on learning the inherent hyper-relation structure consisting of the relation positions and affinity. The hyper-relation structure is independent of specific entities, relations, or features, thus allowing for transferring the learned knowledge to any unseen graphs. We adopt a multi-view approach to model the hyper-relation structure. HyRel incorporates neighborhood learning on each view, capturing nuanced semantics of relative relation position. Meanwhile, dual views contrastive constraints are designed to enforce the robustness of transferable structural knowledge. To the best of our knowledge, our work makes one of the first attempts to generalize the learning of hyper-relation structures, offering high flexibility and ease of use without reliance on any external resources. HyRel demonstrates SOTA performance compared to existing methods under extensive inductive settings, particularly on fully unseen graphs, and validates the efficacy of learning hyper-relation structures for improving generalization. The code is available online at https://github.com/hncps6/HyRel.

Abstract:
Multimedia content's surge on the internet has made multimodal relation extraction vital for applications like intelligent search and knowledge graph construction. As a rich source of image-text data, social media plays a crucial role in populating knowledge bases. However, the noisy information present in social media poses a challenge in multimodal relation extraction. Current methods focus on extracting relevant information from images to improve model performance but often overlook the importance of global image information. In this paper, we propose a novel multimodal relation extraction method FocalMRE, which leverages image focal augmentation, focal attention, and gating mechanisms. FocalMRE enables the model to concentrate on the image's focal regions while effectively utilizing the global information in the image. Through gating mechanisms, FocalMRE optimizes the multimodal fusion strategy, allowing the model to select the most relevant augmented regions for overcoming noise interference in relation extraction. The experimental results on the public MNRE dataset reveal that FocalMRE exhibits robust and significant performance advantages in the multimodal relation extraction task, especially in scenarios with high noise, long-tail distributions, and limited resources. The code is available at https://github.com/NJUNLP/FocalMRE.

Abstract:
Incorporating domain-specific visual information into text poses one of the critical challenges for domain-specific multi-modal neural machine translation (DMNMT). While most existing DMNMT methods often borrow multi-modal fusion frameworks from multi-modal neural machine translation (MNMT) in the general domain, they overlook the domain gaps between general and specific domains. Visual-to-textual interaction in a specific domain frequently exhibits multi-focus characteristics, making it difficult to consistently focus on domain-specific multi-visual details using traditional multi-modal fusion frameworks. This challenge can lead to a decrease in machine translation performance for domain-specific terms. To tackle this problem, this paper presents a virtual visual scene-guided domain-shadow multi-modal fusion mechanism to simultaneously integrate multi-grained domain visual details and text with the guidance of modality-agnostic virtual visual scene, thereby enhancing machine translation performance for DMNMT, especially for domain terms. Specifically, we first adopt a modality-mixing selection-voting strategy to generate modality-mixed domain representations through layer-by-layer intra-modality selection and inter-modality exchanging. Then, we gradually aggregate modality-mixed domain representations and text across modality boundaries with the guidance of a modality-agnostic virtual visual scene to enhance the collaboration between domain characteristics and textual semantics. The experimental results on three benchmark datasets demonstrate that our proposed approach outperforms the state-of-the-art (SOTA) methods in all machine translation tasks. The in-depth analysis further highlights the robustness and generalizability of our approach across various scenarios. Our code is available on https://github.com/HZY2023/VVDF.

Abstract:
Continual learning aims to learn new knowledge from a sequence of tasks without forgetting. Recent studies have found that projecting gradients onto the orthogonal direction of task-specific features is effective. However, these methods mainly focus on mitigating catastrophic forgetting by adopting old features to construct projection spaces, neglecting the potential to enhance plasticity and the valuable information contained in previous gradients. To enhance plasticity and effectively utilize the gradients from old tasks, we propose Gradient Projection in Common Null Space (GPCNS), which projects current gradients into the common null space of final gradients under all preceding tasks. Moreover, to integrate both feature and gradient information, we propose a collaborative framework that allows GPCNS to be utilized in conjunction with existing gradient projection methods as a plug-and-play extension that provides gradient information and better plasticity. Experimental evaluations conducted on three benchmarks demonstrate that GPCNS exhibits superior plasticity compared to conventional gradient projection methods. More importantly, GPCNS can effectively improve the backward transfer and average accuracy for existing gradient projection methods when applied as a plugin, which outperforms all the gradient projection methods without increasing learnable parameters and customized objective functions. The code is available at https://github.com/Hifipsysta/GPCNS.

Abstract:
Link prediction aims to infer missing valid triplets to complete knowledge graphs, with recent inclusion of multimodal information to enrich entity representations. Existing methods project multimodal information into a unified embedding space or learn modality-specific features separately for later integration. However, performance was limited in such studies due to neglecting the modalities compatibility and conflict semantic carried by entities in valid and invalid triplets. In this paper, we aim at modeling inter-entity modality interactions and thus propose a novel Modality Circular fusion approach (MoCi), which interweaves multimodal contextual of entities. Firstly, unlike most methods in this task that directly fuse modalities, we design a triplets-prompt modality contrastive pre-training to align modality semantics beforehand. Moreover, we propose a modality circular fusion model using a simple yet efficient multilinear transformation strategy. This allows explicit inter-entity modality interactions, distinguishing it from methods confined to fuse within individual entities. To the best of our knowledge, MoCi presents one of the pioneering frameworks that tailored to grasp inter-entity modality semantics for better link prediction. Extensive experiments on seven datasets demonstrate our model yields SOTA performance, confirming the efficacy of MoCi in modeling inter-entity modality interactions. Our code is released at https://github.com/MoCiGitHub/MoCi.

Abstract:
In online chatting, people increasingly prefer using stickers to supplement or replace text for replies, as sticker images can express vivid and varied emotions. The Sticker Response Selection (SRS) task aims to predict the sticker image that is most relevant to the history dialogue. Previous researches explore the semantic similarity between context and stickers, overlooking both unimodal and cross-modal emotional information. In this paper, we propose a 'Perceive before Respond' (PBR) training paradigm. PBR perceives sticker emotions through a knowledge distillation module. Variety representations of each emotion category are acquired from the large-scale sticker emotion recognition dataset and distilled into our model to enhance emotion comprehension. We further distinguish stickers with similar subject elements under the same topic. We perform contrastive learning at both inter- and intra-topic levels to obtain discriminative and diverse sticker representations. In addition, we improve the hard negative sampling method for image-text matching based on cross-modal sentiment association, conducting hard sample mining from both semantic similarity and sentiment polarity similarity for sticker-to-dialogue and dialogue-to-sticker. Extensive experiments verify the effectiveness of each proposed component. Ablation experiments on different backbone networks demonstrate the generality of our approach. Our code is released on https://github.com/wuyou-xia/Perceive-before-Respond.

Abstract:
Our paper introduces a novel video dataset specifically for Temporal Intention Localization (TIL), aimed at identifying hidden abnormal intention in densely populated and complex environments. Traditional Temporal Action Localization (TAL) frameworks, focusing on overt actions within constrained temporal intervals, often miss subtle pre-abnormal actions that unfold over extended periods. Our dataset comprises 228 videos with 5790 clips, each annotated to capture fine-grained actions within ambiguous temporal boundaries using the Joint-Linear-Assignment methodology. This approach enables detailed analysis of the evolution of abnormal intention over time. To detect subtle, hidden intention, we developed the Intention-Action Fusion module, an creative approach integrating dynamic feature fusion across 11 behavioral subcategories, significantly enhancing the model's ability to discern nuanced intention. This enhancement has led to performance improvements of up to 139% in specific scenarios, dramatically boosting the model's sensitivity and interpretability, crucial for advancing proactive surveillance systems. By pushing the boundaries of technology, our dataset and methodologies foster proactive surveillance systems capable of preemptively identifying potential threats from nuanced behavioral patterns, encouraging further exploration into the complexities of intention beyond observable actions. The dataset is available at https://github.com/Zzz99999/Hidden_Abnormal_Intention.

Abstract:
Researchers have applied 3D Lookup Tables (LUTs) in cameras, offering new possibilities for enhancing image quality and achieving various tonal effects. However, these approaches often overlook the non-uniformity of color distribution in the original images, which limits the performance of learnable LUTs. To address this issue, we introduce a lightweight end-to-end image enhancement method called Simulated Infrared Fusion Guided Image-adaptive 3D Lookup Tables (SIRLUT). SIRLUT enhances the adaptability of 3D LUTs by reorganizing the color distribution of images through the integration of simulated infrared imagery. Specifically, SIRLUT consists of an efficient Simulated Infrared Fusion (SIF) module and a Simulated Infrared Guided (SIG) refinement module. The SIF module leverages a cross-modal channel attention mechanism to perceive global information and generate dynamic 3D LUTs, while the SIG refinement module blends simulated infrared images to match image consistency features from both structural and color aspects, achieving local feature fusion. Experimental results demonstrate that SIRLUT outperforms state-of-the-art methods on different tasks by up to 0.88 ~ 2.25dB while reducing the number of parameters. Code is available at https://github.com/riversky2025/SIRLUT.git .

Abstract:
In this paper, we develop a progressive local and non-local interactive network with multi-scale cross-content deeply discriminative learning to solve image deraining. The proposed model contains two key techniques: 1) Progressive Local and Non-Local Interactive Network (PLNLIN) and 2) Multi-Scale Cross-Content Deeply Discriminative Learning (MCDDL). The PLNLIN is a U-shaped encoder-decoder network, where the proposed new Progressive Local and Non-Local Interactive Module (PLNLIM) is the basic unit in the encoder-decoder framework. The PLNLIM fully explores local and non-local learning in convolution and Transformer operation respectively and the local and non-local content are further interactively learned in a progressive manner. The proposed MCDDL not only discriminates the output of the generator but also receives the deep content from the generator to distinguish real and fake features at each side layer of the discriminator in a multi-scale manner. We show that the proposed MCDDL has fast and stable convergence properties that lack in existing discriminative learning manners. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art methods on five public synthetic datasets and one real-world data. The source codes will be made available at https://github.com/supersupercong/PLNLIN-MCDDL.

Abstract:
Social Virtual Reality is envisioned to transform how individuals communicate remotely, offering a sense of immersion and co-presence within a virtual space. Current platforms enabling remote social interactions rely on synthetic user representations. We address this limitation by enabling realistic human representation through volumetric content capture, encoding and transmission. Specifically, we present an extended version of VR2Gather, now a fully open source Unity package, available at https://github.com/cwi-dis/VR2Gather-acmmm-oss. Our platform is a customisable system to transmit volumetric content in a multi-party real-time environment, easy to integrate into existing applications.

Abstract:
Weakly supervised semantic segmentation (WSSS) using image-level labels is a challenging task, with relying on Class Activation Map (CAM) to derive segmentation supervision. Although many efficient single-stage solutions have been proposed, their performance is hindered by the inherent ambiguity of CAM. This paper introduces a new approach, dubbed ECA, to Exploit the self-supervised Vision Transformer, DINO, inducing the Class-aware semantic Affinity to overcome this limitation. Specifically, we introduce a Semantic Affinity Exploitation module (SAE). It establishes the class-agnostic affinity graph through the self-attention of DINO. Using the highly activated patches on CAMs as 'seeds', we propagate them across the affinity graph and yield the Class-aware Affinity Region Map (CARM) as supplementary semantic guidance. Moreover, the selection of reliable 'seeds' is crucial to the CARM generation. Inspired by the observed CAM inconsistency between the global and local views, we develop a CAM Correspondence Enhancement module (CCE) to encourage dense local-to-global CAM correspondences, advancing high-fidelity CAM for seed selection in SAE. Our experimental results demonstrate that ECA effectively improves the model's object pattern understanding. Remarkably, it outperforms state-of-the-art alternatives on the PASCAL VOC 2012 and MS COCO 2014 datasets, achieving 90.1% upper bound performance compared to its fully supervised counterpart. Code is available at https://github.com/Wu0409/ECA.

Abstract:
Gesture recognition plays a crucial role in natural human-computer interaction and sign language recognition. Despite considerable progress in normal daylight, research dedicated to gesture recognition in dark environments is scarce. This is partly due to the lack of sufficient datasets for such a task. We bridge the gap of the lack of data for this task by collecting a new dataset: a large-scale multimodal video dataset for gesture recognition in darkness (MGR-Dark). MGR-Dark is distinguished from existing gesture datasets by its gesture collection in darkness, multimodal videos(RGB, Depth, and Infrared), and high video quality. To the best of our knowledge, this is the first multimodal dataset dedicated to human gesture action in dark videos of high quality. Building upon this, we propose a Modality Translation and Cross-modal Distillation (MTCD) RGB-IR benchmark framework. Specifically, the modality translator is firstly utilized to transfer RGB data to pseudo-Infrared data, a progressive cross-modal feature distillation module is then designed to exploit the underlying relations between RGB, pseudo-Infrared and Infrared modalities to guide RGB feature learning. The experiments demonstrate that the dataset and benchmark proposed in this paper are expected to advance research in gesture recognition in dark videos. Our dataset and code can be found at https://github.com/Grass-Shi/MGR-DarkMGR-Dark.

Abstract:
Multimodal Large Language Models (MLLMs) have shown significant potential for chart understanding and generation. However, they are still far from achieving the desired effectiveness in practical applications. This could be due to the limitations of the used training chart data. Existing chart datasets suffer from scarcity of chart types, limited coverage of tasks, and insufficient scalability, making them incapable of effectively enhancing the chart-related capabilities of MLLMs. To tackle these obstacles, we construct NovaChart, a large-scale dataset for chart understanding and generation of MLLMs. NovaChart contains 47K high-resolution chart images and 856K chart-related instructions, covering 18 different chart types and 15 unique tasks of chart understanding and generation. To build NovaChart, we propose a data generation engine for metadata curation, chart visualization and instruction formulation. Chart metadata in NovaChart contains detailed annotations, i.e., data points, visual elements, source data and the visualization code of every chart. This additional information endows NovaChart with considerable scalability, as it can facilitate the extension of chart instruction data to a larger scale and greater diversity. We utilize NovaChart to train several open-source MLLMs. Experimental results demonstrate NovaChart empowers MLLMs with stronger capabilities in 15 chart understanding and generation tasks by a large-margin (35.47%-619.47%), bringing them a step closer to smart chart assistants. Our dataset is now available at https://github.com/Elucidator-V/NovaChart.

Abstract:
Emotion cause analysis has attracted increasing attention in recent years. However, the integration of multimodal information with emotion causes remains underexplored. Existing studies merely extract utterances from conversations as cause evidence, which is too coarse-grained to locate the exact causes from other modalities, especially those that may be reflected only in a specific video frame of an utterance. To address these limitations, we introduce a new task named Multimodal Emotion Cause Generation in Conversations (MECGC), which aims to generate an abstractive summary clearly and intuitively describing the causes that trigger the given emotion based on the multimodal context of conversations. We accordingly construct a dataset named ECGF that contains 1,374 conversations and 7,690 emotion instances from TV series. We further develop a generative framework that first generates emotion-cause aware video captions (Observe) and then facilitates the generation of emotion causes (Generate). The captioning model is trained with examples synthesized by a Multimodal Large Language Model (MLLM). Experimental results demonstrate the effectiveness of our framework and the significance of multimodal information for emotion cause analysis. Our dataset and source codes are available at https://github.com/NUSTM/MECGC.

Abstract:
As a non-contact method, eye-tracking data can be used to diagnose people with Autism Spectrum Disorder (ASD) by comparing the differences of eye movements between ASD and healthy people. However, existing works mainly employ a simple free-viewing paradigm or visual search paradigm with restricted or unnatural stimuli to collect the gaze patterns of adults or children with an average age of 6-to-8 years, hindering the early diagnosis and intervention of preschool children with ASD. In this paper, we propose a novel method for identifying children with ASD in three unique features: First, we design a novel eye-tracking paradigm that records Visual Question Answering (VQA) driven gaze patterns in complex natural scenes as a powerful guide for differentiating children with ASD. Second, we contribute a carefully designed dataset, named VQA4ASD, for collecting VQA-driven eye-tracking data from 2-to-6-year-old ASD and healthy children. To the best of our knowledge, this is the first dataset focusing on the early diagnosis of preschool children, which could facilitate the community to understand and explore the visual behaviors of ASD children. Third, we further develop a VQA-guided cooperative ASD screening network (VQA-CASN), in which both task-agnostic and task-specific visual scanpaths are explored simultaneously for ASD screening. Extensive experiments demonstrate that the proposed VQA-CASN achieves competitive performance with the proposed VQA-driven eye-tracking paradigm. The code and dataset is available at: https://github.com/qijiansong/VQA4ASD.

Abstract:
Elliptical Object Detection (EOD) is crucial yet challenging due to complex scenes and varying object characteristics. Existing methods often struggle with parameter configurations and lack adaptability in label-scarce scenarios. To address this, we propose a new semi-supervised teacher-student framework, namely Dual-Teacher Collaborative Guidance (DTCG), comprising a five-parameter teacher detector, a six-parameter teacher detector, and a student detector. DTCG allows the two teachers specializing in different regression approaches, and co-instructing the student within a unified model, so as to prevent errors and enhance performance. Additionally, a feature correlation module highlights differences between teacher features and employs deformable convolution to select advantageous features for final parameter regression. Furthermore, we devise a collaborative training strategy to update the two teachers asynchronously. Extensive experiments conducted on two widely recognized datasets affirm the superior performance of our DTCG over other leading competitors across various semi-supervised scenarios. Notably, our method achieves a 5.61% higher performance than the second best method when utilizing only 10% annotated data. Code is available at https://github.com/FengLongHan/DTCG.

Abstract:
Recently, many studies have highlighted that training Generative Adversarial Networks (GANs) with limited data suffers from the overfitting of the discriminator (D). Existing studies mitigate the overfitting of D by employing data augmentation, model regularization, or pre-trained models. Despite the success of existing methods in training GANs with limited data, noise injection is another plausible, complementary, yet not well-explored approach to alleviate the overfitting of D issue. In this paper, we propose a simple yet effective method called Dual Adaptive Noise Injection (DANI), to further improve the training of GANs with limited data. Specifically, DANI consists of two adaptive strategies: adaptive injection probability and adaptive noise strength. For the adaptive injection probability, Gaussian noise is injected into both real and fake images for generator (G) and D with a probability p, respectively, where the probability p is controlled by the overfitting degree of D. For the adaptive noise strength, the Gaussian noise is produced by applying the adaptive forward diffusion process to both real and fake images, respectively. As a result, DANI can effectively increase the overlap between the distributions of real and fake data during training, thus alleviating the overfitting of D issue. Extensive experiments on several commonly-used datasets with both StyleGAN2 and FastGAN backbones demonstrate that DANI can further improve the training of GANs with limited data and achieve state-of-the-art results compared with other methods. Codes are available at https://github.com/zzhang05/DANI.

Abstract:
3D Human pose estimation from multiple cameras with unknown calibration has received less attention than it should. The few existing data-driven solutions do not fully exploit 3D training data that are available on the market, and typically train from scratch for every novel multi-view scene, which impedes both accuracy and efficiency. We show how to exploit 3D training data to the fullest and associate multiple dynamic views efficiently to achieve high precision on novel scenes using a simple yet effective framework, dubbed Multiple Dynamic View Pose estimation (MDVPose). MDVPose utilizes novel scenarios data to finetune a single-view pretrained motion encoder in multi-view setting, aligns arbitrary number of views in a unified coordinate via Procruste alignment, and imposes multi-view consistency. The proposed method achieves 22.1 mm P-MPJPE or 34.2 mm MPJPE on the challenging in-the-wild Ski-Pose PTZ dataset, which outperforms the state-of-the-art method by 24.8% P-MPJPE (-7.3 mm) and 19.0% MPJPE (-8.0 mm). It also outperforms the state-of-the-art methods by a large margin (-18.2mm P-MPJPE and -28.3mm MPJPE) on the EgoBody dataset. In addition, MDVPose achieves robust performance on the Human3.6M datasets featuring multiple static cameras. Code is available at https://github.com/iGame-Lab/MDVPose.

Abstract:
Recently, automatic micro-expression (ME) analysis has attracted increasing attention, since ME is a spontaneous facial expression that can truly reflect the emotional state an individual tries to conceal. As a crucial step in ME analysis, Micro- and Macro-expression (MaE) spotting aims to sequentially identify the occurrence intervals of MEs and MaEs within a long video sequence. However, the subtle spatiotemporal movements of MEs and the scarcity of well-labeled data pose great challenges for accurately spotting them. To this end, this paper proposes a novel spotting framework based on Multi-scale Feature Learning Network with Optical Flow Correction. Specifically, we first integrate the pre-trained VideoMAE and customized convolutional layers as a visual feature extraction module to learn the facial motion features in long video sequences. Then, to comprehensively locate and identify the existing ME and MaE segments, we introduce a multi-scale candidate segment generation method based on the ActionFormer. In particular, a multi-start points optical flow filtering method is proposed to improve the precision of expression spotting. Finally, we conduct comprehensive experiments on the MEGC2024 spotting task, and the experimental results demonstrate the effectiveness of our method, which ranks second in this task. The implemented code is also publicly available at https://github.com/zzy188zzy/megc_spotting_code.

Abstract:
Accurately identifying correct correspondence (inlier) within initial ones is pivotal for robust feature-based point cloud registration. Current methods typically rely on one-shot 3D correspondence classification with a single coherence constraint to obtain inlier. These approaches are either insufficiently accurate or inefficient, often requiring more network parameters. To address this issue, we propose a lightweight network, 3DPCP-Net, for fast and robust registration. Its core design lies in progressive correspondence pruning through mining deep spatial geometric coherence, which can effectively learn pairwise 3D spatial distance and angular features to progressively remove outlier (mismatched correspondence) for accurate pose estimation. Moreover, we also propose an efficient feature-based hypothesis proposer that leverages the geometric consistency features to generate reliable model hypotheses for each reliable correspondence explicitly. Extensive experiments on 3DMatch, 3DLoMatch, KITTI and Augmented ICL-NUIM demonstrate the accurate and efficient of our method for outlier removal and pose estimation tasks. Furthermore, our method is highly versatile and can be easily integrated into both learning-based and geometry-based frameworks, enabling them to achieve state-of-the-art results. Code is available at https://github.com/jtw220/3DPCP-Net.

Abstract:
With the rapid development of video conferencing and online education applications, screen content image (SCI) compression has become increasingly crucial. Recently, deep learning techniques have made significant progress in compressing natural images, surpassing the performance of traditional standards like versatile video coding. However, directly applying these methods to SCIs is challenging due to the unique characteristics of SCIs. In this paper, we propose a synergistic approach to preserve structural fidelity and text integrity for SCIs. Firstly, external prior guidance is proposed to enhance structural fidelity and text integrity by providing global spatial attention. Then, a structural enhancement module is proposed to improve the preservation of structural information by enhanced spatial feature transform. Finally, the loss function is optimized for better compression efficiency in text regions by weighted mean square error. Experimental results show that the proposed method achieves 13.3% BD-Rate saving compared to the baseline window attention convolutional neural networks (WACNN) on the JPEGAI, SIQAD, SCID, and MLSCID datasets on average. Our code is available at https://github.com/vpaHduGroup/SFTIP_SCC.

Abstract:
With the popularity of social media, growing number of online chats and comments are presented in the form of multimodal dialogues containing stickers. Automatically summarizing these dialogues can effectively reduce content overload and save reading time. However, existing datasets and works are either text dialogue summarization, or articles with real photos that respectively perform text summaries and key image extraction, and have not simultaneously considered the multimodal dialogue automatic summarization tasks with sticker images and online chat scenarios. To compensate for the lack of datasets and researches in this field, we propose a brand-new Multimodal Chat Dialogue Summarization Containing Stickers (MCDSCS) task and dataset. It consists of 5,527 Chinese multimodal chat dialogues and 14,356 different sticker images, with each dialogue interspersed with stickers in the text to reflect the real social media chat scenario. MCDSCS can also contribute to filling the gap in Chinese multimodal dialogue data. We use the most advanced GPT4 model and carefully design Chain-of-Thoughts (COT) supplemented with manual review to generate dialogues and extract summaries. We also propose a novel method that integrates the visual information of stickers with the text descriptions of emotions and intentions (TEI). Experiments show that our method can effectively improve the performance of various mainstream summary generation models, even better than some other multimodal models, ChatGPT, and Vision Large Language Models (VLMs). Our data and code are publicly available at https://github.com/FakerBoom/MCDSCS.

Abstract:
Simultaneously achieving 3D reconstruction and novel view synthesis for indoor environments has widespread applications but is technically very challenging. State-of-the-art methods based on implicit neural functions can achieve excellent 3D reconstruction results, but their performances on new view synthesis can be unsatisfactory. The exciting development of neural radiance field (NeRF) has revolutionized novel view synthesis, however, NeRF-based models can fail to reconstruct clean geometric surfaces. We have developed a dual neural radiance field (Du-NeRF) to simultaneously achieve high-quality geometry reconstruction and view rendering. Du-NeRF contains two geometric fields, one derived from the SDF field to facilitate geometric reconstruction and the other derived from the density field to boost new view synthesis. One of the innovative features of Du-NeRF is that it decouples a view-independent component from the density field and uses it as a label to supervise the learning process of the SDF field. This reduces shape-radiance ambiguity and enables geometry and color to benefit from each other during the learning process. Extensive experiments demonstrate that Du-NeRF can significantly improve the performance of novel view synthesis and 3D reconstruction for indoor environments and it is particularly effective in constructing areas containing fine geometries that do not obey multi-view color consistency. Our code is available at: https://github.com/pcl3dv/DuNeRF.

Abstract:
Although current prompt learning methods have successfully been designed to effectively reuse the large pre-trained models without fine-tuning their large number of parameters, they still have limitations to be addressed, i.e., without considering the adverse impact of meaningless patches in every image and without simultaneously considering in-sample generalization and out-of-sample generalization. In this paper, we propose an adaptive multi-modality prompt learning to address the above issues. To do this, we employ previous text prompt learning and propose a new image prompt learning. The image prompt learning achieves in-sample and out-of-sample generalization, by first masking meaningless patches and then padding them with the learnable parameters and the information from texts. Moreover, each of the prompts provides auxiliary information to each other, further strengthening these two kinds of generalization. Experimental results on real datasets demonstrate that our method outperforms SOTA methods, in terms of different downstream tasks.

Abstract:
With the wide application of deep neural network models in various computer vision tasks, there has been a proliferation of adversarial example generation strategies aimed at deeply exploring model security. However, existing adversarial training defense models, which rely on single or limited types of attacks under a one-time learning process, struggle to adapt to the dynamic and evolving nature of attack methods. Therefore, to achieve defense performance improvements for models in long-term applications, we propose a novel Sustainable Self-Evolution Adversarial Training (SSEAT) framework. Specifically, we introduce a continual adversarial defense pipeline to realize learning from various kinds of adversarial examples across multiple stages. Additionally, to address the issue of model catastrophic forgetting caused by continual learning from ongoing novel attacks, we propose an adversarial data replay module to better select more diverse and key relearning data. Furthermore, we design a consistency regularization strategy to encourage current defense models to learn more from previously trained ones, guiding them to retain more past knowledge and maintain accuracy on clean samples. Extensive experiments have been conducted to verify the efficacy of the proposed SSEAT defense method, which demonstrates superior defense performance and classification accuracy compared to competitors.

Abstract:
Estimating the momentary level of participant's engagement is an important prerequisite for assistive systems that support human interactions. Previous work has addressed this task in within-domain evaluation scenarios, i.e. training and testing on the same dataset. This is in contrast to real-life scenarios where domain shifts between training and testing data frequently occur. With MultiMediate'24, we present the first challenge addressing multi-domain engagement estimation. As training data, we utilise the NOXI database of dyadic novice-expert interactions. In addition to within-domain test data, we add two new test domains. First, we introduce recordings following the NOXI protocol but covering languages that are not present in the NOXI training data. Second, we collected novel engagement annotations on the MPIIGroupInteraction dataset which consists of group discussions between three to four people. In this way, MultiMediate'24 evaluates the ability of approaches to generalise across factors such as language and cultural background, group size, task, and screen-mediated vs. face-to-face interaction. This paper describes the MultiMediate'24 challenge and presents baseline results. In addition, we discuss selected challenge solutions.

Abstract:
The rapid evolution of egocentric video analysis brings new insights into understanding human activities and intentions from a first-person perspective. Despite this progress, the fragmentation in tasks like action recognition, procedure learning, and moment retrieval, \etc, coupled with inconsistent annotations and isolated model development, hinders a holistic interpretation of video content. In response, we introduce the EAGLE (Egocentric AGgregated Language-video Engine) model and the EAGLE-400K dataset to provide a unified framework that integrates various egocentric video understanding tasks. EAGLE-400K, the first large-scale instruction-tuning dataset tailored for egocentric video, features 400K diverse samples to enhance a broad spectrum task from activity recognition to procedure knowledge learning. Moreover, EAGLE, a strong video multimodal large language model (MLLM), is designed to effectively capture both spatial and temporal information. In addition, we propose a set of evaluation metrics designed to facilitate a thorough assessment of MLLM for egocentric video understanding. Our extensive experiments demonstrate EAGLE's superior performance over existing models, highlighting its ability to balance task-specific understanding with holistic video interpretation. With EAGLE, we aim to pave the way for research opportunities and practical applications in real-world scenarios.

Abstract:
Although the Segment Anything Model (SAM) has achieved impressive results in many segmentation tasks and benchmarks, its performance noticeably deteriorates when applied to high-resolution images for high-precision segmentation, limiting it's usage in many real-world applications.In this work, we explored transferring SAM into the domain of high-resolution images and proposed Pi-SAM. Compared to the original SAM and its variants, Pi-SAM demonstrates the following superiorities: Firstly, Pi-SAM possesses a strong perception capability for the extremely fine details in high-resolution images, enabling it to generate high-precision segmentation masks. As a result,Pi-SAM significantly surpasses previous methods in four high-resolution datasets. Secondly, Pi-SAM supports more precise user interactions. In addition to the native promptable ability of SAM, Pi-SAM allows users to interactively refine the segmentation predictions simply by clicking. While the original SAM fails to achieve this on high-resolution images. Thirdly, building upon SAM, Pi-SAM introduces very few additional parameters and computational costs and ensures highly efficient model fine-tuning to achieve the above performance.

Abstract:
Brain-inspired Spiking Neural Networks (SNNs) leverage sparse spikes to represent information and process them in an asynchronous event-driven manner, offering an energy-efficient paradigm for the next generation of machine intelligence. However, the current focus within the SNN community prioritizes accuracy optimization through the development of large-scale models, limiting their viability in resource-constrained and low-power edge devices. To address this challenge, we introduce a lightweight and hardware-friendly Quantized SNN (Q-SNN) that applies quantization to both synaptic weights and membrane potentials. By significantly compressing these two key elements, the proposed Q-SNNs substantially reduce both memory usage and computational complexity. Moreover, to prevent the performance degradation caused by this compression, we present a new Weight-Spike Dual Regulation (WS-DR) method inspired by information entropy theory. Experimental evaluations on various datasets, including static and neuromorphic, demonstrate that our Q-SNNs outperform existing methods in terms of both model size and accuracy. These state-of-the-art results in efficiency and efficacy suggest that the proposed method can significantly improve edge intelligent computing.

Abstract:
Referring image segmentation (RIS) aims to segment a particular region based on a specific expression. Existing one-stage methods have explored various fusion strategies, yet they encounter two significant issues. Primarily, most methods rely on manually selected visual features from the visual encoder layers. Moreover, the direct fusion of word-level features into coarse aligned features disrupts the established vision-language alignment. In this paper, we introduce an innovative framework for RIS that seeks to overcome these challenges with adaptive alignment of vision and language features, termed the Adaptive Selection with Dual Alignment (ASDA). ASDA innovates in two aspects. Firstly, we design an Adaptive Feature Selection and Fusion (AFSF) module to dynamically select visual features focusing on different regions related to various descriptions. AFSF is equipped with scale-wise feature aggregator to provide hierarchically coarse features that preserve crucial low-level details. Secondly, a Word Guided Dual-Branch Aligner (WGDA) is leveraged to integrate coarse features with linguistic cues by word-guided attention, which effectively addresses the common issue of vision-language misalignment. Extensive experimental results demonstrate that our ASDA framework surpasses state-of-the-art methods on RefCOCO, RefCOCO+ and G-Ref benchmark.

Abstract:
Knowledge distillation (KD) is an established paradigm for transferring privileged knowledge from a cumbersome model to a more lightweight and efficient one. In recent years, logit-based KD methods are quickly catching up in performance with their feature-based counterparts. However, previous research has pointed out that logit-based methods are still fundamentally limited by two major issues in their training process, namely overconfident teacher and confirmation bias. Inspired by the success of cross-view learning in fields such as semi-supervised learning, in this work we introduce within-view and cross-view regularisations to standard logit-based distillation frameworks to combat the above cruxes. We also perform confidence-based soft label mining to improve the quality of distilling signals from the teacher, which further mitigates the confirmation bias problem. Despite its apparent simplicity, the proposed Consistency-Regularisation-based Logit Distillation (CRLD) significantly boosts student learning, setting new state-of-the-art results on the standard CIFAR-100, Tiny-ImageNet, and ImageNet datasets across a diversity of teacher and student architectures, whilst introducing no extra network parameters. Orthogonal to on-going logit-based distillation research, our method enjoys excellent generalisation properties and, without bells and whistles, boosts the performance of various existing approaches by considerable margins.

Abstract:
In this paper, we study the problem of Text-to-Image Person Re-identification (TIReID), which aims to find images of the same identity described by a text sentence from a pool of candidate images. Benefiting from Vision-Language Pre-training, such as CLIP (Contrastive Language-Image Pretraining), the TIReID techniques have achieved remarkable progress recently. However, most existing methods only focus on instance-level matching and ignore identity-level matching, which involves associating multiple images and texts belonging to the same person. In this paper, we propose a novel prototypical prompting framework (Propot) designed to simultaneously model instance-level and identity-level matching for TIReID. Our Propot transforms the identity-level matching problem into a prototype learning problem, aiming to learn identity-enriched prototypes. Specifically, Propot works by 'initialize, adapt, enrich, then aggregate'. We first use CLIP to generate high-quality initial prototypes. Then, we propose a domain-conditional prototypical prompting (DPP) module to adapt the prototypes to the TIReID task using task-related information. Further, we propose an instance-conditional prototypical prompting (IPP) module to update prototypes conditioned on intra-modal and inter-modal instances to ensure prototype diversity. Finally, we design an adaptive prototype aggregation module to aggregate these prototypes, generating final identity-enriched prototypes. With identity-enriched prototypes, we diffuse its rich identity information to instances through prototype-to-instance contrastive loss to facilitate identity-level matching. Extensive experiments conducted on three benchmarks demonstrate the superiority of Propot compared to existing TIReID methods.

Abstract:
Homography estimation is the task of determining the transformation from an image pair. Our approach focuses on employing detector-free feature matching methods to address this issue. Previous work has underscored the importance of incorporating semantic information, however there still lacks an efficient way to utilize semantic information. Previous methods suffer from treating the semantics as a pre-processing, causing the utilization of semantics overly coarse-grained and lack adaptability when dealing with different tasks. In our work, we seek another way to use the semantic information, that is semantic-aware feature representation learning framework. Based on this, we propose SRMatcher, a new detector-free feature matching method, which encourages the network to learn integrated semantic feature representation. Specifically, to capture precise and rich semantics, we leverage the capabilities of recently popularized vision foundation models (VFMs) trained on extensive datasets. Then, a cross-images Semantic-aware Fusion Block (SFB) is proposed to integrate its fine-grained semantic features into the feature representation space. In this way, by reducing errors stemming from semantic inconsistencies in matching pairs, our proposed SRMatcher is able to deliver more accurate and realistic outcomes. Extensive experiments show that SRMatcher surpasses solid baselines and attains SOTA results on multiple real-world datasets. Compared to the previous SOTA approach GeoFormer, SRMatcher increases the area under the cumulative curve (AUC) by about 11% on HPatches. Additionally, the SRMatcher could serve as a plug-and-play framework for other matching methods like LoFTR, yielding substantial precision improvement.

Abstract:
Recent advances in neural camera imaging pipelines have demonstrated notable progress. Nevertheless, the real-world imaging pipeline still faces challenges including the lack of joint optimization in system components, computational redundancies, and optical distortions such as lens shading. In light of this, we propose an end-to-end camera imaging pipeline (RealCamNet) to enhance real-world camera imaging performance. Our methodology diverges from conventional, fragmented multi-stage image signal processing towards end-to-end architecture. This architecture facilitates joint optimization across the full pipeline and the restoration of coordinate-biased distortions. RealCamNet is designed for high-quality conversion from RAW to RGB and compact image compression. Specifically, we deeply analyze coordinate-dependent optical distortions, e.g., vignetting and dark shading, and design a novel Coordinate-Aware Distortion Restoration (CADR) module to restore coordinate-biased distortions. Furthermore, we propose a Coordinate-Independent Mapping Compression (CIMC) module to implement tone mapping and redundant information compression. Existing datasets suffer from misalignment and overly idealized conditions, making them inadequate for training real-world imaging pipelines. Therefore, we collected a real-world imaging dataset. Experiment results show that RealCamNet achieves the best rate-distortion performance with lower inference latency.

Abstract:
Recent developments in diffusion models have demonstrated an exceptional capacity to generate high-quality, prompt-conditioned image edits. Nevertheless, previous approaches have primarily relied on textual prompts for image editing, which tend to be less effective when making precise edits to specific objects or fine-grained regions within a scene containing single/multiple objects. We introduce a novel framework for zero-shot localized multi-object editing through a multi-diffusion process to overcome this challenge. This framework empowers users to perform various operations on objects within an image, such as adding, replacing, or editing many objects in a complex scene in one pass. Our approach leverages foreground masks and corresponding simple text prompts that exert localized influences on the target regions resulting in high-fidelity image editing. A combination of cross-attention and background preservation losses within the latent space ensures that the characteristics of the object being edited are preserved while simultaneously achieving a high-quality, seamless reconstruction of the background with fewer artifacts compared to the state-of-the-art (SOTA). We also curate and release a dataset dedicated to multi-object editing, named LoMOE-Bench. Our experiments against existing SOTA demonstrate the improved effectiveness of our approach in terms of both image editing quality, and inference speed.

Abstract:
Recently, image-to-3D approaches have achieved significant results with a natural image as input. However, it is not always possible to access these enriched color input samples in practical applications, where only sketches are available. Existing sketch-to-3D researches suffer from limitations in broad applications due to the challenges of lacking color information and multi-view content. To overcome them, this paper proposes a novel generation paradigm Sketch3D to generate realistic 3D assets with the shape aligned with the input sketch and color matching the textual description. Concretely, Sketch3D first instantiates the given sketch in the reference image through the shape-preserving generation process. Second, the reference image is leveraged to deduce a coarse 3D Gaussian prior, and multi-view style-consistent guidance images are generated based on the renderings of the 3D Gaussians. Finally, three strategies are designed to optimize 3D Gaussians, i.e., structural optimization via a distribution transfer mechanism, color optimization with a straightforward MSE loss and sketch similarity optimization with a CLIP-based geometric similarity loss. Extensive visual comparisons and quantitative analysis illustrate the advantage of our Sketch3D in generating realistic 3D assets while preserving consistency with the input.

Abstract:
The neural radiance fields (NeRF) have advanced the development of 3D volumetric video technology, but the large data volumes they involve pose significant challenges for storage and transmission. To address these problems, the existing solutions typically compress these NeRF representations after the training stage, leading to a separation between representation training and compression. In this paper, we try to directly learn a compact NeRF representation for volumetric video in the training stage based on the proposed rate-aware compression framework. Specifically, for volumetric video, we use a simple yet effective modeling strategy to reduce temporal redundancy for the NeRF representation. Then, during the training phase, an implicit entropy model is utilized to estimate the bitrate of the NeRF representation. This entropy model is then encoded into the bitstream to assist in the decoding of the NeRF representation. This approach enables precise bitrate estimation, thereby leading to a compact NeRF representation.Furthermore, we propose an adaptive quantization strategy and learn the optimal quantization step for the NeRF representations. Finally, the NeRF representation can be optimized by using the rate-distortion trade-off. Our proposed compression framework can be used for different representations and experimental results demonstrate that our approach significantly reduces the storage size with marginal distortion and achieves state-of-the-art rate-distortion performance for volumetric video on the HumanRF and ReRF datasets. Compared to the previous state-of-the-art method TeTriRF, we achieved an approximately -80% BD-rate on the HumanRF dataset and -60% BD-rate on the ReRF dataset.

Abstract:
The Detection Transformer (DETR), by incorporating the Hungarian algorithm, has significantly simplified the matching process in object detection tasks. This algorithm facilitates optimal one-to-one matching of predicted bounding boxes to ground-truth annotations during training. While effective, this strict matching process does not inherently account for the varying densities and distributions of objects, leading to suboptimal correspondences such as failing to handle multiple detections of the same object or missing small objects. To address this, we propose the Regularized Transport Plan (RTP). RTP introduces a flexible matching strategy that captures the cost of aligning predictions with ground truths to find the most accurate correspondences between these sets. By utilizing the differentiable Sinkhorn algorithm, RTP allows for soft, fractional matching rather than strict one-to-one assignments. This approach enhances the model's capability to manage varying object densities and distributions effectively. Our extensive evaluations on the MS-COCO and VOC benchmarks demonstrate the effectiveness of our approach. RTP-DETR, surpassing the performance of the Deform-DETR and the recently introduced DINO-DETR, achieving absolute gains in mAP of +3.8% and +1.7%, respectively.

Abstract:
Foundational segmentation models, predominantly trained on scenes typical of natural environments, struggle to generalize across varied image domains. Traditional "training-to-adapt'' methods rely heavily on extensive data retraining and model architectures modifications. This significantly limits the models' generalization capabilities and efficiency in deployment. In this study, we propose a novel adaptation paradigm, termed "prompting-to-adapt'', to tackle the above issue by introducing an innovative image prompter. This prompter generates domain-specific prompts through few-shot image-mask pairs, incorporating diverse image processing techniques to enhance adaptability. To tackle the inherent non-differentiability of image prompts, we further devise an information-estimation-based gradient descent strategy that leverages the information entropy of image processing combinations to optimize the prompter, ensuring effective adaptation. Through extensive experiments across nine datasets spanning seven image domains (i.e., depth, thermal, camouflage, endoscopic, ultrasound, grayscale, and natural) and four scenarios (i.e., common scenes, camouflage objects, medical images, and industrial data), we demonstrate that our approach significant improves the foundational models' adaptation capabilities. Moreover, the interpretability of the generated prompts provides insightful revelations into their image processing mechanisms. Source code is available at: \urlgithub.com/yuema1303/Prompting-to-Adapt-FSM.

Abstract:
There has been an increasing focus on domain-generalized (DG) face anti-spoofing (FAS). However, existing methods aim to project a shared visual space through adversarial training, making exploring the space without losing semantic information challenging. We investigate the DG inadequacies resulting from classifier overfitting to a significantly different domain distribution. To address this issue, we propose a novel Fine-Grained Prompt Learning (FGPL) based on Vision-Language Models (VLMs), such as CLIP, which can adaptively adjust weights for classifiers with text features to mitigate overfitting. Specifically, FGPL first motivates the prompts to learn content and domain semantic information by capturing Domain-Agnostic and Domain-Specific features. Furthermore, our prompts are designed to be category-generalized by diversifying the Domain-Specific prompts. Additionally, we design an Adaptive Convolutional Adapter (AC-adapter), which is implemented through an adaptive combination of Vanilla Convolution and Central Difference Convolution, to be inserted into the image encoder for quickly bridging the gap between general image recognition and FAS task. Extensive experiments demonstrate that the proposed FGPL is effective and outperforms state-of-the-art methods on several cross-domain datasets.

Abstract:
In recent years, multi-view outlier detection (MVOD) methods have advanced significantly, aiming to identify outliers within multi-view datasets. A key point is to better detect class outliers and class-attribute outliers, which only exist in multi-view data. However, existing methods either is not able to reduce the impact of outliers when learning view-consistent information, or struggle in cases with varying neighborhood structures. Moreover, most of them do not apply to partial multi-view data in real-world scenarios. To overcome these drawbacks, we propose a novel method named Regularized Contrastive Partial Multi-view Outlier Detection (RCPMOD). In this framework, we utilize contrastive learning to learn view-consistent information and distinguish outliers by the degree of consistency. Specifically, we propose (1) An outlier-aware contrastive loss with a potential outlier memory bank to eliminate their bias motivated by a theoretical analysis. (2) A neighbor alignment contrastive loss to capture the view-shared local structural correlation. (3) A spreading regularization loss to prevent the model from overfitting over outliers. With the Cross-view Relation Transfer technique, we could easily impute the missing view samples based on the features of neighbors. Experimental results on four benchmark datasets demonstrate that our proposed approach could outperform state-of-the-art competitors under different settings.

Abstract:
Multimodal Emotion Recognition in Conversations (MERC) aims to identify the emotions conveyed by each utterance in a conversational video. Current efforts focus on modeling speaker-sensitive context dependencies and multimodal fusion. Despite the progress, the reliability of MERC methods remains largely unexplored. Extensive empirical studies reveal that current methods suffer from unreliable predictive confidence. Specifically, in some cases, the confidence estimated by these models increases when a modality or specific contextual cues are corrupted, defining these as uncertain samples. This contradicts the foundational principle in informatics, namely, the elimination of uncertainty. Based on this, we propose a novel calibration framework CMERC to calibrate MERC models without altering the model structure. It integrates curriculum learning to guide the model in progressively learning more uncertain samples; hybrid supervised contrastive learning to refine utterance representations, by pulling uncertain samples and others apart; and confidence constraint to penalize the model on uncertain samples. Experimental results on two datasets demonstrate the effectiveness and generalization capabilities of our CMERC across various MERC models, surpassing state-of-the-art methods.

Abstract:
Large Language Models (LLMs), benefiting from the auto-regressive modelling approach performed on massive unannotated texts corpora, demonstrates powerful perceptual and reasoning capabilities. However, as for extending auto-regressive modelling to multi-modal scenarios to build Large Multi-modal Models (LMMs), there lies a great difficulty that the image information is processed in the LMM as continuous visual embeddings, which cannot obtain discrete supervised labels for classification. In this paper, we successfully perform multi-modal auto-regressive modeling with a unified objective for the first time. Specifically, we propose the concept of visual tokens, which maps the visual features to probability distributions over LLM's vocabulary, providing supervision information for visual modelling. We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information. Experimental results and ablation studies on 5 VQA tasks and 4 benchmark toolkits validate the powerful performance of our proposed approach.

Abstract:
Recently, image-to-3D approaches have significantly advanced the generation quality and speed of 3D assets based on large reconstruction models, particularly 3D Gaussian reconstruction models. Existing large 3D Gaussian models directly map 2D image to 3D Gaussian parameters, while regressing 2D image to 3D Gaussian representations is challenging without 3D priors. In this paper, we propose a large Point-to-Gaussian model, that inputs the initial point cloud produced from large 3D diffusion model conditional on 2D image to generate the Gaussian parameters, for image-to-3D generation. The point cloud provides initial 3D geometry prior for Gaussian generation, thus significantly facilitating image-to-3D Generation. Moreover, we present the Attention mechanism, Projection mechanism, and Point feature extractor, dubbed as APP block, for fusing the image features with point cloud features. The qualitative and quantitative experiments extensively demonstrate the effectiveness of the proposed approach on GSO and Objaverse datasets, and show the proposed method achieves state-of-the-art performance.

Abstract:
Diffusion models have demonstrated remarkable efficacy in generating high-quality samples. Existing diffusion-based image restoration algorithms exploit pre-trained diffusion models to leverage data priors, yet they still preserve elements inherited from the unconditional generation paradigm. These strategies initiate the denoising process with pure white noise and incorporate random noise at each generative step, leading to over-smoothed results. In this paper, we present a refined paradigm for diffusion-based image restoration. Specifically, we opt for a sample consistent with the measurement identity at each generative step, exploiting the sampling selection as an avenue for output stability and enhancement. The number of candidate samples used for selection is adaptively determined based on the signal-to-noise ratio of the timestep. Additionally, we start the restoration process with an initialization combined with the measurement signal, providing supplementary information to better align the generative process. Extensive experimental results and analyses validate that our proposed method significantly enhances image restoration performance while consuming negligible additional computational resources.

Abstract:
Prompt learning represents a promising method for adapting pre-trained vision-language models (VLMs) to various downstream tasks by learning a set of text embeddings. One challenge inherent to these methods is the poor generalization performance due to the invalidity of the learned text embeddings for unseen tasks. A straightforward approach to bridge this gap is to freeze the text embeddings in prompts, which results in a lack of capacity to adapt VLMs for downstream tasks. To address this dilemma, we propose a paradigm called EnPrompt with a novel External Layer (EnLa). Specifically, we propose a textual external layer and learnable visual embeddings for adapting VLMs to downstream tasks. The learnable external layer is built upon valid embeddings of pre-trained CLIP. This design considers the balance of learning capabilities between the two branches. To align the textual and visual features, we propose a novel two-pronged approach: i) we introduce the optimal transport as the discrepancy metric to align the vision and text modalities, and ii) we introduce a novel strengthening feature to enhance the interaction between these two modalities. Four representative experiments (i.e., base-to-novel generalization, few-shot learning, cross-dataset generalization, domain shifts generalization) across 15 datasets demonstrate that our method outperforms the existing prompt learning method.

Abstract:
Online action detection aims to identify ongoing actions within untrimmed video streams, with extensive applications in real-life scenarios. However, in practical applications, video frames are received sequentially over time and new action categories continually emerge, giving rise to the challenge of catastrophic forgetting - a problem that remains inadequately explored. Generally, in the field of video understanding, researchers address catastrophic forgetting through class-incremental learning. Nevertheless, online action detection is based solely on historical observations, thus demanding higher temporal modeling capabilities for class-incremental learning methods. In this paper, we conceptualize this task as Class-Incremental Online Action Detection (CIOAD) and propose a novel framework, TS-ILM, to address it. Specifically, TS-ILM consists of two components: task-level temporal pattern extractor and temporal-sensitive exemplar selector. The former extracts the temporal patterns of actions in different tasks and saves them, allowing the data to be comprehensively observed on a temporal level before it is input into the backbone. The latter selects a set of frames with the highest causal relevance and minimum information redundancy for subsequent replay, enabling the model to learn the temporal information of previous tasks more effectively. We benchmark our approach against SoTA class-incremental learning methods applied in the image and video domains on THUMOS'14 and TVSeries datasets. Our method outperforms the previous approaches.

Abstract:
Recently, transformer-based techniques incorporating superpoints have become prevalent in 3D instance segmentation. However, they often encounter an over-segmentation problem, especially noticeable with large objects. Additionally, unreliable mask predictions stemming from superpoint mask prediction further compound this issue. To address these challenges, we propose a novel framework called MSTA3D. It leverages multi-scale feature representation and introduces twin-attention mechanisms to effectively capture them. Furthermore, MSTA3D integrates a box query with a box regularizer, offering a complementary spatial constraint alongside semantic queries. Experimental evaluations on ScanNetV2, ScanNet200, and S3DIS datasets demonstrate that our approach surpasses state-of-the-art 3D instance segmentation methods.

Abstract:
Enzyme design plays a crucial role in both industrial production and biology. However, this field faces challenges due to the lack of comprehensive benchmarks and the complexity of enzyme design tasks, leading to a dearth of systematic research. Consequently, computational enzyme design is relatively overlooked within the broader protein domain and remains in its early stages. In this work, we address these challenges by introducing MetaEnzyme, a staged and unified enzyme design framework. We begin by employing a cross-modal structure-to-sequence transformation architecture, as the feature-driven starting point to obtain initial robust protein representation. Subsequently, we leverage domain adaptive techniques to generalize specific enzyme design tasks under low-resource conditions. MetaEnzyme focuses on three fundamental low-resource enzyme redesign tasks: functional design (FuncDesign), mutation design (MutDesign), and sequence generation design (SeqDesign). Through novel unified paradigm and enhanced representation capabilities, MetaEnzyme demonstrates adaptability to diverse enzyme design tasks, yielding outstanding results. Wet lab experiments further validate these findings, reinforcing the efficacy of the redesign process.

Abstract:
Maintaining temporal stability is crucial in multi-agent trajectory prediction. Insufficient regularization to uphold this temporal stability often results in fluctuations in kinematic states, leading to inconsistent predictions and the amplification of errors. In this study, we introduce a framework called Multi-Agent Trajectory prediction via neural interaction Energy (MATE). This framework assesses the interactive motion of agents by employing neural interaction energy, which captures the dynamics of interactions and illustrates their influence on the future trajectories of agents. To bolster temporal stability, we introduce two constraints: inter-agent interaction constraint and intra-agent motion constraint. These constraints work together to ensure temporal stability at both the system and agent levels, effectively mitigating prediction fluctuations inherent in multi-agent systems. Comparative evaluations against previous methods on four diverse datasets, including simulated and real-world scenarios, highlight the superior prediction accuracy and generalization capabilities of our model.

Abstract:
Fake audio detection is an emerging active topic. A growing number of literatures have aimed to detect fake utterance, which are mostly generated by Text-to-speech (TTS) or voice conversion (VC). However, countermeasures against impersonation remain an underexplored area. Impersonation is a fake type that involves an imitator replicating specific traits and speech style of a target speaker. Unlike TTS and VC, which often leave digital traces or signal artifacts, impersonation involves live human beings producing entirely natural speech, rendering the detection of impersonation audio a challenging task. Thus, we propose a novel method that integrates speaker profiles into the process of impersonation audio detection. Speaker profiles are inherent characteristics that are challenging for impersonators to mimic accurately, such as speaker's age, job. We aim to leverage these features to extract discriminative information for detecting impersonation audio. Moreover, there is no large impersonated speech corpora available for quantitative study of impersonation impacts. To address this gap, we further design the first large-scale, diverse-speaker Chinese impersonation dataset, named ImPersonation Audio Detection (IPAD), to advance the community's research on impersonation audio detection. We evaluate several existing fake audio detection methods on our proposed dataset IPAD, demonstrating its necessity and the challenges. Additionally, our findings reveal that incorporating speaker profiles can significantly enhance the model's performance in detecting impersonation audio.

Abstract:
3D multimodal question answering (MQA) plays a crucial role in scene understanding by enabling intelligent agents to comprehend their surroundings in 3D environments. While existing research has primarily focused on indoor household tasks and outdoor roadside autonomous driving tasks, there has been limited exploration of city-level scene understanding tasks. Furthermore, existing research faces challenges in understanding city scenes, due to the absence of spatial semantic information and human-environment interaction information at the city level.To address these challenges, we investigate 3D MQA from both dataset and method perspectives. From the dataset perspective, we introduce a novel 3D MQA dataset named City-3DQA for city-level scene understanding, which is the first dataset to incorporate scene semantic and human-environment interactive tasks within the city. From the method perspective, we propose a Scene graph enhanced City-level Understanding method (Sg-CityU), which utilizes the scene graph to introduce the spatial semantic. A new benchmark is reported and our proposed Sg-CityU achieves accuracy of 63.94 % and 63.76 % in different settings of City-3DQA. Compared to indoor 3D MQA methods and zero-shot using advanced large language models (LLMs), Sg-CityU demonstrates state-of-the-art (SOTA) performance in robustness and generalization.

Abstract:
Click-through rate (CTR) prediction is an essential component of industrial multimedia recommendation, and the key to enhancing the accuracy of CTR prediction lies in the effective modeling of feature interactions using rich user profiles, item attributes, and contextual information. Most of the current deep CTR models resort to parallel or stacked structures to break through the performance bottleneck of Multi-Layer Perceptron (MLP). However, we identify two limitations in these models: (1) parallel or stacked structures often treat explicit and implicit components as isolated entities, leading to a loss of mutual information; (2) traditional CTR models, whether in terms of supervision signals or interaction methods, lack the ability to filter out noise information, thereby limiting the effectiveness of the models.

Abstract:
Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively by memes. This paper poses a particular emphasis on the impact of multi-images on meme captioning. After that, we introduce the XMeCap framework, a novel approach that adopts supervised fine-tuning and reinforcement learning based on an innovative reward model, which factors in both global and local similarities between visuals and text. Our results, benchmarked against contemporary models, manifest a marked improvement in caption generation for both single-image and multi-image memes, as well as different meme categories. XMeCap achieves an average evaluation score of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming the best baseline by 3.71% and 4.82%, respectively. This research not only establishes a new frontier in meme-related studies but also underscores the potential of machines in understanding and generating humor in a multi-modal setting.

Abstract:
Federated learning is a promising distributed training paradigm that effectively safeguards data privacy. However, it may involve significant communication costs, which hinders training efficiency. In this paper, we aim to enhance communication efficiency from a new perspective. Specifically, we request the distributed clients to find optimal model updates relative to global model parameters within predefined random noise. For this purpose, we propose Federated Masked Random Noise (FedMRN), a novel framework that enables clients to learn a 1-bit mask for each model parameter and apply masked random noise (i.e., the Hadamard product of random noise and masks) to represent model updates. To make FedMRN feasible, we propose an advanced mask training strategy, called progressive stochastic masking (PSM). After local training, each client only need to transmit local masks and a random seed to the server. Additionally, we provide theoretical guarantees for the convergence of FedMRN under both strongly convex and non-convex assumptions. Extensive experiments are conducted on four popular datasets. The results show that FedMRN exhibits superior convergence speed and test accuracy compared to relevant baselines, while attaining a similar level of accuracy as FedAvg.

Abstract:
Video generation and editing, particularly human-centric video editing, has seen a surge of interest in its potential to create immersive and dynamic content. A fundamental challenge is ensuring temporal coherence and visual harmony across frames, especially in handling large-scale human motion and maintaining consistency over long sequences. The previous methods, such as zero-shot text-to-video methods with diffusion model, struggle with flickering and length limitations. In contrast, methods employing Video-2D representations grapple with accurately capturing complex structural relationships in large-scale human motion. Simultaneously, some patterns on the human body appear intermittently throughout the video, posing a knotty problem in identifying visual correspondence. To address the above problems, we present HeroMaker. This human-centric video editing framework manipulates the person's appearance within the input video and achieves consistent results across frames. Specifically, we propose to learn the motion priors, which represent the correspondences between dual canonical fields and each video frame, by leveraging the body mesh-based human motion warping and neural deformation-based margin refinement in the video reconstruction framework to ensure the semantic correctness of canonical fields. HeroMaker performs human-centric video editing by manipulating the dual canonical fields and combining them with motion priors to synthesize temporally coherent and visually plausible results. Comprehensive experiments demonstrate that our approach surpasses existing methods regarding temporal consistency, visual quality, and semantic coherence.

Abstract:
The posterior estimation of parameters based on Bayesian theory is a crucial technique in Incremental Learning (IL). The estimated posterior is typically utilized to impose loss regularization, which aligns the current training model parameters with the previously learned posterior to mitigate catastrophic forgetting, a major challenge in IL. However, this additional loss regularization can also impose detriment to the model learning, preventing it from reaching the true global optimum. To overcome this limitation, this paper introduces a novel Bayesian IL framework, Robust Parameter Posterior Fusion (RP2F). Unlike traditional methods, RP2F directly estimates the parameter posterior for new data without introducing extra loss regularization, which allows the model to accommodate new knowledge more sufficiently. It then fuses this new posterior with the existing ones based on the Maximum A Posteriori (MAP) principle, ensuring effective knowledge sharing across tasks. Furthermore, RP2F incorporates a common parameter-robustness priori to facilitate a seamless integration during posterior fusion. Comprehensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets show that RP2F not only effectively mitigates catastrophic forgetting but also achieves backward knowledge transfer.

Abstract:
Knowledge distillation has become widely recognized for its ability to transfer knowledge from a large teacher network to a compact and more streamlined student network. Traditional knowledge distillation methods primarily follow a teacher-oriented paradigm that imposes the task of learning the teacher's complex knowledge onto the student network. However, significant disparities in model capacity and architectural design hinder the student's comprehension of the complex knowledge imparted by the teacher, resulting in sub-optimal performance. This paper introduces a novel perspective emphasizing student-oriented and refining the teacher's knowledge to better align with the student's needs, thereby improving knowledge transfer effectiveness. Specifically, we present the Student-Oriented Knowledge Distillation (SoKD), which incorporates a learnable feature augmentation strategy during training to refine the teacher's knowledge of the student dynamically. Furthermore, we deploy the Distinctive Area Detection Module (DAM) to identify areas of mutual interest between the teacher and student, concentrating knowledge transfer within these critical areas to avoid transferring irrelevant information. This customized module ensures a more focused and effective knowledge distillation process. Our approach, functioning as a plug-in, could be integrated with various knowledge distillation methods. Extensive experimental results demonstrate the efficacy and generalizability of our method.

Abstract:
Knowledge distillation based on student-teacher network is one of the mainstream solution paradigms for the challenging unsupervised Anomaly Detection task, utilizing the difference in representation capabilities of the teacher and student networks to implement anomaly localization. However, over-generalization of the student network to the teacher network may lead to negligible differences in representation capabilities of anomaly, thus affecting the detection effectiveness. Existing methods address the possible over-generalization by using differentiated students and teachers from the structural perspective or explicitly expanding distilled information from the content perspective, which inevitably results in an increased likelihood of underfitting of the student network and poor anomaly detection capabilities in anomaly center or edge. In this paper, we propose Dual-Modeling Decouple Distillation (DMDD) for the unsupervised Anomaly Detection. In DMDD, a Decouple Student-Teacher Network is proposed to decouple the initial student features into normality and abnormality features. We further introduce Dual-Modeling Distillation based on normal-anomalous image pairs, fitting normality features of anomalous image and the teacher features of the corresponding normal image, widening the distance between abnormality features and the teacher features in anomalous regions. Synthesizing these two distillation ideas, we achieve anomaly detection which focuses on both edge and center of anomaly. Finally, a Multi-perception Segmentation Network is proposed to achieve focused anomaly map fusion based on multiple attention. Experimental results on MVTec AD show that DMDD surpasses SOTA localization performance of previous knowledge distillation-based methods, reaching 98.85% on pixel-level AUC and 96.13% on PRO.

Abstract:
The modeling and manipulation of 3D scenes captured from the real world are pivotal in various applications, attracting growing research interest. While previous works on editing have achieved interesting results through manipulating 3D meshes, they often require accurately reconstructed meshes to perform editing, which limits their application in 3D content generation. To address this gap, we introduce a novel single-image-driven 3D scene editing approach based on 3D Gaussian Splatting, enabling intuitive manipulation via directly editing the content on a 2D image plane. Our method learns to optimize the 3D Gaussians to align with an edited version of the image rendered from a user-specified viewpoint of the original scene. To capture long-range object deformation, we introduce positional loss into the optimization process of 3D Gaussian Splatting and enable gradient propagation through reparameterization. To handle occluded 3D Gaussians when rendering from the specified viewpoint, we build an anchor-based structure and employ a coarse-to-fine optimization strategy capable of handling long-range deformation while maintaining structural stability. Furthermore, we design a novel masking strategy to adaptively identify non-rigid deformation regions for fine-scale modeling. Extensive experiments show the effectiveness of our method in handling geometric details, long-range, and non-rigid deformation, demonstrating superior editing flexibility and quality compared to previous approaches.

Abstract:
With the rise of immersive media applications such as digital museums, virtual reality, and interactive exhibitions, point clouds, as a three-dimensional data storage format, have gained increasingly widespread attention. The massive data volume of point clouds imposes extremely high requirements on transmission bandwidth in the above applications, gradually becoming a bottleneck for immersive media applications. Although existing learning-based point cloud compression methods have achieved specific successes in compression efficiency by mining the spatial redundancy of their local structural features, these methods often overlook the intrinsic connections between point cloud data and other modality data (such as image modality), thereby limiting further improvements in compression efficiency. To address the limitation, we innovatively propose a view-guided learned point cloud geometry compression scheme, namely ViewPCGC. We adopt a novel self-attention mechanism and cross-modality attention mechanism based on sparse convolution to align the modality features of the point cloud and the view image, removing view redundancy through Modality Redundancy Removal Module (MRRM). Simultaneously, side information of the view image is introduced into the Conditional Checkboard Entropy Model (CCEM), significantly enhancing the accuracy of the probability density function estimation for point cloud geometry. In addition, we design a View-Guided Quality Enhancement Module (VG-QEM) in the decoder, utilizing the contour information of the point cloud in the view image to supplement reconstruction details. The superior experimental performance demonstrates the effectiveness of our method. Compared to the state-of-the-art point cloud geometry compression methods, ViewPCGC exhibits an average performance gain exceeding 10% on D1-PSNR metric.

Abstract:
Existing image inpainting methods have achieved remarkable accomplishments in generating visually appealing results, often accompanied by a trend toward creating more intricate structural textures. However, while these models excel at creating more realistic image content, they often leave noticeable traces of tampering, posing a significant threat to security. In this work, we take the anti-forensic capabilities into consideration, firstly proposing an end-to-end training framework for anti-forensic image inpainting named SafePaint. Specifically, we innovatively formulated image inpainting as two major tasks: semantically plausible content completion and region-wise optimization. The former is similar to current inpainting methods that aim to restore the missing regions of corrupted images. The latter, through domain adaptation, endeavors to reconcile the discrepancies between the inpainted region and the unaltered area to achieve anti-forensic goals. Through comprehensive theoretical analysis, we validate the effectiveness of domain adaptation for anti-forensic performance. Furthermore, we meticulously crafted a region-wise separated attention (RWSA) module, which not only aligns with our objective of anti-forensics but also enhances the performance of the model. Extensive qualitative and quantitative evaluations show our approach achieves comparable results to existing image inpainting methods while offering anti-forensic capabilities not available in other methods.

Abstract:
Streaming videos from resource-constrained front-end devices over networks to resource-rich cloud servers has long been a common practice for surveillance and analytics. Most existing live video analytics (LVA) systems, however, have been built over terrestrial networks, limiting their applications during natural disasters and in remote areas that desperately call for real-time visual data delivery and scene analysis. With the recent advent of space networking, in particular, Low Earth Orbit (LEO) satellite constellations such as Starlink, high-speed truly global Internet access is becoming available and affordable. This paper examines the challenges and potentials of LVA over modern LEO satellite networking (LSN). Using Starlink as the testbed, we have carried out extensive in-the-wild measurements to gain insights into its achievable performance for LVA. The results reveal that the uplink bottleneck in today's LSN, together with the volatile network conditions, can significantly affect the service quality of LVA and necessitate prompt adaptation. We accordingly develop StarStream, a novel LSN-adaptive streaming framework for LVA. At its core, StarStream is empowered by a Transformer-based network performance predictor tailored for LSN and a content-aware configuration optimizer. We discuss a series of key design and implementation issues of StarStream and demonstrate its effectiveness and superiority through trace-driven experiments with real-world network and video processing data.

Abstract:
Recent studies reveal that even highly biased dense networks can contain an invariant substructure with superior out-of-distribution (OOD) generalization. While existing works commonly seek these substructures using global sparsity constraints, the uniform imposition of sparse penalties across samples with diverse levels of spurious contents renders such methods suboptimal. The precise adaptation of model sparsity, specifically tailored for spurious features, remains a significant challenge. Motivated by the insight that in-distribution (ID) data containing spurious features may exhibit lower experiential risk, we propose a novel Spurious Feature-targeted Pruning framework, dubbed SFP, to induce the authentic invariant substructures without referring to the above concerns. Specifically, SFP distinguishes spurious features within ID instances during training by a theoretically validated threshold. It then penalizes the corresponding feature projections onto the model space, steering the optimization towards subspaces spanned by those invariant factors. Moreover, we also conduct detailed theoretical analysis to provide a rationality guarantee and a proof framework for OOD structures based on model sparsity. Experiments on various OOD datasets show that SFP can significantly outperform both structure-based and non-structure-based OOD generalization state-of-the-art (SOTA) methods by large margins.

Abstract:
Medical report generation (MRG) has emerged as a pivotal research topic in the medical multi-modal field, given its potential to alleviate the heavy workloads of radiologists. Recently, advancements have been made with MRG systems that leverage large multimodal models (LMMs) to generate high-quality reports. To address the challenge of collecting large amounts of paired medical image-report data for training, this paper proposes a zero-shot report generation model based on in-context learning, we call it MCVGen. Departing from traditional in-context learning approaches that directly feed all demonstrations to a pre-trained large model, this work innovates by employing a multi-modal contextual vector (MCV) to represent the contextual information of demonstrations. Initially, we pre-train a medical large multi-modal model (Med-LMM) and secure the last hidden state of each demonstration through the forward pass in Med-LMM. Benefits from the auto-regressive mechanism, the last hidden state garners critical information to the targeted scenarios. Subsequently, we average the multiple MCVs and integrate them with the first hidden state on the new query, thereby shifting the latent states and guiding the model toward acquiring previously unlearned multi-modal contextual information. This approach has the advantage of regulating the number of prompts, thus reducing computational costs. We tested our model on the publicly available IU X-ray and MIMIC datasets, demonstrating its exceptional zero-shot capability on both cross-center and cross-disease evaluations. We hope it could be a viable solution for practical clinical applications.

Abstract:
Dynamic sequential recommendation (DSR) can generate model parameters based on user behavior to improve the personalization of sequential recommendation under various user preferences. However, it faces the challenges of large parameter search space and sparse and noisy user-item interactions, which reduces the applicability of the generated model parameters. The Semantic Codebook Learning for Dynamic Recommendation Models (SOLID) framework presents a significant advancement in DSR by effectively tackling these challenges. By transforming item sequences into semantic sequences and employing a dual parameter model, SOLID compresses the parameter generation search space and leverages homogeneity within the recommendation system. The introduction of the semantic metacode and semantic codebook, which stores disentangled item representations, ensures robust and accurate parameter generation. Extensive experiments demonstrates that SOLID consistently outperforms existing DSR, delivering more accurate, stable, and robust recommendations.

Abstract:
Recommendation algorithms predict user preferences by correlating user and item representations derived from historical interaction patterns. In pursuit of enhanced performance, many methods focus on learning robust and independent representations by disentangling the intricate factors within interaction data across various modalities in an unsupervised manner. However, such an approach obfuscates the discernment of how specific factors (e.g., category or brand) influence the outcomes, making it challenging to regulate their effects. In response to this challenge, we introduce a novel method called Attribute-Driven Disentangled Representation Learning (short for AD-DRL), which explicitly incorporates attributes from different modalities into the disentangled representation learning process. By assigning a specific attribute to each factor in multimodal features, AD-DRL can disentangle the factors at both attribute and attribute-value levels. To obtain robust and independent representations for each factor associated with a specific attribute, we first disentangle the representations of features both within and across different modalities. Moreover, we further enhance the robustness of the representations by fusing the multimodal features of the same factor. Empirical evaluations conducted on three public real-world datasets substantiate the effectiveness of AD-DRL, as well as its interpretability and controllability.

Abstract:
Recent advances in large language models (LLMs) have blurred the boundary of high-quality text generation between humans and machines, which is favorable for generative text steganography. Currently, advanced steganographic mapping is not suitable for LLMs since most users are restricted to accessing only the black-box API or user interface of the LLMs, thereby lacking access to the training vocabulary and its sampling probabilities. In this paper, we explore a black-box generative text steganographic method based on the user interfaces of large language models, which is called LLM-Stega. The main goal of LLM-Stega is to ensure secure covert communication between Alice (sender) and Bob (receiver) by using the user interfaces of LLMs. Specifically, We first construct a keyword set and design a new encrypted steganographic mapping to embed secret messages. Furthermore, an optimization mechanism based on reject sampling is proposed to guarantee accurate extraction of secret messages and rich semantics of generated stego texts. Comprehensive experiments demonstrate that the proposed LLM-Stega outperforms current state-of-the-art methods.

Abstract:
Diffusion-based text-to-image personalization has achieved great success in generating user-specified subjects in various contexts. However, finetuning-based methods often suffer from model overfitting, leading to reduced generative diversity, particularly when the provided subject images are limited. To address this issue, we introduce Pick-and-Draw, a training-free semantic guidance approach that enhances identity consistency and generative diversity. Our method comprises two key components: appearance-picking guidance and layout-drawing guidance. In the appearance-picking phase, we create an appearance palette from visual features of the reference image, selecting local patterns to maintain consistent subject identity. In the layout-drawing phase, we use a generative template from the base diffusion model to sketch the subject shape and scene outline, leveraging its strong image prior to produce diverse contexts based on various text prompts. Pick-and-Draw can be seamlessly integrated with any personalized diffusion model and requires only a single reference image. Both qualitative and quantitative evaluations demonstrate that our approach significantly improves identity consistency and generative diversity, establishing a new Pareto frontier in the balance between subject fidelity and image-text alignment.

Abstract:
Recently, video diffusion models (VDMs) have garnered significant attention due to their notable advancements in generating coherent and realistic video content. However, processing multiple frame features concurrently, coupled with the considerable model size, results in high latency and extensive memory consumption, hindering their broader application. Post-training quantization (PTQ) is an effective technique to reduce memory footprint and improve computational efficiency. Unlike image diffusion, we observe that the temporal features, which are integrated into all frame features, exhibit pronounced skewness. Furthermore, we investigate significant inter-channel disparities and asymmetries in the activation of video diffusion models, resulting in low coverage of quantization levels by individual channels and increasing the challenge of quantization. To address these issues, we introduce the first PTQ strategy tailored for video diffusion models, dubbed QVD. Specifically, we propose the High Temporal Discriminability Quantization (HTDQ) method, designed for temporal features, which retains the high discriminability of quantized features, providing precise temporal guidance for all video frames. In addition, we present the Scattered Channel Range Integration (SCRI) method which aims to improve the coverage of quantization levels across individual channels. Experimental validations across various models, datasets, and bit-width settings demonstrate the effectiveness of our QVD in terms of diverse metrics. In particular, we achieve near-lossless performance degradation on W8A8, outperforming the current methods by 205.12 in FVD.

Abstract:
The success of deep face recognition (FR) systems has raised serious privacy concerns due to their ability to enable unauthorized tracking of users in the digital world. Previous studies proposed introducing imperceptible adversarial noises into face images to deceive those face recognition models, thus achieving the goal of enhancing facial privacy protection. Nevertheless, they heavily rely on user-chosen references to guide the generation of adversarial noises, and cannot simultaneously construct natural and highly transferable adversarial face images in black-box scenarios. In light of this, we present a novel face privacy protection scheme with improved transferability while maintain high visual quality. We propose shaping the entire face space directly instead of exploiting one kind of facial characteristic like makeup information to integrate adversarial noises. To achieve this goal, we first exploit global adversarial latent search to traverse the latent space of the generative model, thereby creating natural adversarial face images with high transferability. We then introduce a key landmark regularization module to preserve the visual identity information. Finally, we investigate the impacts of various kinds of latent spaces and find that F latent space benefits the trade-off between visual naturalness and adversarial transferability. Extensive experiments over two datasets demonstrate that our approach significantly enhances attack transferability while maintaining high visual quality, outperforming state-of-the-art methods by an average 25% improvement in deep FR models and 10% improvement on commercial FR APIs.

Abstract:
GAN-based image editing task aims at manipulating image attributes in the latent space of generative models. Most of the previous 2D and 3D-aware approaches mainly focus on editing attributes in images with ambiguous semantics or regions from a reference image, which fail to achieve photographic semantic attribute transfer, such as the beard from a photo of a man. In this paper, we propose an image-driven Semantic Attribute Transfer method in 3D (SAT3D) by editing semantic attributes from a reference image. For the proposed method, the exploration is conducted in the style space of a pre-trained 3D-aware StyleGAN-based generator by learning the correlations between semantic attributes and style code channels. For guidance, we associate each attribute with a set of phrase-based descriptor groups, and develop a Quantitative Measurement Module (QMM) to quantitatively describe the attribute characteristics in images based on descriptor groups, which leverages the image-text comprehension capability of CLIP. During the training process, the QMM is incorporated into attribute losses to calculate attribute similarity between images, guiding target semantic transferring and irrelevant semantics preserving. We present our 3D-aware attribute transfer results across multiple domains and also conduct comparisons with classical 2D image editing methods, demonstrating the effectiveness and customizability of our SAT3D.

Abstract:
4D facial expression synthesizing is a critical problem in the fields of computer vision and graphics. Current methods lack flexibility and smoothness when simulating the inter-frame motion of expression sequences. In this paper, we propose a frequency-controlled 4D facial expression synthesizing method, FC-4DFS. Specifically, we introduce a frequency-controlled LSTM network to generate 4D facial expression sequences frame by frame from a given neutral landmark with a given length. Meanwhile, we propose a temporal coherence loss to enhance the perception of temporal sequence motion and improve the accuracy of relative displacements. Furthermore, we designed a Multi-level Identity-Aware Displacement Network based on a cross-attention mechanism to reconstruct the 4D facial expression sequences from landmark sequences. Finally, our FC-4DFS achieves flexible and SOTA generation results of 4D facial expression sequences with different lengths on CoMA and Florence4D datasets. The code will be available on GitHub.

Abstract:
This paper presents a novel model protection paradigm Model Locking that locks the performance of a finetuned model on private data to make it unusable or unextractable without the right key. Specifically, we proposed a diffusion-based framework dubbed ModelLock that explores text-guided image editing to transform the private finetuning data into unique styles or blend new objects into the background. A model finetuned on this edited dataset will be locked and can only be unlocked by the key prompt, i.e., the same text prompt used to edit the data. We conduct extensive experiments on both image classification and segmentation tasks and show that 1) ModelLock can effectively lock finetuned models without significantly reducing their unlocked performance, and more importantly, 2) the locked model cannot be easily unlocked without knowing both the key prompt and the diffusion model. Our work opens up a new direction for intellectual property protection of private models.

Abstract:
SMP Challenge is an annual challenge that seeks top research teams to develop innovative forecasting methods that can enhance social and business applications. We define and introduce the Social Media Popularity Prediction (SMPP) task that predicting the future popularity of a post made by a specific user at a given time on social media. This task is pivotal in various applications and scenarios, such as online advertising, social recommendations, post ranking, and demand forecasting, etc. To motivate diverse perspectives of social media prediction researches, we built a large-scale benchmark Social Media Prediction Dataset (SMPD) that includes approximately 500K posts, along with associated 756 tags, visual-language data, and spatial-temporal information, and sourced from around 70K users and their profiles.

Abstract:
Recent research has offered insights into the extraordinary capabilities of Large Multimodal Models (LMMs) in various general vision and language tasks. There is growing interest in how LMMs perform in more specialized domains. Social media content, inherently multimodal, blends text, images, videos, and sometimes audio. To effectively understand such content, models need to interpret the intricate interactions between these diverse communication modalities and their impact on the conveyed message. Understanding social multimedia content remains a challenging problem for contemporary machine learning frameworks. In this talk, we evaluate GPT-4V(ision)'s capabilities for social multimedia analysis. We select five representative tasks, including sentiment analysis, hate speech detection, fake news identification, demographic inference, and political ideology detection, to evaluate GPT-4V. Our investigation begins with a preliminary quantitative analysis for each task using existing benchmark datasets, followed by a careful review of the results and a selection of qualitative samples that illustrate GPT-4V's potential in understanding multimodal social media content. GPT-4V demonstrates remarkable efficacy in these tasks, showcasing strengths such as joint understanding of image-text pairs, contextual and cultural awareness, and extensive commonsense knowledge. In addition to the known hallucination problem, notable challenges remain as GPT-4V struggles with tasks involving multilingual social multimedia comprehension and has difficulties in generalizing to the latest trends in social media. We further present several attempts to improve the performance on some tasks. The insights gleaned from our findings underscore a promising future for LMMs in enhancing our comprehension of social media content and its users through the analysis of multimodal information.

Abstract:
While generative models have proved successful in many domains, they may pose a privacy leakage risk in practical deployment. To address this issue, differentially private generative model learning has emerged as a solution to train private generative models for different downstream tasks. However, existing private generative modeling approaches face significant challenges in generating high-dimensional data due to the inherent complexity involved in modeling such data. In this work, we present a new private generative modeling approach where samples are generated via Hamiltonian dynamics with gradients of the private dataset estimated by a well-trained network. In the approach, we achieve differential privacy by perturbing the projection vectors in the estimation of gradients with sliced score matching. In addition, we enhance the reconstruction ability of the model by incorporating a residual enhancement module during the score matching. For sampling, we perform Hamiltonian dynamics with gradients estimated by the well-trained network, allowing the sampled data close to the private dataset's manifold step by step. In this way, our model is able to generate data with a resolution of 256×256. Extensive experiments and analysis clearly demonstrate the effectiveness and rationality of the proposed approach.

Abstract:
Existing unpaired image deraining approaches face challenges in accurately capture the distinguishing characteristics between the rainy and clean domains, resulting in residual degradation and color distortion within the reconstructed images. To this end, we propose an energy-informed diffusion model for unpaired photo-realistic image deraining (UPID-EDM). Initially, we delve into the intricate visual-language priors embedded within the contrastive language-image pre-training model (CLIP), and demonstrate that the CLIP priors aid in the discrimination of rainy and clean images. Furthermore, we introduce a dual-consistent energy function (DEF) that retains the rain-irrelevant characteristics while eliminating the rain-relevant features. This energy function is trained by the non-corresponding rainy and clean images. In addition, we employ the rain-relevance discarding energy function (RDEF) and the rain-irrelevance preserving energy function (RPEF) to direct the reverse sampling procedure of a pre-trained diffusion model, effectively removing the rain streaks while preserving the image contents. Extensive experiments demonstrate that our energy-informed model surpasses the existing unpaired learning approaches in terms of both supervised and no-reference metrics.

Abstract:
Emotional Video Captioning (EVC) is an emerging task that aims to describe factual content with the intrinsic emotions expressed in videos. The essential of the EVC task is to effectively perceive subtle and ambiguous visual emotional cues during the caption generation, which is neglected by the traditional video captioning. Existing emotional video captioning methods perceive global visual emotional cues at first, and then combine them with the video features to guide the emotional caption generation, which neglects two characteristics of the EVC task. Firstly, their methods neglect the dynamic subtle changes in the intrinsic emotions of the video, which makes it difficult to meet the needs of common scenes with diverse and changeable emotions. Secondly, as their methods incorporate emotional cues into each step, the guidance role of emotion is overemphasized, which makes factual content more or less ignored during generation. To this end, we propose a dual-path collaborative generation network, which dynamically perceives visual emotional cues evolutions while generating emotional captions by collaborative learning. The two paths promote each other and significantly improve the generation performance. Specifically, in the dynamic emotion perception path, we propose a dynamic emotion evolution module, which first aggregates visual features and historical caption features to summarize the global visual emotional cues, and then dynamically selects emotional cues required to be re-composed at each stage as well as re-composed them to achieve emotion evolution by dynamically enhancing or suppressing different granularity subspace's semantics. Besides, in the adaptive caption generation path, to balance the description of factual content and emotional cues, we propose an emotion adaptive decoder, which firstly estimates emotion intensity via the alignment of emotional features and historical caption features at each generation step, and then, emotional guidance adaptively incorporate into the caption generation based on the emotional intensity. Thus, our methods can generate emotion-related words at the necessary time step, and our caption generation balances the guidance of factual content and emotional cues well. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module.

Abstract:
With the advancement of AIGC, video frame interpolation (VFI) has become a crucial component in existing video generation frameworks, attracting widespread research interest. For the VFI task, the motion estimation between neighboring frames plays a crucial role in avoiding motion ambiguity. However, existing VFI methods always struggle to accurately predict the motion information between consecutive frames, and this imprecise estimation leads to blurred and visually incoherent interpolated frames. In this paper, we propose a novel diffusion framework, Motion-Aware latent Diffusion models (MADiff), which is specifically designed for the VFI task. By incorporating motion priors between the conditional neighboring frames with the target interpolated frame predicted throughout the diffusion sampling procedure, MADiff progressively refines the intermediate outcomes, culminating in generating both visually smooth and realistic results. Extensive experiments conducted on benchmark datasets demonstrate that our method achieves state-of-the-art performance significantly outperforming existing approaches, especially under challenging scenarios involving dynamic textures with complex motion.

Abstract:
Few-Shot Learning (FSL) aims to recognize new classes with limited labeled data. Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features. However, they usually struggle to capture complex semantic relationships between textual and visual features. Moreover, vanilla self-attention is heavily affected by useless information in images, severely constraining the potential of semantic priors in FSL due to the confusion of numerous irrelevant tokens during interaction. To address these aforementioned issues, a K-NN Transformer with Pyramid Prompts (KTPP) is proposed to select discriminative information with K-NN Context Attention (KCA) and adaptively modulate visual features with Pyramid Cross-modal Prompts (PCP). First, for each token, the KCA only selects the K most relevant tokens to compute the self-attention matrix and incorporates the mean of all tokens as the context prompt to provide the global context in three cascaded stages. As a result, irrelevant tokens can be progressively suppressed. Secondly, pyramid prompts are introduced in the PCP to emphasize visual features via interactions between text-based class-aware prompts and multi-scale visual features. This allows the ViT to dynamically adjust the importance weights of visual features based on rich semantic information at different scales, making models robust to spatial variations. Finally, augmented visual features and class-aware prompts are interacted via the KCA to extract class-specific features. Consequently, our model further enhances noise-free visual representations via deep cross-modal interactions, extracting generalized visual representation in scenarios with few labeled samples. Extensive experiments on four benchmark datasets demonstrate significant gains over the state-of-the-art methods, especially for the 1-shot task with 2.28% improvement on average due to semantically enhanced visual representations.

Abstract:
Multi-view clustering (MVC) methods based on non-negative matrix factorization (NMF) have gained popularity owing to their ability to provide interpretable clustering results. However, these NMF-based MVC methods generally process each view independently and thus ignore the potential relationship between views. Besides, they are limited in the ability to capture nonlinear data structures. To overcome these weaknesses and inspired by deep learning, we propose a multi-view clustering method based on deep non-negative tensor factorization (MVC-DNTF). With deep tensor factorization, our method can well exploit the spatial structure of the original data and is capable of extracting more deep and nonlinear features embedded in different views. To further extract the complementary information of different views, we adopt the weighted tensor Schatten p-norm regularization term. An optimization algorithm is developed to effectively solve the MVC-DNTF objective. Extensive experiments are performed to demonstrate the effectiveness and superiority of our method.

Abstract:
Multi-view clustering is an important machine learning task for multi-media data, encompassing various domains such as images, videos, and texts. Moreover, with the growing abundance of graph data, the significance of multi-view graph clustering (MVGC) has become evident. Most existing methods focus on graph neural networks (GNNs) to extract information from both graph structure and feature data to learn distinguishable node representations. However, traditional GNNs are designed with the assumption of homophilous graphs, making them unsuitable for widely prevalent heterophilous graphs. Several techniques have been introduced to enhance GNNs for heterophilous graphs. While these methods partially mitigate the heterophilous graph issue, they often neglect the advantages of traditional GNNs, such as their simplicity, interpretability, and efficiency. In this paper, we propose a novel multi-view graph clustering method based on dual-optimized adaptive graph reconstruction, named DOAGC. It mainly aims to reconstruct the graph structure adapted to traditional GNNs to deal with heterophilous graph issues while maintaining the advantages of traditional GNNs. Specifically, we first develop an adaptive graph reconstruction mechanism that accounts for node correlation and original structural information. To further optimize the reconstruction graph, we design a dual optimization strategy and demonstrate the feasibility of our optimization strategy through mutual information theory. Numerous experiments demonstrate that DOAGC effectively mitigates the heterophilous graph problem.

Abstract:
Data-Free Knowledge Distillation (DFKD) is a novel task that aims to train high-performance student models using only the pre-trained teacher network without original training data. Most of the existing DFKD methods rely heavily on additional generation modules to synthesize the substitution data resulting in high computational costs and ignoring the massive amounts of easily accessible, low-cost, unlabeled open-world data. Meanwhile, existing methods ignore the domain shift issue between the substitution data and the original data, resulting in knowledge from teachers not always trustworthy and structured knowledge from data becoming a crucial supplement. To tackle the issue, we propose a novel Open-world Data Sampling Distillation (ODSD) method for the DFKD task without the redundant generation process. First, we try to sample open-world data close to the original data's distribution by an adaptive sampling module and introduce a low-noise representation to alleviate the domain shift issue. Then, we build structured relationships of multiple data examples to exploit data knowledge through the student model itself and the teacher's structured representation. Extensive experiments on CIFAR-10, CIFAR-100, NYUv2, and ImageNet show that our ODSD method achieves state-of-the-art performance with lower FLOPs and parameters. Especially, we improve 1.50%-9.59% accuracy on the ImageNet dataset and avoid training the separate generator for each class.

Abstract:
Localizing text in low-light environments is challenging due to visual degradations. Although a straightforward solution involves a two-stage pipeline with low-light image enhancement (LLE) as the initial step followed by detection, LLE is primarily designed for human vision rather than machine vision and can accumulate errors. In this work, we propose an efficient and effective single-stage approach for localizing text in the dark that circumvents the need for LLE. We introduce a constrained learning module as an auxiliary mechanism during the training stage of the text detector. This module is designed to guide the text detector in preserving textual spatial features amidst feature map resizing, thus minimizing the loss of spatial information in texts under low-light visual degradations. Specifically, we incorporate spatial reconstruction and spatial semantic constraints within this module to ensure the text detector acquires essential positional and contextual range knowledge. Our approach enhances the original text detector's ability to identify text's local topological features using a dynamic snake feature pyramid network and adopts a bottom-up contour shaping strategy with a novel rectangular accumulation technique for accurate delineation of streamlined text features. In addition, we present a comprehensive low-light dataset for arbitrary-shaped text, encompassing diverse scenes and languages. Notably, our method achieves state-of-the-art results on this low-light dataset and exhibits comparable performance on standard normal light datasets. The code and dataset will be released.

Abstract:
The growing interest in embodied intelligence has brought ego-centric perspectives to contemporary research. One significant challenge within this realm is the accurate localization and tracking of objects in ego-centric videos, primarily due to the substantial variability in viewing angles. Addressing this issue, this paper introduces a novel zero-shot approach for the 3D reconstruction and tracking of all objects from the ego-centric video. We present Ego3DT, a novel framework that initially identifies and extracts detection and segmentation information of objects within the ego environment. Utilizing information from adjacent video frames, Ego3DT dynamically constructs a 3D scene of the ego view using a pre-trained 3D scene reconstruction model. Additionally, we have innovated a dynamic hierarchical association mechanism for creating stable 3D tracking trajectories of objects in ego-centric videos. Moreover, the efficacy of our approach is corroborated by extensive experiments on two newly compiled datasets, with 1.04 × - 2.90× in HOTA, showcasing the robustness and accuracy of our method in diverse ego-centric scenarios.

Abstract:
The limited data availability and the low signal-to-noise ratio of fMRI signals lead to the challenging task of fMRI-to-image retrieval. State-of-the-art MindEye remarkably improves fMRI-to-image retrieval performance by leveraging a large model, i.e., a 996M MLP Backbone per subject, to align fMRI embeddings to the final hidden layer of CLIP's Vision Transformer (ViT). However, significant individual variations exist among subjects, even under identical experimental setups, mandating the training of large subject-specific models. The substantial parameters pose significant challenges in deploying fMRI decoding on practical devices. To this end, we propose Lite-Mind, a lightweight, efficient, and robust brain representation learning paradigm based on Discrete Fourier Transform (DFT), which efficiently aligns fMRI voxels to fine-grained information of CLIP. We elaborately design a DFT backbone with Spectrum Compression and Frequency Projector modules to learn informative and robust voxel embeddings. Our experiments demonstrate that Lite-Mind achieves an impressive 94.6% fMRI-to-image retrieval accuracy on the NSD dataset for Subject 1, with 98.7% fewer parameters than MindEye. Lite-Mind is also proven to be able to be migrated to smaller fMRI datasets and establishes a new state-of-the-art for zero-shot classification on the GOD dataset.

Abstract:
Exploring open-vocabulary video action recognition is a promising venture, which aims to recognize previously unseen actions within any arbitrary set of categories. Existing methods typically adapt pretrained image-text models to the video domain, capitalizing on their inherent strengths in generalization. A common thread among such methods is the augmentation of visual embeddings with temporal information to improve the recognition of seen actions. Yet, they compromise with standard less-informative action descriptions, thus faltering when confronted with novel actions. Drawing inspiration from human cognitive processes, we argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition. To realize this, we innovatively blend video models with Large Language Models (LLMs) to devise Action-conditioned Prompts. Specifically, we harness the knowledge in LLMs to produce a set of descriptive sentences that contain distinctive features for identifying given actions. Building upon this foundation, we further introduce a multi-modal action knowledge alignment mechanism to align concepts in video and textual knowledge encapsulated within the prompts. Extensive experiments on various video benchmarks, including zero-shot, few-shot, and base-to-novel generalization settings, demonstrate that our method not only sets new SOTA performance but also possesses excellent interpretability.

Abstract:
Recently, the AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, for audio representation learning, existing datasets suffer from limitations in the following aspects: insufficient volume, simplistic content, and arduous collection procedures. To establish an audio dataset with high-quality captions, we propose an innovative, automatic approach leveraging multimodal inputs, such as video frames, audio streams. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We exploit a series of pre-trained models or APIs, to determine audio-visual synchronisation, generate image captions, object detection, or audio tags for specific videos. Subsequently, we employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues. To demonstrate the effectiveness of the proposed dataset, we train widely used models on our dataset and show performance improvement on various downstream tasks, for example, audio-language retrieval, audio captioning, zero-shot classification. In addition, we establish a novel benchmark with environmental information and provide a benchmark for audio-text tasks.

Abstract:
In the domain of multimedia and multimodal processing, the efficient handling of diverse data streams-such as images, video, and sensor data-is paramount. Model compression and multitask learning (MTL) are crucial in this field, offering the potential to address the resource-intensive demands of processing and interpreting multiple forms of media simultaneously. However, effectively compressing a multitask model presents significant challenges due to the complexities of balancing sparsity allocation and accuracy performance across multiple tasks. To tackle the challenges, we propose AdapMTL, an adaptive pruning framework for MTL models. AdapMTL leverages multiple learnable soft thresholds independently assigned to the shared backbone and the task-specific heads to capture the nuances in different components' sensitivity to pruning. During training, it co-optimizes the soft thresholds and MTL model weights to automatically determine the suitable sparsity level at each component to achieve both high task accuracy and high overall sparsity. It further incorporates an adaptive weighting mechanism that dynamically adjusts the importance of task-specific losses based on each task's robustness to pruning. We demonstrate the effectiveness of AdapMTL through comprehensive experiments on popular multitask datasets, namely NYU-v2 and Tiny-Taskonomy, with different architectures, showcasing superior performance compared to state-of-the-art pruning methods.

Abstract:
The rise of social media and the exponential growth of multimodal communication necessitates advanced techniques for Multimodal Information Extraction (MIE). However, existing methodologies primarily rely on direct Image-Text interactions, a paradigm that often faces significant challenges due to semantic and modality gaps between images and text. In this paper, we introduce a new paradigm of Image-Context-Text interaction, where large multimodal models (LMMs) are utilized to generate descriptive textual context to bridge these gaps. In line with this paradigm, we propose a novel Shapley Value-based Contrastive Alignment (Shap-CA) method, which aligns both context-text and context-image pairs. Shap-CA initially applies the Shapley value concept from cooperative game theory to assess the individual contribution of each element in the set of contexts, texts and images towards total semantic and modality overlaps. Following this quantitative evaluation, a contrastive learning strategy is employed to enhance the interactive contribution within context-text/image pairs, while minimizing the influence across these pairs. Furthermore, we design an adaptive fusion module for selective cross-modal fusion. Extensive experiments across four MIE datasets demonstrate that our method significantly outperforms existing state-of-the-art methods.

Abstract:
The problem of blind image super-resolution aims to recover high-resolution (HR) images from low-resolution (LR) images with unknown degradation modes. Most existing methods model the image degradation process using blur kernels. However, this explicit modeling approach struggles to cover the complex and varied degradation processes encountered in the real world, such as high-order combinations of JPEG compression, blur, and noise. Implicit modeling for the degradation process can effectively overcome this issue, but a key challenge of implicit modeling is the lack of accurate ground truth labels for the degradation process to conduct supervised training. To overcome this limitations inherent in implicit modeling, we propose an Uncertainty-based degradation representation for blind Super-Resolution framework (USR). By suppressing the uncertainties of local degradation representations in images, USR facilitated self-supervised learning of degradation representations. The USR consists of two components: Adaptive Uncertainty-Aware Degradation Extraction (AUDE) and a feature extraction network composed of Variable Depth Dynamic Convolution (VDDC) blocks. To extract Uncertainty-based Degradation Representation from LR images, the AUDE utilizes the Self-supervised Uncertainty Contrast module with Uncertainty Suppression Loss to suppress the inherent model uncertainty of the Degradation Extractor. Furthermore, VDDC block integrates degradation information through dynamic convolution. The VDDC also employs an Adaptive Intensity Scaling operation that adaptively adjusts the degradation representation according to the network hierarchy, thereby facilitating the effective integration of degradation information. Quantitative and qualitative experiments affirm the superiority of our approach.

Abstract:
In recent years, the attention towards One-Shot Federated Learning (OSFL) has been driven by its capacity to minimize communication. With the development of the diffusion model (DM), several methods employ the DM for OSFL, utilizing model parameters, image features, or textual prompts as mediums to transfer the local client knowledge to the server. However, these mediums often require public datasets or the uniform feature extractor, significantly limiting their practicality. In this paper, we propose FedDEO, a Description-Enhanced One-Shot Federated Learning Method with DMs, offering a novel exploration of utilizing the DM in OSFL. The core idea of our method involves training local descriptions on the clients, serving as the medium to transfer the knowledge of the distributed clients to the server. Firstly, we train local descriptions on the client data to capture the characteristics of client distributions, which are then uploaded to the server. On the server, the descriptions are used as conditions to guide the DM in generating synthetic datasets that comply with the distributions of various clients, enabling the training of the aggregated model. Theoretical analyses and sufficient quantitation and visualization experiments on three large-scale real-world datasets demonstrate that through the training of local descriptions, the server is capable of generating synthetic datasets with high quality and diversity. Consequently, with advantages in communication and privacy protection, the aggregated model outperforms compared FL or diffusion-based OSFL methods and, on some clients, outperforms the performance ceiling of centralized training.

Abstract:
Community researchers have developed a range of advanced audio-visual segmentation models aimed at improving the quality of sounding objects' masks. While masks created by these models may initially appear plausible, they occasionally exhibit anomalies with incorrect grounding logic. We attribute this to real-world inherent preferences and distributions as a simpler signal for learning than the complex audio-visual grounding, which leads to the disregard of important modality information. Generally, the anomalous phenomena are often complex and cannot be directly observed systematically. In this study, we made a pioneering effort with the proper synthetic data to categorize and analyze phenomena as two types "audio priming bias" and "visual prior" according to the source of anomalies. For audio priming bias, to enhance audio sensitivity to different intensities and semantics, a perception module specifically for audio perceives the latent semantic information and incorporates information into a limited set of queries, namely active queries. Moreover, the interaction mechanism related to such active queries in the transformer decoder is customized to adapt to the need for interaction regulating among audio semantics. For visual prior, multiple contrastive training strategies are explored to optimize the model by incorporating a biased branch, without even changing the structure of the model. During experiments, observation demonstrates the presence and the impact that has been produced by the biases of the existing model. Finally, through experimental evaluation of AVS benchmarks, we demonstrate the effectiveness of our methods in handling both types of biases, achieving competitive performance across all three subsets.

Abstract:
This paper provides an efficient training-free painterly image harmonization (PIH) method, dubbed FreePIH, that leverages only a pre-trained diffusion model to achieve state-of-the-art harmonization results. Unlike existing methods that require either training auxiliary networks or fine-tuning a large pre-trained backbone, or both, to harmonize a foreground object with a painterly-style background image, our FreePIH tames the denoising process as a plug-in module for foreground image style transfer. Specifically, we find that the very last few steps of the denoising (i.e., generation) process strongly correspond to the stylistic information of images, and based on this, we propose to augment the latent features of both the foreground and background images with Gaussians for a direct denoising-based harmonization. To guarantee the fidelity of the harmonized image, we make use of latent features to enforce the consistency of the content and stability of the foreground objects in the latent space, and meanwhile, aligning both fore-/back-grounds with the same style. Moreover, to accommodate the generation with more structural and textural details, we further integrate text prompts to attend to the latent features, hence improving the generation quality. Quantitative and qualitative evaluations on COCO and LAION 5B datasets demonstrate that our method can surpass representative baselines by large margins.

Abstract:
Visual programming, a modular paradigm, integrates different modules and Python operators to solve various vision-language tasks. Unlike end-to-end models that need task-specific data, it performs visual processing and inference in an unsupervised manner. Current visual programming methods generate programs in a single pass where the ability to evaluate and optimize based on feedback, unfortunately, is lacking, which consequentially limits their effectiveness for complex, multi-step problems. Inspired by benders decomposition, we introduce De-fine, a training-free framework that automatically decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. This model-agnostic approach can improve logical reasoning performance by integrating the strengths of multiple models. Our experiments across various visual tasks show that De-fine creates more robust programs. Moreover, viewing each feedback module as an independent agent will yield fresh prospects for the field of agent research.

Abstract:
Volumetric video based on Neural Radiance Field (NeRF) holds vast potential for various 3D applications, but its substantial data volume poses significant challenges for compression and transmission. Current NeRF compression lacks the flexibility to adjust video quality and bitrate within a single model for various network and device capacities. To address these issues, we propose HPC, a novel hierarchical progressive volumetric video coding framework achieving variable bitrate using a single model. Specifically, HPC introduces a hierarchical representation with a multi-resolution residual radiance field to reduce temporal redundancy in long-duration sequences while simultaneously generating various levels of detail. Then, we propose an end-to-end progressive learning approach with a multi-rate-distortion loss function to jointly optimize both hierarchical representation and compression. Our HPC trained only once can realize multiple compression levels, while the current methods need to train multiple fixed-bitrate models for different rate-distortion (RD) tradeoffs. Extensive experiments demonstrate that HPC achieves flexible quality levels with variable bitrate by a single model and exhibits competitive RD performance, even outperforming fixed-bitrate models across various datasets.

Abstract:
The rapid development of Vision Foundation Model (VFM) brings inherent out-domain generalization for a variety of down-stream tasks. Among them, domain generalized semantic segmentation (DGSS) holds unique challenges as the cross-domain images share common pixel-wise content information but vary greatly in terms of the style. In this paper, we present a novel Spectral-dEcomposed Token (SET) learning framework to advance the frontier. Delving into further than existing fine-tuning token & frozen backbone paradigm, the proposed SET especially focuses on the way learning style-invariant features from these learnable tokens. Particularly, the frozen VFM features are first decomposed into the phase and amplitude components in the frequency space, which mainly contain the information of content and style, respectively, and then separately processed by learnable tokens for task-specific information extraction. Particularly, the frozen VFM features are first decomposed into the phase and amplitude components in the frequency space, which mainly contain the information of content and style, respectively, and then separately processed by learnable tokens for task-specific information extraction.After the decomposition, style variation primarily impacts the token-based feature enhancement within the amplitude branch. To address this issue, we further develop an attention optimization method to bridge the gap between style-affected representation and static tokens during inference. Extensive cross-domain experiments show its state-of-the-art performance.

Abstract:
Human pose estimation in videos has long been a compelling yet challenging task within the realm of computer vision. Nevertheless, this task remains difficult because of the complex video scenes, such as video defocus and self-occlusion. Recent methods strive to integrate multi-frame visual features generated by a backbone network for pose estimation. However, they often ignore the useful joint information encoded in the initial heatmap, which is a by-product of the backbone generation. Comparatively, methods that attempt to refine the initial heatmap fail to consider any spatio-temporal motion features. As a result, the performance of existing methods for pose estimation falls short due to the lack of ability to leverage both local joint (heatmap) information and global motion (feature) dynamics.

Abstract:
Point cloud analysis is challenging due to its unique characteristics of unorderness, sparsity and irregularity. Prior works attempt to capture local relationships by convolution operations or attention mechanisms, exploiting geometric information from coordinates implicitly. These methods, however, are insufficient to describe the explicit local geometry, e.g., curvature and orientation. In this paper, we propose On-the-fly Point Feature Representation (OPFR), which captures abundant geometric information explicitly through Curve Feature Generator module. This is inspired by Point Feature Histogram (PFH) from computer vision community. However, the utilization of vanilla PFH encounters great difficulties when applied to large datasets and dense point clouds, as it demands considerable time for feature generation. In contrast, we introduce the Local Reference Constructor module, which approximates the local coordinate systems based on triangle sets. Owing to this, our OPFR only requires extra 1.56ms for inference (65X faster than vanilla PFH) and 0.012M more parameters, and it can serve as a versatile plug-and-play module for various backbones, particularly MLP-based and Transformer-based backbones examined in this study. Additionally, we introduce the novel Hierarchical Sampling module aimed at enhancing the quality of triangle sets, thereby ensuring robustness of the obtained geometric features. Our proposed method improves overall accuracy (OA) on ModelNet40 from 90.7% to 94.5% (+3.8%) for classification, and OA on S3DIS Area-5 from 86.4% to 90.0% (+3.6%) for semantic segmentation, respectively, building upon PointNet++ backbone. When integrated with Point Transformer backbone, we achieve state-of-the-art results on both tasks: 94.8% OA on ModelNet40 and 91.7% OA on S3DIS Area-5.

Abstract:
Neural Radiance Fields (NeRF) achieves impressive 3D representation learning and novel view synthesis results with high-quality multi-view images as input. However, motion blur in images often occurs in low-light and high-speed motion scenes, which significantly degrades the reconstruction quality of NeRF. Previous deblurring NeRF methods struggle to estimate pose and lighting changes during the exposure time, making them unable to accurately model the motion blur. The bio-inspired event camera measuring intensity changes with high temporal resolution makes up this information deficiency. In this paper, we propose Event-driven Bundle Adjustment for Deblurring Neural Radiance Fields (EBAD-NeRF) to jointly optimize the learnable poses and NeRF parameters by leveraging the hybrid event-RGB data. An intensity-change-metric event loss and a photo-metric blur loss are introduced to strengthen the explicit modeling of camera motion blur. Experiments on both synthetic and real-captured data demonstrate that EBAD-NeRF can obtain accurate camera trajectory during the exposure time and learn a sharper 3D representations compared to prior works.

Abstract:
Product bundling has been a prevailing marketing strategy that is beneficial in the online shopping scenario. Effective product bundling methods depend on high-quality item representations capturing both the individual items' semantics and cross-item relations. However, previous item representation learning methods, either feature fusion or graph learning, suffer from inadequate cross-modal alignment and struggle to capture the cross-item relations for cold-start items. Multimodal pre-train models could be the potential solutions given their promising performance on various multimodal downstream tasks. However, the cross-item relations have been under-explored in the current multimodal pre-train models.

Abstract:
Most existing methods for text-based person retrieval focus on text-to-image person retrieval. Nevertheless, due to the lack of dynamic information provided by isolated frames, the performance is hampered when the person is obscured or variable motion details are missed in isolated frames. To overcome this, we propose a novel Text-to-Video Person Retrieval (TVPR) task. Since there is no dataset or benchmark that describes person videos with natural language, we construct a large-scale cross-modal person video dataset containing detailed natural language annotations, termed as Text-to-Video Person Reidentification (TVPReid) dataset. In this paper, we introduce a Multielement Feature Guided Fragments Learning (MFGF) strategy, which leverages the cross-modal text-video representations to provide strong text-visual and text-motion matching information to tackle uncertain occlusion conflicting and variable motion details. Specifically, we establish two potential cross-modal spaces for text and video feature collaborative learning to progressively reduce the semantic difference between text and video. To evaluate the effectiveness of the proposed MFGF, extensive experiments have been conducted on TVPReid dataset. To the best of our knowledge, MFGF is the first successful attempt to use video for text-based person retrieval task and has achieved state-of-the-art performance on TVPReid dataset. The TVPReid dataset will be publicly available to benefit future research.

Abstract:
Event extraction (EE) is a critical direction in the field of information extraction, laying an important foundation for the construction of structured knowledge bases. EE from text has received ample research and attention for years, yet there can be numerous real-world applications that require direct information acquisition from speech signals, online meeting minutes, interview summaries, press releases, etc. While EE from speech has remained under-explored, this paper fills the gap by pioneering a SpeechEE, defined as detecting the event predicates and arguments from a given audio speech. To benchmark the SpeechEE task, we first construct a large-scale high-quality dataset. Based on textual EE datasets under the sentence, document, and dialogue scenarios, we convert texts into speeches through both manual real-person narration and automatic synthesis, empowering the data with diverse scenarios, languages, domains, ambiences, and speaker styles. Further, to effectively address the key challenges in the task, we tailor an E2E SpeechEE system based on the encoder-decoder architecture, where a novel Shrinking Unit module and a retrieval-aided decoding mechanism are devised. Extensive experimental results on all SpeechEE subsets demonstrate the efficacy of the proposed model, offering a strong baseline for the task. At last, being the first work on this topic, we shed light on key directions for future research. Our codes and the benchmark datasets are open at https://SpeechEE.github.io/

Abstract:
Recently, a vast number of image generation models have been proposed, which raises concerns regarding the misuse of these artificial intelligence (AI) techniques for generating fake images. To attribute the AI-generated images, existing schemes usually design and train deep neural networks (DNNs) to learn the model fingerprints, which usually requires a large amount of data for effective learning. In this paper, we aim to answer the following two questions for AI-generated image attribution, 1) is it possible to design useful handcrafted filters to facilitate the fingerprint learning? and 2) how we could reduce the amount of training data after we incorporate the handcrafted filters? We first propose a set of Multi-Directional High-Pass Filters (MHFs) which are capable to extract the subtle fingerprints from various directions. Then, we propose a Directional Enhanced Feature Learning network (DEFL) to take both the MHFs and randomly-initialized filters into consideration. The output of the DEFL is fused with the semantic features to produce a compact fingerprint. To make the compact fingerprint discriminative among different models, we propose a Dual-Margin Contrastive (DMC) loss to tune our DEFL. Finally, we propose a reference based fingerprint classification scheme for image attribution. Experimental results demonstrate that it is indeed helpful to use our MHFs for attributing the AI-generated images. The performance of our proposed method is significantly better than the state-of-the-art for both the closed-set and open-set image attribution, where only a small amount of images are required for training.

Abstract:
Groundbreaking advancements in text-to-image generation have recently been achieved with the emergence of diffusion models. These models exhibit a remarkable ability to generate highly artistic and intricately detailed images based on textual prompts. However, obtaining desired generation outcomes often necessitates repetitive trials of manipulating text prompts just like casting spells on a magic mirror, and the reason behind that is the limited capability of semantic understanding inherent in current image generation models. Specifically, existing diffusion models encode the input text prompt with a pre-trained encoder structure, which is usually trained on a limited amount of image-caption pairs. State-of-the-art large language models (LLMs) based on the decoder-only structure have shown very powerful semantic understanding capability as their architectures are more suitable for training on very large-scale unlabeled data. In this work, we propose to enhance text-to-image diffusion models by borrowing the strength of semantic understanding from large language models (LLMs), resulting in a simple yet effective adapter to allow the diffusion models to be compatible with the decoder-only structure. Meanwhile, we also provide a supporting theoretical analysis with various architectures (e.g., encoder-only, encoder-decoder, and decoder-only), and conduct extensive empirical evaluations to verify its effectiveness. The experimental results show that the enhanced models with our adapter module are superior to the stat-of-the-art models in terms of text-to-image generation quality and reliability.

Abstract:
Content-adaptive compression is crucial for enhancing the adaptability of the pre-trained neural codec for various contents. Though, its application in neural video compression (NVC) is still limited due to two main aspects: 1), video compression relies heavily on temporal redundancy, therefore updating just one or a few frames can lead to significant errors accumulating over time; 2), NVC frameworks are generally complex, with many comprehensive components that are not trivial to update quickly during the encoding procedure. To address these challenges, we have developed a content-adaptive NVC technique called Group-aware Parameter-efficient Updating (GPU). Initially, to minimize error accumulation, we adopt a group-aware approach for updating encoder parameters. This involves adopting a patch-based Group of Pictures (GoP) updating strategy to segment a video into patch-based GoPs, which will be updated to facilitate a globally optimized domain-transferable solution. Subsequently, we introduce a parameter-efficient delta-tuning strategy, which is achieved by integrating several light-weight adapters into each encoding component by using both serial and parallel configuration. Such architecture-agnostic modules stimulate the components with large parameters, thereby reducing the updating cost during the encoding stage. We incorporate our GPU into the latest NVC framework and conduct extensive experiments, whose results showcase outstanding video compression efficiency across six video compression benchmarks and the adaptability of one medical volumetric image compression benchmark.

Abstract:
In numerous medical scenarios, segmenting clinical targets is highly subjective, influenced by the doctors' expertise and preferences, which results in significant multi-rater variability. This inherent annotation ambiguity poses a challenge for the practical deployment of data-driven techniques and raises concerns about the reliability of automatic predictions by medical artificial intelligence (AI) systems. To address this issue, we host a grand challenge (MMIS-2024) at ACM MM '24 to explore the problem of multi-rater medical image segmentation. First, we have released two datasets publicly, one on nasopharyngeal carcinoma (NPC) and the other on glioblastoma (GBM). For NPC, one challenge track encourages participants to develop models that utilize the four expert-provided labels per sample. The second GBM track explores the one-sample-one-label setting in the context of multi-rater segmentation. Here, different experts annotated different GBM samples for training. Finally, to assess the submissions, we employ two distinct sets of metrics, designed to evaluate prediction diversity and personalization, respectively. By exploring the two tasks with different metrics, the MMIS-2024 challenge aims to establish a global benchmark for multi-rater medical image segmentation, facilitating clinical AI deployments.

Abstract:
Visual Spatial Description (VSD) aims to generate texts that describe the spatial relationships between objects within images. Traditional visual spatial relationship classification (VSRC) methods typically output the spatial relationship between two objects in an image, often neglecting world knowledge and lacking general language capabilities. In this paper, we propose a Large Language-and-Vision Assistant for Visual Spatial Description, named LLaVA-VSD, which is designed for the classification, description, and open-ended description of visual spatial relationships. Specifically, the model first constructs a visual spatial instruction-following dataset using given figure-caption pairs for the three tasks. It then employs LoRA to fine-tune a Large Language and Vision Assistant for VSD, which has 13 billion parameters and supports high-resolution images. Finally, a large language model is used to refine the generated sentences, enhancing their diversity and accuracy. LLaVA-VSD demonstrates excellent multimodal conversational capabilities and can follow open-ended instructions to assist with inquiries about object relationships in images.

Abstract:
In recent years, diffusion models have achieved tremendous success in the field of video generation, with controllable video generation receiving significant attention. However, existing control methods still face two limitations: Firstly, control conditions (such as depth maps, 3D Mesh) are difficult for ordinary users to obtain directly. Secondly, it's challenging to drive multiple objects through complex motions with multiple trajectories simultaneously. In this paper, we introduce DragEntity, a video generation model that utilizes entity representation for controlling the motion of multiple objects. Compared to previous methods, DragEntity offers two main advantages: 1) Our method is more user-friendly for interaction because it allows users to drag entities within the image rather than individual pixels. 2) We use entity representation to represent any object in the image, and multiple objects can maintain relative spatial relationships. Therefore, we allow multiple trajectories to control multiple objects in the image with different levels of complexity simultaneously. Our experiments validate the effectiveness of DragEntity, demonstrating its excellent performance in fine-grained control in video generation.

Abstract:
In the medical field, managing high-dimensional massive medical imaging data and performing reliable medical analysis from it is a critical challenge, especially in resource-limited environments such as remote medical facilities and mobile devices. This necessitates effective dataset compression techniques to reduce storage, transmission, and computational cost. However, existing coreset selection methods are primarily designed for natural image datasets, and exhibit doubtful effectiveness when applied to medical image datasets due to challenges such as intra-class variation and inter-class similarity. In this paper, we propose a novel coreset selection strategy termed as Evolution-aware VAriance (EVA), which captures the evolutionary process of model training through a dual-window approach and reflects the fluctuation of sample importance more precisely through variance measurement. Extensive experiments on medical image datasets demonstrate the effectiveness of our strategy over previous SOTA methods, especially at high compression rates. EVA achieves 98.27% accuracy with only 10% training data, compared to 97.20% for the full training set. None of the compared baseline methods can exceed Random at 5% selection rate, while EVA outperforms Random by 5.61%, showcasing its potential for efficient medical image analysis.

Abstract:
We introduce a new task, novel view synthesis for LiDAR sensors. While traditional model-based LiDAR simulators with style-transfer neural networks can be applied to render novel views, they fall short of producing accurate and realistic LiDAR patterns because the renderers rely on explicit 3D reconstruction and exploit game engines, that ignore important attributes of LiDAR points. We address this challenge by formulating, to the best of our knowledge, the first differentiable end-to-end LiDAR rendering framework, LiDAR-NeRF, leveraging a neural radiance field (NeRF) to facilitate the joint learning of geometry and the attributes of 3D points. However, simply employing NeRF cannot achieve satisfactory results, as it only focuses on learning individual pixels while ignoring local information, especially at low texture areas, resulting in poor geometry. To this end, we have taken steps to address this issue by introducing a structural regularization method to preserve local structural details. To evaluate the effectiveness of our approach, we establish an object-centric multi-view LiDAR dataset, dubbed NeRF-MVL. It contains observations of objects from 9 categories seen from 360-degree viewpoints captured with multiple LiDAR sensors. Our extensive experiments on the scene-level KITTI-360 dataset, and on our object-level NeRF-MVL show that our LiDAR-NeRF surpasses the model-based algorithms significantly.

Abstract:
Panoramic Activity Recognition (PAR) aims to identify multi-granul-arity behaviors performed by multiple persons in panoramic scenes, including individual activities, group activities, and global activities. Previous methods 1) heavily rely on manually annotated detection boxes in training and inference, hindering further practical deployment; or 2) directly employ normal detectors to detect multiple persons with varying size and spatial occlusion in panoramic scenes, blocking the performance gain of PAR. To this end, we consider learning a detector adapting varying-size occluded persons, which is optimized along with the recognition module in the all-in-one framework. Therefore, we propose a novel Adapt-Focused bi-Propagating Prototype learning (AdaFPP) framework to jointly recognize individual, group, and global activities in panoramic activity scenes by learning an adapt-focused detector and multi-granularity prototypes as the pretext tasks in an end-to-end way. Specifically, to accommodate the varying sizes and spatial occlusion of multiple persons in crowed panoramic scenes, we introduce a panoramic adapt-focuser, achieving the size-adapting detection of individuals by comprehensively selecting and performing fine-grained detections on object-dense sub-regions identified through original detections. In addition, to mitigate information loss due to inaccurate individual localizations, we introduce a bi-propagation prototyper that promotes closed-loop interaction and informative consistency across different granularities by facilitating bidirectional information propagation among the individual, group, and global levels. Extensive experiments demonstrate the significant performance of AdaFPP and emphasize its powerful applicability for PAR.

Abstract:
The remarkable performance of Multimodal Large Language Models (MLLMs) has demonstrated their proficient understanding capabilities in handling various visual tasks. Nevertheless, the opaque nature of black-box reasoning processes persists as an enigma, rendering them uninterpretable and struggling with hallucination. Their ability to execute intricate reasoning tasks is also constrained, culminating in stagnation of progression. In this work, we introduce Fact, a novel paradigm designed to generate multimodal rationales that are faithful, concise, and transferable for teaching MLLMs. This paradigm utilizes verifiable visual programming to generate executable code guaranteeing faithfulness. Through a series of operations including pruning, merging, and bridging, the rationale enhances its conciseness. Furthermore, we filter rationales that can be transferred to end-to-end paradigms from programming paradigms to guarantee transferability. Empirical evidence from experiments demonstrates the superiority of Fact across models of varying parameter sizes, significantly enhancing their compositional reasoning and generalization ability and reducing hallucinations owing to its high correlation between images and text.

Abstract:
The existing methods for Vision and Language Navigation in the Continuous Environment (VLN-CE) commonly incorporate a waypoint predictor to discretize the environment. This simplifies the navigation actions into a view selection task and improves navigation performance significantly compared to direct training using low-level actions. However, the VLN-CE agents are still far from the real robots since there are gaps between their visual perception and executed actions. First, VLN-CE agents that discretize the visual environment are primarily trained with high-level view selection, which causes them to ignore crucial spatial reasoning within the low-level action movements. Second, in these models, the existing waypoint predictors neglect object semantics and their attributes related to passibility, which can be informative in indicating the feasibility of actions. To address these two issues, we introduce a low-level action decoder jointly trained with high-level action prediction, enabling the current VLN agent to learn and ground the selected visual view to the low-level controls. Moreover, we enhance the current waypoint predictor by utilizing visual representations containing rich semantic information and explicitly masking obstacles based on humans' prior knowledge about the feasibility of actions. Empirically, our agent can improve navigation performance metrics compared to the strong baselines on both high-level and low-level actions.

Abstract:
Existing plant disease classification models have achieved remarkable performance in recognizing in-laboratory diseased images. However, their performance often significantly degrades in classifying in-the-wild images. Furthermore, we observed that in-the-wild plant images may exhibit similar appearances across various diseases (i.e., small inter-class discrepancy) while the same diseases may look quite different (i.e., large intra-class variance). Motivated by this observation, we propose an in-the-wild multimodal plant disease recognition dataset that contains the largest number of disease classes but also text-based descriptions for each disease. Particularly, the newly provided text descriptions are introduced to provide rich information in textual modality and facilitate in-the-wild disease classification with small inter-class discrepancy and large intra-class variance issues. Therefore, our proposed dataset can be regarded as an ideal testbed for evaluating disease recognition methods in the real world. In addition, we further present a strong yet versatile baseline that models text descriptions and visual data through multiple prototypes for a given class. By fusing the contributions of multimodal prototypes in classification, our baseline can effectively address the small inter-class discrepancy and large intra-class variance issues. Remarkably, our baseline model can not only classify diseases but also recognize diseases in few-shot or training-free scenarios. Extensive benchmarking results demonstrate that our proposed in-the-wild multimodal dataset sets many new challenges to the plant disease recognition task and there is a large space to improve for future works.

Abstract:
Graph Neural Networks have demonstrated great success in various fields of multimedia. However, the distribution shift between the training and test data challenges the effectiveness of GNNs. To mitigate this challenge, Test-Time Training (TTT) has been proposed as a promising approach. Traditional TTT methods require a demanding unsupervised training strategy to capture the information from test to benefit the main task. Inspired by the great annotation ability of Large Language Models (LLMs) on Text-Attributed Graphs (TAGs), we propose to enhance the test-time training on graphs with LLMs as annotators. In this paper, we design a novel Test-Time Training pipeline, LLMTTT, which conducts the test-time adaptation under the annotations by LLMs on a carefully-selected node set. Specifically, LLMTTT introduces a hybrid active node selection strategy that considers not only node diversity and representativeness, but also prediction signals from the pre-trained model. Given annotations from LLMs, a two-stage training strategy is designed to tailor the test-time model with the limited and noisy labels. A theoretical analysis ensures the validity of our method and extensive experiments demonstrate that the proposed LLMTTT can achieve a significant performance improvement compared to existing Out-of-Distribution (OOD) generalization methods.

Abstract:
Face Recognition (FR) systems can be easily deceived by adversarial examples that manipulate benign face images through imperceptible perturbations. Adversarial attacks on FR encompass two types: impersonation (targeted) attacks and dodging (untargeted) attacks. Previous methods often achieve a successful impersonation attack on FR, however, it does not necessarily guarantee a successful dodging attack on FR in the black-box setting. In this paper, our key insight is that the generation of adversarial examples should perform both impersonation and dodging attacks simultaneously. To this end, we propose a novel attack method termed as Adversarial Pruning (Adv-Pruning), to fine-tune existing adversarial examples to enhance their dodging capabilities while preserving their impersonation capabilities. Adv-Pruning consists of Priming, Pruning, and Restoration stages. Concretely, we propose Adversarial Priority Quantification to measure the region-wise priority of original adversarial perturbations, identifying and releasing those with minimal impact on absolute model output variances. Then, Biased Gradient Adaptation is presented to adapt the adversarial examples to traverse the decision boundaries of both the attacker and victim by adding perturbations favoring dodging attacks on the vacated regions, preserving the prioritized features of the original perturbations while boosting dodging performance. As a result, we can maintain the impersonation capabilities of original adversarial examples while effectively enhancing dodging capabilities. Comprehensive experiments demonstrate the superiority of our method compared with state-of-the-art adversarial attack methods.

Abstract:
Gaussian splatting, renowned for its exceptional rendering quality and efficiency, has emerged as a prominent technique in 3D scene representation. However, the substantial data volume of Gaussian splatting impedes its practical utility in real-world applications. Herein, we propose an efficient 3D scene representation, named Compressed Gaussian Splatting (CompGS), which harnesses compact Gaussian primitives for faithful 3D scene modeling with a remarkably reduced data size. To ensure the compactness of Gaussian primitives, we devise a hybrid primitive structure that captures predictive relationships between each other. Then, we exploit a small set of anchor primitives for prediction, allowing the majority of primitives to be encapsulated into highly compact residual forms. Moreover, we develop a rate-constrained optimization scheme to eliminate redundancies within such hybrid primitives, steering our CompGS towards an optimal trade-off between bitrate consumption and representation efficacy. Experimental results show that the proposed CompGS significantly outperforms existing methods, achieving superior compactness in 3D scene representation without compromising model accuracy and rendering quality. Our code will be released on GitHub for further research.

Abstract:
Audio-driven talking face generation aims to synthesize video with lip movements synchronized to input audio. However, current generative techniques face challenges in preserving intricate regional textures (skin, teeth). To address the aforementioned challenges, we propose a novel framework called SegTalker to decouple lip movements and image textures by introducing segmentation as intermediate representation. Specifically, given the mask of image employed by a parsing network, we first leverage the speech to drive the mask and generate talking segmentation. Then we disentangle semantic regions of image into style codes using a mask-guided encoder. Ultimately, we inject the previously generated talking segmentation and style codes into a mask-guided StyleGAN to synthesize video frame. In this way, most of textures are fully preserved. Moreover, our approach can inherently achieve background separation and facilitate mask-guided facial local editing. In particular, by editing the mask and swapping the region textures from a given reference image (e.g. hair, lip, eyebrows), our approach enables facial editing seamlessly when generating talking face video. Experiments demonstrate that our proposed approach can effectively preserve texture details and generate temporally consistent video while remaining competitive in lip synchronization. Quantitative and qualitative results on the HDTF and MEAD datasets illustrate the superior performance of our method over existing methods.

Abstract:
With the development of diffusion-based customization methods like DreamBooth, individuals now have access to train the models that can generate their personalized images. Despite the convenience, malicious users have misused these techniques to create fake images, thereby triggering a privacy security crisis. In light of this, proactive adversarial attacks are proposed to protect users against customization. The adversarial examples are trained to distort the customization model's outputs and thus block the misuse. In this paper, we propose DisDiff (Disrupting Diffusion), a novel adversarial attack method to disrupt the diffusion model outputs. We first delve into the intrinsic image-text relationships, well-known as cross-attention, and empirically find that the subject-identifier token plays an important role in guiding image generation. Thus, we propose the Cross-Attention Erasure module to explicitly "erase" the indicated attention maps and disrupt the text guidance. Besides, we analyze the influence of the sampling process of the diffusion model on Projected Gradient Descent (PGD) attack and introduce a novel Merit Sampling Scheduler to adaptively modulate the perturbation updating amplitude in a step-aware manner. Our DisDiff outperforms the state-of-the-art methods by 12.75% of FDFR scores and 7.25% of ISM scores across two facial benchmarks and two commonly used prompts on average.

Abstract:
Generating customized content in videos has received increasing attention recently. However, existing works primarily focus on customized text-to-video generation for single subject, suffering from subject-missing and attribute-binding problems when the video is expected to contain multiple subjects. Furthermore, existing models struggle to assign the desired actions to the corresponding subjects (action-binding problem), failing to achieve satisfactory multi-subject generation performance. To tackle the problems, in this paper, we propose DisenStudio, a novel framework that can generate text-guided videos for customized multiple subjects, given few images for each subject. Specifically, DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism to associate each subject with the desired action. Then the model is customized for the multiple subjects with the proposed motion-preserved disentangled finetuning, which involves three tuning strategies: multi-subject co-occurrence tuning, masked single-subject tuning, and multi-subject motion-preserved tuning. The first two strategies guarantee the subject occurrence and preserve their visual attributes, and the third strategy helps the model maintain the temporal motion-generation ability when finetuning on static images. We conduct extensive experiments to demonstrate our proposed DisenStudio significantly outperforms existing methods in various metrics. Additionally, we show that DisenStudio can be used as a powerful tool for various controllable generation applications.

Abstract:
The rapid progress in generative models has given rise to the critical task of AI-Generated Content Stealth (AIGC-S), which aims to create AI-generated images that can evade both forensic detectors and human inspection. This task is crucial for understanding the vulnerabilities of existing detection methods and developing more robust techniques. However, current adversarial attacks often introduce visible noise, have poor transferability, and fail to address spectral differences between AI-generated and genuine images. To address this, we propose StealthDiffusion, a framework based on stable diffusion that modifies AI-generated images into high-quality, imperceptible adversarial examples capable of evading state-of-the-art forensic detectors. StealthDiffusion comprises two main components: Latent Adversarial Optimization, which generates adversarial perturbations in the latent space of stable diffusion, and Control-VAE, a module that reduces spectral differences between the generated adversarial images and genuine images without affecting the original diffusion model's generation process. Extensive experiments show that StealthDiffusion is effective in both white-box and black-box settings, transforming AI-generated images into high-quality adversarial forgeries with frequency spectra similar to genuine images. These forgeries are classified as genuine by advanced forensic classifiers and are difficult for humans to distinguish.

Abstract:
Low resource of parallel data is the key challenge of accent conversion(AC) problem in which both the pronunciation units and prosody pattern need to be converted. We propose a two-stage generative framework "convert-and-speak" in which the conversion is only operated on the semantic token level and the speech is synthesized conditioned on the converted semantic token with a speech generative model in target accent domain. The decoupling design enables the "speaking" module to use massive amount of target accent speech and relieves the parallel data required for the "conversion" module. Conversion with the bridge of semantic token also relieves the requirement for the data with text transcriptions and unlocks the usage of language pre-training technology to further efficiently reduce the need of parallel accent speech data. To reduce the complexity and latency of "speaking", a single-stage AR generative model is designed to achieve good quality as well as lower computation cost. Experiments on Indian-English to general American-English conversion show that the proposed framework achieves state-of-the-art performance in accent similarity, speech quality, and speaker maintenance with only 15 minutes of weakly parallel data which is not constrained to the same speaker. Extensive experimentation with diverse accent types suggests that this framework possesses a high degree of adaptability, making it readily scalable to accommodate other accents with low-resource data. Audio samples are available at https://www.microsoft.com/en-us/research/project/convert-and-speak-zero-shot-accent-conversion-with-minimumsupervision/.

Abstract:
Multi-modal entity alignment (MMEA) aims to identify equivalent entities between multi-modal knowledge graphs (MMKGs), where entities can be associated with related images. Most existing studies rely heavily on the automatically learned multi-modal fusion modules, which may allow redundant information such as misleading clues in the generated entity representations, impeding the feature consistency of equivalent entities. To this end, we propose a variational framework for MMEA via information bottleneck, termed as IBMEA, by emphasizing alignment-relevant information while suppressing alignment-irrelevant information in entity representations. Specifically, we first develop multi-modal variational encoders that represent modal-specific features as probability distributions. Then, we propose four modal-specific information bottleneck regularizers to limit the misleading clues in the modal-specific entity representations. Finally, we propose a modal-hybrid information contrastive regularizer to integrate modal-specific representations and ensure the similarity of equivalent entities between MMKGs to achieve MMEA. We conduct extensive experiments on 2 cross-KG and 3 bilingual MMEA datasets. Experimental results demonstrate that our model consistently outperforms previous state-of-the-art methods, and also shows promising and robust performance especially in the low-resource and high-noise data scenarios.

Abstract:
Graph contrastive learning has achieved great success in pre-training graph neural networks without ground-truth labels. Leading graph contrastive learning follows the classical scheme of contrastive learning, forcing model to identify the essential information from augmented views. However, general augmented views are produced via random corruption or learning, which inevitably leads to semantics alteration. Although domain knowledge guided augmentations alleviate this issue, the generated views are domain specific and undermine the generalization. In this work, motivated by the firm representation ability of sparse model from pruning, we reformulate the problem of graph contrastive learning via contrasting different model versions rather than augmented views. We first theoretically reveal the superiority of model pruning in contrast to data augmentations. In practice, we take original graph as input and dynamically generate a perturbed graph encoder to contrast with the original encoder by pruning its transformation weights. Furthermore, considering the integrity of node embedding in our method, we are capable of developing a local contrastive loss to tackle the hard negative samples that disturb the model training. We extensively validate our method on various benchmarks regarding graph classification via unsupervised and transfer learning. Compared to the state-of-the-art (SOTA) works, better performance can always be obtained by the proposed method.

Abstract:
Deep networks have shown impressive performance in the image restoration tasks, such as image colorization. However, we find that previous approaches rely on the digital representation from single color model with a specific mapping function, a.k.a., color space, during the colorization pipeline. In this paper, we first investigate the modeling of different color spaces, and find each of them exhibiting distinctive characteristics with unique distribution of colors. The complementarity among multiple color spaces leads to benefits for the image colorization task.

Abstract:
Virtual Reality (VR) has become increasingly popular for remote collaboration, but video conferencing poses challenges when the user's face is covered by the headset. Existing solutions have limitations in terms of accessibility. In this paper, we propose HeadsetOff, a novel system that achieves photorealistic video conferencing on economical VR headsets by leveraging voice-driven face reconstruction. HeadsetOff consists of three main components: a multimodal predictor, a generator, and an adaptive controller. The predictor effectively predicts user future behavior based on different modalities. The generator employs voice, head motion, and eye blink to animate the human face. The adaptive controller dynamically selects the appropriate generator model based on the trade-off between video quality and delay. Experimental results demonstrate the effectiveness of HeadsetOff in achieving high-quality, low-latency video conferencing on economical VR headsets.

Abstract:
Multimodal Large Language Model (MLLM) has recently garnered attention as a prominent research focus. By harnessing powerful LLM, it facilitates a transition of conversational generative AI from unimodal text to performing multimodal tasks. This boom begins to significantly impact medical field. However, general visual language model (VLM) lacks sophisticated comprehension for medical visual question answering (Med-VQA). Even models specifically tailored for medical domain tend to produce vague answers with weak visual relevance. In this paper, we propose a fine-grained adaptive VLM architecture for Chinese medical visual conversations through parameter-efficient tuning. Specifically, we devise a fusion module with fine-grained vision encoders to achieve enhancement for subtle medical visual semantics. Then we note data redundancy common to medical scenes is ignored in most prior works. In cases of a single text paired with multiple figures, we utilize weighted scoring with knowledge distillation to adaptively screen valid images mirroring text descriptions. For execution, we leverage a large-scale multimodal Chinese ultrasound dataset obtained from the hospital. We create instruction-following data based on text from professional doctors, which ensures effective tuning. With enhanced model and quality data, our Large Chinese Language and Vision Assistant for Ultra sound (LLaVA-Ultra) shows strong capability and robustness to medical scenarios. On three Med-VQA datasets, LLaVA-Ultra surpasses previous state-of-the-art models on various metrics.

Abstract:
Visual question answering aims to provide responses to questions given visual input. Recently, visual programmatic models (VPMs), which generate programs to answer questions through large language models (LLMs), have attracted attention. However, they often require long input prompts to provide the LLM with sufficient API usage details to generate relevant code. To address this limitation, we propose AdaCoder, an adaptive prompt compression framework for VPMs. AdaCoder operates in two phases: a compression phase and an inference phase. In the compression phase, given a preprompt that describes all API definitions with example code snippets, a set of compressed preprompts is generated, each depending on a specific question type. In the inference phase, AdaCoder predicts the question type and chooses the appropriate corresponding compressed preprompt to generate code to answer the question. In experiments, we apply AdaCoder to ViperGPT and demonstrate that it reduces token length by 71.1%, while maintaining or even improving the performance of visual question answering.

Abstract:
Multimodal Emotion Recognition in Conversations (ERC) is a typical multimodal learning task in exploiting various data modalities concurrently. Prior studies on effective multimodal ERC encounter challenges in addressing modality imbalances and optimizing learning across modalities. Dealing with these problems, we present a novel framework named Ada2I, which consists of two inseparable modules namely Adaptive Feature Weighting (AFW) and Adaptive Modality Weighting (AMW) for feature-level and modality-level balancing respectively via leveraging both Inter- and Intra-modal interactions. Additionally, we introduce a refined disparity ratio as part of our training optimization strategy, a simple yet effective measure to assess the overall discrepancy of the model's learning process when handling multiple modalities simultaneously. Experimental results validate the effectiveness of Ada2I with state-of-the-art performance compared to baselines on three benchmark datasets, particularly in addressing modality imbalances.

Abstract:
More and more end-to-end text spotting methods based on Transformer architecture have demonstrated superior performance. These methods utilize a bipartite graph matching algorithm to perform one-to-one optimal matching between predicted objects and actual objects. However, the instability of bipartite graph matching can lead to inconsistent optimization targets, thereby affecting the training performance of the model. Existing literature applies denoising training to solve the problem of bipartite graph matching instability in object detection tasks. Unfortunately, this denoising training method cannot be directly applied to text spotting tasks, as these tasks need to perform irregular shape detection tasks and more complex text recognition tasks than classification. To address this issue, we propose a novel denoising training method (DNTextSpotter) for arbitrary-shaped text spotting. Specifically, we decompose the queries of the denoising part into noised positional and noised content queries. We use the four Bezier control points of the Bezier center curve to generate the noised positional queries. For the noised content queries, considering that the output of the text in a fixed positional order is not conducive to aligning position with content, we employ a masked character sliding method to initialize noised content queries, thereby assisting in the alignment of text content and position. Additionally, to improve the model's perception of the background, we further utilize an additional loss function for background characters classification in the denoising training part. DNTextSpotter outperforms state-of-the-art methods on four benchmarks-Total-Text, SCUT-CTW1500, ICDAR15, and Inverse-Text-most notably achieving an 11.3% improvement over the best approach on Inverse-Text.

Abstract:
Deepfake facial manipulation has garnered significant public attention due to its impacts on enhancing human experiences and posing privacy threats. Despite numerous passive algorithms that have been attempted to thwart malicious Deepfake attacks, they mostly struggle with the generalizability challenge when confronted with hyper-realistic synthetic facial images. To tackle the problem, this paper proposes a proactive Deepfake detection approach by introducing a novel training-free landmark perceptual watermark, LampMark for short. We first analyze the structure-sensitive characteristics of Deepfake manipulations and devise a secure and confidential transformation pipeline from the structural representations, i.e. facial landmarks, to binary landmark perceptual watermarks. Subsequently, we present an end-to-end watermarking framework that imperceptibly and robustly embeds and extracts watermarks concerning the images to be protected. Relying on promising watermark recovery accuracies, Deepfake detection is accomplished by assessing the consistency between the content-matched landmark perceptual watermark and the robustly recovered watermark of the suspect image. Experimental results demonstrate the superior performance of our approach in watermark recovery and Deepfake detection compared to state-of-the-art methods across in-dataset, cross-dataset, and cross-manipulation scenarios.

Abstract:
Image-based 3D Virtual Try-ON (VTON) aims to sculpt the 3D human according to person and clothes images, which is data-efficient (i.e., getting rid of expensive 3D data) but challenging. Recent text-to-3D methods achieve remarkable improvement in high-fidelity 3D human generation, demonstrating its potential for 3D virtual try-on. Inspired by the impressive success of personalized diffusion models (e.g., Dreambooth and LoRA) for 2D VTON, it is straightforward to achieve 3D VTON by integrating the personalization technique into the diffusion-based text-to-3D framework. However, employing the personalized module in a pre-trained diffusion model (e.g., StableDiffusion (SD)) would degrade the model's capability for multi-view or multi-domain synthesis, which is detrimental to the geometry and texture optimization guided by Score Distillation Sampling (SDS) loss. In this work, we propose a novel customizing 3D human try-on model, named DreamVTON, to separately optimize the geometry and texture of the 3D human. Specifically, a personalized SD with multi-concept LoRA is proposed to provide the generative prior about the specific person and clothes, while a Densepose-guided ControlNet is exploited to guarantee consistent prior about body pose across various camera views. Besides, to avoid the inconsistent multi-view priors from the personalized SD dominating the optimization, DreamVTON introduces a template-based optimization mechanism, which employs mask templates for geometry shape learning and normal/RGB templates for geometry/texture details learning. Furthermore, for the geometry optimization phase, DreamVTON integrates a normal-style LoRA into personalized SD to enhance normal map generative prior, facilitating smooth geometry modeling. Extensive experiments show that DreamVTON can generate high-quality 3D Humans with the input person, clothes images, and text prompt, outperforming existing methods.

Abstract:
Sounding Video Generation (SVG) is an audio-video joint generation task challenged by high-dimensional signal spaces, distinct data formats, and different patterns of content information. To address these issues, we introduce a novel multi-modal latent diffusion model (MM-LDM) for the SVG task. We first unify the representation of audio and video data by converting them into a single or a couple of images. Then, we introduce a hierarchical multi-modal autoencoder that constructs a low-level perceptual latent space for each modality and a shared high-level semantic feature space. The former space is perceptually equivalent to the raw signal space of each modality but drastically reduces signal dimensions. The latter space serves to bridge the information gap between modalities and provides more insightful cross-modal guidance. Our proposed method achieves new state-of-the-art results with significant quality and efficiency gains. Specifically, our method achieves a comprehensive improvement on all evaluation metrics and a faster training and sampling speed on Landscape and AIST++ datasets. Moreover, we explore its performance on open-domain sounding video generation, long sounding video generation, audio continuation, video continuation, and conditional single-modal generation tasks for a comprehensive evaluation, where our MM-LDM demonstrates exciting adaptability and generalization ability.

Abstract:
Chart parsing poses a significant challenge due to the diversity of styles, values, texts, and so forth. Even advanced large vision-language models (LVLMs) with billions of parameters struggle to handle such tasks satisfactorily. To address this, we propose OneChart: a reliable agent specifically devised for the structural extraction of chart information. Similar to popular LVLMs, OneChart incorporates an autoregressive main body. Uniquely, to enhance the reliability of the numerical parts of the output, we introduce an auxiliary token placed at the beginning of the total tokens along with an additional decoder. The numerically optimized (auxiliary) token allows subsequent tokens for chart parsing to capture enhanced numerical features through causal attention. Furthermore, with the aid of the auxiliary token, we have devised a self-evaluation mechanism that enables the model to gauge the reliability of its chart parsing results by providing confidence scores for the generated content. Compared to current state-of-the-art (SOTA) chart parsing models, e.g., DePlot, ChartVLM, ChartAst, OneChart significantly outperforms in Average Precision (AP) for chart structural extraction across multiple public benchmarks, despite enjoying only 0.2 billion parameters. Moreover, as a chart parsing agent, it also brings 10%+ accuracy gains for the popular LVLM (LLaVA-1.6) in the downstream ChartQA benchmark. Our code, model and datasets ara available at https://onechartt.github.io.

Abstract:
Driven by powerful image diffusion models, recent research has achieved the automatic creation of 3D objects from textual or visual guidance. By performing score distillation sampling (SDS) iteratively across different views, these methods succeed in lifting 2D generative prior to the 3D space. However, such a 2D generative image prior bakes the effect of illumination and shadow into the texture. As a result, material maps optimized by SDS inevitably involve spurious correlated components. The absence of precise material definition makes it infeasible to relight the generated assets reasonably in novel scenes, which limits their application in downstream scenarios. In contrast, humans can effortlessly circumvent this ambiguity by deducing the material of the object from its appearance and semantics. Motivated by this insight, we propose MaterialSeg3D, a 3D asset material generation framework to infer underlying material from the 2D semantic prior. Based on such a prior model, we devise a mechanism to parse material in 3D space. We maintain a UV stack, each map of which is unprojected from a specific viewpoint. After traversing all viewpoints, we fuse the stack through a weighted voting scheme and then employ region unification to ensure the coherence of the object parts. To fuel the learning of semantics prior, we collect a material dataset, named Materialized Individual Objects (MIO), which features abundant images, diverse categories, and accurate annotations. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method.

Abstract:
Scene segmentation via unsupervised domain adaptation (UDA) enables the transfer of knowledge acquired from source synthetic data to real-world target data, which largely reduces the need for manual pixel-level annotations in the target domain. To facilitate domain-invariant feature learning, existing methods typically mix data from both the source domain and target domain by simply copying and pasting pixels. Such vanilla methods are usually sub-optimal since they do not take into account how well the mixed layouts correspond to real-world scenarios. Real-world scenarios are with an inherent layout. The model suffers from confusion in predicting the target domain due to the unrealistic mixing. For instance, it is not reasonable to directly paste the near "pedestrian'' pixels into the remote "sky'' area. Based on such observation, we propose a depth-aware framework to explicitly leverage depth estimation to mix categories and facilitate two complementary tasks, i.e., segmentation and depth learning in an end-to-end manner. Besides, several public datasets do not provide depth annotation. Therefore, we leverage the off-the-shelf depth estimation network to obtain the pseudo depth. Extensive experiments show that our methods, even with pseudo depth, achieve competitive performance, i.e., 77.7 mIoU on GTA → Cityscapes and 69.3 mIoU on Synthia → Cityscapes.

Abstract:
Generative multimodal content is increasingly prevalent in much of the content creation arena, as it has the potential to allow artists and media personnel to create pre-production mockups by quickly bringing their ideas to life. The generation of audio from text prompts is an important aspect of such processes in the music and film industry. Many of the recent diffusion-based text-to-audio models focus on training increasingly sophisticated diffusion models on a large set of datasets of prompt-audio pairs. These models do not explicitly focus on the presence of concepts or events and their temporal ordering in the output audio with respect to the input prompt. Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data. As such, in this work, using an existing text-to-audio model Tango, we synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from. The loser outputs, in theory, have some concepts from the prompt missing or in an incorrect order. We fine-tune the publicly available Tango text-to-audio model using diffusion-DPO (direct preference optimization) loss on our preference dataset and show that it leads to improved audio output over Tango and AudioLDM2, in terms of both automatic- and manual-evaluation metrics.

Abstract:
Skeleton-based zero-shot action recognition aims to recognize unknown human actions based on the learned priors of the known skeleton-based actions and a semantic descriptor space shared by both known and unknown categories. However, previous works mainly focus on establishing the bridges between the known skeleton representation space and semantic descriptions space at the coarse-grained level for recognizing unknown action categories, ignoring the fine-grained alignment of these two spaces, resulting in suboptimal performance in distinguishing high-similarity action categories. To address these challenges, we propose a novel method via Side information and dual-prompTs learning for skeleton-based zero-shot Action Recognition (STAR) at the fine-grained level. Specifically, 1) we decompose the skeleton into several parts based on its topology structure and introduce the side information concerning multi-part descriptions of human body movements for alignment between the skeleton and the semantic space at the fine-grained level; 2) we design the visual-attribute and semantic-part prompts to improve the intra-class compactness within the skeleton space and inter-class separability within the semantic space, respectively, to distinguish the high-similarity actions. Extensive experiments show that our method achieves state-of-the-art performance in ZSL and GZSL settings on NTU RGB+D, NTU RGB+D 120, and PKU-MMD datasets.

Abstract:
Speech-language multi-modal learning presents a significant challenge due to the fine nuanced information inherent in speech styles. Therefore, a large-scale dataset providing elaborate comprehension of speech style is urgently needed to facilitate insightful interplay between speech audio and natural language. However, constructing such datasets presents a major trade-off between large-scale data collection and high-quality annotation. To tackle this challenge, we propose an automatic speech annotation system for expressiveness interpretation that annotates in-the-wild speech clips with expressive and vivid human language descriptions. Initially, speech audios are processed by a series of expert classifiers and captioning models to capture diverse speech characteristics, followed by a fine-tuned LlaMA for customized annotation generation. Unlike previous tag/templet-based annotation frameworks with limited information and diversity, our system provides in-depth understandings of speech style through tailored natural language descriptions, thereby enabling accurate and voluminous data generation for large model training. With this system, we create SpeechCraft, a fine-grained bilingual expressive speech dataset. It is distinguished by highly descriptive natural language style prompts, containing approximately 2,000 hours of audio data and encompassing over two million speech clips. Extensive experiments demonstrate that the proposed dataset significantly boosts speech-language task performance in stylist speech synthesis and speech style understanding.

Abstract:
Benefiting from strong generalization ability, pre-trained vision-language models (VLMs), e.g., CLIP, have been widely utilized in zero-shot scene understanding. Unlike simple recognition tasks, grounded situation recognition (GSR) requires the model not only to classify salient activity (verb) in the image, but also to detect all semantic roles that participate in the action. This complex task usually involves three steps: verb recognition, semantic role grounding, and noun recognition. Directly employing class-based prompts with VLMs and grounding models for this task suffers from several limitations, e.g., it struggles to distinguish ambiguous verb concepts, accurately localize roles with fixed verb-centric template input, and achieve context-aware noun predictions. In this paper, we argue that these limitations stem from the model's poor understanding of verb/noun classes. To this end, we introduce a new approach for zero-shot GSR via Language EXplainer (LEX), which significantly boosts the model's comprehensive capabilities through three explainers: 1) verb explainer, which generates general verb-centric descriptions to enhance the discriminability of different verb classes; 2) grounding explainer, which rephrases verb-centric templates for clearer understanding, thereby enhancing precise semantic role localization; and 3) noun explainer, which creates scene-specific noun descriptions to ensure context-aware noun recognition. By equipping each step of the GSR process with an auxiliary explainer, LEX facilitates complex scene understanding in real-world scenarios. Our extensive validations on the SWiG dataset demonstrate LEX's effectiveness and interoperability in zero-shot GSR.

Abstract:
Deep convolutional neural networks have made significant breakthroughs in medical image classification, under the assumption that training samples from all classes are simultaneously available. However, in real-world medical scenarios, there's a common need to continuously learn about new diseases, leading to the emerging field of class incremental learning (CIL) in the medical domain. Typically, CIL suffers from catastrophic forgetting when trained on new classes. This phenomenon is mainly caused by the imbalance between old and new classes, and it becomes even more challenging with imbalanced medical datasets. In this work, we introduce two simple yet effective plug-in methods to mitigate the adverse effects of the imbalance. First, we propose a CIL-balanced classification loss to mitigate the classifier bias toward majority classes via logit adjustment. Second, we propose a distribution margin loss that not only alleviates the inter-class overlap in embedding space but also enforces the intra-class compactness. We evaluate the effectiveness of our method with extensive experiments on three benchmark datasets (CCH5000, HAM10000, and EyePACS). The results demonstrate that our approach outperforms state-of-the-art methods.

Abstract:
Traditional video steganography methods are based on modifying the covert space for embedding, whereas we propose an innovative approach that embeds secret message within semantic feature for steganography during the video editing process. Although existing traditional video steganography methods excel in balancing security and capacity, they lack adequate robustness against common distortions in online social networks (OSNs). In this paper, we propose an end-to-end robust generative video steganography network (RoGVSN), which achieves visual editing by modifying semantic feature of videos to embed secret message. We exemplify the face-swapping scenario as an illustration to demonstrate the visual editing effects. Specifically, we devise an adaptive scheme to seamlessly embed secret messages into the semantic features of videos through fusion blocks. Extensive experiments demonstrate the superiority of our method in terms of robustness, extraction accuracy, visual quality, and capacity.

Abstract:
Even in the era of large models, one of the well-known issues in continual learning (CL) is catastrophic forgetting, which is significantly challenging when the continual data stream exhibits a long-tailed distribution, termed as Long-Tailed Continual Learning (LTCL). Existing LTCL solutions generally require the label distribution of the data stream to achieve re-balance training. However, obtaining such prior information is often infeasible in real scenarios since the model should learn without pre-identifying the majority and minority classes. To this end, we propose a novel Prior-free Balanced Replay (PBR) framework to learn from long-tailed data stream with less forgetting. Concretely, motivated by our experimental finding that the minority classes are more likely to be forgotten due to the higher uncertainty, we newly design an uncertainty-guided reservoir sampling strategy to prioritize rehearsing minority data without using any prior information, which is based on the mutual dependence between the model and samples. Additionally, we incorporate two prior-free components to further reduce the forgetting issue: (1) Boundary constraint is to preserve uncertain boundary supporting samples for continually re-estimating task boundaries. (2) Prototype constraint is to maintain the consistency of learned class prototypes along with training. Our approach is evaluated on three standard long-tailed benchmarks, demonstrating superior performance to existing CL methods and previous SOTA LTCL approach in both task- and class-incremental learning settings, as well as ordered- and shuffled-LTCL settings.

Abstract:
Proprietary large language models (LLMs) have been widely applied in various scenarios. Additionally, deploying LLMs on edge devices is trending for efficiency and privacy reasons. However, edge deployment of proprietary LLMs introduces new security challenges: edge-deployed models are exposed as white-box accessible to users, enabling adversaries to conduct model stealing (MS) attacks. Unfortunately, existing defense mechanisms fail to provide effective protection. Specifically, we identify four critical protection properties that existing methods fail to simultaneously satisfy: (1) maintaining protection after a model is physically copied; (2) authorizing model access at request level; (3) safeguarding runtime reverse engineering; (4) achieving high security with negligible runtime overhead. To address the above issues, we propose TransLinkGuard, a plug-and-play model protection approach against model stealing on edge devices. The core part of TransLinkGuard is a lightweight authorization module residing in a secure environment, e.g., TEE, which can freshly authorize each request based on its input. Extensive experiments show that TransLinkGuard achieves the same security as the black-box guarantees with negligible overhead.

Abstract:
Multimodal Question Answering (MMQA) is crucial as it enables comprehensive understanding and accurate responses by integrating insights from diverse data representations such as tables, charts, and text. Most existing researches in MMQA only focus on two modalities such as image-text QA, table-text QA and chart-text QA, and there remains a notable scarcity in studies that investigate the joint analysis of text, tables, and charts. In this paper, we present CT2C-QA, a pioneering Chinese reasoning-based QA dataset that includes an extensive collection of text, tables, and charts, meticulously compiled from 200 selectively sourced webpages. Our dataset simulates real webpages and serves as a great test for the capability of the model to analyze and reason with multimodal data, because the answer to a question could appear in various modalities, or even potentially not exist at all. Additionally, we present AED (Allocating, Expert and Decision), a multi-agent system implemented through collaborative deployment, information interaction, and collective decision-making among different agents. Specifically, the Assignment Agent is in charge of selecting and activating expert agents, including those proficient in text, tables, and charts. The Decision Agent bears the responsibility of delivering the final verdict, drawing upon the analytical insights provided by these expert agents. We execute a comprehensive analysis, comparing AED with various state-of-the-art models in MMQA, including GPT-4. The experimental outcomes demonstrate that current methodologies, including GPT-4, are yet to meet the benchmarks set by our dataset.

Abstract:
Vision and Language Navigation (VLN) is a challenging task that requires agents to understand instructions and navigate to the destination in a visual environment. One of the key challenges in outdoor VLN is keeping track of which part of the instruction was completed. To alleviate this problem, previous works mainly focus on grounding the natural language to the visual input, but neglecting the crucial role of the agent's spatial position information in the grounding process. In this work, we first explore the substantial effect of spatial position locating on the grounding of outdoor VLN, drawing inspiration from human navigation. In real-world navigation scenarios, before planning a path to the destination, humans typically need to figure out their current location. This observation underscores the pivotal role of spatial localization in the navigation process. In this work, we introduce a novel framework, Locating beforePlanning (Loc4Plan), designed to incorporate spatial perception for action planning in outdoor VLN tasks. The main idea behind Loc4Plan is to perform the spatial localization before planning a decision action based on corresponding guidance, which comprises a block-aware spatial locating (BAL) module and a spatial-aware action planning (SAP) module. Specifically, to help the agent perceive its spatial location in the environment, we propose to learn a position predictor that measures how far the agent is from the next intersection for reflecting its position, which is achieved by the BAL module. After this locating process, we propose the PSA module to associate visual observations After the locating process, we propose the SAP module to incorporate spatial information to ground the corresponding guidance and enhance the precision of action planning. Extensive experiments on the Touchdown and map2seq datasets show that the proposed Loc4Plan outperforms the SOTA methods.

Abstract:
Recent diffusion-based unrestricted attacks generate imperceptible adversarial examples with high transferability compared to previous unrestricted attacks and restricted attacks. However, existing works on diffusion-based unrestricted attacks are mostly focused on images yet are seldom explored in videos. In this paper, we propose the Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack (ReToMe-VA), which is the first framework to generate imperceptible adversarial video clips with higher transferability. Specifically, to achieve spatial imperceptibility, ReToMe-VA adopts a Timestep-wise Adversarial Latent Optimization (TALO) strategy that optimizes perturbations in diffusion models' latent space at each denoising step. TALO offers iterative and accurate updates to generate more powerful adversarial frames. TALO can further reduce memory consumption in gradient computation. Moreover, to achieve temporal imperceptibility, ReToMe-VA introduces a Recursive Token Merging (ReToMe) mechanism by matching and merging tokens across video frames in the self-attention module, resulting in temporally consistent adversarial videos. ReToMe concurrently facilitates inter-frame interactions into the attack process, inducing more diverse and robust gradients, thus leading to better adversarial transferability. Extensive experiments demonstrate the efficacy of ReToMe-VA, particularly in surpassing state-of-the-art attacks in adversarial transferability by more than 14.16% on average.

Abstract:
The recent advances in query-based multi-camera 3D object detection are featured by initializing object queries in the 3D space, and then sampling features from perspective-view images to perform multi-round query refinement. In such a framework, query points near the same camera ray are likely to sample similar features from very close pixels, resulting in ambiguous query features and degraded detection accuracy. To this end, we introduce RayFormer, a camera-ray-inspired query-based 3D object detector that aligns the initialization and feature extraction of object queries with the optical characteristics of cameras. Specifically, RayFormer transforms perspective-view image features into bird's eye view (BEV) via the lift-splat-shoot method and segments the BEV map to sectors based on the camera rays. Object queries are uniformly and sparsely initialized along each camera ray, facilitating the projection of different queries onto different areas in the image to extract distinct features. Besides, we leverage the instance information of images to supplement the uniformly initialized object queries by further involving additional queries along the ray from 2D object detection boxes. To extract unique object-level features that cater to distinct queries, we design a ray sampling method that suitably organizes the distribution of feature sampling points on both images and bird's eye view. Extensive experiments are conducted on the nuScenes dataset to validate our proposed ray-inspired model design. The proposed RayFormer achieves 55.5% mAP and 63.3% NDS, respectively.

Abstract:
Recent years have seen an increase in the use of gigapixel-level image and video capture systems and benchmarks with high-resolution wide (HRW) shots. However, unlike close-up shots in the MS COCO dataset, the higher resolution and wider field of view raise unique challenges, such as extreme sparsity and huge scale changes, causing existing close-up detectors inaccuracy and inefficiency. In this paper, we present a novel model-agnostic sparse vision transformer, dubbed SparseFormer, to bridge the gap of object detection between close-up and HRW shots. The proposed SparseFormer selectively uses attentive tokens to scrutinize the sparsely distributed windows that may contain objects. In this way, it can jointly explore global and local attention by fusing coarse- and fine-grained features to handle huge scale changes. SparseFormer also benefits from a novel Cross-slice non-maximum suppression (C-NMS) algorithm to precisely localize objects from noisy windows and a simple yet effective multi-scale strategy to improve accuracy. Extensive experiments on two HRW benchmarks, PANDA and DOTA-v1.0, demonstrate that the proposed SparseFormer significantly improves detection accuracy (up to 5.8%) and speed (up to 3x) over the state-of-the-art approaches.

Abstract:
Object pose tracking is one of the pivotal technologies in multimedia, attracting ever-growing attention in recent years. Existing methods employing traditional cameras encounter numerous challenges such as motion blur, sensor noise, partial occlusion, and changing lighting conditions. The emerging bio-inspired sensors, particularly event cameras, possess advantages such as high dynamic range and low latency, which hold the potential to address the aforementioned challenges. In this work, we present an optical flow-guided 6DoF object pose tracking method with an event camera. A 2D-3D hybrid feature extraction strategy is firstly utilized to detect corners and edges from events and object models, which characterizes object motion precisely. Then, we search for the optical flow of corners by maximizing the event-associated probability within a spatio-temporal window, and establish the correlation between corners and edges guided by optical flow. Furthermore, by minimizing the distances between corners and edges, the 6DoF object pose is iteratively optimized to achieve continuous pose tracking. Experimental results of both simulated and real events demonstrate that our methods outperform event-based state-of-the-art methods in terms of both accuracy and robustness.

Abstract:
The rise of generative AI is transforming the landscape of digital imagery, and exerting a significant influence on online creative communities. This has led to the emergence of AI-Generated Content (AIGC) social platforms, such as Civitai. These distinctive social platforms allow users to build and share their own generative AI models, thereby enhancing the potential for more diverse artistic expression. They also provide artists with the means to showcase their creations (generated from the models), engage in discussions, and obtain feedback, thus nurturing a sense of community. Yet, this openness also raises concerns about the abuse of such platforms, e.g., using models to disseminate deceptive deepfakes or infringe upon copyrights. To explore this, we conduct the first comprehensive empirical study of an AIGC social platform, focusing on its use for generating abusive content. As an exemplar, we construct a comprehensive dataset covering Civitai, the largest available AIGC social platform. Based on this dataset of 87K models and 2M images, we explore the characteristics of content and discuss strategies for moderation to better govern these platforms.

Abstract:
Stylized motion breathes life into characters. However, the fixed skeleton structure and style representation hinder existing data-driven motion synthesis methods from generating stylized motion for various characters. In this work, we propose a generative motion stylization pipeline, named MotionS, for synthesizing diverse and stylized motion on cross-structure characters using cross-modality style prompts. Our key insight is to embed motion style into a cross-modality latent space and perceive the cross-structure skeleton topologies, allowing for motion stylization within a canonical motion space.Specifically, the large-scale Contrastive-Language-Image-Pre-training (CLIP) model is leveraged to construct the cross-modality latent space, enabling flexible style representation within it. Additionally, two topology-encoded tokens are learned to capture the canonical and specific skeleton topologies, facilitating cross-structure topology shifting. Subsequently, the topology-shifted stylization diffusion is designed to generate motion content for the particular skeleton and stylize it in the shifted canonical motion space using multi-modality style descriptions. Through an extensive set of examples, we demonstrate the flexibility and generalizability of our pipeline across various characters and style descriptions. Qualitative and quantitative comparisons show the superiority of our pipeline over state-of-the-arts, consistently delivering high-quality stylized motion across a broad spectrum of skeletal structures.

Abstract:
Speech-driven gesture generation aims at synthesizing a gesture sequence synchronized with the input speech signal. Previous methods leverage neural networks to directly map a compact audio representation to the gesture sequence, ignoring the semantic association of different modalities and failing to deal with salient gestures. In this paper, we propose a novel speech-driven gesture generation method by emphasizing the semantic consistency of salient posture. Specifically, we first learn a joint manifold space for the individual representation of audio and body pose to exploit the inherent semantic association between two modalities, and propose to enforce semantic consistency via a consistency loss. Furthermore, we emphasize the semantic consistency of salient postures by introducing a weakly-supervised detector to identify salient postures, and reweighting the consistency loss to focus more on learning the correspondence between salient postures and the high-level semantics of speech content. In addition, we propose to extract audio features dedicated to facial expression and body gesture separately, and design separate branches for face and body gesture synthesis. Extensive experimental results demonstrate the superiority of our method over the state-of-the-art approaches.

Abstract:
In this work, we introduce DOPRA, a novel approach designed to mitigate hallucinations in multi-modal large language models (MLLMs). Unlike existing solutions that typically involve costly supplementary training data or the integration of external knowledge sources, DOPRA innovatively addresses hallucinations by decoding specific weighted layer penalties and redistribution, offering an economical and effective solution without additional resources. DOPRA is grounded in unique insights into the intrinsic mechanisms controlling hallucinations within MLLMs, especially the models' tendency to over-rely on a subset of summary tokens in the self-attention matrix, neglecting critical image-related information. This phenomenon is particularly pronounced in certain strata. To counteract this over-reliance, DOPRA employs a strategy of weighted overlay penalties and redistribution in specific layers, such as the 12th layer, during the decoding process. Furthermore, DOPRA includes a retrospective allocation process that re-examines the sequence of generated tokens, allowing the algorithm to reallocate token selection to better align with the actual image content, thereby reducing the incidence of hallucinatory descriptions in auto-generated captions. Overall, DOPRA represents a significant step forward in improving the output quality of MLLMs by systematically reducing hallucinations through targeted adjustments during the decoding process.

Abstract:
The rapid evolution of multimedia and computer vision technologies requires adaptive visual model deployment strategies to effectively handle diverse tasks and varying environments. This work introduces AxiomVision, a novel framework that can guarantee accuracy by leveraging edge computing to dynamically select the most efficient visual models for video analytics under diverse scenarios. Utilizing a tiered edge-cloud architecture, AxiomVision enables the deployment of a broad spectrum of visual models, from lightweight to complex DNNs, that can be tailored to specific scenarios while considering camera source impacts. In addition, AxiomVision provides three core innovations: (1) a dynamic visual model selection mechanism utilizing continual online learning, (2) an efficient online method that efficiently takes into account the influence of the camera's perspective, and (3) a topology-driven grouping approach that accelerates the model selection process. With rigorous theoretical guarantees, these advancements provide a scalable and effective solution for visual tasks inherent to multimedia systems, such as object detection, classification, and counting. Empirically, AxiomVision achieves a 25.7% improvement in accuracy.

Abstract:
Continual graph learning (CGL) is an important and challenging task that aims to extend static GNNs to dynamic task flow scenarios. As one of the mainstream CGL methods, the experience replay (ER) method receives widespread attention due to its superior performance. However, existing ER methods focus on identifying samples by feature significance or topological relevance, which limits their utilization of comprehensive graph data. In addition, the topology-based ER methods only consider local topological information and add neighboring nodes to the buffer, which ignores the global topological information and increases memory overhead. To bridge these gaps, we propose a novel method called Feature-Topology Fusion-based Experience Replay (FTF-ER) to effectively mitigate the catastrophic forgetting issue with enhanced efficiency. Specifically, from an overall perspective to maximize the utilization of the entire graph data, we propose a highly complementary approach including both feature and global topological information, which can significantly improve the effectiveness of the sampled nodes. Moreover, to further utilize global topological information, we propose Hodge Potential Score (HPS) as a novel module to calculate the topological importance of nodes. HPS derives a global node ranking via Hodge decomposition on graphs, providing more accurate global topological information compared to neighbor sampling. By excluding neighbor sampling, HPS significantly reduces buffer storage costs for acquiring topological information and simultaneously decreases training time. Compared with state-of-the-art methods, FTF-ER achieves a significant improvement of 3.6% in AA and 7.1% in AF on the OGB-Arxiv dataset, demonstrating its superior performance in the class-incremental learning setting.

Abstract:
Existing works in few-shot action recognition mostly fine-tune a pre-trained image model and design sophisticated temporal alignment modules at feature level. However, simply fully fine-tuning the pre-trained model could cause overfitting due to the scarcity of video samples. Additionally, we argue that the exploration of task-specific information is insufficient when relying solely on well extracted abstract features. In this work, we propose a simple but effective task-specific adaptation method (Task-Adapter) for few-shot action recognition. By introducing the proposed Task-Adapter into the last several layers of the backbone and keeping the parameters of the original pre-trained model frozen, we mitigate the overfitting problem caused by full fine-tuning and advance the task-specific mechanism into the process of feature extraction. In each Task-Adapter, we reuse the frozen self-attention layer to perform task-specific self-attention across different videos within the given task to capture both distinctive information among classes and shared information within classes, which facilitates task-specific adaptation and enhances subsequent metric measurement between the query feature and support prototypes. Experimental results consistently demonstrate the effectiveness of our proposed Task-Adapter on four standard few-shot action recognition datasets. Especially on temporal challenging SSv2 dataset, our method outperforms the state-of-the-art methods by a large margin.

Abstract:
Current weakly supervised video anomaly detection (WSVAD) task aims to achieve frame-level anomalous event detection with only coarse video-level annotations available. Existing works typically involve extracting global features from full-resolution video frames and training frame-level classifiers to detect anomalies in the temporal dimension. However, most anomalous events tend to occur in localized spatial regions rather than the entire video frames, which implies existing frame-level feature based works may be misled by the dominant background information and lack the interpretation of the detected anomalies. To address this dilemma, this paper introduces a novel method called STPrompt that learns spatio-temporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs). Our proposed method employs a two-stream network structure, with one stream focusing on the temporal dimension and the other primarily on the spatial dimension. By leveraging the learned knowledge from pre-trained VLMs and incorporating natural motion priors from raw videos, our model learns prompt embeddings that are aligned with spatio-temporal regions of videos (e.g., patches of individual frames) for identify specific local regions of anomalies, enabling accurate video anomaly detection while mitigating the influence of background information. Without relying on detailed spatio-temporal annotations or auxiliary object detection/tracking, our method achieves state-of-the-art performance on three public benchmarks for the WSVADL task.

Abstract:
With the rise of Machine Learning as a Service (MLaaS) platforms, safeguarding the intellectual property of deep learning models is becoming paramount. Among various protective measures, trigger set watermarking has emerged as a flexible and effective strategy for preventing unauthorized model distribution. However, this paper identifies an inherent flaw in the current paradigm of trigger set watermarking: evasion adversaries can readily exploit the shortcuts created by models memorizing watermark samples that deviate from the main task distribution, significantly impairing their generalization in adversarial settings. To counteract this, we leverage diffusion models to synthesize unrestricted adversarial examples as trigger sets. By learning the model to accurately recognize them, unique watermark behaviors are promoted through knowledge injection rather than error memorization, thus avoiding exploitable shortcuts. Furthermore, we uncover that the resistance of current trigger set watermarking against removal attacks primarily relies on significantly damaging the decision boundaries during embedding, intertwining unremovability with adverse impacts. By optimizing the knowledge transfer properties of protected models, our approach conveys watermark behaviors to extraction surrogates without aggressive decision boundary perturbation. Experimental results on CIFAR-10/100 and Imagenette datasets demonstrate the effectiveness of our method, showing not only improved robustness against evasion adversaries but also superior resistance to watermark removal attacks compared to state-of-the-art solutions.

Abstract:
In digital security, Reversible Adversarial Examples (RAE) blend adversarial attacks with Reversible Data Hiding (RDH) within images to thwart unauthorized access. Traditional RAE methods, however, compromise attack efficiency for the sake of perturbation concealment, diminishing the protective capacity of valuable perturbations and limiting applications to white-box scenarios. This paper proposes a novel Dual-Phase merging Reversible Adversarial Example (DP-RAE) generation framework, combining a heuristic black-box attack and RDH with Grayscale Invariance (RDH-GI) technology. This dual strategy not only evaluates and harnesses the adversarial potential of past perturbations more effectively but also guarantees flawless embedding of perturbation information and complete recovery of the original image. Experimental validation reveals our method's superiority, secured an impressive 96.9% success rate and 100% recovery rate in compromising black-box models. In particular, it achieved a 90% misdirection rate against commercial models under a constrained number of queries. This marks the first successful attempt at targeted black-box reversible adversarial attacks for commercial recognition models. This achievement highlights our framework's capability to enhance security measures without sacrificing attack performance. Moreover, our attack framework is flexible, allowing the interchangeable use of different attack and RDH modules to meet advanced technological requirements.

Abstract:
Neural radiance fields are capable of reconstructing high-quality drivable human avatars but are expensive to train and render and not suitable for multi-human scenes with complex shadows. To reduce consumption, we propose Animatable 3D Gaussian, which learns human avatars from input images and poses. We extend 3D Gaussians to dynamic human scenes by modeling a set of skinned 3D Gaussians and a corresponding skeleton in canonical space and deforming 3D Gaussians to posed space according to the input poses. We introduce a multi-head hash encoder for pose-dependent shape and appearance and a time-dependent ambient occlusion module to achieve high-quality reconstructions in scenes containing complex motions and dynamic shadows. On both novel view synthesis and novel pose synthesis tasks, our method achieves higher reconstruction quality than InstantAvatar with less training time (1/60), less GPU memory (1/4), and faster rendering speed (7x). Our method can be easily extended to multi-human scenes and achieve comparable novel view synthesis results on a scene with ten people in only 25 seconds of training.

Abstract:
Transformer-based deep models for single image super-resolution (SISR) have greatly improved the performance of lightweight SISR tasks in recent years. However, they often suffer from heavy computational burden and slow inference due to the complex calculation of multi-head self-attention (MSA), seriously hindering their practical application and deployment. In this work, we present an efficient SR model to mitigate the dilemma between model efficiency and SR performance, which is dubbed Entropy Attention and Receptive Field Augmentation network (EARFA), and composed of a novel entropy attention (EA) and a shifting large kernel attention (SLKA). From the perspective of information theory, EA increases the entropy of intermediate features conditioned on a Gaussian distribution, providing more informative input for subsequent reasoning. On the other hand, SLKA extends the receptive field of SR models with the assistance of channel shifting, which also favors to boost the diversity of hierarchical features. Since the implementation of EA and SLKA does not involve complex computations (such as extensive matrix multiplications), the proposed method can achieve faster nonlinear inference than Transformer-based SR models while maintaining better SR performance. Extensive experiments show that the proposed model can significantly reduce the delay of model inference while achieving the SR performance comparable with other advanced models.

Abstract:
In the federated learning (FL) process, since the data held by each participant is different, it is necessary to figure out which participant has a higher contribution to the model performance. Effective contribution assessment can help motivate data owners to participate in the FL training. Research works in this field can be divided into two directions based on whether a validation dataset is required. Validation-based methods need to use representative validation data to measure the model accuracy, which is difficult to obtain in practical FL scenarios. Existing validation-free methods assess the contribution based on the parameters and gradients of local models and the global model in a single training round, which is easily compromised by the stochasticity of model training. In this work, we propose CoAst, a practical method to assess the FL participants' contribution without access to any validation data. The core idea of CoAst involves two aspects: one is to only count the most important part of model parameters through a weights quantization, and the other is a cross-round valuation based on the similarity between the current local parameters and the global parameter updates in several subsequent communication rounds. Extensive experiments show that CoAst has comparable assessment reliability to existing validation-based methods and outperforms existing validation-free methods.

Abstract:
Fine-grained image retrieval (FGIR) is to learn visual representations that distinguish visually similar objects while maintaining generalization. Existing methods propose to generate discriminative features, but rarely consider the particularity of the FGIR task itself. This paper presents a meticulous analysis leading to the proposal of practical guidelines to identify subcategory-specific discrepancies and generate discriminative features to design effective FGIR models. These guidelines include emphasizing the object (G1), highlighting subcategory-specific discrepancies (G2), and employing effective training strategy (G3). Following G1 and G2, we design a novel Dual Visual Filtering mechanism for the plain visual transformer, denoted as DVF, to capture subcategory-specific discrepancies. Specifically, the dual visual filtering mechanism comprises an object-oriented module and a semantic-oriented module. These components serve to magnify objects and identify discriminative regions, respectively. Following G3, we implement a discriminative model training strategy to improve the discriminability and generalization ability of DVF. Extensive analysis and ablation studies confirm the efficacy of our proposed guidelines. Without bells and whistles, the proposed DVF achieves state-of-the-art performance on three widely-used fine-grained datasets in closed-set and open-set settings.

Abstract:
Accurate and robust multimodal multi-task perception is crucial for modern autonomous driving systems. However, current multimodal perception research follows independent paradigms designed for specific perception tasks, leading to a lack of complementary learning among tasks and decreased performance in multi-task learning (MTL) due to joint training. In this paper, we propose MaskBEV, a masked attention-based MTL paradigm that unifies 3D object detection and bird's eye view (BEV) map segmentation. MaskBEV introduces a task-agnostic Transformer decoder to process these diverse tasks, enabling MTL to be completed in a unified decoder without requiring additional design of specific task heads. To fully exploit the complementary information between BEV map segmentation and 3D object detection tasks in BEV space, we propose spatial modulation and scene-level context aggregation strategies. These strategies consider the inherent dependencies between BEV segmentation and 3D detection, naturally boosting MTL performance. Extensive experiments on nuScenes dataset show that compared with previous state-of-the-art MTL methods, MaskBEV achieves 1.3 NDS improvement in 3D object detection and 2.7 mIoU improvement in BEV map segmentation, while also demonstrating slightly leading inference speed.

Abstract:
This paper investigates the challenging problem of learned image compression (LIC) with extreme low bitrates. Previous LIC methods based on transmitting quantized continuous features often yield blurry and noisy reconstruction due to the severe quantization loss. While previous LIC methods based on learned codebooks that discretize visual space usually give poor-fidelity reconstruction due to the insufficient representation power of limited codewords in capturing faithful details. We propose a novel dual-stream framework, HyrbidFlow, which combines the continuous-feature-based and codebook-based streams to achieve both high perceptual quality and high fidelity under extreme low bitrates. The codebook-based stream benefits from the high-quality learned codebook priors to provide high quality and clarity in reconstructed images. The continuous feature stream targets at maintaining fidelity details. To achieve the ultra low bitrate, a masked token-based transformer is further proposed, where we only transmit a masked portion of codeword indices and recover the missing indices through token generation guided by information from the continuous feature stream. We also develop a bridging correction network to merge the two streams in pixel decoding for final image reconstruction, where the continuous stream features rectify biases of the codebook-based pixel decoder to impose reconstructed fidelity details. Experimental results demonstrate superior performance across several datasets under extremely low bitrates, compared with existing single-stream codebook-based or continuous-feature-based LIC methods.

Abstract:
X-ray images play a vital role in the intraoperative processes due to their high resolution and fast imaging speed and greatly promote the subsequent segmentation, registration and reconstruction. However, over-dosed X-rays superimpose potential risks to human health to some extent. Data-driven algorithms from volume scans to X-ray images are restricted by the scarcity of paired X-ray and volume data. Existing methods are mainly realized by modelling the whole X-ray imaging procedure. In this study, we propose a learning-based approach termed CT2X-GAN to synthesize the X-ray images in an end-to-end manner using the content and style disentanglement from three different image domains. Our method decouples the anatomical structure information from CT scans and style information from unpaired real X-ray images/ digital reconstructed radiography (DRR) images via a series of decoupling encoders. Additionally, we introduce a novel consistency regularization term to improve the stylistic resemblance between synthesized X-ray images and real X-ray images. Meanwhile, we also impose a supervised process by computing the similarity of computed real DRR and synthesized DRR images. We further develop a pose attention module to fully strengthen the comprehensive information in the decoupled content code from CT scans, facilitating high-quality multi-view image synthesis in the lower 2D space. Extensive experiments were conducted on the publicly available CTSpine1K dataset and achieved 97.8350, 0.0842 and 3.0938 in terms of FID, KID and defined user-scored X-ray similarity, respectively. In comparison with 3D-aware methods (π-GAN, EG3D), CT2X-GAN is superior in improving the synthesis quality and realistic to the real X-ray images.

Abstract:
In recent years, there has been a significant focus on research related to text-guided image inpainting. However, the task remains challenging due to several constraints, such as ensuring alignment between the image and the text, and maintaining consistency in distribution between corrupted and uncorrupted regions. In this paper, thus, we propose a dual affine transformation generative adversarial network (DAFT-GAN) to maintain the semantic consistency for text-guided inpainting. DAFT-GAN integrates two affine transformation networks to combine text and image features gradually for each decoding block. Moreover, we minimize information leakage of uncorrupted features for fine-grained image generation by encoding corrupted and uncorrupted regions of the masked image separately. Our proposed model outperforms the existing GAN-based models in both qualitative and quantitative assessments with three benchmark datasets (MS-COCO, CUB, and Oxford) for text-guided image inpainting.

Abstract:
Pixel-level fine-grained image editing remains an open challenge. Previous works fail to achieve an ideal trade-off between control granularity and inference speed. They either fail to achieve pixel-level fine-grained control, or their inference speed requires optimization. To address this, this paper for the first time employs a regression-based network to learn the variation patterns of StyleGAN latent codes during the image dragging process. This method enables pixel-level precision in dragging editing with little time cost. Users can specify handle points and their corresponding target points on any GAN-generated images, and our method will move each handle point to its corresponding target point. Through experimental analysis, we discover that a short movement distance from handle points to target points yields a high-fidelity edited image, as the model only needs to predict the movement of a small portion of pixels. To achieve this, we decompose the entire movement process into multiple sub-processes. Specifically, we develop a transformer encoder-decoder based network named 'Latent Predictor' to predict the latent code motion trajectories from handle points to target points in an autoregressive manner. Moreover, to enhance the prediction stability, we introduce a component named 'Latent Regularizer', aimed at constraining the latent code motion within the distribution of natural images. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) inference speed and image editing performance at the pixel-level granularity.

Abstract:
Talking head generation is a significant research topic that still faces numerous challenges. Previous works often adopt generative adversarial networks or regression models, which are plagued by generation quality and average facial shape problem. Although diffusion models show impressive generative ability, their exploration in talking head generation remains unsatisfactory. This is because they either solely use the diffusion model to obtain an intermediate representation and then employ another pre-trained renderer, or they overlook the feature decoupling of complex facial details, such as expressions, head poses and appearance textures. Therefore, we propose a Facial Decoupled Diffusion model for Talking head generation called FD2Talk, which fully leverages the advantages of diffusion models and decouples the complex facial details through multi-stages. Specifically, we separate facial details into motion and appearance. In the initial phase, we design the Diffusion Transformer to accurately predict motion coefficients from raw audio. These motions are highly decoupled from appearance, making them easier for the network to learn compared to high-dimensional RGB images. Subsequently, in the second phase, we encode the reference image to capture appearance textures. The predicted facial and head motions and encoded appearance then serve as the conditions for the Diffusion UNet, guiding the frame generation. Benefiting from decoupling facial details and fully leveraging diffusion models, extensive experiments substantiate that our approach excels in enhancing image quality and generating more accurate and diverse results compared to previous state-of-the-art methods.

Abstract:
Efficient visual perception using mobile systems is crucial, particularly in unknown environments such as search and rescue operations, where swift and comprehensive perception of objects of interest is essential. In such real-world applications, objects of interest are often situated in complex settings, making the selection of the 'Next Best' view based solely on maximizing visibility gain suboptimal. We argue that incorporating semantics-providing a higher-level interpretation of perception-can significantly contribute to the selection of viewpoints for various perception tasks. In this study, we formulate a novel information gain that integrates both visibility and semantic gain in a unified form to select the semantic-aware Next-Best-View. We also design an adaptive strategy with termination criterion to facilitate the two-stage search-and-acquisition manoeuvre on multiple objects of interest aided by a multi-degree-of-freedoms (Multi-DoFs) mobile system. To evaluate our approach, we introduce several semantically relevant reconstruction metrics, including perspective directivity and the region of interest (ROI)-to-full reconstruction volume ratio. Simulation experiments demonstrate that our approach outperforms the existing methods by up to 27.46% in the ROI-to-full reconstruction volume ratio and 0.88234 in average perspective directivity. Furthermore, the planned motion trajectory exhibits better perceiving coverage toward the target.

Abstract:
The rapid growth of online video resources has significantly promoted the development of video retrieval methods. As a standard evaluation metric for video retrieval, Average Precision (AP) assesses the overall rankings of relevant videos at the top list, making the predicted scores a reliable reference for the users. However, recent video retrieval methods utilize pair-wise losses that treat all sample pairs equally, leading to an evident gap between the training objective and evaluation metric. To effectively bridge this gap, in this work, we aim to address two primary challenges: a) The current similarity measure and AP-based loss are suboptimal for video retrieval; b) The noticeable noise from frame-to-frame matching introduces ambiguity in estimating the AP loss. In response to these challenges, we propose the Hierarchical learning framework for Average-Precision-oriented Video Retrieval (HAP-VR). For the former challenge, we develop the TopK-Chamfer Similarity and QuadLinear-AP loss to measure and optimize video-level similarities in terms of AP. For the latter challenge, we suggest constraining the frame-level similarities to achieve an accurate AP loss estimation. Experimental results present that HAP-VR outperforms existing methods on several benchmark datasets, providing a feasible solution for video retrieval tasks and thus offering potential benefits for the multi-media application.

Abstract:
Stance detection, which aims to identify public opinion towards specific targets using social media data, is an important yet challenging task. With the proliferation of diverse multimodal social media content including text, and images multimodal stance detection (MSD) has become a crucial research area. However, existing MSD studies have focused on modeling stance within individual text-image pairs, overlooking the multi-party conversational contexts that naturally occur on social media. This limitation stems from a lack of datasets that authentically capture such conversational scenarios, hindering progress in conversational MSD. To address this, we introduce a new multimodal multi-turn conversational stance detection dataset (called MmMtCSD). To derive stances from this challenging dataset, we propose a novel multimodal large language model stance detection framework (MLLM-SD), that learns joint stance representations from textual and visual modalities. Experiments on MmMtCSD show state-of-the-art performance of our proposed MLLM-SD approach for multimodal stance detection. We believe that MmMtCSD will contribute to advancing real-world applications of stance detection research.

Abstract:
Text recognition is an inherent integration of vision and language, encompassing the visual texture in stroke patterns and the semantic context among the character sequences. Towards advanced text recognition, there are three key challenges: (1) an encoder capable of representing the visual and semantic distributions; (2) a decoder that ensures the alignment between vision and semantics; and (3) consistency in the framework during pre-training, if it exists, and fine-tuning. Inspired by masked autoencoding, a successful pre-training strategy in both vision and language, we propose an innovative scene text recognition approach, named VL-Reader. The novelty of the VL-Reader lies in the pervasive interplay between vision and language throughout the entire process. Concretely, we first introduce a Masked Visual-Linguistic Reconstruction (MVLR) objective, which aims at simultaneously modeling visual and linguistic information. Then, we design a Masked Visual-Linguistic Decoder (MVLD) to further leverage masked vision-language context and achieve bi-modal feature interaction. The architecture of VL-Reader maintains consistency from pre-training to fine-tuning. In the pre-training stage, VL-Reader reconstructs both masked visual and text tokens, while in the fine-tuning stage, the network degrades to reconstruct all characters from an image without any masked regions. VL-reader achieves an average accuracy of 97.1% on six typical datasets, surpassing the SOTA by 1.1%. The improvement was even more significant on challenging datasets. The results demonstrate that vision and language reconstructor can serve as an effective scene text recognizer.

Abstract:
With the recent burst of 2D and 3D data, cross-modal retrieval has attracted increasing attention recently. However, manual labeling by non-experts will inevitably introduce corrupted annotations given ambiguous 2D/3D content. Though previous works have addressed this issue by designing a naive division strategy with hand-crafted thresholds, their performance generally exhibits great sensitivity to the threshold value. Besides, they fail to fully utilize the valuable supervisory signals within each divided subset. To tackle this problem, we propose a Divide-and-conquer 2D-3D cross-modal Alignment and Correction framework (DAC), which comprises Multimodal Dynamic Division (MDD) and Adaptive Alignment and Correction (AAC). Specifically, the former performs accurate sample division by adaptive credibility modeling for each sample based on the compensation information within multimodal loss distribution. Then in AAC, samples in distinct subsets are exploited with different alignment strategies to fully enhance the semantic compactness and meanwhile alleviate over-fitting to noisy labels, where a self-correction strategy is introduced to improve the quality of representation. Moreover. To evaluate the effectiveness in real-world scenarios, we introduce a challenging noisy benchmark, namely Objaverse-N200, which comprises 200k-level samples annotated with 1156 realistic noisy labels. Extensive experiments on both traditional and the newly proposed benchmarks demonstrate the generality and superiority of our DAC, where DAC outperforms state-of-the-art models by a large margin. (i.e., with +5.9% gain on ModelNet40 and +5.8% on Objaverse-N200).

Abstract:
We proposed Precomputed Radiance Transfer of Gaussian Splats (PRTGS), a real-time high-quality relighting method for Gaussian splats in low-frequency lighting environments that captures soft shadows and interreflections by precomputing 3D Gaussian splats' radiance transfer. Existing studies have demonstrated that 3D Gaussian splatting (3DGS) outperforms neural fields in efficiency for dynamic lighting scenarios. However, the current relighting method based on 3DGS is still struggling to compute high-quality shadow and indirect illumination in real time for dynamic light, leading to unrealistic rendering results. We solve this problem by precomputing the expensive transport simulations required for complex transfer functions like shadowing, the resulting transfer functions are represented as dense sets of vectors or matrices for every Gaussian splat. We introduce distinct precomputing methods tailored for training and rendering stages, along with unique ray tracing and indirect lighting precomputation techniques for 3D Gaussian splats to accelerate training speed and compute accurate indirect lighting related to environment light. Experimental analyses demonstrate that our approach achieves state-of-the-art visual quality while maintaining competitive training times and importantly allows high-quality real-time (30+ fps) relighting for dynamic light and relatively complex scenes at 1080p resolution.

Abstract:
Depression recognition based on physiological signals such as functional near-infrared spectroscopy (fNIRS) and electroencephalogram (EEG) has made considerable progress. However, most existing studies ignore the complementarity and semantic consistency of multimodal physiological signals under the same stimulation task in complex spatio-temporal patterns. In this paper, we introduce a multimodal physiological signals representation learning framework using Siamese architecture via multiscale contrasting for depression recognition (MRLMC). First, fNIRS and EEG are transformed into different but correlated data based on a time-domain data augmentation strategy. Then, we design a spatio-temporal contrasting module to learn the representation of fNIRS and EEG through weight-sharing multiscale spatio-temporal convolution. Furthermore, to enhance the learning of semantic representation associated with stimulation tasks, a semantic consistency contrast module is proposed, aiming to maximize the semantic similarity of fNIRS and EEG. Extensive experiments on publicly available and self-collected multimodal physiological signals datasets indicate that MRLMC outperforms the state-of-the-art models. Moreover, our proposed framework is capable of transferring to multimodal time series downstream tasks.

Abstract:
As physical adversarial attacks become extensively applied in unearthing the potential risk of security-critical scenarios, especially in dynamic scenarios, their vulnerability to environmental variations has also been brought to light. The non-robust nature of physical adversarial attack methods brings less-than-stable performance consequently. Although methods such as Expectation over Transformation (EOT) have enhanced the robustness of traditional contact attacks like adversarial patches, they fall short in practicality and concealment within dynamic environments such as traffic scenarios. Meanwhile, non-contact laser attacks, while offering enhanced adaptability, face constraints due to a limited optimization space for their attributes, rendering EOT less effective. This limitation underscores the necessity for developing a new strategy to augment the robustness of such practices. To address these issues, this paper introduces the Embodied Laser Attack (ELA), a novel framework that leverages the embodied intelligence paradigm of Perception-Decision-Control to dynamically tailor non-contact laser attacks. For the perception module, given the challenge of simulating the victim's view by full-image transformation, ELA has innovatively developed a local perspective transformation network, based on the intrinsic prior knowledge of traffic scenes and enables effective and efficient estimation. For the decision and control module, ELA trains an attack agent with data-driven reinforcement learning instead of adopting time-consuming heuristic algorithms, making it capable of instantaneously determining a valid attack strategy with the perceived information by well-designed rewards, which is then conducted by a controllable laser emitter. Experimentally, we apply our framework to diverse traffic scenarios both in the digital and physical world, verifying the effectiveness of our method under dynamic successive scenes.

Abstract:
Recent diffusion-based image editing approaches have exhibited impressive editing capabilities in images with one dominant object in simple compositions. However, localized editing in images containing multiple objects and intricate compositions has not been well-studied in the literature, despite its growing real-world demands. Existing mask-based inpainting methods fall short of retaining the underlying structure within the edit region, causing noticeable discordance with their complex surroundings. Meanwhile, attention-based methods such as Prompt-to-Prompt (P2P) often exhibit editing leakage and misalignment in more complex compositions. In this work, we propose MAG-Edit, a plug-and-play, inference-stage optimization method, that empowers attention-based editing approaches, such as P2P, to enhance localized image editing in intricate scenarios. In particular, MAG-Edit optimizes the noise latent feature by encouraging two mask-based cross-attention ratios of the edit token, which in turn gradually enhances the local alignment with the desired prompt. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method in achieving both text alignment and structure preservation for localized editing within complex scenarios.

Abstract:
Embodied Everyday Task is a popular task in the embodied AI community, requiring agents to make a sequence of actions based on natural language instructions and visual observations. Traditional learning-based approaches face two challenges. Firstly, natural language instructions often lack explicit task planning. Secondly, extensive training is required to equip models with knowledge of the task environment. Previous works based on Large Language Model (LLM) either suffer from poor performance due to the lack of task-specific knowledge or rely on ground truth as few-shot samples. To address the above limitations, we propose a novel approach called Progressive Retrieval Augmented Generation (P-RAG), which not only effectively leverages the powerful language processing capabilities of LLMs but also progressively accumulates task-specific knowledge without ground-truth. Compared to the conventional RAG methods, which retrieve relevant information from the database in a one-shot manner to assist generation, P-RAG introduces an iterative approach to progressively update the database. In each iteration, P-RAG retrieves the latest database and obtains historical information from the previous interaction as experiential references for the current interaction. Moreover, we also introduce a more granular retrieval scheme that not only retrieves similar tasks but also incorporates retrieval of similar situations to provide more valuable reference experiences. Extensive experiments reveal that P-RAG achieves competitive results without utilizing ground truth and can even further improve performance through self-iterations.

Abstract:
Text-to-image diffusion models sometimes depict blended concepts in the generated images. One promising use case of this effect would be the nonword-to-image generation task which attempts to generate images intuitively imaginable from a non-existing word (nonword). To realize nonword-to-image generation, an existing study focused on associating nonwords with similar-sounding words. Since each nonword can have multiple similar-sounding words, generating images containing their blended concepts would increase intuitiveness, facilitating creative activities and promoting computational psycholinguistics. Nevertheless, no existing study has quantitatively evaluated this effect in either diffusion models or the nonword-to-image generation paradigm. Therefore, this paper first analyzes the conceptual blending in a pretrained diffusion model, Stable Diffusion. The analysis reveals that a high percentage of generated images depict blended concepts when inputting an embedding interpolating between the text embeddings of two text prompts referring to different concepts. Next, this paper explores the best text embedding space conversion method of an existing nonword-to-image generation framework to ensure both the occurrence of conceptual blending and image generation quality. We compare the conventional direct prediction approach with the proposed method that combines k-nearest neighbor search and linear regression. Evaluation reveals that the enhanced accuracy of the embedding space conversion by the proposed method improves the image generation quality, while the emergence of conceptual blending could be attributed mainly to the specific dimensions of the high-dimensional text embedding space.

Abstract:
Recently, a novel form of audio partial forgery has posed challenges to its forensics, requiring advanced countermeasures to detect subtle forgery manipulations within long-duration audio. However, existing countermeasures still serve a classification purpose and fail to perform meaningful analysis of the start and end timestamps of partial forgery segments. To address this challenge, we introduce a novel coarse-to-fine proposal refinement framework (CFPRF) that incorporates a frame-level detection network (FDN) and a proposal refinement network (PRN) for audio temporal forgery detection and localization. Specifically, the FDN aims to mine informative inconsistency cues between real and fake frames to obtain discriminative features that are beneficial for roughly indicating forgery regions. The PRN is responsible for predicting confidence scores and regression offsets to refine the coarse-grained proposals derived from the FDN. To learn robust discriminative features, we devise a difference-aware feature learning (DAFL) module guided by contrastive representation learning to enlarge the sensitive differences between different frames induced by minor manipulations. We further design a boundary-aware feature enhancement (BAFE) module to capture the contextual information of multiple transition boundaries and guide the interaction between boundary information and temporal features via a cross-attention mechanism. Extensive experiments show that our CFPRF achieves state-of-the-art performance on various datasets, including LAV-DF, ASVS2019PS, and HAD.

Abstract:
Modern perception systems for autonomous flight are sensitive to occlusion and have limited long-range capability, which is a key bottleneck in improving low-altitude economic task performance. Recent research has shown that the UAV-to-UAV (U2U) cooperative perception system has great potential to revolutionize the autonomous flight industry. However, the lack of a large-scale dataset is hindering progress in this area. This paper presents U2UData, the first large-scale cooperative perception dataset for swarm UAVs autonomous flight. The dataset was collected by three UAVs flying autonomously in the U2USim, covering a 9 km^2 flight area. It comprises 315K LiDAR frames, 945K RGB and depth frames, and 2.41M annotated 3D bounding boxes for 3 classes. It also includes brightness, temperature, humidity, smoke, and airflow values covering all flight routes. U2USim is the first real-world mapping swarm UAVs simulation environment. It takes Yunnan Province as the prototype and includes 4 terrains, 7 weather conditions, and 8 sensor types. U2UData introduces two perception tasks: cooperative 3D object detection and cooperative 3D object tracking. This paper provides comprehensive benchmarks of recent cooperative perception algorithms on these tasks.

Abstract:
Facial action units (AUs), as defined in the Facial Action Coding System (FACS), have received significant research interest owing to their diverse range of applications in facial state analysis. Current mainstream FAU recognition models have a notable limitation, i.e., focusing only on the accuracy of AU recognition and overlooking explanations of corresponding AU states. In this paper, we propose an end-to-end Vision-Language joint learning network for explainable FAU recognition (termed VL-FAU), which aims to reinforce AU representation capability and language interpretability through the integration of joint multimodal tasks. Specifically, VL-FAU brings together language models to generate fine-grained local muscle descriptions and distinguishable global face description when optimising FAU recognition. Through this, the global facial representation and its local AU representations will achieve higher distinguishability among different AUs and different subjects. In addition, multi-level AU representation learning is utilised to improve AU individual attention-aware representation capabilities based on multi-scale combined facial stem feature. Extensive experiments on DISFA and BP4D AU datasets show that the proposed approach achieves superior performance over the state-of-the-art methods on most of the metrics. In addition, compared with mainstream FAU recognition methods, VL-FAU can provide local- and global-level interpretability language descriptions with the AUs' predictions.

Abstract:
Large multimodal models (LMMs) have shown remarkable performance in the visual commonsense reasoning (VCR) task, which aims to answer a multiple-choice question based on visual commonsense within an image. However, the ability of LMMs to correct potential visual commonsense errors in the distractor upon their occurrence is yet under-explored. Drawing inspiration from how a human teacher crafts challenging distractors to test students' comprehension of the concepts or skills and assists them in identifying and correcting errors toward the answer, we are the pioneering research for LMMs to simulate this error correction process. To this end, we employ GPT-4 as a ''teacher'' to collect the explainable feedback dataset VCR-DF for error correction, which serves as a benchmark to evaluate the ability of LMMs to identify misconceptions and clarify reasons behind the error in VCR distractors toward final answers. In addition, we propose an LMM-based Pedagogical Expert Instructed Feedback Generation (PEIFG) model to incorporate the learnable expert prompts and multimodal instruction as guidance for feedback generation. Experimental results show that our PEIFG significantly outperforms existing LMMs. We believe that our benchmark provides a new direction for evaluating the capabilities of LMMs.

Abstract:
Automated Machine Learning (AutoML) offers a promising approach to streamline the training of machine learning models. However, existing AutoML frameworks are often limited to unimodal scenarios and require extensive manual configuration. Recent advancements in Large Language Models (LLMs) have showcased their exceptional abilities in reasoning, interaction, and code generation, presenting an opportunity to develop a more automated and user-friendly framework. To this end, we introduce AutoM3L, an innovative Automated Multimodal Machine Learning framework that leverages LLMs as controllers to automatically construct multimodal training pipelines. AutoM3L comprehends data modalities and selects appropriate models based on user requirements, providing automation and interactivity. By eliminating the need for manual feature engineering and hyperparameter optimization, our framework simplifies user engagement and enables customization through directives, addressing the limitations of previous rule-based AutoML approaches. We evaluate the performance of AutoM3L on six diverse multimodal datasets spanning classification, regression, and retrieval tasks, as well as a comprehensive set of unimodal datasets. The results demonstrate that AutoM3L achieves competitive or superior performance compared to traditional rule-based AutoML methods. Furthermore, a user study highlights the user-friendliness and usability of our framework, compared to the rule-based AutoML methods.

Abstract:
Scene Graph Generation(SGG) is a scene understanding task that aims at identifying object entities and reasoning their relationships within a given image. In contrast to prevailing two-stage methods based on a large object detector (e.g., Faster R-CNN), one-stage methods integrate a fixed-size set of learnable queries to jointly reason relational triplets . This paradigm demonstrates robust performance with significantly reduced parameters and computational overhead. However, the challenge in one-stage methods stems from the issue of weak entanglement, wherein entities involved in relationships require both coupled features shared within triplets and decoupled visual features. Previous methods either adopt a single decoder for coupled triplet feature modeling or multiple decoders for separate visual feature extraction but fail to consider both. In this paper, we introduce UniQ, a Unified decoder with task-specific Queries architecture, where task-specific queries generate decoupled visual features for subjects, objects, and predicates respectively, and unified decoder enables coupled feature modeling within relational triplets. Experimental results on the Visual Genome dataset demonstrate that UniQ has superior performance to both one-stage and two-stage methods.

Abstract:
Panoptic narrative grounding (PNG), whose core target is fine-grained image-text alignment, requires a panoptic segmentation of referred objects given a narrative caption. Previous discriminative methods achieve only weak or coarse-grained alignment by panoptic segmentation pretraining or CLIP model adaptation. Given the recent progress of text-to-image Diffusion models, several works have shown their capability to achieve fine-grained image-text alignment through cross-attention maps and improved general segmentation performance. However, the direct use of phrase features as static prompts to apply frozen Diffusion models to the PNG task still suffers from a large task gap and insufficient vision-language interaction, yielding inferior performance. Therefore, we propose an Extractive-Injective Phrase Adapter (EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts with image features and inject the multimodal cues back, which leverages the fine-grained image-text alignment capability of Diffusion models more sufficiently. In addition, we also design a Multi-Level Mutual Aggregation (MLMA) module to reciprocally fuse multi-level image and phrase features for segmentation refinement. Extensive experiments on the PNG benchmark show that our method achieves new state-of-the-art performance.

Abstract:
AI-generated video has revolutionized short video production, filmmaking, and personalized media, making video local editing an essential tool. However, this progress also blurs the line between reality and fiction, posing challenges in multimedia forensics. To solve this urgent issue, V2A-Mark is proposed to address the limitations of current video tampering forensics, such as poor generalizability, singular function, and single modality focus. Combining the fragility of video-into-video steganography with deep robust watermarking, our method can embed invisible visual-audio localization watermarks and copyright watermarks into the original video frames and audio, enabling precise manipulation localization and copyright protection. We also design a temporal alignment and fusion module and degradation prompt learning to enhance the localization accuracy and decoding robustness. Meanwhile, we introduce a sample-level audio localization method and a cross-modal copyright extraction mechanism to couple the information of audio and video frames. The effectiveness of V2A-Mark has been verified on a visual-audio tampering dataset, emphasizing its superiority in localization precision and copyright accuracy, crucial for the sustainable development of video editing in the AIGC video era.

Abstract:
With the rapid advancements of large-scale text-to-image diffusion models, various practical applications have emerged, bringing significant convenience to society. However, model developers may misuse the unauthorized data to train diffusion models. These data are at risk of being memorized by the models, thus potentially violating citizens' privacy rights. Therefore, in order to judge whether a specific image is utilized as a member of a model's training set, Membership Inference Attack (MIA) is proposed to serve as a tool for privacy protection. Current MIA methods predominantly utilize pixel-wise comparisons as distinguishing clues, considering the pixel-level memorization characteristic of diffusion models. However, it is practically impossible for text-to-image models to memorize all the pixel-level information in massive training sets. Therefore, we move to the more advanced structure-level memorization. Observations on the diffusion process show that the structures of members are better preserved compared to those of nonmembers, indicating that diffusion models possess the capability to remember the structures of member images from training sets. Drawing on these insights, we propose a simple yet effective MIA method tailored for text-to-image diffusion models. Extensive experimental results validate the efficacy of our approach. Compared to current pixel-level baselines, our approach not only achieves state-of-the-art performance but also demonstrates remarkable robustness against various distortions.

Abstract:
Social media popularity (SMP) prediction is a complex task involving multi-modal data integration. While pre-trained vision-language models (VLMs) like CLIP have been widely adopted for this task, their effectiveness in capturing the unique characteristics of social media content remains unexplored. This paper critically examines the applicability of CLIP-based features in SMP prediction, focusing on the overlooked phenomenon of semantic inconsistency between images and text in social media posts. Through extensive analysis, we demonstrate that this inconsistency increases with post popularity, challenging the conventional use of VLM features. We provide a comprehensive investigation of semantic inconsistency across different popularity intervals and analyze the impact of VLM feature adaptation on SMP tasks. Our experiments reveal that incorporating inconsistency measures and adapted text features significantly improves model performance, achieving an SRC of 0.729 and an MAE of 1.227. These findings not only enhance SMP prediction accuracy but also provide crucial insights for developing more targeted approaches in social media analysis.

Abstract:
Large vision-language models (LVLMs) are markedly proficient in deriving visual representations guided by natural language. Recent explorations have utilized LVLMs to tackle zero-shot visual anomaly detection (VAD) challenges by pairing images with textual descriptions indicative of normal and abnormal conditions, referred to as anomaly prompts. However, existing approaches depend on static anomaly prompts that are prone to cross-semantic ambiguity, and prioritize global image-level representations over crucial local pixel-level image-to-text alignment that is necessary for accurate anomaly localization. In this paper, we present ALFA, a training-free approach designed to address these challenges via a unified model. We propose a run-time prompt adaptation strategy, which first generates informative anomaly prompts to leverage the capabilities of a large language model (LLM). This strategy is enhanced by a contextual scoring mechanism for per-image anomaly prompt adaptation and cross-semantic ambiguity mitigation. We further introduce a novel fine-grained aligner to fuse local pixel-level semantics for precise anomaly localization, by projecting the image-text alignment from global to local semantic spaces. Extensive evaluations on the challenging MVTec and VisA datasets confirm ALFA's effectiveness in harnessing the language potential for zero-shot VAD, achieving significant PRO improvements of 12.1% on MVTec AD and 8.9% on VisA compared to state-of-the-art zero-shot VAD approaches.

Abstract:
We present TimeNeRF, a generalizable neural rendering approach for rendering novel views at arbitrary viewpoints and at arbitrary times, even with few input views. For real-world applications, it is expensive to collect multiple views and inefficient to re-optimize for unseen scenes. Moreover, as the digital realm, particularly the metaverse, strives for increasingly immersive experiences, the ability to model 3D environments that naturally transition between day and night becomes paramount. While current techniques based on Neural Radiance Fields (NeRF) have shown remarkable proficiency in synthesizing novel views, the exploration of NeRF's potential for temporal 3D scene modeling remains limited, with no dedicated datasets available for this purpose. To this end, our approach harnesses the strengths of multi-view stereo, neural radiance fields, and disentanglement strategies across diverse datasets. This equips our model with the capability for generalizability in a few-shot setting, allows us to construct an implicit content radiance field for scene representation, and further enables the building of neural radiance fields at any arbitrary time. Finally, we synthesize novel views of that time via volume rendering. Experiments show that TimeNeRF can render novel views in a few-shot setting without per-scene optimization. Most notably, it excels in creating realistic novel views that transition smoothly across different times, adeptly capturing intricate natural scene changes from dawn to dusk.

Abstract:
Image deblurring aims to restore a high-quality image from its corresponding blurred. The emergence of CNNs and Transformers has enabled significant progress. However, these methods often face the dilemma between eliminating long-range degradation perturbations and maintaining computational efficiency. While the selective state space model (SSM) shows promise in modeling long-range dependencies with linear complexity, it also encounters challenges such as local pixel forgetting and channel redundancy. To address this issue, we propose an efficient image deblurring network that leverages selective state spaces model to aggregate enriched and accurate features. Specifically, we introduce an aggregate local and global information block (ALGBlock) designed to effectively capture and integrate both local invariant properties and non-local information. The ALGBlock comprises two primary modules: a module for capturing local and global features (CLGF), and a feature aggregation module (FA). The CLGF module is composed of two branches: the global branch captures long-range dependency features via a selective state spaces model, while the local branch employs simplified channel attention to model local connectivity, thereby reducing local pixel forgetting and channel redundancy. In addition, we design a FA module to accentuate the local part by recalibrating the weight during the aggregation of the two branches for restoration. Experimental results demonstrate that the proposed method outperforms state-of-the-art approaches on widely used benchmarks.

Abstract:
With the progressive advancements in deep graph learning, out-of-distribution (OOD) detection for graph data has emerged as a critical challenge. While the efficacy of auxiliary datasets in enhancing OOD detection has been extensively studied for image and text data, such approaches have not yet been explored for graph data. Unlike Euclidean data, graph data exhibits greater diversity but lower robustness to perturbations, complicating the integration of outliers. To tackle these challenges, we propose the introduction of Hybrid External and Internal Graph Outlier Exposure (HGOE) to improve graph OOD detection performance. Our framework involves using realistic external graph data from various domains and synthesizing internal outliers within ID subgroups to address the poor robustness and presence of OOD samples within the ID class. Furthermore, we develop a boundary-aware OE loss that adaptively assigns weights to outliers, maximizing the use of high-quality OOD samples while minimizing the impact of low-quality ones. Our proposed HGOE framework is model-agnostic and designed to enhance the effectiveness of existing graph OOD detection models. Experimental results demonstrate that our HGOE framework can significantly improve the performance of existing OOD detection models across all 8 real datasets.

Abstract:
Histopathology analysis is the gold standard for medical diagnosis. Accurate classification of whole slide images (WSIs) and region-of-interests (ROIs) localization can assist pathologists in diagnosis. The gigapixel resolution of WSI and the absence of fine-grained annotations make direct classification and analysis challenging. In weakly supervised learning, multiple instance learning (MIL) presents a promising approach for WSI classification. The prevailing strategy is to use attention mechanisms to measure instance importance for classification. However, attention mechanisms fail to capture inter-instance information, and self-attention causes quadratic computational complexity. To address these challenges, we propose AMD-MIL, an agent aggregator with a mask denoise mechanism. The agent token acts as an intermediate variable between the query and key for computing instance importance. Mask and denoising matrices, mapped from agents-aggregated value, dynamically mask low-contribution representations and eliminate noise. AMD-MIL achieves better attention allocation by adjusting feature representations, capturing micro-metastases in cancer, and improving interpretability. Extensive experiments on CAMELYON-16, CAMELYON-17, TCGA-KIDNEY, and TCGA-LUNG show AMD-MIL's superiority over state-of-the-art methods.

Abstract:
We present TALE, a novel training-free framework harnessing the generative capabilities of text-to-image diffusion models to address the cross-domain image composition task that focuses on flawlessly incorporating user-specified objects into a designated visual contexts regardless of domain disparity. Previous methods often involve either training auxiliary networks or finetuning diffusion models on customized datasets, which are expensive and may undermine the robust textual and visual priors of pre-trained diffusion models. Some recent works attempt to break the barrier by proposing training-free workarounds that rely on manipulating attention maps to tame the denoising process implicitly. However, composing via attention maps does not necessarily yield desired compositional outcomes. These approaches could only retain some semantic information and usually fall short in preserving identity characteristics of input objects or exhibit limited background-object style adaptation in generated images. In contrast, TALE is a novel method that operates directly on latent space to provide explicit and effective guidance for the composition process to resolve these problems. Specifically, we equip TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided Latent Optimization. The former formulates noisy latents conducive to initiating and steering the composition process by directly leveraging background and foreground latents at corresponding timesteps, and the latter exploits designated energy functions to further optimize intermediate latents conforming to specific conditions that complement the former to generate desired final results. Our experiments demonstrate that TALE surpasses prior baselines and attains state-of-the-art performance in image-guided composition across various photorealistic and artistic domains.

Abstract:
Large Vision Language Models (VLMs) extend and enhance the perceptual abilities of Large Language Models (LLMs). Despite offering new possibilities for LLM applications, these advancements raise significant security and ethical concerns, particularly regarding the generation of harmful content. While LLMs have undergone extensive security evaluations with the aid of red teaming frameworks, VLMs currently lack a well-developed one. To fill this gap, we introduce Arondight, a standardized red team framework tailored specifically for VLMs. Arondight is dedicated to resolving issues related to the absence of visual modality and inadequate diversity encountered when transitioning existing red teaming methodologies from LLMs to VLMs. Our framework features an automated multi-modal jailbreak attack, wherein visual jailbreak prompts are produced by a red team VLM, and textual prompts are generated by a red team LLM guided by a reinforcement learning agent. To enhance the comprehensiveness of VLM security evaluation, we integrate entropy bonuses and novelty reward metrics. These elements incentivize the RL agent to guide the red team LLM in creating a wider array of diverse and previously unseen test cases. Our evaluation of ten cutting-edge VLMs exposes significant security vulnerabilities, particularly in generating toxic images and aligning multi-modal prompts. In particular, our Arondight achieves an average attack success rate of 84.5% on GPT-4 in all fourteen prohibited scenarios defined by OpenAI in terms of generating toxic text. For a clearer comparison, we also categorize existing VLMs based on their safety levels and provide corresponding reinforcement recommendations. Our multimodal prompt dataset and red team code will be released after ethics committee approval. CONTENT WARNING: THIS PAPER CONTAINS HARMFUL MODEL RESPONSES.

Abstract:
Diffusion-based models for story visualization have shown promise in generating content-coherent images for storytelling tasks. However, how to effectively integrate new characters into existing narratives while maintaining character consistency remains an open problem, particularly with limited data. Two major limitations hinder the progress: (1) the absence of a suitable benchmark due to potential character leakage and inconsistent text labeling, and (2) the challenge of distinguishing between new and old characters, leading to ambiguous results. To address these challenges, we introduce the NewEpisode benchmark, comprising refined datasets designed to evaluate generative models' adaptability in generating new stories with fresh characters using just a single example story. The refined dataset involves refined text prompts and eliminates character leakage. Additionally, to mitigate the character confusion of generated results, we propose EpicEvo, a method that customizes a diffusion-based visual story generation model with a single story featuring the new characters seamlessly integrating them into established character dynamics. EpicEvo introduces a novel adversarial character alignment module to align the generated images progressively in the diffusive process, with exemplar images of new characters, while applying knowledge distillation to prevent forgetting of characters and background details. Our evaluation quantitatively demonstrates that EpicEvo outperforms existing baselines on the NewEpisode benchmark, and qualitative studies confirm its superior customization of visual story generation in diffusion models. In summary, EpicEvo provides an effective way to incorporate new characters using only one example story, unlocking new possibilities for applications such as serialized cartoons.

Abstract:
Image generation models can generate or edit images from a given text. Recent advancements in image generation technology, exemplified by DALL-E and Midjourney, have been groundbreaking. These advanced models, despite their impressive capabilities, are often trained on massive Internet datasets, making them susceptible to generating content that perpetuates social stereotypes and biases, which can lead to severe consequences. Prior research on assessing bias within image generation models suffers from several shortcomings, including limited accuracy, reliance on extensive human labor, and lack of comprehensive analysis. In this paper, we propose BiasPainter, a novel evaluation framework that can accurately, automatically and comprehensively trigger social bias in image generation models. BiasPainter uses a diverse range of seed images of individuals and prompts the image generation models to edit these images using gender, race, and age-neutral queries. These queries span 62 professions, 39 activities, 57 types of objects, and 70 personality traits. The framework then compares the edited images to the original seed images, focusing on the significant changes related to gender, race, and age. BiasPainter adopts a key insight that these characteristics should not be modified when subjected to neutral prompts. Built upon this design, BiasPainter can trigger the social bias and evaluate the fairness of image generation models. We use BiasPainter to evaluate six widely-used image generation models, such as stable diffusion and Midjourney. Experimental results show that BiasPainter can successfully trigger social bias in image generation models. According to our human evaluation, BiasPainter can achieve 90.8% accuracy on automatic bias detection, which is significantly higher than the results reported in previous work.

Abstract:
Tracking by detection has been the prevailing paradigm in the field of Multi-object Tracking (MOT). These methods typically rely on the Kalman Filter to estimate the future locations of objects, assuming linear object motion. However, they fall short when tracking objects exhibiting nonlinear and diverse motion in scenarios like dancing and sports. In addition, there has been limited focus on utilizing learning-based motion predictors in MOT. To address these challenges, we resort to exploring data-driven motion prediction methods. Inspired by the great expectation of state space models (SSMs), such as Mamba, in long-term sequence modeling with near-linear complexity, we introduce a Mamba-based motion model named Mamba moTion Predictor (MTP). MTP is designed to model the complex motion patterns of objects like dancers and athletes. Specifically, MTP takes the spatial-temporal location dynamics of objects as input, captures the motion pattern using a bi-Mamba encoding layer, and predicts the next motion. In real-world scenarios, objects may be missed due to occlusion or motion blur, leading to premature termination of their trajectories. To tackle this challenge, we further expand the application of MTP. We employ it in an autoregressive way to compensate for missing observations by utilizing its own predictions as inputs, thereby contributing to more consistent trajectories. Our proposed tracker, MambaTrack, demonstrates advanced performance on benchmarks such as Dancetrack and SportsMOT, which are characterized by complex motion and severe occlusion.

Abstract:
While existing Audio-Visual Speech Separation (AVSS) methods primarily concentrate on the audio-visual fusion strategy for two-speaker separation, they demonstrate a severe performance drop in the multi-speaker separation scenarios. Typically, AVSS methods employ guiding videos to sequentially isolate individual speakers from the given audio mixture, resulting in notable missing and noisy parts across various segments of the separated speech. In this study, we propose a simultaneous multi-speaker separation framework that can facilitate the concurrent separation of multiple speakers within a singular process. We introduce speaker-wise interactions to establish distinctions and correlations among speakers. Experimental results on the VoxCeleb2 and LRS3 datasets demonstrate that our method achieves state-of-the-art performance in separating mixtures with 2, 3, 4, and 5 speakers, respectively. Additionally, our model can utilize speakers with complete audio-visual information to mitigate other visual-deficient speakers, thereby enhancing its resilience to missing visual cues. We also conduct experiments where visual information for specific speakers is entirely absent or visual frames are partially missing. The results demonstrate that our model consistently outperforms others, exhibiting the smallest performance drop across all settings involving 2, 3, 4, and 5 speakers.

Abstract:
The recognition of named entities in visually-rich documents (VrD-NER) plays a critical role in various real-world scenarios and applications. However, the research in VrD-NER faces three major challenges: complex document layouts, incorrect reading orders, and unsuitable task formulations. To address these challenges, we propose a query-aware entity extraction head, namely UNER, to collaborate with existing multi-modal document transformers to develop more robust VrD-NER models. The UNER head considers the VrD-NER task as a combination of sequence labeling and reading order prediction, effectively addressing the issues of discontinuous entities in documents. Experimental evaluations on diverse datasets demonstrate the effectiveness of UNER in improving entity extraction performance. Moreover, the UNER head enables a supervised pre-training stage on various VrD-NER datasets to enhance the document transformer backbones and exhibits substantial knowledge transfer from the pre-training stage to the fine-tuning stage. By incorporating universal layout understanding, a pre-trained UNER-based model demonstrates significant advantages in few-shot and cross-linguistic scenarios and exhibits zero-shot entity extraction abilities.

Abstract:
Live streaming has experienced significant growth recently. Yet this rise in popularity contrasts with the reality that a substantial segment of the global population still lacks Internet access. The emergence of Low Earth orbit Satellite Networks (LSNs), such as SpaceX's Starlink and Amazon's Project Kuiper, presents a promising solution to fill this gap. Nevertheless, our measurement study reveals that existing live streaming platforms may not be able to deliver a smooth viewing experience on LSNs due to frequent satellite handovers, which lead to frequent video rebuffering events. Current state-of-the-art learning-based Adaptive Bitrate (ABR) algorithms, even when trained on LSNs' network traces, fail to manage the abrupt network variations associated with satellite handovers effectively. To address these challenges, for the first time, we introduce Satellite-Aware Rate Adaptation (SARA), a versatile and lightweight middleware that can seamlessly integrate with various ABR algorithms to enhance the performance of live streaming over LSNs. SARA intelligently modulates video playback speed and furnishes ABR algorithms with insights derived from the distinctive network characteristics of LSNs, thereby aiding ABR algorithms in making informed bitrate selections and effectively minimizing rebuffering events that occur during satellite handovers. Our extensive evaluation shows that SARA can effectively reduce the rebuffering time by an average of 39.41% and slightly improve latency by 0.65% while only introducing an overall loss in bitrate by 0.13%.

Abstract:
The preservation of cultural heritage, as mandated by the United Nations Sustainable Development Goals (SDGs), is integral to sustainable urban development. This paper focuses on the Dragon Boat Festival, a prominent event in Chinese cultural heritage, and proposes leveraging Virtual Reality (VR), to enhance its preservation and accessibility. Traditionally, participation in the festival's dragon boat races was limited to elite athletes, excluding broader demographics. Our proposed solution, named MetaDragonBoat, enables virtual participation in dragon boat racing, offering immersive experiences that replicate physical exertion through a cultural journey. Thus, we build a digital twin of a university campus located in a region with a rich dragon boat racing tradition. Coupled with three paddling techniques that are enabled by either commercial controllers or physical paddle controllers with haptic feedback, diversified users can engage in realistic rowing experiences. Our results demonstrate that by integrating resistance into the paddle controls, users could simulate the physical effort of dragon boat racing, promoting a deeper understanding and appreciation of this cultural heritage.

Abstract:
Reconstructing 3D objects from a single image guided by pretrained diffusion models has demonstrated promising outcomes. However, due to utilizing the case-agnostic rigid strategy, their generalization ability to arbitrary cases and the 3D consistency of reconstruction are still poor. In this work, we propose Consistent123, a case-aware two-stage method for highly consistent 3D asset reconstruction from one image with both 2D and 3D diffusion priors. In the first stage, Consistent123 utilizes only 3D structural priors for sufficient geometry exploitation, with a CLIP-based case-aware adaptive detection mechanism embedded within this process. In the second stage, 2D texture priors are introduced and progressively take on a dominant guiding role, delicately sculpting the details of the 3D model. Consistent123 aligns more closely with the evolving trends in guidance requirements, adaptively providing adequate 3D geometric initialization and suitable 2D texture refinement for different objects. Consistent123 can obtain highly 3D-consistent reconstruction and exhibits strong generalization ability across various objects. Qualitative and quantitative experiments show that our method significantly outperforms state-of-the-art image-to-3D methods.

Abstract:
Embodied intelligence empowers agents with a profound sense of perception, enabling them to respond in a manner closely aligned with real-world situations. Large Language Models (LLMs) delve into language instructions with depth, serving a crucial role in generating plans for intricate tasks. Thus, LLM-based embodied models further enhance the agent's capacity to comprehend and process information. However, this amalgamation also ushers in new challenges in the pursuit of heightened intelligence. Specifically, attackers can manipulate LLMs to produce irrelevant or even malicious outputs by altering their prompts. Confronted with this challenge, we observe a notable absence of multi-modal datasets essential for comprehensively evaluating the robustness of LLM-based embodied models. Consequently, we construct the Embodied Intelligent Robot Attack Dataset (EIRAD), tailored specifically for robustness evaluation. Additionally, two attack strategies are devised, including untargeted attacks and targeted attacks, to effectively simulate a range of diverse attack scenarios. At the same time, during the attack process, to more accurately ascertain whether our method is successful in attacking the LLM-based embodied model, we devise a new attack success evaluation method utilizing the BLIP2 model. Recognizing the time and cost-intensive nature of the GCG algorithm in attacks, we devise a scheme for prompt suffix initialization based on various target tasks, thus expediting the convergence process. Experimental results demonstrate that our method exhibits a superior attack success rate when targeting LLM-based embodied models, indicating a lower level of decision-level robustness in these models.

Abstract:
Video Temporal Grounding (VTG) aims to identify visual frames in a video clip that match text queries. Recent studies in VTG employ cross-attention to correlate visual frames and text queries as individual token sequences. However, these approaches overlook a crucial aspect of the problem: a holistic understanding of the query sentence. A model may capture correlations between individual word tokens and arbitrary visual frames while possibly missing out on the global meaning. To address this, we introduce two primary contributions: (1) a visual frame-level gate mechanism that incorporates holistic textual information, (2) cross-modal alignment loss to learn the fine-grained correlation between query and relevant frames. As a result, we regularize the effect of individual word tokens and suppress irrelevant visual frames. We demonstrate that our method outperforms state-of-the-art approaches in VTG benchmarks, indicating that holistic text understanding guides the model to focus on the semantically important parts within the video.

Abstract:
Existing studies for gait recognition primarily utilized sequences of either binary silhouette or human parsing to encode the shapes and dynamics of persons during walking. Silhouettes exhibit accurate segmentation quality and robustness to environmental variations, but their low information entropy may result in sub-optimal performance. In contrast, human parsing provides fine-grained part segmentation with higher information entropy, but the segmentation quality may deteriorate due to the complex environments. To discover the advantages of silhouette and parsing and overcome their limitations, this paper proposes a novel cross-granularity alignment gait recognition method, named XGait, to unleash the power of gait representations of different granularity. To achieve this goal, the XGait first contains two branches of backbone encoders to map the silhouette sequences and the parsing sequences into two latent spaces, respectively. Moreover, to explore the complementary knowledge across the features of two representations, we design the Global Cross-granularity Module (GCM) and the Part Cross-granularity Module (PCM) after the two encoders. In particular, the GCM aims to enhance the quality of parsing features by leveraging global features from silhouettes, while the PCM aligns the dynamics of human parts between silhouette and parsing features using the high information entropy in parsing sequences. In addition, to effectively guide the alignment of two representations with different granularity at the part level, an elaborate-designed learnable division mechanism is proposed for the parsing features. Finally, comprehensive experiments on two large-scale gait datasets not only show the superior performance of XGait with the Rank-1 accuracy of 80.5% on Gait3D and 88.3% CCPG but also reflect the robustness of the learned features even under challenging conditions like occlusions and cloth changes

Abstract:
Despite the growing prevalence of black-box pre-trained models (PTMs) such as prediction API services, there remains a significant challenge in directly applying general models to real-world scenarios due to the data distribution gap. Considering a data deficiency and constrained computational resource scenario, this paper proposes a novel parameter-efficient transfer learning framework for vision recognition models in the black-box setting. Our framework incorporates two novel training techniques. First, we align the input space (i.e., image) of PTMs to the target data distribution by generating visual prompts of spatial and frequency domain. Along with the novel spatial-frequency hybrid visual prompter, we design a novel training technique based on probabilistic clusters, which can enhance class separation in the output space (i.e., prediction probabilities). In experiments, our model demonstrates superior performance in a few-shot transfer learning setting across extensive visual recognition datasets, surpassing state-of-the-art baselines. Additionally, we show that the proposed method efficiently reduces computational costs for training and inference phases.

Abstract:
DNN-based watermarking methods are rapidly developing and delivering impressive performances. Recent advances achieve resolution-agnostic image watermarking by reducing the variant resolution watermarking problem to a fixed resolution watermarking problem. However, such a reduction process can potentially introduce artifacts and low robustness. To address this issue, we propose the first, to the best of our knowledge, Resolution-Agnostic Image WaterMarking (RAIMark) framework by watermarking the implicit neural representation (INR) of image. Unlike previous methods, our method does not rely on the previous reduction process by directly watermarking the continuous signal instead of image pixels, thus achieving resolution-agnostic watermarking. Precisely, given an arbitrary-resolution image, we fit an INR for the target image. As a continuous signal, such an INR can be sampled to obtain images with variant resolutions. Then, we quickly fine-tune the fitted INR to get a watermarked INR conditioned on a binary secret message. A pre-trained watermark decoder extracts the hidden message from any sampled images with arbitrary resolutions. By directly watermarking INR, we achieve resolution-agnostic watermarking with increased robustness. Extensive experiments show that our method outperforms previous methods with significant improvements: averagely improved bit accuracy by 7%~29%. Notably, we observe that previous methods are vulnerable to at least one watermarking attack (e.g. JPEG, Crop, or Resize), while ours are robust against all watermarking attacks.

Abstract:
This tutorial explores recent advancements in multimodal pretrained and large models, capable of integrating and processing diverse data forms such as text, images, audio, and video. Participants will gain an understanding of the foundational concepts of multimodality, the evolution of multimodal research, and the key technical challenges addressed by these models. We will cover the latest multimodal datasets and pretrained models, including those beyond vision and language. Additionally, the tutorial will delve into the intricacies of multimodal large models and instruction tuning strategies to optimise performance for specific tasks. Hands-on laboratories will offer practical experience with state-of-the-art multimodal models, demonstrating real-world applications like visual storytelling and visual question answering. This tutorial aims to equip researchers, practitioners, and newcomers with the knowledge and skills to leverage multimodal AI. ACM Multimedia 2024 is the ideal venue for this tutorial, aligning perfectly with our goal of understanding multimodal pretrained and large language models, and their tuning mechanisms.

Abstract:
With Vision-Language Pre-training (VLP) models demonstrating powerful multimodal interaction capabilities, the application scenarios of neural networks are no longer confined to unimodal domains but have expanded to more complex multimodal V+L downstream tasks. The security vulnerabilities of unimodal models have been extensively examined, whereas those of VLP models remain challenging. We note that in CV models, the understanding of images comes from annotated information, while VLP models are designed to learn image representations directly from raw text. Motivated by this discrepancy, we developed the Feature Guidance Attack (FGA), a novel method that uses text representations to direct the perturbation of clean images, resulting in the generation of adversarial images. FGA is orthogonal to many advanced attack strategies in the unimodal domain, facilitating the direct application of rich research findings from the unimodal to the multimodal scenario. By appropriately introducing text attack into FGA, we construct Feature Guidance with Text Attack (FGA-T). Through the interaction of attacking two modalities, FGA-T achieves superior attack effects against VLP models. Moreover, incorporating data augmentation and momentum mechanisms significantly improves the black-box transferability of FGA-T. Our method demonstrates stable and effective attack capabilities across various datasets, downstream tasks, and both black-box and white-box settings, offering a unified baseline for exploring the robustness of VLP models.

Abstract:
Large Vision-Language Models (LVLMs) exhibit remarkable capabilities but struggle with ''hallucinations''-inconsistencies between images and their descriptions. Previous hallucination evaluation studies on LVLMs have identified hallucinations in terms of objects, attributes, and relations but overlooked complex hallucinations that create an entire narrative around a fictional entity. In this paper, we introduce a refined taxonomy of hallucinations, featuring a new category: Event Hallucination. We then utilize advanced LLMs to generate and filter fine-grained hallucinatory data consisting of various types of hallucinations, with a particular focus on event hallucinations, laying the groundwork for integrating discriminative and generative evaluation methods within our universal evaluation framework. The proposed benchmark distinctively assesses LVLMs' ability to tackle a broad spectrum of hallucinations, making it a reliable and comprehensive tool for gauging LVLMs' efficacy in handling hallucinations. We will release our code and data.

Abstract:
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model. As the massive multilingual modeling of visual data requires huge computational costs, we propose a novel efficient training strategy, processing with visual speech units. Through analysis, we confirm that the visual speech units mainly contain viseme information while suppressing non-linguistic information. By using the visual speech units as the inputs of our system, we propose to pre-train a VSR model to predict corresponding text outputs on multilingual data. As both the inputs (i.e., visual speech units) and outputs (i.e., text) are discrete, we can greatly improve the training efficiency compared to the standard VSR training. Specifically, the input data size is reduced to 0.016% of the original video inputs. In addition, to stabilize the training, we apply curriculum learning where the inputs of the system begin with audio-visual speech units and gradually transition to visual speech units. After pre-training, the model is finetuned on continuous features. We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models, with a single trained model.

Abstract:
Low-light image enhancement (LLIE) has achieved promising performance by employing conditional diffusion models. Despite the success of some conditional methods, previous methods may neglect the importance of a sufficient formulation of task-specific condition strategy, resulting in suboptimal visual outcomes. In this study, we propose JoReS-Diff, a novel approach that incorporates Retinex- and semantic-based priors as the additional pre-processing condition to regulate the generating capabilities of the diffusion model. We first leverage pre-trained decomposition network to generate the Retinex prior, which is updated with better quality by an adjustment network and integrated into a refinement network to implement Retinex-based conditional generation at both feature- and image-levels. Moreover, the semantic prior is extracted from the input image with an off-the-shelf semantic segmentation model and incorporated through semantic attention layers. By treating Retinex- and semantic-based priors as the condition, JoReS-Diff presents a unique perspective for establishing an diffusion model for LLIE and similar image enhancement tasks. Extensive experiments validate the rationality and superiority of our approach.

Abstract:
Social media has become ubiquitous for connecting with others, staying updated with news, expressing opinions, and finding entertainment. However, understanding the intention behind social media posts remains challenging due to the implicit and commonsense nature of these intentions, the need for cross-modality understanding of both text and images, and the presence of noisy information such as hashtags, misspelled words, and complicated abbreviations. To address these challenges, we present MIKO, a Multimodal Intention Knowledge DistillatiOn framework that collaboratively leverages a Large Language Model (LLM) and a Multimodal Large Language Model (MLLM) to uncover users' intentions. Specifically, our approach uses an MLLM to interpret the image, an LLM to extract key information from the text, and another LLM to generate intentions. By applying MIKO to publicly available social media datasets, we construct an intention knowledge base featuring 1,372K intentions rooted in 137,287 posts. Moreover, We conduct a two-stage annotation to verify the quality of the generated knowledge and benchmark the performance of widely used LLMs for intention generation, and further apply MIKO to a sarcasm detection dataset and distill a student model to demonstrate the downstream benefits of applying intention knowledge.

Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have constituted a significant leap forward in the field, particularly in the processing of videos, which encompasses inherent challenges such as spatiotemporal relationships. However, existing MLLMs are predominantly focused on the comprehension of video inputs, with limited capabilities in generating video content. In this paper, we present GPT4Video, a unified framework that seamlessly and lightly integrates with LLMs, visual feature extractors, and stable diffusion generative models for cohesive video understanding and generation. Moreover, we explore a text-only finetuning approach to equip models for instruction-following and safeguarding in multimodal conversations, enhancing training efficiency and generalization capabilities. Additionally, we construct multi-turn and caption-interleaved datasets for finetuning and benchmarking MLLMs, which serve as solid resources for advancing this field. Through quantitative and qualitative assessments, GPT4Video demonstrates the following advantages: 1) The framework incorporates video generation ability without adding extra training parameters, ensuring seamless compatibility with various video generators. 2) The model achieves superior performances across a variety of benchmarks. For instance, it outperforms Valley by 11.8% on video question answering, and surpasses NExt-GPT by 2.3% on text-to-video generation. 3) As safety pioneers in open-source MLLMs, we developed finetuning and evaluation datasets, securing an F1 score exceeding 80% in blocking harmful content during understanding and generating videos. In general, GPT4Video shows potential to function as a real-life assistant, marked by its effectiveness, adaptability, and safety.

Abstract:
Video Question Answering (VideoQA) aims to answer natural language questions based on the information observed in videos. Despite the recent success of Large Multimodal Models (LMMs) in image-language understanding, they deal with VideoQA insufficiently, by simply taking uniformly sampled frames as visual inputs, which ignores question-relevant visual clues. Moreover, there are no human annotations for question-critical timestamps in existing VideoQA datasets. In light of this, we propose a weakly supervised framework to enforce the LMMs to reason out the answers with question-critical moments as visual inputs. Specifically, we fuse the question and answer pairs as event descriptions to find multiple keyframes as target moments and pseudo-labels. With these pseudo-labeled keyframes as additionally weak supervision, we devise a lightweight Gaussian-based Contrastive Grounding (GCG) module. GCG learns multiple Gaussian masks to characterize the temporal structure of the video, and sample question-critical frames as positive moments to be the visual inputs of LMMs. Extensive experiments on several benchmarks verify the effectiveness of our framework, and we achieve substantial improvements compared to previous state-of-the-art methods.

Abstract:
As posts on social media increase rapidly, analyzing the sentiments embedded in image-text pairs has become a popular research topic in recent years. Although existing works achieve impressive accomplishments in simultaneously harnessing image and text information, they lack the considerations of possible low-quality and missing modalities. In real-world applications, these issues might frequently occur, leading to urgent needs for models capable of predicting sentiment robustly. Therefore, we propose a Distribution-based feature Recovery and Fusion (DRF) method for robust multimodal sentiment analysis of image-text pairs. Specifically, we maintain a feature queue for each modality to approximate their feature distributions, through which we can simultaneously handle low-quality and missing modalities in a unified framework. For low-quality modalities, we reduce their contributions to the fusion by quantitatively estimating modality qualities based on the distributions. For missing modalities, we build inter-modal mapping relationships supervised by samples and distributions, thereby recovering the missing modalities from available ones. In experiments, two disruption strategies that corrupt and discard some modalities in samples are adopted to mimic the low-quality and missing modalities in various real-world scenarios. Through comprehensive experiments on three publicly available image-text datasets, we demonstrate the universal improvements of DRF compared to SOTA methods under both two strategies, validating its effectiveness in robust multimodal sentiment analysis.

Abstract:
In this work, we focus on exploring explicit fine-grained control of generative facial image editing, all while generating faithful facial appearances and consistent semantic details, which however, is quite challenging and has not been extensively explored, especially under an one-shot scenario. We identify the key challenge as the exploration of disentangled conditional control between high-level semantics and explicit parameters ( e.g., 3DMM) in the generation process, and accordingly propose a novel diffusion-based editing framework, named DisControlFace. Specifically, we leverage a Diffusion Autoencoder (Diff-AE) as the semantic reconstruction backbone. To enable explicit face editing, we construct an Exp-FaceNet that is compatible with Diff-AE to generate spatial-wise explicit control conditions based on estimated 3DMM parameters. Different from current diffusion-based editing methods that train the whole conditional generative model from scratch, we freeze the pre-trained weights of the Diff-AE to maintain its semantically deterministic conditioning capability and accordingly propose a random semantic masking (RSM) strategy to effectively achieve an independent training of Exp-FaceNet. This setting endows the model with disentangled face control meanwhile reducing semantic information shift in editing. Our model can be trained using 2D in-the-wild portrait images without requiring 3D or video data and perform robust editing on any new facial image through a simple one-shot fine-tuning. Comprehensive experiments demonstrate that DisControlFace can generate realistic facial images with better editing accuracy and identity preservation over SOTA methods.

Abstract:
Hate speech is a pressing issue in modern society, with significant effects both online and offline. Recent research in hate speech detection has primarily centered on text-based media, largely overlooking multimodal content such as videos. Existing studies on hateful video datasets have predominantly focused on English content within a Western context and have been limited to binary labels (hateful or non-hateful), lacking detailed contextual information. This study presents MultiHateClip, an novel multilingual dataset created through hate lexicons and human annotation. It aims to enhance the detection of hateful videos on platforms such as YouTube and Bilibili, including content in both English and Chinese languages. Comprising 2,000 videos annotated for hatefulness, offensiveness, and normalcy, this dataset provides a cross-cultural perspective on gender-based hate speech. Through a detailed examination of human annotation results, we discuss the differences between Chinese and English hateful videos and underscore the importance of different modalities in hateful and offensive video analysis. Evaluations of state-of-the-art video classification models, such as VLM, GPT-4V and Qwen-VL, on MultiHateClip highlight the existing challenges in accurately distinguishing between hateful and offensive content and the urgent need for models that are both multimodally and culturally nuanced. MultiHateClip represents a foundational advance in enhancing hateful video detection by underscoring the necessity of a multimodal and culturally sensitive approach in combating online hate speech.

Abstract:
Music source separation and pitch estimation are two vital tasks in music information retrieval. Typically, the input of pitch estimation is obtained from the output of music source separation. Therefore, existing methods have tried to perform these two tasks simultaneously, so as to leverage the mutually beneficial relationship between both tasks. However, these methods still face two critical challenges that limit the improvement of both tasks: the lack of labeled data and joint learning optimization. To address these challenges, we propose a Model-Agnostic Joint Learning (MAJL) framework for both tasks. MAJL is a generic framework and can use variant models for each task. It includes a two-stage training method and a dynamic weighting method named Dynamic Weights on Hard Samples (DWHS), which addresses the lack of labeled data and joint learning optimization, respectively. Experimental results on public music datasets show that MAJL outperforms state-of-the-art methods on both tasks, with significant improvements of 0.92 in Signal-to-Distortion Ratio (SDR) for music source separation and 2.71% in Raw Pitch Accuracy (RPA) for pitch estimation. Furthermore, comprehensive studies not only validate the effectiveness of each component of MAJL, but also indicate the great generality of MAJL in adapting to different model architectures.

Abstract:
We present ObjBlur, a novel curriculum learning approach to improve layout-to-image generation models, where the task is to produce realistic images from layouts composed of boxes and labels. Our method, based on progressive object-level blurring, effectively stabilizes training and enhances the quality of generated images. This strategy systematically applies varying degrees of blurring to individual objects or the background during training, starting with strong blurring to progressively cleaner images. Our findings reveal significant performance improvements, stabilized training, smoother convergence, and reduced variance between multiple runs. Moreover, our technique is compatible with generative adversarial networks and diffusion models, highlighting its versatility across generative modeling paradigms. We reach new state-of-the-art results on the complex COCO and Visual Genome datasets.

Abstract:
Accurately and promptly predicting accidents among surrounding traffic agents from camera footage is crucial for the safety of autonomous vehicles (AVs). This task presents substantial challenges stemming from the unpredictable nature of traffic accidents, their long-tail distribution, the intricacies of traffic scene dynamics, and the inherently constrained field of vision of onboard cameras. To address these challenges, this study introduces a novel accident anticipation framework for AVs, termed CRASH. It seamlessly integrates five components: object detector, feature extractor, object-aware module, context-aware module, and multi-layer fusion. Specifically, we develop the object-aware module to prioritize high-risk objects in complex and ambiguous environments by calculating the spatial-temporal relationships between traffic agents. In parallel, the context-aware is also devised to extend global visual information from the temporal to the frequency domain using the Fast Fourier Transform (FFT) and capture fine-grained visual features of potential objects and broader context cues within traffic scenes. To capture a wider range of visual cues, we further propose a multi-layer fusion that dynamically computes the temporal dependencies between different scenes and iteratively updates the correlations between different visual features for accurate and timely accident prediction. Evaluated on real-world datasets-Dashcam Accident Dataset (DAD), Car Crash Dataset (CCD), and AnAn Accident Detection (A3D) datasets-our model surpasses existing top baselines in critical evaluation metrics like Average Precision (AP) and mean Time-To-Accident (mTTA). Importantly, its robustness and adaptability are particularly evident in challenging driving scenarios with missing or limited training data, demonstrating significant potential for application in real-world autonomous driving systems.

Abstract:
Large vision-language models (LVLMs) integrate visual information into large language models, showcasing remarkable multi-modal conversational capabilities. However, the visual modules introduces new challenges in terms of robustness for LVLMs, as attackers can craft adversarial images that are visually clean but may mislead the model to generate incorrect answers. In general, LVLMs rely on vision encoders to transform images into visual tokens, which are crucial for the language models to perceive image contents effectively. Therefore, we are curious about one question: Can LVLMs still generate correct responses when the encoded visual tokens are attacked and disrupting the visual information? To this end, we propose a non-targeted attack method referred to as VT-Attack (Visual Tokens Attack), which constructs adversarial examples from multiple perspectives, with the goal of comprehensively disrupting feature representations and inherent relationships as well as the semantic properties of visual tokens output by image encoders. Using only access to the image encoder in the proposed attack, the generated adversarial examples exhibit transferability across diverse LVLMs utilizing the same image encoder and generality across different tasks. Extensive experiments validate the superior attack performance of the VT-Attack over baseline methods, demonstrating its effectiveness in attacking LVLMs with image encoders, which in turn can provide guidance on the robustness of LVLMs, particularly in terms of the stability of the visual feature space.

Abstract:
As short-form video-sharing platforms become a significant channel for news consumption, fake news in short videos has emerged as a serious threat in the online information ecosystem, making developing detection methods for this new scenario an urgent need. Compared with that in text and image formats, fake news on short video platforms contains rich but heterogeneous information in various modalities, posing a challenge to effective feature utilization. Unlike existing works mostly focusing on analyzing what is presented, we introduce a novel perspective that considers how it might be created. Through the lens of the creative process behind news video production, our empirical analysis uncovers the unique characteristics of fake news videos in material selection and editing. Based on the obtained insights, we design FakingRecipe, a creative process-aware model for detecting fake news short videos. It captures the fake news preferences in material selection from sentimental and semantic aspects and considers the traits of material editing from spatial and temporal aspects. To improve evaluation comprehensiveness, we first construct FakeTT, an English dataset for this task, and conduct experiments on both FakeTT and the existing Chinese FakeSV dataset. The results show FakingRecipe's superiority in detecting fake news on short video platforms.

Abstract:
Facial movements play a crucial role in conveying altitude and intentions, and facial optical flow provides a dynamic and detailed representation of it. However, the scarcity of datasets and a modern baseline hinders the progress in facial optical flow research. This paper proposes FacialFlowNet (FFN), a novel large-scale facial optical flow dataset, and the Decomposed Facial Flow Model (DecFlow), the first method capable of decomposing facial flow. FFN comprises 9,635 identities and 105,970 image pairs, offering unprecedented diversity for detailed facial and head motion analysis. DecFlow features a facial semantic-aware encoder and a decomposed flow decoder, excelling in accurately estimating and decomposing facial flow into head and expression components. Comprehensive experiments demonstrate that FFN significantly enhances the accuracy of facial flow estimation across various optical flow methods, achieving up to an 11% reduction in Endpoint Error (EPE) (from 3.91 to 3.48). Moreover, DecFlow, when coupled with FFN, outperforms existing methods in both synthetic and real-world scenarios, enhancing facial expression analysis. The decomposed expression flow achieves a substantial accuracy improvement of 18% (from 69.1% to 82.1%) in micro-expressions recognition. These contributions represent a significant advancement in facial motion analysis and optical flow estimation. Codes and datasets can be found here.

Abstract:
Facial Expression Recognition (FER) holds significant importance in human-computer interactions. Existing cross-domain FER methods often transfer knowledge solely from a single labeled source domain to an unlabeled target domain, neglecting the comprehensive information across multiple sources. Nevertheless, cross-multidomain FER (CMFER) is very challenging for (i) the inherent inter-domain shifts across multiple domains and (ii) the intra-domain shifts stemming from the ambiguous expressions and low inter-class distinctions. In this paper, we propose a novel Learning with Alignments CMFER framework, named LA-CMFER, to handle both inter- and intra-domain shifts. Specifically, LA-CMFER is constructed with a global branch and a local branch to extract features from the full images and local subtle expressions, respectively. Based on this, LA-CMFER presents a dual-level inter-domain alignment method to force the model to prioritize hard-to-align samples in knowledge transfer at a sample level while gradually generating a well-clustered feature space with the guidance of class attributes at a cluster level, thus narrowing the inter-domain shifts. To address the intra-domain shifts, LA-CMFER introduces a multi-view intra-domain alignment method with a multi-view clustering consistency constraint where a prediction similarity matrix is built to pursue consistency between the global and local views, thus refining pseudo labels and eliminating latent noise. Extensive experiments on six benchmark datasets have validated the superiority of our LA-CMFER.

Abstract:
Many contrastive learning based models have achieved advanced performance in image-text matching tasks. The key of these models lies in analyzing the correlation between image-text pairs, which involves cross-modal interaction of embeddings in corresponding dimensions. However, the embeddings of different modalities are from different models or modules, and there is a significant modality gap. Directly interacting such embeddings lacks rationality and may capture inaccurate correlation. Therefore, we propose a novel method called DIAS to bridge the modality gap from two aspects: (1) We align the information representation of embeddings from different modalities in corresponding dimension to ensure the correlation calculation is based on interactions of similar information. (2) The spatial constraints of inter- and intra-modalities unmatched pairs are introduced to ensure the effectiveness of semantic alignment of the model. Besides, a sparse correlation algorithm is proposed to select strong correlated spatial relationships, enabling the model to learn more significant features and avoid being misled by weak correlation. Extensive experiments demonstrate the superiority of DIAS, achieving 4.3%-10.2% rSum improvements on Flickr30k and MSCOCO benchmarks.

Abstract:
The fusion of images from dual camera systems featuring a wide-angle and a telephoto camera has become a hotspot problem recently. By integrating simultaneously captured wide-angle and telephoto images from these systems, the resulting fused image achieves a wide field of view (FOV) coupled with high-definition quality. Existing approaches are mostly deep learning methods, and predominantly rely on supervised learning, where the training dataset plays a pivotal role. However, current datasets typically adopt a data synthesis approach, where the wide-angle inputs are synthesized rather than captured using real wide-angle cameras, and the ground-truth image is captured by wide-angle cameras whose quality is substantially lower than that of input telephoto images captured by telephoto cameras. To address these limitations, we introduce a novel hardware setup utilizing a beam splitter to simultaneously capture three images, i.e. input pairs and ground-truth images, from two authentic cellphones equipped with wide-angle and telephoto dual cameras. Specifically, the wide-angle and telephoto images captured by cellphone 2 serve as the input pair, while the telephoto image captured by cellphone 1, which is calibrated to match the optical path of the wide-angle image from cellphone 2, serves as the ground-truth image, maintaining quality on par with the input telephoto image. Experiments validate the efficacy of our newly introduced dataset, named ReWiTe, which can significantly enhance the performance of various existing methods for the real-world wide-angle and telephoto dual image fusion task.

Abstract:
ControlNet excels at creating content that closely matches precise contours in user-provided masks. However, when these masks contain noise, as a frequent occurrence with non-expert users, the output would include unwanted artifacts. This paper first highlights the crucial role of controlling the impact of these inexplicit masks with diverse deterioration levels through in-depth analysis. Subsequently, to enhance controllability with inexplicit masks, an advanced Shape-aware ControlNet consisting of a deterioration estimator and a shape-prior modulation block is devised. The deterioration estimator assesses the deterioration factor of the provided masks. Then this factor is used in a modulation block to adaptively adjust the model's contour-following ability, which helps it dismiss the noise part in the inexplicit masks. Extensive experiments prove its effectiveness in encouraging ControlNet to interpret inaccurate spatial conditions robustly rather than blindly following the given contours, suitable for diverse kinds of conditions. We showcase application scenarios like modifying shape priors and composable shape-controllable generation. Codes are available at github.

Abstract:
Recently, stable diffusion (SD) models have typically flourished in the field of image synthesis and personalized editing, with a range of photorealistic and unprecedented images being successfully generated. As a result, widespread interest has been ignited to develop and use various SD-based tools for visual content creation. However, the exposure of AI-created content on public platforms could raise both legal and ethical risks. In this regard, the traditional methods of adding watermarks to the already generated images (i.e. post-processing) may face a dilemma (e.g., being erased or modified) in terms of copyright protection and content monitoring, since the powerful image inversion and text-to-image editing techniques have been widely explored in SD-based methods. In this work, we propose a Safe and high-traceable Stable Diffusion framework (namely Safe-SD) to adaptively implant the graphical watermarks (e.g., QR code) into the imperceptible structure-related pixels during the generative diffusion process for supporting text-driven invisible watermarking and detection. Different from the previous high-cost injection-then-detection training framework, we design a simple and unified architecture, which makes it possible to simultaneously train watermark injection and detection in a single network, greatly improving the efficiency and convenience of use. Moreover, to further support text-driven generative watermarking and deeply explore its robustness and high-traceability, we elaborately design a λ-sampling and λ-encryption algorithm to fine-tune a latent diffuser wrapped by a VAE for balancing high-fidelity image synthesis and high-traceable watermark detection. We present our quantitative and qualitative results on two representative datasets LSUN, COCO and FFHQ, demonstrating state-of-the-art performance of Safe-SD and showing it significantly outperforms the previous approaches.

Abstract:
For interacting with mobile objects in unfamiliar environments, simultaneously locating, mapping, and tracking the 3D poses of multiple objects are crucially required. This paper proposes a Tracklet Graph and Query Graph-based framework, i.e., GSLAMOT, to address this challenge. GSLAMOT utilizes camera and LiDAR multimodal information as inputs and divides the representation of the dynamic scene into a semantic map for representing the static environment, a trajectory of the ego-agent, and an online maintained Tracklet Graph (TG) for tracking and predicting the 3D poses of the detected mobile objects. A Query Graph (QG) is constructed in each frame by object detection to query and update TG. For accurate object association, a Multi-criteria Star Graph Association (MSGA) method is proposed to find matched objects between the detections in QG and the predicted tracklets in TG. Then, an Object-centric Graph Optimization (OGO) method is proposed to simultaneously optimize the TG, the semantic map, and the agent trajectory. It triangulates the detected objects into the map to enrich the map's semantic information. We address the efficiency issues to handle the three tightly coupled tasks in parallel. Experiments are conducted on KITTI, Waymo, and an emulated Traffic Congestion dataset that highlights challenging scenarios. Experiments show that GSLAMOT enables accurate crowded object tracking while conducting SLAM accurately in challenging scenarios, demonstrating more excellent performances than the state-of-the-art methods. The code and dataset are at https://gslamot.github.io.

Abstract:
The evolution of Artificial Intelligence Generated Contents (AIGCs) is advancing towards higher quality. The growing interactions with AIGCs present a new challenge to the data-driven AI community: While AI-generated contents have played a crucial role in a wide range of AI models, the potential hidden risks they introduce have not been thoroughly examined. Beyond human-oriented forgery detection, AI-generated content poses potential issues for AI models originally designed to process natural data. In this study, we underscore the exacerbated hallucination phenomena in Large Vision-Language Models (LVLMs) caused by AI-synthetic images. Remarkably, our findings shed light on a consistent AIGC hallucination bias: the object hallucinations induced by synthetic images are characterized by a greater quantity and a more uniform position distribution, even these synthetic images do not manifest unrealistic or additional relevant visual features compared to natural images. Moreover, our investigations on Q-former and Linear projector reveal that synthetic images may present token deviations after visual projection, thereby amplifying the hallucination bias.

Abstract:
This work proposes a novel and simple sequential learning strategy to train models on videos and texts for multimodal sentiment analysis. To estimate sentiment polarities on unseen out-of-distribution data, we introduce a multimodal model that is trained either in a single source domain or multiple source domains using our learning strategy. This strategy starts with learning domain invariant features from the text, followed by learning sparse domain-agnostic features from videos, assisted by the selected features learned in text. Our experimental results demonstrate that our model achieves significantly better performance than the state-of-the-art approaches on average in both single-source and multi-source settings. Our feature selection procedure favors the features that are independent to each other and are strongly correlated with their polarity labels. To facilitate research on this topic, the source code of this work will be publicly available upon acceptance.

Abstract:
Deep Video Quality Assessment (VQA) methods have shown impressive high-performance capabilities. Notably, no-reference (NR) VQA methods play a vital role in situations where obtaining reference videos is restricted or not feasible. Nevertheless, as more streaming videos are being created in ultra-high definition (e.g., 4K) to enrich viewers' experiences, the current deep VQA methods face unacceptable computational costs. Furthermore, the resizing, cropping, and local sampling techniques employed in these methods can compromise the details and content of original 4K videos, thereby negatively impacting quality assessment. In this paper, we propose a highly efficient and novel NR 4K VQA technology. Specifically, first, a novel data sampling and training strategy is proposed to tackle the problem of excessive resolution. This strategy allows the VQA Swin Transformer-based model to effectively train and make inferences using the full data of 4K videos on standard consumer-grade GPUs without compromising content or details. Second, a weighting and scoring scheme is developed to mimic the human subjective perception mode, which is achieved by considering the distinct impact of each sub-region within a 4K frame on the overall perception. Third, we incorporate the frequency domain information of video frames to better capture the details that affect video quality, consequently further improving the model's generalizability. To our knowledge, this is the first technology for the NR 4K VQA task. Thorough empirical studies demonstrate it not only significantly outperforms existing methods on a specialized 4K VQA dataset but also achieves state-of-the-art performance across multiple open-source NR video quality datasets.

Abstract:
Video Quality Assessment (VQA) aims to simulate the process of perceiving video quality by the Human Visual System (HVS). Although subjective studies have shown that the judgments of HVS are strongly influenced by human feelings, it remains unclear how video content relates to human feelings. The recent rapid development of Vision-Language pre-trained models (VLM) has established a solid link between language and vision. And human feelings can be accurately described by language, which means that VLM can extract information related to human feelings from visual content with linguistic prompts. In this paper, we propose CLiF-VQA, which innovatively utilizes the visual linguistic capabilities of VLM to introduce human feelings features based on traditional spatio-temporal features to more accurately simulate the perceptual process of HVS. In order to efficiently extract features related to human feelings from videos, we pioneer the exploration of the consistency between Contrastive Language-Image Pre-training (CLIP) and human feelings in video perception. In addition, we design effective prompts, i.e., a variety of objective and subjective descriptions closely related to human feelings, as prompts. Extensive experiments show that the proposed CLiF-VQA exhibits excellent performance on several VQA datasets. The results show that introducing human feelings features on top of spatio-temporal features is an effective way to obtain better performance.

Abstract:
As autonomous driving systems increasingly become part of daily transportation, the ability to accurately anticipate and mitigate potential traffic accidents is paramount. Traditional accident anticipation models primarily utilizing dashcam videos are adept at predicting when an accident may occur but fall short in localizing the incident and identifying involved entities. Addressing this gap, this study introduces a novel framework that integrates Large Language Models (LLMs) to enhance predictive capabilities across multiple dimensions-what, when, and where accidents might occur. We develop an innovative chain-based attention mechanism that dynamically adjusts to prioritize high-risk elements within complex driving scenes. This mechanism is complemented by a three-stage model that processes outputs from smaller models into detailed multimodal inputs for LLMs, thus enabling a more nuanced understanding of traffic dynamics. Empirical validation on the DAD, CCD, and A3D datasets demonstrates superior performance in Average Precision (AP) and Mean Time-To-Accident (mTTA), establishing new benchmarks for accident prediction technology. Our approach not only advances the technological framework for autonomous driving safety but also enhances human-AI interaction, making predictive insights generated by autonomous systems more intuitive and actionable.

Abstract:
Pre-training & fine-tuning is a prevalent paradigm in computer vision (CV). Recently, parameter-efficient transfer learning (PETL) methods have shown promising performance in adapting to downstream tasks with only a few trainable parameters. Despite their success, the existing PETL methods in CV can be computationally expensive and require large amounts of memory and time cost during training, which limits low-resource users from conducting research and applications on large models. In this work, we propose Parameter, Memory, and Time Efficient Visual Adapter (E3VA) tuning to address this issue. We provide a gradient backpropagation highway for low-rank adapters which eliminates the need for expensive backpropagation through the frozen pre-trained model, resulting in substantial savings of training memory and training time. Furthermore, we optimise the E3VA structure for CV tasks to promote model performance. Extensive experiments on COCO, ADE20K, and Pascal VOC benchmarks show that E3VA can save up to 62.2% training memory and 26.2% training time on average, while achieving comparable performance to full fine-tuning and better performance than most PETL methods. Note that we can even train the Swin-Large-based Cascade Mask RCNN on GTX 1080Ti GPUs with less than 1.5% trainable parameters.

Abstract:
Recently, the 3D Gaussian Splatting (3D-GS) method has achieved great success in novel view synthesis, providing real-time rendering while ensuring high-quality rendering results. However, this method faces challenges in modeling specular reflections and handling anisotropic appearance components, especially in dealing with view-dependent color under complex lighting conditions. Additionally, 3D-GS uses spherical harmonic to learn the color representation, which has limited ability to represent complex scenes. To overcome these challenges, we introduce Lantent-SpecGS, an approach that utilizes a universal latent neural descriptor within each 3D Gaussian. This enables a more effective representation of 3D feature fields, including appearance and geometry. Moreover, two parallel CNNs are designed to decoder the splatting feature maps into diffuse color and specular color separately. A mask that depends on the viewpoint is learned to merge these two colors, resulting in the final rendered image. Experimental results demonstrate that our method obtains competitive performance in novel view synthesis and extends the ability of 3D-GS to handle intricate scenarios with specular reflections.

Abstract:
With the rise of large-scale language models (LLMs), it is currently popular and effective to convert multimodal information into text descriptions for multimodal multi-hop question answering. However, we argue that the current methods of multi-modal multi-hop question answering still mainly face two challenges: 1) The retrieved evidence containing a large amount of redundant information, inevitably leads to a significant drop in performance due to irrelevant information misleading the prediction. 2) The reasoning process without interpretable reasoning steps makes the model difficult to discover the logical errors for handling complex questions. To solve these problems, we propose a unified LLMs-based approach but without heavily relying on them due to the LLM's potential errors, and innovatively treat multimodal multi-hop question answering as a joint entailment tree generation and question answering problem. Specifically, we design a multi-task learning framework with a focus on facilitating common knowledge sharing across interpretability and prediction tasks while preventing task-specific errors from interfering with each other via mixture of experts. Afterward, we design an iterative feedback mechanism to further enhance both tasks by feeding back the results of the joint training to the LLM for regenerating entailment trees, aiming to iteratively refine the potential answer. Notably, our method has won the first place in the official leaderboard of WebQA (since April 10, 2024), and achieves competitive results on MultimodalQA.

Abstract:
Hypothesis inference, a sophisticated cognitive process that allows humans to construct plausible explanations for incomplete observations, is paramount to our ability to make sense of the world around us. Despite the universality of this skill, it remains under-explored within the context of multi-modal AI, which necessitates analyzing observation, recalling information in the mind, and generating explanations. In this work, we propose the Cross-modal Observation hypothesIs iNference task (COIN). Given a textual description of a partially observed event, COIN strives to recall the most probable event from the visual mind (video pool), and infer the subsequent action flow connecting the visual mind event and the observed textural event. To advance the development of this field, we propose a large-scale text-video dataset, Tex-COIN, that contains 39,796 meticulously annotated hypothesis inference examples and auxiliary commonsense knowledge (appearance, clothing, action, etc.) for key video characters. Based on the proposed Tex-COIN dataset, we design a strong baseline, COINNet, which features two perspectives: 1) aligning temporally displaced textual observations with target videos via transformer-based multi-task learning, and 2) inferring the action flow with non-parametric graph-based inference grounded in graph theory. Extensive experiments on the Tex-COIN dataset validate the effectiveness of our COINNet by significantly outperforming the state-of-the-arts.

Abstract:
Video-to-audio generation is crucial for autonomous video editing and post-processing, which aims to generate high-quality audio for silent videos with semantic similarity and temporal synchronization. However, most existing methods mainly focus on matching the semantics of the visual and acoustic modalities while merely considering their temporal alignment in a coarse granularity, thus failing to achieve precise synchronization. In this study, we propose a novel time-aligned video-to-audio framework, called TiVA, to achieve semantic matching and temporal synchronization jointly when generating audio. Given a silent video, our method encodes its visual semantics and predicts an audio layout separately. Then, leveraging the semantic latent embeddings and the predicted audio layout as condition, it learns a latent diffusion-based audio generator. Comprehensive objective and subjective experiments demonstrate that our method consistently outperforms state-of-the-art methods on semantic matching and temporal synchronization.

Abstract:
Event cameras are novel bio-inspired cameras that record asynchronous events with high temporal resolution and dynamic range. Leveraging the auxiliary temporal information recorded by event cameras holds great promise for the task of video super-resolution (VSR). However, existing event-guided VSR methods assume that the event and RGB cameras are strictly calibrated (e.g., pixel-level sensor designs in DAVIS 240/346). This assumption proves limiting in emerging high-resolution devices, such as dual-lens smartphones and unmanned aerial vehicles, where such precise calibration is typically unavailable. To unlock more event-guided application scenarios, we perform the task of asymmetric event-guided VSR for the first time, and we propose an Asymmetric Event-guided VSR Network (AsEVSRN) for this new task. AsEVSRN incorporates two specialized designs for leveraging the asymmetric event stream in VSR. Firstly, the content hallucination module dynamically enhances event and RGB information by exploiting their complementary nature, thereby adaptively boosting representational capacity. Secondly, the event-enhanced bidirectional recurrent cells align and propagate temporal features fused with features from content-hallucinated frames. Within the bidirectional recurrent cells, event-enhanced flow is employed to simultaneously utilize and fuse temporal information at both the feature and pixel levels. Comprehensive experimental results affirm that our method consistently generates superior quantitative and qualitative results.

Abstract:
The effective denoising demonstrated by the latent diffusion model poses a new threat to image watermarking, as attackers can erase the watermark by performing a forward diffusion, followed by backward denoising. While such denoising might introduce large distortion in the pixel domain, the image semantics remain similar. Unfortunately, most existing robust watermarking methods fail to tackle such an erasure attack since they are primarily designed for traditional channel distortions. To address such issue, this paper proposed DERO, a diffusion-model-erasure robust watermarking framework. Based on the frequency domain analysis of the diffusion model's denoising process, we designed a destruction and compensation noise layer (DCNL) to approximate the distortion effects caused by latent diffusion model erasure (LDE). In detail, DCNL consists of a multi-scale low-pass filtering and a white noise compensation process, where the high-frequency components of the image are first obliterated, and then full-frequency components are enriched with white noise. Such a process broadly simulates the LDE distortions. Besides, on the extraction side, we cascaded a pre-trained variational autoencoder before the decoder to extract the watermark in the latent domain, which closely adapts to the operation domain of the LDE process. Meanwhile, to improve the robustness of the decoder, we also design a latent feature augmentation (LFA) operation on the latent feature. Throughout the end-to-end training with the DCNL and LFA, DERO can successfully achieve robustness against LDE. Our experimental results demonstrate the effectiveness and the generalizability of the proposed framework. The LDE robustness is significantly improved from 75% with SOTA methods to an impressive 96% with DERO.

Abstract:
The metaverse is a developing field that crosses several areas of multimedia research. In this paper, we introduce the 256-MetaverseRecords dataset, a novel and extensive collection of annotated screen recordings in the form of videos from various virtual worlds of the metaverse. We describe the process of creating the dataset, the quality criteria for the annotations, and the exploration of the dataset. We also show four experiments to evaluate the performance of different feature extraction methods for Metaverse Recordings (MVRs): MVR segmentation, audio event detection, and object and interaction detection based on this dataset. Our results demonstrate that existing methods have limitations and leave challenges in dealing with the diversity and complexity of metaverse data, and that more research is needed to develop metaverse-specific techniques. Our dataset can serve as a valuable resource for the research community and foster the development of new applications and solutions for the metaverse.

Abstract:
Multi-view clustering has garnered attention for its effectiveness in addressing heterogeneous data by unsupervisedly revealing underlying correlations between different views. As a mainstream method, multi-view graph clustering has attracted increasing attention in recent years. Despite its success, it still has some limitations. Notably, many methods construct the similarity graph without considering the local geometric structure and exploit coarse-grained complementary and consensus information from different views at the view level. To solve the shortcomings, we focus on local structure consistency and fine-grained representations across multiple views. Specifically, each view's local consistency similarity graph is obtained through the adaptive neighbor. Subsequently, the multi-view similarity tensor is rotated and sliced into fine-grained instance-wise slices. Finally, these slices are fused into the final similarity matrix. Consequently, cross-view consistency can be captured by exploring the intersections of multiple views in an instance-wise manner. We design a collaborative framework with the augmented Lagrangian method to refine all subtasks towards optimal solutions iteratively. Extensive experiments on several multi-view datasets confirm the significant enhancement in clustering accuracy achieved by our method.

Abstract:
Image-to-image translation is defined as the process of learning a mapping between images from a source domain and images from a target domain. The probabilistic structure that maps a fixed initial state to a pinned terminal state through a standard Wiener process is a Brownian bridge. In this paper, we propose a score-based Stochastic Differential Equation (SDE) approach via the Brownian bridges, termed the Amenable Brownian Bridges (A-Bridges), to image-to-image translation tasks as an unconditional diffusion model. Our framework embraces a large family of Brownian bridge models, while the discretization of the linear A-Bridge exploits its advantage that provides the explicit solution in a closed form and thus facilitates the model training. Our model enables the accelerated sampling and has achieved record-breaking performance in sample quality and diversity on benchmark datasets following the guidance of its SDE structure.

Abstract:
In today's multimedia-rich environment, the rapid growth of data poses significant challenges for developing efficient multi-modal retrieval systems essential for retrieving text, images, audio, and video. As data expands, newer, scalable, and high-performance retrieval systems are increasingly necessary. Embedding-based deep neural networks (DNNs) have become key solutions, transforming high-dimensional data into lower-dimensional embeddings for easy comparison and retrieval. However, updating DNNs changes the internal feature representations, necessitating the extraction of new feature vectors for all gallery data, which is costly, especially with gallery sets comprising billions of data. Learning backward-compatible representations addresses this by allowing new representation to be matched with old gallery data without recalculating features. This tutorial aims to equip participants with the knowledge and tools to apply backward-compatible representations, enhancing multimedia retrieval systems' efficiency and scalability. Participants will learn the importance of compatible representations, basic methods and techniques, and explore challenging open questions that are becoming increasingly relevant to multimedia and cross-modal retrieval.

Abstract:
Traditional dense video captioning predominantly focuses on edited exocentric footage. These videos are filmed from an external perspective and generally feature distinct transitions between different events, as exemplified in edited instructional videos. However, such videos do not genuinely reflect the way we perceive our real lives. Instead, we observe the world from an egocentric viewpoint and witness only continuous unedited footage. To facilitate further research, we introduce a new topic: Egocentric Vehicle Dense Video Captioning, in classic vehicle driving scenarios. This is a multi-modal, multi-task subject endeavor for a comprehensive understanding of untrimmed, egocentric driving videos. It consists of three sub-tasks that concentrate on event localization, captioning, and vehicle state estimation separately. To accomplish these tasks, it is necessary to deal with at least three challenges: extracting ego-motion relevant information, describing driving behavior and analyzing the underlying rationale, as well as resolving the boundary ambiguity problem. In response, we devise corresponding solutions, including a vehicle ego-motion learning strategy and a novel adjacent contrastive learning strategy, which effectively address the aforementioned issues. We validate our method by conducting extensive experiments on the BDD-X dataset, all of which show promising results and achieve new state-of-the-art performance on most metrics, which proves the effect of our approach.

Abstract:
Facing the increasing heterogeneity of data in the real world, multi-view learning has become a crucial area of research. Graph Convolutional Networks (GCNs) are powerful for modeling both graph structures and features, making them a focal point in multi-view learning research. However, these methods typically only account for static data dependencies within each view separately when constructing the topology necessary for GCNs, overlooking potential relationships across views in multi-view data. Furthermore, there is a notable absence of theoretical guidance for constructing multi-view data topologies, leading to uncertainty regarding the progression of graph embeddings toward a consistent state. To tackle these challenges, we introduce a framework named energy-constrained multi-view graph diffusion. This approach establishes a mathematical correspondence between multi-view data and GCNs via graph diffusion. It treats multi-view data as a unified entity and devises a feature propagation process with inter-view awareness by considering both inter-view and intra-view feature flow across the entire system. Additionally, an energy function is introduced to guide the inter- and intra-view diffusion, ensuring that the representations converge towards global consistency. The empirical research on several benchmark datasets substantiates the benefits of the proposed method.

Abstract:
Multi-view clustering is an important task in multimedia and machine learning. In multi-view clustering, multi-view spectral clustering is one kind of the most popular and effective methods. However, existing multi-view spectral clustering ignores the fairness in the clustering result, which may cause discrimination. To tackle this problem, in this paper, we propose an innovative Fair Multi-view Spectral Clustering (FMSC) method. Firstly, we provide a new perspective of fairness from the graph theory viewpoint, which constructs a relation between fairness and the average degree in graph theory. Secondly, based on this relation, we design a novel fairness-aware regularized term, which has the same form as the ratio cut in spectral clustering. Thirdly, we seamlessly plug this fairness-aware regularized term into the multi-view spectral clustering, leading to our one-stage FMSC, which can directly obtain the final clustering result without any post-processing. We also conduct extensive experiments compared with state-of-the-art fair clustering and multi-view clustering methods, which shows that our method can achieve better fairness.

Abstract:
Referring video object segmentation (RVOS) is a cross-modal task that aims to segment the target object described by language expressions. A video typically consists of multiple frames and existing works conduct segmentation at either the clip-level or the frame-level. Clip-level methods process a clip at once and segment in parallel, lacking explicit inter-frame interactions. In contrast, frame-level methods facilitate direct interactions between frames by processing videos frame by frame, but they are prone to error accumulation. In this paper, we propose a novel tracking-forced framework, introducing high-quality tracking information and forcing the model to achieve accurate segmentation. Concretely, we utilize the ground-truth segmentation of previous frames as accurate inter-frame interactions, providing high-quality tracking references for segmentation in the next frame. This decouples the current input from the previous output, which enables our model to concentrate on accurately segmenting just based on given tracking information, improving training efficiency and preventing error accumulation. For the inference stage without ground-truth masks, we carefully select the beginning frame to construct tracking information, aiming to ensure accurate tracking-based frame-by-frame object segmentation. With these designs, our tracking-forced method significantly outperforms existing methods on 4 widely used benchmarks by at least 3%. Especially, our method achieves 88.3% P@0.5 accuracy and 87.6 overall IoU score on the JHMDB-Sentences dataset, surpassing previous best methods by 5.0% and 8.0, respectively.

Abstract:
Cross-modal hashing encodes different modalities of multi-modal data into a low-dimensional Hamming space for fast cross-modal retrieval. Most existing cross-modal hashing methods heavily rely on label semantics to boost retrieval performance; however, semantics are expensive to collect in real applications. To mitigate the heavy reliance on semantics, this work proposes a new semi-supervised deep cross-modal hashing method, namely, Graph Convolutional Semi-Supervised Cross-Modal Hashing (GCSCH), which is trained with limited label supervision. The proposed GCSCH first generates pseudo-multi-labels of the unlabeled samples using the simple yet effective idea of consistency regularization and pseudo-labeling. GCSCH designs a fusion network that merges the two modalities and employs Graph Convolutional Network (GCN) to capture semantic information among ground-truth-labeled and pseudo-labeled multi-modal data. Using the idea of knowledge distillation, GCSCH employs a teacher-student learning scheme that can successfully transfer knowledge from the fusion module to the image and text hashing networks. Empirical studies on three multi-modal benchmark datasets demonstrate the superiority of the proposed GCSCH over state-of-the-art cross-modal hashing methods with limited label supervision.

Abstract:
Landscapes, recognized for their indispensable role in large-scale scenes, are experiencing growing demand. However, the manual modeling of such content is labor-intensive and lacks efficiency. Procedural Content Generation (PCG) techniques enable the rapid generation of diverse landscape elements. Nevertheless, ordinary users may encounter difficulties controlling these methods for desired results. In this paper, we introduce a controllable framework for procedurally generating landscapes. We integrate state-of-the-art Large Language Models (LLMs) to enhance user accessibility and control. By converting plain text inputs into parameters through LLMs, our framework allows ordinary users to generate a batch of plausible landscapes tailored to their specifications. A parameter-controlled PCG procedure is designed to leverage optimization techniques and employ rule-based refinements. It achieves harmonious layering in terrains, zoning, and roads while enabling aesthetic arrangement of vegetation and artificial elements. Extensive experiments demonstrate our framework's effectiveness in generating landscapes comparable to those crafted by experienced architects. Our framework has the potential to enhance the productivity of landscape designers significantly.

Abstract:
Multimodal Dialogue agents are often required to respond to conversation history using both textual and visual content. Even though current dialogue studies predominantly strive to generate natural texts or images, they fall short in considering the relevance of multimodal responses within a dialogue context, consequently confining agents from making prudent choices based on multiple alternatives and their associated relevance scores for decision-making. In this paper, we present a bidirectional multimodal dialogue framework that skillfully combines the forward generation of multiple text and image response candidates with reverse selection guided by relevance scores evaluated on dialogue context, facilitating agents in selecting the most suitable multimodal responses. Specifically, the forward generation aspect of our framework leverages a stage-wise approach, first producing textual replies and composite visual descriptions from the dialogue context, followed by the generation of visual responses aligned with the descriptions. In the reverse selection process, visual responses are translated into tangible descriptive texts that, in conjunction with textual responses, are inversely tied back to the dialogue context for relevance assessment, assigning a reference score to each multimodal response candidate to assist the intelligent agent in making informed decisions. Experimental outcomes demonstrate that our proposed bidirectional dialogue response framework markedly elevates performance in both automatic and human evaluations, yielding a range of contextually fitting multimodal responses for selection.

Abstract:
Fair multi-view clustering aims to achieve both satisfactory clustering performance and non-discriminatory outcomes with respect to sensitive attributes. Existing fair multi-view clustering methods impose a constraint that requires the distribution of sensitive attributes to be uniform within each cluster. However, this constraint can lead to misallocation of samples with sensitive attributes. To solve this problem, we propose a novel Deep Fair Multi-View Clustering (DFMVC) method that learns a consistent and discriminative representation instructed by a fairness constraint constructed from the cluster distribution. Specifically, we incorporate contrastive constraints on semantic features from different views to obtain consistent and discriminative representations for each view. Additionally, we align the distribution of sensitive attributes with the target cluster distribution to achieve optimal fairness in clustering results. Experimental results on four datasets with sensitive attributes demonstrate that our method improves fairness and clustering performance compared with state-of-the-art multi-view clustering methods.

Abstract:
Synthesizing binaural audio according to personalized requirements is crucial for building immersive artificial spaces. Previous methods employ the visual modality to guide audio spatialization since it can provide spatial information about objects. However, the paradigm is dependent on object visibility and strict audiovisual correspondence, which makes it tough to satisfy personalized requirements. In addition, the visual counterpart to the audio may be crippled or even non-existent, which greatly limits the development of the field. To this end, we advocate exploring a novel task known as Text-guided Audio Spatialization (TAS), in which the goal is to convert mono audio into spatial audio based on text prompts. This approach circumvents harsh audiovisual conditions and allows for more flexible individualization. To facilitate this research, we construct the first TASBench dataset. The dataset provides a dense frame-level description of the spatial location of sounding objects in audio, enabling fine-grained spatial control. Since text prompts contain multiple sounding objects and spatial locations, the core issue of TAS is to establish the mapping relationship between text semantic information and audio objects. To tackle this issue, we design a Semantic-Aware Fusion (SAF) module to capture text-aware audio features and propose a text-guided diffusion model to learn the spatialization of audio, which can generate spatial audio consistent with text prompts. Extensive experiments on TASBench compare the proposed method with several methods from related tasks, demonstrating that our method is promising to achieve the personalized generation of spatial sense of audio under text prompts.

Abstract:
Versatile Video Coding (VVC/H.266) standard is the emerging successor to the widespread High Efficiency Video Coding (HEVC/H.265). This work introduces the latest version of our academic open-source VVC intra encoder called uvg266. It has been developed from our well-known Kvazaar HEVC encoder by introducing new VVC coding tools into carefully optimized and parallelized coding flow of Kvazaar. This paper outlines the design methodology and implementation aspects of all intra (AI) configuration of uvg266. The experimental results show that single-threaded uvg266 is more than twice as fast as the state-of-the-art VVenC encoder in all our test cases. In speed-optimized coding, the coding overhead of uvg266 is 21.7% but the gap narrows down to 2.6% in the rate distortion optimized coding case. uvg266 has almost linear speedup with core count up to 32 threads. The better scalability of uvg266 quadruples the speed over VVenC. Furthermore, single-threaded uvg266 is up to 380× as fast as VVC reference software VTM and the gap raises to over 11,000× with 32 threads. To the best of our knowledge, uvg266 is currently the fastest available open-source VVC intra software encoder.

Abstract:
The way we create, consume and interact with multimedia content has changed significantly in recent years with the advent of affordable recording devices and easy sharing and access in the form of mobile phones. With the imminent wave of affordable devices that enable mixed reality experiences and the large variety of devices on the market, interaction with multimedia content is expected to continue to evolve rapidly. This will also drastically affect the entire area of multimedia information retrieval in eXtended Reality (XR), for instance by novel ways to express user needs in VR, result presentation that takes the specific capabilities of XR devices into account, and/or result feedback. This tutorial on Multimedia Retrieval in XR discusses and demonstrates existing solutions and highlights key challenges in this evolving field.

Abstract:
Emotion and sentiment analysis (ESA) assists machines to serve humans more intelligently. However, collecting large-scale high-quality datasets for training ESA models in a supervised manner is expensive, time-consuming, and difficult in practice. This tutorial focuses on the label-efficient ESA (LeESA) learning methods. Specifically, we first introduce the stimuli and characteristics of emotion and then illustrate seven typical training paradigms, followed by applications and future directions of LeESA.

Abstract:
With the rising interest in multi-camera cross-spectral systems, cross-spectral images have been widely used in computer vision and image processing. Therefore, an effective super-resolution (SR) method provides high-resolution (HR) cross-spectral images for different research and applications. However, existing SR methods rarely consider utilizing cross-spectral information to assist the SR of visible images. They cannot handle complex degradation (noise, high brightness, low light) and misalignment problems in low-resolution (LR) cross-spectral images. Here, we first explore the potential of using near-infrared (NIR) image guidance for better SR, based on the observation that NIR images can preserve valuable information for recovering adequate image details. To take full advantage of the cross-spectral prior, we propose a novel Cross-Spectral Prior guided image SR approach (CSPSR). The cross-view matching (CVM) module and the dynamic multi-modal fusion (DMF) module can enhance the spatial correlation between cross-spectral images and bridge the multi-modal feature gap, respectively. Extensive experiments demonstrate the effectiveness of our CSPSR.

Abstract:
Conversation is a common form of human communication that includes extensive emotional interaction. Traditional approaches focused on studying emotions and their underlying causes in conversations. They try to address two issues: what emotions are present in the dialogue and what causes these emotions. However, these works often overlook the bidirectional nature of emotional interaction in dialogue: utterances can evoke emotions (cause), and emotions can also lead to certain utterances (consequence). Therefore, we propose a new issue: what consequences arise from these emotions? This leads to the introduction of a new task called Emotion Consequence Forecasting in CONversations (ECFCON). In this work, we first propose a corresponding dialogue-level dataset. Specifically, we select 2,780 video dialogues for annotation, totaling 39,950 utterances. Out of these, 12,391 utterances contain emotions, and 8,810 of these have discernible consequences. Then, we benchmark this task by conducting experiments from the perspectives of traditional methods, generalized LLMs prompting methods, and clue-driven hybrid methods. Both our dataset and benchmark codes are openly accessible to the public.

Abstract:
Text-video retrieval (TVR) identifies relevant videos based on textual queries. Existing methods are limited by their ability to understand and connect different modalities, resulting in increased difficulty in retrievals. In this paper, we propose a generation-based TVR paradigm facilitated by LLM distillation to better learn and capture deep retrieval knowledge for text-video retrieval, amidsting the rapid evolution of Large Language Models. Specifically, we first design the fine-tuning large vision-language model that leverages the knowledge learned from language models to enhance the alignment of semantic information between the text and video modalities. It also incorporates an inductive reasoning mechanism, which focuses on incorporating important temporal and spatial features into the video embeddings. We further design question prompt clustering to select the most important prompts, considering their contribution to improving retrieval performance. Experimental results show that our approach achieves excellent performance on two benchmark datasets compared to its competitors.

Abstract:
In this paper, we are interested in identifying denser and finer animals joints. The lack of standardized joint definitions across various APE datasets, e.g., AnimalPose with 20 joints, AP-10k with 17 joints, and TigDog with 19 joints, presents a significant challenge yet offers an opportunity to fully utilize annotation data. This paper challenges this new non-standardized annotation problem, aiming to learn fine-grained (e.g., 24 or more joints) pose estimators in datasets that lack complete annotations. To combat the unannotated joints, we propose FreeNet, comprising a base network and an adaptation network connected through a circuit feedback learning paradigm. FreeNet enhances the adaptation network's tolerance to unannotated joints via body part-aware learning, optimizing the sampling frequency of joints based on joint detection difficulty, and improves the base network's predictions for unannotated joints using feedback learning. This leverages the cognitive differences of the adaptation network between non-standardized labeled and large-scale unlabeled data. Experimental results on three non-standard datasets demonstrate the effectiveness of our method for fine-grained APE.

Abstract:
Miscalibrated models tend to be unreliable and insecure for downstream applications. In this work, we attempt to highlight and remedy miscalibration in current scene graph generation (SGG) models, which has been overlooked by previous works. We discover that obtaining well-calibrated models for SGG is more challenging than conventional calibration settings, as long-tailed SGG training data exacerbates miscalibration with overconfidence in head classes and underconfidence in tail classes. We further analyze which components are explicitly impacted by the long-tailed data during optimization, thereby exacerbating miscalibration and unbalanced learning, including biased parameters,deviated boundaries, and distorted target distribution. To address the above issues, we propose the Compositional Optimization Calibration (COC) method, comprising three modules: i. A parameter calibration module that utilizes a hyperspherical classifier to eliminate the bias introduced by biased parameters. ii. A boundary calibration module that disperses features of majority classes to consolidate the decision boundaries of minority classes and mitigate deviated boundaries. iii. A target distribution calibration module that addresses distorted target distribution, leverages within-triplet prior to guide confidence-aware and label-aware target calibration, and applies curriculum regulation to constrain learning focus from easy to hard classes. Extensive evaluation on popular benchmarks demonstrates the effectiveness of our proposed method in improving model calibration and resolving unbalanced learning for long-tailed SGG. Finally, our proposed method performs best on model calibration compared to different types of calibration methods and achieves state-of-the-art trade-off performance on balanced SGG learning.

Abstract:
Person re-identification (ReID) is crucial in video surveillance, aiming to match individuals across different camera views while cloth-changing person re-identification (CC-ReID) focuses on pedestrians changing attire. Many existing CC-ReID methods overlook generalization, crucial for universality across cloth-consistent and cloth-changing scenarios. This paper pioneers exploring the cloth-generalized person re-identification (CG-ReID) task and introduces the Cloth-aware Augmentation (CaAug) strategy. Comprising domain augmentation and feature augmentation, CaAug aims to learn identity-relevant features adaptable to both scenarios. Domain augmentation involves creating diverse fictitious domains and simulating various clothing scenarios. Supervising features from different cloth domains enhances robustness and generalization against clothing changes. Additionally, for feature augmentation, element exchange introduces diversity concerning clothing changes. Regularizing the model with these augmented features strengthens resilience against clothing change uncertainty. Extensive experiments on cloth-changing datasets demonstrate the efficacy of our approach, consistently outperforming state-of-the-art methods.

Abstract:
Zero-Shot learning (ZSL) correlates visual samples and shared semantic information to transfer knowledge from seen classes to unseen classes. Existing methods typically establish visual-semantic correlation by aligning visual and semantic features, which are extracted from visual samples and semantic information, respectively. However, instance-level images, owing to singular observation perspectives and diverse individuals, cannot exactly match the comprehensive semantic information defined at the class level. Direct feature alignment imposes correlation between mismatched vision and semantics, resulting in spurious visual-semantic correlation. To address this, we propose a novel method termed Causal Visual-semantic Correlation (CVsC) to learn substantive visual-semantic correlation for ZSL. Specifically, we utilize a Visual Semantic Attention module to facilitate interaction between vision and semantics, thereby identifying attribute-related visual features. Furthermore, we design a Conditional Correlation Loss to properly utilize semantic information as supervision for establishing visual-semantic correlation. Moreover, we introduce counterfactual intervention applied to attribute-related visual features, and maximize their impact on semantic and target predictions to enhance substantive visual-semantic correlation. Extensive experiments conducted on three benchmark datasets (i.e., CUB, SUN, and AWA2) demonstrate that our CVSC outperforms existing state-of-the-art methods.

Abstract:
Collecting large-scale multi-label data with full labels is difficult for real-world scenarios. Many existing studies have tried to address the issue of missing labels caused by annotation but ignored the difficulties encountered during the annotation process. We find that the high annotation workload can be attributed to two reasons: (1) Annotators are required to identify labels on widely varying visual concepts. (2) Exhaustively annotating the entire dataset with all the labels becomes notably difficult and time-consuming. In this paper, we propose a new setting, i.e. block diagonal labels, to reduce the workload on both sides. The numerous categories can be divided into different subsets based on semantics and relevance. Each annotator can only focus on its own subset of labels so that only a small set of highly relevant labels are required to be annotated per image. To deal with the issue of such missing labels, we introduce a simple yet effective method that does not require any prior knowledge of the dataset. In practice, we propose an Adaptive Pseudo-Labeling method to predict the unknown labels with less noise. Formal analysis is conducted to evaluate the superiority of our setting. Extensive experiments are conducted to verify the effectiveness of our method on multiple widely used benchmarks.

Abstract:
Text-motion retrieval (TMR) is a significant cross-modal task that retrieves motion sequences semantically similar to a given query text. Existing TMR methods primarily utilize single embeddings to represent and align text and motion sequences. However, real-world motion sequences typically contain multiple atomic motions with complex semantics, which is hard to precisely capture by single embeddings. Additionally, the common co-occurring and coupling of atomic motions further post significant challenges in effective modeling and aligning text and motion sequences. In this paper, we regard TMR as a Multi-Instance Multi-Label (MIML) learning problem, where the motion sequence is viewed as a bag of atomic motions and the text is the bag of corresponding phrases. To address the MIML problem, we propose a novel Multi-Granularity Semantics Interaction (MGSI) approach, which effectively captures and aligns the semantics of text and motion sequences across various levels. Specifically, the MGSI approach initially decomposes both the query and motion sequences into three hierarchical levels: token, instance, and bag. Then, we utilize graph neural networks to explicitly model their semantics correlation and perform semantics interaction at these respective levels, precisely capturing the semantics at multiple granularities. To identify and model co-occurring atomic motions, we measure the frame-wise semantic consistency between motions and then fuse and interact the accordant ones to refine their representations. Finally, we exploit token, instance, and bag-wise semantics interaction to comprehensively align text and motion sequence. We evaluated our methods on two widely-used benchmark datasets, HumanML3D and KIT-ML. The proposed method achieves significant improvements, outperforming the state-of-the-art with a 23.09% increase in Rsum on HumanML3D and a 21.84% increase on KIT-ML.

Abstract:
Recent advances in multimodal artificial intelligence have greatly improved the integration of vision-language-audio cues to enrich the content creation process. Inspired by these developments, in this paper, we first integrate audio into the face inpainting task to facilitate identity manipulation. Our main insight is that a person's voice carries distinct identity markers, such as age and gender, which provide an essential supplement for identity-aware face inpainting. By extracting identity information from audio as guidance, our method can naturally support tasks of identity preservation and identity swapping in face inpainting. Specifically, we introduce a dual-stream network architecture comprising a face branch and an audio branch. The face branch is tasked with extracting deterministic information from the visible parts of the input masked face, while the audio branch is designed to capture heuristic identity priors from the speaker's voice. The identity codes from two streams are integrated using a multi-layer perceptron (MLP) to create a virtual unified identity embedding that represennts comprehensive identity features. In addition, to explicitly exploit the information from audio, we introduce an audio-face generator to generate an 'fake' audio face directly from audio and fuse the multi-scale intermediate features from the audio-face generator into face inpainting network through an audio-visual feature fusion (AVFF) module. Extensive experiments demonstrate the positive impact of extracting identity information from audio on face inpainting task, especially in identity preservation.

Abstract:
For AI systems to be safely and reliably grounded in the real world, they should possess the ability of physical commonsense reasoning. Physical commonsense reasoning is essentially a multisensory task as physical properties of objects are manifested through multiple perception modalities, including both visual and auditory. In this study, we constructed two new benchmarks, called PACS-Reason and PACS-Reason+, for explainable physical audiovisual commonsense reasoning (EPACS), in which each datapoint is accompanied by a golden detailed rationale (intermediate reasoning path) to explain the answer selection. Moreover, we present PAVC-Reasoner, a multimodal large language model (LLM) designed to reason about physical commonsense attributes. The model aligns different modalities with the language modality by integrating three different perceivers for cross-modal pretraining and instruction finetuning at multiple granularities. It utilizes an LLM as a cognitive engine to process multimodal inputs and output convincing intermediate reasoning paths as justification for inferring answers. Numerous experiments have demonstrated the effectiveness and superiority of PAVC-Reasoner as a baseline model for studying EPACS. Most attractively, PAVC-Reasoner is capable of reasoning and obtaining strong interpretable explicit reasoning paths, signifying a significant stride towards real-world physical commonsense reasoning.

Abstract:
Humans understand digital 3D scenes by observing them from reasonably placed virtual cameras. Selecting camera views is fundamental for 3D scene applications but is typically manual. Existing literature on selecting views is based on regular or polygonal room shapes without focusing on the objects in the scene, resulting in poorly composed views concerning objects. This paper introduces ScenePhotographer, an object-oriented framework for automatic view selection in residential scenes. Potential object-oriented views are yielded by a learning-based method, which clusters objects into groups according to objects' functional and spatial relationships. We propose four criteria to evaluate the views and recommend the best batch, including room information, visibility, composition balance, and line dynamics. Each criterion measures the view according to its corresponding photography rule. Experiments on various room types and layouts demonstrate that our method can generate views focusing on coherent objects while preserving aesthetics, leading to more visually pleasing results.

Abstract:
Multi-object tracking (MOT) is a pivotal task for media interpretation, where reliable motion and appearance cues are essential for cross-frame identity preservation. However, limited by the inherent perspective properties of 2D space, the crowd density and frequent occlusions in real-world scenes expose the fragility of these cues. We observe the natural advantage of objects being well-separated in high-dimensional space and propose a novel 2D MOT framework, "Detecting-Lifting-Tracking'' (DLT). Initially, a pre-trained detector is employed to capture 2D object information. Secondly, we introduce a Mamba Distance Estimator to obtain the distances of objects to a monocular camera with temporal consistency, achieving object-level pseudo-3D lifting. Finally, we thoroughly explore distance-aware tracking via pseudo-3D information. Specifically, we introduce a Score-Distance Hierarchical Matching and Short-Long Terms Association to enhance accurate and robust association capability. Even without appearance cues, our DLT achieves state-of-the-art performance on MOT17, MOT20, and DanceTrack, demonstrating its potential to address occlusion challenges.

Abstract:
The inherent variability and unpredictability in open multi-view learning scenarios infuse considerable ambiguity into the learning and decision-making processes of predictors. This demands that predictors not only recognize familiar patterns but also adaptively interpret unknown ones out of training scope. To address this challenge, we propose an Ambiguity-Aware Multi-view Learning Framework, which integrates four synergistic modules into an end-to-end framework to achieve generalizability and reliability beyond the known. By introducing the mixed samples to broaden the learning sample space, accompanied by corresponding soft labels to encapsulate their inherent uncertainty, the proposed method adapts to the distribution of potentially unknown samples in advance. Furthermore, an instance-level sparse inference is implemented to learn sparse approximated points in the multiple view embedding space, and individual view representations are gated by view-level confidence mappings. Finally, a multi-view consistent representation is obtained by dynamically assigning weights based on the degree of cluster-level dispersion. Extensive experiments demonstrate that our approach is effective and stable compared with other state-of-the-art methods in open-world recognition situations.

Abstract:
Subject-driven image generation, aimed at customizing user-specified subjects, has experienced rapid progress. However, most of them focus on transferring the customized appearance of subjects. In this work, we consider a novel concept customization task, that is, capturing the interaction between subjects in exemplar images and transferring the learned concept of interaction to achieve customized text-to-image generation. Intrinsically, the interaction between subjects is diverse and is difficult to describe in only a few words. In addition, typical exemplar images are about the interaction between humans, which further intensifies the challenge of interaction-driven image generation with various categories of subjects. To address this task, we adopt a divide-and-conquer strategy and propose a two-stage interaction inversion framework. The framework begins by learning a pseudo-word for a single pose of each subject in the interaction. This is then employed to promote the learning of the concept for the interaction. In addition, language prior and cross-attention loss are incorporated into the optimization process to encourage the modeling of interaction. Extensive experiments demonstrate that the proposed methods are able to effectively invert the interactive pose from exemplar images and apply it to the customized generation with user-specified interaction.

Abstract:
This paper introduces a novel method for enhancing image composition guidance in photography. It utilizes advanced composition rules to guide a Real-Time Detection Transformer (RT-DETR) model in predicting aesthetically pleasing compositions for photographs. Unlike traditional methods constrained by original image boundaries, our approach allows the predicted framing to extend beyond these limits, offering dynamic, real-time guidance for image composition in photography. The system integrates multi-label composition classification and compositional element annotation, using YOLOv8 for key object detection and an enhanced Deep Hough Transform for compositional lines to guide photographers. It provides photographers with real-time guidance for optimal camera adjustments, transforming traditional post-processing tasks into an intuitive, interactive process. This method significantly enhances photographers' flexibility and effectiveness in capturing visually superior photographs.

Abstract:
The tutorial "Large Vision-Language Model in the Society" aims to provide a comprehensive overview of state-of-the-art techniques and applications of large vision-language models (LVLMs), which integrate visual and textual data to transform multimedia research and applications. LVLMs are poised to revolutionize domains such as content creation, social media analysis, education, healthcare, and entertainment by enabling sophisticated content analysis, retrieval, and generation. This tutorial will cover the fundamentals of vision-language integration, state-of-the-art models, training techniques, applications, ethical considerations, and future directions. It is designed to be educational and instructive, providing an in-depth introduction rather than a cursory survey. Attendees will gain practical skills, and insights into the latest research, and engage in interactive sessions to reinforce learning. By addressing both technical and societal aspects, the tutorial will significantly benefit the multimedia community, driving innovation and progress in the field.

Abstract:
This is the overview paper for the Micro-Action Analysis Grand Challenge hosted at ACM Multimedia 2024. In recent years, a growing trend towards deeper understanding of human emotional states has led to a gradual shift in the attention of multimedia and computer vision researchers from macro facial expressions to whole-body micro-actions. Micro-actions are spontaneous body movements that indicate a person's true feelings and potential intentions. Yet, recognizing, distinguishing, and understanding micro-actions is challenging because they are subtle compared to normal actions. This grand challenge aims to foster innovative research in micro-action analysis and provide benchmark evaluations to advance the technology in the human-centric action understanding community.

Abstract:
Automated diagnosis of depression is crucial for early detection and timely intervention. Previous research has largely concentrated on visual information, often neglecting the value of leveraging a variety of data types. Although some studies have attempted to employ multiple modalities, they typically fall short in investigating the complex dynamics between features from various modalities over time. To address this challenge, we present an innovative Multi-modal Dual-Attention aggregation architecture for Depression Recognition (MDDR). This framework leverages multi-modal pre-trained features and introduces two attention aggregation mechanisms: the Feature Alignment and Aggregation (FAA) module and the Sequence Encoding and Aggregation (SEA) module. The FAA module is designed to dynamically evaluate the relevance of multi-modal features for each instance, facilitating a dynamic integration of these features over time. Following this, the SEA module determines the importance of the amalgamated features for each frame, ensuring that aggregation is conducted based on their significance, to extract the most relevant features for accurately diagnosing depression. Moreover, we propose a unique loss calculation method specifically designed for depression assessment, named DR Loss. Our approach, evaluated on the AVEC2013 and AVEC2014 depression audiovisual datasets, achieves unparalleled performance.

Abstract:
We devise an articulatory representation-based text-to-speech (TTS) model, ArtSpeech, an explainable and effective network for human-like speech synthesis, by revisiting the sound production system. Current deep TTS models learn acoustic-text mapping in a fully parametric manner, ignoring the explicit physical significance of articulation movement. ArtSpeech, on the contrary, leverages articulatory representations to perform adaptive TTS, clearly describing the voice tone and speaking prosody of different speakers. Specifically, energy, F0, and vocal tract variables are utilized to represent airflow forced by articulatory organs, the degree of tension in the vocal folds of the larynx, and the coordinated movements between different organs, respectively. We also design a multi-dimensional style mapping network to extract speaking styles from the articulatory representations, guided by which variation predictors could predict the final mel spectrogram output. To validate the effectiveness of our approach, we conducted comprehensive experiments and analyses using the widely recognized speech corpus, such as LJSpeech and LibriTTS datasets, yielding promising similarity enhancement between the generated results and the target speaker's voice and prosody.

Abstract:
With the popularity and advancement of the Internet and video-sharing platforms, video affective content analysis has greatly developed. Temporal information is crucial for this task. Nevertheless, existing methods often overlook the fact that there is substantial irrelevant information in videos and that the importance of modalities is uneven for emotional tasks. This could result in noise from both temporal fragments and modalities, reducing the model's ability to identify crucial temporal fragments and recognize emotions. To tackle the above issues, we propose a Temporal Enhancement (TE) method in this paper. Specifically, we utilize three encoders for extracting features at various levels and employ temporal sampling to enhance the temporal data, thereby enriching video representation and improving the model's robustness to noise. Subsequently, we design a cross-modal temporal enhancement module to enhance temporal information for every modal feature. This module interacts with multiple modalities simultaneously to emphasize critical temporal fragments while suppressing irrelevant ones. The experimental results on four benchmark datasets show that the proposed temporal enhancement method achieves state-of-the-art video affective content analysis performance. Moreover, the effectiveness of each module is confirmed through ablation experiments.

Abstract:
Multimodal Large Language Models (MLLMs) have showcased remarkable advances in handling various vision-language tasks. These models typically consist of a Large Language Model (LLM), a vision encoder and a connector structure, which is used to bridge the modality gap between vision and language. It is challenging for the connector to filter the right visual information for LLM according to the task in hand. Most of previous connectors, such as light-weight projection and Q-former, treat visual information for diverse tasks uniformly, therefore lacking task-specific visual information extraction capabilities. To address the issue, this paper proposes Q-MoE, a query-based connector with Mixture-of-Experts (MoE) to extract task-specific information with text-driven routing. Furthermore, an optimal path based training strategy is proposed to find an optimal expert combination. Extensive experiments on two popular open-source LLMs and several different visual-language tasks demonstrate the effectiveness of the Q-MoE connecter.

Abstract:
In the evolving landscape of federated learning (FL), the integration of multimodal data presents both unprecedented opportunities and significant challenges. Existing works fall short of meeting the growing demand for systems that can efficiently handle diverse tasks and modalities in rapidly changing environments. We propose a meta learning strategy tailored for Multimodal Federated Learning in a multitask setting, which harmonizes intra-modal and inter-modal feature spaces through the Cross-Modal Meta Consensus. This innovative approach enables seamless integration and transfer of knowledge across different data types, enhancing task personalization within modalities and facilitating effective cross-modality knowledge sharing. Additionally, we introduce Gradient Consistency-based Clustering for multimodal convergence, specifically designed to resolve conflicts at meta-initialization points arising from diverse modality distributions, supported by theoretical guarantees. Our approach, evaluated as M3Fed on five federated datasets, with at most four modalities and four downstream tasks, demonstrates strong performance across diverse data distributions, affirming its effectiveness in Multimodal Federated Learning.

Abstract:
Graph Neural Networks (GNNs) have proven effective in various scenarios. A key strategy involves pre-training existing graphs to extract knowledge that can be transferred to improve performance on downstream tasks, reducing the need for extensive labeled data. However, previous works commonly assumed that pre-training and fine-tuning occur in the same or closely related domains. A limitation is that for each individual graph without accessible pre-training data, a GNN must be trained from scratch, imposing high training overhead and hindering the ability of generalization. In this paper, we address the GNN multi-domain pre-training problem, which intends to pre-train a transferable GNN model from heterogeneous multi-source graph domains and then apply it in an unseen one with minor fine-tuning costs. To this end, we propose a scaLA ble Multi-source Pre-training (LAMP) method. For pre-training, LAMP presents a graph dual-distillation approach to distill massive knowledge from various graph domains to form synthetic homogeneous graphs. Simultaneously, high-level meta-knowledge from the synthetic graphs is extracted to train the GNN model, whose capability can be adjusted according to target graph contexts through a co-training modulation architecture. For fine-tuning, LAMP respectively aligns the target graph distribution, graph context, and graph task with the pretext so that the downstream task in the unseen domain can be reshaped to leverage the transferable knowledge efficiently. Extensive experiments on four different graph domain datasets show the superiority of LAMP.

Abstract:
3DQA has gained considerable attention due to its enhanced spatial understanding capabilities compared to image-based VQA. However, existing 3DQA methods have explicitly focused on integrating text and color-coded point cloud features, thereby overlooking the rich high-level semantic relationships among objects. In this paper, we propose a novel graph-based 3DQA method termed 3DGraphQA, which leverages scene graph reasoning to enhance the ability to handle complex reasoning tasks in 3DQA and offers stronger interpretability. Specifically, our method first adaptively constructs dynamic scene graphs for the 3DQA task. Then we inject both the situation and the question inputs into the scene graph, forming the situation-graph and the question-graph, respectively. Based on the constructed graphs, we finally perform intra- and inter-graph feature propagation for efficient graph inference: intra-graph feature propagation is performed based on Graph Transformer in each graph to realize single-modal contextual interaction and high-order contextual interaction; inter-graph feature propagation is performed among graphs based on bilinear graph networks to realize the interaction between different contexts of situations and questions. Drawing on these intra- and inter-graph feature propagation, our approach is poised to better grasp the intricate semantic and spatial relationship issues among objects within the scene and their relations to the questions, thereby facilitating reasoning complex and compositional questions. We validate the effectiveness of our approach on SQA3D and ScanQA datasets, and expand the SQA3D dataset to SQA3D Pro with multi-view information, making it more suitable for our approach. Experimental results demonstrate that our 3DGraphQA outperforms existing methods.

Abstract:
Despite significant advances in image-text medical visual language modeling, the high cost of fine-grained annotation of images to align radiology reports has led current approaches to focus primarily on semantic alignment between the image and the full report, neglecting the critical diagnostic information contained in the text. This is insufficient in medical scenarios demanding high explainability. To address this problem, in this paper, we introduce radiology reports as images in prompt learning. Specifically, we extract key clinical concepts, lesion locations, and positive labels from easily accessible radiology reports and combine them with an external medical knowledge base to form fine-grained self-supervised signals. Moreover, we propose a novel Report-Concept Textual-Prompt Learning ( RC-TPL ), which aligns radiology reports at multiple levels. In the inference phase, the report-level and concept-level prompts provide rich global and local semantic understanding for X-ray images. Extensive experiments on X-ray image datasets demonstrate the superior performance of our approach with respect to various baselines, especially in the presence of scarce imaging data. Our study not only significantly improves the accuracy of data-constrained medical X-ray diagnosis, but also demonstrates how the integration of domain-specific conceptual knowledge can enhance the explainability of medical image analysis.

Abstract:
Video hashing is a technique of encoding videos into binary vectors, facilitating efficient video storage and high-speed computation. Current approaches to video hashing predominantly utilize sequential frame images to produce semantic binary codes. However, videos encompass not only visual but also audio signals. Therefore, we propose a tri-level Transformer-based audio-visual hashing technique for video retrieval, named AVHash. It first processes audio and visual signals separately using pre-trained AST and ViT large models, and then projects temporal audio and keyframes into a shared latent semantic space using a Transformer encoder. Subsequently, a gated attention mechanism is designed to fuse the paired audio-visual signals in the video, followed by another Transformer encoder leading to the final video representation. The training of this AVHash model is directed by a video-based contrastive loss as well as a semantic alignment regularization term for audio-visual signals. Experimental results show that AVHash significantly outperforms existing video hashing methods in video retrieval tasks. Furthermore, ablation studies reveal that while video hashing based solely on visual signals achieves commendable mAP scores, the incorporation of audio signals can further boost its performance for video retrieval.

Abstract:
Audio super-resolution aims to improve the quality of acoustic signals and is able to reconstruct corresponding high-resolution acoustic signals from low-resolution acoustic signals. However, since acoustic signals can be divided into two forms: time-domain acoustic waves or frequency-domain spectrograms, most existing research focuses on data enhancement in a single field, which can only obtain partial or local features of the audio signal, resulting in limitations of data analysis. Therefore, this paper proposes a time-frequency domain fusion enhanced audio super-resolution method to mine the complementarity of the two representations of acoustic signals. Specifically, we propose an end-to-end audio super-resolution network. Including the variational autoencoder based sound wave super-resolution module, U-Net-based Spectrogram Super-Resolution Module, and attention-based Time-Frequency Domain Fusion Module. The first two modules can generate more high-frequency and low-frequency components for audio respectively. As a critical component of our method, time-frequency domain fusion module performs weighted fusion on the above two outputs to obtain a super-resolution audio signal. Compared with other methods, experimental results on the VCTK and Piano datasets in natural scenes show that the time-frequency domain fusion audio super-resolution model has a state-of-the-art bandwidth expansion effect. Furthermore, we perform super-resolution on the ShipsEar dataset containing underwater acoustic signals. The super-resolution results are used to test ship target recognition, and and the accuracy is improved by 12.66%. Therefore, the proposed super-resolution method has excellent signal enhancement effect and generalization ability.

Abstract:
Although deep learning-based methods have made significant advances in the field of image restoration (IR), they often suffer from excessive model parameters. To tackle this problem, this work proposes a compact Transformer (Compacter) for lightweight image restoration by making several key designs. We employ the concepts of projection sharing, adaptive interaction, and heterogeneous aggregation to develop a novel Compact Adaptive Self-Attention (CASA). Specifically, CASA utilizes shared projection to generate Query, Key, and Value to simultaneously model spatial and channel-wise self-attention. The adaptive interaction process is then used to propagate and integrate global information from two different dimensions, thus enabling omnidirectional relational interaction. Finally, a depth-wise convolution is incorporated on Value to complement heterogeneous local information, enabling global-local coupling. Moreover, we propose a Dual Selective Gated Module (DSGM) to dynamically encapsulate the globality into each pixel for context-adaptive aggregation. Extensive experiments demonstrate that our Compacter achieves state-of-the-art performance for a variety of lightweight IR tasks with approximately 400K parameters.

Abstract:
In this paper, we introduce a new challenging task called Zero-Shot Controllable Image-to-Video Animation, where the goal is to animate an image based on motion trajectories defined by the user, without fine-tuning the base model. Primary challenges include maintaining consistency of background, consistency of object in motion, faithfulness to the user-defined trajectory, and quality of motion animation. We also introduce a novel approach for this task, leveraging diffusion models called Img2VidAnim-Zero (IVA0). IVA0 tackles our controllable Image-to-Video (I2V) task by decomposing it into two subtasks: 'out-of-place' and 'in-place' motion animation. Due to this decomposition, IVA0 can leverage existing work on layout-conditioned image generation for out-of-place motion generation, and existing text-conditioned video generation methods for in-place motion animation, thus facilitating zero-shot generation. Our model also addresses key challenges for controllable animation, such as Layout Conditioning via Spatio-Temporal Masking to incorporate user guidance and Motion Afterimage Suppression (MAS) scheme to reduce object ghosting during out-of-place animation. Finally, we design a novel controllable I2V benchmark featuring diverse local- and global-level metrics. Results show IVA0 as a new state-of-the-art, establishing a new standard for the zero-shot controllable I2V task. Our method highlights the simplicity and effectiveness of task decomposition and modularization for this novel task for future studies. Our code and visualizations are available at https://img2vidanim-0.github.io/

Abstract:
In practical object detection scenarios, distributed data and stringent privacy protections significantly limit the feasibility of traditional centralized training methods. Federated learning (FL) emerges as a promising solution to this dilemma. Nonetheless, the issue of data heterogeneity introduces distinct challenges to federated object detection, evident in diminished object perception, classification and localization abilities. In response, we introduce a task-driven federated learning methodology, dubbed Adaptive Hierarchical Aggregation (FedAHA), tailored to overcome these obstacles. Our algorithm unfolds in two strategic phases from shallow-to-deep layers: (1) Structure-aware Aggregation (SAA) aligns feature extractors during the aggregation phase, thus bolstering the global model's object perception capabilities; (2) Convex Semantic Calibration (CSC) leverages convex function theory to average semantic features instead of model parameters, enhancing the global model's classification and localization precision. We demonstrate experimentally and theoretically the effectiveness of the proposed two modules respectively. Our method consistently outperforming the state-of-the-art methods across multiple valuable application scenarios from 2.26% to 7.61%. Moreover, we build a real FL system using Raspberry Pis to demonstrate that our approach achieves a good trade-off between performance and efficiency.

Abstract:
In the mobile internet era, short videos are inundating people's lives. However, research on visual language models specifically designed for short videos has not yet received sufficient attention. Short videos are not just videos of limited duration. The prominent visual details and high information density of short videos differentiate them to long videos. In this paper, we propose the SpatioTemporal Fine-grained Description (STFVD) emphasizing on the uniqueness of short videos, which entails capturing the intricate details of the main subject and fine-grained movements. To this end, we create a comprehensive Short Video Advertisements Description (SVAD) dataset, comprising 34,930 clips from 5,046 videos. The dataset covers a range of topics, including 191 sub-industries, 649 popular products, and 470 trending games. Various efforts have been made in the data annotation process to ensure the inclusion of fine-grained spatiotemporal information, resulting in 34,930 high-quality annotations. Compared to existing datasets, samples in SVAD exhibit a superior text information density, suggesting that SVAD is more appropriate for the analysis of short videos. Based on the SVAD dataset, we develop a visual language model (SVAD-VLM) to generate spatiotemporal fine-grained description for short videos. We use a prompt-guided keyword generation task to efficiently learn key visual information. Moreover, we also utilize dual visual alignment to exploit the advantage of mixed-datasets training. Experiments on SVAD dataset demonstrate the challenge of STFVD and the competitive performance of proposed method compared to previous ones.

Abstract:
Change captioning involves describing the subtle changes between a pair of similar images. Although existing efforts have achieved compelling success, they overlook the potential of multimodal large language models (MLLMs) in tackling this challenging task. In this work, we aim to empower MLLMs with the capability to perceive subtle differences between paired images and enhance their performance in generating change captions. Specifically, we present a diFferentIal-perceptive aNd rEtRieval-augmented MLLM (FINER-MLLM) tailored for this task. In particular, FINER-MLLM leverages LoRA fine-tuned MLLM's image encoder to extract image patch features, enabling the capture of detailed image information. Subsequently, within MLLM's feature extraction, typically Q-Former, FINER-MLLM incorporates dual constraints: the intra-image feature independence constraint and the inter-image feature alignment constraint. These constraints ensure that the features can comprehensively extract subtle visual information within each image and that corresponding features across images align effectively. Last, we introduced the retrieval augmentation to first retrieve the relevant corpus to facilitate the MLLM's decoder i.e., LLM, in generating accurate change captions. Extensive experiments on three benchmark datasets, i.e., CLEVR-Change, Spot-the-Diff, and Image-Editing-Request, demonstrate the superiority of our proposed method.

Abstract:
Parameter-Efficient Fine Tuning (PEFT) has been demonstrated to be effective and efficient for transferring foundation models to downstream tasks. Transferring pretrained uni-modal models to multi-modal downstream tasks helps alleviate substantial computational costs for retraining multi-modal models. However, existing approaches primarily focus on multi-modal fusion, while neglecting the modal-specific fine-tuning, which is also crucial for multi-modal tasks. To this end, we propose parameter-efficient Collaborative Prompt Learning (CoPL) to fine-tune both uni-modal and multi-modal features. Specifically, the collaborative prompts consist of modal-specific prompts and modal-interaction prompts. The modal-specific prompts are tailored for fine-tuning each modality, while the modal-interaction prompts are customized to explore inter-modality association. Furthermore, prompt bank-based mutual coupling is introduced to extract instance-level features, further enhancing the model's generalization ability. Extensive experimental results demonstrate that our approach achieves comparable or higher performance on various audio-visual downstream tasks while utilizing approximately 1% extra trainable parameters.

Abstract:
Image descriptions provide precious information for a myriad of visual media management tasks ranging from image classification to image search. The value of such curated collections comes from their diverse content and their accompanying extensive annotations. Such annotations are typically supplied by communities, where users (often volunteers) curate labels and/or descriptions of images. Supporting users in their quest to increase (overall) description completeness where possible is, therefore, of utmost importance.

Abstract:
Medical report generation aims at automating the synthesis of accurate and comprehensive diagnostic reports from radiological images. The task can significantly enhance clinical decision-making and alleviate the workload on radiologists. Existing works normally generate reports from single chest radiographs, although historical examination data also serve as crucial references for radiologists in real-world clinical settings. To address this constraint, we introduce a novel framework that mimics the workflow of radiologists. This framework compares past and present patient images to monitor disease progression and incorporates prior diagnostic reports as references for generating current personalized reports. We tackle the textual diversity challenge in cross-modal tasks by promoting style-agnostic discrete report representation learning and token generation. Furthermore, we propose a novel spatio-temporal fusion method with multi-granularities to fuse textual and visual features by disentangling the differences between current and historical data. We also tackle token generation biases, which arise from long-tail frequency distributions, proposing a novel feature normalization technique. This technique ensures unbiased generation for tokens, whether they are frequent or infrequent, enabling the robustness of report generation for rare diseases. Experimental results on the two public datasets demonstrate that our proposed model outperforms state-of-the-art baselines.

Abstract:
Most of previous camouflaged object detection methods heavily lean upon large-scale manually-labeled training samples, which are notoriously difficult to obtain. Even worse, the reliability of labels is compromised by the inherent challenges in accurately annotating concealed targets that exhibit high similarities with their surroundings. To overcome these shortcomings, this paper develops the first semi-supervised camouflaged object detection framework, which requires merely a small amount of samples even having noisy/incorrect annotations. Specifically, on the one hand, we introduce an innovative pixel-level loss re-weighting technique to reduce possible negative impacts from imperfect labels, through a window-based voting strategy. On the other hand, we take advantages of ensemble learning to explore robust features against noises/outliers, thereby generating relatively reliable pseudo labels for unlabelled images. Extensive experimental results on four benchmark datasets have been conducted.

Abstract:
Interactive segmentation task (IS) aims at taking into account the influence of user preferences on the basis of general semantic segmentation in order to obtain the specific target-of-interest. Given the fact that most of the related algorithms generate a single mask only, the robustness of which might be constrained due to the diversity of user intention in the early interaction stage, namely the vague selection of object part/whole object/adherent object, especially when there's only one click. To handle this, we propose a novel framework called Diversified Interactive Segmentation Network (DISNet) in which we revisit the peculiarity of first-click: given an input image, DISNet outputs multiple candidate masks under the guidance of first-click only, then a Dual-attentional Mask Correction (DAMC) module is utilized to measure the complex mutual effect within first-click, all-clicks and image features. Moreover, we design a new sampling strategy to generate GT masks with rich semantic relations. Performance analysis plus adequate ablation studies has demonstrated the efficacy of our methods, which further exemplifies the decisive role of first-click in the realm of IS.

Abstract:
Recently, large vision language models (LVLMs) have advanced AI by integrating visual and linguistic data for tasks like visual conversation, image captioning, and visual question answering. Current LVLM research either scales up model size for performance or reduces parameters for limited computational resources. We believe both large and tiny models have unique strengths and that collaborative training yields better results than independent training. We propose Collaborative Training of Tiny-Large Vision Language Models (CTVLMs), a framework connecting large and tiny models via a projection layer and leveraging a synergistic training strategy. Our framework improves training efficiency by strengthening the interconnection between large and tiny models. Using the parameter efficiency of tiny models, we effectively align image-text features, then apply knowledge distillation to help large models better align cross-modal information. During fine-tuning, the large model's extensive knowledge enhances tiny model's performance. This collaborative approach allows models to adapt to various computational resources and outperforms existing methods in vision-language tasks.

Abstract:
Low-light image enhancement has been researched several years. However, current image restoration methods predominantly focus on recovering images from RGB images, overlooking the potential of incorporating more modalities. With the advancements in personal handheld devices, we can now easily capture images with depth information using devices such as mobile phones. The integration of depth information into image restoration is a research question worthy of exploration. Therefore, in this paper, we propose a multimodal low-light image enhancement task based on depth information and establish a dataset named LED (Low-light Image Enhanced with Depth Map), consisting of 1,365 samples. Each sample in our dataset includes a low-light image, a normal-light image, and the corresponding depth map. Moreover, for the LED dataset, we design a corresponding multimodal method, which can processes the input images and depth map information simultaneously to generate the predicted normal-light images. Experimental results and detailed ablation studies proves the efficiency of our method which exceeds previous single-modal state-of-the arts methods from relevant field.

Abstract:
Many consumer cameras with rolling shutter (RS) CMOS would suffer undesired distortion and artifacts, particularly when objects experiences fast motion. The neuromorphic event camera, with high temporal resolution events, could bring much benefit to the RS correction process. In this work, we explore the characteristics of RS images and event data for the design of the rolling shutter correction (RSC) model. Specifically, the relationship between RS images and event data is modeled by incorporating time encoding to the computation of cross-attention in transformer encoder to achieve time-aware multi-modal information fusion. Features from RS images enhanced by event data are adopted as keys and values in transformer decoder, providing source for appearance, while features from event data enhanced by RS images are adopted as queries, providing spatial transition information. By embedding the time information of the desired global shutter (GS) image into the query, the transformer with deformable attention is capable of producing the target GS image.To enhance the model's generalization ability, we propose to further self-supervise the model by cycling between time coordinate systems corresponding to RS images and GS images. Extensive evaluations over both synthetic and real datasets demonstrate that the proposed method performs favorably against state-of-the-art approaches.

Abstract:
The domain of image restoration encompasses a wide array of highly effective models (e.g., SwinIR, CODE, DnCNN), each exhibiting distinct advantages in either efficiency or performance. Selecting and deploying these models necessitate careful consideration of resource limitations. While some studies have explored dynamic restoration through the integration of an auxiliary network within a unified framework, these approaches often fall short in practical applications due to the complexities involved in training, retraining, and hyperparameter adjustment, as well as limitations as being totally controlled by auxiliary network and biased by training data. To address these challenges, we introduce FlexIR: a flexible and manipulable framework for image restoration. FlexIR is distinguished by three components: a meticulously designed hierarchical branch network enabling dynamic output, an innovative progressive self-distillation process, and a channel-wise evaluation method to enhance knowledge distillation efficiency. Additionally, we propose two novel inference methodologies to fully leverage FlexIR, catering to diverse user needs and deployment contexts. Through this framework, FlexIR achieves unparalleled performance across all branches, allowing users to navigate the trade-offs between quality, cost, and efficiency during the inference phase. Crucially, FlexIR employs a dynamic mechanism powered by a non-learning metric independent of training data, ensuring that FlexIR is entirely under the direct control of the user. Comprehensive experimental evaluations validate FlexIR's flexibility, manipulability, and cost-effectiveness, showcasing its potential for straightforward adjustments and quick adaptations across a range of scenarios.

Abstract:
Video frame interpolation is a critical component of video streaming, a vibrant research area dealing with requests of both service providers and users. However, existing methods cannot handle changing video resolutions while improving user perceptual quality. We aim to unleash the multifaceted knowledge yielded by the hierarchical views at multiple scales in a pyramid network. Specifically, we build a dual-view pyramid network by introducing pyramidal dual-view correspondence matching. It compels each scale to actively seek knowledge in view of both the current scale and a coarser scale, conducting robust correspondence matching by considering neighboring scales. Meanwhile, an auxiliary multi-scale collaborative supervision is devised to enforce the exchange of knowledge among scales and thus reduce error propagation from coarse to fine scales. Based on the robust capture of video dynamics via pyramidal dual-view correspondence matching, we further construct a pyramidal refinement module that formulates frame refinement as progressive latent representation generations by developing flow-guided cross-scale attention for feature fusion among frames. The proposed method is able to improve the perceptual quality on several benchmarks of varying video resolutions, while keeping low distortion and a compact model size.

Abstract:
Image-to-image (i2i) translation has achieved notable success, yet remains challenging in scenarios like real-to-illustrative style transfer of fashion. Existing methods focus on enhancing the generative model with diversity while lacking ID-preserved domain translation. This paper introduces a novel model named Uni-DlLoRA to release this constraint. The proposed model combines the original images within a pretrained diffusion-based model using the proposed Uni-adapter extractors, while adopting the proposed Dual-LoRA module to provide distinct style guidance. This approach optimizes generative capabilities and reduces the number of additional parameters required. In addition, a new multimodal dataset featuring higher-quality images with captions built upon an existing real-to-illustration dataset is proposed. Experimentation validates the effectiveness of our proposed method.

Abstract:
We present MegaSurf, a Neural Surface Reconstruction (NSR) framework to reconstruct 3D models of large scenes from aerial images. Many methods utilize geometry cues to overcome the shape-radiance ambiguity, which would produce large geometric errors. In addition, directly using inevitable imprecise geometric cues would lead to degradation in the reconstruction results, especially on large-scale scenes. To address this phenomenon, we propose a Learnable Geometric Guider (LG Guider) to learn a sampling field from reliable geometric cues. The LG Guider decides which position should fit the input radiance and can be continuously refined by rendering loss. Our MegaSurf uses a Divide-and-Conquer training strategy to address the synchronization issue between the Guider and the lagging NSR's radiance field. This strategy enables the Guider to transmit the information it carried to the radiance field without being disrupted by the gradients back-propagated from the lagging rendering loss at the early stage of training. Furthermore, we propose a Fast PatchMatch MVS module to derive the geometric cues in the planer regions that help overcome ambiguity. Experiments on several aerial datasets show that MegaSurf can overcome ambiguity while preserving high-fidelity details. Compared to SOTA methods, MegaSurf achieves superior reconstruction accuracy of large scenes and boosts the acquisition of geometric cues more than four times.

Abstract:
Camouflaged instance segmentation is a challenging task due to the various aspects such as color, structure, lighting, etc., of object instances embedded in complex backgrounds. Although the current DETR-based scheme simplifies the pipeline, it suffers from a large number of object queries, leading to many false positive instances. To address this issue, we propose an adaptive query selection mechanism. Our research reveals that a large number of redundant queries scatter the extracted features of the camouflaged instances. To remove these redundant queries with weak correlation, we evaluate the importance of the object query from the perspectives of information entropy and volatility. Moreover, we observed that occlusion and overlapping instances significantly impact the accuracy of the selection mechanism. Therefore, we design a boundary location embedding mechanism that incorporates fake instance boundaries to obtain better location information for more accurate query instance matching. We conducted extensive experiments on two challenging camouflaged instance segmentation datasets, namely COD10K and NC4K, and demonstrated the effectiveness of our proposed model. Compared with the OSFormer, our model significantly improves the performance by 3.8% AP and 5.6% AP with less computational cost, achieving the state-of-the-art of 44.8 AP and 48.1 AP with ResNet-50 on the COD10K and NC4K test-dev sets, respectively.

Abstract:
The issue of face privacy protection has aroused wide social concern along with the increasing applications of face images. The latest methods focus on achieving a good privacy-utility tradeoff so that the protected results can still be used to support the downstream computer vision tasks. However, they may suffer from limited flexibility in manipulating this tradeoff because the practical requirements may vary under different scenarios. In this paper, we present a novel recurrent latent representation reorganization (LReOrg) framework to deal with the problem. LReOrg relies on two key modules to deal with the privacy-utility tradeoff, where the first one is responsible for anonymizing the privacy sensitive information and the other is responsible for recovering the destroyed useful insensitive information according to user requirements. LReOrg is advantageous in: (a) enabling users to recurrently process fine-grained attributes; (b) providing flexible control over privacy-utility tradeoff by manipulating which attributes to anonymize or preserve using cross-modal keywords; and (c) eliminating the need of data annotations for network training. The experimental results on benchmark datasets have reported the superior ability of our approach for providing flexible protection on facial information.

Abstract:
Recent advances have been witnessed in audio-language joint learning, such as CLAP, that shows much success in multi-modal understanding tasks. These models usually aggregate uni-modal local representations, namely frame or word features, into global ones, on which the contrastive loss is employed to reach coarse-grained cross-modal alignment. However, frame-level correspondence with texts may be ignored, making it ill-posed on explainability and fine-grained challenges which may also undermine performances on coarse-grained tasks. In this work, we aim to improve both coarse- and fine-grained audio-language alignment in large-scale contrastive pre-training. To unify the granularity and latent distribution of two modalities, a shared codebook is adopted to represent multi-modal global features with common bases, and each codeword is regularized to encode modality-shared semantics, bridging the gap between frame and word features. Based on it, a locality-aware block is involved to purify local patterns, and a hard-negative guided loss is devised to boost alignment. Experiments on eleven zero-shot coarse- and fine-grained tasks suggest that our model not only surpasses the baseline CLAP significantly but also yields superior or competitive results compared to current SOTA works.

Abstract:
We introduce the Self-Exemplar Illumination Equalization Network, designed specifically for effective portrait shadow removal. The core idea of our method is that partially shadowed portraits can find ideal exemplars within their non-shadowed facial regions. Rather than directly fusing two distinct classes of facial features, our approach utilizes non-shadowed regions as an illumination indicator to equalize the shadowed regions, generating deshadowed results without boundary-merging artifacts. Our network comprises cascaded Self-Exemplar Illumination Equalization Blocks (SExmBlock), each containing two modules: a self-exemplar feature matching module and a feature-level illumination rectification module. The former identifies and applies internal illumination exemplars to shadowed areas, producing illumination-corrected features, while the latter adjusts shadow illumination by reapplying the illumination factors from these features to the input face. Applying this series of SExmBlocks to shadowed portraits incrementally eliminates shadows and preserves clear, accurate facial details. The effectiveness of our method is demonstrated through evaluations on two public shadow portrait datasets, where it surpasses existing state-of-the-art methods in both qualitative and quantitative assessments.

Affiliations: State Key Laboratory of Media Convergence and Communication, Guangxi Zhuang Autonomous Region Information Center, Communication University of China, China ; Key Laboratory of Grain Information Processing and Control (Henan University of Technology), Ministry of Education, China ; School of Data Science and Media Intelligence, China ; School of Computer Science, Beijing University of Posts and Telecommunications & Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, China ; Chongqing Research Institute, Beijing University of Technology, China ; School of Information Science, North China University of Technology

Abstract:
Knowledge Graphs (KGs) serve as valuable auxiliary information to improve the accuracy of recommendation systems. Previous methods have leveraged the knowledge graph to enhance item representation and thus achieve excellent performance. However, these approaches heavily rely on high-quality knowledge graphs and learn enhanced representations with the assistance of carefully designed triplets. Furthermore, the emergence of knowledge graphs has led to models that ignore the inherent relationships between items and entities. To address these challenges, we propose a Self-Derived Knowledge Graph Contrastive Learning framework (CL-SDKG) to enhance recommendation systems. Specifically, we employ the variational graph reconstruction technique to estimate the Gaussian distribution of user-item nodes corresponding to the graph neural network aggregation layer. This process generates multiple KGs, referred to as self-derived KGs. The self-derived KG acquires more robust perceptual representations through the consistency of the estimated structure. Besides, the self-derived KG allows models to focus on user-item interactions and reduce the negative impact of miscellaneous dependencies introduced by conventional KGs. Finally, we apply contrastive learning to the self-derived KG to further improve the robustness of CL-SDKG through the traditional KG contrast-enhanced process. We conducted comprehensive experiments on three public datasets, and the results demonstrate that our CL-SDKG outperforms state-of-the-art baselines.

Abstract:
When people use agent characters to travel through different spaces (such as virtual scenes and real scenes, or different game spaces), it is important to reasonably position the characters in the new scene according to their personal characteristics. In this paper, we propose a novel pipeline for relocating virtual agents in new scenarios based on their personal characteristics. We extract the characteristics of the characters (including figure, posture, social distance). Then a cost function is designed to evaluate the agent's position in the scene, which consists of a spatial term and an personalized term. Finally, a a Markov Chain Monte Carlo optimization method is applied to search for the optimized solution. The results generated by our approach are evaluated through extensive user study experiments, verifying the effectiveness of our approach compared with other alternative approaches.

Abstract:
Bokeh is a wide-aperture optical effect that creates aesthetic blurring in photography. However, achieving this effect typically demands expensive professional equipment and expertise. To make such cinematic techniques more accessible, bokeh rendering aims to generate the desired bokeh effects from all-in-focus inputs captured by smartphones. Previous efforts in bokeh rendering primarily focus on static images. However, when extended to video inputs, these methods exhibit flicker and artifacts due to a lack of temporal consistency modeling. Meanwhile, they cannot utilize information like occluded objects from adjacent frames, which are necessary for bokeh rendering. Moreover, the difficulties of capturing all-in-focus and bokeh video pairs result in a shortage of data for training video bokeh models. To tackle these challenges, we propose the Video Bokeh Renderer (VBR), the model designed specifically for video bokeh rendering.VBR leverages implicit feature space alignment and aggregation to model temporal consistency and exploit complementary information from adjacent frames. On the data front, we introduce the first Synthetic Video Bokeh (SVB) dataset, synthesizing authentic bokeh effects using ray-tracing techniques. Furthermore, to improve the robustness of the model to inaccurate disparity maps, we employ a set of augmentation strategies to simulate corrupted disparity inputs during training. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our method.

Abstract:
Facial landmark detection forms the foundation for numerous face-related tasks. Recently, this field has gained substantial attention and made significant advancements. Nonetheless, detecting facial landmarks for stylized characters still remains a challenge. Existing approaches, which are mostly trained on real-human face datasets, struggle to perform well due to the structural variations between real and stylized characters. Additionally, a comprehensive dataset for analyzing stylized characters' facial features is lacking. This study proposes a novel dataset, the Facial Landmark Dataset for Stylized Characters (FLSC), which contains 2674 images and 4086 faces selected from 16 cartoon video clips, together with 98 landmarks per image, labeled by professionals. Besides, we propose StylizedFacePoint: a deep-learning-based method for stylized facial landmark detection that outperforms the existing approaches. This method has also proven to work well for characters with styles outside the training domain. Moreover, we outline two primary types of applications for our dataset and method. For each, we provide a detailed illustrative example.

Abstract:
The rapid advancement of deepfake technology poses significant threats to social trust. Although recent deepfake detectors have exhibited promising results on deepfakes of the same type as those present in training, their effectiveness degrades significantly on novel deepfakes crafted by unseen algorithms due to the gap in forgery patterns. Some studies have enhanced detectors by adapting to the continuously emerging deepfakes through incremental learning. Despite the progress, they overlooked the scarcity of novel samples that can easily lead to insufficient learning of forgery patterns. To mitigate this issue, we introduce the Dynamic Mixed-Prototype (DMP) model, which dynamically increases prototypes to adapt to novel deepfakes efficiently. Specifically, the DMP model adopts multiple prototypes to represent both real and fake classes, enabling learning novel patterns by expanding prototypes and jointly retaining knowledge learned in previous prototypes. Furthermore, we propose the Prototype-Guided Replay strategy and Prototype Representation Distillation loss, both of which effectively prevent forgetting learned knowledge based on the prototypical representation of samples. Our method surpasses existing incremental deepfake detectors across four datasets and can generalize to novel deepfakes by learning limited deepfake samples.

Abstract:
In this work, borrowing a solution from the large-scale vision-language models (VLMs) instead of directly removing modality-specific signals from visual features, we propose a novel Flexible Modal CLIP (FM-CLIP) for flexible modal FAS, that can utilize text features to dynamically adjust visual features to be modality independent. In the visual branch, considering the huge visual differences of the same attack in different modalities, which makes it difficult for classifiers to flexibly identify subtle spoofing clues in different test modalities, we propose Cross-Modal Spoofing Enhancer (CMS-Enhancer). It includes a Frequency Extractor (FE) and Cross-Modal Interactor (CMI), aiming to map different modal attacks in a shared frequency space to reduce interference from modality-specific signals and enhance spoofing clues by leveraging cross-modal learning from the shared frequency space. In the text branch, we introduce a Language-Guided Patch Alignment (LGPA) based on prompt learning, which further guides the image encoder to focus on patch-level spoofing representations through dynamic weighting by text features. Thus, our FM-CLIP can flexibly test different modal samples by identifying and enhancing modality-agnostic spoofing cues. Finally, extensive experiments show that FM-CLIP is effective and outperforms state-of-the-art methods on multiple multi-modal datasets.

Abstract:
Video harmonization aims to address the discrepancy in color and lighting between foreground and background elements within video compositions, thereby enhancing the innate coherence of composite video content. Nevertheless, existing methods struggle to effectively handle video composite tasks with excessively large-scale foregrounds. In this paper, we propose Video Harmonization Masked Autoencoders (VHMAE), a simple yet powerful end-to-end video harmonization method designed to tackle this challenge once and for all. Unlike other typically MAE-based methods employing random or tube masking strategies, we innovative treat all foregrounds in each frame required for harmonization as prediction regions, which are designated as masked tokens and fed into our network to produce the final refinement video. To this end, the network is optimized to prioritize the harmonization task, proficiently reconstructing the masked region despite the limited background information. Specifically, we introduce the Pattern Alignment Module (PAM) to extract content information from the extensive masked foreground region, aligning the latent semantic features of the masked foreground content with the background context while disregarding the impact of various colors or illumination. Moreover, We propose the Patch Balancing Loss, which effectively mitigates the undesirable grid-like artifacts commonly observed in MAE-based approaches for image generation, thereby ensuring consistency between the predicted foreground and the visible background. Additionally, we introduce a real-composited video harmonization dataset named RCVH, which serves as a valuable benchmark for assessing the efficacy of techniques aimed at video harmonization across different real video sources. Comprehensive experiments demonstrate that our VHMAE outperforms state-of-the-art techniques on both RCVH and HYouTube datasets.

Abstract:
Whole-body motion imitation has gained wide attention in recent years as it can enhance the locomotive capabilities of humanoid robots. In this task, non-intrusive human motion capturing with RGB cameras is commonly used for its low-cost, efficiency, portability and user-friendliness. However, RGB based methods always faces the problem of depth ambiguity, leading to inaccurate and unstable imitation. Accordingly, we propose to introduce pressure sensor into the non-intrusive humanoid motion imitation system for two considerations: first, pressure can be used to estimate the contact relationship and interaction force between human and the ground, which play a key role in the balancing and stabilizing motion; second, pressure can be measured in the manner of almost non-intrusive approach, which can keep the experience of human demonstrator. In this paper, we establish a RGB-Pressure (RGB-P) based humanoid imitation system, achieving accurate and stable end-to-end mapping from human body models to robot control parameters. Specifically, we use RGB camera to capture human posture and pressure insoles to measure the underfoot pressure during the movements of human demonstrator. Then, a constraint relationship between pressure and pose is studied to refine the estimated pose according to the support modes and balance mechanism, thereby enhancing consistency between human and robot motions. Experimental results demonstrate that fusing RGB and pressure can enhance overall robot motion execution performance by improving stability while maintaining imitation similarity.

Abstract:
In this paper, we propose a simple but effective illumination distribution prior (IDP) for images to illuminate the darkness. The illumination distribution prior is the product of a statistical approach to low-light images. It is based on a key factor - the mean value and standard deviation of images are positively correlated with the illumination. Using IDP in combination with the dual-domain feature fusion network (DFFN), we can obtain images that are more consistent with the ground truth distribution. DFFN inserts the discrete wavelet transform (DWT) into the transformer architecture, aiming to recover the detailed texture of the image through local high-frequency information and global spatial information. We have conducted extensive experiments on five widely used low-light image enhancement datasets and the experimental results show the superior performance of our proposed network (IDP-Net) compared to other state-of-the-art methods.

Abstract:
News Captioning involves generating the descriptions for news images based on the detailed content of related news articles. Given that these articles often contain extensive information not directly related to the image, captions may end up misaligned with the visual content. To mitigate this issue, we propose the novel cross-modal coherence-enhanced feedback prompting method to clarify the crucial elements that align closely with the visual content for news captioning. Specifically, we first adapt CLIP to develop a news-specific image-text matching module, enriched with insights from language model MPNet using a matching-score comparative loss, which facilitates effective cross-modal knowledge distillation. This module enhance the coherence between images and each news sentences via rating confidence. Then, we design confidence-aware prompts to fine-tune LLaVA model with by LoRa strategy, focusing on essential details in extensive articles. Lastly, we evaluate the generated news caption with refined CLIP, constructing confidence-feedback prompts to further enhance LLaVA through feedback learning, which iteratively refine captions to improve its accuracy. Extensive experiments conduct on two public datasets, GoodNews and NYTimes800k, have validated the effectiveness of our method.

Abstract:
Multi-view clustering has emerged as an important unsupervised method to process unlabelled multi-view data that provides a comprehensive description of an object. Existing multi-view clustering methods focus on centralized settings but ignore the fact that real-world multi-view data may be distributed across different entities. The sensitive information embedded in multi-view data hinders the cooperative training of multi-view clustering, since data of different views cannot be directly shared, leading to a great challenge to cooperatively exploit the consistent and complementary information of different views. To validate the multi-view clustering in distributed scenarios, in this paper, we propose a novel federated multi-view method named Federated Multi-View Fuzzy C-means with Schatten-p Norm Minimization (FMVFCMSP) which is based on fuzzy C-means and tensor Schatten p-norm. Specifically, we utilize the membership degrees to replace conventional hard clustering assignment in K-means, enabling improved uncertainty handling and less information loss. Moreover, we introduce a tensor Schatten p-norm-based regularizer to fully explore the inter-view complementary information and global spatial structure. We also develop a federated optimization algorithm enabling clients to collaboratively learn the clustering results. Extensive experiments on several datasets demonstrate that our proposed method exhibits superior performance in federated multi-view clustering.

Abstract:
Polygonal meshes are widely used to represent complex geometries. However, the increasing complexity of models often leads to large meshes with millions of triangles, raising significant challenges for storage, transmission, and computation. Mesh simplification, a process of reducing the number of triangles in a mesh while preserving its overall shape and important features, has emerged as an indispensable technique to address these challenges. In this work, we focus on the problem of obtaining a visually consistent ultra-low-polygon mesh for complex meshes. Unlike previous methods, we design a robust simplification framework, SimpliGuard, to handle any meshes in the wild. Firstly, a reconstruction module is used to construct a low-polygon mesh with a similar shape but a manifold topology. Then, a texture initialization module is employed to quickly initialize the entire texture map. After that, a differentiable rendering module is utilized to optimize the overall structure and texture details, ensuring high-quality results. For meshes with skeletons, the correctness of motion can be preserved with our designed motion post-processing module. Experimental results demonstrate that SimpliGuard significantly outperforms previous methods and various featured software, including Blender and Simplygon.

Abstract:
Sound Effect (SFX) generation, primarily aims to automatically produce sound waves for sounding visual objects in images or videos. Rather than learning an automatic solution to this task, we aim to propose a much broader system, AutoSFX, designed to automate sound design for videos in a more efficient and applicable manner. AutoSFX capitalizes on this concept by aggregating multimodal representations by cross-attention and leverages a diffusion model to generate sound with visual information embedded. AutoSFX also optimizes the generated sounds to render the entire soundtrack for the input video, leading to a more immersive and engaging multimedia experience by performing seamless transitions between sound clips and harmoniously mixing sounds playing simultaneously. We have developed a user-friendly interface for AutoSFX enabling users to interactively engage in the SFX generation for their videos with particular needs. To validate the capability of our vision-to-sound generation, we conducted comprehensive experiments and analyses using the widely recognized VEGAS and VGGSound test sets, yielding promising results. We also conducted a user study to evaluate the performance of the optimized soundtrack and the usability of the interface. Overall, the results revealed that our AutoSFX provides a viable sound landscape solution for making attractive videos.

Abstract:
Video Individual Counting (VIC), which focuses on accurately tallying the total number of individuals in a video without duplication, is crucial for urban public space management and densely-populated areas planning. Existing methods suffer from limitations in terms of expensive manual annotation, and the efficiency of location or detection algorithms. In this work, we contribute a novel Prototype-guided Dual-Transformer Reasoning framework, termed PDTR, which takes both similarity and difference of adjacent frames into account to achieve accurate counting in an end-to-end regression manner. Specifically, we first design a multi-receptive field feature fusion module to acquire initial comprehensive representations. Subsequently, the dynamic prototype generation module memorizes consistent representations of similar information to generate prototypes. Additionally, to further dig out the shared and private features from different frames, a prototype cross-guided decoder and a privacy-decoupling module are designed. Extensive experiments conducted on two existing VIC datasets, consistently demonstrate the superiority of PDTR over state-of-the-art baselines.

Abstract:
Subjective experiments driven by human visual perception for images aid in the development of related technologies such as analysis, compression, and transmission, which garners substantial attention and research interest. However, the general subjective experiment platform is relatively lacking, and verifying the reliability of annotation data is often difficult. In response to these challenges, an open source subjective experiment platform, namely OpenSEP, is proposed in this paper. Specifically, OpenSEP mainly includes a contrast mode sub-platform that displays dual stimuli, allowing for the simultaneous display of both source stimulus and distorted stimulus for subjective testing. Moreover, a scoring mode sub-platform that displays single stimulus is also provided in OpenSEP. In this mode, subjects can only score the distorted stimulus individually after sequentially viewing the source stimulus and all the distorted stimuli. Besides, OpenSEP constructs a cross-validation sub-platform integrated mainstream Just Noticeable Distortion (JND) algorithms. Within this sub-platform, the reliability of subjective annotation data can be verified based on existing JND algorithms, and the effectiveness of newly proposed modeling algorithms can also be validated. The open source library for OpenSEP is available at https://openi.pcl.ac.cn/OpenDatasets/OpenSEP

Abstract:
In this work, we present a video editing chatbot (VEC) that performs intelligent multimedia editing through natural language dialogue. VEC comprises three modules: instruction analysis, multimedia resources retrieval, and multimedia resources editing. It analyzes user instructions to retrieve relevant multimedia resources from the multimedia database (MMDB), and then applies appropriate editing methods from the multimedia toolbase (MMTB) automatically. To enhance user experience and simplify operation, VEC uses a multi-turn dialogue mechanism to handle complex editing tasks.

Abstract:
We propose an interactive and intelligent hybrid teleconferencing system compatible with Virtual Reality devices. Our system understands meeting contexts and leverages user interactions to enhance better system configuration. Employing interactive scene graphs [11], the system extracts and transmits essential meeting context to users while relaying user interactions back to the streaming systems for user-involved adaptive streaming and foveated rendering. We demonstrate the system's real-time performance and compatibility with commercial VR devices such as the Meta Quest 3.

Abstract:
Micro-actions convey the emotions of characters in daily communication and offer richer semantic information compared to conventional actions. Accurate detection of these micro-actions is essential for video understanding. Due to their short duration, low intensity, and high overlap, micro-actions require more detailed video features, presenting a significant challenge for accurate detection. To address these challenges, we propose the 3D-SENet Adapter, which aggregates spatio-temporal information and enables end-to-end online video feature learning. We also find that incorporating background information significantly enhances the detection of small-scale micro-actions. Thus we develop the Cross-Attention Aggregation Detection Head, which integrates multi-scale features within the feature pyramid, thereby improving the detection accuracy of micro-actions occupying small regions in video frames. Our approach achieves first place in the Multi-label Micro-Action Detection (MMAD) and second place in the Micro-Action Recognition (MAR) of Micro-Action Analysis Grand Challenge.

Abstract:
Micro-actions are spontaneous body movements that indicate a person's true feelings and potential intentions, and micro-action recognition is important in human behavior analysis. Yet, recognizing micro-actions is challenging because they are subtle and appear for a very short time compared to normal actions. In this paper, we propose a micro-action recognition framework based on Hierarchical Fusion and Inference (HiFI) to capture subtle multimodal information. Specifically, we first hierarchically integrate multimodal local and global information, including the 2D key-points of faces, hands and bodies, the depth information, and the RGB image sequences. Afterward, both 3D-CNNs and Transformers are used to effectively capture local and long-range dependence. Finally, we propose a novel from-fine-to-coarse (F2C) inference strategy, based on hybrid ensemble of multi-branches, to boost the accuracy and credibility of coarse action recognition. Our solution ranked 4th in the MAC Challenge Track 1.

Abstract:
Recent years have seen a revolution in the creation of synthetic multimedia content. Algorithms with the ability to generate truly convincing images, videos, text and audio capable of fooling any human being. In addition to the possible beneficial uses that this type of technology may have, we must highlight the danger of its misuse for criminal or fraudulent activities. Deepfakes stand out as an example of a potentially dangerous use of these technologies, since they facilitate identity theft and the generation of misinformation. Current solutions are not capable of detecting this type of fake content with sufficient reliability. Therefore, it is crucial to develop new algorithms that solve this problem. This paper presents two methods focusing on the classification and localization of deepfake videos taking into account audio and visual information. These methods were submitted to the ACM 1M Deepfakes Detection Challenge, achieving the highest score in the temporal localization task and a top-five ranking in the classification task.

Abstract:
The prediction task of social media popularity aims to automatically forecast the future popularity of the posts by leveraging vast amounts of social media data. This data encompasses diverse visual and textual content, including photos, categories, custom tags, temporal information, and geographical data. Existing methods have explored multiple feature types to enhance popularity prediction. Despite their success, visual and textual features-both crucial pieces of information-are often simply concatenated after extraction, ignoring the divergence between these two feature spaces. In this paper, we propose a method to project visual and language information into an aligned semantic representation, thereby uncovering intricate associations between these two modalities. Specifically, we leverage the BLIP-2 model to understand and generate visual description text that encapsulates the content of photos. Semantic embeddings are then extracted from all available visual and textual information. Additionally, we deeply exploit user-related behavior and characteristic information to extract features, uncovering hidden clues for post popularity prediction. Leveraging these improvements, we conduct extensive experiments to demonstrate the effectiveness of our proposed method.

Abstract:
AI assistants have been in our lives since the 1990s. Building AI agents from LLMs has changed the way we do research, development and deployment of assistants as agents. Agents are AI systems with various degrees of reasoning and planning, that can anticipate user needs, autonomously figure out the steps to fulfil such needs, access knowledge and tools external to the LLM, to accomplish user tasks. They are the new AI assistants in the LLM era. In this talk, I will give an overview of the evolution of AI assistants to AI agents, and outline the challenges we still face today and the opportunities ahead for developing a Family of Agents that span over different modalities and interfaces. As different model architectures emerge in the future, LLMs will remain as an effective tool of building agents, replacing manual design.

Abstract:
Icons are ubiquitous visual elements in graphic design, yet their creation is often complex and time-consuming. To resolve this problem, we draw inspiration from the booming text-to-image field and propose Text-Guided Icon Set Expansion, a novel task that helps users design high-quality icons using textual descriptions. Besides, users can control the style consistency of the created icons by inputting a few hand-crafted icons as style reference. Despite its practicality, the task poses two unique challenges. (i) Abstract Concept Visualization. Abstract concepts like technology and health are frequently encountered in icon creation, but their visualization is not straightforward and requires a grounding process that translates them into physical, easy-to-depict objects. (ii) Fine-grained Style Transfer. Unlike ordinary images, icons exhibit richer fine-grained stylistic elements, including tones, line widths, shapes, shadow effects, etc., which puts higher demands on capturing and preserving detailed styles during icon generation.

Abstract:
Face anti-spoofing (FAS) based on domain generalization (DG) has attracted increasing attention from researchers. The reason for the poor generalization is that the model is overfitted to salient liveness-irrelevant signals. However, the previous methods alleviate the overfitting by mapping the images from multiple domains into a common feature space or promoting the separation of image features from domain-specific features and task-related features. If the text features of vision-language pre-trained (VLP) models (e.g., CLIP) are used to dynamically adjust the image features to gain a better generalization, we can not only explore a wider feature space but also avoid the potential degradation of semantic information. Specifically, we propose a FAS method of Style-Conditional Prompt Token Learning (S-CPTL), which aims to generate generalized text features by training the introduced prompt tokens to carry visual styles and use them as weights for classifiers to improve the model's generalization. Compared to the inherently static prompt token, we propose the dynamic prompt token, which can adaptively capture live-irrelevant signals from the instance-specific styles and increase their diversity through mixed feature statistics to further reduce the overfitting of the model. Thorough experimental analysis demonstrates that S-CPTL exceeds current top-performing methods in four distinct cross-dataset benchmarks.

Abstract:
Speaker extraction aims to selectively extract the target speaker from the multi-talker environment under the guidance of auxiliary reference. Recent studies have shown that the attended speaker's information can be decoded by the auditory attention decoding from the listener's brain activity. However, how to more effectively utilize the common information about the target speaker contained in both electroencephalography (EEG) and speech is still an unresolved problem. In this paper, we propose a multi-scale fusion network (MSFNet) for brain-controlled speaker extraction, which utilizes the EEG recorded from the listener to extract the target speech. In order to make full use of the speech information, the mixed speech is encoded with multiple time scales so that the multi-scale embeddings are acquired. In addition, to effectively extract the non-Euclidean data of EEG, the graph convolutional networks are used as the EEG encoder. Finally, these multi-scale embeddings are separately fused with the EEG features. To facilitate research related to auditory attention decoding and further validate the effectiveness of the proposed method, we also construct the AVED dataset, a new EEG-Audio dataset. Experimental results on both the public Cocktail Party dataset and the newly proposed AVED dataset in this paper show that our MSFNet model significantly outperforms the state-of-the-art method in certain objective evaluation metrics.

Abstract:
Emotion recognition based on electroencephalogram (EEG) has garnered increasing attention in recent years due to non-invasiveness and high reliability of EEG measurements. Despite the promising performance achieved by numerous existing methods, several challenges persist. Firstly, there is the challenge of emotional label noise, stemming from the assumption that emotions remain consistently evoked and stable throughout the entirety of video observation. Such an assumption proves difficult to uphold in practical experimental settings, leading to discrepancies between EEG signals and anticipated emotional states. In addition, there's a need for comprehensive capture of temporal-spatial-spectral characteristics of EEG signals and cope with low signal-to-noise ratio (SNR) issues. To tackle these challenges, we propose a comprehensive pipeline named REmoNet, which leverages novel self-supervised techniques and multi-regularized co-learning. Two self-supervised methods, including masked channel modeling via temporal-spectral transformation and emotion contrastive learning, are introduced to facilitate the comprehensive understanding and extraction of emotion-relevant EEG representations during pre-training. Additionally, fine-tuning with multi-regularized co-learning exploits feature-dependent information through intrinsic similarity, resulting in mitigating emotional label noise. Experimental evaluations on two public datasets demonstrate that our proposed approach, REmoNet, surpasses existing state-of-the-art methods, showcasing its effectiveness in simultaneously addressing raw EEG signals and noisy emotional labels.

Abstract:
Multi-modality physiological signal-based emotion recognition has attracted increasing attention as its capacity to capture human affective states comprehensively. Due to multi-modality heterogeneity and cross-subject divergence, practical applications struggle with generalizing models across individuals. Effectively addressing both issues requires mitigating the gap between multimodal signals while acquiring generalizable representations across subjects. However, existing approaches often handle these dual challenges separately, resulting in suboptimal generalization. This study introduces a novel framework, termed Correlation-Driven Multi-Modality Graph Decomposition (CMMGD). The proposed CMMGD initially captures adaptive cross-modal correlations. It connects each unimodal graph to a multimodal mixed graph. To simultaneously address the dual challenges, it incorporates a correlation-driven graph decomposition module that decomposes the mixed graph into concordant and discrepant subgraphs based on the correlations. The decomposed concordant subgraph encompasses consistently activated features across modalities and subjects during emotion elicitation, unveiling a generalizable subspace. Additionally, we design a Multi-Modality Graph Regularized Transformer (MGRT) backbone specifically tailored for multimodal physiological signals. The MGRT can alleviate the over-smoothing issue and mitigate over-reliance on any single modality. Extensive experiments demonstrate that CMMGD outperforms the state-of-the-art methods by 1.79% and 2.65% on DEAP and MAHNOB-HCI datasets, respectively, under the leave-one-subject-out cross-validation strategy.

Abstract:
With the increasing popularity of online social applications, stickers have become common in online chats. Teaching a model to select the appropriate sticker from a set of candidate stickers based on dialogue context is important for optimizing the user experience. Existing methods have proposed leveraging emotional information to facilitate the selection of appropriate stickers. However, considering the frequent co-occurrence among sticker images, words with emotional preference in the dialogue and emotion labels, these methods tend to over-rely on such dataset bias, inducing spurious correlations during training. As a result, these methods may select inappropriate stickers that do not match users' intended expression. In this paper, we introduce a causal graph to explicitly identify the spurious correlations in the sticker selection task. Building upon the analysis, we propose a Causal Knowledge-Enhanced Sticker Selection model to mitigate spurious correlations. Specifically, we design a knowledge-enhanced emotional utterance extractor to identify emotional information within dialogues. Then an interventional visual feature extractor is employed to obtain unbiased visual features, aligning them with the emotional utterances representation. Finally, a standard transformer encoder fuses the multimodal information for emotion recognition and sticker selection. Extensive experiments on the MOD dataset show that our CKS model significantly outperforms the baseline models.

Abstract:
Accurate segmentation of cerebrovascular structures from Time-of-flight magnetic resonance angiography is vital for treating cerebrovascular diseases. However, existing methods rely on voxel categorization, leading to discontinuities in fine vessel locations. We propose a connectivity-based cerebrovascular segmentation method that considers inter-voxel relationships to alleviate this limitation. By modeling connectivity, we convert voxel classification into inter-voxel connectivity prediction. Given the sparse and widely distributed nature of cerebrovascular structures, we employ a sparse 3D Bi-level routing attention to effectively capture cerebrovascular features. To extract inter-voxel directional information, we utilize a 3D direction excitation block. Additionally, a 3D direction interactive block continuously enhances the directional information within the feature map. This directional enhanced feature map is then concatenated with the feature map from the encoder layer and fed into the decoder layer. We compare our method with current state-of-the-art cerebrovascular segmentation techniques and general medical image segmentation methods. Our method achieved a Dice Similarity Coefficient of 92.413% and 83.197% on the clinical and open cerebrovascular datasets respectively, outperforming existing approaches.

Abstract:
In recent years, diffusion models have dominated the field of image generation with their outstanding generation quality. However, pre-trained large-scale diffusion models are generally trained using fixed-size images, and fail to maintain their performance at different aspect ratios. Existing methods for generating arbitrary-size images based on diffusion models face several issues, including the requirement for extensive finetuning or training, sluggish sampling speed, and noticeable edge artifacts. This paper presents the InstantAS method for arbitrary-size image generation. This method performs non-overlapping minimum coverage segmentation on the target image, minimizing the generation of redundant information and significantly improving sampling speed. To maintain the consistency of the generated image, we also proposed the Inter-Domain Distribution Bridging method to integrate the distribution of the entire image and suppress the separation of diffusion paths in different regions of the image. Furthermore, we propose the dynamic semantic guided cross-attention method, allowing for the control of different regions using different semantics. Experimental results show that InstantAS has better fusion capabilities compared to previous arbitrary-size image generation methods and is far ahead in sampling speed compared to them.

Abstract:
Cross-resolution person re-identification (CR-ReID) aims to match images of the same person with different resolutions in different scenarios. Existing CR-ReID methods achieve promising performance by relying on large-scale manually annotated identity labels. However, acquiring manual labels requires considerable human effort, greatly limiting the flexibility of existing CR-ReID methods. To address this issue, we propose a dual-resolution fusion modeling (DRFM) framework to tackle the CR-ReID problem in an unsupervised manner. Firstly, we design a cross-resolution pseudo-label generation (CPG) method, which initially clusters high-resolution images and then obtains reliable identity pseudo-labels by fusing class vectors in both resolution spaces. Subsequently, we develop a cross-resolution feature fusion (CRFF) module to fuse features from both high-resolution and low-resolution spaces. The fusion features have the potential to serve as a new form of resolution-invariant features. Finally, we introduce cross-resolution contrastive loss and probability sharpening loss in DRFM to facilitate resolution-invariant learning and effectively utilize ambiguous samples for optimization. Experimental results on multiple CR-ReID datasets demonstrate that the proposed DRFM not only outperforms existing unsupervised methods but also approaches the performance of early supervised methods.

Abstract:
Recently, there has been significant progress in leveraging human feedback to enhance diffusion-based image generation, garnering considerable interest and attention. However, existing methods fail to achieve a fine-grained performance boost for the following challenges: i) insufficient amount of fine-grained feedback data; ii) lack of effective fine-grained feedback learning framework; To tackle these challenges, we present TreeReward to facilitate the fine-grained feedback optimization for diffusion models. Specifically, to address the limitation of the fine-grained feedback data, we first design a novel "AI + Expert" feedback data construction pipeline, yielding about 2.2M high-quality feedback dataset encompassing six fine-grained dimensions at a relatively low cost. Built upon this dataset, we introduce a tree-structure reward model to exploit the fine-grained feedback data efficiently and provide tailored optimization during feedback learning. We validate the feedback learning performance of our method across different fine-grained dimensions and various downstream tasks. Extensive experiments on both Stable Diffusion v1.5 (SD1.5) and Stable Diffusion XL (SDXL) demonstrate the effectiveness of our method in enhancing the general and fine-grained generation and downstream tasks generalization.

Abstract:
Semantic edge detection (SED) is pivotal for the precise demarcation of object boundaries, yet it faces ongoing challenges due to the prevalence of low-quality labels in current methods. In this paper, we present a novel solution to bolster SED through the encoding of both language and image data. Distinct from antecedent language-driven techniques, which predominantly utilize static elements such as dataset labels, our method taps into the dynamic language content that details the objects in each image and their interrelations. By encoding this varied input, we generate integrated features that utilize semantic insights to refine the high-level image features and the ultimate mask representations. This advancement improves the quality of these features and elevates SED performance. Experimental evaluation on benchmark datasets, including SBD and Cityscape, showcases the efficacy of our method, achieving leading ODS F-scores of 79.0 and 76.0, respectively. Our approach signifies a notable advancement in SED technology by seamlessly integrating multimodal textual information, embracing both static and dynamic aspects.

Abstract:
Insufficient labeled training samples pose a critical challenge in multi-label classification, potentially leading to overfitting of the model. This paper delineates a criterion for establishing a common domain among different datasets, whereby datasets sharing analogous object descriptions and label structures are considered part of the 'same field'. Integrating samples from disparate datasets within this shared field for training purposes effectively mitigates overfitting and enhances model accuracy. Motivated by this approach, we introduce a novel method for multi-label classification termed Non-Overlapped Multi-View Weak-Label Learning Guided by Multiple Correlations (NOMWM). Our method strategically amalgamates samples from diverse datasets within the shared field to enrich the training dataset. Furthermore, we project samples from various datasets onto a unified subspace to facilitate learning in a consistent latent space. Additionally, we address the challenge of weak labels stemming from incomplete label overlaps across datasets. Leveraging weak-label indicator matrices and label correlation mining techniques, we effectively mitigate the impact of weak labels. Extensive experimentation on multiple benchmark datasets validates the efficacy of our method, demonstrating clear improvements over existing state-of-the-art approaches.

Abstract:
Simultaneously mapping and exploring a complex unknown scene is an NP-hard problem, which is still challenging with the rapid development of deep learning techniques. We present CSO, a deep reinforcement learning-based framework for efficient active scene mapping. Constraint-guided space optimization is adopted for both state and critic space to reduce the difficulty of finding the global optimal explore path and avoid long-distance round trips while exploring. We first take the frontiers-based entropy as the input constraint with the raw observation into the network, which guides the training start from imitating the local greedy searching. However, the entropy-based optimization can easily get stuck with few local optimal or cause inefficient round trips since the entropy space and the real world do not share the same metric. Inspired by constrained reinforcement learning, we then introduce an action mask-based optimization constraint to align the metric of these two spaces. Exploration optimization in aligned spaces can avoid long-distance round trips more effectively. We evaluate our method with a ground robot in 29 complex indoor scenes with different scales. Our method can perform 19.16% more exploration efficiency and 3.12% more exploration completeness on average compared to the state-of-the-art alternatives. We also implement our method in real-world scenes that can efficiently explore an area of 649 m^2. The experiment video can be found in the supplementary material.

Abstract:
Partially View-aligned Clustering (PVC) presents a challenge as it requires a comprehensive exploration of complementary and consistent information in the presence of partial alignment of view data. Existing PVC methods typically learn view correspondence based on latent features that are expected to contain common semantic information. However, latent features obtained from heterogeneous spaces, along with the enforcement of alignment into the same feature dimension, can introduce cross-view discrepancies. In particular, partially view-aligned data lacks sufficient shared correspondences for the critical common semantic feature learning, resulting in inaccuracies in establishing meaningful correspondences between latent features across different views. While feature representations may differ across views, instance relationships within each view could potentially encode consistent common semantics across views. Motivated by this, our aim is to learn view correspondence based on graph distribution metrics that capture semantic view-invariant instance relationships. To achieve this, we utilize similarity graphs to depict instance relationships and learn view correspondence by aligning semantic similarity graphs through optimal transport with graph distribution. This facilitates the precise learning of view alignments, even in the presence of heterogeneous view-specific feature distortions. Furthermore, leveraging well-established cross-view correspondence, we introduce a cross-view contrastive learning to learn semantic features by exploiting consistency information. The resulting meaningful semantic features effectively isolate shared latent patterns, avoiding the inclusion of irrelevant private information. We conduct extensive experiments on several real datasets, demonstrating the effectiveness of our proposed method for the PVC task.

Abstract:
Long-tailed recognition (LTR) aims to learn balanced models from extremely unbalanced training data. Fine-tuning pretrained foundation models has recently emerged as a promising research direction for LTR. However, we observe that the fine-tuning process tends to degrade the intrinsic representation capability of pretrained models and lead to model bias towards certain classes, thereby hindering the overall recognition performance. To unleash the intrinsic representation capability of pretrained foundation models, in this work, we propose a new Parameter-Efficient Complementary Expert Learning (PECEL) for LTR. Specifically, PECEL consists of multiple experts, where individual experts are trained via Parameter-Efficient Fine-Tuning (PEFT) and encouraged to learn different expertise on complementary sub-categories via the proposed sample-aware logit adjustment loss. By aggregating the predictions of different experts, PECEL effectively achieves a balanced performance on long-tailed classes. Nevertheless, learning multiple experts generally introduces extra trainable parameters. To ensure parameter efficiency, we further propose a parameter sharing strategy which decomposes and shares the parameters in each expert. Extensive experiments on 4 LTR benchmarks show that the proposed PECEL can effectively learn multiple complementary experts without increasing the trainable parameters and achieve new state-of-the-art performance.

Abstract:
The challenging task composed image retrieval targets at identifying the matched image from the multi-modal query with a reference image and a textual modifier. Most existing methods are devoted to composing the unified query representations from the query images and texts, yet the distribution gaps between the hybrid-modal query representations and visual target representations are neglected. However, directly incorporating target features on the query may cause ambiguous rankings and poor robustness due to the insufficient exploration of the distinguishments and overfitting issues. To address the above concerns, we propose a novel framework termed SemAntic Distillation from Neighborhood (SADN) for composed image retrieval. For mitigating the distribution divergences, we construct neighborhood sampling from the target domain for each query and aggregate neighborhood features with adaptive weights to restructure the query representations. Specifically, the adaptive weights are determined by the collaboration of two individual modules, as correspondence-induced adaption and divergence-based correction. Correspondence-induced adaption accounts for capturing the correlation alignments from neighbor features under the guidance of the positive representations, and the divergence-based correction regulates the weights based on the embedding distances between hard negatives and the query in the latent space. Extensive results and ablation studies on CIRR and FashionIQ validate that the proposed semantic distillation from neighborhood significantly outperforms baseline methods.

Abstract:
Camouflaged instance segmentation (CIS) aims to detect and segment objects blending with their surroundings. While existing CIS methods rely heavily on fully-supervised training with massive precisely annotated data, consuming considerable annotation efforts yet struggling to segment highly camouflaged objects accurately. Despite their visual similarity to the background, camouflaged objects differ semantically. Since text associated with images offers explicit semantic cues to underscore this difference, we propose a novel approach: the first Text-Prompt based weakly-supervised camouflaged instance segmentation method named TPNet, leveraging semantic distinctions for effective segmentation. TPNet operates in two stages: pseudo mask generation and a self-training process. In the first stage, we align text prompts with images using a language-image model to obtain region proposals containing camouflaged instances. A Semantic-Spatial Iterative Fusion module is designed to assimilate spatial information with semantic insights, iteratively refining pseudo mask. In the second stage, Graduated Camouflage Learning, a self-training strategy, sequences training from simple to complex images based on camouflage levels, facilitating an effective learning gradient. Through the collaboration of the dual phases, our method offers a comprehensive experiment on two common benchmark and demonstrates a significant advancement, delivering a novel solution that bridges the gap between weak-supervised and high camouflaged instance segmentation.

Abstract:
As social networks grow exponentially, there is an increasing demand for video retrieval using natural language. Cross-modal hashing that encodes multi-modal data using compact hash code has been widely used in large-scale image-text retrieval, primarily due to its computation and storage efficiency. When applied to video-text retrieval, existing unsupervised cross-modal hashing extracts the frame- or word-level features individually, and thus ignores long-term dependencies. In addition, effectively exploiting the multi-modal structure is a remarkable challenge owing to the complex nature of video and text. To address the above issues, we propose Similarity Preserving Transformer Cross-Modal Hashing (SPTCH), a new unsupervised deep cross-modal hashing method for video-text retrieval. SPTCH encodes video and text by bidirectional transformer encoder that exploits their long-term dependencies. SPTCH constructs a multi-modal collaborative graph to model correlations among multi-modal data, and applies semantic aggregation by employing Graph Convolutional Network (GCN) on such graph. SPTCH designs unsupervised multi-modal contrastive loss and neighborhood reconstruction loss to effectively leverage inter- and intra-modal similarity structure among videos and texts. The empirical results on three video benchmark datasets illustrate that the proposed SPTCH generally outperforms state-of-the-arts in video-text retrieval.

Abstract:
The Conversational Recommendation System (CRS) aims to capture user dynamic preferences and provide item recommendations based on multi-turn conversations. However, effectively modeling these dynamic preferences faces challenges due to conversational limitations, which mainly manifests as limited turns in a conversation (quantity aspect) and low compliance with queries (quality aspect). Previous studies often address these challenges in isolation, overlooking their interconnected nature. The fundamental issue underlying both problems lies in the potential abrupt changes in user preferences, to which CRS may not respond promptly. We acknowledge that user preferences are influenced by temporal factors, serving as a bridge between conversation quantity and quality. Therefore, we propose a more comprehensive CRS framework called Time-aware User-preference Tracking for Conversational Recommendation System (TUT4CRS), leveraging time dynamics to tackle both issues simultaneously. Specifically, we construct a global time interaction graph to incorporate rich external information and establish a local time-aware weight graph based on this information to adeptly select queries and effectively model user dynamic preferences. Extensive experiments on two real-world datasets validate that TUT4CRS can significantly improve recommendation performance while reducing the number of conversation turns.

Abstract:
This paper presents a pioneering method for teaching computer sketching that transforms input images into sequential, parameterized strokes. However, two challenges are raised for this sketching task: weak stimuli during stroke decomposition and maintaining semantic correctness, stylistic consistency, and detail integrity in the final drawings. To tackle the challenge of weak stimuli, our method incorporates an attention agent, which enhances the algorithm's sensitivity to subtle canvas changes by focusing on smaller, magnified areas. Moreover, in enhancing the perceived quality of drawing outcomes, we integrate a sketching style feature extractor to seamlessly capture semantic information and execute style adaptation at feature level, alongside a drawing agent that decomposes strokes under the guidance of a fine-grained reward, thereby ensuring the integrity of sketch details. Based on dual intelligent agents, we have constructed an efficient sketching model. Experimental results attest to the superiority of our approach in both visual effects and perceptual metrics when compared to state-of-the-art techniques, confirming its efficacy in achieving realistic sketching.

Abstract:
Unanticipated domain shifts can severely degrade model performance, prompting the need for model adaptation techniques (i.e., Source-free Domain Adaptation (SFDA)) to adapt a model to new domains without accessing source data. However, existing SFDA methods often sacrifice source domain performance to improve adaptation on the target, limiting overall model capability. In this paper, we focus on a more challenging paradigm in semantic segmentation, Generalized SFDA (G-SFDA), aiming to achieve robust performance on both source and target domains. To achieve this, we propose a novel G-SFDA framework, Reliable Knowledge Propagation (RKP), for semantic segmentation tasks, which leverages the text-to-image diffusion model to propagate reliable semantic knowledge from the segmentation model. The key of RKP lies in aggregating the predicted reliable but scattered segments into a complete semantic layout and using them to activate the diffusion model for conditional generation. Subsequently, diverse images with multiple domain factors can be synthesized to retrain the segmentation model. This enables the segmentation model to learn domain-invariant knowledge across multiple domains, improving its adaptability to target domain, maintaining discriminability to source domain, and even handling unseen domains. Our model-agnostic RKP framework establishes new state-of-the-art across current SFDA segmentation benchmarks, significantly advancing various SFDA methods. The code will be open source.

Abstract:
Arbitrary style transfer aims to render artistic features from a style reference onto an image while retaining its original content. Previous methods either focus on learning the holistic style from a specific artist or extracting instance features from a single artwork. However, they often fail to apply style elements uniformly across the entire image and lack adaptation to the style of different artworks. To solve these issues, our key insight is that the art genre has better generality and adaptability than the overall features of the artist. To this end, we propose a Dual-head Genre-instance Transformer (DGiT) framework to simultaneously capture the genre and instance features for arbitrary style transfer. To the best of our knowledge, this is the first work to integrate the genre features and instance features to generate a high-quality stylized image. Moreover, we design two contrastive losses to enhance the capability of the network to capture two style features. Our approach ensures the uniform distribution of the overall style across the stylized image while enhancing the details of textures and strokes in local regions. Qualitative and quantitative evaluations demonstrate that our approach exhibits superior visual quality and efficiency.

Abstract:
With the increasing spatial and temporal resolutions of obtained remote sensing (RS) images, effective compression becomes critical for storage, transmission, and large-scale in-memory processing. Although image compression methods achieve a series of breakthroughs for daily images, a straightforward application of these methods to RS domain underutilizes the properties of the RS images, such as content duplication, homogeneity, and temporal redundancy. This paper proposes a Spatial-Temporal Context model (STCM) for RS image compression, jointly leveraging context from a broader spatial scope and across different temporal images. Specifically, we propose a stacked diagonal masked module to expand the contextual reference scope, which is stackable and maintains its parallel capability. Furthermore, we propose spatial-temporal contextual adaptive coding to enable the entropy estimation to reference context across different temporal RS images at the same geographic location. Experiments show that our method outperforms previous state-of-the-art compression methods on rate-distortion (RD) performance. For downstream tasks validation, our method reduces the bitrate by 52 times for single temporal images in the scene classification task while maintaining accuracy.

Abstract:
Traditional multi-task learning often relies on explicit task interaction mechanisms to enhance multi-task performance. However, these approaches encounter challenges such as negative transfer when jointly learning multiple weakly correlated tasks. Additionally, these methods handle encoded features at a large scale, which escalates computational complexity to ensure dense prediction task performance. In this study, we introduce a Task-Interaction-Free Network (TIF) for multi-task learning, which diverges from explicitly designed task interaction mechanisms. Firstly, we present a Scale Attentive-Feature Fusion Module (SAFF) to enhance each scale in the shared encoder to have rich task-agnostic encoded features. Subsequently, our proposed task and scale-specific decoders efficiently decode the enhanced features shared across tasks without necessitating task-interaction modules. Concretely, we utilize a Self-Feature Distillation Module (SFD) to explore task-specific features at lower scales and the Low-To-High Scale Feature Diffusion Module (LTHD) to diffuse global pixel relationships from low-level to high-level scales. Experiments on publicly available multi-task learning datasets validate that our TIF attains state-of-the-art performance.

Abstract:
Underwater images, often plagued by complex degradation, pose significant challenges for image enhancement. To address these challenges, the paper redefines underwater image enhancement as an image decomposition problem and proposes a deep invertible neural network (INN) that accurately predicts both the latent image and the degradation effects. Instead of using an explicit formation model to describe the degradation process, the INN adheres to the constraints of the image decomposition model, providing necessary regularization for model training, particularly in the absence of supervision on degradation effects. Taking into account the diverse scales of degradation factors, the INN is structured on a multi-scale basis to effectively manage the varied scales of degradation factors. Moreover, the INN incorporates several asymmetric design elements that are specifically optimized for the decomposition model and the unique physics of underwater imaging. Comprehensive experiments show that our approach provides significant performance improvement over existing methods.

Abstract:
In this work, we introduce a novel approach to single-source domain generalization (SDG) in medical imaging, focusing on overcoming the challenge of style variation in out-of-distribution (OOD) domains without requiring domain labels or additional generative models. We propose a Universal Frequency Perturbation framework for SDG termed as UniFreqSDG, that performs hierarchical feature-level frequency domain perturbations, facilitating the model's ability to handle diverse OOD styles. Specifically, we design a learnable spectral perturbation module that adaptively learns the frequency distribution range of samples, allowing for precise low-frequency (LF) perturbation. This adaptive approach not only generates stylistically diverse samples but also preserves domain-invariant anatomical features without the need for manual hyperparameter tuning. Then, the frequency features before and after perturbation are decoupled and recombined through the Content Preservation Reconstruction operation, effectively preventing the loss of discriminative content information. Furthermore, we introduce the Active Domain-variance Inducement Loss to encourage effective perturbation in the frequency domain while ensuring the sufficient decoupling of domain-invariant and domain-style features. Extensive experiments demonstrate that UniFreqSDG increases the dice score by an average of 7.47% (from 77.98% to 85.45%) on the fundus dataset and 4.99% (from 71.42% to 76.73%) on the prostate dataset compared to the state-of-the-art approaches.

Abstract:
Nowadays, the abuse of AI-generated content (AIGC), especially the facial images known as deepfake, on social networks has raised severe security concerns, which might involve the manipulations of both visual and audio signals. For multimodal deepfake detection, previous methods usually exploit forgery-relevant knowledge to fully finetune Vision transformers (ViTs) and perform cross-modal interaction to expose the audio-visual inconsistencies. However, these approaches may undermine the prior knowledge of pretrained ViTs and ignore the domain gap between different modalities, resulting in unsatisfactory performance. To tackle these challenges, in this paper, we propose a new framework, i.e., Forgery-aware Audio-distilled Multimodal Learning (FRADE), for deepfake detection. In FRADE, the parameters of pretrained ViT are frozen to preserve its prior knowledge, while two well-devised learnable components, i.e., the Adaptive Forgery-aware Injection (AFI) and Audio-distilled Cross-modal Interaction (ACI), are leveraged to adapt forgery relevant knowledge. Specifically, AFI captures high-frequency discriminative features on both audio and visual signals and injects them into ViT via the self-attention layer. Meanwhile, ACI employs a set of latent tokens to distill audio information, which could bridge the domain gap between audio and visual modalities. The ACI is then used to well learn the inherent audio-visual relationships by cross-modal interaction. Extensive experiments demonstrate that the proposed framework could outperform other state-of-the-art multimodal deepfake detection methods under various circumstances.

Abstract:
One of the serious impacts brought by artificial intelligence is the abuse of deepfake techniques. Despite the proliferation of deepfake detection methods aimed at safeguarding the authenticity of media across the Internet, they mainly consider the improvement of detector architecture or the synthesis of forgery samples. The forgery perceptions, including the feature responses and prediction scores for forgery samples, have not been well considered. As a result, the generalization across multiple deepfake techniques always comes with complicated detector structures and expensive training costs. In this paper, we shift the focus to real-time perception analysis in the training process and generalize deepfake detectors through an efficient method dubbed Forgery Perception Guidance (FPG). In particular, after investigating the deficiencies of forgery perceptions, FPG adopts a sample refinement strategy to pertinently train the detector, thereby elevating the generalization efficiently. Moreover, FPG introduces more sample information as explicit optimizations, which makes the detector further adapt the sample diversities. Experiments demonstrate that FPG improves the generality of deepfake detectors with small training costs, minor detector modifications, and the acquirement of real data only. In particular, our approach not only outperforms the state-of-the-art on both the cross-dataset and cross-manipulation evaluation but also surpasses the baseline that needs more than 3× training time.

Abstract:
Portrait video editing has attracted wide attention thanks to its practical applications. Existing methods either target fixed-length clips or perform temporally inconsistent per-frame editing. In this work, we present a brand new system, StreamEdit, which is primarily designed to edit streaming videos. Our system follows the ideology of editing propagation to ensure temporal consistency. Concretely, we choose to edit only one reference frame and warp the outcome to obtain the editing results of other frames. For this purpose, we employ a warping module, aided by a probabilistic pixel correspondence estimation network, to help establish the pixel-wise mapping between two frames. However, such a pipeline requires the reference frame to contain all contents appearing in the video, which is scarcely possible especially when there exist large motions and occlusions. To address this challenge, we propose to adaptively replace the reference frame, benefiting from a heuristic strategy referring to the overall pixel mapping uncertainty. That way, we can easily align the editing of the before- and after-replacement reference frames via image inpainting. Extensive experimental results demonstrate the effectiveness and generalizability of our approach in editing streaming portrait videos. Code will be made public.

Abstract:
Face swapping, the technique of transferring the identity from one face to another, merges as a field with significant practical applications. However, previous swapping methods often result in visible artifacts. To address this issue, in our paper, we propose CodeSwap, a symmetrical framework to achieve face swapping with high-fidelity and realism. Specifically, our method firstly utilizes a codebook that captures the knowledge of high quality facial features. Building on this foundation, the face swapping is then converted into the code manipulation task in a code space. To achieve this, we design a Transformer-based architecture to update each code independently, which enable more precise manipulations. Furthermore, we incorporate a mask generator to achieve seamless blending of the generated face with the background of target image. A distinctive characteristic of our method is its symmetrical approach to processing both target and source images, simultaneously extracting information from each to improve the quality of face swapping. This symmetry also simplifies the bidirectional exchange of faces in a singular operation. Through extensive experiments on ClelebA-HQ and FF++, our method is proven to not only achieve efficient identity transfer but also substantially reduce the visible artifacts.

Abstract:
Diffusion models have garnered significant success in generative tasks, emerging as the predominant model in this domain. Despite their success, the substantial computational resources required for training diffusion models restrict their practical applications. In this paper, we resort to the optimal transport theory to accelerate the training of diffusion models, providing an in-depth analysis of the forward diffusion process. It shows that the upper bound on the Wasserstein distance of the distribution between any two timesteps in the diffusion process is an exponential decrease of the initial distance by a factor of times. This finding suggests that the state distribution of the diffusion model has a non-uniform rate of change at different points in time, thus highlighting the different importance of the diffusion timestep. To this end, we propose a novel non-uniform timestep sampling method based on the Bernoulli distribution, which favors more frequent sampling in significant timestep intervals. The key idea is to make the model focus on timesteps with larger differences, thus accelerating the training of the diffusion model. Experiments on benchmark datasets reveal that the proposed method significantly reduces the computational overhead while improving the quality of the generated images.

Abstract:
Video Virtual Try-On aims to transfer a garment onto a person in the video. Previous methods typically focus on image-based virtual try-on, but directly applying these methods to videos often leads to temporal discontinuity due to inconsistencies between frames. Limited attempts in video virtual try-on also suffer from unrealistic results and poor generalization ability. In light of previous research, we posit that the task of video virtual try-on can be decomposed into two key aspects: (1) single-frame results are realistic and natural, while retaining consistency with the garment; (2) the person's actions and the garment are coherent throughout the entire video. To address these two aspects, we propose a novel two-stage framework based on Latent Diffusion Model, namely Garment-Preserving Diffusion for Video Virtual Try-On (GPD-VVTO). In the first stage, the model is trained on single-frame data to improve the ability of generating high-quality try-on images. We integrate both low-level texture features and high-level semantic features of the garment into the denoising network to preserve garment details while ensuring a natural fit between the garment and the person. In the second stage, the model is trained on video data to enhance temporal consistency. We devise a novel Garment-aware Temporal Attention (GTA) module that incorporates garment features into temporal attention, enabling the model to maintain the fidelity to the garment during temporal modeling. Furthermore, we collect a video virtual try-on dataset containing high-resolution videos from diverse scenes, addressing the limited variety of current datasets in terms of video background and human actions. Extensive experiments demonstrate that our method outperforms existing state-of-the-art methods in both image-based and video-based virtual try-on tasks, indicating the effectiveness of our proposed framework.

Abstract:
Federated Learning (FL) is an emerging direction in distributed machine learning that enables jointly training a global model without sharing data with server. However, data heterogeneity biases the parameter aggregation at the server, leading to slower convergence and poorer accuracy of the global model. To cope with this, most of the existing works involve enforcing regularization in local optimization or improving the model aggregation scheme at the server. Though effective, they lack a deep understanding of cross-client features. In this paper, we propose a saliency latent space feature aggregation method (FedSLS) across federated clients. By Guided BackPropagation (GBP), we transform deep models into powerful and flexible visual fidelity encoders, applicable to general state inputs across different image domains, and achieve powerful aggregation in the form of saliency latent features. Notably, since GBP is label-insensitive, it is sufficient to capture saliency features only once on each client. Experimental results demonstrate that FedSLS leads to significant improvements over the state-of-the-arts in terms of accuracies, especially in highly heterogeneous settings. For example, on CIFAR-10 dataset, FedSLS achieves 63.43% accuracy within the strongly heterogeneous environment α=0.05, which is 6% to 23% higher than other baselines.

Abstract:
Fine-grained video color enhancement delivers superior visual results by making precise adjustments to specific areas of the frame, maintaining more natural color relationships compared to global enhancement techniques. However, dynamically applying these specific enhancements can lead to flickering artifacts and unsatisfactory color blending at object boundaries, issues caused by the coarse and unstable masks produced by current video segmentation algorithms. To overcome these challenges, we introduce MovingColor, featuring a novel self-supervised training approach that leverages large-scale video datasets. This approach redefines color fusion as a generation process using original full-frame textures and color editing information from non-edge areas. We address spatio-temporal inconsistencies with a spectral-spatial hybrid encoder that captures multi-scale spatial and frequency features, thus enhancing color adjustments in complex scenes. Additionally, our global-local feature propagation module, incorporating Transformer blocks, consolidates spatio-temporal contexts to ensure consistency among frames. Both quantitative and subjective evaluations validate the effectiveness of MovingColor in delivering state-of-the-art spatio-temporal consistency for video color enhancements, adhering closely to the intended color editing operations. These results demonstrate that MovingColor can effectively enhance fine-grained video color grading, making it more efficient and accessible to a wider range of users. Please refer to the project page for code and models: https://yidong.pro/projects/movingcolor/.

Abstract:
Audio-Visual Event (AVE) Localization aims to identify and classify video segments that are both audible and visible, a field that has seen substantial progress in recent years. Existing methods operate under a closed-set assumption and struggle to recognize unknown events in open-world scenarios. To better adapt to real-life applications, we introduce the Open Set Audio-Visual Event Localization task and propose a novel and effective network called OpenAVE based on evidential deep learning. To the best of our knowledge, this is the first effort to address this challenge. Our approach encompasses deep evidential AVE classification and event-relevant prediction, targeting the nuanced demands of open-set environments. The deep evidential AVE classification manages event classification uncertainty by extracting class evidence from segment-specific representations enriched with multi-scale context. To effectively distinguish between unknown events and background segments, event-relevant prediction utilizes positive-unlabeled learning. Futhermore, a learnable Gaussian-prior prediction branch is adopted to enhance the performance of event-relevant prediction. Experimental results demonstrate that OpenAVE significantly outperforms state-of-the-art models on the Audio-Visual Event dataset, confirming the effectiveness of our proposed method.

Abstract:
Correspondence pruning has recently drawn considerable attention as a crucial step in image matching. Existing methods typically achieve this by constructing neighborhoods for each feature point and imposing neighborhood consistency. However, the nearest-neighbor matching strategy often results in numerous many-to-one correspondences, thereby reducing the reliability of neighborhood information. Furthermore, the smoothness constraint fails in cases of large-scale rotations, leading to misjudgments. To address the above issues, this paper proposes a novel robust correspondence pruning method termed RoSe, which is based on rotation-invariant sequence-aware consensus. We formulate the correspondence pruning problem as a mathematical optimization problem and derive a closed-form solution. Specifically, we devise a rectified local neighborhood construction strategy that effectively enlarges the distribution between inliers and outliers. Meanwhile, to accommodate large-scale rotation, we propose a relative sequence-aware consistency as an alternative to existing smoothness constraints, which can better characterize the topological structure of inliers. Experimental results on image matching and registration tasks demonstrate the effectiveness of our method. Robustness analysis involving diverse feature descriptors and varying rotation degrees further showcases the efficacy of our method.

Abstract:
Scanpath generation in 360° images aims to model the realistic trajectories of gaze points that viewers follow when exploring panoramic environments. Existing methods for scanpath genera- tion suffer from various limitations, including a lack of global atten-tion to panoramic environments, insufficient diversity in generated scanpaths, and inadequate consideration of the temporal sequence of gaze points. To address these challenges, we propose a novel approach, named ScanTD, which employs a conditional Diffusion Model-based method to generate multiple scanpaths. Notably, a transformer-based time-series (TTS) module with a novel attention mechanism is integrated into ScanTD to capture the temporal de- pendency of gaze points effectively. Additionally, ScanTD utilizes a Vision Transformer-based method for image feature extraction, en- abling better learning of scene semantic information. Experimental results demonstrate that our approach outperforms state-of-the-art methods across three datasets. We further demonstrate its general- izability by applying it to the 360° saliency detection task.

Abstract:
This paper introduces a novel approach to Image Quality Assessment (IQA) by presenting a new loss function, Dual-Criterion Quality (DCQ) Loss, which integrates the Mean Squared Error (MSE) framework with a Relative Perception Constraint (RPC). The RPC is comprised of two main components: the Quantitative Discrepancy Constraint (QDC) and the Qualitative Alignment Constraint (QAC). The QDC focuses on capturing the numerical relationships of relative differences by minimizing the mean squared error between the differences in predicted scores among samples within a batch size and the differences in Mean Opinion Scores (MOS). Meanwhile, the QAC aims to capture the ordinal relationships between these differences. This method is designed to closely align with human subjective assessments of image quality, which are frequently quantified using the MOS, and to enhance the interpretability and reliability of IQA. Unlike existing ranking methods that suffer from complex pipelines and the introduction of errors through the generation of pair-wise or ordering data, DCQ Loss provides a more straightforward and efficient approach. Moreover, the loss function outperforms current rank-based IQA methods in terms of convergence, stability, and the ability to emulate human perception of visual quality. The effectiveness of this approach is validated through extensive experiments on various mainstream datasets and IQA network architectures, demonstrating significant performance gains over traditional rank loss approaches and contributing to the ongoing development of IQA.

Abstract:
Due to the limited permissions for upgrading dual-side (i.e., server-side and client-side) loss tolerance schemes from the perspective of CDN vendors in a multi-supplier market, modern large-scale live streaming services are still using the automatic-repeat-request (ARQ) based paradigm for loss recovery, which only requires server-side modifications. In this paper, we first conduct a large-scale measurement study with up to 50 million live streams. We find that loss shows dynamics and live streaming contains frequent on-off mode switching in the wild. We further find that the recovery latency, enlarged by the ubiquitous retransmission loss, is a critical factor affecting live streaming's client side QoE (e.g., video freezing). We then propose an enhanced recovery mechanism called AutoRec, which can transform the disadvantages of on-off mode switching into an advantage for reducing loss recovery latency without any modifications on the client side. AutoRec also adopts an online learning-based policy to fit the dynamics of loss, balancing the tradeoff between the recovery latency and the incurred overhead. We implement AutoRec upon QUIC and evaluate it via both testbed and real-world commercial services deployments. The experimental results demonstrate the practicability and profitability of AutoRec, in which the average times and duration of client-side video freezing can be lowered by 11.4% and 5.2%, respectively.

Abstract:
The integration of large language models into open-world detection frameworks significantly improves versatility in new environments. Prompt representations derived from these models help establish classification boundaries for both base and novel categories within open-world detectors. However, we are the first to discover that directly fine-tuning language models in detection systems results in redundant attention patterns and leads to suboptimal prompt representations. In order to fully leverage the capabilities of large language models and augment prompt encoding for detection, this study introduces a redundancy assessment metric to identify uniform attention patterns. Furthermore, in areas with high redundancy, we incorporate multimodal inplace prompt tuning (MIPT) to enrich the text prompt with visual clues. Experimental results validate the efficacy of our MIPT framework, achieving a notable increase across benchmarks, e.g. elevating GLIP-L from 22.6% to 25.0% on ODinW-35, and 9.0% improvement on LVIS.

Abstract:
Weakly-supervised Temporal Action Localization (WTAL) following a localization-by-classification paradigm has achieved significant results, yet still grapples with confounding arising from ambiguous snippets. Previous works have attempted to distinguish these ambiguous snippets from action snippets without investigating the underlying causes of their formation, thus failing to effectively eliminate the bias on both action-context and action-content. In this paper, we revisit WTAL from the perspective of structural causal model to identify the true origins of confounding, and propose an efficient dual-confounding eliminating framework to alleviate these biases. Specifically, we construct a Substituted Confounder Set (SCS) to eliminate the confounding bias on action-context by leveraging the modal disparity between RGB and FLOW. Then, a Multi-level Consistency Mining (MCM) method is designed to mitigate the confounding bias on action-content by utilizing the consistency between discriminative snippets and corresponding proposals at both the feature and label levels. Notably, SCS and MCM could be seamlessly integrated into any two-stream models without additional parameters by Expectation-Maximization (EM) algorithm. Extensive experiments on two challenging benchmarks including THUMOS14 and ActivityNet-1.2 demonstrate the superior performance of our method.

Abstract:
We propose a method for lighting and shadow editing of outdoor disharmonious composite images, including foreground harmonization and cast shadow generation. Most existing works can only perform foreground appearance editing task or only focus on shadow generation. In fact, lighting not only affects the brightness and color of objects, but also produces corresponding cast shadows. In recent years, diffusion models have demonstrated their strong generative capabilities, and due to their iterative denoising properties, they have a significant advantage in image restoration task. But it fails to preserve content structure of image. To this end, we propose an effective model to tackle the problem of foreground lighting-shadow editing. Specifically, we use a coarse shadow prediction module (SP) to generate coarse shadows for foreground objects. Then, we use the predicted results as prior knowledge to guide the generation of harmony diffusion model. In this process, the primary task is to learn lighting variation to harmonize foreground regions, the secondary task is to generate high-quality cast shadow containing more details. Considering that existing datasets do not support the dual tasks of image harmonization and shadow generation, we construct a real outdoor dataset, named IH-SG, covering various lighting conditions. Extensive experiments conducted on existing benchmark datasets and the IH-SG dataset demonstrate the superiority of our method.

Abstract:
In recent years, the Few-Shot Fine-Grained Image Classification (FS-FGIC) problem has gained widespread attention. A number of effective methods have been proposed that focus on extracting discriminative information within high-level features in a single episode/task. However, this is insufficient for addressing the cross-task challenges of FS-FGIC, which is represented in two aspects. On the one hand, from the perspective of the Fine-Grained Image Classification (FGIC) task, there is a need to supplement the model with mid-level features containing rich fine-grained information. On the other hand, from the perspective of the Few-Shot Learning (FSL) task, explicit modeling of cross-task general knowledge is required. In this paper, we propose a novel Bi-directional Task-Guided Network (BTG-Net) to tackle these issues. Specifically, from the FGIC task perspective, we design the Semantic-Guided Noise Filtering (SGNF) module to filter noise on mid-level features rich in detailed information. Further, from the FSL task perspective, the General Knowledge Prompt Modeling (GKPM) module is proposed to retain the cross-task general knowledge by utilizing the prompting mechanism, thereby enhancing the model's generalization performance on novel classes. We have conducted extensive experiments on five fine-grained benchmark datasets, and the results demonstrate that BTG-Net outperforms state-of-the-art methods comprehensively.

Abstract:
Multi-View Clustering (MVC) aims to mine complementary information across different views to partition multi-view data more effectively and has attracted considerable interest. However, existing deep multi-view clustering methods frequently neglect the exploration of structural information within individual view and lack the learning of structural consistency among views, which results in limitations in the clustering performance. In this paper, we introduce a novel multi-view clustering framework based on graph consistency learning to address this issue. Specifically, we design intra-view graph contrastive learning to uncover structural information within each view and achieve structural conscistency objectives through cross-view graph consistency learning. Additionally, to address the conflict between different learning objectives when trained in the same space, we introduce two new feature spaces, one for cluster-levcel contrastive learning and the other for instance-level contrastive learning. Subsequently, to make the most of discriminative information from all views, we concatenate high-level features from all views to form global features and employ self-supervision to promote clustering consistency across different views. Experimental results on several challenging datasets demonstrate the outstanding performance of our proposed method.

Abstract:
The rapid development of multi-media techniques boosts the emergence of multi-view data, and how to uncover its intrinsic structure and utilize it to conduct the subsequent downstream tasks is crucial in data analysis. Multi-view clustering is representative of handling multi-view data. The anchor-based method has received widespread attention for excellent performance and low time complexity. However, existing methods encounter two drawbacks, cutting down their performance, i.e., the assumption of the availability of all views and limited interaction of anchor generation among views. In some scenes, views arrive sequentially, and storing them is challenging owing to the limited space/privacy considerations, and the existing anchor-based MVC is unsuitable for this. Additionally, recent works fail to generate anchors with the guidance of other views, and it is tough to align the anchor graphs. To this end, we propose A Lightweight Anchor-Based Incremental Framework for Multi-view Clustering. Specifically, we first initialize an anchor graph with the assistance of k-means when a new view arrives. Then, the consensus one of the anchor graph is updated by the newly collected view with a permutation matrix. Our proposed method is more capable of anchor alignment because, in incremental MVC, the anchor graphs of previous views could be listed as a reference to guide the generation of anchor graphs of the coming view. Furthermore, we design a three-step iterative and convergent algorithm to address the resultant problem. Notably, the proposed algorithm shows outstanding effectiveness and time/space efficiency in extensive experiments.

Abstract:
The core of active semi-supervised crowd counting is the sample selection criteria. However, the scale factor has been neglected in active learning approaches despite the fact that the scale of heads varies drastically in the crowd images. In this paper, we propose a simple yet effective active labeling strategy to explicitly select informative unlabeled images, guided by the intra-scale uncertainty and inter-scale inconsistency metrics. The intra-scale uncertainty is quantified through the sum of the query-level entropy of images at different scales. Images are initially ranked based on this uncertainty for preselection. Inter-scale inconsistency is measured by the divergence between the query-level predictions of upscaled and downscaled images, allowing for the identification of the most informative images exhibiting the highest inconsistency. Additionally, we implement a progressive updating scheme for the semi-supervised crowd counting framework, in which the pseudo-labels for unlabeled images are refined iteratively. It further improves the counting accuracy. Through extensive experiments on widely used benchmarks, the proposed approach has demonstrated superior performance compared to previous state-of-the-art semi-supervised and active semi-supervised crowd counting methods.

Abstract:
Multilingual text recognition (MLTR) is increasingly essential for facilitating cultural communication. However, existing methods often struggle with retaining previous knowledge when learning new languages. A straightforward solution is performing incremental learning (IL) on MLTR tasks. However, it ignores the shared words and characters across incremental languages, which we first term as an incremental sharing problem. Motivated by this observation, we propose HierArchicalMulti-label learning framework for Multilingual tExt Recognition, termed HAMMER. An online knowledge analysis is designed to identify shared knowledge and provide corresponding multi-label language supervision. Specifically, only words and characters appearing simultaneously in multiple languages are considered shared knowledge. Additionally, to further capture language dependencies, we introduce a hierarchical language evaluation mechanism to predict language scores at word and character levels. These scores, supervised by the knowledge analysis, guide the specific recognizers to effectively utilize both old and new language knowledge, thereby mitigating catastrophic forgetting caused by imbalanced rehearsal sets. Extensive experiments conducted on benchmark datasets, MLT17 and MLT19, show that HAMMER exhibits remarkable results and outperforms other state-of-the-art approaches.

Abstract:
The remarkable success of neural radiance fields in low-level vision tasks such as novel view synthesis has motivated its extension to high-level semantic understanding, giving rise to the concept of the neural semantic field (NeSF). NeSF aims to simultaneously synthesize novel view images and associated semantic segmentation maps. Generalizable NeSF, in particular, is an appealing direction as it can generalize to unseen scenes for synthesizing images and semantic maps for novel views, thereby avoiding the need for tedious per-scene optimization. However, existing approaches to generalizable NeSF fall short in fully exploiting the geometric and semantic features as well as their mutual interactions, resulting in suboptimal performance in both novel-view image synthesis and semantic segmentation. To address this limitation, we propose Geometry-Semantics Synergy for Generalized Neural Semantic Fields (GS2-GNeSF), a novel approach aimed at improving the performance of generalizable NeSF through the comprehensive construction and synergistic interaction of geometric and semantic features. In GS2-GNeSF, we introduce a robust geometric prior generator to generate the cost volumes and depth prior, which aid in constructing geometric features and facilitating geometry-aware sampling. Leveraging the depth prior, we additionally construct a global semantic context for the target view. This context provides two types of compensation information to enhance geometry and semantic features, achieved through boundary detection and semantic segmentation, respectively. Lastly, we present an efficient dual-directional interactive attention mechanism to foster deep interactions between the enhanced geometric and semantic features. Experiments conducted on both synthetic and real datasets demonstrate that our GS2-GNeSF outperforms existing methods in both novel view and semantic map synthesis, highlighting its effectiveness in generalizing neural semantic fields for unseen scenes.

Abstract:
In the realm of CLIP adaptation through prompt learning, it is important to emphasize the pivotal role that the proper alignment of visual and textual representations plays when adapting the CLIP to downstream tasks. We propose that the proper alignment for downstream tasks is determined by the flexibility of the interaction between cross-modal information, which compensates for the absence of contrastive loss during the adaptation process. However, the current prompt learning methods, such as isolated modifications to the visual or language branches of CLIP or the employment of uni-directional cross-modal fusion, are not sufficient to explore the full potential of the mutual interaction between visual and textual modalities. To overcome this limitation, we propose a new paradigm for the CLIP prompt learning community, named Bilateral Adaptive Cross-Modal Fusion Prompt Learning (Bloom), which includes two enhancements. First, we propose using projection functions for bi-directional modality transformation and fusion functions to encourage the mutual interaction between corresponding layers within both the image and text encoders. Second, we propose an adaptive manner that automatically searches the optimal combination of cross-modal information at each layer. These two improvements ensure a more efficient and flexible integration of the two modalities, thereby achieving proper alignment for specific downstream tasks. We put our method to the test in terms of base-to-novel, cross-dataset, and cross-domain evaluations on 15 image classification datasets. The results demonstrate a significant performance enhancement achieved by Bloom.

Abstract:
As concerns over privacy protection grow and relevant laws come into effect, machine unlearning (MU) has emerged as a pivotal research area. Due to the complexity of the forgetting data distribution, the sample-wise MU is still open challenges. Gradient ascent, as the inverse of gradient descent, is naturally applied to machine unlearning, which is also the inverse process of machine learning. However, the straightforward gradient ascent MU method suffers from the trade-off between effectiveness, fidelity, and efficiency. In this work, we analyze the gradient ascent MU process from a multi-task learning (MTL) view. This perspective reveals two problems that cause the trade-off, i.e., the gradient direction problem and the gradient dominant problem. To address these problems, we propose a novel MU method, namely GDR-GMA, consisting of Gradient Direction Rectification (GDR) and Gradient Magnitude Adjustment (GMA). For the gradient direction problem, GDR rectifies the direction between the conflicting gradients by projecting a gradient onto the orthonormal plane of the conflicting gradient. For the gradient dominant problem, GMA dynamically adjusts the magnitude of the update gradients by assigning the dynamic magnitude weight parameter to the update gradients. Furthermore, we evaluate GDR-GMA against several baseline methods in three sample-wise MU scenarios: random data forgetting, sub-class forgetting, and class forgetting. Extensive experimental results demonstrate the superior performance of GDR-GMA in effectiveness, fidelity, and efficiency.

Abstract:
Few-shot fine-grained image classification aims to use only few labelled samples to successfully recognize subtle sub-classes within the same parent class. This task is extremely challenging, due to the co-occurrence of large inter-class similarity, low intra-class similarity, and only few labelled samples. In this paper, to address these challenges, we propose a new Channel-Spatial Cross-Attention Module (CSCAM), which can effectively drive a model to extract discriminative fine-grained feature representations with only few shots. CSCAM collaboratively integrates a channel cross-attention module and a spatial cross-attention module, for the attentions across support and query samples. In addition, to fit for the characteristics of fine-grained images, a support averaging method is proposed in CSCAM to reduce the intra-class distance and increase the inter-class distance. Extensive experiments on four few-shot fine-grained classification datasets validate the effectiveness of CSCAM. Furthermore, CSCAM is a plug-and-play module, conveniently enabling effective improvement of state-of-the-art methods for few-shot fine-grained image classification.

Abstract:
Multi-model fitting aims to robustly estimate the parameters of various model instances in data contaminated by noise and outliers. Most previous works employ only a single type of consensus or implicit fusion model to represent the correlation between data points and model hypotheses. This approach often results in unrealistic and incorrect model fitting in the presence of noise and uncertainty. In this paper, we propose a novel method of diverse Consensuses paired with Motion estimation-based multi-Model Fitting (CMMF), which leverages three types of diverse consensuses along with inter-model collaboration to enhance the effectiveness of multi-model fusion. We design a Tangent Consensus Residual Reconstruction (TCRR) module to capture motion structure information of two points at the pixel level. Additionally, we introduce a Cross Consensus Affinity (CCA) framework to strengthen the correlation between data points and model hypotheses. To address the challenge of multi-body motion estimation, we propose a Nested Consensus Clustering (NCC) strategy, which formulates multi-model fitting as a motion estimation problem. It explicitly establishes motion collaboration between models and ensures that multiple models are well-fitted. Extensive quantitative and qualitative experiments are conducted on four public datasets (i.e., AdelaideRMF-F, Hopkins155, KITTI, MTPV62), and the results demonstrate that our proposed method outperforms several state-of-the-art methods.

Abstract:
Audio Identification aims to precisely retrieve exact matches from a vast music repository through a query audio snippet. The need for specificity and granularity has traditionally led to representing music audio using numerous short fixed-duration overlapped segment/shingle features in fingerprinting approaches. However, fingerprinting imposes constraints on scalability and efficiency, as hundreds or even thousands of embeddings are generated to represent a music audio. In this paper, we present an innovative self-supervised approach called Angular Margin Guided Embedding (AMG-Embedding). AMG-Embedding is built on a traditional fingerprinting encoder and aims to represent variable-duration non-overlapped segments as embeddings through a two-stage embedding and class-level learning process. AMG-Embedding significantly reduces the number of generated embeddings while achieving high-specific fragment-level audio identification simultaneously. Experimental results demonstrate that AMG-Embedding achieves retrieval accuracy comparable to the based fingerprinting approach while consuming less than 1/10th of its storage and retrieval time. The efficiency gains of our approach position it as a promising solution for scalable and efficient audio identification systems.

Abstract:
One-shot object detection (OSOD) uses a query patch to identify the same category of object in a target image. As the OSOD setting, the target images are required to contain the object category of the query patch, and the image styles (domains) of the query patch and target images are always similar. However, in practical application, the above requirements are not commonly satisfied. Therefore, we propose a new problem namely Cross-Domain Object Search (CDOS), where the object categories of the query patch and target image are decoupled, and the image styles between them may also be significantly different. For this problem, we develop a new method, which incorporates both foreground-background contrastive learning heads and a domain-generalized feature augmentation technique. This makes our method effectively handle the object category gap and domain distribution gap, between the query patch and target image in the training and testing datasets. We further build a new benchmark for the proposed CDOS problem, on which our method shows significant performance improvements over the comparison methods.

Abstract:
Learning semantic-rich representations from unlabeled time series data with intricate dynamics is a notable challenge. Traditional contrastive learning techniques predominantly focus on segment-level augmentations through time slicing, a practice that, while valuable, often results in sampling bias and suboptimal performance due to the loss of global context. Furthermore, they typically disregard the vital frequency information to enrich data representations. To this end, we propose a novel self-supervised general-purpose framework called Temporal-Frequency and Contextual Consistency (TFCC). Specifically, this framework first performs two instance-level augmentation families over the entire series to capture nuanced representations alongside critical long-term dependencies. Then, TFCC advances by initiating dual cross-view forecasting tasks between the original series and its augmented counterpart in both time and frequency domains to learn robust representations. Finally, three specially designed consistency modules 'temporal, frequency, and temporal-frequency' aid in further developing discriminative representations on top of the learned robust representations. Extensive experiments on multiple benchmarks demonstrate TFCC's superiority over the state-of-the-art classification and forecasting methods and exhibit exceptional efficiency in semi-supervised and transfer learning scenarios.

Abstract:
Deep Cross-Modal Hashing (CMH) has become one of the most popular solutions for cross-modal retrieval. Existing methods need to first collect data and then be trained with these accumulated data. However, in real world, data may be generated and possessed by different owners. Considering the concerns about privacy, data may not be shared or transmitted, leading to the failure of sufficient training of CMH. To solve the problem, we propose a new framework called Federated Cross-modal Hashing with Adaptive Feature Enhancement (FedCAFE). FedCAFE is a federated method which could use distributed data to train existing CMH methods under the privacy protection. To overcome the data heterogeneity challenge of distributed data and improve the generalization ability of global model, FedCAFE is endowed with a novel adaptive feature enhancement module and a new weighted aggregation strategy. Besides, it could fully utilize the rich global information carried in the global model to constrain the model during the local training process. We have conducted extensive experiments on four widely-used datasets in CMH domain with both IID and non-IID settings. The reported results demonstrate that the proposed FedCAFE achieves better performance than several state-of-the-art baselines.

Abstract:
The security of Deep Neural Networks (DNNs) has proven to be critical for their applicabilities in real-world scenarios. However, DNNs are well-known to be vulnerable against adversarial attacks, such as adding artificially designed imperceptible magnitude perturbation to the benign input. Therefore, adversarial robustness is essential for DNNs to defend against malicious attacks. Stochastic Neural Networks (SNNs) have recently shown effective performance on enhancing adversarial robustness by injecting uncertainty into models. Nevertheless, existing SNNs are still limited for adversarial defense, as their insufficient representation capability from the fixed uncertainty. In this paper, to elevate feature representation capability of SNNs, we propose a novel yet practical stochastic neural network that maximizes feature distribution variance (MFDV-SNN). In addition, we provide theoretical insights to support the adversarial resistance of MFDV, which primarily derived from the stochastic noise we injected into DNNs. Our research demonstrates that by gradually increasing the level of stochastic noise in a DNN, the model naturally becomes more resistant to input perturbations. Since adversarial training is not required, MFDV-SNN does not compromise clean data accuracy and saves up to 7.5 times computation time. Extensive experiments on various attacks demonstrate that MFDV-SNN improves adversarial robustness significantly compared to other methods.

Abstract:
Multi-view clustering methods have been extensively explored in the last decades. This kind of methods is built on the assumption that the data are sampled from multiple subspaces with low dimension and each group fits into one of these subspaces. The quadratic or cubic computation complexity produced by these methods is inevitable, resulting in the difficulty for clustering multi-view datasets with large scales. Some efforts have been presented to select key anchors beforehand to capture the data distributions in different views. Despite significant progress, these methods pay few attentions to deriving provably scalable and correct method for finding the optimal shared anchor graph from the geometric interpretation perspective. They also ignore to give a well balance between the connectedness and subspace preserving properties of the shared anchor graph. In this paper, we propose the Fast Elastic- Net Multi-view Clustering (FENMC) from a geometric interpretation perspective. We provide the geometric analysis in determining the optimal shared anchor graph based on the introduced elastic-net regularizer for fast multi-view clustering, where the elastic-net regularizer is built on the mixture of L_2 and L_1 norms. We also give a theoretical justification for the balance between the connectedness and subspace preserving properties of the shared anchor graph for multi-view clustering. Our experiments on different datasets show that the proposed method not only obtains the satisfied clustering performance, but also deals with large-scale datasets with high efficiency.

Abstract:
Mainstream painting agents based on stroke-based rendering (SBR) attempt to translate visual appearance into a sequence of vectorized painting-style strokes. Lacking a direct mapping (and consequently the differentiable ability) between pixel domain and stroke parameter searching space, these methods often yield non-realistic/artist-incompatible stroke decompositions, hindering its further application in high quality art generation. To explicitly address this issue, we propose a novel SBR based image-to-painting framework which aligns with artistic oil painting behaviors/techniques. In the heart is a semantic content stratification module which decomposes images into hierarchical painting regions encapsulated with semantics, according to which a coarse-to-fine strategy is developed to first fill-in the abstract structure of the painting with coarse brushstrokes; and then depict the detailed texture portrayal with parallel-run localized multi-scale stroke search. In the meantime, we also propose a novel method that integrates SBR frameworks into a simulation-based interactive painting system for stroke quality assessment. Extensive experimental results on a wide range of images show that our method not only achieves high fidelity and artist-like painting rendering effect with a reduced number of strokes, but also exhibits greater stroke quality over prior methods.

Abstract:
Backdoor attacks pose a significant security vulnerability for deep neural networks (DNNs), enabling them to operate normally on clean inputs but manipulate predictions when specific trigger patterns occur. In this paper, we consider a practical post-training scenario backdoor defense, where the defender aims to evaluate whether a trained model has been compromised by backdoor attacks. Currently, post-training backdoor detection approaches often operate under the assumption that the defender has knowledge of the attack information, logit output from the model, and knowledge of the model parameters, limiting their implementation in practical scenarios.

Abstract:
Stress has rapidly emerged as a significant public health concern in the contemporary society, necessitating prompt identification and effective intervention strategies. Video-based stress detection offers a non-invasive, low-cost, and mass-reaching approach for identifying stress. In this paper, we propose a three-level content-semantic-world knowledge framework, addressing three particular issues for video-based stress detection. (1) How to abstract and encode video semantics with frame contents into visual representation? (2) How to leverage general-purpose LMMs to augment task-specific visual representation? (3) To what extent could general-purpose LMMs contribute to video-based stress detection? We design a Slow-Emotion-Fast-Action scheme to encode fast temporal changes of body actions revealed from video frames, as well as subtle details of emotions per video segment, into visual representation. We augment task-specific visual representation with linguistic facial expression descriptions by prompting general-purpose Large Multimodal Models (LMMs). A knowledge retriever is designed to evaluate and select the most proper deliverable of LMMs. Experimental results on two video-based stress detection datasets show that 1) our proposed three-level framework can achieve 90.89% F1-score in UVSD dataset and 80.79% F1-score, outperforming state-of-the-art; 2) leveraging LMMs helps to improve the F1-score by 2.25% in UVSD and 3.55% in RSL, compared to using the traditional Facial Action Coding System; 3) purely relying on general-purpose LMMs is insufficient with 88.73% F1-score in UVSD dataset and 77.48% F1-score in RSL dataset, demonstrating the necessity to combine task-specific dedicated solutions with world knowledge given by LMMs.

Abstract:
DreamBooth has demonstrated significant potential in subject-driven text-to-image generation, especially in scenarios requiring precise preservation of a subject's appearance. However, it still suffers from inefficiency and requires extensive iterative training to customize concepts using a small set of reference images. To address these issues, we introduce DreamBooth++, a region-level training strategy designed to significantly improve the efficiency and effectiveness of learning specific subjects. In particular, our approach employs a region-level data re-formulation technique that packs a set of reference images into a single sample, significantly reducing computational costs. Moreover, we adapt convolution and self-attention layers to ensure their processings are restricted within individual regions. Thus their operational scope (i.e., receptive field) can be preserved within a single subject, avoiding generating multiple sub-images within a single image. Last but not least, we design a text-guided prior regularization between our model and the pretrained one to preserve the original semantic generation ability. Comprehensive experiments demonstrate that our training strategy not only accelerates the subject-learning process but also significantly boosts fidelity to both subject and prompts in subject-driven generation.

Abstract:
In this paper, we replicate the experimental results from our previous work titled "Aesthetics-Driven Virtual Time-Lapse Photography Generation", which was presented at ACM Multimedia 2023. Our primary objective is to confirm the validity of our earlier findings and to provide a more comprehensive understanding of our software framework. We provide the necessary artifacts to reproduce the results from our prior research. This paper details the technical aspects of our package, including dataset preparation, source code structure, and the experimental environment. By utilizing these artifacts, we demonstrate the reproducibility of our results. We encourage others to use our software framework for purposes beyond reproducibility.

Abstract:
Quantization index modulation (QIM) based VoIP steganography can conceal secret information in VoIP streams. Malicious users could use this technology to conduct illegal activities, threatening network and public security. Hence, practical steganalysis models that could detect QIM-based VoIP steganography are urged to be developed. In recent years, deep learning (DL) models have been investigated for this task, and exciting outcomes have been achieved. However, existing models are far from practical. Two major challenges are required to be addressed. First, there is still significant room for improvement in detection accuracy. Second, studies that balance the detection accuracy and response time are still insufficient. In this context, our main research topic fits in the QIM-based VoIP steganalysis theme, which aims to detect QIM-based steganography in VoIP streams in a fast and accurate manner.

Abstract:
Effective and smooth video communication has become an integral tool for interaction in our modern society. The well-known WebRTC project is optimized for server-based communication architectures with less stringent latency requirements. This paper introduces an open-source, peer-to-peer (P2P) mesh-based application called uvgComm 1.0 for low-latency multi-party video communication. Our experimental results show that uvgComm attains 25% lower latency than server-based solutions at 219 ms. In addition, it offers advanced privacy protection mechanisms with end-to-end encryption and P2P mesh topology that delivers media without any media servers. The ambitious goal of uvgComm is to take P2P mesh-based conferencing architecture to the next level and make this architecture a serious competitor to server-based communication architectures.

Affiliations: School of Mathematical and Computer Sciences, Heriot-Watt University Malaysia, Malaysia ; Institute of Psychology, University of the Chinese Academy of Sciences, China ; Adrian KDavison Manchester Metropolitan University, United Kingdom ; Gen BingLiong Universiti Malaya, Kuala Lumpur, Malaysia ; Moi HoonYap Manchester Metropolitan University, United Kingdom ; National Taiwan University, Taiwan ; Zhejiang University, China ; Harbin Institute of Technology, China ; Institute of Psychology

Abstract:
Facial micro-expressions (MEs) are involuntary spontaneous movements of the face that typically appear in high-stakes situations where a person attempts to conceal a certain emotion from being known. A decade after the inception of the widely used CASME II and SMIC datasets, research in computational analysis of MEs has now advanced toward new pathways, exploring problems crucial to model generalization and real-world practicality. It is often challenging to design robust algorithms or models for spotting micro-expressions due to the high variability across diverse cultural backgrounds. Also, treating spotting and recognition as separate tasks is undesirable when handling long-spanning videos under realistic settings. This Grand Challenge comprises two distinct tracks: the Cross-Cultural Spotting (CCS) track, and the Spot-Then-Recognize (STR) track. All participating solutions submitted their results to a leaderboard, and several submissions performed well surpassing their respective baseline results. More details are available at: https://megc2024.github.io.

Abstract:
In multi-modal classification tasks, a good fusion algorithm can effectively integrate and process multi-modal data, thereby significantly improving its performance. Researchers often focus on the design of complex fusion operators and have proposed numerous fusion operators, while paying less attention to the design of feature fusion usage, specifically how features should be fused to better facilitate multi-modal classification tasks. In this article, we propose a progressive skip reasoning fusion network (PSRFN) to make some attempts to address this issue. Firstly, unlike most existing multi-modal fusion methods that only use one fusion operator in a single stage to fuse all view features, PSRFN utilizes the progressive skip reasoning (PSR) block to fuse all views with a fusion operator at each layer. Specifically, each PSR block utilizes all view features and the fused features from the previous layer to jointly obtain the fused features for the current layer. Secondly, each PSR block utilizes a dual-weighted fusion strategy with learnable parameters to adaptively allocate weights during the fusion process. The first level of weighting assigns weights to each view feature, while the second level assigns weights to the fused features from the previous layer and the fused features obtained from the first level of weighting in the current layer. This strategy ensures that the PSR block can dynamically adjust the weights based on the actual contribution of features. Finally, to enable the model to fully utilize feature information from different levels for feature fusion, the skip connections are adopted between PSR blocks. Extensive experiment results on six real multi-modal datasets show that a better usage for fusion operator is indeed able to improve performance.

Abstract:
In the literature, existing studies on text-to-motion generation (TMG) routinely focus on exploring the objective alignment of text and motion, which largely ignore the subjective emotion information, especially the limb-level emotion information. With this in mind, this paper proposes a new Emotion-enriched Text-to-Motion Generation (ETMG) task, aiming to generate motions with the subjective emotion information. Further, this paper believes that injecting emotions into limbs (named intra-limb emotion injection) and ensuring the coordination and coherence of emotional motions after injecting emotion information (named inter-limb emotion disturbance) is rather important and challenging in this ETMG task. To this end, this paper proposes an LL M-guided Limb-level Emotion Manipulating ( L3 EM) approach to ETMG. Specifically, this approach designs an LLM-guided intra-limb emotion modeling block to inject emotion into limbs, followed by a graph-structured inter-limb relation modeling block to ensure the coordination and coherence of emotional motions. Particularly, this paper constructs a coarse-grained EmotionalText-to-Motion (EmotionalT2M) dataset and a fine-grained Limb -level Emotional Text-to-Motion (Limb-ET2M) dataset to justify the effectiveness of the proposed L3EM approach. Detailed evaluation demonstrates the significant advantage of our L3EM approach to ETMG over the state-of-the-art baselines. This justifies the importance of the limb-level emotion information for ETMG and the effectiveness of our L3EM approach in coherently manipulating such information.

Abstract:
Neuromorphic event sensors are novel visual cameras that feature high-speed illumination-variation sensing and have found widespread application in guiding frame-based imaging enhancement. This paper focuses on color restoration in the event-guided image deblurring task, we fuse blurry images with mosaic color events instead of mono events to avoid artifacts such as color bleeding. The challenges associated with this approach include demosaicing color events for reconstructing full-resolution sampled signals and fusing bimodal signals to achieve image deblurring. To meet these challenges, we propose a novel network called Color4E to enhance the color restoration quality for the image deblurring task. Color4E leverages an event demosaicing module to upsample the spatial resolution of mosaic color events and a cross-encoding image deblurring module for fusing bimodal signals, a refinement module is designed to fuse full-color events and refine initial deblurred images. Furthermore, to avoid the real-simulated gap of events, we implement a display-filter-camera system that enables mosaic and full-color event data captured synchronously, to collect a real-captured dataset used for network training and validation. The results on the public dataset and our collected dataset show that Color4E enables high-quality event-based image deblurring compared to state-of-the-art methods.

Abstract:
Cartoon parsing is an important task for cartoon-centric applications, which segments the body parts of cartoon images. Due to the complex appearances, abstract drawing styles, and irregular structures of cartoon characters, cartoon parsing remains a challenging task. In this paper, a novel approach, named CartoonNet, is proposed for cartoon parsing, in which semantic consistency and structure correlation are integrated to address the visual diversity and structural complexity for cartoon parsing. A memory-based semantic consistency module is designed to learn the diverse appearances exhibited by cartoon characters. The memory bank stores features of diverse samples and retrieves the samples related to new samples for consistency, which aims to improve the semantic reasoning capability of the network. A self-attention mechanism is employed to conduct consistency learning among diverse body parts belong to the retrieved samples and new samples. To capture the intricate structural information of cartoon images, a structure correlation module is proposed. Leveraging graph attention networks and a main body-aware mechanism, the proposed approach enables structural correlation, allowing it to parse cartoon images with complex structures. Experiments conducted on cartoon parsing and human parsing datasets demonstrate the effectiveness of the proposed method, which outperforms the state-of-the-art approaches for cartoon parsing and achieves competitive performance on human parsing.

Abstract:
Incomplete Multi-View Clustering (IMVC) is a promising topic in multimedia as it breaks the data completeness assumption. Most existing methods solve IMVC from the perspective of graph learning. In contrast, self-representation learning enjoys a superior ability to explore relationships among samples. However, only a few works have explored the potentiality of self-representation learning in IMVC. These self-representation methods infer missing entries from the perspective of whole samples, resulting in redundant information. In addition, designing an effective strategy to retain salient features while eliminating noise is rarely considered in IMVC. To tackle these issues, we propose a novel self-representation learning method with missing sample recovery and enhanced low-rank tensor regularization. Specifically, the missing samples are inferred by leveraging the local structure of each view, which is constructed from available samples at the feature level. Then an enhanced tensor norm, referred to as Logarithm-p norm is devised, which can obtain an accurate cross-view description by adaptive weights. Our proposed method achieves exact subspace representation in IMVC by leveraging high-order correlations and inferring missing information at the feature level. Extensive experiments on several widely used multi-view datasets demonstrate the effectiveness of the proposed method.

Abstract:
Visual-language models based on CLIP have shown remarkable abilities in general few-shot image classification. However, their performance drops in specialized fields such as healthcare or agriculture, because CLIP's pre-training does not cover all category data. Existing methods excessively depend on the multi-modal information representation and alignment capabilities acquired from CLIP pre-training, which hinders accurate generalization to unfamiliar domains. To address this issue, this paper introduces a novel visual-language collaborative representation network (MCRNet), aiming at acquiring a generalized capability for collaborative fusion and representation of multi-modal information. Specifically, MCRNet learns to generate relational matrices from an information fusion perspective to acquire aligned multi-modal features. This relationship generation strategy is category-agnostic, so it can be generalized to new domains. A class-adaptive fine-tuning inference technique is also introduced to help MCRNet efficiently learn alignment knowledge for new categories using limited data. Additionally, the paper establishes a new broad-domain few-shot image classification benchmark containing seven evaluation datasets from five domains. Comparative experiments demonstrate that MCRNet outperforms current state-of-the-art models, achieving an average improvement of 13.06% and 13.73% in the 1-shot and 5-shot settings, highlighting the superior performance and applicability of MCRNet across various domains.

Abstract:
Spiking Neural Networks (SNNs) have indeed shown remarkable promise in the field of computer vision, emerging as a low-energy alternative to traditional Artificial Neural Networks (ANNs). However, SNNs also face several challenges: i) Existing SNNs are not purely additive and involve a substantial amount of floating-point computations, which contradicts the original design intention of adapting to neuromorphic chips; ii) The incorrect positioning of convolutional and pooling layers relative to spiking layers leads to reduced accuracy; iii) Leaky Integrate-and-Fire (LIF) neurons have limited capability in representing local information, which is disadvantageous for downstream visual tasks like semantic segmentation.

Abstract:
Automatic Speech Recognition (ASR) models pre-trained on large-scale speech datasets have achieved significant breakthroughs compared with traditional methods. However, mainstream pre-trained ASR models encounter challenges in distinguishing homophones, which have close or identical pronunciations. Previous studies have introduced visual auxiliary cues to address this challenge, yet the sophisticated use of lip movements falls short in correcting homophone errors. On the other hand, the fusion and utilization of scene images remain in an exploratory stage, with performance still inferior to the pre-trained speech model. In this paper, we introduce CIEASR (Contextual Image-Enhanced Automatic Speech Recognition), a novel multimodal speech recognition model that incorporates a new cue fusion method, using scene images as soft prompts to correct homophone errors. To mitigate data scarcity, we refine and expand the VSDial dataset for extensive experiments, illustrating that scene images contribute to the accurate recognition of entity nouns and personal pronouns. Our proposed CIEASR achieves state-of-the-art results on VSDial and Flickr8K, significantly reducing the Character Error Rate (CER) on VSDial from 3.61% to 0.92%.

Abstract:
Scene Graph Generation (SGG) is an important cross-modal task in scene understanding, aiming to detect visual relations in an image. However, due to the various appearance features, the feature distributions of different categories have suffered from a severe overlap, which makes the decision boundaries ambiguous. The current SGG methods mainly attempt to re-balance the data distribution, which is dataset-dependent and limits the generalization. To solve this problem, a Synergetic Prototype Learning Network (SPLN) is proposed here, where the generalized semantic space is modeled and the synergetic effect among different semantic subspaces is delved into. In SPLN, a Collaboration-induced Prototype Learning method is proposed to model the interaction of visual semantics and structural semantics. The conventional visual semantics is focused on with a residual-driven representation enhancement module to capture details. And the intersection of structural semantics and visual semantics is explicitly modeled as conceptual semantics, which has been ignored by existing methods. Meanwhile, to alleviate the noise of unrelated and meaningless words, an Intersection-induced Prototype Learning method is also proposed specially for conceptual semantics with an essence-driven prototype enhancement module. Moreover, a Selective Fusion Module is proposed to synergetically integrate the results of visual, structural, conceptual branches and the generalized semantics projection. Experiments on VG and GQA datasets show that our method achieves state-of-the-art performance on the unbiased metrics.

Abstract:
Traditional deep learning models often struggle in few-shot learning scenarios, where limited labeled data is available. While the Contrastive Language-Image Pre-training (CLIP) model demonstrates impressive zero-shot capabilities, its performance in few-shot scenarios remains limited. Existing methods primarily aim to leverage the limited labeled dataset, but this offers limited potential for improvement. To overcome the limitations of small datasets in few-shot learning, we introduce a novel framework, SSAT-Adapter, that leverages CLIP's language understanding to generate informative auxiliary tasks and improve CLIP's performance and adaptability in few-shot settings. We utilize CLIP's language understanding to create decision-boundary-focused image latents. These latents form auxiliary tasks, including inter-class instances to bridge CLIP's pre-trained knowledge with the provided examples, and intra-class instances to subtly expand the representation of target classes. A self-paced training regime, progressing from easier to more complex tasks, further promotes robust learning. Experiments show our framework outperforms the state-of-the-art online few-shot learning method by an average of 2.2% on eleven image classification datasets. Further ablation studies on various tasks demonstrate the effectiveness of our approach to enhance CLIP's adaptability in few-shot image classification.

Abstract:
Temporal relation modeling is one of the core aspects of few-shot action recognition. Most previous works mainly focus on temporal relation modeling based on coarse-level actions, without considering the atomic action details and fine-grained temporal information. This oversight represents a significant limitation in this task. Specifically, coarse-level temporal relation modeling can make the few-shot models overfit in high-discrepancy temporal context, and ignore the low-discrepancy but high-semantic relevance action details in the video. To address these issues, we propose a saliency-guided fine-grained temporal mask learning method that models the temporal atomic action relation for few-shot action recognition in a finer manner. First, to model the comprehensive temporal relations of video instances, we design a temporal mask learning architecture to automatically search for the best matching of each atomic action snippet. Next, to exploit the low-discrepancy atomic action features, we introduce a saliency-guided temporal mask module to adaptively locate and excavate the atomic action information. After that, the few-shot predictions can be obtained by feeding the embedded rich temporal-relation features to a common feature matcher. Extensive experimental results on standard datasets demonstrate our method's superior performance compared to existing state-of-the-art methods.

Abstract:
In this work, we present AerialGait, a comprehensive dataset for aerial-ground gait recognition. This dataset comprises 82,454 sequences totaling over 10 million frames from 533 subjects, captured from both aerial and ground perspectives. To align with real-life scenarios of aerial and ground surveillance, we utilize a drone and a ground surveillance camera for data acquisition. The drone is operated at various speeds, directions, and altitudes. Meanwhile, we conduct data collection across five diverse surveillance sites to ensure a comprehensive simulation of real-world settings. AerialGait has several unique features: 1) The gait sequences exhibit significant variations in views, resolutions, and illumination across five distinct scenes. 2) It incorporates challenges of motion blur and frame discontinuity due to drone mobility. 3) The dataset reflects the domain gap caused by the view disparity between aerial and ground views, presenting a realistic challenge for drone-based gait recognition. Moreover, we perform a comprehensive analysis of existing gait recognition methods on AerialGait dataset and propose the Aerial-Ground Gait Network (AGG-Net). AGG-Net effectively learns discriminative features from aerial views by uncertainty learning and clusters features across aerial and ground views through prototype learning. Our model achieves state-of-the-art performance on both AerialGait and DroneGait datasets.

Abstract:
Vision-Language Tracking (VLT) requires locating a specific target in video sequences, given a natural language prompt and an initial object box. Despite recent advancements, existing approaches heavily rely on expensive and time-consuming human annotations. To mitigate this limitation, directly generating pseudo labels from raw videos seems to be a straightforward solution; however, it inevitably introduces undesirable noise during the training process. Moreover, we insist that an efficient tracker should excel in tracking the target, regardless of the temporal direction. Building upon these insights, we propose the pioneering semi-supervised learning scheme for VLT task, representing a crucial step towards reducing the dependency on high-quality yet costly labeled data. Specifically, drawing inspiration from the natural attributes of a video (i.e., space, time, and semantics), our approach progressively leverages inherent consistencies from these aspects: (1) Spatially, each frame and any object cropped from it naturally form an image-bbox (bounding box) pair for self-training; (2) Temporally, bidirectional tracking trajectories should exhibit minimal differences; (3) Semantically, the correlation between visual and textual features is expected to remain consistent. Furthermore, the framework is validated with a simple yet effective tracker we devised, named ATTracker (Asymmetrical Transformer Tracker). It modifies the self-attention operation in an asymmetrical way, striving to enhance target-related features while suppressing noise. Extensive experiments confirm that our ATTracker serves as a robust baseline, outperforming fully supervised base trackers. By unveiling the potential of learning with limited annotations, this study aims to attract attention and pave the way for Semi-supervised Vision-Language Tracking (SS-VLT).

Abstract:
This work tackles the persistent challenge of image-text retrieval, a key problem at the intersection of computer vision and natural language processing. Despite significant advancements facilitated by large-scale Contrastive Language-Image Pretraining (CLIP) models, we found that existing methods fall short in bridging the fine-grained semantic gap between visual and textual representations. To address the above pitfalls, we propose a model called Local and Generative-driven Modality Gap Correction (LG-MGC), which devotes to simultaneously enhancing representation learning and alleviating the modality gap in cross-modal retrieval. The proposed model consists of two main components: a local-driven semantic completion module, which complements specific local context information that is overlooked by traditional models within global features, and a generative-driven semantic translation module, which leverages generated features as a bridge to mitigate modality gap. Our model not only tackles the granularity of semantic correspondence and improves the performance of existing methods without requiring additional trainable parameters, but is also designed to be plug-and-play, allowing for easy integration into existing retrieval models without altering their architectures. Extensive experiments demonstrate the effectiveness of LG-MGC by achieving consistent state-of-the-art performance over strong baselines.

Abstract:
Recent achievements have shown that model-based steganographic schemes hold promise for better security than heuristic-based ones, as they can provide theoretical guarantees on secure steganography under a given statistical model. However, it remains a challenge to exploit the correlations between DCT coefficients for secure steganography in practical scenarios where only a single compressed JPEG image is available. To cope with this, we propose a novel model-based steganographic scheme using the Conditional Random Field (CRF) model with four-element cross-neighborhood to capture the dependencies among DCT coefficients for JPEG steganography with symmetric embedding. Specifically, the proposed CRF model is characterized by the delicately designed energy function, which is defined as the weighted sum of a series of unary and pairwise potentials, where the potentials associated with the statistical detectability of steganography are formulated as the KL divergence between the statistical distributions of cover and stego. By optimizing the constructed energy function with the given payload constraint, the non-independent distortion cost corresponding to the least detectability can be accordingly obtained. Extensive experimental results validate the effectiveness of our proposed scheme, especially outperforming the previous independent art J-MiPOD.

Abstract:
Multiple kernel clustering (MKC) has garnered considerable attention, as their efficacy in handling nonlinear data in high-dimensional space. However, current MKC methods have three primary issues: (1) Solely focuses on clustering information while neglecting energy information and potential noise interference within the kernel; (2) The inherent manifold structure in the high-dimensional space is complex, and they lack the insufficient exploration of topological structure; (3) Most encounter cubic computational complexity, posing a formidable resource consumption challenge. To tackle the above issues, we propose a novel MKC method with shifted Laplacian on Grassmann manifold (sLGm). Firstly, sLGm constructs r-rank shifted Laplacian and subsequently reconstructs it, retaining the clustering-related and energy-related information while reducing the influence of noise. Additionally, sLGm introduces a Grassmann manifold for information fusion, which can preserve topological information in the high-dimensional space. Notably, an optimal consensus partition can be concurrently learnt from above two procedures, thereby yielding the clustering assignments, and the computational complexity of the whole procedure drops to the quadratic. Conclusively, a comprehensive suite of experiments is executed to roundly prove the effectiveness of sLGm.

Abstract:
Deep network-based image Compressive Sensing (CS) has attracted much attention in recent years. However, there still exist the following two issues: 1) Existing methods typically use fixed-scale sampling, which leads to limited insights into the image content. 2) Most pre-trained models can only handle fixed sampling rates and fixed block scales, which restricts the scalability of the model. In this paper, we propose a novel scale-aware scalable CS network (dubbed S2-CSNet), which achieves scale-aware adaptive sampling, fine granular scalability and high-quality reconstruction with one single model. Specifically, to enhance the scalability of the model, a structural sampling matrix with a predefined order is first designed, which is a universal sampling matrix that can sample multi-scale image blocks with arbitrary sampling rates. Then, based on the universal sampling matrix, a distortion-guided scale-aware scheme is presented to achieve scale-variable adaptive sampling, which predicts the reconstruction distortion at different sampling scales from the measurements and select the optimal division scale for sampling. Furthermore, a multi-scale hierarchical sub-network under a well-defined compact framework is put forward to reconstruct the image. In the multi-scale feature domain of the sub-network, a dual spatial attention is developed to explore the local and global affinities between dense feature representations for deep fusion. Extensive experiments manifest that the proposed S2-CSNet outperforms existing state-of-the-art CS methods.

Abstract:
In this paper, we propose a neural representation for videos that enables real-time quality-scalable decoding, called QS-NeRV. QS-NeRV comprises a Self-Learning Distribution Mapping Network (SDMN) and Extensible Enhancement Networks (EENs). Firstly, SDMN functions as the base layer (BL) for scalable video coding, focusing on encoding videos of lower quality. Within SDMN, we employ a methodology that minimizes the bitstream overhead to achieve efficient information exchange between the encoder and decoder instead of direct transmission. Specifically, we utilize an invertible network to map the multi-scale information obtained from the encoder to a specific distribution. Subsequently, during the decoding process, this information is recovered from a randomly sampled latent variable to assist the decoder in achieving improved reconstruction performance. Secondly, EENs serve as the enhancement layers (ELs) and are trained in an overfitting manner to obtain robust restoration capability. By integrating the fixed BL bitstream with the parameters of EEN as an extension pack, the decoder can produce higher-quality enhanced videos. Furthermore, the scalability of the method allows for adjusting the number of combined packs to accommodate diverse quality requirements. Experimental results demonstrate our proposed QS-NeRV outperforms the state-of-the-art real-time decoding INR-based methods on various datasets for video compression and interpolation tasks.

Abstract:
We design ScaleTraversal, an interactive tool for creating multi-scale 3D demonstration animations with limited resources for users who are unavailable to access high performance machines such as clusters or super computers. It is difficult to create 3D demonstration animations for multi-scale data. First, it is challenging to strike a balance between flexibility and user friendliness to design the user interface in customizing demonstration animations. Second, the multi-scale biomedical data is often characterized as large-size so that it is hard for users to handle it by a desktop PC. We design an interactive bi-functional user interface to create multi-scale biomedical demonstration animations intuitively. It fully utilizes the strengths of graphical interface's user friendliness and textual interface's flexibility, which enables users to customize demonstration animations from macro-scales to meso- and micro-scales. Furthermore, we design three scale-based memory management strategies to solve the issues presented in multi-scale data, including a streaming data processing strategy, a scale-based data prefetching strategy and a GPU acceleration strategy for rendering. Finally, we conduct both quantitative evaluation and qualitative evaluation to demonstrate the efficiency, expressiveness and usability of ScaleTraversal.

Abstract:
Single hyperspectral image super-resolution aims to reconstruct a high-resolution hyperspectral image (HRHSI) from an observed low resolution hyperspectral image (LRHSI). Most current methods combine CNN and Transformer structures to directly extract features of all channels in LRHSI for image reconstruction, but they do not consider the interference of redundant information in adjacent bands, resulting in spectral and spatial distortions in the reconstruction results and an increase in model computational complexity. To address this issue, this paper proposes a spectral clustering-based pyramid super-resolution network (SCPSN) to progressively reconstruct HRHSI at different scales. In each image reconstruction layer, a clustering super-resolution block (CSRB) consisting of spectral clustering block (SCB), patch non local attention block (PNAB), and dynamic fusion block (DFB) is designed to achieve the reconstruction of detail features. Specifically, for the high correlation between adjacent spectral bands in LRHSI, a SCB is first constructed to achieve clustering of spectral channels and filtering of hyperchannels. This can reduce the interference of redundant spectral information and the computational complexity of the model. Then, by utilizing the non-local similarity of features within the channel, a patch non-local attention block (PNAB) is constructed to enhance the features of hyperchannels. Next, a dynamic fusion block (DFB) is designed to reconstruct the features of all channels in LRHSI by establishing correlations between enhanced hyperchannels and other channels. Finally, the reconstructed channels are upsampled and added to the corresponding channels to obtain the reconstructed HRHSI. Extensive experiments validate that the performance of SCPSN is superior to that of some other state-of-the-art (SOTA) HSSR methods in terms of visual effects and quantitative metrics. In addition, our model does not require training on large-scale datasets compared to other methods. The dataset and code will be released on GitHub.

Abstract:
Social video platforms have emerged as significant channels for information dissemination, facilitating lively public discussions that often give rise to controversies. However, existing approaches to controversy detection primarily focus on textual features, which raises three key concerns: it underutilizes the potential of visual information available on social media platforms; it is ineffective when faced with incomplete or absent textual information; and the existing datasets fail to adequately address the need for comprehensive multimodal resources on social media platforms. To address these challenges, we construct a large-scale Multimodal Controversial Dataset (MMCD) in Chinese. Additionally, we propose a novel framework named Multi-view Controversy Detection (MVCD) to effectively model controversies from multiple perspectives. Through extensive experiments using state-of-the-art models on the MMCD, we demonstrate MVCD's effectiveness and potential impact.

Abstract:
With the growth of VR and AR industry, 3D reconstruction has become a more and more important topic in multimedia. Although 3D Gaussian Splatting achieves state-of-the-art in 3D Reconstruction, a large number of Gaussians are needed to fit a 3D scene due to the Gibbs Phenomenon. The pursuit of compressing 3D Gaussian Splatting and reducing memory overhead has long been a focal point. Embarking on this trajectory, our study delves into this domain, aiming to mitigate these challenges. Inspired by the tangram, a Chinese ancient puzzle, we introduce a novel methodology (Tangram-Splatting) that leverages shape priors to optimize 3D scene fitting. Central to our approach is a pioneering technique that diversifies Gaussian function types while preserving algorithmic efficiency. Through exhaustive experimentation, we demonstrate that our method achieves a remarkable average reduction of 62.4% in memory consumption used to store optimized parameters and decreases the training time by at least 10 minutes, with only marginal sacrifices in PSNR performance, typically under 0.3 dB, and our algorithm is even better on some datasets. This reduction in memory burden is of paramount significance for real-world applications, mitigating the substantial memory footprint and transmission burden traditionally associated with such algorithms. Our algorithm underscores the profound potential of Tangram-Splatting in advancing multimedia applications.

Abstract:
The quality of 3D models reconstructed by PatchMatch Multi-View Stereo remains a challenging problem due to unreliable photometric consistency in object boundaries and textureless areas. Since textureless areas usually exhibit strong planarity, previous methods used planar prior to improve the reconstruction performance. However, their planar prior ignores the depth discontinuity at the object boundary, making the boundary inaccurate (not sharp). In addition, due to the unreliable planar models in large-scale low-textured objects, the reconstruction results are incomplete. To address the above issues, we introduce the segmentation generated from Segment Anything Model into PatchMatch. Using segmentation to determine whether the depth is continuous based on the characteristics of segmentation and depth sharing boundaries. Then we construct Boundary Plane that fits the object boundary and Object Plane to increase consistency of planes in large-scale textureless objects. Finally, we use a probability graph model to calculate Aggregated Prior guided by Multiple Planes and embed it into the matching cost. The experimental results indicate that our method achieves SOTA in boundary sharpness on ETH3D and improves the completeness of weakly textured objects.

Abstract:
Adversarial attacks on point clouds are crucial for assessing and improving the adversarial robustness of 3D deep learning models. Despite leveraging various geometric constraints, current adversarial attack strategies often suffer from inadequate imperceptibility. Given that adversarial perturbations tend to disrupt the inherent symmetry in objects, we recognize this disruption as the primary cause of the lack of imperceptibility in these attacks. In this paper, we introduce a novel framework, symmetry-aware imperceptible adversarial attacks on 3D point clouds (SymAttack), to address this issue. Our approach starts by identifying part- and patch-level symmetry elements, and grouping points based on semantic and Euclidean distances, respectively. During the adversarial attack iterations, we intentionally adjust the perturbation vectors on symmetric points relative to their symmetry plane. By preserving symmetry within the attack process, SymAttack significantly enhances imperceptibility. Extensive experiments validate the effectiveness of SymAttack in generating imperceptible adversarial point clouds, demonstrating its superiority over the state-of-the-art methods.

Abstract:
Recent advancements in 3D generation have garnered considerable interest due to their potential applications. Despite these advancements, the field faces persistent challenges in multi-conditional control, primarily due to the lack of paired datasets and the inherent complexity of 3D structures. To address these challenges, we introduce ImageBind3D, a novel framework for controllable 3D generation that integrates text, hand-drawn sketches, and depth maps to enhance user controllability. Our innovative contribution is adopting an inversion-align strategy, facilitating controllable 3D generation without requiring paired datasets. Firstly, utilizing GET3D as a baseline, our method innovates a 3D inversion technique that synchronizes 2D images with 3D shapes within the latent space of 3D GAN. Subsequently, we leverage images as intermediaries to facilitate pseudo-pairing between the shapes and various modalities. Moreover, our multi-modal diffusion model design strategically aligns external control signals with the generative model's latent knowledge, enabling precise and controllable 3D generation. Extensive experiments validate that ImageBind3D surpasses existing state-of-the-art methods in both fidelity and controllability. Additionally, our approach can offer composable guidance for any feed-forward 3D generative models, significantly enhancing their controllability.

Abstract:
Virtual reality enables us to access and interact with immersive virtual environments anytime and anywhere in various fields such as entertainment, training, and education. However, users immersed in virtual scenes remain physically connected to their real-world surroundings, which can pose safety and immersion challenges. Although virtual scene synthesis has attracted widespread attention, many popular methods are limited to generating purely virtual scenes independent of physical environments or simply mapping physical objects as obstacles. To this end, we propose a scene agent that synthesizes situated 3D virtual scenes as a kind of ubiquitous embodied interface in VR for users. The scene agent synthesizes scenes by perceiving the user's physical environment as well as inferring the user's demands. The synthesized scenes maintain the affordances of the physical environment, enabling immersive users to interact with the physical environment and improving the user's sense of security. Meanwhile, the synthesized scenes maintain the style described by the user, improving the user's immersion. The comparison results show that the proposed scene agent can synthesize virtual scenes with better affordance maintenance, scene diversity, style maintenance, and 3D intersection over union compared to baselines. To the best of our knowledge, this is the first work that achieves in situ scene synthesis with virtual-real affordance consistency and user demand.

Abstract:
Quantum networks have the potential to transmit multimedia data with high security and efficiency. However, ensuring high-fidelity transmission links remains a significant challenge. Current work mainly focuses on selecting high-fidelity link transmissions for single packages, neglecting the link allocation problem for multi-package transmissions. This limitation leads to reduced scalability in the practical applications of quantum networks. In addition, when selecting a single link, existing methods can easily fall into the exploration and exploitation dilemma, given various fidelity distributions. To address this issue, this paper proposes a new framework that selects high-fidelity link transmission for multiple tasks through median elimination to estimate fidelity and transmission strategies, thereby improving the application scalability of quantum networks. To optimize the transmission of multimedia chunks in a quantum network, we can employ the scheduling strategy to maximize the cumulative profit of chunk transmissions while considering the fidelity of the links and the overall network utilization. Through extensive experiments, our proposal demonstrates significant advantages. Compared to the randomized method, Minerva reduces bounce number and execution time by 12% ~ 28% and 8% ~ 32%, respectively, while improving average fidelity by 15%. Compared with the uniformly distributed method, our approach decreases bounce number by 24% ~ 30% and execution time by 8% ~ 32% and enhances average fidelity by 11% ~ 21%.

Abstract:
Speech-preserving Facial Expression Manipulation (SPFEM) aims to alter facial emotions in video content while preserving the facial movements associated with speech. Current works often fall short due to the inadequate representation of emotion as well as the absence of time-aligned paired data-two corresponding frames from the same speaker that showcase the same speech content but differ in emotional expression. In this work, we introduce a novel framework, Self-Supervised Emotion Representation Disentanglement (SSERD), to disentangle emotion representation for accurate emotion transfer while implementing a paired data construction module to facilitate automated, photorealistic facial animations. Specifically, We developed a module for learning emotion latent codes using StyleGAN's latent space, employing a cross-attention mechanism to extract and predict emotion editing codes, with contrastive learning to differentiate emotions. To overcome the lack of strictly paired data in the SPFEM task, we exploit pretrained StyleGAN to generate paired data, focusing on expression vectors unrelated to mouth shape. Additionally, we employed a hybrid training strategy using both synthetic paired and real unpaired data to enhance the realism of SPFEM model's generated images. Extensive experiments conducted on benchmark datasets, including MEAD and RAVDESS, have validated the effectiveness of our framework, demonstrating its superior capability in generating photorealistic and expressive facial animations.

Abstract:
Reconstructing visual stimuli from brain activities is crucial for deciphering the underlying mechanism of the human visual system. While recent studies have achieved notable results by leveraging deep generative models, challenges persist due to the lack of large-scale datasets and the inherent noise from non-invasive measurement methods. In this study, we draw inspiration from the mechanism of human memory and propose BrainRAM, a novel two-stage dual-guided framework for visual stimuli reconstruction. BrainRAM incorporates a Retrieval-Augmented Module (RAM) and diffusion prior to enhance the quality of reconstructed images from the brain. Specifically, in stage I, we transform fMRI voxels into the latent space of image and text embeddings via diffusion priors, obtaining preliminary estimates of the visual stimuli's semantics and structure. In stage II, based on previous estimates, we retrieve data from the LAION-2B-en dataset and employ the proposed RAM to refine them, yielding high-quality reconstruction results. Extensive experiments demonstrate that our BrainRAM outperforms current state-of-the-art methods both qualitatively and quantitatively, providing a new perspective for visual stimuli reconstruction.

Abstract:
Identifying social media posts that spread vaccine misinformation can inform emerging public health risks and aid in designing effective communication interventions. Existing studies, while promising, often rely on single user posts, potentially leading to flawed conclusions. This highlights the necessity to model users' historical posts for a comprehensive understanding of their stance towards vaccines. However, users' historical posts may contain a diverse range of content that adds noise and leads to low performance. To address this gap, in this study, we present VaxMine, a cooperative multi-agent reinforcement learning method that automatically selects relevant textual and visual content from a user's posts, reducing noise. To evaluate the performance of the proposed method, we create and release a new dataset of 2,072 users with historical posts due to the unavailability of publicly available datasets. The experimental results show that our approach outperforms state-of-the-art methods with an F1-Score of 0.94 (an absolute increase of 13%), demonstrating that extracting relevant content from users' historical posts and understanding both modalities are essential to detecting anti-vaccine users on social media. We further analyze the robustness and generalizability of VaxMine, showing that extracting relevant textual and visual content from a user's posts improves performance. We conclude with a discussion of the practical implications of our study by explaining how computational methods used in surveillance can benefit from our work, with flow-on effects on the design of health communication interventions to counter vaccine misinformation on social media.

Abstract:
3D face recognition is subject to frequent spoofing attacks, in which 3D face presentation attack is one of the most notorious attacks. The attacker takes advantages of 3D scanning and printing techniques to generate masks of targets, which has found success in numerous real-life examples. The salient feature in such attacks is to obtain 3D face models through 3D scanning, though relatively more expensive and inconvenient when comparing with 2D photos. In this work, we propose a new method, DREAM, to recover 3D face models from single 2D image. Specifically, we adopt a black-box approach, which recovers 'sufficient' depths to defeat target recognition models (e.g., face identification and face authentication models) by accessing its output and the corresponding RGB photo. The key observation is that it is not necessary to restore the true value of depths, but only need to recover the essential features relevant to the target model. We used four public 3D face datasets to verify the effectiveness of DREAM. The experimental results show that DREAM can achieve a success rate of 94% on face authentication model, even in cross-dataset testing, and a success rate of 36% on face identification model.

Abstract:
Vision-Language Models (VLMs) built on contrastive learning, such as CLIP, demonstrate great transferability and excel in downstream tasks like zero-shot classification and retrieval. To further enhance the performance of VLMs, existing methods have introduced additional parameter modules or fine-tuned VLMs on downstream datasets. However, these methods often fall short in scenarios where labeled data for downstream tasks is either unavailable or insufficient for fine-tuning, and the training of additional parameter modules may considerably impair the existing transferability of VLMs on open-set tasks. To alleviate this issue, we introduce WaveDN, a wavelet-based distribution normalization method that can boost the VLMs' performance on downstream tasks without parametric modules or labeled data. Initially, wavelet distributions are extracted from the embeddings of the sampled, unlabeled test samples. Subsequently, WaveDN conducts a hierarchical normalization across the wavelet coefficients of all embeddings, thereby incorporating the distributional characteristics of the test data. Finally, the normalized embeddings are reconstructed via inverse wavelet transformation, facilitating the computation of similarity metrics between the samples. Through extensive experiments on two downstream tasks, using a total of 14 datasets covering text-image and text-audio modal data, WaveDN has demonstrated superiority compared to state-of-the-art methods.

Abstract:
In this paper, we address the challenge of adapting vision-language models (VLMs) to few-shot image recognition in a training-free manner. We observe that existing methods are not able to effectively characterize the semantic relationship between support and query samples in a training-free setting. We recognize that, in the semantic feature space, the feature of the query image is a linear and sparse combination of support image features since support-query pairs are from the class and share the same small set of distinctive visual attributes. Motivated by this interesting observation, we propose a novel method called Training-free Feature ReConstruction with Sparse optimization (TaCo), which formulates the few-shot image recognition task as a feature reconstruction and sparse optimization problem. Specifically, we exploit the VLM to encode the query and support images into features. We utilize sparse optimization to reconstruct the query feature from the corresponding support features. The feature reconstruction error is then used to define the reconstruction similarity. Coupled with the text-image similarity provided by the VLM, our reconstruction similarity analysis accurately characterizes the relationship between support and query images. This results in significantly improved performance in few-shot image recognition. Our extensive experimental results on few-shot recognition demonstrate that our method outperforms existing state-of-the-art approaches by substantial margins.

Abstract:
Pre-trained language models (PLMs) that rely solely on textual corpus may present limitations in multimodal semantics comprehension. Existing studies attempt to alleviate this issue by incorporating additional modal information through image retrieval or generation. However, these methods: (1) inevitably encounter modality gaps and noise; (2) treat all modalities indiscriminately; and (3) ignore visual or acoustic semantics of key entities. To tackle these challenges, we propose a novel principled iterative framework for multimodal-augmented PLMs termed MASE, which achieves efficient and balanced injection of multimodal semantics under the proposed Expectation Maximization (EM) based iterative algorithm. Initially, MASE utilizes multimodal proxies instead of explicit data to enhance PLMs, which avoids noise and modality gaps. In E-step, MASE adopts a novel information-driven self-balanced strategy to estimate allocation weights. Furthermore, MASE employs heterogeneous graph attention to capture entity-level fine-grained semantics on the proposed multimodal-semantic scene graph. In M-step, MASE injects global multimodal knowledge into PLMs through a cross-modal contrastive loss. Experimental results show that MASE consistently outperforms competitive baselines on multiple tasks across various architectures. More impressively, MASE is compatible with existing efficient parameter fine-tuning methods, such as prompt learning.

Abstract:
Contrastive vision-language pre-training has shown great promise in representation transfer learning and cross-modality learning in the medical field. However, without fully exploiting the intrinsic properties and correlations of multimodal medical data within patient studies, current research fails to explore all the potential of available data, leading to suboptimal performance on representation learning. In this paper, we propose a novel pre-training framework for learning better medical vision-language embedding, oriented on patients' study-level data. Based on the order-agnostic property of radiology report, we adopt a two-stage feature extraction method for more representative textual characterization. Then, by leveraging momentum encoders and memory queues, study-level semantics are explored with three contrastive objectives to provide comprehensive supervision from three perspectives, i.e., cross-modal, multi-modal, and uni-modal, such that the potential information neglected by previous research can be fully exploited. The superiority of the proposed framework is demonstrated by the impressive improvements on four typical downstream tasks, including zero-shot/data-efficient image classification, image segmentation, and cross-modal retrieval.

Abstract:
This paper addresses the issue of cross-class domain adaptation (CCDA) in semantic segmentation, where the target domain contains both shared and novel classes that are either unlabeled or unseen in the source domain. This problem is challenging, as the absence of labels for novel classes hampers the effective solutions of both cross-domain and cross-class problems. Since Visual Language Models (VLMs) have exhibited impressive generalization across diverse data distributions and are capable of generating zero-shot predictions without requiring task-specific training examples, we propose a label alignment method by leveraging VLMs to relabel pseudo labels for novel classes. Considering that VLMs typically provide only image-level predictions, we embed a two-stage method to enable fine-grained semantic segmentation and design a threshold based on the uncertainty of pseudo labels to exclude noisy VLM predictions. To further augment the supervision of novel classes, we devise memory banks with an adaptive update scheme to effectively manage accurate VLM predictions, which are then resampled to increase the sampling probability of novel classes. Through comprehensive experiments, we demonstrate the effectiveness and versatility of our proposed method across various CCDA scenarios.

Abstract:
Multi-View Clustering (MVC) commonly utilizes the anchor technique to mitigate the computational complexity. Existing methods generally assume a pre-selection of anchors to facilitate subsequent clustering tasks. However, the determination of the optimal number of anchors is often non-trivial and necessitates their treatment as a tunable parameter, incurring additional computational overhead. Moreover, it is not reasonable to assume an identical number of anchors across all views, as this assumption restricts the representational capacity of anchors in individual views. To address the above issues, we propose a view adaptive anchor multi-view clustering called Multi-view Clustering with Automatic and Aligned Anchor (3AMVC). We introduce a Hierarchical Bipartite Neighbor Clustering (HBNC) strategy to adaptively select a suitable number of representative anchors in each view. Specifically, when the representative difference of anchors lies in a acceptable and satisfactory range, the HBNC process is halted and picks out the final anchors. Moreover, we propose an innovative anchor alignment strategy in response to the varying quantities of anchors across different views. This approach initially evaluates the quality of anchors on each view based on the intra-cluster distance criterion and then proceeds to align based on the view with the highest-quality anchors. The carefully organized experiments well validate the effectiveness and strengthens of 3AMVC.

Abstract:
Video frame interpolation based on optical flow has made great progress in recent years. Most of the previous studies have focused on improving the quality of clean videos. However, many real-world videos contain large obstructions making the video discontinuous. To address this challenge, we propose our Obstruction Robustness Framework (ORF) that enhances the robustness of existing VFI networks in the face of large obstructions. The ORF contains two components: (1) A feature repair module that first captures ambiguous pixels in the synthetic frame by a region similarity map, then repairs them with a cross-overlap attention module. (2) A data augmentation strategy that enables the network to handle dynamic obstructions without extra data. To the best of our knowledge, this is the first work that explicitly addresses the error caused by large obstructions in video frame interpolation. By using previous state-of-the-art methods as backbones, our method does not only improve the results in original benchmarks but also significantly enhances the interpolation quality for videos with obstructions.

Abstract:
Few-shot semantic segmentation (FSS) aims to locate pixels of unseen classes with clues from a few labeled samples. Recently, thanks to profound prior knowledge, diffusion models have been expanded to achieve FSS tasks. However, due to probabilistic noising and denoising processes, it is difficult for them to maintain spatial relationships between inputs and outputs, leading to inaccurate segmentation masks. To address this issue, we propose a Diffusion-based Segmentation network (DiffSeg), which decouples probabilistic denoising and segmentation processes. Specifically, DiffSeg leverages attention maps extracted from a pretrained diffusion model as support-query interaction information to guide segmentation, which mitigates the impact of probabilistic processes while benefiting from rich prior knowledge of diffusion models. In the segmentation stage, we present a Perceptual Attention Module (PAM), where two cross-attention mechanisms capture semantic information of support-query interaction and spatial information produced by the pretrained diffusion model. Furthermore, a self-attention mechanism within PAM ensures a balanced dependence for segmentation, thus preventing inconsistencies between the aforementioned semantic and spatial information. Additionally, considering the uncertainty inherent in the generation process of diffusion models, we equip DiffSeg with a Spatial Control Module (SCM), which models spatial structural information of query images to control boundaries of attention maps, thus aligning the spatial location between knowledge representation and query images. Experiments on PASCAL-5i and COCO datasets show that DiffSeg achieves new state-of-the-art performance with remarkable advantages.

Abstract:
Stereo image super-resolution (stereoSR) strives to improve the quality of super-resolution by leveraging the auxiliary information provided by another perspective. Most approaches concentrate on refining module design, and stacking massive network blocks to extract and integrate information. Although there have been advancements, the memory and computation costs are increasing as well. To tackle this issue, we propose a lattice structure that autonomously learns the optimal combination patterns of network blocks, which enables the efficient and precise acquisition of feature representations, and ultimately achieves lightweight stereoSR. Specifically, we draw inspiration from the lattice phase equalizer and design lattice stereo NAFBlock (LSNB) to bridge pairs of NAFBlocks using re-weight block (RWBlock) through a coupled butterfly-style topological structures. RWBlock empowers LSNB with the capability to explore various combination patterns of pairwise NAFBlocks by adaptive re-weighting of feature. Moreover, we propose a lattice stereo attention module (LSAM) to search and transfer the most relevant features from another view. The resulting tightly interlinked architecture, named as LSSR, extensive experiments demonstrate that our method performs competitively to the state-of-the-art.

Abstract:
Graph Neural Networks (GNNs) are widely employed to derive meaningful node representations from graphs. Despite their success, deep GNNs frequently grapple with the oversmoothing issue, where node representations become highly indistinguishable due to repeated aggregations. In this work, we consider the oversmoothing issue from two aspects of the node embedding space: dimension and instance. Specifically, while existing methods primarily concentrate on instance-level node relations to mitigate oversmoothing, we propose to mitigate oversmoothing at dimension level. We reveal the heightened information redundancy between dimensions which diminishes information diversity and impairs node differentiation in GNNs. Motivated by this insight, we propose the Dimension-Level Decoupling (DLD) to reduce dimension redundancy, enhancing dimensional-level node differentiation. Besides, at the instance level, the neglect of class differences leads to vague classification boundaries. Hence, we introduce the Instance-Level Class-Difference Decoupling (ICDD) that repels inter-class nodes and attracts intra-class nodes, improving the instance-level node discrimination with clear classification boundaries. Additionally, we introduce a novel evaluation metric that considers the impact of class differences on node distances, facilitating precise oversmoothing measurement. Extensive experiments demonstrate the effectiveness of our method Dual-Dimensional Class-Difference Decoupling (DDCD) across diverse scenarios.

Abstract:
Unpaired point cloud completion involves filling in missing parts of a point cloud without requiring partial-complete correspondence. Meanwhile, since point cloud completion is an ill-posed problem, there are multiple ways to generate the missing parts. Existing unpaired completion methods usually leverage generative adversarial training by transforming partial shape encoding into a complete one in the low-dimensional latent feature space. However, "mode collapse" often occurs, where only a subset of the shapes is represented in the low-dimensional space, reducing the diversity of the generated shapes. In this paper, we propose a novel unpaired multimodal shape completion approach that directly operates on point coordinate space. We achieve unpaired completion via a single diffusion model trained on complete data by "hijacking" the generative process. We further augment the diffusion model by introducing two guidance mechanisms to facilitate mapping the partial point cloud to the complete one while preserving its original structure. We conduct extensive evaluations of our approach, which show that our method generates shapes that are more diverse and better preserve the original structures compared to alternative methods.

Abstract:
Moving infrared small target detection, crucial in contexts like traffic management and maritime rescue, encounters challenges from factors such as complex backgrounds, target occlusion, camera shake, and motion blur. Existing algorithms fall short in comprehensively addressing these issues by exploring hybrid modeling, impeding generalization in complex and dynamic motion scenes. In this paper, we propose a hybrid modeling method for moving infrared small target detection via smoothed-particle hydrodynamics (SPH) and Markov decision processes (MDP). SPH can simulate the motion trajectories of targets and background scenes, while MDP can optimize detection system strategies for optimal action selection based on contexts and target states. Specifically, we develop an SPH-inspired image-level enhancement algorithm which models the image sequence of infrared video as a 3D spatiotemporal graph in SPH. Enhancing the spatiotemporal information of the target using the designed sliding window and fluid dynamics formula. In addition, we design an MDP-guided temporal feature perception module. This module selects reference frames, aggregates features from both reference frames and the current frame. The previous and current frames are modeled as an MDP tailored for multi-frame infrared small target detection tasks, aiding in detecting the current frame. Conducted extensive experiments on two public dataset: DAUB and DATR, the proposed network surpasses the state-of-the-art methods in terms of objective metrics and visual quality.

Abstract:
Neural implicit representations have recently revolutionized simultaneous localization and mapping (SLAM), giving rise to a groundbreaking paradigm known as NeRF-based SLAM. However, existing methods often fall short in accurately estimating poses and reconstructing scenes. This limitation largely stems from their reliance on volume rendering techniques, which oversimplify the modeling process. In this paper, we introduce a novel neural implicit SLAM system named SAR-SLAM to address these shortcomings. Our approach reconstructs Neural Radiance Fields (NeRFs) using a self-attentive architecture and represents scenes through neural point cloud encoding. Unlike previous NeRF-based SLAM methods, which depend on traditional volume rendering equations for scene representation and view synthesis, our method employs a self-attentive rendering framework with the Transformer architecture during mapping and tracking stages. To enable incremental mapping, we anchor scene features within a neural point cloud, striking a balance between estimation accuracy and computational cost. Experimental results on three challenging datasets show the superior performance and robustness of our SAR-SLAM compared to recent NeRF-based SLAM systems. The code will be released.

Abstract:
Scene synthesis has gained significant attention recently, and interactive scene synthesis focuses on yielding scenes according to user preferences. Existing literature either generates floor plans or scenes according to the floor plans. The system proposed in this paper generates scenes over floor plans in real-time. Given an initial scene, the only interaction a user needs is changing the room shapes. Our framework splits/merges rooms and adds/rearranges/removes objects for each transient moment during interactions. A systematic pipeline achieves our framework by compressing objects' arrangements over modified room shapes in a transient moment, thus enabling real-time performances. We also propose elastic boxes that indicate how objects should be arranged according to their continuously changed contexts, such as room shapes and other objects. Through a few interactions, a floor plan filled with object layouts is generated concerning user preferences on floor plans and object layouts according to floor plans. Experiments show that our framework is efficient at user interactions and plausible for synthesizing 3D scenes.

Abstract:
Due to the inherent vulnerability of neural networks, adversarial attacks present formidable challenges to the robustness and reliability of deep learning models. In contrast to traditional adversarial training (AT) methods that prioritize semantic distillation and purification, our work pioneers a novel discovery attributing the insufficient adversarial robustness of models to the challenges of spatial attention shift and channel activation disarray. To mitigate these issues, we propose a robust spatial-aligned and channel-adapted learning paradigm, which we term the "StayFocused", that integrates spatial alignment and channel adaptation to enhance the focus region against adversarial attacks by adaptively recalibrating the spatial attention and channel responses. Specifically, the proposed StayFocused mainly benefits from two flexible mechanisms, i.e., Spatial-aligned Hypersphere Constraint (SHC) and Channel-adapted Prompting Calibration (CPC). Specifically, SHC aims to enhance intra-class compactness and inter-class separation between adversarial and natural samples by measuring the angular margins and distribution distance within the hypersphere space. Inspired by the top-K candidate prompts from the clean sample, CPC is designed to dynamically recalibrate channel-wise feature responses by explicitly modeling interdependencies between channels. To comprehensively learn feature representations, the StayFocused framework can be easily extended with additional branches in a multi-head training manner, further enhancing the model's robustness and adaptability. Extensive experiments on multiple benchmark datasets consistently demonstrate the effectiveness and superiority of our StayFocused over state-of-the-art baselines.

Abstract:
Recent text-to-image (T2I) synthesis models have demonstrated intriguing abilities to produce high-quality images based on text prompts. However, current models still face Text-Image Misalignment problem (e.g., attribute errors and relation mistakes) for compositional generation. Existing models attempted to condition T2I models on grounding inputs to improve controllability while ignoring the explicit supervision from the layout conditions. To tackle this issue, we propose Grounded jOint lAyout aLignment (GOAL), an effective framework for T2I synthesis. Two novel modules, discriminative semantic alignment (DSAlign) and masked attention alignment (MAAlign), are proposed and incorporated in this framework to improve the text-image alignment. DSAlign leverages discriminative tasks at the region-wise level to ensure low-level semantic alignment. MAAlign provides high-level attention alignment by guiding the model to focus on the target object. We also build a dataset GOAL2K for model fine-tuning, which composes 2000 semantically accurate image-text pairs and their layout annotations. Comprehensive evaluations on T2I-Compbench, NSR-1K, and Drawbench demonstrate the superior generation performance of our method. Especially, there are improvements of 19%, 13%, and 12% in color, shape, and texture metrics for T2I-Compbench. Additionally, Q-Align metrics demonstrate that our method can generate images of higher quality.

Abstract:
Uncertainty-aware multi-view deep classification methods have markedly improved the reliability of results amidst the challenges posed by noisy multi-view data, primarily by quantifying the uncertainty of predictions. Despite their efficacy, these methods encounter limitations in real-world applications: 1) They are limited to providing a single class prediction per instance, which can lead to inaccuracies when dealing with samples that are difficult to classify due to inconsistencies across multiple views. 2) While these methods offer a quantification of prediction uncertainty, the magnitude of such uncertainty often varies with different datasets, leading to confusion among decision-makers due to the lack of a standardized measure for uncertainty intensity. To address these issues, we introduce Conformalized Multi-view Deep Classification (CMDC), a novel method that generates set-valued rather than single-valued predictions and integrates uncertain predictions as an explicit class category. Through end-to-end training, CMDC minimizes the size of prediction sets while guaranteeing that the set-valued predictions contain the true label with a user-defined probability, building trust in decision-making. The superiority of CMDC is validated through comprehensive theoretical analysis and empirical experiments on various multi-view datasets.

Abstract:
By integrating various modules with the Visual Transformer (ViT), we facilitate a interpretation of image processing across each layer and attention head. This method allows us to explore the connections both within and across the layers, enabling a analysis of how images are processed at different layers. Conducting a analysis of the contributions from each layer and attention head, shedding light on the intricate interactions and functionalities within the model's layers. This in-depth exploration not only highlights the visual cues between layers but also examines their capacity to navigate the transition from abstract concepts to tangible objects. It unveils the model's mechanism to building an understanding of images, providing a strategy for adjusting attention heads between layers, thus enabling targeted pruning and enhancement of performance for specific tasks. Our research indicates that achieving a scalable understanding of transformer models is within reach, offering ways for the refinement and enhancement of such models.

Abstract:
As billions of face images stored on cloud platforms contain sensitive information to human vision, the public confronts substantial threats to visual face privacy. In response, the community has proposed some perturbation-based schemes to mitigate visual privacy leakage. However, these schemes need to generate a new protective perturbation for each image, failing to satisfy the real-time requirement of cloud platforms. To address this issue, we present an efficient visual face privacy protection scheme by utilizing person-specific veils, which can be conveniently applied to all images of the same user without regeneration. The protected images exhibit significant visual differences from the originals but remain identifiable to face recognition models. Furthermore, the protected images can be recovered to originals under certain circumstances. In the process of generating the veils, we propose a feature alignment loss to promote consistency between the recognition outputs of protected and original images with approximate construction of feature subspace. Meanwhile, the block variance loss is designed to enhance the concealment of visual identity information. Extensive experimental results demonstrate that our scheme can significantly eliminate the visual appearance of original images and almost has no impact on face recognition models.

Abstract:
Understanding human assessment of semantically salient parts of multimedia content is crucial for developing human-centric applications, such as annotation tools, search and recommender systems, and systems able to generate new media matching human interests. However, the challenge of acquiring suitable supervision signals to detect semantic saliency without extensive manual annotation remains significant. Here, we explore a novel method that utilizes signals measured directly from human cognition via electroencephalogram (EEG) in response to natural visual perception. These signals are used for supervising representation learning to capture semantic saliency. Through a contrastive learning framework, our method aligns EEG data with visual stimuli, capturing human cognitive responses without the need for any manual annotation. Our approach demonstrates that the learned representations closely align with human-centric notions of visual saliency and achieve competitive performance in several downstream tasks. We also introduce an open EEG/image dataset to facilitate research in utilizing cognitive signals for multimodal data analysis and developing models for cross-modal representation learning.

Abstract:
As a widely explored multi-modal task, 3D object grounding endeavors to localize a unique pre-existing object within a single 3D scene given a natural language description. However, such a strict setting is unnatural as it is not always possible to know whether a target object exists in a specific 3D scene. In real-world scenarios, a collection of 3D scenes is generally available, some of which may not contain the described object while some potentially contain multiple target objects. To this end, we introduce a more realistic setting, named Group-wise 3D Object Grounding, to simultaneously process a group of related 3D scenes, allowing a flexible number of target objects to exist in each scene. Instead of localizing target objects in each scene individually, we argue that ignoring the rich visual information contained in other related 3D scenes within the same group may lead to sub-optimal results. To achieve more accurate localization, we propose a baseline method named GNL3D, a Grouped Neural Listener for 3D grounding in the group-wise setting, which extends the traditional 3D object grounding pipeline with a novel language-guided consensus aggregation and distribution mechanism to explicitly exploit the intra-group visual connections. Specifically, based on context-aware spatial-semantic alignment, a language-guided consensus aggregation module is developed to aggregate the visual features of target objects in each 3D scene to form a visual consensus representation, which is then distributed and injected into a consensus-modulated feature refinement module for refining visual features, thus benefiting the subsequent multi-modal reasoning. To validate the effectiveness of the proposed method, we reorganize and enhance the ReferIt3D dataset and propose evaluation metrics to benchmark prior work and GNL3D. Extensive experiments demonstrate that GNL3D achieves state-of-the-art results on the group-wise setting and the traditional 3D object grounding task.

Abstract:
Despite the advancements that deep learning has brought to medical image analysis (MIA), protecting the privacy of images remains a challenge. In a client-server MIA framework, especially after deployment, patients' private medical images can be easily captured by attackers from the transmission channel or malicious third-party servers. Previous MIA privacy-enhancing methods, whether based on distortion or homomorphic encryption, expose the fact that the transmitted images are medical images or transform the images into semantic-lacking noise. This tends to alert attackers, thereby falling into a cat-and-mouse game of theft and protection. To address this issue, we propose a covert MIA framework based on deep image hiding, namely HideMIA, which secures medical images by embedding them within natural cover images that are unlikely to raise suspicion. By directly analyzing the hidden medical images in the steganographic domain, HideMIA makes it difficult for attackers to notice the presence of medical images. Specifically, we propose the Mixture-of-Difference-Convolutions (MoDC) and Asymmetric Wavelet Attention (AsyWA) to enable HideMIA to conduct fine-grained analysis on each wavelet sub-band within the steganographic domain, mining features that are specific to medical images. Moreover, to reduce resource consumption on client devices, we design function-aligned knowledge distillation to obtain a lightweight hiding network, namely LightIH. Extensive experiments on six medical datasets demonstrate that our HideMIA achieves superior MIA performance and protective imperceptibility on medical image segmentation and classification.

Abstract:
Panoramic activity recognition is a comprehensive yet challenging task in crowd scene understanding, which aims to concurrently identify multi-grained human behaviors, including individual actions, social group activities, and global activities. Previous studies tend to capture cross-granularity activity-semantics relations from solely the video input, thus ignoring the intrinsic semantic hierarchy in label-text space. To this end, we propose a label text-aided hierarchical semantics mining (THSM) framework, which explores multi-level cross-modal associations by learning hierarchical semantic alignment between visual content and label texts. Specifically, a hierarchical encoder is first constructed to encode the visual and text inputs into semantics-aligned representations at different granularities. To fully exploit the cross-modal semantic correspondence learned by the encoder, a hierarchical decoder is further developed, which progressively integrates the lower-level representations with the higher-level contextual knowledge for coarse-to-fine action/activity recognition. Extensive experimental results on the public JRDB-PAR benchmark validate the superiority of the proposed THSM framework over state-of-the-art methods.

Abstract:
Talking Face Generation (TFG) reconstructs facial motions concerning lips given speech input, which aims to generate highquality, synchronized, and lip-readable videos. Previous efforts have achieved success in generating quality and synchronization, and recently, there has been an increasing focus on the importance of intelligibility. Despite these efforts, there remains a challenge in achieving a balance among quality, synchronization, and intelligibility, often resulting in trade-offs that compromise one aspect in favor of another. In light of this, we propose SyncTalklip, a novel dual-tower framework designed to overcome the challenges of synchronization while improving lip-reading performance. To enhance the performance of SyncTalklip in both synchronization and intelligibility, we design AV-SyncNet, a pre-trained multi-task model, aiming to achieve a dual-focus on synchronization and intelligibility. Moreover, we propose a novel cross-modal contrastive learning bringing audio and video closer to enhance synchronization. Experimental results demonstrate that SyncTalklip achieves state-of-the-art performance in quality, intelligibility, and synchronization. Furthermore, extensive experiments have demonstrated our model's generalizability across domains. The code and demo is available at https://sync-talklip.github.io.

Abstract:
This paper revisits the development of generative self-supervised learning in 2D images and 3D point clouds in autonomous driving. In 2D images, the pretext task has evolved from low-level to high-level features. Inspired by this, through explore model analysis, we find that the gap in weight distribution between self-supervised learning and supervised learning is substantial when employing only low-level features as the pretext task in 3D point clouds. Low-level features represented by PoInt Cloud reconsTruction are insUfficient to learn 3D REpresentations (dubbed PICTURE). To advance the development of pretext tasks, we propose a unified generative self-supervised framework. Firstly, high-level features are demonstrated to exhibit semantic consistency with downstream tasks. We utilize the high-level features as an additional pretext task to enhance the understanding of semantic information during the pre-training. Next, we propose inter-class and intra-class discrimination-guided masking (I2Mask) based on the attributes of the high-level features, adaptively setting the masking ratio for each superclass. On Waymo and nuScenes datasets, we achieve 75.13% mAP and 72.69% mAPH for 3D object detection, 79.4% mIoU for 3D semantic segmentation, and 18.4% mIoU for occupancy prediction. Extensive experiments have demonstrated the effectiveness and necessity of high-level features.

Abstract:
Video Scene Graph Generation (VidSGG) plays a crucial role in various visual-language tasks by providing accessible structured visual relation knowledge. However, the requirement of annotating all categories of prevailing VidSGG methods limits their application in real-world scenarios. Despite the popular VLMs facilitating preliminary exploration of open-vocabulary VidSGG tasks, the correspondence between visual union regions and relation predicates is usually ignored. Therefore, we propose an Open-vocabulary VidSGG framework named Union-Aware Semantic Alignment Network (UASAN) to explore the alignment between visual union regions and relation predicate concepts in the same semantic space. Specifically, a visual refiner is designed to acquire open-vocabulary knowledge and the ability to bridge different modalities. To achieve better alignment, we first design a semantic-aware context encoder to achieve a comprehensive semantic interaction between object trajectories, visual union regions, and trajectory motion information to obtain semantic-aware union region representations. Then, a union-relation alignment decoder is utilized to generate the discriminative relation token for each union region for final relation prediction.

Abstract:
Visual abduction reasoning aims to find the most plausible explanation for incomplete observations, and suffers from inherent uncertainties and ambiguities, which mainly stem from the latent causal relations, incomplete observations, and the reasoning itself. To address this, we propose a probabilistic model named Uncertainty-Guided Probabilistic Distillation Transformer (UPD-Trans) to model uncertainties for Visual Abductive Reasoning. In order to better discover the correct cause-effect chain, we model all the potential causal relations into a unified reasoning framework, thus both the direct relations and latent relations are considered. In order to reduce the effect of the stochasticity and uncertainty for reasoning: 1) we extend the deterministic Transformer to a probabilistic Transformer by considering those uncertain factors as Gaussian random variables and explicitly modeling their distribution; 2) we introduce a distillation mechanism between the posterior branch with complete observations and the prior branch with incomplete observations to transfer posterior knowledge. Evaluation results on the benchmark datasets, consistently demonstrate the commendable performance of our UPD-Trans, with significant improvements after latent relation modeling and uncertainty modeling.

Abstract:
Parameter-Efficient-Tuning (PET) for pre-trained deep models (e.g., transformer) hold significant potential for domain increment learning (DIL). Recent prevailing approaches resort to prompt learning, which typically involves learning a small number of prompts for each domain to avoid the issue of catastrophic forgetting. However, previous studies have pointed out prompt-based methods are often challenging to optimize, and their performance may vary non-monotonically with trainable parameters. In contrast to previous prompt-based DIL methods, we put forward an importance-aware shared parameter subspace learning for domain incremental learning, on the basis of low-rank adaption (LoRA). Specifically, we propose to incrementally learn a domain-specific and domain-shared low-rank parameter subspace for each domain, in order to effectively decouple the parameter space and capture shared information across different domains. Meanwhile, we present a momentum update strategy for learning the domain-shared subspace, allowing for the smoothly accumulation of knowledge in the current domain while mitigating the risk of forgetting the knowledge acquired from previous domains. Moreover, given that domain-shared information might hold varying degrees of importance across different domains, we design an importance-aware mechanism that adaptively assigns an importance weight to the domain-shared subspace for the corresponding domain. Finally, we devise a cross-domain contrastive constraint to encourage domain-specific subspaces to capture distinctive information within each domain effectively, and enforce orthogonality between domain-shared and domain-specific subspaces to minimize interference between them. Extensive experiments on image domain incremental datasets demonstrate the effectiveness of the proposed method in comparison to the related state-of-the-art methods.

Abstract:
Deep neural networks (DNNs) are susceptible to backdoor attacks due to their black-box nature and lack of interpretability. Backdoor attacks intend to manipulate the model's prediction when hidden backdoors are activated by predefined triggers. Although considerable progress has been made in backdoor detection and removal at the model deployment stage, an effective defense against backdoor attacks during the training time is still under-explored. In this paper, we propose a novel training-time backdoor defense method called Learning from Distinction (LfD), allowing training a backdoor-free model on the backdoor-poisoned data. LfD uses a low-capacity model as a teacher to guide the learning of a backdoor-free student model via a dynamic weighting strategy. Extensive experiments on CIFAR-10, GTSRB and ImageNet-subset datasets show that LfD significantly reduces attack success rates to 0.67%, 6.14% and 1.42%, respectively, with minimal impact on clean accuracy (less than 1%, 3% and 1%).

Abstract:
Most existing NAS-based multi-modal classification (MMC-NAS) methods are optimized using the classification accuracy.They can not simultaneously provide multiple models with diverse perferences such as model complex and classification performance for meeting different users' demands. Combining NAS-MMC with multi-objective optimization is a nature way for this issue. However, the challenge problem of this solution is the high computation cost. For multi-objective optimization, the computing bottleneck is pareto front search. Some higher-quality MMC models (namely core structures, CSs) consisting of high-quality features and fusion operators are easier to identify. We find that CSs have a close relation with the pareto front (PF), i.e., the individuals lying in PF contain the CSs. Based on the finding, we propose an efficient multi-objective neural architecture search for multi-modal classification by applying CSs to guide the PF search (CoMO-NAS). In conclusion, experimental results thoroughly demonstrate the effectiveness of our CoMO-NAS. Compared to state-of-the-art competitors on benchmark multi-modal tasks, we achieve comparable performance with lower model complexity in shorter search time.

Abstract:
In the task of image dehazing, it has been proven that high-quality codebook priors can be used to compensate for the distribution differences between real-world hazy images and synthetic hazy images, thereby helping the model improve its performance. However, because the concentration and distribution of haze in the image are irregular, the manners those simply replacing or blending the prior information in the codebook with the original image features are inconsistent with this irregularity, which leads to a non-ideal dehazing performance. To this end, we propose a haze concentration aware network (HcaNet), its haze-concentration-aware module (HcaM) can reduce the information loss in the vector quantization stage and achieve an adaptive domain transfer for regions with different degrees of degradation. To further capture the detailed texture information, we develop a frequency selective fusion module (FSFM) to facilitate the transmission of shallow information retained in haze areas to deeper layers, thereby enhancing the fusion with high-quality feature priors. Extensive evaluations demonstrate that the proposed model can be merely trained on synthetic hazy-clean pairs and effectively generalize to real-world data. Several experimental results confirm that the proposed dehazing model outperforms state-of-the-art methods significantly on real-world images.

Abstract:
Multi-view clustering has proven to be highly effective in exploring consistency information across multiple views/modalities when dealing with large-scale unlabeled data. However, in the real world, multi-view data is often distributed across multiple entities, and due to privacy concerns, federated multi-view clustering solutions have emerged. Existing federated multi-view clustering algorithms often result in misalignment in feature representations among clients, difficulty in integrating information across multiple views, and poor performance in heterogeneous scenarios. To address these challenges, we propose HFMVC, a heterogeneity-aware federated deep multi-view clustering method. Specifically, HFMVC adaptively perceives the degree of heterogeneity in the environment and employs contrastive learning to explore consistency and complementarity information across clients' multi-view data. Besides, we seek consensus among clients where local data originates from the same view, incorporating a contrastive loss between local models and the global model during local training to adjust consistency among local models. Furthermore, we elucidate the sample representation logic for local clustering in different heterogeneous environments, identifying the degree of heterogeneity by computing the within-cluster sum of squares (WCSS) and the average inter-cluster distance (AICD). Extensive experiments verify the superior performance of HFMVC across both IID and Non-IID settings.

Abstract:
Simultaneous Localization and Mapping (SLAM) plays a pivotal role in autonomous driving and robotics. Existing methods often rely on hand-craft feature extraction and cross-modal fusion techniques, resulting in limited feature representation capability and reduced robustness. To address this challenge, we introduce DeepPointMap2, a novel learning-based LiDAR-Visual SLAM architecture that leverages neural descriptors to tackle multiple SLAM sub-tasks in a unified manner. Our approach employs neural networks to extract multi-modal tokens, which are then adaptively fused by the Visual-Point Fusion Module to generate sparse 3D neural descriptors, ensuring precise and robust performance. As a pioneering work, our method achieves state-of-the-art localization performance among various Visual-, LiDAR-, and Visual-LiDAR-based methods in widely-used benchmarks, as shown in the experiment results. Furthermore, the approach proves to be robust in scenarios involving camera failure and LiDAR obstruction.

Abstract:
Diversity plays a crucial role in Recommender Systems (RSs) as it ensures a wide range of recommended items, providing users with access to new and varied options. Without diversity, users often encounter repetitive content, limiting their exposure to novel choices. While significant efforts begin to enhance recommendation diversification in static offline scenarios, relatively less attention has been given to online Conversational Recommender Systems (CRSs). However, the lack of recommendation diversity in CRSs will increasingly exacerbate over time due to the dynamic user-system feedback loop, resulting in challenges such as the Matthew effect, filter bubbles, and echo chambers. To address these issues, we propose a novel paradigm, User-Centric Multi-Interest Learning for Conversational Movie Recommendation (CoMoRec), aiming to learn multiple user interests to improve result diversity for movie recommendations. Firstly, CoMoRec automatically models various facets of user interests, including context-, graph-, and review-based interests, to explore a wide range of user potential intentions. Then, it leverages these multi-aspect user interests to accurately predict personalized and diverse movie recommendations and generate fluent and informative responses during conversations. Extensive experiments on two publicly CRS-based movie datasets show that our CoMoRec achieves a new state-of-the-art performance and the superiority of improving recommendation diversity in the CRS.

Abstract:
Multimodal sentiment analysis, which has garnered widespread attention in recent years, aims to predict human emotional states using multimodal data. Previous studies have primarily focused on enhancing multimodal fusion and integrating information across different modalities, while overlooking the impact of noisy data on the internal features of each single modality. In this paper, we propose the Enhanced experts with Uncertainty-Aware Routing (EUAR) method to address the influence of noisy data on multimodal sentiment analysis by capturing uncertainty and dynamically altering the network. Specifically, we introduce the Mixture of Experts approach into multimodal sentiment analysis for the first time, leveraging its properties under conditional computation to dynamically alter the network in response to different types of noisy data. Particularly, we refine the experts within the MoE framework to capture uncertainty in the data and extract clearer features. Additionally, a novel routing mechanism is introduced. Through our proposed U-loss, which utilizes the quantified uncertainty by experts, the network learns to route different samples to experts with lower uncertainty for processing, thus obtaining clearer, noise-free features. Experimental results demonstrate that our method achieves state-of-the-art performance on three widely used multimodal sentiment analysis datasets. Moreover, experiments on noisy datasets show that our approach outperforms existing methods in handling noisy data.

Abstract:
Cross-Domain Recommendation (CDR) has been proposed to improve the recommendation accuracy in the target domain (the sparser dataset) by benefiting from the auxiliary information transferred or the knowledge learned from one or many source domains (the denser datasets). However, most of the existing CDR approaches still suffer from the problem of negative transfer caused by undifferentiated knowledge transfer, and thus the recommendation accuracy in some domains, especially in the sparser domains, is still too low, which is not practical in real application scenarios. To address this problem, we propose a novel Active Masked Attention framework, i.e., AMA-CDR, for many-to-many CDR scenarios. Our AMA-CDR pursues a higher goal for CDR approaches, i.e., improving the recommendation performance in the target domain to achieve a practically usable level, which is meaningful and challenging in real CDR systems. Specifically, AMA-CDR adopts an end-to-end graph embedding to reduce the objective distortion between graph embedding and embedding combination. More importantly, we propose an active mask for the embedding combination to ease negative transfer, which leverages both the prior knowledge, i.e., data density, and the posterior knowledge, i.e., sample uncertainty. Extensive experiments conducted on two public datasets demonstrate that our proposed AMA-CDR models significantly outperform the state-of-the-art approaches and achieve the new goal.

Abstract:
Recent advances in vision-language pre-trained models like CLIP have greatly enhanced general domain image-text retrieval performance. This success has led scholars to develop methods for applying CLIP to Specific Domain Image-Text Retrieval (SDITR) tasks such as Remote Sensing Image-Text Retrieval (RSITR) and Text-Image Person Re-identification (TIReID). However, these methods for SDITR often neglect two critical aspects: the enhancement of modal-level distribution consistency within the retrieval space and the reduction of CLIP's computational cost during inference. To address these issues, this paper presents a novel framework, Accurate and lightweight learning for specific domain Image-text Retrieval (AIR), based on the CLIP. AIR incorporates a Modal-Level distribution Consistency Enhancement regularization (MLCE) loss and a Self-Pruning Distillation Strategy (SPDS) to improve retrieval precision and computational efficiency. The MLCE loss harmonizes the sample distance distributions within image and text modalities, fostering a retrieval space closer to the ideal state. SPDS employs a strategic knowledge distillation process to transfer deep multimodal insights from CLIP to a shallower level, maintaining only the essential layers for inference, thus achieving model light-weighting. Comprehensive experiments across various datasets in RSITR and TIReID reveal that MLCE loss secures optimal retrieval, while SPDS achieves a favorable balance between accuracy and computational demand during testing.

Abstract:
The increasing use of smartphones for capturing documents in various real-world conditions has underscored the need for robust document localization technologies. Current challenges in this domain include handling diverse document types, complex backgrounds, and varying photographic conditions such as low contrast and occlusion. However, there currently are no publicly available datasets containing these complex scenarios and few methods demonstrate their capabilities on these complex scenes. To address these issues, we create a new comprehensive real-world document localization benchmark dataset which contains the complex scenarios mentioned above and propose a novel Real-world Document Localization Network (RDLNet) for locating targeted documents in the wild. The RDLNet consists of an innovative light-SAM encoder and a masked attention decoder. Utilizing light-SAM encoder, the RDLNet transfers the mighty generalization capability of SAM to the document localization task. In the decoding stage, the RDLNet exploits the masked attention and object query method to efficiently output the triple-branch predictions consisting of corner point coordinates, instance-level segmentation area and categories of different documents without extra post-processing. We compare the performance of RDLNet with other state-of-the-art approaches for real-world document localization on multiple benchmarks, the results of which reveal that the RDLNet remarkably outperforms contemporary methods, demonstrating its superiority in terms of both accuracy and practicability.

Abstract:
Unsupervised visible-infrared person re-identification (USL-VI-ReID) is of great research and practical significance yet remains challenging due to significant modality discrepancy and lack of annotations. Many existing approaches utilize variants of bipartite graph global matching algorithms to address this issue, aiming to establish cross-modality correspondences. However, these methods may encounter mismatches due to significant modality gaps and limited model representation. To mitigate this, we propose a simple yet effective framework for USL-VI-ReID, which gradually establishes associations between different modalities. To measure the confidence whether samples from different modalities belong to the same identity, we introduce a bidirectional-consistency criterion, which not only considers direct relationships between samples from different modalities but also incorporates potential hard negative samples from the same modality. Additionally, we propose a cross-modality correlation preserving module to further enhance the semantic representation of the model by maintaining consistency in correlations across modalities. Extensive experiments conducted on the public SYSU-MM01 and RegDB datasets demonstrate the superiority of our method over existing USL-VI-ReID approaches across various settings, despite the simplicity of our method.

Abstract:
With the continuous development of imaging technology and the gradual expansion of the amount of image data, how to achieve high compression efficiency of high-resolution images is a challenge problem for storage and transmission. Image rescaling aims to reduce the original data amount through downscaling to facilitate data transmission and storage before encoding, and reconstruct the quality through upscaling after decoding, which is a key technology to assist in high-ratio image compression. However, existing rescaling approaches are more focused on reconstruction quality rather than image compressibility. In repetitive observation scenarios, multi-temporal images brought by periodic observations provide an opportunity to alleviate the conflict between reconstruction quality and compressibility, that is, the historical images as reference indicates what information can be dropped at downscaling to reduce the information content in downscaled image and provides the dropped information to improve the image restoration quality at upscaling. Based on this consideration, we propose a novel multi-temporal assisted reference-based image rescaling framework (RefScale). Specifically, a referencing network is proposed to calculate the similarity map to provide the referencing condition, which is then injected into the conditional invertible neural network to guide the information drop at the downscaling stage and information fusion at the upscaling stage. Additionally, a low-resolution guidance loss is proposed to further constrain the data amount of the downscaled image. Experiments conducted on both satellite imaging and autonomous driving show the superior performance of our approach over the state-of-the-art methods.

Abstract:
Computer vision models based on deep neural networks are proven to be vulnerable to adversarial attacks. Robustness distillation, as a countermeasure, takes both robustness challenges and efficiency challenges of edge models into consideration. However, most existing robustness distillations are data-driven, which can hardly be deployed in data-privacy scenarios. Also, the trade-off between robustness and accuracy tends to transfer from the teacher to the student, and there has been no discussion on mitigating this trade-off in the data-free scenario yet. In this paper, we propose a Data-free Experts-guided Robustness Distillation (DERD) to extend robustness distillation to the data-free paradigm, which offers three advantages: (1) Dual-level adversarial learning strategy achieves robustness distillation without real data. (2) Expert-guided distillation strategy brings a better trade-off to the student model. (3) A novel stochastic gradient aggregation module reconciles the task conflicts of the multi-teacher from a consistency perspective. Extensive experiments demonstrate that the proposed DERD can even achieve comparable results to data-driven methods.

Abstract:
Text to Motion Retrieval (TMR) is an emerging task to retrieve relevant motion sequences with the nature language description. The dominant approach learns a joint embedding space to measure global-level similarities. However, simple global embeddings are insufficient to represent complicated motion and textual details, such as the movement of specific body parts and the coordination among these body parts. In addition, most of the motion variations occur subtly and locally, resulting in semantic vagueness among these motions, which further presents considerable challenges in precisely aligning motion sequences with texts. To address these challenges, we propose a novel Modal-Enhanced Semantic Modeling (MESM) method, focusing on fine-grained alignment through enhanced modal semantics. Specifically, we develop a prompt-enhanced textual module (PTM) to generate detailed descriptions of specific body part movements, which comprehensively captures the fine-grained textual semantics for precise matching. We employ a skeleton-enhanced motion module (SMM) to effectively enhance the model's capability to represent intricate motions. This module leverages a graph convolutional network to meticulously model the intricate spatial dependencies among relevant body parts. To improve the sensitivity to the subtle motions, we further propose a text-driven semantics interaction module (TSIM). The TSIM assigns motion features into a set of aggregated descriptors and employs cross-attention to aggregate discriminative motion embeddings guided by text, enabling precise semantic alignment between subtle motions and corresponding texts. Extensive experiments conducted on two widely used benchmark datasets, HumanML3D and KIT-ML, demonstrate the effectiveness of our proposed method. Our approach outperforms existing state-of-the-art retrieval methods, achieving significant Rsum improvements of 24.28% on HumanML3D and 25.80% on KIT-ML.

Abstract:
Pose-agnostic anomaly detection refers to the situation where the pose of test samples is inconsistent with the training dataset, allowing anomalies to appear at any position in any pose. We propose a novel method IGSPAD to address this challenge. Specifically, we employ 3D Gaussian splatting to represent the normal information from the training dataset. To accurately determine the pose of the test sample, we introduce an approach termed Inverting 3D Gaussian Splatting (IGS) to address the challenge of 6D pose estimation for anomalous images. The pose derived from IGS is utilized to render a normal image well-aligned with the test sample. Subsequently, the image encoder of the Segment Anything Model is employed to identify discrepancies between the rendered image and the test sample, predicting the location of anomalies. Experimental results on the MAD dataset demonstrate that the proposed method significantly surpasses the existing state-of-the-art method in terms of precision (from 97.8% to 99.7% at pixel level and from 90.9% to 98.0% at image level) and efficiency.

Abstract:
The emergence of video captioning makes it possible to automatically generate natural language description for a given video. However, generating detailed video descriptions that incorporate domain-specific information remains an unsolved challenge, holding significant research and application value, particularly in domains such as sports commentary generation. Moreover, sports event commentary goes beyond being a mere game report, it involves entertaining, metaphorical, and emotional descriptions. To promote the field of sports commentary automatic generation, in this paper, we introduce a novel dataset, the Basketball Highlight Commentary (BH-Commentary), comprising approximately 4K basketball highlight videos with groundtruth commentaries from professional commentators. In addition, we propose an end-to-end framework as a benchmark for basketball highlight commentary generation task, in which a lightweight and effective prompt strategy is designed to enhance alignment fusion among visual and textual features. Experimental results on the BH-Commentary dataset demonstrate the validity of the dataset and the effectiveness of the proposed benchmark for sports highlight commentary generation.

Abstract:
The emergence of virtual reality technology has made stereoscopic omnidirectional images (SOI) easily accessible and prompted the need to evaluate their perceptual quality. At present, most stereoscopic omnidirectional image quality assessment (SOIQA) methods rely on one of the projection formats, i.e., Equirectangular Projection (ERP) or CubeMap Projection (CMP). However, while ERP provides global information and the less distorted CMP complements it by providing local structural guidance, research on leveraging both ERP and CMP in SOIQA remains limited, hindering a comprehensive understanding of both global and local visual cues. Motivated by this gap, our study introduces a novel dual-stream perception-driven network for blind quality assessment of stereoscopic omnidirectional images. By integrating both ERP and CMP, our method effectively captures both global and local information, marking the first attempt to bridge this gap in SOIQA, particularly through deep learning methodologies. We employ an inter-intra feature fusion module, which considers both the inter-complementarity between ERP and CMP and the intra-relationships within CMP images. This module dynamically and complementarily adjusts the contributions of features from both projections and effectively integrates them to achieve a more comprehensive perception. Besides, deformable convolution is employed to extract the local region of interest, simulating the orientation selectivity of the primary visual cortex. Finally, with the features of left and right views of SOI, a stereo cross attention module that simulates the binocular fusion mechanism is proposed to predict the quality score. Extensive experiments are conducted to evaluate our model and the state-of-the-art competitors, demonstrating that our model has achieved the best performance on the databases of LIVE 3D VR, SOLID, and NBU.

Abstract:
Motion transitions, which serve as bridges between two sequences of character animation, play a crucial role in creating long variable animation for real-time 3D interactive applications. In this paper, we present a framework to produce hybrid character animation, which combines motion capture animation and physical simulation animation that seamlessly connects the front and back motion clips. In contrast to previous works using interpolation for transition, our physics-based approach inherently ensures physical validity, and both the transition moment of the source motion clip and the horizontal rotation of the target motion clip can be specified arbitrarily within a certain range, which achieves high responsiveness and wide latitude for user control. The control policy of character can be trained automatically using only the motion capture data that requires transition, and is enhanced by our proposed Self-Behavior Cloning (SBC), an approach to improve the unsupervised reinforcement learning of motion transition. We show that our framework can accomplish the interactive transition tasks from a fully-connected state machine constructed from nine motion clips with high accuracy and naturalness.

Abstract:
Voice large language models (LLMs) cast voice synthesis as a language modeling task in a discrete space, and have demonstrated significant progress to date. Despite the recent success, the current development of voice LLMs in low-resource applications is hampered by data scarcity and high computational cost. In this work, we propose VoiceTuner, with a self-supervised pre-training and efficient fine-tuning approach for low-resource voice generation. Specifically, 1) to mitigate data scarcity, we leverage large-scale unlabeled dataset and pre-train VoiceTuner-SSL without pre-defined applications, which can be fine-tuned in downstream tasks; 2) to further reduce the high training cost in complete fine-tuning, we introduce a multiscale transformer adapter to effectively update only around 1% parameters as a plug-and-play module. Experimental results demonstrate that VoiceTuner-SSL presents strong acoustic continuations, and VoiceTuner achieves state-of-the-art results in rich-resource TTS evaluation compared with competitive baseline models. Low-resource (1h, 10h, 30h) downstream applications including zero-shot TTS, instruction TTS, and singing voice synthesis present VoiceTuner's superior audio quality and style similarity with reduced data requirement and computational cost. Audio samples are available at https://VoiceTuner.github.io

Abstract:
As a critical component in graphic design, artistic posters are widely applied in the advertising and entertainment industry, thus the automatic poster creation from user-provided prompts has become increasingly desired recently. Although existing Text2Image methods create impressive images aligned with given prompts, they fail to generate ideal artistic posters, especially with Chinese texts. To create desired artistic Chinese posters including an aligned background, reasonable layouts, and stylized graphical texts from given prompts only, we propose an automatic poster creation framework, named Prompt2Poster. Our framework utilizes the capacity of the powerful Large Language Model (LLM) to extract user intention from provided prompts and generate the aligned background. Although only taking a user prompt as the input, linguistic, visual, and geometrical information is fully utilized in the framework, bringing the ability to fit different distributions. To achieve the use of multi-modal information in the framework, two carefully designed modules, Controllable Layout Generator (CLG) and Graphical Text Generator (GTG) are proposed, leading to accurate and pleasurable visual results. Comprehensive experiments demonstrate that our Prompt2Poster achieves superior performance, especially in text quality and visual harmony.

Abstract:
Due to the limitations of infrared image acquisition conditions, many essential tasks currently rely on visible images as the main source of training data. However, single-modal data makes it difficult for downstream networks to show optimal performance. Therefore, converting the more easily obtainable visible images into infrared images emerges as an effective remedy to alleviate the critical shortage of infrared data. Yet current methods typically focus solely on transferring visible images to infrared style, while overlooking the crucial infrared thermal feature during cross-modal translation. To elevate the authenticity of cross-model translation at the feature level, this paper introduces a translation network based on frequency feature mapping and dual patches contrast, MappingFormer, which can achieve cross-modal image generation from visible to infrared. Specifically, the generator incorporates two branches: low-frequency feature mapping (LFM) and high-frequency feature refinement (HFR), both have embedded the Swin Transformer blocks. The LFM branch captures the fuzzy structural from visible images, while the HFR focuses on mapping edge and texture features. The extracted dual-branch frequency features undergo refinement and fusion through cross-attention mechanisms. Additionally, a dual contrast learning mechanism based on feature patch (DFPC) is designed to infer effective mappings between unaligned cross-modal data. Numerous experimental results prove the effectiveness of this method in cross-modal feature mapping and image generation from visible to infrared. This method holds significant potential for downstream tasks where infrared data is limited.

Abstract:
In various domains such as transportation, resource management, and weather forecasting, there is an urgent need for methods that can provide predictions over a sufficiently long time horizon to encompass the period required for decision-making and implementation. Compared to traditional time series forecasting, ultra-long time series forecasting requires enhancing the model's ability to infer long time series, while maintaining inference costs within an acceptable range. To address this challenge, we propose the Boundary-Aware Periodicity-based sparsification strategy for Ultra-Long time series forecasting (BAP-UL).This method effectively captures periodic features in time series and reorganizes inputs and outputs into shorter sub-sequences for improved prediction accuracy. In the paper, we investigate several commonly used benchmark datasets and demonstrate that the proposed method can yield comparable performance across them.

Abstract:
Graph-based fraud detection (GFD) has garnered increasing attention due to its effectiveness in identifying fraudsters within multimedia data such as online transactions, product reviews, or telephone voices. However, the prevalent in-distribution (ID) assumption significantly impedes the generalization of GFD approaches to out-of-distribution (OOD) scenarios, which is a pervasive challenge considering the dynamic nature of fraudulent activities. In this paper, we introduce the Heterophilic Graph Invariant Learning Framework (HGIF), a novel approach to bolster the OOD generalization of GFD. HGIF addresses two pivotal challenges: creating diverse virtual training environments and adapting to varying target distributions. Leveraging edge-aware augmentation, HGIF efficiently generates multiple virtual training environments characterized by generalized heterophily distributions, thereby facilitating robust generalization against fraud graphs with diverse heterophily degrees. Moreover, HGIF employs a shared dual-channel encoder with heterophilic graph contrastive learning, enabling the model to acquire stable high-pass and low-pass node representations during training. During the Test-time Training phase, the shared dual-channel encoder is flexibly fine-tuned to adapt to the test distribution through graph contrastive learning. Extensive experiments showcase HGIF's superior performance over existing methods in OOD generalization, setting a new benchmark for GFD in OOD scenarios.

Abstract:
Optimizing user Quality of Experience (QoE) for live video streaming remains a long-standing challenge. The Bitrate Control Algorithm (BCA) plays a crucial role in shaping user QoE. Recent advancements have seen RL-based algorithms overtake traditional rule-based methods, promising enhanced QoE optimization. Nevertheless, our comprehensive study reveals a pressing issue: current RL-based BCAs are limited to the fixed and formulaic reward functions, rendering them ill-equipped to adapt to dynamic network environments and varied viewer preferences. In this work, we present AraLive, an automatically adaptive reward learning method designed for seamless integration with any existing learning-based approach in live streaming contexts. To achieve this goal, we have two main designs. First, we construct a dedicated user QoE assessment dataset for live streaming, which includes thousands of videos with millisecond-level metrics. Second, we custom-design an adversarial model that skillfully aligns human feedback with actual network scenarios. We have deployed AraLive in practical video streaming systems, in comparison to a series of state-of-the-art BCAs. The experimental results demonstrate that AraLive not only elevates overall QoE but also exhibits remarkable adaptability to varied user preferences.

Abstract:
We introduce OpenLEAF, a new benchmark designed for the emerging open-domain interleaved image-text generation task. This task aims to generate arbitrarily interleaved multimodal content from input queries. It goes beyond commonly seen single-modality image or text generation, thereby enabling various novel applications by creating content such as visual storybooks and how-to instructions. Despite the importance of this task, no established benchmark exists due to the challenges in defining evaluation scenarios and formulating effective metrics. To introduce and facilitate the new task of interleaved image-text generation, we create a new dataset covering queries with various input-output formats and 10 different application scenarios. We also propose a novel evaluation pipeline named "detection-summarization-scoring," which breaks down the evaluation into multiple reasoning steps. This pipeline leverages large multimodal models (LMMs) to thoroughly evaluate ten aspects of the generated content, which leads to the final rating. With experiments on a proposed agent system, we demonstrate that our evaluation method aligns closely with human judgments, thus together with the dataset, offering the research community a valuable benchmark for exploring interleaved image-text generation.

Abstract:
Visual similarity estimation plays a fundamental role both in human cognition and multimedia information processing as it is the basis for many applications ranging from image search and recommendation to visual content generation. Existing computational models to assess visual similarity often diverge from human perception as they are typically trained solely on image data without information about how humans perceive image similarity. Here, we present an approach for learning perceptual visual similarity from brain recordings obtained via electroencephalogram (EEG). Our approach establishes a mapping between similarity reflected in the human cognitive system and a latent image representation. We evaluate the approach in two tasks. First, predicting visual distance from EEG data and second, adjusting a latent representation of a generative model to generate new images at a predicted distance from a given source image. Experiments demonstrate that the predicted distances from EEG closely align with the ground truth distances, and images generated using these predicted distances closely resemble the ground truth images. These findings open new possibilities for leveraging signals measured from human cognition to infer similarity as opposed to using only content-based models.

Abstract:
Three-dimensional point cloud data is one of the most extensively used data representation today, favored in various fields for its realistic and lifelike visual effects. However, the substantial volume of data poses significant challenges for storage and transmission. To advance point cloud compression (PCC) technology, we develop a learning-based PCC algorithm library, namely LearningPCC. To our knowledge, this is the first comprehensive set of algorithms that is compatible with all types of point cloud data. This PyTorch library incorporates eleven learning-based algorithms that address both geometry and attribute compression of point cloud data. We categorize the existing methods into six main classes and thoroughly introduce and analyze the principles of these algorithms. Moreover, we conduct performance evaluations using point clouds with various densities, offering detailed test results on several compression metrics, such as RD curves, BD-BR gains, compression ratio improvements, and encoding times. We will provide researchers with convenient access to these methods, replicate codes, and experiment results. Our commitment includes maintaining and updating these algorithms to offer researchers the latest in compression technologies.

Abstract:
We present MIRACLE, a system for online, interpretable visual concept and video action recognition. Through a chat interface, users query the recognition system with an uploaded image or video. For images, MIRACLE returns concept predictions from its structured knowledge base, justifying its predictions with heatmaps and natural language-based attribute detections. For videos, MIRACLE predicts an action and justifies its prediction with time varying entity-entity relations. With its ability to learn new concepts in an online, few-shot manner and its support of dynamic changes to its knowledge base, MIRACLE represents a step forward in interpretable multimodal learning systems.

Abstract:
The continual advancements in Generative Artificial Intelligence have created substantial hurdles for accurate deepfake detection, leading to limitations of currently popular detection methods across content-driven video-level deepfake detection scenarios. In this paper, we present the solutions to the Video-Level Deepfake Detection task. Our empirical findings demonstrate that modeling correlations of audio-visual modalities is important for video-level deepfake detection. Therefore, we introduce the model denoted Audio-Visual Local-Global Neural Network (i.e., AV-LGNN) in which the core design is the proposed AV-LGI Module (Audio-Visual Local-Global Interaction Module). The AV-LGI Module is composed of three stages: Local Intra-Region Interaction, Global Inter-Region Interaction, and Local-Global Interaction, which can better capture detailed information at local-level and efficiently learn the fine-grained correlations of inter-modalities in video deepfake detection under lower computational overheads. We further propose an adaptive modality selection strategy to facilitate model learning. Besides, a variety of data augmentation techniques are incorporated for audio-visual branches to enhance the robustness of the AV-LGNN. The experimental results verify the effectiveness of our model.

Abstract:
Recent advancements in Multimodal Large Language Models (MLLMs) have showcased their ability to handle tasks involving single images, such as generating detailed descriptions and answering related queries. These models have significantly pushed the boundaries of multimodal content comprehension and generation. However, when confronted with more complex multimodal contexts, particularly those involving interleaved text and multiple images, MLLMs often face difficulties in effectively processing and understanding these complex contexts. Furthermore, MLLMs exhibit weaknesses in instruction following, especially when dealing with tasks that require reasoning about and prioritizing specific details. In this paper, we design a novel approach, InstructFusion, which introduces an enhanced module Fusing Former to improve the model's understanding of relationships among multiple images within convoluted multimodal tasks. This ensures more accurate and contextually relevant responses. Additionally, we adopt a two-stage training process, first developing general capabilities in the Fusing Former and then fine-tuning with LoRA on instruction-focused datasets to enhance instruction following ability while minimizing costs and preventing forgetting. Empirical results, including a second-place finish in the ACM Multimedia 2024 Demonstrative Instruction Following Challenge, demonstrate the effectiveness of our proposed method.

Abstract:
Micro-expressions (MEs) are involuntary and quickly displayed facial expressions that reveal subtle psychological activities. Most previous research typically focused on two separate tasks: micro-expression spotting and recognition. We aim to propose a high-precision "spotting+recognition" method that can spot ME intervals from long videos and recognize their emotional categories. Due to the occurrence sparsity of MEs, there is a significant imbalance between the number of micro-expression intervals and non-micro-expression intervals in long videos. This imbalance makes it challenging for models trained using conventional strategies to distinguish true MEs from noise samples caused by head movements, blinking, and macro-expressions, resulting in a high false-positive-rate and reducing the overall performance. We reduce the number of smooth segments to alter the data distribution within the non-micro-expression (non-ME) category. This adjustment enables the model to focus more on the subtle differences between noise samples and ME samples. To achieve this, we design an ingenious training data preparation strategy: using false positive samples from the initial spotting results as non-ME category samples, and using true positive and false negative samples from the initial spotting as emotion category samples. These are combined as the training data, creating a recognition model capable of both emotion classification and non-ME category determination. Additionally, we propose a three-stage micro-expression analysis method, including ME spotting, ME recognition and non-ME intervals removal module. Our method is validated through five-fold cross-validation experiments on the CAS(ME)² and SAMM Long Video datasets, achieving a overall STRS metric of 0.16, which significantly outperformed baseline methods and demonstrated the effectiveness of our approach.

Abstract:
The growing prevalence of psychological disorders underscores the critical importance of mental health research in today's society. In psychotherapy, particularly Acceptance and Commitment Therapy (ACT), cognitive exercises employing mental imagery are used to manage negative thoughts. However, the challenge of maintaining vivid imagery diminishes their therapeutic effectiveness. Virtual reality (VR) offers untapped potential for increasing engagement and therapeutic efficacy. However, there is still a gap in exploration regarding how to effectively leverage the potential of VR to enhance traditional cognitive exercises with mental imagery. This study investigates the effective HCI design and the comparative efficacy of a VR-mediated exercise for promoting cognitive defusion to address negative thoughts grounded in ACT. Using a co-design approach with clinicians and potential users of postgraduate students, we developed a VR system that materializes negative thoughts into tangible objects. This allows users to visually modify and transpose these objects onto a surface, facilitating mental detachment from negative thoughts. In an evaluation study with 20 non-clinical participants, divided into VR and mental imagery groups, we assessed the impact of the cognitive defusion exercise on their perception of negative thoughts and psychological measures using standardized questionnaires. Results show improvement in both groups, with significant enhancements in negative thought perception and mental detachment from negative thoughts exclusively in the VR group, whereas the mental imagery group did not demonstrate significant changes. Interviews emphasize the VR's capability to present vivid visualizations of negative thoughts effortlessly, highlighting its effectiveness and engagement in psychotherapy to facilitate cognitive exercises.

Abstract:
Current LiDAR-only 3D detection methods are limited by the sparsity of point clouds. The previous method used pseudo points generated by depth completion to supplement the LiDAR point cloud, but the pseudo points sampling process was complex, and the distribution of pseudo points was uneven. Meanwhile, due to the imprecision of depth completion, the pseudo points suffer from noise and local structural ambiguity, which limit the further improvement of detection accuracy. This paper presents SQDNet, a novel framework designed to address these challenges. SQDNet incorporates two key components: the SQD, which achieves sparse-to-dense matching via grid position indices, allowing for rapid sampling of large-scale pseudo points on the dense depth map directly, thus streamlining the data preprocessing pipeline. And use the density of LiDAR points within these grids to alleviate the uneven distribution and noise problems of pseudo points. Meanwhile, the sparse 3D Backbone is designed to capture long-distance dependencies, thereby improving voxel feature extraction and mitigating local structural blur in pseudo points. The experimental results validate the effectiveness of SQD and achieve considerable detection performance for difficult-to-detect instances on the KITTI test.

Abstract:
Supervised cross-modal retrieval (CMR) achieves excellent performance thanks to the semantic information provided by its labels, which helps to establish semantic correlations between samples from different modalities. However, in real-world scenarios, there often exists a large amount of unlabeled and unpaired multimodal training data, rendering existing methods unfeasible. To address this issue, we propose a novel partially aligned cross-modal retrieval method called Optimal Transport-based Prototype Alignment Learning (OTPAL). Due to the high computational complexity involved in directly establishing matching correlations between unannotated unaligned cross-modal samples, instead, we establish matching correlations between shared prototypes and samples. To be specific, we employ the optimal transport algorithm to establish cross-modal alignment information between samples and prototypes, and then minimize the distance between samples and their corresponding prototypes through a specially designed prototype alignment loss. As an extension of this paper, we also extensively investigate the influence of incomplete multimodal data on cross-modal retrieval performance under the partially aligned setting proposed above. To further address the above more challenging scenario, we raise a scalable prototype-based neighbor feature completion method, which better captures the correlations between incomplete samples and neighbor samples through a cross-modal self-attention mechanism. Experimental results on four benchmark datasets show that our method can obtain satisfactory accuracy and scalability in various real-world scenarios.

Abstract:
Pre-trained Vision-Language Models (VLMs) have shown great ability in various Vision-Language tasks. However, these VLMs exhibit inherent vulnerabilities to transferable adversarial examples, which could potentially undermine their performance and reliability in real-world applications. Cross-modal interactions have been demonstrated to be the key point to boosting adversarial transferability, but the utilization of them is limited in existing multimodal adversarial attacks. Stable Diffusion, which contains multiple cross-attention modules, possesses great potential in facilitating adversarial transferability by leveraging abundant cross-modal interactions. Therefore, We propose a Multimodal Diffusion-based Attack (MDA), which conducts adversarial attacks against VLMs using Stable Diffusion. Specifically, MDA initially generates adversarial text, which is subsequently utilized to optimize the adversarial image during the diffusion process. Besides leveraging adversarial text in calculating downstream loss, MDA also takes it as the guiding prompt in adversarial image generation during the denoising process, which enriches the ways of cross-modal interactions, thus strengthening the adversarial transferability. Compared with pixel-based attacks, MDA introduces perturbations in the latent space rather than pixel space to manipulate high-level semantics, which is also beneficial to improving adversarial transferability. Experimental results demonstrate that the adversarial examples generated by MDA are highly transferable across different VLMs on different downstream tasks, surpassing state-of-the-art methods by a large margin.

Abstract:
Low-light environments will introduce high-intensity noise into images. Containing fine details with reduced noise, near-infrared/flash images can serve as guidance to facilitate noise removal. However, existing fusion-based methods fail to effectively suppress artifacts caused by inconsistency between guidance/noisy image pairs and do not fully excavate the useful information contained in guidance images. In this paper, we propose a robust and flexible fusion network (RFFNet) for low-light image denoising. Specifically, we present a multi-scale inconsistency calibration module to address inconsistency before fusion by first mapping the guidance features to multi-scale spaces and calibrating them with the aid of pre-denoising features in a coarse-to-fine manner. Furthermore, we develop a dual-domain adaptive fusion module to adaptively extract useful high-/low-frequency signals from the guidance features and then highlight the informative frequencies. Extensive experimental results demonstrate that our method achieves state-of-the-art performance on NIR-guided RGB image denoising and flash-guided no-flash image denoising.

Abstract:
Document Visual Question Answering (DVQA) involves responding to queries based on the contents of document images. Existing works are confined to locating information within a single page and lack support for cross-page question-and-answer interactions. Furthermore, the token length limitation on model inputs can lead to the truncation of answer-relevant segments. In this study, we present CREAM, an innovative methodology that focuses on high-performance retrieval and integrates relevant multimodal document information to effectively address this critical issue. To overcome the limitations of current text embedding similarity methods, we first employ a coarse-to-fine retrieval and ranking approach. The coarse phase calculates the similarity between the query and text chunk embeddings, while the fine phase involves multiple rounds of grouping and ordering with a large language model to identify the text chunks most relevant to the query. Subsequently, integrating an attention pooling mechanism for multi-page document images into the vision encoder allows us to effectively merge the visual information of multi-page documents, enabling the multimodal large language model (MLLM) to simultaneously process both single-page and multi-page documents. Finally, we apply various parameter-efficient tuning methods to enhance document visual question-answering performance. Experiments demonstrate that our approach secures state-of-the-art results across various document datasets.

Abstract:
The challenge of bias in visual question answering (VQA) has gained considerable attention in contemporary research. Various intricate bias dependencies, such as modalities and data imbalances, can cause semantic ambiguities to generate shifts in the feature space of VQA instances. This phenomenon is referred to as ''VQA Hallucinations''. Such distortions can cause hallucination distributions that deviate significantly from the true data, resulting in the model producing factually incorrect predictions. To address this challenge, we propose a robust Multi-Space Co-debias Learning (MSCD) approach for combating VQA hallucinations, which effectively mitigates bias-induced instance and distribution shifts in multi-space under a unified paradigm. Specifically, we design bias-aware and prior-aware debias constraints by utilizing the angle and angle margin of the spherical space to construct bias-prior-instance constraints, thereby refining the manifold representation of instance de-bias and distribution de-dependence. Moreover, we leverage the inherent overfitting characteristics of Euclidean space to introduce bias components from biased examples and modal counterexample injection, further assisting in multi-space robust learning. By integrating homeomorphic instances in different spaces, MSCD could enhance the comprehension of structural relationships between semantics and answer classes, yielding robust representations that are not solely reliant on training priors. In this way, our co-debias paradigm generates more robust representations that effectively mitigate biases to combat hallucinations. Extensive experiments on multiple benchmark datasets consistently demonstrate that the proposed MSCD method outperforms state-of-the-art baselines.

Abstract:
Recently, tensor Schatten p-norm has achieved impressive performance for fast multi-view clustering [57]. This primarily ascribes the superiority of tensor Schatten p-norm in exploring high-order structure information among views. Whereas, 1) tensor Schatten p-norm treats different singular values equally, such that the larger singular values corresponding to certain significant feature information (i.e., prior information) have not been utilized fully; 2) tensor Schatten p-norm also ignore ranking the core entries of core tensor, which may contain noise information; 3) existing methods select fixed anchors or averagely update anchors to construct the neighbor bipartite graphs, greatly limiting the flexibility and expression of anchors. To break these limitations, we propose a novel Improved Weighted Tensor Schattenp-Norm for Fast Multi-view Graph Clustering (IWTSN-FMGC). Specifically, to eliminate the interference of the first two limitations, we propose an improved weighted tensor Schatten p-norm to dynamically rank core tensor and automatically shrink singular values. To this end, improved weighted tensor Schatten p-norm has the potential to more effectively leverage low-rank structures and prior information, thereby enhancing robustness compared to current tensor Schatten p-norm methods. Further, the designed adaptive neighbor bipartite graph learning can more flexibly and expressively encode the local manifold structure information than existing anchor selection and averaged anchor updating. Extensive experiments validate our effectiveness and superiority across multiple benchmark datasets.

Abstract:
Multi-view clustering (MVC) constitutes a distinct approach to data mining within the field of machine learning. Due to limitations in the data collection process, missing attributes are frequently encountered. However, existing MVC methods primarily focus on missing instances, showing limited attention to missing attributes. A small number of studies employ the reconstruction of missing instances to address missing attributes, potentially overlooking the synergistic effects between the instance and feature spaces, which could lead to distorted imputation outcomes. Furthermore, current methods uniformly treat all missing attributes as zero values, thus failing to differentiate between real and technical zeroes, potentially resulting in data over-imputation. To mitigate these challenges, we introduce a novel Reliable Attribute-Missing Multi-View Clustering method (RAM-MVC). Specifically, feature reconstruction is utilized to address missing attributes, while similarity graphs are simultaneously constructed within the instance and feature spaces. By leveraging structural information from both spaces, RAM-MVC learns a high-quality feature reconstruction matrix during the joint optimization process. Additionally, we introduce a reliable imputation guidance module that distinguishes between real and technical attribute-missing events, enabling discriminative imputation. The proposed RAM-MVC method outperforms nine baseline methods, as evidenced by real-world experiments using single-cell multi-view data.

Abstract:
Recently, the text-to-3D task has developed rapidly due to the appearance of the SDS method. However, the SDS method always generates 3D objects with poor quality due to the over-smooth issue. This issue is attributed to two factors: 1) the DDPM single-step inference produces poor guidance gradients; 2) the randomness from the input noises and timesteps averages the details of the 3D contents. In this paper, to address the issue, we propose DreamLCM which incorporates the Latent Consistency Model (LCM). DreamLCM leverages the powerful image generation capabilities inherent in LCM, enabling generating consistent and high-quality guidance,~\ie, predicted noises or images. Powered by the improved guidance, the proposed method can provide accurate and detailed gradients to optimize the target 3D models. In addition, we propose two strategies to enhance the generation quality further. Firstly, we propose a guidance calibration strategy, utilizing Euler Solver to calibrate the guidance distribution to accelerate 3D models to converge. Secondly, we propose a dual timestep strategy, increasing the consistency of guidance and optimizing 3D models from geometry to appearance in DreamLCM. Experiments show that DreamLCM achieves state-of-the-art results in both generation quality and training efficiency.

Abstract:
Masked image modeling (MIM), as a self-supervised learning paradigm in computer vision, has gained widespread attention among researchers. MIM operates by training the model to predict masked patches of the image. Given the sparse nature of image semantics, it is imperative to devise a masking strategy that steers the model towards reconstructing high-semantic regions. However, conventional mask strategies often miss these high-semantic regions or lack alignment with the masks and semantics. To solve this, we propose the Global Patch-wise Attention (GPA) framework, a transferable and efficient framework for MIM pre-training. We observe that the attention between patches can be the metric of identifying high-semantic regions, which can guide the model to learn more effective representations. Therefore, we firstly define the global patch-wise attention via vision transformer blocks. Then we design the soft-to-hard mask generation to guide the model gradually focusing on high semantic regions identified by GPA (GPA as a teacher). Finally, we design an extra task to predict GPA (GPA as a feature). Experiments conducted under various settings demonstrate that our proposed GPA framework enables MIM to learn better representations, which benefit the model across a wide range of downstream tasks. Furthermore, our GPA framework can be easily and effectively transferred to various MIM architectures.

Abstract:
Existing radiology report generation (RRG) studies mostly adopt autoregressive (AR) approaches to produce textual descriptions token-by-token for specific clinical radiographs, where they are susceptible to error propagation problems if irrelevant contents are half-way generated, leading to potential ill-presenting of precise diagnoses, especially when there exist complicated abnormalities in radiographs. Although the non-AR paradigm, e.g., diffusion model, provides an alternative solution to tackle the problem from AR by generating all contents in parallel, the mechanism of using Gaussian noise in existing diffusion models still has significant room to improve when such models are used in particular circumstances, i.e., providing proper guidance in controlling noises in the diffusive process to ensure precise report generation. In this paper, we propose to conduct RRG with diffusion networks by controlling the noise with task-specific features, which leverages irrelevant visual and textual information as noise rather than the stochastic Gaussian noise, and allows the diffusion networks to filter particular information through iterative denoising, thus performing a precise and controlled report generation process. Experiments on IU X-Ray and MIMIC-CXR demonstrate the superiority of our approach compared to strong baselines and state-of-the-art solutions. Human evaluation and noise type analysis show that comprehensive noise control greatly helps diffusion networks to refine the generation of global and local report contents.

Abstract:
Repairing deep neural networks (DNNs) to maintain its performance during deployment presents significant challenges due to the potential occurrence of unknown but common environmental corruptions. Most existing DNN repair methods only focus on repairing DNN for each corruption separately, lacking the ability of generalizing to the myriad corruptions from the ever-changing deploying environment. In this work, we propose to repair DNN from a novel perspective, i.e. Learning to Repair (L2R), where the repairing of target DNN is realized as a general learning-to-learn, a.k.a. meta-learning, process. In specific, observing different corruptions are correlated on their data distributions, we propose to utilize previous DNN repair experiences as tasks for meta-learning how to repair the target corruption. With the meta-learning from different tasks, L2R learns a meta-knowledge that summarizes how the DNN is repaired under various environmental corruptions. The meta-knowledge essentially serves as a general repairing prior which enables the DNN quickly adapt to unknown corruptions, thus making our method generalizable to different type of corruptions. Practically, L2R benefits DNN repair with a general pipeline yet tailoring meta-learning for repairing DNN is not trivial. By re-designing the meta-learning components under DNN repair context, we further instantiate the proposed L2R strategy into a concrete model named MetaRepair with pragmatic assumption of experience availability. We conduct comprehensive experiments on the corrupted CIFAR-10 and tiny -ImageNet by applying MetaRepair to repair DenseNet, ConvNeXt and VAN. The experimental results confirmed the superior repairing and generalization capability of our proposed L2R strategy under various environmental corruptions.

Abstract:
Currently, the information processing in a spatial domain alone has intrinsic limitations that hinder the deep network's effectiveness (performance) improvement in a single image deraining. Moreover, the deraining networks' structures and learning processes are becoming increasingly intricate, leading to challenges in structural lightweight, and training and testing efficiency. We propose a lightweight multi-domain multi-attention progressive network (M2PN) to handle these challenges. For performance improvement, the M2PN backbone applies a simple progressive CNN-based structure consisting of the S same recursive M2PN modules. This recursive backbone with a skip connection mechanism allows for better gradient flow and helps to effectively capture low-to-high-level/scales spatial features in progressive structure to improve contextual information acquisition. To further complement acquired spatial information for better deraining, we conduct spectral analysis on the frequency energy distribution of rain steaks, and theoretically present the relationship between the spectral bandwidths and the unique falling characteristics and special morphology of rain steaks. We present the frequency-channel attention (FcA) mechanism and the spatial-channel attention (ScA) mechanism to fuse frequency-channel features and spatial features better to distinguish and remove rain steaks. The simple recursive network structure and effective multi-domain multi-attention mechanism serve as the M2PN to achieve superior performance and facilitate fast convergence during training. Furthermore, the M2PN structure, with a small network component quantity, shallow network channels, and few convolutional kernels, requires only 168K parameters, which is 1 to 2 orders of magnitude lower than the existing SOTA networks. The experimental results demonstrate that even with such a few network parameters, M2PN still achieves the best overall performance.

Abstract:
We introduce HmPEAR, a novel dataset crafted for advancing research in 3D Human Pose Estimation (3D HPE) and Human Action Recognition (HAR), with a primary focus on outdoor environments. This dataset offers a synchronized collection of imagery, LiDAR point clouds, 3D human poses, and action categories. In total, the dataset encompasses over 300,000 frames collected from 10 distinct scenes and 25 diverse subjects. Among these, 250,000 frames of data contain 3D human pose annotations captured using an advanced motion capture system and further optimized for accuracy. Furthermore, the dataset annotates 40 types of daily human actions, resulting in over 6,000 action clips. Through extensive experimentation, we have demonstrated the quality of HmPEAR and highlighted the challenges it presents to current methodologies. Additionally, we propose baselines leveraging sequential images and point clouds for 3D HPE and HAR, which underscore the mutual reinforcement between them, highlighting the potential for cross-task synergies. The dataset is available at http://www.lidarhumanmotion.net/hmpear.

Abstract:
Multi-view based molecular properties prediction learning has received widely attention in recent years in terms of its potential for the downstream tasks in the field of drug discovery. However, the consistency of different molecular view representations and the full utilization of complementary information among them in existing multi-view molecular property prediction methods remain to be further explored. Furthermore, most current methods focus on generating global level representations at the graph level with information from different molecular views (e.g., 2D and 3D views) assuming that the information can be corresponded to each other. In fact it is not unusual that for example the conformation change or computational errors may lead to discrepancies between views. To addressing these issues, we propose a new Cross-View contrastive unification guides Generative Molcular pre-trained model, call MolCVG. We first focus on common and private information extraction from 2D graph views and 3D geometric views of molecules, Minimizing the impact of noise in private information on subsequent strategies. To exploit both types of information in a more refined way, we propose a cross-view contrastive unification strategy to learn cross-view global information and guide the reconstruction of masked nodes, thus effectively optimizing global features and local descriptions. Extensive experiments on real-world molecular data sets demonstrate the effectiveness of our approach for molecular property prediction task.

Abstract:
Cross-Domain Few-Shot Semantic Segmentation (CD-FSS) aims to achieve pixel-level segmentation of novel categories across various domains by transferring knowledge from the source domain leveraging limited samples. The main challenge in CD-FSS is bridging the inter-domain gap and addressing the scarcity of labeled samples in the target domain to enhance both generalization and discriminative abilities. Current methods usually resort to additional networks and complex strategy to embrace domain variability, which inevitably increases the training costs. This paper proposes a Dual-Branch Fusion with Style Modulation (DFSM) method to tackle this issues. We specifically deploy a parameter-free Grouped Style Modulation (GSM) layer that captures and adjusts a wide spectrum of potential feature distribution changes, thus improving the model's domain transferability. Additionally, to overcome data limitations and enhance adaptability in the target domain, we develope a Dual-Branch Fusion (DBF) strategy which achieves accurate pixel-level prediction results by combining predicted probability maps through weighted fusion, thereby enhancing the discriminative ability of the model. We evaluate the proposed method on multiple widely-used benchmark datasets, including FSS-1000, ISIC, Chest X-Ray, and Deepglobe, and demonstrate superior performance compared to state-of-the-art methods in CD-FSS tasks.

Abstract:
The effectiveness of contrastive-learning-based Knowledge Distillation (KD) has sparked renewed interest in relational distillation, but these methods typically focus on angle-wise information from the penultimate layer. We show that exploiting relational information derived from intermediate layers further improves the effectiveness of distillation. We also find that adding distance-wise relational information to contrastive-learning-based methods negatively impacts distillation quality, revealing an implicit contention between angle-wise and distance-wise attributes. Therefore, we propose a Multi-stage Decoupled Relational (MDR) KD framework equipped with an adaptive stage selection to identify the stages that maximize the efficacy of transferring the relational knowledge. MDR framework decouples angle-wise and distance-wise information to resolve their conflicts while still preserving complete relational knowledge, thereby resulting in an elevated transferring efficiency and distillation quality. To evaluate the proposed method, we conduct extensive experiments on multiple image benchmarks i.e. CIFAR100, ImageNet and Pascal VOC, covering various tasks i.e. classification, few-shot learning, transfer learning and object detection. Our method exhibits superior performance under diverse scenarios, surpassing the state of the art by an average improvement of 1.22% on CIFAR-100 across extensively utilized teacher-student network pairs.

Abstract:
Cloth-Changing Person Re-Identification (CC-ReID) aims to accurately identify a target person in the more realistic surveillance scenario where clothes of the pedestrian may change drastically, which is critical in public security systems for tracking down disguised criminal suspects. Existing methods mainly transform the CC-ReID problem into cross-modality feature alignment from the data-driven perspective, without modelling the interference factors such as clothes and camera view changes meticulously. This may lead to over-consideration or under-consideration of the influence of these factors on the extraction of robust and discriminative identity features. This paper proposes a novel algorithm for thoroughly disentangling identity features from interference factors brought by clothes and camera view changes while ensuring the robustness and discriminability. It adopts a dual-stream identity feature learning framework consisting of a raw image stream and a cloth-erasing stream, to explore discriminative and cloth-irrelevant identity feature representations. Specifically, an adaptive cloth-irrelevant contrastive objective is introduced to contrast features extracted by the two streams, aiming to suppress the fluctuation caused by clothes textures in the identity feature space. Moreover, we innovatively mitigate the influence of the interference factors through a generative adversarial interference factor decoupling network. This network is targeted at capturing identity-related information residing in the interference factors and disentangling the identity features from such information. Extensive experimental results demonstrate the effectiveness of the proposed method, achieving superior performances to state-of-the-art methods.

Affiliations: Department of Electrical and Computer and Engineering, National University of Singapore, Singapore ; Laboratory for Big Data and Decision, National University of Defense Technology, Singapore ; School of Electronic, Electrical and Communication Engineering, University of the Chinese Academy of Sciences, China ; National Key Laboratory of Information Systems Engineering, China ; Shien-Ming Wu School of Intelligent Engineering, South China University of Technology, China ; School of Computer and Communication Engineering, University of Science and Technology Beijing, China ; Shenzhen Research Institute of Big Data, School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen)

Abstract:
Class-incremental learning poses a significant challenge under an exemplar-free constraint, leading to catastrophic forgetting and sub-par incremental accuracy. Previous attempts have focused primarily on single-modality tasks, such as image classification or audio event classification. However, in the context of Audio-Visual Class-Incremental Learning (AVCIL), the effective integration and utilization of heterogeneous modalities, with their complementary and enhancing characteristics, remains largely unexplored. To bridge this gap, we propose the Multi-Modal Analytic Learning (MMAL) framework, an exemplar-free solution for AVCIL that employs a closed-form, linear approach. To be specific, MMAL introduces a modality fusion module that re-formulates the AVCIL problem through a Recursive Least-Square (RLS) perspective. Complementing this, a Modality-Specific Knowledge Compensation (MSKC) module is designed to further alleviate the under-fitting limitation intrinsic to analytic learning by harnessing individual knowledge from audio and visual modality in tandem. Comprehensive experimental comparisons with existing methods show that our proposed MMAL demonstrates superior performance with the accuracy of 76.71%, 78.98%, and 76.19% on AVE, Kinetics-Sounds, and VGGSounds100 datasets, respectively, setting new state-of-the-art AVCIL performance. Notably, compared to those memory-based methods, our MMAL, being an exemplar-free approach, provides good data privacy and can better leverage multi-modal information for improved incremental accuracy.

Abstract:
Open-vocabulary multi-object tracking (MOT) aims to track arbitrary objects encountered in the real world beyond the training set. However, recent methods rely solely on instance-level detection and association of novel objects, which may not consider the valuable fine-grained semantic representations of the targets within key and reference frames. In this paper, we propose a Global and Local Awareness open-vocabulary MOT method (GLATrack), which learns to tackle the task of real-world MOT from both global and instance-level perspectives. Specifically, we introduce a region-aware feature enhancement module to refine global knowledge for complementing local target information, which enhances semantic representation and bridges the distribution gap between the image feature map and the pooled regional features. We propose a bidirectional semantic complementarity strategy to mitigate semantic misalignment arising from missing target information in key frames, which dynamically selects valuable information within reference frames to enrich object representation during the knowledge distillation process. Furthermore, we introduce an appearance richness measurement module to provide appropriate representations for targets with different appearances. The proposed method gains an improvement of 6.9% in TETA and 5.6% in mAP on the large-scale TAO benchmark.

Abstract:
Fine-grained leaf image retrieval (FGLIR) is a new unsupervised pattern recognition task in content-based image retrieval (CBIR). It aims to distinguish varieties/cultivars of leaf images within a certain plant species and is more challenging than general leaf image retrieval task due to the inherently subtle differences across different cultivars. In this study, we for the first time investigate the possible way to mine the spatial structure and contextual information from the activation of the convolutional layers of CNN networks for FGLIR. For achieving this goal, we design a novel geometrical structure, named Triplet Patch-Pairs Composite Structure (TPCS), consisting of three symmetric patch pairs segmented from the leaf images in different orientations. We extract CNN feature map for each patch in TPCS and measure the difference between the feature maps of the patch pair for constructing local deep self-similarity descriptor. By varying the size of the TPCS, we can yield multi-scale deep self-similarity descriptors. The final aggregated local deep self-similarity descriptors, named Structural Deep Patch Representation (SDePR), not only encode the spatial structure and contextual information of leaf images in deep feature domain, but also are invariant to the geometrical transformations. The extensive experiments of applying our SDePR method to the public challenging FGLIR tasks show that our method outperforms the state-of-the-art handcrafted visual features and deep retrieval models.

Abstract:
Image aesthetics assessment (IAA) primarily examines image quality from a user-centric perspective and can be applied to guide various applications, including image capture, recommendation, and enhancement. The fundamental issue in IAA revolves around the quantification of image aesthetics. Existing methodologies rely on assigning a scalar (or a distribution) to represent aesthetic value based on conventional practices, which confines this scalar within a specific range and artificially labels it. However, conventional methods rarely incorporate research on interpretability, particularly lacking systematic responses to the following three fundamental questions: 1) Can aesthetic qualities be quantified? 2) What is the nature of quantifying aesthetics? 3) How can aesthetics be accurately quantified? In this paper, we present a law called "Special Relativity" of IAA (SR-IAA) that addresses the aforementioned core questions. We have developed a Multi-Attribute IAA Framework (MAINet), which serves as a preliminary validation for SR-IAA within the existing datasets and achieves state-of-the-art (SOTA) performance. Specifically, our metrics on multi-attribute assessment outperform the second-best performance by 8.06% (AADB), 1.67% (PARA), and 2.44% (SPAQ) in terms of SRCC. We anticipate that our research will offer innovative theoretical guidance to the IAA research community. All resources are available here.

Abstract:
We are investigating interaction with entities showing features different from humans', to understand how they can be embodied as avatars and perceived as living, social beings. To push this investigation to its limit, we have designed as an avatar an interactive space (the Room), that challenges both the anthropomorphic structure, and most of the social interaction mechanisms we are used to. We introduce a first framework for the Room design, addressing challenges related to its body, perception, and interaction process. We present a pilot implementation of some of the aspects of the framework as an interactive installation, namely a real-time, two-player, VR experience, featuring the Room avatar, with a focus on haptic feedback as the main means of perception for the subject embodying the Room. By radically challenging anthropomorphism, we seek to investigate the most basic aspects of embodiment and social cognition.

Abstract:
We address image deraining under complex backgrounds, diverse rain scenarios, and varying illumination conditions, representing a highly practical and challenging problem. Our approach utilizes synthetic, real-world, and nighttime datasets, wherein rich backgrounds, multiple degradation types, and diverse illumination conditions coexist. The primary challenge in training models on these datasets arises from the discrepancies among them, potentially leading to conflicts or competition during the training period. To address this issue, we first align the distribution of synthetic, real-world and nighttime datasets. Then we propose a novel contrastive learning strategy to extract multi-view (multiple) representations that effectively capture image details, degradations, and illuminations, thereby facilitating training across all datasets. Regarding multiple representations as profitable prompts for deraining, we devise a prompting strategy to integrate them into the decoding process. This contributes to a potent deraining model, dubbed Rainmer. Additionally, a spatial-channel interaction module is introduced to fully exploit cues when extracting multi-view representations. Extensive experiments on synthetic, real-world, and nighttime datasets demonstrate that Rainmer outperforms current representative methods. Moreover, Rainmer achieves superior performance on the All-in-One image restoration dataset, underscoring its effectiveness. Furthermore, quantitative results reveal that Rainmer significantly improves object detection performance on both daytime and nighttime rainy datasets. These observations substantiate the potential of Rainmer for practical applications.

Abstract:
Multimodal large language models (MLLM) have been observed to exhibit biases originating from their training datasets. Unlike unimodal LLMs, biases in MLLMs may stem from interactions between multiple modalities, which increases the complexity of multimodal debiasing. Conventional approaches like fine-tuning to alleviate biases in models are costly and data-hungry. Model editing methods, which focus on post-hoc modifications of model knowledge, have recently demonstrated significant potential across diverse applications. These methods can effectively and precisely adjust the behavior of models in specific knowledge domains, while minimizing the impact on the overall performance of the model. However, there is currently no comprehensive study to drive the application of model editing methods in debiasing MLLM and to analyze its pros and cons. To facilitate research in this field, we define the debiasing problem of MLLM as an editing problem and propose a novel set of evaluation metrics for MLLM debias editing. Through various experiments, we demonstrate that: (1) Existing model editing methods can effectively alleviate biases in MLLM and can generalize well to semantically equivalent image-text pairs. However, most methods tend to adversely affect the stability of the MLLM. (2) Compared to editing the visual modality of the MLLM, editing the textual modality yields better results in addressing MLLM biases. (3) Model editing based debiasing method can achieve generalization across different types of biases.

Abstract:
Inserting foreground objects into specific background scenes and eliminating the illumination inconsistency (eg., color, brightness) between them is an important and challenging task. It typically involves multiple processing tasks, such as image harmonization and shadow generation. In these two domains, there are already many mature solutions, but they often only focus on one of the tasks. Recently, some image composition methods have utilized diffusion models to address both of these issues simultaneously, but they cannot guarantee complete reconstruction of the foreground content. In this work, we propose CFDiffusion, which can simultaneously handle image harmonization and shadow generation. We first employ a shadow mask predictor to estimate the shadow mask of the foreground object. Next, we design a harmonization-shadow generator based on a diffusion model to harmonize the foreground and generate shadows concurrently. Additionally, we propose a foreground content enhancement module to ensure the complete preservation of foreground content at the insertion location, and we also develop an adaptive encoder to guide the harmonization process in the foreground area. The experimental results on the iHarmony4 dataset and the IH-SG dataset demonstrate the superiority of our CFDiffusion approach.

Abstract:
Point cloud data is pivotal in applications like autonomous driving, virtual reality, and robotics. However, its substantial volume poses significant challenges in storage and transmission. In order to obtain a high compression ratio, crucial semantic details usually confront severe damage, leading to difficulties in guaranteeing the accuracy of downstream tasks. To tackle this problem, we are the first to introduce a novel Region of Interest (ROI)-guided Point Cloud Geometry Compression (RPCGC) method for human and machine vision. Our framework employs a dual-branch parallel structure, where the base layer encodes and decodes a simplified version of the point cloud, and the enhancement layer refines this by focusing on geometry details. Furthermore, the residual information of the enhancement layer undergoes refinement through an ROI prediction network. This network generates mask information, which is then incorporated into the residuals, serving as a strong supervision signal. Additionally, we intricately apply these mask details in the Rate-Distortion (RD) optimization process, with each point weighted in the distortion calculation. Our loss function includes RD loss and detection loss to better guide point cloud encoding for the machine. Experiment results demonstrate that RPCGC achieves exceptional compression performance and better detection accuracy (10% gain) than some learning-based compression methods at high bitrates in ScanNet and SUN RGB-D datasets.

Abstract:
3D visual grounding is a fundamental yet important task in multimedia understanding, which aims to locate a specific object in a complicated 3D scene semantically according to a text description. However, this task requires a large number of annotations of labeled text-object pairs for training, so the scarcity of annotated data has been a key obstacle in this task. To this end, this paper makes the first attempt to introduce and address a new semi-supervised setting, where only a few text-object labels are provided during training. Considering most scene data has no annotation, we explore a new solution for unlabeled 3D grounding by additionally training and transferring knowledge from a correlated task, i.e., 3D captioning. Our main insight is that 3D grounding and captioning are complementary and can be iteratively trained with unlabeled data to provide object and text contexts for each other with pseudo-label learning. Specifically, we propose a novel 3D Cross-Task Teacher-Student Framework (3D-CTTSF) for joint 3D grounding and captioning in the semi-supervised setting, where each branch contains parallel grounding and captioning modules. We first pre-train the two modules of the teacher branch with limited labeled data for warm-up. Then, we train the student branch to mimic the ability of the teacher model and iteratively update both branches with the unlabeled data. In particular, we transfer the learned knowledge between the grounding and captioning modules across two branches to generate and refine the pseudo-labels of unlabeled data for providing reliable supervision. To further improve the quality of the pseudo-labels, we design a cross-task pseudo-label generation scheme, filtering low-quality pseudo-labels at the detection, captioning, and grounding levels, respectively. Experimental results on various datasets show competitive performances in both tasks compared to previous fully- and weakly-supervised methods, demonstrating the proposed 3D-CTTSF can serve as an effective solution to overcome the data scarcity issue.

Abstract:
RGB-Thermal Salient Object Detection (RGBT-SOD) plays a critical role in complex scene recognition applications, such as autonomous driving. However, security research in this domain is still in its infancy. This paper presents the first backdoor attack on RGBT-SOD systems, generating saliency maps on triggered inputs that depict non-existent salient objects chosen by the attacker or falsely mark an entire image as fully salient or entirely non-salient. We uncover that triggers have an influence range for generating non-existent salient objects, supported by a theoretical analysis. Extensive experiments show the effectiveness of our attack in both digital and physical-world scenarios. Notably, our dual-modality backdoor attack achieves an Attack Success Rate (ASR) of 86.72% with only five pairs of poisoned images in model training. After investigating potential countermeasures, we find them inadequate in mitigating our attacks, highlighting the urgent need for robust defenses against sophisticated backdoor attacks in RGBT-SOD systems.

Abstract:
Detecting 3D lane lines from monocular images is garnering increasing attention in the Autonomous Driving (AD) area due to its cost-effective edge. However, current monocular image models capture road scenes lacking 3D spatial awareness, which is error-prone to adverse circumstance changes. In this work, we design a novel cross-modal knowledge transfer scheme, namely LaneCMKT, to address this challenge by transferring 3D geometric cues learned from a pre-trained LiDAR model to the image model. Performing on the unified Bird's-Eye-View (BEV) grid, our monocular image model acts as a student network and benefits from the spatial guidance of the 3D LiDAR teacher model over the intermediate feature space. Since LiDAR points and image pixels are intrinsically two different modalities, to facilitate such heterogeneous feature transfer learning at matching levels, we propose a dual-path knowledge transfer mechanism. We divide the feature space into shallow and deep paths where the image student model is prompted to focus on lane-favored geometric cues from the LiDAR teacher model. We conduct extensive experiments and thorough analysis on the large-scale public benchmark OpenLane. Our model achieves notable improvements over the image baseline by 5.3% and the current BEV-driven SoTA method by 2.7% in the F1 score, without introducing any extra computational overhead. We also observe that the 3D abilities grabbed from the teacher model are critical for dealing with complex spatial lane properties from a 2D perspective.

Abstract:
In the field of machine learning, continual learning is a crucial concept that allows models to adapt to non-stationary data distributions. However, most of the existing works focus on uni-modal settings and ignore the multi-modal data. In this paper, to enable neural networks better understand diverse modalities in real-world scenario, we investigate continual learning for two typical vision-language applications, i.e. retrieval and grounding. Instead of conventional exemplar-based methods, we leverage the pre-trained transformer model (e.g. CLIP/GLIP) and the prompt technique to tackle this problem. Under this scheme, we identify two critical limitations in existing methods: (1) Unfamiliarity across tasks, which prevents task-specific prompts from achieving forward propagation; and (2) Heterogeneity between modalities, which makes it difficult to guarantee a consistent optimization direction for prompts of different modalities. To overcome these constraints, we design Historical Prompt Calibration that includes two objectives to calibrate prompts. First, the intra-modal relevance estimation helps encode sufficient task-specific information for prompts, with the help a relevance estimator developed for recognizing task relevance. Second, the inter-modal consistency alignment enhances the agreement of the two modality-specific prompts in the current task by contrasting them with the prompts from previous tasks. We evaluate the superiority of our strategy over state-of-the arts methods by four vision-language applications, including two retrieval tasks (i.e. image- and video-text retrieval) and two grounding tasks (i.e. referring expression comprehension and segmentation).

Abstract:
Sonar imaging is widely utilized in submarine and underwater detection missions. However, due to the complex underwater environment, sonar images suffer from complex distortions and noises, making detection models hard to extract clean high-level features for detection. Existing works introduce denoised images as pseudo labels to assist the network to extract clean features while not fully considering the rationality of pseudo labels. To this end, we propose an Efficient Pseudo Labels-Driven Underwater Forward-looking Sonar Images Object Detection algorithm (EPL-UFLSID). Specifically, we first design a Gaussian Mixture Model based Deep Image Prior (GMMDIP) network to generate denoised sonar images by setting the GMM distribution as its input. After that, to filter the most detection-friendly images of the denoised images generated by GMMDIP as efficient pseudo labels, Detection-Friendly Image Quality Assessment network (DFIQA), is designed, which is also able to help EPL-UFLSID further distill cleaner features from pseudo labels to improve detection performance. Extensive experimental results show that our EPL-UFLSID reaches average precision (AP) of 67.8%/39.8% and average recall (AR) of 73.7%/49.6% on two real sonar datasets, which outperforms SOTA underwater forward-looking sonar images object detection algorithms.

Abstract:
Driven by the complementary information fusion of optical and synthetic aperture radar (SAR) images, the optical-SAR image matching has drawn much attention. However, the significant radiometric differences between them imposes great challenges on accurate matching. Most existing approaches convert SAR and optical images into a shared feature space to perform the matching, but these methods often fail to achieve the robust matching since the feature spaces are unknown and uninterpretable. Motivated by the interpretable latent space of diffusion models, this paper formulates an optical-SAR image translation and matching framework via a dynamically conditioned diffusion model (DCDM) to achieve the interpretable and robust optical-SAR cross-modal image matching. Specifically, in the denoising process, to filter out outlier matching regions, a gated dynamic sparse cross-attention module is proposed to facilitate efficient and effective long-range interactions of multi-grained features between the cross-modal data. In addition, a spatial position consistency constraint is designed to promote the cross-attention features to perceive the spatial corresponding relation in different modalities, improving the matching precision. Experimental results demonstrate that the proposed method outperforms state-of-the-art methods in terms of both the matching accuracy and the interpretability.

Abstract:
Restoring low-quality fundus images, especially the recovery of vessel structures, is crucial for clinical observation and diagnosis. Existing state-of-the-art methods use standard convolution and window based self-attention block to recover low-quality fundus images, but these feature capturing approaches do not effectively match the slender and tortuous structure of retinal vessels. Therefore, these methods struggle to accurately restore vessel structures. To overcome this challenge, we propose a novel low-quality fundus image restoration method called Masked Snake Attention Network (MSANet). It is designed specifically for accurately restoring vessel structures. Specifically, we introduce the Snake Attention module (SA) to adaptively aggregate vessel features based on the morphological structure of the vessels. Due to the small proportion of vessel pixels in the image, we further present the Masked Snake Attention module (MSA) to more efficiently capture vessel features. MSA enhances vessel features by constraining snake attention within regions predicted by segmentation methods. Extensive experimental results demonstrate that our MSANet outperforms the state-of-the-art methods in enhancement evaluation and downstream segmentation tasks.

Abstract:
Recent research has confirmed the possibility of adversarial attacks on deep models. However, these methods typically assume that the surrogate model has access to the target domain, which is difficult to achieve in practical scenarios. To address this limitation, this paper introduces a novel cross-domain attack method tailored for semantic segmentation, named Prototype-based Feature and Frequency Alteration Attack (PFFAA). This approach empowers a surrogate model to efficiently deceive the black-box victim model without requiring access to the target data. Specifically, through limited queries on the victim model, bidirectional relationships are established between the target classes of the victim model and the source classes of the surrogate model, enabling the extraction of prototypes for these classes. During the attack process, the features of each source class are perturbed to move these features away from their respective prototypes. Moreover, we propose substituting frequency information from images used to train the surrogate model into the frequency domain of the test images to modify texture and structure, thus further enhancing the attack efficacy. Experimental results across multiple datasets and victim models validate that PFFAA achieves state-of-the-art performances.

Abstract:
Learning multiple proxy tasks is a popular training strategy in semi-supervised video anomaly detection. However, the traditional method of learning multiple proxy tasks simultaneously is prone to suboptimal solutions, and simply executing multiple proxy tasks sequentially cannot ensure continuous performance improvement. In this paper, we thoroughly investigate the impact of task composition and training order on performance enhancement. We find that ensuring continuous performance improvement in multi-task learning requires different but continuous optimization objectives in different training phases. To this end, a training strategy based on progressive learning is proposed to enhance the multi-task learning in VAD. The learning objectives of the model in previous phases contribute to the training in subsequent phases. Specifically, we decompose video anomaly detection into three phases: perception, comprehension, and inference, continuously refining the learning objectives to enhance model performance. In the three phases, we perform the visual task, the semantic task and the open-set task in turn to train the model. The model learns different levels of features and focuses on different types of anomalies in different phases. Extensive experiments demonstrate the effectiveness of our method, highlighting that the benefits derived from the progressive learning transcend specific proxy tasks.

Abstract:
Adversarial examples (AEs), which are maliciously hand-crafted by adding perturbations to benign images, reveal the vulnerability of deep neural networks (DNNs) and have been used as a benchmark for evaluating model robustness. With great efforts have been devoted to generating AEs with stronger attack ability, the visual quality of AEs is generally neglected in previous studies. The lack of a good quality measure of AEs makes it very hard to compare the relative merits of attack techniques and is hindering technological advancement. How to evaluate the visual quality of AEs remains an understudied and unsolved problem. In this work, we make the first attempt to fill the gap by presenting an image quality assessment method specifically designed for AEs. Towards this goal, we first construct a new database, called AdvDB, developed on diverse adversarial examples with elaborated annotations. We also propose a detection-based structural similarity index (AdvDSS) for adversarial example perceptual quality assessment. Specifically, the visual saliency for capturing the near-threshold adversarial distortions is first detected via human visual system (HVS) techniques and then the structural similarity is extracted to predict the quality score. Moreover, we further propose AEQA for overall adversarial example quality assessment by integrating the perceptual quality and attack intensity of AEs. Extensive experiments validate that the proposed AdvDSS achieves state-of-the-art performance which is more consistent with human opinions.

Abstract:
Due to the limitations of sensor, traditional cameras struggle to capture details within extremely dark areas of videos. The absence of such details can significantly impact the effectiveness of low-light video enhancement. In contrast, event cameras offer a visual representation with higher dynamic range, facilitating the capture of motion information even in exceptionally dark conditions. Motivated by this advantage, we propose the Real-Event Embedded Network for low-light video enhancement. To better utilize events for enhancing extremely dark regions, we propose an Event-Image Fusion module, which can identify these dark regions and enhance them significantly. To ensure temporal stability of the video and restore details within extremely dark areas, we design unsupervised temporal consistency loss and detail contrast loss. Alongside the supervised loss, these loss functions collectively contribute to the semi-supervised training of the network on unpaired real data. Experimental results on synthetic and real data demonstrate the superiority of the proposed method compared to the state-of-the-art methods.

Abstract:
Advances in computer vision research enable human-like high-dimensional perceptual induction over analogical visual reasoning problems, such as Raven's Progressive Matrices (RPMs). In this paper, we propose a Hierarchical Perception and Predictive Analogy-Inference network (HP^2AI), consisting of three major components that tackle key challenges of RPM problems. Firstly, in view of the limited receptive fields of shallow networks in most existing RPM solvers, a perceptual encoder is proposed, consisting of a series of hierarchically coupled Patch Attention and Local Context (PALC) blocks, which could capture local attributes at early stages and capture the global panel layout at deep stages. Secondly, most methods seek for object-level similarities to map the context images directly to the answer image, while failing to extract the underlying analogies. The proposed reasoning module, Predictive Analogy-Inference (PredAI), consists of a set of Analogy-Inference Blocks (AIBs) to model and exploit the inherent analogical reasoning rules instead of object similarity. Lastly, the Squeeze-and-Excitation Channel-wise Attention (SECA) in the proposed PredAI discriminates essential attributes and analogies from irrelevant ones. Extensive experiments over four benchmark RPM datasets show that the proposed HP^2AI achieves significant performance gains over all the state-of-the-art methods consistently on all four datasets.

Abstract:
Recently, indoor 3D object detection has shown impressive progress. However, these improvements have come at the cost of increased memory consumption and longer inference times, making it difficult to apply these methods in practical scenarios. To address this issue, knowledge distillation has emerged as a promising technique for model acceleration. In this paper, we propose the VRDistill framework, the first knowledge distillation framework designed for efficient indoor 3D object detection. Our VRDistill framework includes a refinement module and a soft foreground mask operation to enhance the quality of the distillation. The refinement module utilizes trainable layers to improve the quality of the teacher's votes, while the soft foreground mask operation focuses on foreground votes, further enhancing the distillation performance. Comprehensive experiments on the ScanNet and SUN-RGBD datasets demonstrate the effectiveness and generalization ability of our VRDistill framework.

Abstract:
Estimating the 3D poses of interacting hands from a monocular image is challenging due to the similarity in appearance between hand parts. Therefore, utilizing the appearance features alone tends to result in unreliable pose estimation. Existing approaches directly fuse the appearance features with position features, ignoring that the two types of features are heterogeneous. Here, the appearance features are derived from the RGB values of pixels, while the position features are mapped from the coordinates of pixels or joints. To address this problem, we present a novel framework called Decoupled Feature Learning (DFL ) for 3D pose estimation of interacting hands. By decoupling the appearance and position features, we facilitate the interactions within each feature type and those between both types of features. First, we compute the appearance relationships between the joint queries and the image feature maps; we utilize these relationships to aggregate each joint's appearance and position features. Second, we compute the 3D spatial relationships between hand joints using their position features; we utilize these relationships to guide the feature enhancement of joints. Third, we calculate appearance relationships and spatial relationships between the joints and image using the appearance and position features, respectively; we utilize these complementary relationships to promote the joints' location in the image. The two processes mentioned above are conducted iteratively. Finally, only the refined position features are used for hand pose estimation. This strategy avoids the step of mapping heterogeneous appearance features to hand-joint positions. Our method significantly outperforms state-of-the-art methods on the large-scale InterHand2.6M dataset. More impressively, our method exhibits strong generalization ability on in-the-wild images.

Abstract:
Multi-planar magnetic resonance imaging (MRI) can provide comprehensive 3D structural information for disease diagnosis. Compared to multi-source MRI, multi-planar MRI scans target areas in the human body from different directions. This atypical difference between directions may lead to poor performance of traditional domain generalization methods, especially when MRI from different planes also comes from different sources. In this paper, we propose ADDG, an Adaptive Domain Generalization framework for accurate cross-plane MRI segmentation. ADDG significantly mitigates the impact of information loss caused by slice spacing by injecting 3D shape prior to the segmentation target and capturing domain-agnostic feature differences from heterogeneous data sources through an adaptive data partitioning strategy. In addition, we propose a mesh deformation-based organ segmentation network to simultaneously delineate 2D boundary and 3D volume of organ, which could guide more accurate mesh deformation. We also develop an organ-specific mesh template and employ Loop subdivision for generating smoother 3D organ mesh. Furthermore, we design a flexible meta-learning paradigm to adaptively partition data domains based on invariant learning, which can learn domain-agnostic features from multi-source data to enhance the overall generalization ability. Experimental results show that ADDG outperforms several medical image segmentation, single-view 3D shape reconstruction, and domain generalization methods.

Abstract:
Multi-instance multi-label learning (MIML), which deals with objects with complex structures and multiple semantics, plays a crucial role in various fields. In practice, the naturally skewed label distribution and label dependence contribute to the issue of label imbalance in MIML, which is crucial but rarely studied. Most existing MIML methods often produce biased models due to the ignorance of inter-class variations in imbalanced data. To address this issue, we propose a novel imbalanced multi-instance multi-label learning method named IMIMLC, based on the error-correcting coding ensemble and an adaptive threshold strategy. Specifically, we design a feature embedding method to extract the structural information of each object via Fisher vectors and eliminate inexact supervision. Subsequently, to alleviate the disturbance caused by the imbalanced distribution, a novel ensemble model is constructed by concatenating the error-correcting codes of randomly selected subtasks. Meanwhile, IMIMLC trains binary base classifiers on small-scale data blocks partitioned by our codes to enhance their diversity and then learns more reliable results to improve model robustness for the imbalance issue. Furthermore, IMIMLC adaptively learns thresholds for each individual label by margin maximization, preventing inaccurate predictions caused by the semantic discrepancy across many labels and their unbalanced ratios. Finally, extensive experimental results on various datasets validate the effectiveness of IMIMLC against state-of-the-art approaches.

Abstract:
Addressing the disparity in description granularity and information gap between images and text has long been a formidable challenge in text-based person retrieval (TBPR) tasks. Recent researchers tried to solve this problem by random local alignment. However, they failed to capture the fine-grained relationships between images and text, so the information and modality gaps remain on the table. We align image regions and text phrases at the same semantic granularity to address the semantic atomicity gap. Our idea is first to extract and then exploit the relationships between fine-grained locals. We introduce a novel Fine-grained Semantic Alignment with Transferred Person-SAM (SAP-SAM) approach. By distilling and transferring knowledge, we propose a Person-SAM model to extract fine-grained semantic concepts at the same granularity from images and texts of TBPR and its relationships. With the extracted knowledge, we optimize the fine-grained matching via Explicit Local Concept Alignment and Attentive Cross-modal Decoding to discriminate fine-grained image and text features at the same granularity level and represent the important semantic concepts from both modalities, effectively alleviating the granularity and information gaps. We evaluate our proposed approach on three popular TBPR datasets, demonstrating that SAP-SAM achieves state-of-the-art results and underscores the effectiveness of end-to-end fine-grained local alignment in TBPR tasks.

Abstract:
To tackle the high-dimensional data with multiple representations, multi-view unsupervised feature selection has emerged as a significant learning paradigm. However, previous methods suffer from the following dilemmas: (i) They focus on selecting the features that preserve the similarity structure of data, whereas neglecting the discriminative information in the cluster structure; (ii) The orthogonal constraint is often imposed on the pseudo cluster labels, breaking the locality in the cluster label space; (iii) Learning the similarity or cluster structure from all samples is time-consuming. To this end, a Scalable Multi-view Unsupervised Feature Selection with structure learning and fusion (SMUFS) is proposed to jointly exploit the cluster structure and the similarity relations of data. Specifically, SMUFS introduces the sample-view weights to adaptively fuse the membership matrices that indicate cluster structures and serve as the pseudo cluster labels, such that a unified membership matrix across views can be effectively obtained to guide feature selection. Meanwhile, SMUFS performs graph learning from the membership matrix, preserving the locality of cluster labels and improving their discriminative capability. Further, an acceleration strategy has been developed to make SMUFS scalable for large-scale data. An iterative optimization is designed to solve the formulated objective function, and extensive experiments demonstrate the superiority of SMUFS.

Abstract:
Graph anomaly detection (GAD) aims to identify anomalous graphs that significantly deviate from other ones, which has raised growing attention due to the broad existence and complexity of graph-structured data in many real-world scenarios. However, existing GAD methods usually execute with centralized training, which may lead to privacy leakage risk in some sensitive cases, thereby impeding collaboration among organizations seeking to collectively develop robust GAD models. Although federated learning offers a promising solution, the prevalent non-IID problems and high communication costs present significant challenges, particularly pronounced in collaborations with graph data distributed among different participants. To tackle these challenges, we propose an effective federated graph anomaly detection framework (FGAD). We first introduce an anomaly generator to perturb the normal graphs to be anomalous and train a powerful anomaly detector by distinguishing generated anomalous graphs from normal ones. We subsequently leverage a student model to distill knowledge from the trained anomaly detector (teacher model), which aims to maintain the personality of local models and alleviate the adverse impact of non-IID problems. Additionally, we design an effective collaborative learning mechanism that facilitates the personalization preservation of local models and significantly reduces communication costs among clients. Empirical results of diverse GAD tasks demonstrate the superiority and efficiency of FGAD.

Abstract:
Single domain generalization (SDG) aims to learn a generalizable model from only one source domain available to unseen target domains. Existing SDG techniques rely on data or feature augmentation to generate distributions that complement the source domain. However, these approaches fail to address the challenge where gradient conflicts from synthesized domains impede the learning of domain-invariant representation. Inspired by the concept of mechanical equilibrium in physics, we propose a novel conflict-aware approach named domain gradient equilibrium for SDG. Unlike prior conflict-aware SDG methods that alleviate the gradient conflicts by setting them to zero or random values, the proposed domain gradient equilibrium method first decouples gradients into domaininvariant and domain-specific components. The domain-specific gradients are then adjusted and reweighted to achieve equilibrium, steering the model optimization toward a domain-invariant direction to enhance generalization capability. We conduct comprehensive experiments on four image recognition benchmarks, and our method achieves an accuracy improvement of 2.94% in the PACS dataset over existing state-of-the-art approaches, demonstrating the effectiveness of our proposed approach.

Abstract:
Multimodal sentiment analysis (MSA) aims to integrate multiple modalities of information to better understand human sentiment. The current research mainly focuses on conducting multimodal fusion, which neglects the under-optimized modal representations generated by the imbalance of unimodal performances in joint learning. Moreover, the size of labeled datasets limits the generalization ability of existing supervised models. To address the above issues, this paper proposes a knowledge-enhanced self-supervised balanced representation approach (KEBR). First, a text-based cross-modal fusion method (TCMF) is constructed, which injects the non-verbal information from the videos into the semantic representation of text to enhance the multimodal representation of text. Then, a multimodal cosine constrained loss (MCC) is designed to constrain the fusion of non-verbal information in joint learning to balance the representation. Finally, with the help of sentiment knowledge and non-verbal information, KEBR conducts sentiment word masking and sentiment intensity prediction. Experimental results show that KEBR outperforms the baseline.

Abstract:
Semi-supervised visible-infrared person re-identification (SSVI-ReID) aims to match pedestrian images of the same identity from different modalities (visible and infrared) while only annotating visible images, which is highly related to multimedia and multi-modal processing. Existing works primarily focus on assigning accurate pseudo-labels to infrared images, but overlook the two key challenges: erroneous pseudo-labels and large modality discrepancy. To alleviate these issues, this paper proposes a novel Modality-Unified and Confidence-Guided (MUCG) semi-supervised learning method. Specifically, we first propose a Dynamic Intermediate Modality Generation (DIMG) module, which transfers knowledge from labeled visible images to unlabeled infrared images, enhancing the pseudo-label quality and bridging the modality discrepancy. Meanwhile, we propose a Weighted Identification Loss (WIL) that can reduce the model's dependence on erroneous labels by using confidence weighting. Moreover, an effective Modality Consistency Loss (MCL) is proposed to narrow the distribution of visible and infrared features, further narrowing the modality discrepancy and enabling the learning of modality-unified features. Extensive experiments show that the proposed MUCG has significant advantages in improving the performance of the SSVI-ReID task, surpassing the current state-of-the-art methods by a significant margin.

Abstract:
Deep neural networks are widely used in retrieval systems. However, they are notoriously vulnerable to attack. Among the various forms of adversarial attacks, the patch attack is one of the most threatening forms. This type of attack can introduce cognitive biases into the retrieval system by inserting deceptive patches into images. Despite the seriousness of this threat, there are still no well-established solutions in image retrieval systems. In this paper, we propose the Pre-denosing Augmented Image Retrieval (PAIR) model, a new approach designed to protect image retrieval systems against adversarial patch attacks. The core strategy of PAIR is to dynamically and randomly reconstruct entire images based on their semantic content. This purifies well-designed patch attacks while preserving the semantic integrity of the images. Furthermore, we present a novel training strategy that incorporates a semantic discriminator. This discriminator significantly improves PAIR's ability to capture real semantics and reconstruct images. Experiments show that PAIR significantly outperforms existing defense methods. It effectively reduces the success rate of two state-of-the-art patch attack methods to below 5%, achieving a 14% improvement over current leading methods. Moreover, in defending against global perturbation attacks, PAIR also achieves competitive results.

Abstract:
Deep neural networks have revealed enormous potential in video super-resolution (VSR), yet the expensive computational expense limits their deployment on resource-limited devices and actual scenarios, especially for restoring multiple frames simultaneously. Existing VSR models contain considerable redundant filters, which drag down the inference efficiency. To accelerate the inference of VSR models, we propose a scalable method based on adaptive patch routing to achieve practical speedup. Specifically, we design a confidence estimator to predict the aggregation performance of each block for adjacent patch information. It learns to dynamically perform block skipping, i.e., choose which basic blocks of the VSR network to execute during inference so as to reduce total computation to the maximum extent without degrading reconstruction accuracy dramatically. However, we observe that skipping error would be amplified as the hidden states propagate along with recurrent networks. To alleviate the issue, we design temporal feature alignment to guarantee the performance. This proposal essentially proposes an adaptive routing scheme for each patch. Extensive experiments demonstrate that our method can not only accelerate inference but also provide strong quantitative and qualitative results. Built upon the BasicVSR model, our method achieves a speedup of 20% on average, going as high as 50% for some images, while even maintaining competitive performance on REDS4.

Abstract:
Deep Convolutional Neural Networks (CNNs) have demonstrated excellent performance in various multimedia application scenarios. However, complex models often require significant computational resources and energy costs. Therefore, CNN compression is crucial for addressing deployment challenges of multimedia application on resource constrained edge devices. However, existing CNN channel pruning strategies primarily focus on the "weights" or "activations" of the model, overlooking its "interpretability" information. In this paper, we explore CNN pruning strategies from the perspective of model interpretability. We model the correspondence between channel feature maps and interpretable visual perception based on class saliency maps, aiming to assess the contribution of each channel to the desired output. Additionally, we utilize Discrete Wavelet Transform (DWT) to capture the global features and structure of class saliency maps. Based on this, we propose a Channel Spatial Dependability (CSD) metric, evaluating the importance and contribution of channels in a bidirectional manner to guide model pruning. And we dynamically adjust the pruning rate of each layer based on performance changes, in order to achieve more accurate and efficient adaptive pruning. Our method achieves significant results across a range of different networks and datasets. For instance, we achieved a 51.3% pruning on the ResNet-56 model while maintaining an accuracy of 94.16%, outperforming feature-map or other State-of-the-Art (SOTA).

Abstract:
Co-Speech gesture generation encounters challenges with imbalanced, long-tailed gesture distributions. While recent methods typically address this by employing Vector Quantized Variational Autoencoder (VQ-VAE), encode gestures into a codebook and classify codebook indices based on audio or text cues. However, due to the imbalanced, the codebook classification tends to bias towards majority gestures, neglecting semantically rich minority gestures. To address this, this paper proposes the Entropy-Guided Co-Speech Gesture Generation (EGGesture). EGGesture leverages an Entropy-Guided VQ-VAE to jointly optimizes the distribution of codebook indices and adjusts loss weights for codebook index classification, which consists of a) A differentiable approach for entropy computation using Gumbel-Softmax and cosine similarity, facilitating online codebook distribution optimization, and b) a strategy that utilizes computed codebook entropy to collaboratively guide the classification loss weighting. These designs enable the dynamic refinement of the codebook utilization, striking a balance between the quality of the learned gesture representation and the accuracy of the classification phase. Experiments on the Trinity and BEAT datasets demonstrate EGGesture's state-of-the-art performance both qualitatively and quantitatively.

Abstract:
Swarical, a Swar m-based hierarchical localization technique, enables miniature drones, Flying Light Specks (FLSs), to accurately and efficiently localize and illuminate complex 2D and 3D shapes. Its accuracy depends on the physical hardware (sensors) of FLSs used to track neighboring FLSs to localize themselves. It uses the specification of the sensors to convert mesh files into point clouds that enable a swarm of FLSs to localize at the highest accuracy afforded by their sensors. Swarical considers a heterogeneous mix of FLSs with different orientations for their tracking sensors, ensuring a line of sight between a localizing FLS and its anchor FLS. We present an implementation using Raspberry cameras and ArUco markers. A comparison of Swarical with a state of the art decentralized localization technique shows that it is as accurate and more than 2x faster.

Abstract:
With the development of depth sensors and 3D vision, the vulnerability of 3D point cloud models has garnered heightened concern. Almost all existing 3D attackers are deployed in the white-box setting, where they access the model details and directly optimize coordinate-wise noises to perturb 3D objects. However, realistic 3D applications would not share any model information (model parameters, gradients, etc.) with users. Although a few recent works try to explore the black-box attack, they still achieve limited attack success rates (ASR) and fail to generate high-quality adversarial samples. In this paper, we focus on designing a transfer-based black-box attack method, called Transferable Frequency-aware 3D GAN, to delve into achieving a high black-box ASR by improving the adversarial transferability while making the adversarial samples more imperceptible. Considering that the 3D imperceptibility depends on whether the shape of the object is distorted, we utilize the spectral tool with the GAN design to explicitly perceive and preserve the 3D geometric structures. Specifically, we design the Graph Fourier Transform (GFT) encoding layer in the GAN generator to extract the geometries as guidance, and develop a corresponding Inverse-GFT decoding layer to decode latent features with this guidance to reconstruct high-quality adversarial samples. To further improve the transferability, we develop a dual learning scheme of discriminator from both frequency and feature perspectives to constrain the generator via adversarial learning. Finally, imperceptible and transferable perturbations are rapidly generated by our proposed attack. Experimental results demonstrate that our attack method achieves the highest transfer ASR while exhibiting stronger imperceptibility.

Abstract:
Few-Shot Industrial Anomaly Detection (FS-IAD) has drawn great attention most recently since data efficiency and the ability to design algorithms for fast migration across products have become the main concerns. The difficulty of memory-based IAD in low-data regime primarily lies in inefficient measurement between the memory bank and query images. We address such a pivotal issue from a new perspective of optimal matching between features of image regions. Taking the unbalanced nature of query features into consideration, we adopt Conditional Transport (CT) as a metric to compute the structural distance between representations of the two sets to determine feature relevance. CT distance generates the optimal matching flows between unbalanced structural elements that achieve the minimum matching cost, which can be directly used for IAD since it well reflects the differences of query images compared with the normal memory. Realizing the fact that query images usually come one-by-one or batch-by-batch, we further propose an Online Conditional Transport (OCT) by making full use of the current and historical query images for IAD via simultaneously calibrating the memory bank using the online query images and matching features between the calibrated memory and the current query image. Go one step further, for sparse foreground products, we employ a predominant segment model to implement Foreground-aware OCT (FOCT) for improving the effectiveness and efficiency of OCT by forcing the model to pay more attention to diverse targets rather than redundant backgrounds when calibrating the memory bank. FOCT can improve the diversity of calibrated memory, which is critical for robust FS-IAD in practice. Besides, FOCT is flexible since it can be friendly plugged and played with any pre-trained backbones, such as WRN, and any pre-trained segment models, such as SAM. The effectiveness and efficiency of our model is demonstrated across diverse datasets, including benchmarks of MVTec and MPDD, achieving SOTA performance.

Abstract:
While RAW images are efficient for image editing and perception tasks, their large size can strain camera storage and bandwidth. Reconstruction methods of RAW images from sRGB data typically require additional metadata from the RAW image, which increases camera processing computations. To address this problem, we propose using Prior Meta as a reference to reconstruct the RAW data instead of relying on per-image metadata. Prior metadata is extracted offline from reference RAW images, which are usually part of the training dataset and have similar scenes and light conditions as the target image. With this prior metadata, the camera does not need to provide any extra processing other than the sRGB images, and our model can autonomously find the desired prior information. To achieve this, we design a three-step pipeline. First, we build a pixel searching network that can find the most similar pixels in the reference RAW images as prior information. Then, in the second step, we compress the large-scale reference images to about 0.02% of their original size to reduce the searching cost. Finally, in the last step, we develop a neural network reconstructor to reconstruct the high-fidelity RAW images. Our model achieves comparable, and even better, performance than RAW reconstruction methods based on metadata.

Abstract:
Diffusion models have shown remarkable prowess in text-to-image synthesis and editing, yet they often stumble when tasked with interpreting complex prompts that describe multiple entities with specific attributes and interrelations. The generated images often contain inconsistent multi-entity representation (IMR), reflected as inaccurate presentations of the multiple entities and their attributes. Although providing spatial layout guidance improves the multi-entity generation quality in existing works, it is still challenging to handle the leakage attributes and avoid unnatural characteristics. To address the IMR challenge, we first conduct in-depth analyses of the diffusion process and attention operation, revealing that the IMR challenges largely stem from the process of cross-attention mechanisms. According to the analyses, we introduce the entity guidance generation mechanism, which maintains the integrity of the original diffusion model parameters by integrating plug-in networks. Our work advances the stable diffusion model by segmenting comprehensive prompts into distinct entity-specific prompts with bounding boxes, enabling a transition from multi-entity to single-entity generation in cross-attention layers. More importantly, we introduce entity-centric cross-attention layers that focus on individual entities to preserve their uniqueness and accuracy, alongside global entity alignment layers that refine cross-attention maps using multi-entity priors for precise positioning and attribute accuracy. Additionally, a linear attenuation module is integrated to progressively reduce the influence of these layers during inference, preventing oversaturation and preserving generation fidelity. Our comprehensive experiments demonstrate that this entity guidance generation enhances existing text-to-image models in generating detailed, multi-entity images.

Abstract:
Designing embedding costs is pivotal in modern image steganography. Many studies have shown adjusting symmetric embedding costs to asymmetric ones can enhance steganographic security. However, most existing methods heavily depend on manually defined parameters or rules, limiting security performance improvements. To overcome this limitation, we introduce an advanced GAN-based framework that transitions symmetric costs to asymmetric ones without the need for the manual intervention seen in existing approaches, such as the detailed specification of cost modulation directions and magnitudes. In our framework, we firstly achieve symmetric costs for a cover image, which is randomly split into two sub-images, with part of the secret information embedded into one. Subsequently, we design a GAN model to adjust the embedding costs of the second sub-image to asymmetric, facilitating the secure embedding of the remaining secret information. To support our phased embedding approach, our GAN's discriminator incorporates two steganalyers with different tasks: distinguishing the generator's final output, i.e., the stego image, from both the input cover image and the partially embedded stego image, providing diverse guidance to the generator. In addition, we introduce a simple yet effective update strategy to ensure a stable training process. Comprehensive experiments demonstrate that our method significantly enhances security over existing symmetric steganography techniques, achieving state-of-the-art levels compared to other methods focused on embedding costs adjustments. Additionally, detailed ablation studies validate our approach's effectiveness.

Abstract:
Federated learning has rapidly gained attention in the industrial sector due to its significant advantages in protecting privacy. However, ensuring the fairness of federated learning models post-deployment presents a challenge in practical applications. Given that clients typically rely on limited private datasets to assess model fairness, this constrains their ability to make accurate judgments about the fairness of the model. To address this issue, we propose an innovative evaluation framework, FedEvalFair, which integrates private data from multiple clients to comprehensively assess the fairness of models in actual deployment without compromising data privacy. Firstly, FedEvalFair draws on the concept of federated learning to achieve a comprehensive assessment while protecting privacy. Secondly, based on the statistical concept of "estimating the population from the sample", FedEvalFair is capable of estimating the fairness performance of the model in real-world settings from a limited data sample. Thirdly, we have designed a flexible two-stage evaluation strategy based on statistical hypothesis testing. We verified the theoretical performance and sensitivity to fairness variations of FedEvalFair using Monte Carlo simulations, demonstrating the superior performance of its two-stage evaluation strategy. Additionally, we validated the effectiveness of the FedEvalFair method on real-world datasets, including UCI Adult and eICU, and demonstrated its stability in dealing with real-world data distribution changes compared to traditional evaluation methods.

Abstract:
Some video traffic carries harmful content, such as hate speech and child abuse, primarily encrypted and transmitted through Dynamic Adaptive Streaming over HTTP (DASH). Promptly identifying and intercepting traffic of harmful videos is crucial in network regulation. However, QUIC is becoming another DASH transport protocol in addition to TCP. On the other hand, complex network environments and diverse playback modes lead to significant distortions in traffic. The issues above have not been effectively addressed. This paper proposes a real-time identification method for DASH encrypted video traffic with distortion, named Zenith. We extract stable video segment sequences under various itags as video fingerprints to tackle resolution changes and propose a method of traffic fingerprint extraction under QUIC and VPN. Subsequently, simulating the sequence matching problem as a natural language problem, we propose Traffic Language Model (TLM), which can effectively address video data loss and retransmission. Finally, we propose a frequency dictionary to accelerate Zenith's speed further. Zenith significantly improves accuracy and speed compared to other SOTA methods in various complex scenarios, especially in QUIC, VPN, automatic resolution, and low bandwidth. Zenith requires traffic for just half a minute of video content to achieve precise identification, demonstrating its real-time effectiveness.

Abstract:
Point cloud upsampling is crucial for 3D reconstruction, with recent research significantly benefitting from the advances in deep learning technologies. The majority of existing methods, which focus on a sequence of processes including feature extraction, augmentation, and the reconstruction of coordinates, encounter significant challenges in interpreting the geometric attributes they uncover, particularly with respect to the intricacies of transitioning feature dimensionality. In this paper, we delve deeper into modeling Partial Differential Equations (PDEs) specifically tailored for the inverse heat dissipation process in dense point clouds. Our goal is to detect gradients within the dense point cloud data distribution and refine the accuracy of interpolated points' positions along with their complex geometric nuances through a systematic iterative approximation method. Simultaneously, we adopt multivectors from geometric algebra as the primary tool for representing the geometric characteristics of point clouds, moving beyond the conventional vector space representations. The use of geometric products of multivectors enables us to capture the complex relationships between scalars, vectors, and their components more effectively. This methodology not only offers a robust framework for depicting the geometric features of point clouds but also enhances our modeling capabilities for inverse heat dissipation PDEs. Through both qualitative and quantitative assessments, we demonstrate that our results significantly outperform existing state-of-the-art techniques in terms of widely recognized point cloud evaluation metrics and 3D visual reconstruction fidelity.

Abstract:
Sampling strategies have been widely adopted in Vision-Language Pre-training (VLP) and have achieved great success recently. However, the sampling strategies adopted by current VLP works are limited in two ways: i) they only focus on negative sampling, ignoring the importance of more informative positive samples; ii) their sampling strategies are conducted in the local in-batch level, which may lead to sub-optimal results. To tackle these problems, in this paper, we propose a curriculum-based Global Positive-Negative Sampling (GPN-S) framework for vision-language pre-training, which conducts both positive and negative sampling in the global level, grounded on the notion of neighborhood relationships. Additionally, we incorporate curriculum learning into our sampling strategy, progressively increasing the complexity of samples as the training progresses. Specifically, our proposed GPN-S framework is capable of utilizing positive sampling to bring semantically equivalent samples closer, as well as employing negative sampling to push challenging negative samples farther away. We jointly consider them for vision-language pre-training on the global-level perspective rather than a local-level mini-batch, which provides more informative and diverse samples. We evaluate the effectiveness of the proposed GPN-S framework by conducting experiments on several common downstream tasks, and the results demonstrate significant performance improvement over the existing models.

Abstract:
Fine-grained Visual Recognition (FGVR) aims to distinguish objects within similar subcategories. Humans adeptly perform this challenging task by leveraging both intra-category distinctiveness and inter-category similarity. However, previous methods fail to combine these two complementary dimensions and mine the intrinsic relations among various semantic features. To address these limitations, we propose HI2R, a Hypergraph-guided Intra- and Inter-category Relation Modeling approach, which simultaneously extracts the intra-category structural information and inter-category relation for more precise reasoning. Specifically, we exploit a Hypergraph-guided Structure Learning (HSL) module, which employs hypergraphs to capture high-order structural relations, transcending traditional graph-based methods that are limited to pairwise linkages. This advancement allows the model to adapt to significant intra-category variations. Additionally, we propose an Inter-category Relation Perception (IRP) module to improve feature discrimination across categories by extracting and analyzing semantic relations among them. Our objective is to alleviate the robustness issue associated with exclusive reliance on intra-category discriminative features. Furthermore, a random semantic consistency (RSC) loss is introduced to direct the model's attention to commonly overlooked yet distinctive regions, indirectly enhancing the representation ability of both HSL and IRP modules. Both qualitative and quantitative results demonstrate the effectiveness and usefulness of HI2R.

Abstract:
The 3D model can be estimated by regressing the pose and shape parameters from the image data of the digital model. The reconstruction of 3D cartoon characters poses a challenging task due to diverse visual representations and postural variations. This paper proposes a dual-branch structure named MagicCartoon for 3D bipedal cartoon character estimation, which models pose and shape independently through feature decoupling. Considering the correlation between category difference and shape parameters, a hybrid feature fusion technique is introduced, which integrates the global features of the original image with the corresponding local features expressed by the puzzle image, reducing the abstractness of understanding shape parameter differences. To semantically align image and geometric between feature space, a geometric-guided feedback loop is proposed in an iterative way, so that the pose of modeling results can be expressed consistently with the image. Moreover, a feature consistency loss is designed to augment the training data by incorporating the same character with different postures and the same posture of different characters. It enhances the correlation between the features extracted by the backbone network and the specific task. Experiments conducted on the 3DBiCar dataset demonstrate that MagicCartoon outperforms the state-of-the-art methods.

Abstract:
Spatial transcriptomics provides revolutionary insights into cellular interactions and disease development mechanisms by combining high-throughput gene sequencing and spatially resolved imaging technologies to analyze genes naturally associated with spatially variable tissue genes. However, existing methods typically map aggregated multi-view features into a unified representation, ignoring the heterogeneity and view independence of genes and spatial information. To this end, we construct a heterogeneous Graph guided Contrastive Learning (stGCL) for aggregating spatial transcriptomics data. The method is guided by the inherent heterogeneity of cellular molecules by dynamically coordinating triple-level node attributes through comparative learning loss distributed across view domains, thus maintaining view independence during the aggregation process. In addition, we introduce a cross-view hierarchical feature alignment module employing a parallel approach to decouple spatial and genetic views on molecular structures while aggregating multi-view features according to information theory, thereby enhancing the integrity of inter- and intra-views. Rigorous experiments demonstrate that stGCL outperforms existing methods in various tasks and related downstream applications.

Abstract:
Coreference resolution, an essential task in natural language processing, is particularly challenging in multi-modal scenarios where data comes in various forms and modalities. Despite advancements, limitations due to scarce labeled data and underleveraged unlabeled data persist. We address these issues with a self-adaptive fine-grained multi-modal data augmentation framework for semi-supervised MCR, focusing on enriching training data from labeled datasets and tapping into the untapped potential of unlabeled data. Regarding the former issue, we first leverage text coreference resolution datasets and diffusion models,to perform fine-grained text-to-image generation with aligned text entities and image bounding boxes. We then introduce a self-adaptive selection strategy, meticulously curating the augmented data to enhance the diversity and volume of the training set without compromising its quality. For the latter issue, we design a self-adaptive threshold strategy that dynamically adjusts the confidence threshold based on the model's learning status and performance, enabling effective utilization of valuable information from unlabeled data. Additionally, we incorporate a distance smoothing term, which smooths distances between positive and negative samples, enhancing discriminative power of the model?s feature representations and addressing noise and uncertainty in the unlabeled data. Our experiments on the widely-used CIN dataset show that our framework significantly outperforms state-of-the-art baselines by at least 9.57% on MUC F1 score and 4.92% on CoNLL F1 score. Remarkably, against weakly-supervised baselines, our framework achieves a staggering 22.24% enhancement in MUC F1 score. These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing MCR tasks.

Abstract:
Cross-modal 2D-3D point cloud semantic segmentation using few-shot-based learning provides a practical approach for borrowing matured 2D domain knowledge into the 3D segmentation model, which reduces the reliance on laborious 3D annotation work and improves generalization to new categories. However, previous methods use single-view point cloud generation algorithms to bridge the gap between 2D images and 3D point clouds, leaving the incomplete geometry of an object or scene due to occlusions. To address this issue, we propose a novel view synthesis cross-modal few-shot point cloud semantic segmentation network. It introduces the color and depth inpainting to generate multi-view images and masks, which compensate for the absent depth information of generated point clouds. Additionally, we propose a Co-embedding Network to bridge the domain features between synthesized and original, collected 3D data, and a weighted prototype network is employed to balance the impact of multi-view images and enhance the segmentation performance. Extensive experiments on two benchmarks show the superiority of our method by outperforming the existing cross-modal few-shot 3D segmentation methods.

Abstract:
Due to the explosive growth in data sources and label categories, multi-view multi-label learning has garnered widespread attention. However, multi-view multi-label data often exhibits incomplete features and a huge number of unlabeled instances, due to the technical limitations and high cost of manual labeling in practice. Learning for such simultaneous missing of view features and labels is crucial but rarely studied, particularly when the labeled samples are limited. In this paper, we tackle this problem by proposing a novel Deep Incomplete Multi-View Semi-Supervised Multi-Label Learning method (DIMvSML). Specifically, to improve high-level representations of missing features, deep graph network is firstly employed to recover the feature information with structural similarity relations. Meanwhile, we design the structure-specific deep feature extractors to obtain discriminative information and preserve the cross-view consistency for the recovered data with instance-level contrastive loss. Furthermore, to eliminate the bias of the estimate of the risk that the semi-supervised multi-label methods minimise, we design a safe estimate framework with an unbiased loss and improve its empirical performance by using pseudo-labels of unlabeled data. Besides, we provide both the theoretical proof of better estimate variance and the intuitive explanation of our debiased framework. Finally, extensive experimental results on public datasets validate the superiority of DIMvSML compared with state-of-the-art methods.

Abstract:
Camera calibration is crucial in computer vision tasks and applications, e.g., autonomous driving (AD). However, prevailing camera calibration models pose a time-consuming and labor-intensive off-board process in mass production settings, while simultaneously lacking exploration of real-world AD scenarios. To this end, inspired by recent advancements in bird's-eye-view (BEV) perception models, this paper proposes a novel multi-camera Calibration method via Reversed BEV representations for AD, termed CalibRBEV. Specifically, the proposed CalibRBEV model primarily comprises two stages. Initially, we innovatively reverse the BEV perception pipeline, reconstructing bounding boxes through an attention auto-encoder module to fully extract the latent reversed BEV representations. Subsequently, the obtained representations from encoder are interacted with the surrounding multi-view image features for further refinement and calibration parameters prediction. Extensive experimental results on nuScenes and Waymo datasets validate the effectiveness of our proposed model.

Abstract:
Large-scale pre-trained audio-language models excel in general multi-modal representation, facilitating their adaptation to downstream audio recognition tasks in a data-efficient manner. However, existing few-shot audio recognition methods based on audio-language models primarily focus on learning coarse-grained correlations, which are not sufficient to capture the intricate matching patterns between the multi-level information of audio and the diverse characteristics of category concepts. To address this gap, we propose multi-grained correspondence learning for bootstrapping audio-language models to improve audio recognition with few training samples. This approach leverages generative models to enrich multi-modal representation learning, mining the multi-level information of audio alongside the diverse characteristics of category concepts. Multi-grained matching patterns are then established through multi-grained key-value cache and multi-grained cross-modal contrast, enhancing the alignment between audio and category concepts. Additionally, we incorporate optimal transport to tackle temporal misalignment and semantic intersection issues in fine-grained correspondence learning, enabling flexible fine-grained matching. Our method achieves state-of-the-art results on multiple benchmark datasets for few-shot audio recognition, with comprehensive ablation experiments validating its effectiveness.

Abstract:
Video Tube Retrieval (VTR) has attracted wide attention in the multi-modal domain, aiming to accurately localize the spatial-temporal tube in videos based on the natural language description. Despite the remarkable progress, existing VTR models trained on a specific domain (source domain) often perform unsatisfactory in another domain (target domain), due to the domain gap. Toward this issue, we introduce the learning strategy, Unsupervised Domain Adaptation, into the VTR task (UDA-VTR), which enables the knowledge transfer from the labeled source domain to the unlabeled target domain without additional manual annotations. An intuitive solution is generating the pseudo labels for the target domain samples with the fully trained source model and fine-tuning the source model on the target domain with pseudo labels. However, the existing domain gap gives rise to two problems for this process: (1) The transfer of model parameters across domains may introduce source domain bias into target domain features, significantly impacting the feature-based prediction for target domain samples. (2) The pseudo labels tend to identify video tubes that are widely present in the source domain, rather than accurately localizing the correct video tubes specific to the target domain samples. To address the above issues, we propose the unsupervised domain adaptation model via Hierarchical dEbiAsing and noisy corRecTion (HEART) for cross-domain video tube retrieval, which contains two characteristic modules: Layered Feature Debiasing (including the adversarial feature alignment and the graph based alignment) and Pseudo Label Refinement. Extensive experiments prove the effectiveness of our HEART model by significantly surpassing the state-of-the-arts.

Abstract:
Multimodal Entity Linking (MEL) aims to address the ambiguity in multimodal mentions and associate them with Multimodal Knowledge Graphs (MMKGs). Existing works primarily focus on designing multimodal interaction and fusion mechanisms to enhance the performance of MEL. However, these methods still overlook two crucial gaps within the MEL task. One is the content discrepancy between mentions and entities, manifested as uneven information density. The other is the knowledge gap, indicating insufficient knowledge extraction and reasoning during the linking process. To bridge these gaps, we propose a novel framework FissFuse, as well as a plug-and-play knowledge-aware re-ranking method KAR. Specifically, FissFuse collaborates with the Fission and Fusion branches, establishing dynamic features for each mention-entity pair and adaptively learning multimodal interactions to alleviate content discrepancy. Meanwhile, KAR is endowed with carefully crafted instruction for intricate knowledge reasoning, serving as re-ranking agents empowered by Large Language Models (LLMs). Extensive experiments on two well-constructed MEL datasets demonstrate outstanding performance of FissFuse compared with various baselines. Comprehensive evaluations and ablation experiments validate the effectiveness and generality of KAR.

Abstract:
Active learning (AL) aims to select highly informative data points from an unlabeled dataset for annotation, mitigating the need for extensive human labeling effort. However, classical AL methods heavily rely on human expertise to design the sampling strategy, inducing limited scalability and generalizability. Many efforts have sought to address this limitation by directly connecting sample selection with model performance improvement, typically through influence function. Nevertheless, these approaches often ignore the dynamic nature of model behavior during training optimization, despite empirical evidence highlights the importance of dynamic influence to track the sample contribution. This oversight can lead to suboptimal selection, hindering the generalizability of model. In this study, we explore the dynamic influence based data selection strategy by tracing the impact of unlabeled instances on model performance throughout the training process. Our theoretical analyses suggest that selecting samples with higher projected gradients along the accumulated optimization direction at each checkpoint leads to improved performance. Furthermore, to capture a wider range of training dynamics without incurring excessive computational or memory costs, we introduce an additional dynamic loss term designed to encapsulate more generalized training progress information. These insights are integrated into a universal and task-agnostic AL framework termed Dynamic Influence Scoring for Active Learning (DISAL). Comprehensive experiments across various tasks have demonstrated that DISAL significantly surpasses existing state-of-the-art AL methods, demonstrating its ability to facilitate more efficient and effective learning in different domains.

Abstract:
Existing adversarial example defense methods are static, meaning they remain unchanged once training is completed, regardless of how attack methods change. Consequently, static defense methods are highly vulnerable to adaptive attacks. We argue that to counter more formidable attacks, models should continually adapt to various attack methods. We propose a novel dynamic defense approach. Initially, we use Gaussian Mixture Models (GMM) to obtain structural information of the data, which is combined with model prediction information to generate pseudo-labels for optimizing inputs. Subsequently, we employ information maximization and enhanced mean predictions as optimization objectives, utilizing a hierarchical optimization approach to refine the model. Meanwhile, we propose a sample-efficient optimization strategy that reduces the total number of samples in the test data stream for reverse updating and improves the efficiency. Notably, our method can be directly applied to pre-trained models without the need for accessing training data or retraining the model. Therefore, our approach is training-data-agnostic and model-agnostic, easily applicable to existing adversarially trained models, significantly enhancing the resilience of various models against white-box, black-box, and adaptive attacks across diverse datasets. We have conducted extensive experiments to validate the state-of-the-art of our proposed method. The pseudo-code can be found in the appendix.

Abstract:
Information diffusion prediction aims to forecast the path of information spreading in social networks by exploiting user correlations or preferences. Recent works focus on characterizing the dynamic of user preferences and propose to capture users' dynamic preferences by discretizing the diffusion process into structure snapshots. Despite their effectiveness, these works simply summarize users' dynamic preferences from partially observed structure snapshots, ignoring the continuous evolution of the preferences. Moreover, discretizing the diffusion process makes these models overlook abundant structure information across different periods, reducing their ability to discover potential participants. To address the above issues, we propose a novel Graph Neural Ordinary Differential Equation Network (GODEN) for information diffusion prediction, which incorporates neural ordinary differential equations (ODE) to model the continuous dynamics of the diffusion process. Specifically, we design two coupled ODE functions on nodes and edges to describe their co-evolution dynamics and infer users' dynamic preferences based on the solution of ODEs. To predict the future infections of the observed cascade, we represent its diffusion pattern in terms of temporal and user contexts and apply a multi-head attention module to attend to different contexts. Experimental results confirm our approach's effectiveness, with our model outperforming the state-of-the-art diffusion prediction models.

Abstract:
Online chatting has become an essential aspect of our daily interactions, with stickers emerging as a prevalent tool for conveying emotions more vividly than plain text. While conventional image emotion recognition focuses on global features, sticker emotion recognition necessitates incorporating both global and local features, along with additional modalities like text. To address this, we introduce a topic ID-guided transformer method to facilitate a more nuanced analysis of the stickers. Since each sticker will have a topic, and the same topic will have the same object, we introduce a topic ID as a flag to group images by theme. Our approach encompasses a novel topic-guided context-aware module and a topic-guided attention mechanism, enabling the extraction of comprehensive topic context features from stickers sharing the same topic ID, significantly enhancing emotion recognition accuracy. Moreover, we integrate a frequency linear attention module to leverage frequency domain information to capture better the object information of the stickers and a locally enhanced re-attention mechanism for improved local feature extraction. Extensive experiments and ablation studies on the large-scale sticker emotion dataset SER30k validate the efficacy of our method. Experimental results show that our proposed method obtains the best accuracy on both single-modal and multi-modal sticker emotion recognition.

Abstract:
Cartoon animal parsing aims to segment the body parts such as heads, arms, legs and tails of cartoon animals. Different from previous parsing tasks, cartoon animal parsing faces new challenges, including irregular body structures, abstract drawing styles and diverse animal categories. Existing methods have difficulties when addressing these challenges caused by the spatial and structural properties of cartoon animals. To address these challenges, a novel spatial learning and structural modeling network, named CAPNet, is proposed for cartoon animal parsing. It aims to address the critical problems of spatial perception, structure modeling and spatial-structural consistency learning. A spatial-aware learning module integrates deformable convolutions to learn spatial features of diverse cartoon animals. The multi-task edge and center point prediction mechanism is incorporated to capture the intricate spatial patterns. A structural modeling method is proposed to model the complex structural representations of cartoon animals, which integrates a graph neural network with a shape-aware relation learning module. To mitigate the significant differences among animals, a spatial and structural consistency learning strategy is proposed to capture and learn feature correlations across different animal species. Extensive experiments conducted on benchmark datasets demonstrate the effectiveness of the proposed approach, which outperforms the state-of-the-art methods.

Abstract:
Recently, learning-based methods have made significant progress for image specular highlight removal. However, many of these approaches treat all the image pixels uniformly, overlooking the negative impact of invalid pixels on feature reconstruction. This oversight often leads to undesirable outcomes, such as color distortion or residual highlights. In this paper, we propose a novel image specular highlight removal network called HighlightRNet, which utilizes valid pixels as references to reconstruct the highlight-free image. To achieve this, we introduce a context-aware fusion block (CFBlock) that aggregates information in four directions, effectively capturing global contextual information. Additionally, we introduce a location-aware feature transformation module (LFTModule) to adaptively learn the valid pixels for feature reconstruction, thereby avoiding information errors caused by invalid pixels. With these modules, our method can produce high-quality highlight-free results without color distortion and highlight residual. Furthermore, we develop a multiple light image-capturing system to construct a large-scale highlight dataset called NSH, which exhibits minimal misalignment in image pairs and minimal brightness variation in non-highlight regions. Experimental results on various datasets demonstrate the superiority of our method over state-of-the-art methods, both qualitatively and quantitatively.

Abstract:
Implicit 3D representations have shown great promise in deep learning-based 3D reconstruction. With differentiable renderers, current methods are able to learn implicit occupancy fields without 3D supervision by minimizing the error between the images rendered from the learned occupancy fields and 2D ground truth images. In this paper, however, we hypothesize that a full rendering pipeline including visibility determination and evaluation of a shading model is not required for the learning of 3D shapes without 3D supervision. Instead, we propose to use implicit reasoning, that is, we reason directly on the implicit occupancy field without explicit rendering. This leads our method to reveal highly accurate 3D structures from low quality silhouette images. Our implicit reasoning infers a 3D occupancy field by evaluating how well it matches with multiple 2D occupancy maps, using occupancy clues rather than rendering the 3D occupancy field into images. We exploit the occupancy clues that indicate whether a viewing ray inside a 2D object silhouette hits at least one occupied 3D location, or whether a ray outside the silhouette hits no occupied location. In contrast to differentiable renderers whose losses do not distinguish between the inside and outside of objects, our novel loss function weights unoccupied clues more than occupied ones. Our results outperform recent state-of-the-art techniques, justifying that we can learn accurate occupancy fields only using sparse clues without an explicit rendering process.

Abstract:
In recent years, neural field-based methods for synthesizing novel views have gained popularity due to their exceptional rendering quality and fast training speed. However, the computational cost of volumetric rendering has significantly increased with the advancement of camera technology and the subsequent rise in average camera resolution. Despite extensive efforts to accelerate the training process, the training duration remains unacceptable for high-resolution inputs. Therefore, it's crucial to develop efficient sampling methods to optimize the learning process of neural fields from large inputs. In this paper, we present a new technique called Superpixel-based Efficient Sampling (SES) to improve the learning efficiency of neural fields. Our approach optimizes pixel-level ray sampling by segmenting the error map into multiple superpixels and dynamically updating their errors during training to increase ray sampling in superpixel areas with higher rendering errors. Compared with other methods, our approach leverages the flexibility of superpixels, effectively reducing redundant sampling while considering local information. Our method not only speeds up the learning process but also enhances the rendering quality learned from large inputs. We conduct extensive experiments to evaluate the effectiveness of our method across several baselines and datasets. The code will be released.

Abstract:
Spiking Neural Networks (SNNs) have great advantages in discrete event data processing because of their binary digital computation form. However, due to the limitation of the current structures of SNNs, the original event data needs to be preprocessed to reduce the time calculation steps and information redundancy. The traditional methods of dividing data into frames lead to the loss of a large amount of time information. In this paper, we proposed an efficient Recurrent Spiking Neural Network (RSNN) to reduce the time domain information loss of original slice samples with the spiking based neural dynamics for processing the dynamic spatial-temporal information. By constructing the Recurrent Spiking Neural Network model, the recurrent structure was used to preprocess slices before it was further input into the spiking structure to enhance the time correlation between slices. In addition, in order to match the two-dimensional spatial structure of data sample frames efficiently, this paper adapts a variation of structures of the recurrent neural network, named Convolution LSTM (CONLSTM). Through experiments on event based datasets such as DVS128-Gesture and CIFAR10-DVS, we find that the proposed model could not only behave better than some other spiking based models but also save energy and power consumption which paves the way for practical applications of neuromorphic hardware.

Abstract:
Video colorization poses challenging tasks, necessitating structural stability, continuity, and details control in the colors produced. In this paper, based on a pretrained text-to-image model, we introduce the Gated Color Guidance module (GCG ), enabling the model to adaptively perform color propagation or generation according to the structural differences between reference and grayscale frames. Based on this multifunctionality, we propose a novel two-stage coloring strategy. In the first stage, under reference-mask condition, the model autonomously and jointly colors input keyframes in a one-to-many color domain mapping, while temporal coherence constraints are emphasized by modifying the attention mechanism. In the second stage, under reference-guided condition, the model effectively captures the colors of matching structures in the reference, and we further introduce Sliding Reference Grid strategy (SRG) to merge and extract the color features from multiple frames, providing more stable coloring for the grayscale frames. Through this pipeline, we can achieve high-quality and stable video coloring while maintaining the accuracy of detailed colors. Additionally, the two-stage strategy is flexible and detachable, allowing users to adjust the number of selected reference frames to balance coloring quality and efficiency. Extensive experiments demonstrate that our method significantly outperforms previous state-of-the-art models in both qualitative comparison and quantitative measurement.

Abstract:
Non-Fungible Tokens (NFTs) have emerged as a pivotal digital asset, offering authenticated ownership of unique digital content. Despite it has gained remarkable traction, yet face pressing storage and verification challenges stemming from blockchain's permanent data costs. Existing off-chain or centralized storage solutions, while being alternatives, also introduce notable security vulnerabilities. We present SemNFT, an innovative decentralized framework integrated with blockchain oracle middleware services, addressing these persistent NFT dilemmas. Our approach compresses NFT source data into compact embeddings encapsulating semantic essence. These arrays are stored on-chain, while facilitating reliable decentralized image reconstruction and ownership verification. We implemented ERC721-compliant smart contracts with supplementary functionalities, demonstrating SemNFT's seamless integrative capabilities within the ecosystem. Extensive evaluations evidence marked storage optimizations and preservation of requisite visual fidelity by comparison with existing solutions. The proposed SemNFT framework marks a significant advancement in holistically confronting rising NFT storage and verification challenges without compromising decentralization. It substantively propels the meaningful evolution of NFT infrastructure to achieve digital asset immortality.

Abstract:
Graphical User Interface (GUI) Automation has shown significant potential recently. Previous works built GUI Agent systems to handle short-procedure tasks such as element grounding or functional assistance. In this paper, we propose a novel PC-Copilot, AssistEditor, that focuses on automating the video editing workflow. Unlike previous approaches, our system does not require users to input specific commands to control the computer. Instead, users simply describe their requirements, such as the content and style of the video, and upload the necessary materials. The system then autonomously translates these requirements into detailed actions for controlling video understanding models and professional video editing software, e.g., Premiere Pro to produce the final video. This functionality is enabled by a collaborative AI agent framework of multiple GUI agents, each capable of dialogue, knowledge retrieval, and software usage. These agents have distinct roles, including interacting with users to gather requirements, generating storyboards, and performing editing tasks. This approach significantly streamlines the video editing process, making advanced editing accessible to users with varying levels of expertise.

Abstract:
In the domain of video generation, Text-to-video suffers from a notable application gap due to lack of audio that harmonizes with the visual content. Current solutions typically dubbing based solely on the original text used for generate video, which causes a mismatch between the video content and audio details, primarily stems from the lack of understanding of the video's visual modality. Leveraging advancements in multimodal large language model and LLM-based Agent, we propose MAF-ID, a multi-agent interactive framework for video dubbing based on deep video understanding. MAF-ID achieves agent collaboration through the autonomous interaction of three agents, to capture a deep understanding of the video visual content from macro to micro, progressively generate sound effects, voice-overs, and background music that is adaptive to the video. By deeply aligning text, video, and audio modalities, our method significantly enhances the fine-grained coordination between video and audio, making it widely available for AI-generated videos, VLOGs, and other video production scenarios requiring dubbing.

Abstract:
We developed an application that can easily calculate the nutritional content of a meal by utilizing our multimedia recipe dataset tied to the Nutrition Facts table and an ingredient estimation model. A CLIP-based image recognition model and an ingredient co-occurrence graph assist a user in selecting the appropriate ingredient from over 2,500 ingredients. Unlike traditional food applications, ours calculates nutrition from a list of ingredients, allowing the user to see how each ingredient affects the overall nutrition. The user adjusts the amount of each ingredient and knows how to change their meal to meet their body's needs.

Abstract:
The Social Media Prediction (SMP) challenge focuses on predicting the popularity of an online post in social media. Social media contains multimodal information, such as texts, images, user IDs, timestamps, locations, etc. Many studies have presented various feature extraction methods to retrieve the multimodal features and used the retrieved features to predict the popularity of posts. However, they often overlook the phenomenon that the posts made by the same user tend to have similar popularity. In this paper, we propose the MultiModal Framework, called MMF, our winning solution to the Social Media Prediction (SMP) Challenge 2024. MMF derives the post representations by considering the interaction relationships and inter-relationships among the extracted multimodal features. Also, it introduces the pseudo labels to consider the phenomenon that the posts made by the same user tend to have similar popularity and learn the popularity distributions of users in the test dataset. Therefore, MMF can simultaneously learn the relationships between post representations and true labels from the data of the training dataset, and users' popularity distributions from the data of the test dataset. The extensive experiments on the Social Media Prediction Dataset show that our proposed framework outperforms the compared models in terms of Spearman's rank correlation and mean absolute error.

Abstract:
Micro-expressions, as a type of facial expression corresponding to macro-expressions, usually have a short duration and low intensity. Due to these characteristics, micro-expression spotting holds significant value in medical care and public safety. Recent years have witnessed advancements in micro-expression spotting methods; however, spotting micro-expressions remains a challenging task due to their brief duration and low intensity. In this paper, we propose a micro-expression spotting method based on optical flow features with boundary calibration. We first perform face detection, cropping, and alignment on images containing faces. Then, regions of interest (ROIs) are defined, and optical flow features are extracted. Furthermore, candidate expression segments are identified based on the magnitude of the processed optical flows. Finally, a boundary calibration module is utilized to calibrate the boundaries. The effectiveness of the proposed method is evaluated on the MEGC2024 test set, resulting in an overall F1-score of 0.27.

Abstract:
E-commerce has emerged as a significant endeavour in which technological advancements influence the shopping experience. Simultaneously, the metaverse is the next breakthrough to transform multimedia engagement. However, under such situations, deceiving designs aimed at deceiving users into making desired choices might be more successful. This paper proposes the design space of manipulative techniques in e-commerce applications for the metaverse. We construct our arguments by evaluating user interaction with manipulative design in metaverse shopping experiences, followed by a survey among users to understand the effect of counteracting manipulative e-commerce scenarios. Our findings can understanding of design guidelines according to metaverse e-commerce experiences and the possibility of opportunities to improve user awareness of manipulative experiences.

Abstract:
Multi-modal intent detection (MID) aims to comprehend users' intentions through diverse modalities, which has received widespread attention in dialogue systems. Despite the promising advancements in complex fusion mechanisms or architecture designs, challenges remain due to: (1) various noise and redundancy in both visual and audio modalities and (2) long-tailed distributions of intent categories. In this paper, to tackle the above two issues, we propose InMu-Net, a simple yet effective framework for MID from the Information bottleneck and Multi-sensory processing perspective. Our contributions lie in three aspects. First, we devise a denoising bottleneck module to filter out the intent-irrelevant information in the fused feature; Second, we introduce a saliency preservation loss to prevent the dropping of intent-relevant information; Ultimately, kurtosis regulation is introduced to maintain representation smoothness during the filtering process, mitigating the adverse impact of the long tail distribution. Comprehensive experiments on two MID benchmark datasets demonstrate the effectiveness of InMu-Net and its vital components. Impressively, a series of analyses reveal our denoising potential and robustness in low-resource, modality corruption, cross-architecture and cross-task scenarios.

Abstract:
The technique of 3D Gaussian splatting (3DGS) has demonstrated its effectiveness and efficiency in rendering photo-realistic images for novel view synthesis. However, 3DGS requires a high density of camera coverage, and its performance inevitably degrades with sparse training views, which significantly limits its applicability in real-world scenarios. In recent years, many researchers have explored the use of depth information to alleviate this problem, but the performance of their methods is sensitive to the accuracy of depth estimation. To this end, we propose an efficient method to enhance the performance of 3DGS with sparse training views. Specifically, instead of applying depth maps for regularization, we propose a densification method that generates high-quality point clouds, providing a superior initialization for 3D Gaussians. Furthermore, we propose Systematically Angle of View Sampling (SAOVS), which employs Spherical Linear Interpolation (SLERP) and linear interpolation for side view sampling, to determine unseen views outside the training data for semantic pseudo-label regularization. Experiments show that our proposed method significantly outperforms other leading 3D rendering models on the ScanNet dataset and the LLFF dataset. In particular, compared with the conventional 3DGS method, our proposed method achieves performance gains of up to 1.71dB in PSNR and 0.07 in SSIM. In addition, the novel view synthesis produced by our method demonstrates the highest visual quality with minimal distortions.

Abstract:
The open-vocabulary human-object interaction (Ov-HOI) detection aims to identify both base and novel categories of human-object interactions while only base categories are available during training. Existing Ov-HOI methods commonly leverage knowledge distilled from CLIP to extend their ability to detect previously unseen interaction categories. However, our empirical observations indicate that the inherent noise present in CLIP has a detrimental effect on HOI prediction. Moreover, the absence of novel human-object position distributions often leads to overfitting on the base categories within their learned queries. To address these issues, we propose a two-step framework named, CaM-LQ, Calibrating visual-language Models, (e.g., CLIP) for open-vocabulary HOI detection with Locality-aware Queries. By injecting the fine-grained HOI supervision from the calibrated CLIP into the HOI decoder, our model can achieve the goal of predicting novel interactions. Extensive experimental results demonstrate that our approach performs well in open-vocabulary human-object interaction detection, surpassing state-of-the-art methods across multiple metrics on mainstream datasets and showing superior open-vocabulary HOI detection performance, e.g., with 4.54 points improvement on the HICO-DET dataset over the SoTA CLIP4HOI on the UV task with the same backbone ResNet-50.

Abstract:
Multi-modality image fusion (MMIF) aims to integrate the complementary features of source images into the fused image, including target saliency and texture specifics. Recently, image fusion methods leveraging diffusion models have demonstrated commendable results. Despite their strengths, diffusion models reduce the capability to perceive local features. Additionally, their inherent working mechanism, introducing noise to the inputs, consequently leads to a loss of original information. To overcome this problem, we propose a novel Diffusion-CNN feature Aggregation Fusion (DCAFuse) network that can extract complementary features from the dual branches and aggregate them effectively. Specifically, we utilize the denoising diffusion probabilistic model (DDPM) in the diffusion-based branch to construct global information, and multi-scale convolutional kernels in the CNN-based branch to extract local detailed features. Afterward, we design a novel complementary feature aggregation module (CFAM). By constructing coordinate attention maps for features, CFAM captures long-range dependencies in both horizontal and vertical directions, thereby dynamically guiding the aggregation weights of branches. In addition, to further improve the complementarity of dual-branch features, we introduce a novel loss function based on cosine similarity and a unique denoising timestep selection strategy. Extensive experimental results show that our proposed DCAFuse outperforms other state-of-the-art methods in multiple image fusion tasks, including infrared and visible image fusion (IVF) and medical image fusion (MIF).

Abstract:
Existing few-shot learning methods generally focus on designing exquisite structures of meta-learners for learning task-specific prior to improve the discriminative ability of global embeddings. However, they often ignore the importance of learning stability in meta-training, making it difficult to obtain a relatively optimal model. From this key observation, we propose an innovative generic differentiable Reinforcement Learning (RL) strategy for few-shot classification. It aims to explore stable meta-optimization patterns in meta-training by learning generalizable optimizations for producing task-adaptive embeddings. Accordingly, our differentiable RL strategy models the embedding procedure of feature transformation layers in meta-learner to optimize the gradient flow implicitly. Also, we propose a memory module to associate historical and current task states and actions for exploring inter-task similarity. Notably, our RL-based strategy can be easily extended to various backbones. In addition, we propose a novel task state encoder to encode task representation, which fully explores inner-task similarities between support set and query set. Extensive experiments verify that our approach can improve the performance of different backbones and achieve promising results against state-of-the-art methods in few-shot classification.

Abstract:
Mesh denoising is a fundamental task in geometry processing, and recent studies have demonstrated the remarkable superiority of deep learning-based methods in this field. However, existing works commonly rely on neural networks without explicit designs for noise and geometry which are actually fundamental factors in mesh denoising. In this paper, by jointly considering noise intensity and geometric characteristics, a novel Filtering Coefficient Learner (FCL for short) for mesh denoising is developed, which delicately generates coefficients to filter face normals. Specifically, FCL produces filtering coefficients consisting of a noise-aware component and a geometry-aware component. The first component is inversely proportional to the noise intensity of each face, resulting in smaller coefficients for faces with stronger noise. For the effective assessment of the noise intensity, a noise intensity estimation module is designed, which predicts the angle between paired noisy-clean normals based on a mean filtering angle. The second component is derived based on two types of geometric features, namely the category feature and face-wise features. The category feature provides a global description of the input patch, while the face-wise features complement the perception of local textures. Extensive experiments have validated the superior performance of FCL over SOTA works in both noise removal and feature preservation.

Abstract:
When applying high-level visual algorithms to rainy scenes, it is customary to preprocess the rainy images using low-level rain removal networks, followed by visual networks to achieve the desired objectives. Such a setting has never been explored by adversarial attack methods, which are only limited to attacking one kind of them. Considering the deficiency of multi-functional attacking strategies and the significance for open-world perception scenarios, we are the first to propose a Cascaded Adversarial Attack (CAA) setting, where the adversarial example can simultaneously attack different-level tasks, such as rain removal and semantic segmentation in an integrated system. Specifically, our attack on the rain removal network aims to preserve rain streaks in the output image, while for the semantic segmentation network, we employ powerful existing adversarial attack methods to induce misclassification of the image content. Importantly, CAA innovatively utilizes binary masks to effectively concentrate the aforementioned two significantly disparate perturbation distributions on the input image, enabling attacks on both networks. Additionally, we propose two variants of CAA, which minimize the differences between the two generated perturbations by introducing a carefully designed perturbation interaction mechanism, resulting in enhanced attack performance. Extensive experiments validate the effectiveness of our methods, demonstrating their superior ability to significantly degrade the performance of the downstream task compared to methods that solely attack a single network.

Abstract:
Recent generative methods have revolutionized the way of human motion synthesis, such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Denoising Diffusion Probabilistic Models (DMs). These methods have gained significant attention in human motion fields. However, there are still challenges in unconditionally generating highly diverse human motions from a given distribution. To enhance the diversity of synthesized human motions, previous methods usually employ deep neural networks (DNNs) to train a transport map that transforms Gaussian noise distribution into real human motion distribution. According to Figalli's regularity theory, the optimal transport map computed by DNNs frequently exhibits discontinuities. This is due to the inherent limitation of DNNs in representing only continuous maps. Consequently, the generated human motions tend to heavily concentrate on densely populated regions of the data distribution, resulting in mode collapse or mode mixture. To address the issues, we propose an efficient method called MOOT for unconditional human motion synthesis. First, we utilize a reconstruction network based on GRU and transformer to map human motions to latent space. Next, we employ convex optimization to match the noise distribution with the latent space distribution of human motions through the Optimal Transport (OT) map. Then, we combine the extended OT map with the generator of reconstruction network to generate new human motions. Thereby overcoming the issues of mode collapse and mode mixture. MOOT generates a latent code distribution that is well-behaved and highly structured, providing a strong motion prior for various applications in the field of human motion. Through qualitative and quantitative experiments, MOOT achieves state-of-the-art results surpassing the latest methods, validating its superiority in unconditional human motion generation.

Abstract:
Amidst the prevailing trend of escalating demands for data and computational resources, the efficiency of data utilization emerges as a critical lever for enhancing the performance of deep learning models, especially in the realm of image restoration tasks. This investigation delves into the intricacies of data efficiency in the context of image restoration, with Gaussian image denoising serving as a case study. We postulate a strong correlation between the model's performance and the content information encapsulated in the training images. This hypothesis is rigorously tested through experiments conducted on synthetically blurred datasets. Building on this premise, we delve into the data efficiency within training datasets and introduce an effective and stabilized method for quantifying content information, thereby enabling the ranking of training images based on their influence. Our in-depth analysis sheds light on the impact of various subset selection strategies, informed by this ranking, on model performance. Furthermore, we examine the transferability of these efficient subsets across disparate network architectures. The findings underscore the potential to achieve comparable, if not superior, performance with a fraction of the data-highlighting instances where training IRCNN and Restormer models with only 3.89% and 2.30% of the data resulted in a negligible drop and, in some cases, a slight improvement in PSNR. This investigation offers valuable insights and methodologies to address data efficiency challenges in Gaussian denoising. Similarly, our method yields comparable conclusions in other restoration tasks. We believe this will be beneficial for future research.

Abstract:
Unsupervised domain adaptation (UDA) aims to adapt a model trained on the source domain (e.g. synthetic data) to the target domain (e.g. real-world data) without requiring further annotations on the target domain. Most previous UDA methods for semantic segmentation focus on minimizing the domain discrepancy of various levels, e.g., pixels and features, for extracting domain-invariant knowledge. However, the primary domain knowledge, such as context and detail correlation, remains underexplored. To address this problem, we propose a context- and detail-enhanced unsupervised learning framework, called CDEA, for domain adaptive semantic segmentation that facilitates image detail correlations and contexts semantic consistency. Firstly, we propose an adaptive masked image consistency module to enhance UDA by learning spatial context relations of the target domain, which enforces the consistency between predictions and masked target images. Secondly, we propose a detail extraction module to enhance UDA by integrating the learning of spatial information into low-level layers, which fuses the low-level detail features with deep semantic features. Extensive experiments verify the effectiveness of the proposed method and demonstrate the superiority of our approach over state-of-the-art methods.

Abstract:
Recent advancements in text-to-image generative models have showcased remarkable capabilities across various tasks. However, these powerful models have revealed the inherent risks of social biases. Such biases can propagate distorted real-world perspectives and spread unforeseen prejudice and discrimination. Current debiasing methods are primarily designed for scenarios with a single individual in the image and exhibit homogenous race or gender when multiple individuals are involved, harming the diversity of social groups within the image. To address this problem, we consider the semantic consistency between text prompts and generated images in text-to-image diffusion models to identify how biases are generated. We propose a novel method to locate where the biases are based on different tokens and then mitigate them for each individual. Specifically, we introduce a Linguistic-aligned Attention Guidance module consisting of Block Voting and Linguistic Alignment, to effectively locate the semantic regions related to biases. Additionally, we employ Fair Inference in these regions to generate fair attributes across arbitrary distributions while preserving the original structural and semantic information. Extensive experiments and analyses demonstrate our method outperforms existing methods for debiasing with multiple individuals across various scenarios.

Abstract:
Text-to-image (T2I) generation is a pivotal and core interest within the realm of AI content generation. Amid the swift advancements of both open-source (such as Stable Diffusion) and proprietary (for example, DALLE, MidJourney) T2I models, there is a notable absence of a comprehensive and robust quantitative framework for evaluating their output quality. Traditional methods of quality assessment overlook the textual prompts when judging images; meanwhile, the advent of large multi-modal models (LMMs) introduces the capability to incorporate text prompts in evaluations, yet the challenge of fine-tuning these models for precise T2I quality assessment remains unresolved. In our study, we introduce the T2I-Scorer, a novel two-stage training methodology aimed at fine-tuning LMMs for T2I evaluation. For the first stage, we collect 397K GPT-4V-labeled question-answer pairs related to T2I evaluation. Termed as T2I-ITD, the pseudo-labeled dataset is analyzed and examined by human, and used for instruction tuning to improve the LMM's low-level quality perception. The first stage model, T2I-Scorer-IT, has reached superior accuracy on T2I evaluation than all kinds of existing T2I metrics under zero-shot settings. For the second stage, we define an explicit multi-task training scheme to further align the LMM with human opinion scores, and the fine-tuned T2I-Scorer can reach state-of-the-art accuracy on both image quality and image-text alignment perspectives with significant improvements. We anticipate the proposed metrics can serve as a reliable metric to gauge the ability of T2I generation models in the future. We will make code, data, and weights publicly available.

Abstract:
Image compression for machine vision exhibits various rate-accuracy performance across different downstream tasks and content types. An efficient utilization of constrained network resource for achieving an optimal overall task performance has thus recently attracted a growing attention. In this paper, we propose Tombo, a task-oriented image compression and transmission framework that efficiently identifies the optimal encoding bitrate and routing scheme for multiple image bitstreams delivered simultaneously for different downstream tasks. Specifically, we study the characteristics of image rate-accuracy performance for different machine vision tasks, and formulate the task-oriented joint bitrate and routing optimization problem for multi-bitstreams as a multi-commodity network flow problem with the time-expanded network modeling. To ensure consistency between the encoding bitrate and routing optimization, we also propose an augmented network that incorporates the encoding bitrate variables into the routing variables. To improve computational efficiency, we further convert the original optimization problem to a multi-marginal optimal transport problem, and adopt a Sinkhorn iteration-based algorithm to quickly obtain the near-optimal solution. Finally, we adapt Tombo to efficiently deal with the dynamic network scenario where link capacities may fluctuate over time. Empirical evaluations on three typical machine vision tasks and four real-world network topologies demonstrate that Tombo achieves a comparable performance to the optimal one solved by the off-the-shelf solver Gurobi, with a 5x ~ 114× speedup.

Abstract:
In the domain of generative multimedia and interactive experiences, generating realistic and accurate full-body poses from sparse tracking is crucial for many real-world applications, while achieving sequence modeling and efficient motion generation remains challenging. Recently, state space models (SSMs) with efficient hardware-aware designs (i.e., Mamba) have shown great potential for sequence modeling, particularly in temporal contexts. However, processing motion data is still challenging for SSMs. Specifically, the sparsity of input conditions makes motion generation an ill-posed problem. Moreover, the complex structure of the human body further complicates this task. To address these issues, we present Motion Mamba Diffusion (MMD), a novel conditional diffusion model, which effectively utilizes the sequence modeling capability of SSMs and the robust generation ability of diffusion models to track full-body poses accurately. In particular, we design a bidirectional Temporal Mamba Module (TMM) to model motion sequence. Additionally, a Spatial Mamba Module (SMM) is further proposed for feature enhancement within a single frame. Extensive experiments on the large motion capture dataset (AMASS) demonstrate that our proposed approach outperforms the latest methods in terms of accuracy and smoothness, thus providing a crucial advancement for creating realistic virtual avatars in various applications.

Abstract:
In Large Language Models (LLMs), text generation that involves knowledge representation is often fraught with the risk of "hallucinations'', where models confidently produce erroneous or fabricated content. These inaccuracies often stem from intrinsic biases in the pre-training stage or from the incorporation of human preference biases during the fine-tuning process. To mitigate these issues, we take inspiration from Goldman's causal theory of knowledge, which asserts that knowledge is not merely about having a true belief but also involves a causal connection between the belief and the truth of the proposition. We instantiate this theory within the context of Knowledge Question Answering (KQA) by constructing a causal graph that delineates the pathways between the candidate knowledge and belief. Through the application of the do-calculus rules from structural causal models, we devise an unbiased estimation framework based on this causal graph, thereby establishing a methodology for knowledge modeling grounded in causal inference. The resulting CORE framework (short for "Causal knOwledge REasoning'') is comprised of four essential components: question answering, causal reasoning, belief scoring, and refinement. Together, they synergistically improve the KQA system by fostering faithful reasoning and introspection. Extensive experiments are conducted on ScienceQA and HotpotQA datasets, which demonstrate the effectiveness and rationality of the CORE framework.

Abstract:
Label noise, an inevitable issue in various real-world datasets, tends to impair the performance of deep neural networks. A large body of literature focuses on symmetric co-training, aiming to enhance model robustness by exploiting interactions between models with distinct capabilities. However, the symmetric training processes employed in existing methods often culminate in model consensus, diminishing their efficacy in handling noisy labels. To this end, we propose an Asymmetric Co-Training (ACT) method to mitigate the detrimental effects of label noise. Specifically, we introduce an asymmetric training framework in which one model (i.e., RTM) is robustly trained with a selected subset of clean samples while the other (i.e., NTM) is conventionally trained using the entire training set. We propose two novel criteria based on agreement and discrepancy between models, establishing asymmetric sample selection and mining. Moreover, a metric, derived from the divergence between models, is devised to quantify label memorization, guiding our method in determining the optimal stopping point for sample mining. Finally, we propose to dynamically re-weight identified clean samples according to their reliability inferred from historical information. We additionally employ consistency regularization to achieve further performance improvement. Extensive experimental results on synthetic and real-world datasets demonstrate the effectiveness and superiority of our method.

Abstract:
Transformer-based encoders that encode both region and grid features are the preferred choice for the image captioning task due to their multi-head self-attention mechanism. This mechanism ensures superior capture of relationships and contextual information between various regions in an image. However, because of the Transformer block stacking, self-attention computes the visual features several times, increasing computing costs and producing a great deal of redundant feature calculation. In this paper, we propose a novel Distilled Cross-Combination Transformer (DCCT) network. Specifically, we first design a distillation cascade fusion encoder(DCFE) to filter out redundant features in visual features that affect attentional focus, obtaining refined features. Additionally, we introduce a parallel cross-fusion attention module (PCFA) that fully utilizes the complementarity and correlation between grid and region features to better fuse the encoded dual visual features. Extensive experiments on the MSCOCO dataset demonstrate that the proposed DCCT strategy outperforms many state-of-the-art techniques and attains exceptional performance.

Abstract:
Medical image segmentation is of great significance to disease diagnosis and treatment planning. Despite multiple progresses, most present methods (1) pay insufficient attention to suppressing background noise disturbance that impacts segmentation accuracy and (2) are not efficient enough, especially when the images are of large resolutions. To address the two challenges, we turn to a traditional de-noising method and a new efficient network structure and propose BSBP-RWKV for accurate and efficient medical image segmentation. Specifically, we combine the advantages of Perona-Malik Diffusion (PMD) in noise suppression without losing boundary details and RWKV in its efficient structure, and devise the DWT-PMD RWKV Block across one of our encoder branches to preserve boundary details of lesion areas while suppressing background noise disturbance in an efficient structure. Then we feed the de-noised lesion boundary cues to our proposed Multi-Step Runge-Kutta convolutional Block to supplement the cues with more local details. We also propose a novel loss function for shape refinement that can align the shape of predicted lesion areas with GT masks in both spatial and frequency domains. Experiments on ISIC 2016 and Kvasir-SEG show the superior accuracy and efficiency of our BSBP-RWKV. Specifically, BSBP-RWKV reduces complexity of 5.8 times compared with the SOTA while also cutting down GPU memory usage by over 62.7% for each 1024 × 1024 image during inference.

Abstract:
The prevalence of multimedia applications has led to increased concerns and demand for auto face retouching. Face retouching aims to enhance portrait quality by removing blemishes. However, the existing auto-retouching methods rely heavily on a large amount of paired training samples, and perform less satisfactorily when handling complex and unusual blemishes. To address this issue, we propose a Language-guided Blemish Removal Transformer for automatically retouching face images, while at the same time reducing the dependency of the model on paired training data. Our model is referred to as LangBRT, which leverages vision-language pre-training for precise facial blemish removal. Specifically, we design a text-prompted blemish detection module that indicates the regions to be edited. The priors not only enable the transformer network to handle specific blemishes in certain areas, but also reduce the reliance on retouching training data. Further, we adopt a target-aware cross attention mechanism, such that the blemish-like regions are edited accurately while at the same time maintaining the normal skin regions unchanged. Finally, we adopt a regularization approach to encourage the semantic consistency between the synthesized image and the text description of the desired retouching outcome. Extensive experiments are performed to demonstrate the superior performance of LangBRT over competing auto-retouching methods in terms of dependency on training data, blemish detection accuracy and synthesis quality.

Abstract:
Few-shot semantic segmentation has considerable potential for low-data scenarios, especially for medical images that require expert-level dense annotations. Existing few-shot medical image segmentation methods strive to deal with the task by means of prototype learning. However, this scheme relies on support prototypes to guide the segmentation of query images, ignoring the rich anatomical prior knowledge in medical images, which hinders effective feature enhancement for medical images. In this paper, we propose an anatomical prior guided spatial contrastive learning, called APSCL, which exploits anatomical prior knowledge derived from medical images to construct contrastive learning from a spatial perspective for few-shot medical image segmentation. The new framework forces the model to learn the features in line with the embedded anatomical representations. Besides, to fully exploit the guidance information of the support samples, we design a mutual guidance decoder to predict the label of each pixel in the query image. Furthermore, our APSCL can be trained end-to-end in the form of episodic training. Comprehensive experiments on three challenging medical image datasets, i.e., CHAOS-T2, MS-CMRSeg, and Synapse, prove that our method significantly surpasses state-of-the-art few-shot medical segmentation methods, with a mean improvement of 3.61%, 2.30%, and 6.38% on the Dice score, respectively.

Abstract:
Federated learning addresses privacy concerns in multimedia recommender systems by enabling collaborative model training without exchanging raw data. However, existing federated recommendation models are mainly based on basic backbones like Matrix Factorization (MF), which are inadequate to capture complex implicit interactions between users and multimedia content. Graph Convolutional Networks (GCNs) offer a promising method by utilizing the information from high-order neighbors, but face challenges in federated settings due to problems such as over-smoothing, data heterogeneity, and elevated communication expenses. To resolve these problems, we propose a Cluster-driven Personalized Federated Recommender System with Interest-aware Graph Convolution Network (CPF-GCN) for multimedia recommendation. CPF-GCN comprises a local interest-aware GCN module that optimizes node representations through subgraph-enhanced adaptive graph convolution operations, mitigating the over-smoothing problem by adaptively extracting information from layers and selectively utilizing high-order connectivity based on user interests. Simultaneously, a cluster-driven aggregation approach at the server significantly reduces communication costs by selectively aggregating models from clusters. The aggregation produces a global model and cluster-level models, combining them with the user's local model allows us to tailor the recommendation model for the user, achieving personalized recommendations. Moreover, we propose an adversarial optimization technique to further augment the robustness of CPF-GCN. Experiments on three datasets demonstrate that CPF-GCN significantly outperforms the state-of-the-art models.

Abstract:
Multimodal sentiment analysis (MSA) aims to predict sentiment from text, audio, and visual data of videos. Existing works focus on designing fusion strategies or decoupling mechanisms, which suffer from low data utilization and a heavy reliance on large amounts of labeled data. However, acquiring large-scale annotations for multimodal sentiment analysis is extremely labor-intensive and costly. To address this challenge, we propose GRACE, a GRadient-based Active learning method with Curriculum Enhancement, designed for MSA under a multi-task learning framework. Our approach achieves annotation reduction by strategically selecting valuable samples from the unlabeled data pool while maintaining high-performance levels. Specifically, we introduce informativeness and representativeness criteria, calculated from gradient magnitudes and sample distances, to quantify the active value of unlabeled samples. Additionally, an easiness criterion is incorporated to avoid outliers, considering the relationship between modality consistency and sample difficulty. During the learning process, we dynamically balance sample difficulty and active value, guided by the curriculum learning principle. This strategy prioritizes easier, modality-aligned samples for stable initial training, then gradually increases the difficulty by incorporating more challenging samples with modality conflicts. Extensive experiments demonstrate the effectiveness of our approach on both multimodal sentiment regression and classification benchmarks.

Abstract:
Depression recognition (DR) using facial images, audio signals, or language text recordings has achieved remarkable performance. Recently, multimodal DR has shown improved performance over single-modal methods by leveraging information from a combination of these modalities. However, collecting high-quality data containing all modalities poses a challenge. In particular, these methods often encounter performance degradation when certain modalities are either missing or degraded. To tackle this issue, we present a generalizable multimodal framework for DR by aggregating feature disentanglement and privileged knowledge distillation. In detail, our approach aims to disentangle homogeneous and heterogeneous features within multimodal signals while suppressing noise, thereby adaptively aggregating the most informative components for high-quality DR. Subsequently, we leverage knowledge distillation to transfer privileged knowledge from complete modalities to the observed input with limited information, thereby significantly improving the tolerance and compatibility. These strategies form our novel Feature Disentanglement and Privileged knowledge Distillation Network for DR, dubbed Dis2DR. Experimental evaluations on AVEC 2013, AVEC 2014, AVEC 2017, and AVEC 2019 datasets demonstrate the effectiveness of our Dis2DR method. Remarkably, Dis2DR achieves superior performance even when only a single modality is available, surpassing existing state-of-the-art multimodal DR approaches AVA-DepressNet by up to 9.8% on the AVEC 2013 dataset.

Abstract:
Current LiDAR point cloud-based 3D single object tracking (SOT) methods typically rely on point-based representation network. Despite demonstrated success, such networks suffer from some fundamental problems: 1) It contains pooling operation to cope with inherently disordered point clouds, hindering the capture of 3D spatial information that is useful for tracking, a regression task. 2) The adopted set abstraction operation hardly handles density-inconsistent point clouds, also preventing 3D spatial information from being modeled. To solve these problems, we introduce a novel tracking framework, termed VoxelTrack. By voxelizing inherently disordered point clouds into 3D voxels and extracting their features via sparse convolution blocks, VoxelTrack effectively models precise and robust 3D spatial information, thereby guiding accurate position prediction for tracked objects. Moreover, VoxelTrack incorporates a dual-stream encoder with cross-iterative feature fusion module to further explore fine-grained 3D spatial information for tracking. Benefiting from accurate 3D spatial information being modeled, our VoxelTrack simplifies tracking pipeline with a single regression loss. Extensive experiments are conducted on three widely-adopted datasets including KITTI, NuScenes and Waymo Open Dataset. The experimental results confirm that VoxelTrack achieves state-of-the-art performance (88.3%, 71.4% and 63.6% mean precision on the three datasets, respectively), and outperforms the existing trackers with a real-time speed of 36 Fps on a single TITAN RTX GPU. The source code and model will be released.

Abstract:
In Joint Photographic Experts Group (JPEG) image steganalysis and forensics, the quantization step can reveal the history of image operations. Several methods for estimating the quantization step have been proposed by researchers. However, existing algorithms fail to account for robustness, which limits the application of these algorithms. To solve the above problems, we propose a two-stream network structure based on Swin Transformer. The spatial domain features of JPEG images exhibit strong robustness but low accuracy. Conversely, frequency domain features demonstrate high accuracy but weak robustness. Therefore, we design a two-stream network with the multi-scale feature of Swin Transformer to extract spatial domain features with high robustness and frequency domain features with high accuracy, respectively. Furthermore, to adaptively fuse features in both the frequency domain and spatial domain,we design a Spatial-frequency Information Dynamic Fusion (SIDF) module to dynamically allocate weights. Finally, we modify the network from a regression model to a classification model to speed up convergence and improve the accuracy of the algorithm. The experiment results show that the accuracy of the proposed method is higher than 98% on clean images. Meanwhile, in robust environments, the algorithm proposed maintains an average accuracy of over 81%.

Abstract:
Snapshot spectral compressive imaging can capture spectral information across multiple wavelengths in one imaging. The coded aperture snapshot spectral imaging (CASSI) method, aims to recover 3D spectral cubes from 2D measurements. Most existing approaches employ a deep unfolding framework based on Transformer, which alternately address a data subproblem and a prior subproblem. However, these frameworks lack flexibility regarding the sensing matrix and inter-stage interactions. In addition, the quadratic computational complexity of global Transformer and the restricted receptive field of local Transformer impact reconstruction efficiency and accuracy. In this paper, we propose a dynamic deep unfolding network with mamba for compressive spectral imaging, called VmambaSCI. We integrate spatial-spectral information from the sensing matrix into the data module and utilizes spatial adaptive operations in the stage interaction of the prior module. Furthermore, recognizing that the imaging process causes aliasing of spatial and spectral information, we develop a dual-domain scanning mamba (DSMamba), featuring a novel spatial-channel scanning method for enhanced efficiency and accuracy. To our knowledge, VmambaSCI is the first Mamba-based model for compressive spectral imaging. Experimental results on the public databases, CAVE and KAIST, demonstrate the superiority of the proposed VmambaSCI over the state-of-the-art approaches.

Abstract:
Multi-Modal Large Language Models (MM-LLMs) have demonstrated powerful reasoning abilities in various visual question-answering tasks. However, they face the challenge of lacking rigorous reasoning and precise arithmetic, when solving geometry questions. To address this challenge, we propose a novel prompting method, namely Reason-and-Execute (R&E), to enhance the accuracy of solving geometry questions by MM-LLMs. Specifically, the R&E method includes two templates: reasoning template and execution template. We first adopt a reverse-thinking approach to construct a rigorous reasoning template so that it guides MM-LLMs to start reasoning from the most relevant domain knowledge of the question and ultimately identify the arithmetic requirements. We then make use of program-assisted thought to construct execution template in order to guide MM-LLMs to understand the arithmetic requirements from reasoning template and generate executable code block. The answer is finally obtained by executing the code block. We evaluate our prompting method on 9 models in answering questions on 6 datasets (including four geometry datasets and two science datasets) compared to Chain-of-Thought (CoT) and Program-Aided Language (PAL) prompting methods. R&E method shows up to 12.8% improvement compared to CoT and PAL, proving strong reasoning and arithmetic abilities for solving geometry questions of our method. Moreover, we further analyze the answering accuracy from the different perspectives on solving geometric questions, including domain knowledge, geometry shapes, question length, and language. Through multiple analysis, our method is able to enhance the ability of MM-LLMs to solve geometry questions.

Abstract:
Deep Hashing (DH) has emerged as an indispensable technique for fast image search in recent years. To deploy DH on resource-limited devices, the Binary Neural Network (BNN) offers a solution that significantly reduces computations and parameters compared to CNN. Unfortunately, applying BNN directly to DH will lead to huge performance degradation. To tackle this problem, we first conducted extensive experiments and discovered that the center-based method provides a fundamental guarantee for BNN-DH performance. Subsequently, we delved deeper into the impact of BNNs on center-based methods and revealed two key insights. First, we find reducing the distance between hash codes and hash centers is challenging for BNN-DH compared to CNN-based DH. Second, the evolution of hash code aggregation undergoes two stages in BNN-DH, which is different from CNN-based DH. Based on these findings, we designed a strong and general method called One-bit Deep Hashing (ODH). First, ODH incorporates a semantic self-adaptive hash center module to address the problem of hash codes inadequately converging to their hash centers. Then, it employs a novel two-stage training method to consider the evolution of hash code aggregation. Finally, extensive experiments on two datasets demonstrate that ODH can achieve significant superiority over other BNN-DH models.

Abstract:
As VR devices become increasingly prevalent, live 360-degree video has surged in popularity. However, current live 360-degree video systems heavily rely on uplink bandwidth to deliver high-quality live videos. Recent advancements in neural-enhanced streaming offer a promising solution to this limitation by leveraging server-side computation to conserve bandwidth. Nevertheless, these methods have primarily concentrated on neural enhancement within a single domain (either spatial or temporal), which may not adeptly adapt to diverse video scenarios and fluctuating bandwidth conditions. In this paper, we propose Lumos, a novel spatial-temporal integrated neural-enhanced live 360-degree video streaming system. To accommodate varied video scenarios, we devise a real-time Neural-enhanced Quality Prediction (NQP) model to predict the neural-enhanced quality for different video contents. To cope with varying bandwidth conditions, we design a Content-aware Bitrate Allocator, which dynamically allocates bitrates and selects an appropriate neural enhancement configuration based on the current bandwidth. Moreover, Lumos employs online learning to improve prediction performance and adjust resource utilization to optimize user quality of experience (QoE). Experimental results demonstrate that Lumos surpasses state-of-the-art neural-enhanced systems with an improvement of up to 0.022 in terms of SSIM, translating to an 8.2%-8.5% enhancement in QoE for live stream viewers.

Abstract:
Industrial multimedia recommendation systems extensively utilize cascade architectures to deliver personalized content for users, generally consisting of multiple stages like retrieval and ranking. However, retrieval models have long suffered from Sample Selection Bias (SSB) due to the distribution discrepancy between the exposed items used for model training and the candidates (almost unexposed) during inference, affecting recommendation performance. Traditional methods utilize retrieval candidates as augmented training data, indiscriminately treating unexposed data as negative samples, which leads to inaccuracies and noise. Some efforts rely on unbiased datasets, while they are costly to collect and insufficient for industrial models. In this paper, we propose a debiasing framework named DAMCAR, which introduces Domain Adaptation to mitigate SSB in Multimedia CAscade Recommendation systems. Firstly, we sample hard-to-distinguish samples from unexposed data to serve as the target domain, optimizing data quality and resource utilization. Secondly, adversarial domain adaptation is employed to generate pseudo-labels for each sample. To enhance robustness, we utilize Exponential Moving Average (EMA) to create a teacher model that supervises the generation of pseudo-labels via self-distillation. Finally, we obtain a retrieval model that maintains stable performance during inference through a hybrid training mechanism. We conduct offline experiments on two real-world datasets and deploy our approach in the retrieval model of a multimedia video recommendation system for online A/B testing. Comprehensive experimental results demonstrate the effectiveness of DAMCAR in practical applications.

Abstract:
Fine-grained remote sensing object detection aims to locate and identify specific targets with variable scale and orientation from complex background in the high-resolution and wide-swath images, which needs requirement of high precision and real-time processing simultaneously. Although traditional knowledge distillation technology show its effectiveness in model compression and accuracy preservation for natural images, the challenges of heavy background noise and intra-class similarity faced by remote sensing images limits the knowledge quality of teacher model and the learning ability of student model. To address these issues, we propose the Information Fusion with Knowledge Distillation (IFKD) method to enhance student model performance by integrating information from external images, frequency domain, and hyperbolic space. This includes three key modules: 1) External Disturbance Enhancement (EDE), which uses MobileSAM to enrich teachers' knowledge and reduce students' dependency on teachers; 2) Frequency Domain Reconstruction (FDR) to amplify key feature representations and reduce background noise interference by resampling low-frequency information; 3) Hyperbolic Similarity Mask (HSM) to increase intra-class differences, guiding students in analyzing and utilizing teachers' knowledge, and leveraging the exponential capabilities of hyperbolic space for performance improvement. Experimental results verify that the IFKD method significantly enhances performance in fine-grained recognition tasks compared to existing distillation techniques. Specially, 65.8% and 81.4% Ap_50 have achieved on optical ShipRSImageNet and SAR Aircraft-1.0 with our method, even which is 0.4% and 4.7% higher than the teacher.

Abstract:
Incremental monocular depth estimation aims to continuously learn from new domains while maintaining their performance on old domains. The catastrophic forgetting problem is the key challenge when the model adapts the dynamic scene variations. Previous methods usually address this forgetting problem by storing raw samples from the old domain, allowing the model to review the knowledge of the old domain. However, due to the concerns of data privacy and security, our objective is to tackle the incremental monocular depth estimation problem in more stringent scenarios without the need for replaying samples. In this paper, we attribute the cross-domain catastrophic forgetting to the domain distribution shifts and continuous variations of depth space. To this end, we propose Domain Shared and Specific Prompt Learning (DSSP) for incremental monocular depth estimation. In detail, to alleviate the domain distribution shift, complementary domain prompt is designed to learn the domain-shared and domain-specific knowledge which are optimized by the inter-domain alignment and intra-domain orthogonal loss. To mitigate the depth space variations, we first introduce a pre-trained model to generate the domain-shared depth space. Then, we design S^2-Adapter that quantizes depth space variations with scale&shift matrices and converts the domain-shared depth space to domain-specific depth space. Our method achieves state-of-the-art performance under various scenarios such as different depth ranges, virtual and real, different weather conditions, and the few-shot incremental learning setting on 12 datasets. We will release the source codes and pre-trained models.

Abstract:
Multi-view clustering, a pivotal technology in multimedia research, aims to leverage complementary information from diverse perspectives to enhance clustering performance. The current multi-view clustering methods normally enforce the reduction of distances between any pair of views, overlooking the heterogeneity between views, thereby sacrificing the diverse and valuable insights inherent in multi-view data. In this paper, we propose a Tree-Based View-Gap Maintaining Multi-View Clustering (TGM-MVC) method. Our approach introduces a novel conceptualization of multiple views as a graph structure. In this structure, each view corresponds to a node, with the view gap, calculated by the cosine distance between views, acting as the edge. Through graph pruning, we derive the minimum spanning tree of the views, reflecting the neighbouring relationships among them. Specifically, we applied a share-specific learning framework, and generate view trees for both view-shared and view-specific information. Concerning shared information, we only narrow the distance between adjacent views, while for specific information, we maintain the view gap between neighboring views. Theoretical analysis highlights the risks of eliminating the view gap, and comprehensive experiments validate the efficacy of our proposed TGM-MVC method.

Abstract:
The latest progress in novel view synthesis can be attributed to the Neural Radiance Field (NeRF), which requires densely sampled images with precise camera poses. However, collecting dense input images for a NeRF with accurate camera poses is highly expensive in many real-world scenarios. In this paper, we propose to learn Geometry Consistent Neural Radiance Field (GC-NeRF), to tackle this challenge by jointly optimizing a NeRF and its corresponding camera poses with sparse (as low as 2) and unposed views. First, the proposed GC-NeRF establishes image-level geometric consistencies, by producing photometric constraints from inter- and intra-views to update the NeRF and the camera poses in a fine-grained manner. Then, we adopt geometry projection with camera extrinsic parameters to further provide region-level consistency supervisions, which constructs pseudo-pixel labels to capture critical matching correlations. Moreover, we present an adaptive high-frequency mapping function to augment the geometry and texture information of the 3D scene. Extensive experiments on multiple challenging real-world datasets validate the effectiveness of the proposed GC-NeRF, which sets a new state-of-the-art for effectively learning NeRF with sparse and unposed views.

Abstract:
Video captioning is a challenging task and typically requires paired video-text data for training. However, manually annotating coherent textual descriptions for videos is laborious and time-consuming. To address this challenge, we propose a novel approach that enhances video captioning using only synthetic text data. Leveraging the exceptional text generation capabilities of large language models (LLMs), we produce high-quality and diverse video captions tailored to the target domain. Our approach employs a two-stage prompting strategy: first prompt GPT-4 with few-shot target-domain captions to create a set of high-quality captions, and then continue prompting with the generated captions to acquire large-scale synthetic data. To effectively utilize these captions, we introduce Mixture of Scale and Shift experts (MoS2), an efficient adaptation method for pre-trained captioning models. MoS2 employs lightweight routing networks to estimate probability distributions over a collection of scale and shift experts, dynamically allocating tokens to the appropriate experts. This dynamic adjustment mechanism enhances the model's ability to handle data variations and mitigates the distribution shift between synthetic and real captions. Moreover, our method reduces the number of learnable parameters, facilitating more efficient adaptation. Our method achieves superior performance with only synthetic text data, narrowing the gap between zero-shot and fine-tuned models and reducing the dependency on paired data from the target domain.

Abstract:
Multi-label image classification is crucial for a wide range of multimedia applications. To address the resource limitation issue, various knowledge distillation (KD) methods have been developed to transfer knowledge from a large network (referred to as the "teacher") to a small network (referred to as the "student"). However, existing KD methods do not explicitly distill the dependencies between labels, which limits the model ability to capture multi-label correlation. Furthermore, although existing methods for multi-label image classification have utilized the second-order label pair dependency (direct dependency between two labels), the high-order label pair dependency, which captures the indirect dependency between two labels, remains unexplored. In this paper, we propose a Multi-Order Label Pair Dependencies Knowledge Distillation (MDKD) framework. MDKD explicitly distills the knowledge to capture multi-order dependencies between labels, including the label pair dependencies from second-order and high-order, thus transferring the insight of label correlations from different perspectives. Extensive experiments on Pascal VOC2007, MSCOCO2014, and NUS-WIDE demonstrate the superior performances of MDKD.

Abstract:
The rise of mobile devices has spurred advancements in camera technology and image quality. However, mobile photography still faces issues like scattering and reflective flares. While previous research has acknowledged the negative impact of the mobile devices' internal image signal processing pipeline (ISP) on image quality, the specific ISP operations that hinder flare removal have not been fully identified. In addition, current solutions only partially address ISP-related deterioration due to a lack of comprehensive raw image datasets for flare study. To bridge these research gaps, we introduce a new raw image dataset tailored for mobile camera systems, focusing on eliminating flare. This dataset encompasses over 2,000 high-quality, full-resolution raw image pairs for scattering flare, and 1,200 for reflective flare, captured across various real-world scenarios, mobile devices, and camera settings. It is designed to enhance the generalizability of flare removal algorithms across a wide spectrum of conditions. Through detailed experiments, we have identified that ISP operations, such as denoising, compression, and sharpening, may either improve or obstruct flare removal, offering critical insights into optimizing ISP configurations for better flare mitigation. Our dataset is poised to advance the understanding of flare-related challenges, enabling more precise incorporation of flare removal steps into the ISP. Ultimately, this work paves the way for significant improvements in mobile image quality, benefiting both enthusiasts and professional mobile photographers alike.

Affiliations: Hefei Institute of Physical Science, Chinese Academy of Sciences, University of Science and Technology of China, & Astribot Inc, China ; Department of Mathematics, Chinese University of Hong Kong, New Territories, Hong Kong SAR, China ; School of Mathematical Sciences, National Engineering Research Center of Visual Technology, Peking University, China ; Department of Computer Science, Shanghai Jiao Tong University, China ; The Hong Kong Polytechnic University, Chinese Academy Sciences

Abstract:
Articulated objects are common in our daily life. However, current category-level articulation pose works mostly focus on predicting 9D poses on statistical point cloud observations. In this paper, we deal with the problem of category-level online robust 9D pose tracking of articulated objects, where we propose VoCAPTER, a novel 3D Voting-based Category-level Articulated object Pose TrackER. Our VoCAPTER efficiently updates poses between adjacent frames by utilizing partial observations from the current frame and the estimated per-part 9D poses from the previous frame. Specifically, by incorporating prior knowledge of continuous motion relationships between frames, we begin by canonicalizing the input point cloud, casting the pose tracking task as an inter-frame pose increment estimation challenge. Subsequently, to obtain a robust pose-tracking algorithm, our main idea is to leverage SE(3)-invariant features during motion. This is achieved through a voting-based articulation tracking algorithm, which identifies keyframes as reference states for accurate pose updating throughout the entire video sequence. We evaluate the performance of VoCAPTER in the synthetic dataset and real-world scenarios, which demonstrates VoCAPTER's generalization ability to diverse and complicated scenes. Through these experiments, we provide evidence of VoCAPTER's superiority and robustness in multi-frame pose tracking of articulated objects. We believe that this work can facilitate the progress of various fields, including robotics, embodied intelligence, and augmented reality. All the codes will be made publicly available.

Abstract:
Decoding human visual representations from brain activity data is a challenging but arguably essential task with an understanding of the real world and the human visual system. However, decoding semantically similar visual representations from brain recordings is difficult, especially for electroencephalography (EEG), which has excellent temporal resolution but suffers from spatial precision. Prevailing methods mainly focus on matching brain activity data with corresponding stimuli-responses using contrastive learning. They rely on massive and high-quality paired data and omit semantically aligned modalities distributed in distinct regions of the latent space. This paper proposes a novel Multimodal Bidirectional Cycle Consistency (MB2C) framework for learning robust visual neural representations. Specifically, we utilize dual-GAN to generate modality-related features and inversely translate back to the corresponding semantic latent space to close the modality gap and guarantee that embeddings from different modalities with similar semantics are in the same region of representation space. We perform zero-shot tasks on the ThingsEEG dataset. Additionally, we conduct EEG classification and image reconstruction on both the ThingsEEG and EEGCVPR40 datasets, achieving state-of-the-art performance compared to other baselines.

Abstract:
Generating diverse plausible outputs from a single input is crucial for addressing visual ambiguities, exemplified in medical imaging where experts may provide varying semantic segmentation annotations for the same image.Existing methods handles ambiguous segmentation relying on probabilistic modeling and extensive multi-output annotated data while often struggles with limited ambiguously labeled datasets common in real-world applications.To surmount the challenge, we propose P²SAM, a novel framework that leverages the Segment Anything Model (SAM)'s prior knowledge for ambiguous object segmentation. By transforming SAM's sensitivity to prompts into an advantage, we introduce a prior probabilistic space for prompts.Experimental results show that P²SAM significantly enhances medical segmentation precision and diversity using minimal ambiguously annotated samples. Benchmarking against state-of-the-art methods demonstrates superior performance with just 5.5% of the training data (+12% Dmax). This approach marks a significant advancement towards deploying probabilistic models in data-limited real-world scenarios.

Abstract:
Photographed documents are prevalent but often suffer from deformations like curves or folds, hindering readability. Consequently, document dewarping has been widely studied, however its performance is still not satisfied due to lack of real training samples with pixel-level annotation. To obtain the pixel-level labels, we leverage a document registration pipeline to automatically align warped-flat documents. Unlike general image registration works, registering documents poses unique challenges due to their severe deformations and fine-grained textures. In this paper, we introduce a coarse-to-fine framework including a coarse registration network (CRN) aiming to eliminate severe deformations then a fine registration network (FRN) focusing on fine-grained features. In addition, we utilize self-supervised learning to initialize our document registration model, where we propose a cross-reconstruction pre-training task on the pair of warped-flat documents. Extensive experiments show that we can achieve satisfied document registration performance, consequently obtaining a high-quality registered document dataset with pixel-level annotation. Without bells and whistles, we re-train two popular document dewarping models on our registered document dataset WarpDoc-R, and obtain superior performance with those using almost 100× scale of synthetic training data, verifying the label quality of our document registration method.

Abstract:
Deep unfolding network (DUN) is a powerful technique for image compressive sensing that bridges the gap between optimization methods and deep networks. However, DUNs usually rely heavily on single-domain information, overlooking the inter-domain dependencies. Therefore, such DUNs often face the following challenges: 1) information loss due to the inefficient representation within a single domain, and 2) limited robustness due to the absence of inter-domain dependencies. To overcome these challenges, we propose a deep unfolding framework D^3U-Net that establishes a dual-domain collaborative optimization scheme. This framework introduces both visual representations from the image domain and multi-resolution analysis provided by the wavelet domain. Such dual-domain representations constrain the feasible region within the solution space more accurately. Specifically, we design a consistency-difference collaborative mechanism to capture inter-domain dependencies effectively. This mechanism not only enhances the fidelity of reconstruction but also enriches the depth and breadth of extracted features, improving the overall robustness and reconstruction quality. Moreover, we develop an inter-stage transmission pathway to minimize the information loss during transmission while broadcasting multi-scale features in a frequency-adaptive manner. Extensive experimental results on various benchmark datasets show the superior performance of our method.

Abstract:
Model compression and distillation techniques have become essential for deploying deep learning models efficiently. However, existing methods often encounter challenges related to model generalization and scalability for harnessing the expertise of pre-trained large models. This paper introduces CoTuning, a novel framework designed to enhance the generalization ability of neural networks by leveraging collaborative learning between large and small models. CoTuning overcomes the limitations of traditional compression and distillation techniques by introducing strategies for knowledge exchange and simultaneous optimization. Our framework comprises an adapter-based co-tuning mechanism between cloud and edge models, a scale-shift projection for feature alignment, and a novel collaborative knowledge distillation mechanism for domain-agnostic tasks. Extensive experiments conducted on various benchmark datasets demonstrate the effectiveness of CoTuning in improving model generalization while maintaining computational efficiency and scalability. The proposed framework exhibits a significant advancement in model compression and distillation, with broad implications for research in the collaborative evolution of large-small models.

Abstract:
Universal few-shot dense prediction requires a versatile model capable of learning any dense prediction task from limited labeled images, which necessitates the model to possess efficient adaptation abilities. Prevailing few-shot learning methods rely on efficient fine-tuning of model weights for few-shot adaptation, which carries the risk of disrupting the pre-trained knowledge and lacks the capability to extract task-specific knowledge contained in the pre-trained model. To overcome these limitations, our paper approaches universal few-shot dense prediction from a novel perspective. Unlike conventional fine-tuning techniques that use all model parameters and modify a specific set of weights for few-shot adaptation, our method focuses on selecting task-relevant computation pathways of the pre-trained model while keeping the model weights frozen. Building upon this idea, we introduce a novel framework UniDense for universal few-shot dense prediction. First, we construct a versatile MoE (Mixture of Experts) architecture for dense prediction based on the Stable Diffusion model. We then utilize episodes-based meta-learning to train a set of routers for this MoE model, called Meta-Routers, which act as hyper-networks responsible for selecting computation blocks relevant to each task. We demonstrate that fine-tuning these meta-routers enables efficient few-shot adaptation of the entire model. Moreover, for each few-shot task, we leverage support samples to extract a task embedding, which serves as a conditioning factor for meta-routers. This strategy allows meta-routers to dynamically adapt themselves for different few-shot task, leading to improved adaptation performance. Experiments on a challenging variant of Taskonomy dataset with 10 dense prediction tasks demonstrate the superiority of our approach.

Abstract:
Story visualization aims to generate realistic and coherent images based on multi-sentence stories. However, current methods face challenges in achieving high-quality image generation while maintaining lightweight models and a fast generation speed. The main issue lies in the two existing frameworks. The independent framework prioritizes speed but sacrifices image quality with the non-collaborative image generation process and basic GAN-based learning. The autoregressive framework modifies the large pretrained text-to-image model in an auto-regressive manner with additional history modules, leading to large model size, resource-intensive requirements, and slow generation speed. To address these issues, we propose a lightweight and effective framework, namely CoIn. Specifically, we introduce a Context-aware Story Generator to predict shared context semantics for each image generator. Additionally, we propose an Intra-Story Interchange module that allows each image generator to exchange visual information with other image generators. Furthermore, we incorporate DINOv2 into the story and image discriminators to assess the story image quality more accurately. Extensive experiments show that our CoIn keeps the model size and generation speed of the independent framework, while achieving promising story image quality.

Affiliations: College of Computer Science and Technology, Harbin Engineering University & Qingdao Innovation and Development Base of Harbin Engineering University, China ; R&D and External Relations Department, Xiangjiang Laboratory & Hunan University of Technology and Business, School of Artificial Intelligence and Advanced Computing, China ; School of Information, Renmin University of China, China ; Guanghua School of Management, Peking University & Harvest Fund Management Co., Harbin Engineering University & Modeling and Emulation in E-Government National Engineering Laboratory

Abstract:
Graph Contrastive Learning (GCL) aims to address the issue of label scarcity by leveraging graph structures to propagate labels from a limited set of labeled data to a broader range of unlabeled data. However, recent GCL methods often rely on uniform negative sample selection schemes, such as random sampling, which results in suboptimal performance. To tackle this challenge, we present GraphSaSe, a tailored approach specifically designed for graph contrastive learning. Our method introduces an innovative reinforcement learning strategy that translates the divergence between positive pairs into a reinforcement reward mechanism. This mechanism generates selection probabilities to dynamically guide the selection of negative samples during training. We explore the impact of negative sample selection at different stages in graph contrastive learning and analyze how the discount factor affects the reward mechanism in reinforcement learning. These studies enhance the overall performance of the model. Comprehensive experimentation across diverse real-world datasets validates the effectiveness of our algorithm, positioning it favorably against contemporary state-of-the-art methodologies.

Abstract:
The thermal-to-visible (T2V) face translation task is essential for enabling face verification in low-light or dark conditions by converting thermal infrared faces into their visible counterparts. However, this task faces two primary challenges. First, the inherent differences between the modalities hinder the effective use of thermal information to guide RGB face reconstruction. Second, translated RGB faces often lack the identity details of the corresponding visible faces, such as skin color. To tackle these challenges, we introduce DiffTV, the first Latent Diffusion Model (LDM) specifically designed for T2V facial image translation with a focus on preserving identity. Our approach proposes a novel heterogeneous feature alignment strategy that bridges the modal gap and extracts both coarse-and fine-grained identity features consistent with visible images. Furthermore, a dual-stage condition injection strategy introduces control information to guide identity-preserved translation. Experimental results demonstrate the superior performance of DiffTV, particularly in scenarios where maintaining identity integrity is critical.

Abstract:
Given the truly immersive viewing experiences, full-scene volumetric videos have received increasing attention from both academia and industry. Their vast data volumes, however, present significant challenges for real-time streaming over today's bandwidth-limited Internet. Considering the vast amount of full-scene volumetric data to be streamed and the limited bandwidth on the Internet, achieving adaptive full-scene volumetric video streaming over the Internet presents a significant challenge. Inspired by the advantages offered by neural fields, especially the feature grid method, we propose FSVFG, a novel full-scene volumetric video streaming system integrated feature grids as the representation of volumetric content. FSVFG employs an incremental training approach for feature grids and stores the features and residuals between adjacent grids as frames. To support adaptive streaming, we delve into the data structure and rendering processes of feature grids and propose bandwidth adaptation mechanisms. The mechanisms involve a coarse ray-marching for the selection of features and residuals to be sent, and achieve variable bitrate streaming by Level-of-Detail (LoD) and residual filtering. Based on these mechanisms, FSVFG achieves adaptive streaming by adaptively balancing the transmission of feature and residual according to the available bandwidth. Our preliminary results demonstrate the effectiveness of FSVFG, demonstrating its ability to improve visual quality and reduce bandwidth requirements of full-scene volumetric video streaming.

Abstract:
Deep learning technologies have been popular in the image compression field for some time. An increasing number of deep-learning-based models are proposed to improve Rate-Distortion (RD) performance. Previous algorithms are implemented in the specific platform and can not be applied in cross-platform environments. In this paper, we present an open-source algorithm library called OpenDIC, which integrates a variety of end-to-end image compression methods in cross-platform environments. The contribution and details of the algorithms used in the library are described. To evaluate the performance of these algorithms, we conduct a comprehensive performance test. We compare and analyze each algorithm according to RD performance, running time, and GPU memory occupancy. The algorithm library has been released at https://openi.pcl.ac.cn/OpenDIC/.

Abstract:
This tutorial focuses on curriculum learning (CL), an important topic in machine learning, which gains an increasing amount of attention in the research community. CL is a learning paradigm that enables machines to learn from easy data to hard data, imitating the meaningful procedure of human learning with curricula. As an easy-to-use plug-in, CL has demonstrated its power in improving the generalization capacity and convergence rate of various models in a wide range of scenarios such as computer vision, natural language processing, reinforcement learning, etc. In particular, CL can also play an important role in multimedia applications. Therefore, it is essential to introduce CL to more scholars and researchers in the machine learning and multimedia community. However, there have been no tutorials on CL for multimedia so far, motivating the organization of this tutorial at ACM Multimedia 2024. To give a comprehensive tutorial on CL for multimedia, we plan to organize it from the following aspects: (1) theories, (2) approaches, (3) applications, (4) tools, and (5) future directions. First, we introduce the motivations, theories, and insights behind CL. Second, we advocate novel, high-quality approaches, as well as innovative solutions to the challenging problems in CL. Then we present the applications of CL in various scenarios, especially multimedia, followed by some relevant tools. In the end, we discuss open questions and future directions in the era of large language models. We believe this topic is at the core of the scope of ACM Multimedia and is attractive to the audience interested in machine learning and multimedia from both academia and industry.

Abstract:
Social media has emerged as a vital platform for communication, information sharing, and acquisition. Predictive analysis of social media data has wide applications, such as sentiment examination and social network analysis. However, existing work often directly utilizes social media data for training, neglecting the issue of mismatched text and images. This neglect can lead to confusion about the contents, thereby affecting the identification of trending topics and the accuracy of social media predictions. In this paper, an approach named Dual-Stream Pre-training Transformer (DSPT) is introduced to address this gap. In DSPT, we use a Visual-Language Model (VLM) and a Language Model (LM) to separately learn from image and text data, mitigating the impact of text-image mismatches. Moreover, to enhance the understanding of the model to social media data, we conduct incremental pre-training for both models. To achieve better feature interaction, we construct an integrated regression module combining LightGBM and CatBoost, jointly predicting the extracted feature embeddings. This dual-stream multimodal feature extraction method improves the performance of predictive tasks. Experimental results validate the effectiveness of our approach, demonstrating its potential and providing deeper insights into multimodal data mining in social media.

Abstract:
With the development of AIGC, images generated by AI are difficult to identify with the human eye. This article proposes a simple training architecture and a simple model. In the training process of deep forgery detection, a variety of data enhancement techniques are used, including horizontal flipping, adding Gaussian noise and random cropping to enhance the generalization ability of the model. Differentiated processing strategies are adopted for images of different sizes: center cropping is used for images smaller than 224 pixels, random cropping is used for images between 224 and 512 pixels, and images larger than 512 pixels are first scaled and then randomly cropped. Furthermore, the concept of neighbor pixel relations (NPR) is introduced, which is an effective way to capture local structural artifacts introduced by upsampling operations. As a complement to image representation, NPR significantly improves the detector's ability to identify forged images, especially when the forgery technology is unknown. By integrating the NPR module into the ResNet-50 model, it not only enhances the classifier's extraction of forged features, but also maintains the computational efficiency and versatility of the model. Finally, AIGC generated images in the ''Malanshan Cup" International Audio-Video Algorithm Competition. Determine the first place result for the track.

Abstract:
In this rapidly evolving era of AI technology, image generation models are pushing the boundaries of art, design, and even information dissemination at an unprecedented pace. However, alongside this technological advancement comes the growing challenge of combating misinformation and image manipulation. One major difficulty in determining AI-generated images lies in dealing with various generation models. In this report, our core solution can be summarized as ''balancing data sources is key to enhancing the generalization capability of the detection model.'' In the competition, we achieved a score of 0.9558 on the B leaderboard, ultimately securing the second place.

Abstract:
Video Moment Retrieval (VMR) targets to retrieve the specific moment corresponding to a sentence query from an untrimmed video. Although recent respectable works have made remarkable progress in this task, they implicitly are rooted in the closed-set assumption that all the given queries as video-relevant. Given an OOD query in open-set scenarios, they still utilize it for wrong retrieval, which might lead to irrecoverable losses in high-risk scenarios, e.g., criminal activity detection. To this end, we creatively explore a brand-new VMR setting termed Open-Set Video Moment Retrieval (OS-VMR), where we should not only retrieve the precise moments based on ID query, but also reject OOD queries. In this paper, we make the first attempt to step toward OS-VMR and propose a novel model OpenVMR, which first distinguishes ID and OOD queries based on the normalizing flow technology, and then conducts moment retrieval based on ID queries. Specifically, we first learn the ID distribution by constructing a normalizing flow, and assume the ID query distribution obeys the multi-variate Gaussian distribution. Then, we introduce an uncertainty score to search the ID-OOD separating boundary. After that, we refine the ID-OOD boundary by pulling together ID query features. Besides, video-query matching and frame-query matching are designed for coarse-grained and fine-grained cross-modal interaction, respectively. Finally, a positive-unlabeled learning module is introduced for moment retrieval. Experimental results on three VMR datasets show the effectiveness of our OpenVMR.

Abstract:
Virtual reality (VR) is a revolutionary method of presenting data visualizations, which brings potential possibilities for enhancing analytical activities. However, applying this method to visualize complex data flows remains largely underexplored, especially the Sankey diagrams, which have an advantageous capacity to represent trends in data flows. In this work, we explored a novel design for the immersive Sankey diagram system within VR environments, utilizing a three-dimensional visual design and several interaction techniques that leveraged VR's spatial and immersive capabilities. Through two comparative user studies, we found the effectiveness of the VR Sankey diagram system in improving task performance and engagement and reducing cognitive workload in complex data analysis. We contribute an interactive, immersive Sankey diagram system in VR environments, empirical evidence of its advantages, and design lessons for future immersive visualization tools.

Abstract:
Singing melody extraction is a key task in the field of music information retrieval (MIR). However, decades of research works have uncovered two difficult issues. First, binary classification on frequency-domain audio features (e.g., spectrogram) is regarded as the primary method, which ignores the potential associations of musical information at different frequency bins, as well as their varying significance for output decisions. Second, the existing semi-supervised singing melody extraction models ignore the accuracy of the generated pseudo labels by semi-supervised models, which largely limits the further improvements of the model. To solve the two issues, in this paper, we propose a heterogeneous knowledge distillation framework for semi-supervised singing melody extraction using harmonic supervision, termed as HKDSME. We begin by proposing a four-class classification paradigm for determining the results of singing melody extraction using harmonic supervision. This enables the model to capture more information regarding melodic relations in spectrograms. To improve the accuracy issue of pseudo labels, we then build a semi-supervised method by leveraging the extracted harmonics as a consistent regularization. Different from previous methods, it judges the availability of unlabeled data in terms of the inner positional relations of extracted harmonics. To further build a light-weight semi-supervised model, we propose a heterogeneous knowledge distillation (HKD) module, which enables the prior knowledge to transfer between heterogeneous models. We also propose a novel confidence guided loss, which incorporates with the proposed HKD module to reduce the wrong pseudo labels. We evaluate our proposed method using several well-known public available datasets, and the findings demonstrate the efficacy of our proposed method.

Abstract:
Vision Transformers (ViTs) excel in extracting global information from image patches. However, their inherent limitation lies in effectively extracting information within local regions, hindering their applicability and performance. Particularly, fully supervised pre-trained ViTs, such as Vanilla ViT and CLIP, face the challenge of locality vanishing when adapting to downstream tasks. To address this, we introduce a novel LOcality-aware pRompt lEarning (LORE) method, aiming to improve the adaptation of pre-trained ViTs to downstream tasks. LORE integrates a data-driven Black Box module (i.e., a pre-trained ViT encoder) with a knowledge-driven White Box module. The White Box module is a locality-aware prompt learning mechanism to compensate for ViTs' deficiency in incorporating local information. More specifically, it begins with the design of a Locality Interaction Network (LIN), which treats an image as a neighbor graph and employs graph convolution operations to enhance local relationships among image patches. Subsequently, a Knowledge-Locality Attention (KLA) mechanism is proposed to capture critical local regions from images, learning Knowledge-Locality (K-L) prototypes utilizing relevant semantic knowledge. Afterwards, K-L prototypes guide the training of a Prompt Generator (PG) to generate locality-aware prompts for images. The locality-aware prompts, aggregating crucial local information, serve as additional input for our Black Box module. Combining pre-trained ViTs with our locality-aware prompt learning mechanism, our Black-White Box model enables the capture of both global and local information, facilitating effective downstream task adaptation. Experimental evaluations across four downstream tasks demonstrate the effectiveness and superiority of our LORE.

Abstract:
Point cloud segmentation forms the foundation of 3D scene understanding. Boundaries, the intersections of regions, are prone to mis-segmentation. Current point cloud segmentation models exhibit unsatisfactory performance on boundaries. There is limited focus on explicitly addressing semantic segmentation of point cloud boundaries. We introduce a method called Multi-fineness Boundary Constraint (MBC) to tackle this challenge. By querying boundaries at various degrees of fineness and imposing feature constraints within these boundary areas, we enhance the discrimination between boundaries and non-boundaries, improving point cloud boundary segmentation. However, solely emphasizing boundaries may compromise the segmentation accuracy in broader non-boundary regions. To mitigate this, we introduce a new concept of point cloud space termed ensemble and a Shifted Ensemble-aware Perception (SEP) module. This module establishes information interactions between points with minimal computational cost, effectively capturing direct point-to-point long-range correlations within ensembles. It enhances segmentation performance for both boundaries and non-boundaries.

Abstract:
Incremental Object Detection (IOD) simulates the dynamic data flow in real-world applications, which require detectors to learn new classes or adapt to new domains while retaining knowledge from previous tasks. Most existing IOD methods focus only on class incremental learning, assuming all data comes from the same domain. However, this is hardly achievable in practical applications, as images collected under different conditions often exhibit completely different characteristics, such as lighting, weather, style, etc. Class IOD methods suffer from performance degradation in these scenarios with domain shifts. To bridge domain shifts and category gaps in IOD, we propose Purified Distillation (PD), where we use a set of trainable queries to transfer the teacher's attention on old tasks to the student and adopt the gradient reversal layer to guide the student to learn the teacher's feature space structure from a micro perspective, which has not been extensively studied in previous works. Meanwhile, PD combines classification confidence with localization confidence to purify the most meaningful output nodes, so that the student model inherits a more comprehensive teacher knowledge. Extensive experiments across various IOD settings on six widely used datasets show that PD significantly outperforms state-of-the-art methods. Even after five steps of incremental learning, our method can preserve 60.6% mAP on the first task, while compared methods can only maintain up to 55.9%.

Abstract:
Large-scale point cloud semantic segmentation is a challenging task in 3D computer vision. A key challenge is how to resolve ambiguities arising from locally high inter-class similarity. In this study, we introduce a solution by modeling long-distance contextual information to understand the scene's overall layout. The context sensitivity of previous methods is typically constrained to small blocks(e.g. 2m x 2m) and cannot be directly extended to the entire scene. For this reason, we propose Long-Distance Context Modeling Network(LDCNet). Our key insight is that keypoints are enough for inferring the layout of a scene. Therefore, we represent the entire scene using keypoints along with local descriptors and model long-distance context on these keypoints. Finally, we propagate the long-distance context information from keypoints back to non-keypoints. This allows our method to model long-distance context effectively. We conducted experiments on six datasets, demonstrating that our approach can effectively mitigate ambiguities. Our method performs well on large, irregular objects and exhibits good generalization for typical scenarios.

Abstract:
Within the domain of blind face restoration (BFR), approaches lacking facial priors frequently result in excessively smoothed visual outputs. Exiting BFR methods predominantly utilize generative facial priors to achieve realistic and authentic details. However, these methods, primarily designed for images, encounter challenges in maintaining temporal consistency when applied to face video restoration. To tackle this issue, we introduce StableBFVR, an innovative Blind Face Video Restoration method based on Stable Diffusion that incorporates temporal information into the generative prior. This is achieved through the introduction of temporal layers in the diffusion process. These temporal layers consider both long-term and short-term information aggregation. Moreover, to improve generalizability, BFR methods employ complex, large-scale degradation during training, but it often sacrifices accuracy. Addressing this, StableBFVR features a novel mixed-degradation-aware prompt module, capable of encoding specific degradation information to dynamically steer the restoration process. Comprehensive experiments demonstrate that our proposed StableBFVR outperforms state-of-the-art methods.

Abstract:
Various information in different modalities in an intuitive way in multi-modal knowledge graphs (MKGs), which are utilized in different downstream tasks, like recommendation. However, most MKGs are still far from complete, which motivates the flourishing of MKG reasoning models. Recently, with the development of general artificial intelligence, pre-trained transformers have drawn increasing attention, especially in multi-modal scenarios. However, the research of multi-modal pre-trained transformers (MPT) for knowledge graph reasoning (KGR) is still at an early stage. As the biggest difference between MKG and other multi-modal data, the rich structural information underlying the MKG is still not fully utilized in previous MPT. Most of them only use the graph structure as a retrieval map for matching images and texts connected with the same entity, which hinders their reasoning performances. To this end, the graph Structure Guided Multi-modal Pre-trained Transformer is proposed for knowledge graph reasoning (SGMPT). Specifically, the graph structure encoder is adopted for structural feature encoding. Then, a structure-guided fusion module with two simple yet effective strategies, i.e., weighted summation and alignment constraint, is designed to inject the structural information into both the textual and visual features. To the best of our knowledge, SGMPT is the first MPT for multi-modal KGR, which mines structural information underlying MKGs. Extensive experiments on FB15k-237-IMG and WN18-IMG, demonstrate that our SGMPT outperforms existing state-of-the-art models, and proves the effectiveness of the designed strategies.

Abstract:
Advanced mobile computing has led to a surge in the need for practical super-resolution (SR) techniques. The look-up table (LUT) based SR-LUT has pioneered a new avenue of research without needing hardware acceleration. Nevertheless, all preceding methods that drew inspiration from the SR-LUT framework invariably resort to interpolation and rotation techniques for diminishing the LUT size, thereby prolonging the inference time and contradicting the original objective of efficient SR. Recently, a study named EC-LUT proposed an expanded convolution method to avoid interpolation operations. However, the performance of EC-LUT regarding SR quality and LUT volume is unsatisfactory. To address these limitations, this paper proposes a novel expanded convolutional neural network (ECNN). Specifically, we further extend feature fusion to the feature channel dimension to enhance mapping ability. In addition, our approach reduces the number of single indexed pixels to just one, eliminating the need for rotation tricks and dramatically reducing the LUT size from the MB level to the KB level, thus improving cache hit rates. By leveraging these improvements, we can stack expanded convolutional layers to form an ECNN, with each layer convertible to LUTs during inference. Experiments show that our method improves the overall performance of the upper limit of LUT based methods. For example, under comparable SR quality conditions, our model achieves state-of-the-art performance in speed and LUT volume.

Abstract:
Human Activity Recognition (HAR) as an emerging research field has attracted widespread academic attention due to its wide range of practical applications in areas such as healthcare, environmental monitoring, and sports training. Given the high cost of annotating sensor data, many unsupervised and semi-supervised methods have been applied to HAR to alleviate the problem of limited data. In this paper, we propose a novel video-enhanced cross-modal collaborative learning method, Vi2ACT, to address the issue of few-shot HAR. We introduce a new data augmentation approach that utilizes a text-to-video generation model to generate class-related videos. Subsequently, a large quantity of video semantic representations are obtained through fine-tuning the video encoder for cross-modal co-learning. Furthermore, to effectively align video semantic representations and time series representations, we enhance HAR at the representation-level using conditional Generative Adversarial Nets (cGAN). We design a novel Representation Conditional Discriminator that is trained to assess samples as originating from video representations rather than those generated by the time series encoder as accurately as possible. We conduct extensive experiments on four commonly used HAR datasets. The experimental results demonstrate that our method outperforms other baseline models in all few-shot scenarios.

Abstract:
Neural networks often tend to rely on bias features that have strong but spurious correlations with the target labels for decision-making, leading to poor performance on data that does not adhere to these correlations. Early debiasing methods typically construct an unbiased optimization objective based on the labels of bias features. Recent work assumes that bias label is unavailable and usually trains two models: a biased model to deliberately learn bias features for exposing data bias, and a target model to eliminate bias captured by the bias model. In this paper, we first reveal that previous biased models fit target labels, which resulted in failing to expose data bias. To tackle this issue, we propose poisoner, which utilizes data poisoning to embed the biases learned by biased models into the poisoned training data, thereby encouraging the models to learn more biases. Specifically, we couple data poisoning and model training to continuously prompt the biased model to learn more bias. By utilizing the biased model, we can identify samples in the data that contradict these biased correlations. Subsequently, we amplify the influence of these samples in the training of the target model to prevent the model from learning such biased correlations. Experiments show the superior debiasing performance of our method.

Abstract:
Federated learning (FL) is undergoing significant traction due to its ability to perform privacy-preserving training on decentralized data. In this work, we focus on sensitive time series data collected by distributed sensors in real-world applications. However, time series data introduce the challenge of dual spatial-temporal feature skew due to their dynamic changes across domains and time, differing from computer vision. This key challenge includes inter-client spatial feature skew caused by heterogeneous sensor collection and intra-client temporal feature skew caused by dynamics in time series distribution. We follow the framework of Personalized Federated Learning (pFL) to handle dual feature drifts to enhance the capabilities of customized local models. Therefore, in this paper, we propose a method FedST to solve key challenges through orthogonal feature decoupling and regularization in both training and testing stages. During training, we collaborate time view and frequency view of time series data to enrich the mutual information and adopt orthogonal projection to disentangle and align the shared and personalized features between views, and between clients. During testing, we apply prototype-based predictions and model-based predictions to achieve model consistency based on shared features. Extensive experiments on multiple real-world classification datasets and multimodal time series datasets show our method consistently outperforms state-of-the-art baselines with clear advantages.

Abstract:
Many domain adaptive object detection (DAOD) methods employ domain adversarial training to align features and mitigate the domain gap. In this approach, a feature extractor is trained to deceive a domain classifier, thereby aligning feature distributions. However, the domain classifier's discrimination capability can easily fall into a local optimum due to the equilibrium challenge, hindering the effective training of the feature extractor. In this work, we propose an efficient optimization strategy called Virtual-label Fooled Domain Discrimination (VFDD), which revitalizes the domain classifier during training using virtual domain labels. Such virtual label makes the separable distributions less separable, and thus leads to a more easily confused domain classifier, which in turn further drives feature alignment. Particularly, we introduce a novel concept of virtual domain label for the unaligned samples and propose the VirtualH -divergence to overcome the problem of falling into local optimum due to the equilibrium challenge. VFDD is orthogonal to most existing DAOD methods and can be integrated as a plug-and-play module to enhance these models. Theoretical insights and experimental analyses demonstrate that VFDD improves many popular baselines and surpasses recent unsupervised DAOD models.

Abstract:
Adhesive adversarial patches have been common used in attacks against the computer vision task of monocular depth estimation (MDE). Compared to physical patches permanently attached to target objects, optical projection patches show great flexibility and have gained wide research attention. However, applying digital patches for direct projection may lead to partial blurring or omission of details in the captured patches, attributed to high information density, surface depth discrepancies, and non-uniform pixel distribution. To address these challenges, in this work we introduce DepthCloak, an adversarial optical patch designed to interfere with the MDE of vehicles. To this end, we first simplify the patch to a gray pattern because the projected ''black-and-white light'' has strong robustness to ambient light. We propose a generative adversarial network (GAN) based approach to simulate projections and deduce a projectable list. Then, we employ neighborhood averaging to fill sparse depth values, compress all depth values into a reduced dynamic range via nonlinear mapping, and use these values to adjust the Gaussian blur radius as weight parameters, thereby simulating depth variation effects. Finally, by integrating Moiré pattern and applying style transfer techniques, we customize adversarial patches featuring regularly arranged characteristics. We deploy DepthCloak in real driving scenarios, and extensive experiments demonstrate that DepthCloak can achieve an attack success rate of over 80% in the physical world.

Abstract:
Multi-party Mobile Virtual Reality (MMVR) enables multiple mobile users to share virtual scenes for immersive multimedia experience in scenarios such as gaming, social interaction, and industrial mission collaboration. Dynamic 3D Point Cloud (DPCL) is an emerging representation form of MMVR that can be consumed as a free-viewpoint video with 6 degrees of freedom. Given that it is challenging to render DPCL at a satisfying frame rate with limited on-device resources, offloading rendering tasks to edge servers is recognized as a practical solution. However, repeated loading of DPCL scenes with a substantial amount of metadata introduces a significant redundancy overhead that cannot be overlooked when enabling multiple edge servers to support the rendering requirements of user groups. In this paper, we design PoClVR, an edge-assisted DPCL rendering system for MMVR applications, which breaks down the rendering process of the complete dynamic scene into multiple rendering tasks of dynamic objects. PoClVR significantly reduces the repetitive loading overhead of DPCL scenes on edge servers and periodically adjusts the rendering task allocation during the application running to accommodate rendering requirements. We deploy PoClVR based on a real-world implementation and the experimental evaluation results show that PoClVR can reduce GPU utilization by up to 15.1% and increase rendering frame rate by up to 34.6% compared to other baselines while ensuring that the image quality viewed by the user is virtually unchanged.

Abstract:
Generative image steganography has gained significant attention due to its ability to hide secret data during image generation. However, existing generative image steganography methods still face challenges in terms of controllability, usability, and robustness, making it difficult to apply real-world scenarios. We propose a practical and robust generative image steganography based on Latent Diffusion Models, called LDStega. LDStega takes controllable condition text as input and designs an encoding strategy in the reverse process of the Latent Diffusion Models to couple latent space generation with data hiding. The encoding strategy selects a sampling interval from a candidate pool of truncated Gaussian distributions guided by secret data to generate the stego latent space. Subsequently, the stego latent space is fed into the Decoder to generate the stego image. The receiver extracts the secret data from the globally Gaussian distribution of the lossy-reconstructed latent space in the reverse process. Experimental results demonstrate that LDStega achieves high extraction accuracy while controllably generating image content and saving the stego image in the widely used PNG and JPEG formats. Additionally, LDStega outperforms state-of-the-art techniques in resisting common image attacks.

Abstract:
Singing Voice Synthesis (SVS) has significantly advanced with deep generative models, achieving high audio quality but still struggling with musicality, mainly due to the lack of performance control over timing, dynamics, and pitch, which are essential for music expression. Additionally, integrating data and supporting diverse languages and styles in SVS remain challenging. To tackle these issues, this paper presents ExpressiveSinger, an SVS framework that leverages a cascade of diffusion models to generate realistic singing across multiple languages, styles, and techniques from scores and lyrics. Our approach begins with consolidating, cleaning, annotating, and processing public singing datasets, developing a multilingual phoneme set, and incorporating different musical styles and techniques. We then design methods for generating expressive performance control signals including phoneme timing, F0 curves, and amplitude envelopes, which enhance musicality and model consistency, introduce more controllability, and reduce data requirements. Finally, we generate mel-spectrograms and audio from performance control signals with style guidance and singer timbre embedding. Our models also enable trained singers to sing in new languages and styles. Several listening tests reveal both musicality and controllability of our generated singing compared with existing works and human singing. We release the data for future research. Demo: https://shuqid.net/expressive-singing-synthesis.

Abstract:
Partial multi-label learning (PML) deals with the problem of accurately predicting the correct multi-label class for each instance in multi-label data containing noise. Compared with traditional multi-label learning, partial multi-label learning requires learning and completing multi-label classification tasks in an imperfect environment. The existing PML methods have the following problems: (1) the correlation between samples and labels is not fully utilized; (2) the nonlinear nature of the model is not taken into account. To solve these problems, we propose a new method of PML based on label enhancement of near and far neighbor information and nonlinear guidance(PML-LENFN). Specifically, the original binary label information is reconstructed by using the information of sample near neighbors and far neighbors to eliminate the influence of noise. Then we construct a linear multi-label classifier that can explore label correlation. In order to learn the nonlinear relationship between features and labels, we use nonlinear mapping to constrain this classifier, so as to obtain the prediction results that are more consistent with the realistic label distribution.

Abstract:
In digital pathology, cancer lesions are identified by analyzing the spatial context within pathology images. Synthesizing such complex spatial context is challenging as pathology whole slide images typically exhibit high resolution, low inter-class variety, and are sparsely labeled. To address these challenges, we propose PathUp, a novel diffusion model tailored for the synthesis of multi-class high-resolution pathology images. Our approach includes a latent space patch-wise timestep tracking, which helps to generate high-quality images without tiling artifacts. Pathology knowledge is integrated through our patho-align. The robust generation of lesion subtypes and scale information is ensured by introducing a feature entropy loss. The effectiveness of our method is evaluated through extensive experiments, supplemented by assessments from human experts, demonstrating the authenticity of the synthetic data produced. Furthermore, we highlight the potential utility of our generated images as an augmentation method, thereby enhancing the performance of downstream tasks such as cancer subtype classification.

Abstract:
Deep learning for medical image classification needs large amounts of carefully labeled data with the aid of domain experts. However, data labeling is vulnerable to noises, which may degrade the accuracy of classifiers. Given the cost of medical data collection and annotation, it is highly desirable for methods that can effectively utilize noisy labeled data. In addition, efficiency and universality are essential for noisy label training, which requires further research.To address the lack of high-quality labeled medical data and meet algorithm efficiency requirements for clinical application, we propose a simple yet effective approach for multi-field medical images to utilize noisy data, named Pseudo-T correction. Specifically, we design a noisy label filter to divide the training data into clean and noisy samples. Then, we estimate a transition matrix that corrects model predictions based on the partitions of clean and noisy data samples. However, if the model overfits noisy data, noisy samples become more difficult to detect in the filtering step, resulting in inaccurate transition matrix estimation. Therefore, we employ gradient disparity as an effective criterion to decide whether or not to refine the transition matrix in the model's further training steps. The novel design enables us to build more accurate machine-learning models by leveraging noisy labels. We demonstrate that our method outperforms the state-of-the-art methods on three public medical datasets and achieves superior computational efficiency over the alternatives.

Abstract:
Multi-modal learning leverages data from diverse perceptual media to obtain enriched representations, thereby empowering machine learning models to complete more complex tasks. However, recent research results indicate that multi-modal learning still suffers from " modality imbalance '': Certain modalities' contributions are suppressed by dominant ones, consequently constraining the overall performance enhancement of multimodal learning. To tackle this issue, current approaches attempt to mitigate modality competition in various ways, but their effectiveness is still limited. To this end, we propose an Euler Representation Learning-based Modality Rebalance (ERL-MR) strategy, which reshapes the underlying competitive relationships between modalities into mutually reinforcing win-win situations while maintaining stable feature optimization directions. Specifically, ERL-MR employs Euler's formula to map original features to complex space, constructing cooperatively enhanced non-redundant features for each modality, which helps reverse the situation of modality competition. Moreover, to counteract the performance degradation resulting from optimization drift among modalities, we propose a Multi-Modal Constrained (MMC) loss based on cosine similarity of complex feature phase and cross-entropy loss of individual modalities, guiding the optimization direction of the fusion network. Extensive experiments conducted on four multi-modal multimedia datasets and two task-specific multi-modal multimedia datasets demonstrate the superiority of our ERL-MR strategy over state-of-the-art baselines, achieving modality rebalancing and further performance improvements.

Abstract:
Object Navigation (ObjcetNav), which enables an agent to seek any instance of an object category specified by a semantic label, has shown great advances. However, current agents are built upon occlusion-prone visual observations or compressed 2D semantic maps, which hinder their embodied perception of 3D scene geometry and easily lead to ambiguous object localization and blind exploration. To address these limitations, we present an Embodied Contrastive Learning (ECL) method with Geometric Consistency (GC) and Behavioral Awareness (BA), which motivates agents to actively encode 3D scene layouts and semantic cues. Driven by our embodied exploration strategy, BA is modeled by predicting navigational actions based on multi-frame visual images, as behaviors that cause differences between adjacent visual sensations are crucial for learning correlations among continuous visions. The GC is modeled as the alignment of behavior-aware visual stimulus with 3D semantic shapes by employing unsupervised contrastive learning. The aligned behavior-aware visual features and geometric invariance priors are injected into a modular ObjectNav framework to enhance object recognition and exploration capabilities. As expected, our ECL method performs well on object detection and instance segmentation tasks. Our ObjectNav strategy outperforms state-of-the-art methods on MP3D and Gibson datasets, showing the potential of our ECL in embodied navigation.

Abstract:
While margin-based deep face recognition models, such as ArcFace and AdaFace, have achieved remarkable successes over recent years, they may suffer from degraded performances when encountering training sets corrupted with noises. This is often inevitable when massively large scale datasets need to be dealt with, yet it remains difficult to construct clean enough face datasets under these circumstances. In this paper, we propose a robust deep face recognition model, RobustFace, by combining the advantages of margin-based learning models with the strength of mining-based approaches to effectively mitigate the impact of noises during trainings. Specifically, we introduce a noise-adaptive mining strategy to dynamically adjust the emphasis balance between hard and noise samples by monitoring the model's recognition performances at the batch level to provide optimization-oriented feedback, enabling direct training on noisy datasets without the requirement of pre-training. Extensive experiments validate that our proposed RobustFace achieves competitive performances in comparison with the existing SoTA models when trained with clean datasets. When trained with both real-world and synthetic noisy datasets, RobustFace significantly outperforms the existing models, especially when the synthetic noisy datasets are corrupted with both close-set and open-set noises. While the existing baseline models suffer from an average performance drop of around 40%, under these circumstances, our proposed still delivers accuracy rates of more than 90%.

Abstract:
As the visual interpretations for convolutional neural networks (CNNs), backpropagation attribution methods have been garnering growing attention. Nevertheless, majority of these methods merely concentrate on the ultimate convolutional layer, leading to tiny and concentrated interpretations that fail to adequately clarify the model-central attention. Therefore, we propose a precise attribution method (i.e., Holistic-CAM) for high-definition visual interpretation in the holistic stage of CNNs. Specifically, we first present weighted positive gradients to guarantee the sanity of interpretations in shallow layers and leverage multi-scale fusion to improve the resolution across the holistic stage. Then, we further propose a denoising strategy based on the fundamental scale component to eliminate the faithless attribution derived from fusing larger-scale features. The proposed method is capable of simultaneously rendering fine-grained and faithful interpretations for CNNs from shallow to deep layers. Extensive experimental results demonstrate that Holistic-CAM outperforms state-of-the-art methods on common-used benchmarks, including deletion and insertion, energy-based point game as well as Remove and Debias on ImageNet-1k, it also passes the sanity check easily.

Abstract:
Two-view correspondence pruning aims to accurately remove incorrect correspondences (outliers) from initial ones. Graph Neural Networks (GNNs) incorporated by Multilayer Perceptrons (MLPs) are treated as a powerful manner to handle sparse and unevenly distributed data. However, the expression capability of correspondence features obtained by MLPs is limited by their inherent insufficient of context information. In addition, previous works directly utilize the outputs of off-the-shelf GNNs, thus leading to confusion between sparse correspondence attribute features and their global structural information. To alleviate these issues, we propose a two-view correspondence pruning network TrGa. Specifically, we firstly use complete Transformer structures instead of context-agnostic MLPs to capture correspondence features with global context information and stronger expression capability. After that, we introduce the Concatenation Graph Node and Global Structure (CGNS) block to separately capture the interaction patterns among sparse correspondence attribute features and the global structural information among them, which can prevent their confusion. Finally, the proposed Feature Dimension Transformation and Enhancement (FDTE) block is applied for dimension transformation and feature augmentation. Additionally, we propose an efficient variant C-TrGa, in which the similarity matrix of the proposed C-Transformer is computed along the channel dimension. Extensive experiments demonstrate that the proposed TrGa and C-TrGa outperform state-of-the-art methods in different computer vision tasks.

Abstract:
Recently, temporal action localization (TAL) methods, especially the weakly-supervised and unsupervised ones, have become a hot research topic. Existing unsupervised methods follow an iterative ''clustering and training'' strategy with diverse model designs during training stage, while they often overlook maintaining consistency between these stages, which is crucial: more accurate clustering results can reduce the noises of pseudolabels and thus enhance model training, while more robust training can in turn enrich clustering feature representation. We identify two critical challenges in unsupervised scenarios: 1. What features should the model generate for clustering? 2. Which pseudolabeled instances from clustering should be chosen for model training? After extensive explorations, we proposed a novel yet simple framework called Consistency-Oriented Progressive high actionness Learning to address these issues. For feature generation, our framework adopts a High Actionness snippet Selection (HAS) module to generate more discriminative global video features for clustering from the enhanced actionness features obtained from a designed Inner-Outer Consistency Network (IOCNet). For pseudolabel selection, we introduces a Progressive Learning With Representative Instances (PLRI) strategy to identify the most reliable and informative instances within each cluster for model training. These three modules, HAS, IOCNet, and PLRI, synergistically improve consistency in model training and clustering performance. Extensive experiments on THUMOS'14 and ActivityNet v1.2 datasets under both unsupervised and weakly-supervised settings demonstrate that our framework achieves the state-of-the-art results.

Abstract:
Multi-view multi-label classification has recently received extensive attention due to its wide-ranging applications across various fields, such as medical imaging and bioinformatics. However, views and labels are usually incomplete in practical scenarios, attributed to the uncertainties in data collection and manual labeling. To cope with this issue, we propose an uncertainty-aware pseudo-labeling and dual graph driven network (UPDGD-Net), which can fully leverage the supervised information of the available labels and feature information of available views. Different from the existing works, we leverage the label matrix to impose dual graph constraints on the embedded features of both view-level and label-level, which enables the method to maintain the inherent structure of the real data during the feature extraction stage. Furthermore, our network incorporates an uncertainty-aware pseudo-labeling strategy to fill the missing labels, which not only addresses the learning issue of incomplete multi-labels but also enables the method to explore more reliable supervised information to guide the network training. Extensive experiments on five datasets demonstrate that our method outperforms other state-of-the-art methods.

Abstract:
Conditional Generative Adversarial Network (cGAN) is an important type of GAN which is often equipped with an auxiliary classifier. However, existing cGANs usually have the issue of mode collapse which can incur unstable performance in practice. In this paper, we propose a novel stable training method for cGANs with well preserving the generation fidelity and diversity. Our key ideas are designing efficient adversarial training strategies for the auxiliary classifier and mitigating the overconfidence issue caused by the cross-entropy loss. We propose a classifier-based cGAN called Confidence Guided Generative Adversarial Networks (CG-GAN) by introducing the adversarial training to a K-way classifier. In particular, we show in theory that the obtained K-way classifier can encourage the generator to learn the real joint distribution. To further enhance the performance and stability, we propose to establish a high-entropy prior label distribution for the generated data and incorporate a reverse KL divergence term into the minimax loss of CG-GAN. Through a comprehensive set of experiments on the popular benchmark datasets, including the large-scale dataset ImageNet, we demonstrate the advantages of our proposed method over several state-of-the-art cGANs.

Abstract:
In the last two years, Artificial Intelligence Generated Content (AIGC) has received significant attention, leading to an anecdotal rise in the amount of AIGC being shared via social media platforms. The impact of AIGC and its implications are of key importance to social platforms, e.g., regarding the implementation of policies, community formation, and algorithmic design. Yet, to date, we know little about how the arrival of AIGC has impacted the social media ecosystem. To fill this gap, we present a comprehensive study of Pixiv, an online community for artists who wish to share and receive feedback on their illustrations. Pixiv hosts over 100 million artistic submissions and receives more than 1 billion page views per month (as of 2023). Importantly, it allows both human and AI generated content to be uploaded. Exploiting this, we perform the first analysis of the impact that AIGC has had on the social media ecosystem, through the lens of Pixiv. Based on a dataset of 15.2 million posts (including 2.4 million AI-generated images), we measure the impact of AIGC on the Pixiv community, as well as the differences between AIGC and human-generated content in terms of content creation and consumption patterns. Our results offer key insight to how AIGC is changing the dynamics of social media platforms like Pixiv.

Abstract:
Real-time mesh reconstruction is highly demanded for integrating human avatar in modern computer graphics applications. Current methods typically use coordinate-based MLP to represent 3D scene as Signed Distance Field (SDF) and optimize it through volumetric rendering, relying on Marching Cubes for mesh extraction. However, volumetric rendering lacks training and rendering efficiency, and the dependence on Marching Cubes significantly impacts mesh extraction efficiency. This study introduces a novel approach, Mesh-Centric Gaussian Splatting (MCGS), which introduces a unique representation Mesh-Centric SDF and optimizes it using high-efficiency Gaussian Splatting. The primary innovation introduces Mesh-Centric SDF, a thin layer of SDF enveloping the underlying mesh, and could be efficiently derived from mesh. This derivation of SDF from mesh allows for mesh optimization through SDF, providing mesh as 0 iso-surface, and eliminating the need for slow Marching Cubes. The secondary innovation focuses on optimizing Mesh-Centric SDF with high-efficiency Gaussian Splatting. By dispersing the underlying mesh of Mesh-Centric SDF into multiple layers and generating Mesh-Constrained Gaussians on them, we create Multi-Layer Gaussians. These Mesh-Constrained Gaussians confine Gaussians within a 2D surface space defined by mesh, ensuring an accurate correspondence between Gaussian rendering and mesh geometry. The Multi-Layer Gaussians serve as sampling layers of Mesh-Centric SDF and can be optimized with Gaussian Splatting, which would further optimize Mesh-Centric SDF and its underlying mesh. As a result, our method can directly optimize the underlying mesh through Gaussian Splatting, providing fast training and rendering speeds derived from Gaussian Splatting, as well as precise surface learning of SDF. Experiments demonstrate that our method achieves dynamic mesh reconstruction at over 30 FPS. In contrast, SDF-based methods using Marching Cubes achieve less than 1 FPS, and concurrent 3D Gaussian Splatting-based methods cannot extract reasonable mesh.

Abstract:
Voice is one of the most widely used media for information transmission in human society. While high-quality synthetic voices are extensively utilized in various applications, they pose significant risks to content security and trust building. Numerous studies have concentrated on AI-synthesized voice detection to mitigate these risks, with many claiming to achieve promising performance. However, recent research has demonstrated that fake voice detectors suffer from serious overfitting to speaker-irrelative features (SiFs) and cannot be used in real-world scenarios. In this paper, we analyze the limitations of existing fake voice detectors and propose a new design philosophy, guiding the detection model to prioritize learning human voice features rather than the difference between the human voice and the synthetic voice. Based on this philosophy, we propose a novel AI-synthesized voice detection framework named SiFSafer, which uses pre-trained speech representation models to enhance the learning of feature distribution in human voices and the adapter fine-tuning to optimize the performance. The evaluation shows that the average EERs of existing fake voice detectors in the ASVspoof datasets can exceed 20% if the SiFs like silence segments are removed, while SiFSafer achieves an EER of less than 8%, indicating that SiFSafer is robust to SiFs and strongly resistant to existing attacks.

Abstract:
Understanding the speaking style, such as the emotion of the interlocutor's speech, and responding with speech in an appropriate style is a natural occurrence in human conversations. However, technically, existing research on speech synthesis and speaking style captioning typically proceeds independently. In this work, an innovative framework, referred to as UniStyle, is proposed to incorporate both the capabilities of speaking style captioning and style-controllable speech synthesizing. Specifically, UniStyle consists of a UniConnector and a style prompt-based speech generator. The role of the UniConnector is to bridge the gap between different modalities, namely speech audio and text descriptions. It enables the generation of text descriptions with speech as input and the creation of style representations from text descriptions for speech synthesis with the speech generator. Besides, to overcome the issue of data scarcity, we propose a two-stage and semi-supervised training strategy, which reduces data requirements while boosting performance. Extensive experiments conducted on open-source corpora demonstrate that UniStyle achieves state-of-the-art performance in speaking style captioning and synthesizes expressive speech with various speaker timbres and speaking styles in a zero-shot manner.

Abstract:
Video Conferencing Applications (VCAs) are indispensable for real-time communication in remote work and education by enabling simultaneous transmission of audio, video, and screen-sharing content. Despite their ubiquity, research on how these platforms allocate network bandwidth, especially under constrained conditions, and how these resource allocation strategies affect the users' Quality of Experience (QoE) is lacking. This paper addresses this gap by analyzing bandwidth allocation strategies in Zoom, Webex, and Google Meet, with a focus on QoE implications. To assess QoE, we propose a general QoE prediction model based on data collected from a study involving 800 participants. This study is a pioneering effort in evaluating multimedia transmissions across diverse scenarios and network conditions, advancing beyond prior research focused on single media types. The results demonstrate the model's effectiveness and generality in predicting QoE across various VCA scenarios.

Abstract:
Scene image is one of the important windows for showcasing product design. To obtain it, the standard 3D-based pipeline requires designer to not only create the 3D model of product, but also manually construct the entire scene in software, which hindering its adaptability in situations requiring rapid evaluation. This study aims to realize a novel conditional synthesis method to create the scene image based on a single-model rendering of the desired object and the scene description. In this task, the major challenges are ensuring the strict appearance fidelity of drawn object and the overall visual harmony of synthesized image. The former's achievement relies on maintaining an appropriate condition-output constraint, while the latter necessitates a well-balanced generation process for all regions of image. In this work, we propose Scene Diffusion framework to meet these challenges. Its first progress is introducing the Shading Adaptive Condition Alignment (SACA), which functions as an intensive training objective to promote the appearance consistency between condition and output image without hindering the network's learning to the global shading coherence. Afterwards, a novel low-to-high Frequency Progression Training Schedule (FPTS) is utilized to maintain the visual harmony of entire image by moderating the growth of high-frequency signals in the object area. Extensive qualitative and quantitative results are presented to support the advantages of the proposed method. In addition, we also demonstrate the broader uses of Scene Diffusion, such as its incorporation with ControlNet.

Abstract:
QUIC is the underlying protocol of the next generation HTTP/3, serving as the major vehicle delivering video data nowadays. As a userspace protocol based on UDP, QUIC features low transmission latency and has been widely deployed by content providers. However, the high computational overhead of QUIC shifts system knobs to CPUs in high-bandwidth scenarios. When CPU resources become the constraint, HTTP/3 exhibits even lower throughput than HTTP/1.1. In this paper, we carefully analyze the performance bottleneck of QUIC and find it results from ACK processing, packet sending, and data encryption. By reducing the ACK frequency, activating UDP generic segmentation offload (GSO), and incorporating PicoTLS, a high-performance encryption library, the CPU overhead of QUIC could be effectively reduced in stable network environments. However, simply reducing the ACK frequency also impairs the transmission throughput of QUIC under poor network conditions. To solve this, we develop LiteQUIC, which involves two mechanisms towards alleviating the overhead of ACK processing in addition to GSO and PicoTLS. We evaluate LiteQUIC in the DASH-based video streaming, and the results show that LiteQUIC achieves 1.2× higher average bitrate and 93.3% lower rebuffering time than an optimized version of QUIC with GSO and PicoTLS.

Abstract:
The rapid development of image generative models has lowered the threshold for image creation but also raised security concerns related to the propagation of false information, urgently necessitating the development of detection technologies for AI-generated images. Presently, text-to-image generation stands as the predominant approach to image generation, where the rendering of generated images hinges on two primary factors: text prompts and the inherent characteristics of the model. However, the variety of semantic text prompts yields diverse generated images, posing significant challenges to existing detection methodologies that rely solely on learning from image features, particularly in scenarios with limited samples. To tackle these challenges, this paper presents a novel perspective on the AI-generated image detection task, advocating for detection under semantic-decoupling conditions. Building upon this insight, we propose SemGIR, a semantic-guided image regeneration based method for AI-generated image detection. SemGIR first regenerates images through image-to-text followed by a text-to-image generation process, subsequently utilizing these re-generated image pairs to derive discriminative features. This regeneration process effectively decouples semantic features organically, allowing the detection process to concentrate more on the inherent characteristics of the generative model. Such an efficient detection scheme can also be effectively applied to attribution. Experimental findings demonstrate that in realistic scenarios with limited samples, SemGIR achieves an average detection accuracy 15.76% higher than state-of-the-art (SOTA) methods. Furthermore, in attribution experiments on the SDv2.1 model, SemGIR attains an accuracy exceeding 98%, affirming the effectiveness and practical utility of the proposed method.

Abstract:
In the field of Vision-Language Models (VLM), the Contrastive Language-Image Pretraining (CLIP) model has yielded outstanding performance on many downstream tasks through prompt tuning. By integrating image and text representations, CLIP exhibits zero-shot generalization capabilities on unseen data. However, when new categories and distribution shifts occur, the pretrained text embeddings in CLIP may not align well with unseen images, potentially leading to a decrease in CLIP's zero-shot generalization performance. To address this issue, many existing methods use test samples to update the CLIP model during testing through a process known as Test-Time Adaptation (TTA). Previous TTA techniques, such as image augmentation, can lead to overfitting given outlying samples, while methods based on teacher-student distillation can increase memory use. Further, these methods significantly increase inference time, which is a crucial factor in the testing phase. To improve robustness, mitigate overfitting, and reduce bias toward outlying samples, we propose a novel method: Self-Text Distillation with Conjugate Pseudo-labels (SCP), designed to enhance CLIP's zero-shot generalization. SCP uses gradient information from conjugate pseudo-labels to enhance the model's robustness toward distribution shifts. It also innovates by using a fixed prompt list to distil learnable prompts from within the same model, acting as a self-regulation mechanism that minimizes overfitting. Additionally, SCP is a fully test-time adaptation method that does not require retraining. It directly improves CLIP's zero-shot generalization at test time without increasing either memory overheads or inference time. In evaluations across three zero-shot generalization scenarios, SCP surpasses existing state-of-the-art methods in performance and significantly reduces inference time.

Abstract:
Due to device constraints and lighting conditions, captured images frequently exhibit coupled low-resolution and ultra-dark degradations. Enhancing the visibility and resolution of ultra-dark images simultaneously is crucial for practical applications. Current approaches often address both tasks in isolation or through simplistic cascading strategies, while also relying heavily on empirical and manually designed composite loss constraints, which inevitably results in compromised training efficacy, increased artifacts, and diminished detail fidelity. To address these issues, we propose TriCo, the first to adopt a Tri -level learning framework that explicitly formulates the bidirectional Co operative relationship and devises algorithms to tackle coupled degradation factors. In the optimization across Upper (U)-Middle (M)-Lower (L) levels, we model the synergistic dependencies between illumination learning and super-resolution tasks within the M-L levels. Moving to the U-M levels, we introduce hyper-variables to automate the learning of beneficial constraints for both learning tasks, moving beyond the traditional trial-and-error pitfalls of the learning process. Algorithmically, we establish a Phased Gradient-Response (PGR) algorithm as our training mechanism, which facilitates a dynamic, inter-variable gradient feedback and ensures efficient and rapid convergence. Moreover, we merge inherent illumination priors with universal semantic model features to adaptively guide pixel-level high-frequency detail recovery. Extensive experimentation validates the framework's broad generalizability across challenging ultra-dark scenarios, outperforming current state-of-the-art methods across 4 real and synthetic benchmark datasets over 6 metrics (e.g., 5.8%← in PSNR and 26.6%← in LPIPS).

Abstract:
All along, KG completion relied on link prediction has always been the focus of researchers. However, overwhelming majority of them can only serve 2-ary KGs. While in practice, knowledge hypergraphs (KH) covering facts beyond binary relations are far more ubiquitous. When confronted with them, massive studies for KGs show inadaptability. The several work towards N-ary KHs generally simply extend KG methods. And they usually transform N-ary knowledge into role-value pairs or triples, largely simplifying inherent association within each piece of knowledge. Furthermore, previous models study each N-ary knowledge independently, resulting in structural correlations among them being completely neglected. Motivated by these, avoiding breaking knowledge structure in KHs like previous studies do, we propose the first KH reasoning model based on original knowledge formats, RHKH. Challenged by complicated compositions indicated by the original format of N-ary tuples, association within and among each tuple is discovered through an innovative relational hypergraph neural network, RHNN. It considers complex interactions between relation and entities involved in the same knowledge as well. To refine such interactions, semantic components at each arity-position of relations are distinguished, along with introducing position-specific shift. Extensive experiments demonstrate the effectiveness of RHKH.

Abstract:
Video moment localization (VML) aims to identify the temporal boundary semantically matching the given query. Point-supervised VML balances localization accuracy and annotation cost but is still immature due to granularity alignment and scale perception issues. To this end, we propose a Semantic Granularity and Scale Correspondence Integration (SG-SCI) framework aimed at leveraging limited single-frame annotation for correspondence learning. It explicitly models semantic relations of different feature granularities and adaptively mines the implicit semantic scale, thereby enhancing feature representations of varying granularities and scales. SG-SCI uses granularity correspondence alignment to align semantics via latent prior knowledge and a scale correspondence learning to identify and address semantic scale differences. Extensive experiments on benchmark datasets have demonstrated the promising performance of our model over several state-of-the-art competitors.

Abstract:
Learning item representation is crucial for a myriad of on-line e-commerce applications. The nucleus of retail item representation learning is how to properly fuse the semantics within a single item, and the interactions across different items generated by user behaviors (e.g., co-click or co-view). Product semantics depict the intrinsic characteristics of the item, while the interactions describe the relationships between items from the perspective of human perception. Existing approaches either solely rely on a single type of information or loosely couple them together, leading to hindered representations. In this work, we propose a novel model named TESPA to reinforce semantic modeling and interaction modeling mutually. Specifically, collaborative filtering signals in the interaction graph are encoded into the language models through fine-grained topological pre-training, and the interaction graph is further enriched based on semantic similarities. After that, a novel multi-channel co-training paradigm is proposed to deeply fuse the semantics and interactions under a unified framework. In a nutshell, TESPA is capable of enjoying the merits of both sides to facilitate item representation learning. Experimental results of on-line and off-line evaluations demonstrate the superiority of our proposal.

Abstract:
Point cloud upsampling concerns producing a dense and uniform point set from a sparse and irregular one. Current upsampling methods primarily encounter two challenges: (i) insufficient uni-modal representations of sparse point clouds, and (ii) inaccurate estimation of geometric details in dense point clouds, resulting in suboptimal upsampling results. To tackle these challenges, we propose MVP-Net, a multi-view depth image guided cross-modal detail estimation distillation network for point cloud upsampling, in which the multi-view depth images of point clouds are fully explored to guide upsampling. Firstly, we propose a cross-modal feature extraction module, consisting of two branches designed to extract point features and depth image features separately. This setup aims to produce sufficient cross-modal representations of sparse point clouds. Subsequently, we design a Multi-View Depth Image to Point Feature Fusion (MVP) block to fuse the cross-modal features in a fine-grained and hierarchical manner. The MVP block is incorporated into the feature extraction module. Finally, we introduce a paradigm for multi-view depth image-guided detail estimation and distillation. The teacher network fully utilizes paired multi-view depth images of sparse point clouds and their dense counterparts to formulate multi-hierarchical representations of geometric details, thereby achieving high-fidelity reconstruction. Meanwhile, the student network takes only sparse point clouds and their multi-view depth images as input, and it learns to predict the multi-hierarchical detail representations distilled from the teacher network. Extensive qualitative and quantitative results on both synthetic and real-world datasets demonstrate that our method outperforms state-of-the-art point cloud upsampling methods.

Abstract:
Deceptive images can quickly spread via social networking services, posing significant risks. The rapid progress in Image Manipulation Localization (IML) seeks to address this issue. However, the scarcity of public training datasets in the IML task directly hampers the performance of models. To address the challenge, we propose a Prompt-IML framework, which leverages the rich prior knowledge of pre-trained models by employing tunable prompts. Specifically, sets of tunable prompts enable the frozen pre-trained model to extract multi-view features, including spatial and high-frequency features. This approach minimizes redundant architecture for feature extraction across different views, resulting in reduced training costs. In addition, we develop a plug-and-play Feature Alignment and Fusion module that seamlessly integrates into the pre-trained models without additional structural modifications. The proposed module reduces noise and uncertainty in features through interactive processing. The experimental results showcase that our proposed method attains superior performance across 6 test datasets, demonstrating exceptional robustness.

Abstract:
With the rapid development of multimedia applications such as online education, remote conferences, and telemedicine, an emerging type of image known as the text screen content image (TSCI) has gained widespread utilization. Distinguishing from natural images captured by cameras, TSCI is generally generated or rendered by computers and exhibits significant differences in content characteristics. Notably, TSCI primarily comprises text, which is a symbol system uniquely defined by humans with specific semantics. As an important carrier for transmitting semantic information, the quality of text in TSCI significantly affects the subjective perception experience of multimedia system users. Just noticeable difference (JND) is a widely studied image quality measure that is theoretically closest to human perception. However, the traditional JND (T-JND) experiments fail to distinguish text from other image contents, ignoring the significant impact of text semantic readability on image quality. This paper, for the first time, focuses on the impact of text semantics on the quality of TSCI, and JND experiments for TSCIs compressed by the state-of-the-art versatile video coding (VVC) standard are explored and discussed. Specifically, a matching TSCI dataset is first established. Using the dataset, image subjective observation experiments are further designed and carried out to construct the traditional JND (T-JND) experiment as well as the semantic aware JND (S-JND) experiment. By comparing the experimental results, crucial conclusions are reached, including the fact that the S-JND experiment provides a more precise description of the TSCI quality compared to the T-JND experiment. These conclusions have important guiding significance for the subsequent development of efficient JND models suitable for TSCIs compressed by VVC.

Abstract:
Deep steganography is a technique that imperceptibly hides secret information into image by neural networks. Existing networks consist of two components, including a hiding component for information hiding and an adversary component for countering against steganalyzers. However, these two components are two ends of the seesaw, and it is difficult to balance the tradeoff between message extraction accuracy and security performance by joint optimization. To address the issues, this paper proposes a steganographic method called AHDeS (Adversary-Hiding-Decoupled Steganography) under the Dig-and-Fill paradigm, wherein the adversary and hiding components can be decoupled into an optimization-based adversary module in the digging process and an INN-based hiding network in the filling process. Specfically in the training stage, the INN is first trained for acquiring the ability of message embedding. In the deployment stage, given the well-trained and fixed INN, the cover image is first iteratively optimized for enhancing the security performance against steganalyzers, followed by the actual message embedding by the INN. Owing to the reversibility of the INN, security performance can be enhanced without sacrificing message extraction accuracy. Experimental results show that AHDeS can achieve the state-of-the-art security performance and visual quality while maintaining satisfied message extraction accuracy.

Abstract:
In e-commerce platforms, visual content plays a pivotal role in capturing and retaining audience attention. A high-quality and aesthetically designed product background image can quickly grab consumers' attention, and increase their confidence in taking actions, such as making a purchase. Recently, diffusion models have achieved profound advancements, rendering product background generation a promising avenue for exploration. However, text-guided diffusion models require meticulously crafted prompts. The diverse range of products makes it challenging to compose prompts that result in visually appealing and semantically appropriate background scenes. Current work has made great efforts on creating prompts through expert-crafted rules or specialized fine-tuning of large language models, but it still relies on detailed human inputs and often falls short in generating desirable results by e-commerce standards.

Abstract:
We propose Interpretable Neural Radiance Fields (InNeRF) for generalizable 3D scene representation and rendering. In contrast to previous image-based rendering, which used two independent working processes of pooling-based fusion and MLP-based rendering, our framework unifies source-view fusion and target-view rendering processes via an end-to-end interpretable Transformer-based network. InNeRF enables the investigation of deep relationships between the target-rendering view and source views that were previously neglected by pooling-based fusion and fragmented rendering procedures. As a result, InNeRF improves model interpretability by enhancing the shape and appearance consistency of a 3D scene in both the surrounding view space and the ray-cast space. For a query rendering 3D point, InNeRF integrates both its projected 2D pixels from the surrounding source views and its adjacent 3D points along the query ray and simultaneously decodes this information into the query 3D point representation. Experiments show that InNeRF outperforms state-of-the-art image-based neural rendering methods in both scene-agnostic and per-scene finetuning scenarios, especially when there is a considerable disparity between source views and rendering views. The interpretation experiment shows that InNeRF can explain a query rendering process.

Abstract:
Generative AI has revolutionized multimedia, leading to groundbreaking developments in content creation, interactive experiences, and personalized media. This panel delves into the transformative potential of generative AI in academic and industrial sectors, exploring its future applications and connections to emerging techniques. Additionally, the panel will address newly identified opportunities and challenges from both technical and ethical perspectives, highlighting the importance of responsible AI development. Bringing together leading experts from universities, research institutions, and industry, this panel aims to foster discussion and debate among participants. We invite everyone to join and contribute to this critical and promising area of research in the multimedia community.

Abstract:
Semi-supervised graph domain adaptation, as a subfield of graph transfer learning, seeks to precisely annotate unlabeled target graph nodes by leveraging transferable features acquired from the limited labeled source nodes. However, most existing studies often directly utilize graph convolutional networks (GCNs)-based feature extractors to capture domain-invariant node features, while neglecting the issue that GCNs are insufficient in collecting complex structure information in graph. Considering the importance of graph structure information in encoding the complex relationship among nodes and edges, this paper aims to utilize such powerful information to assist graph transfer learning. To achieve this goal, we develop a novel framework called HOGDA. Concretely, HOGDA introduces a high-order structure information mixing (HSIM) module to effectively capture abundant structure information in graph, greatly enhancing the feature extractor's ability to adapt across different domains. Moreover, to achieve fine-grained feature distributions alignment, a novel strategy called adaptive weighted domain alignment (AWDA) is proposed to dynamically adjust the node weight during adversarial domain adaptation process, effectively boosting the model's transfer ability. Furthermore, to mitigate the overfitting phenomenon caused by limited source labeled nodes, we also design a trust-aware node clustering (TNC) strategy to guide the unlabeled nodes to achieve discriminative clustering. Extensive experimental results show that our HOGDA outperforms the state-of-the-art methods on various transfer tasks.

Abstract:
Understanding the correct input domain for black-box models is vital for tasks such as model cloning, inversion, and membership inference. However, this area remains underexplored, hindering related methods' efficacy without domain information. In this paper, we highlight the need for discovering the data domain and propose an approach that leverages existing generative models to address this challenge. With hard-label black-box access to a neural network model, our method produces a set of embeddings that, when utilized with the generative model, yield samples closely aligned with each target class's data domain, facilitating downstream tasks. Central to our method is an objective function covering both functional relevance and embedding generality. We employ an iterative search algorithm to identify the optimal set of embeddings. Starting with initial embeddings, new data points are generated and classified by the target model. Successful classifications guide embedding resampling, refining subsequent iterations' generated images closer to the target class's data domain. Consequently, the embeddings are iteratively modified to better match the data domain of the target class. Given the vast embedding space, we introduce an optional preprocessing phase. This phase leverages a comprehensive corpus like ImageNet to select a representative subset of samples, roughly aligned with the model's input domain, to serve as starting points.

Abstract:
Although automatic shot transition detection approaches are already investigated for more than two decades, an effective universal human-level model was not proposed yet. Even for common shot transitions like hard cuts or simple gradual changes, the potential diversity of analyzed video contents may still lead to both false hits and false dismissals. Recently, deep learning-based approaches significantly improved the accuracy of shot transition detection using 3D convolutional architectures and artificially created training data. Nevertheless, one hundred percent accuracy is still an unreachable ideal. In this paper, we share the current version of our deep network TransNet V2 that reaches state-of-the-art performance on respected benchmarks. A trained instance of the model is provided so it can be instantly utilized by the community for a highly efficient analysis of large video archives. Furthermore, the network architecture, as well as our experience with the training process, are detailed, including simple code snippets for convenient usage of the proposed model and visualization of results.

Abstract:
In this demonstration, we present DanceMimic, a real-time dance imitation capture and assessment system designed to enhance the accessibility and learning experience of dancing. Guided by an interactive user interface, novice dancers can simultaneously observe and imitate a selected choreography while listening to the corresponding music. The choreography is captured and compared with the reference for quantitative evaluation of the performance proficiency. Finally, our system retargets the performed dance to a rigged 3D character to provide immersive imitation experience. Demo video is on: https://youtu.be/7nL9YPPRj-4

Abstract:
Point clouds have the strong capability for modeling 3D objects and scenes, which can be widely used in diverse applications and thus generate the burdens of transmission and storage. Efficient compression algorithms have been explored extensively, and research efforts have also been invested to enhancement algorithms. Moreover, the quality of point clouds can influence 3D analysis tasks, e.g., classification, segmentation, detection, and multimodal understanding, etc. Recent 3D multimodal large models can bring better perception optimizations. This tutorial will provide the fundamental knowledge for point cloud compression, enhancement and applications, and place emphasis on the influences of point cloud quality to human and machine perceptions. We will also discuss the progress of international standards and open source projects for point cloud technologies. From this tutorial, audiences are expected to grasp the basic knowledge and recent progress of point cloud technologies, and promote the research developments in both academia and industrial communities.

Abstract:
This paper presents a summary of the proposed solution to the AV-Deepfake1M competition. Deepfake technology is developing fast, and realistic generation techniques of audio and videos have aroused public concerns. With this background, the AV-Deepfake1M competition aims to address the problem of audio-video Deepfake and provides a large-scale dataset named AV-Deepfake1M to boost the research in this area. In this paper, we present our solutions which have achieved top performance in this competition. We also provide more detailed experiments to prove the effectiveness of the modules used in our methods.

Abstract:
Engagement estimation is crucial for advancing natural human-computer interaction, allowing artificial agents to dynamically adjust their responses based on user engagement levels and creating more intuitive and immersive experiences. Despite advancements in automating real-time engagement estimation, challenges persist in real-world scenarios due to the complex nature of multi-modal human social signals. This paper proposes a novel cross-modality fusion-based methodology to address these challenges by leveraging multi-modal data. Our approach integrates visual and audio features, such as facial motion, acoustic characteristics, Contrastive Language-Image Pretraining (CLIP), and semantic embeddings. These features first pass through a transformer encoder, are then combined and processed through a cross-modal fusion mechanism, ensuring robust integration. The final integrated features are then used to predict engagement scores. This hierarchical and self-normalizing approach enhances the accuracy of engagement estimation by effectively capturing dependencies within and between modalities. The experiments are conducted on multimediate's NoXI and MPIIGroupInteraction datasets and the results demonstrates competitive performance in estimating engagement levels, addressing the complex, context-dependent nature of human engagement. Specifically, our approach achieves a Global Concordance Correlation Coefficient (CCC) score approximately (56.1%) higher than the baseline. This work contributes to developing more intelligent and responsive artificial systems, enhancing user experiences across various interactive applications.

Abstract:
Visual Spatial Description (VSD) is an emerging image-to-text task which aims at generating descriptions of the spatial relationships between given objects in an image. In this paper, we apply Retrieval-Augmented Generation (RAG) technology in guiding Multimodal Large Language Models (MLLMs) for the task of VSD, complemented by an Adaptive Hallucination Corrector, and further fine-tuning them to bolster semantic understanding and overall model efficacy. We found that our approach demonstrated higher accuracy and fewer hallucination errors in both spatial relationship classification and visual language description tasks within the VSD task, achieving state-of-the-art results.

Abstract:
In recent years, the task of image-to-text generation has received considerable attention from scholars. One of its subtasks, Visual Spatial Description (VSD), focuses on a model's ability to understand spatial relationships. VSD is a novel task that emphasizes spatial semantics by generating sentences describing the spatial relationships between two objects in a given image. In this work, a VSD method based on large language model fine-tuning (LFVSD) is proposed to enhance the accuracy and robustness of visual spatial relationship descriptions. Initially, image and text features are extracted using pre-trained models, and Q-Former is employed for feature fusion. The original and fused features are then fed into FlanT5XXL. Object overlap priors are introduced, and momentum distillation is used to filter hard negative samples and generate soft labels. Finally, multiple VSD models are trained using data augmentation and long-tail data balancing techniques. Through multimodal feature fusion and fine-tuning, our approach is evaluated on the VSD2024 test set, which includes 5,855 images and their corresponding textual descriptions. The results demonstrate the effectiveness of our proposed method.

Abstract:
The AI Generated Image Detection Challenge, organized by MGTV, invites participants to develop advanced algorithms capable of accurately distinguishing between real and AI-generated images. These images may be created using various cutting-edge techniques, including but not limited to GAN and Stable Diffusion algorithms. Participants are encouraged to utilize open-source datasets or develop their own datasets to train their algorithms. This challenge presents a unique opportunity to enhance the field of AI-generated image detection, particularly in improving the algorithm's generalization capabilities to identify unknown and emerging samples. For more details and resources, please visit our official website (https://challenge.ai.mgtv.com/#/track/24).

Abstract:
Micro-Expression is subtle facial movements that reveal hidden emotions, but their fleeting and involuntary nature poses significant challenges for detection. This paper introduces a novel approach addressing two critical tasks in Micro-Expression analysis: spotting and Recognize. We integrate the VideoMAE V2 framework with a temporal information adapter and multi-scale feature fusion to enhance the performance of Micro-Expression Spotting-then-Recognize. Our method leverages the temporal information adapter to capture local temporal context within video frames, improving feature extraction efficiency. Additionally, we construct a multi-scale image pyramid to capture a range of motion features, from broad movements to subtle details. By combining these multi-scale features, our approach strengthens the model's capabilities in Micro-Expression Spotting-then-Recognize. Our method effectively addresses issues related to environmental variations, involuntary facial movements, and dataset imbalance, leading to improved accuracy in Micro-Expression Spotting-then-Recognize.

Abstract:
Previous research has demonstrated the potential of Augmented Reality in enhancing psychological comfort in Human-Robot Interaction (AR-HRI) through shared robot intent, enhanced visual feedback, and increased expressiveness and creativity in interaction methods. However, the challenge of selecting interaction methods that enhance physical comfort in varying scenarios remains. This study purposes a dynamic dual-layer interaction adjustment mechanism to improve user comfort and interaction efficiency. The mechanism comprises two models: an general layer model, grounded in ergonomics principles, identifies appropriate areas for various interaction methods; a individual layer model predicts user discomfort levels using physiological signals. Interaction methods are dynamically adjusted based on discomfort level changes, enabling the system to adapt to individual differences and dynamic changes, thereby reducing misjudgments and enhancing comfort management. The mechanism's success in authoring tasks validates its effectiveness, significantly advancing AR-HRI and fostering more comfortable and enhancing efficient human-centered interactions.

Abstract:
Due to the small size of valid samples, multi-source EEG features with high dimensionality can easily cause problems such as overfitting and poor real-time performance of the emotion recognition classifier. Feature selection has been demonstrated as an effective means to solve these problems. Current EEG feature selection research assumes that all dimensions of emotional labels are complete. However, owing to the open acquisition environment, subjective variability, and border ambiguity of individual perceptions of emotion, the training data in the practical application often includes missing information, i.e., multi-dimensional emotional labels of several instances are incomplete. The aforementioned incomplete information directly restricts the accurate construction of the EEG feature selection model for multi-dimensional emotion recognition. To wrestle with the aforementioned problem, we propose a novel EEG feature selection model with weighted self-expression learning (WSEL). The model utilizes self-representation learning and least squares regression to reconstruct the label space through the second-order correlation and higher-order correlation within the multi-dimensional emotional labels and simultaneously realize the EEG feature subset selection under the incomplete information. We have utilized two multimedia-induced emotion datasets with EEG recordings, DREAMER and DEAP, to confirm the effectiveness of WSEL in the missing multi-dimensional emotional feature selection challenge. Compared to nine state-of-the-art feature selection approaches, the experimental results demonstrate that the EEG feature subsets chosen by WSEL can achieve optimal performance in terms of six performance metrics.

Abstract:
Recent research on Diffusion Models and Transformers has brought significant advancements to 3D Human Pose Estimation (HPE). Nonetheless, existing methods often fail to concurrently address the issues of accuracy and generalization. In this paper, we propose a Geometry-guided Dif fusion Model with Masked Transformer (Masked Gifformer) for robust multi-view 3D HPE. Within the framework of the diffusion model, a hierarchical multi-view trans-former-based denoiser is exploited to fit the 3D pose distribution by systematically integrating joint and view information. To address the long-standing problem of poor generalization, we introduce a fully random mask mechanism without any additional learnable modules or parameters. Furthermore, we incorporate geometric guidance into the diffusion model to enhance the accuracy of the model. This is achieved by optimizing the sampling process to minimize reprojection errors through modeling a conditional guidance distribution. Extensive experiments on two benchmarks demonstrate that Masked Gifformer effectively achieves a trade-off between accuracy and generalization. Specifically, our method outperforms other probabilistic methods by > 40% and achieves comparable results with state-of-the-art deterministic methods. In addition, our method exhibits robustness to varying camera numbers, spatial arrangements, and datasets.

Abstract:
Since the release of the CLIP model by OpenAI, it has received widespread attention. However, categories in the real world often exhibit a long-tail distribution, and existing CLIP models struggle to effectively recognize rare, tail-end classes, such as an endangered African bird. An intuitive idea is to generate visual descriptions for these tail-end classes and use descriptions to create category prototypes for classification. However, experiments reveal that visual descriptions, image captions, and test prompt templates belong to three distinct domains, leading to distribution shifts. In this paper, we propose the use of caption object parsing to identify the objects set contained within captions. During training, the object sets is used to generate visual descriptions and test prompts, aligning these three domains and enabling the text encoder to generate category prototypes based on visual descriptions. Thanks to the acquired object sets, our approach can construct many-to-many relationships at a lower cost and derive soft labels, addressing the noise issues associated with traditional one-to-one matching. Extensive experimental results demonstrate that our method significantly surpasses the CLIP baseline and exceeds existing methods, achieving a new state-of-the-art (SOTA).

Abstract:
Enlightened by the InfoMax principle, Graph Contrastive Learning (GCL) has achieved remarkable performance in processing large amounts of unlabeled graph data. Due to the impracticality of precisely calculating mutual information (MI), conventional contrastive learning methods turn to approximate its lower bound using parametric neural estimators, which inevitably introduces additional parameters and leads to increased computational complexity. Building upon a common Gaussian assumption on the distribution of node representations, a computationally tractable surrogate for the original MI can be rigorously derived, termed as Gaussian Mutual Information (GMI). Leveraging multi-view priors of GCL, we induce an efficient contrastive objective based on GMI with performance guarantees, eliminating the reliance on parameterized estimators and negative samples. The emergence of another decorrelation-based self-supervised learning branch parallels contrastive-based approaches. By positioning the proposed GMI-based objective as a pivot, we bridge the gap between these two research areas from two aspects of approximate form and consistent solution, which contributes to the advancement of a unified theoretical framework for self-supervised learning. Extensive comparison experiments, ablation studies, and visual analysis provide compelling evidence for the effectiveness and efficiency of our method while supporting our theoretical achievements.

Abstract:
Existing RGB-D semantic segmentation methods struggle to handle modality missing input, where only RGB images or depth maps are available, leading to degenerated segmentation performance. We tackle this issue using MaskMentor, a new pre-training framework for modality missing segmentation, which advances its counterparts via two novel designs: Masked Modality and Image Modeling (M2IM), and Self-Teaching via Token-Pixel Joint reconstruction (STTP). M2IM simulates modality missing scenarios by combining both modality- and patch-level random masking. Meanwhile, STTP offers an effective self-teaching strategy, where the trained network assumes a dual role, simultaneously acting as both the teacher and the student. The student with modality missing input is supervised by the teacher with complete modality input through both token- and pixel-wise masked modeling, closing the gap between missing and complete input modalities. By integrating M2IM and STTP, MaskMentor significantly improves the generalization ability of the trained model across diverse input conditions and outperforms state-of-the-art methods on two popular benchmarks by a considerable margin. Extensive ablation studies further verify the effectiveness of the above contributions.

Abstract:
Open-domain multi-modal dialogue system heavily relies on visual information to generate contextually relevant responses. The existing open-domain multi-modal dialog generation methods ignore the complementary relationship between multiple modalities, and are difficult to integrate with LLMs. To tackle these challenges, we introduce AutoGraph, an innovative method for constructing visual context graphs automatically. We aim to structure complex information and seamlessly integrate it with large language models (LLMs), aligning information from multiple modalities at both semantic and structural levels. Specifically, we fully connect the text graphs and scene graphs, and then trim unnecessary edges via LLMs to automatically construct a visual context graph. Next, we design several graph sampling grammar for the first time to convert graph structures into sequence which is suitable for LLMs. Finally, we propose a two-stage fine-tuning strategy to allow LLMs to understand graph sampling grammar and generate responses. We validate our proposed method on text-based LLMs, and visual-based LLMs, respectively. Experimental results show that our proposed method achieves state-of-the-art performance on multiple public datasets.

Abstract:
With the increasing prevalence of virtual assistants, multimodal conversational recommendation systems (multimodal CRS) becomes essential for boosting customer engagement, improving conversion rates, and enhancing user satisfaction. Yet conversational samples, as training data for such a system, are difficult to obtain in large quantities, particularly in new platforms. To effectively train multimodal CRS in a small data setting, we enhance data quality to make up for the small data quantity by augmenting conversations with dialogue states. We then devise an effective dialogue state encoder to bridge the semantic gap between conversation and product representations for recommendation. To further reduce the cost of dialogue state annotation, a semi-supervised learning method is developed to effectively train the dialogue state encoder with a small set of labeled conversations. In addition, we design a correlation regularisation that leverages knowledge in the multimodal product database to help align textual and visual modalities. Experiments on the dataset MMD demonstrate the effectiveness of our method. Particularly, with only 5% of the MMD training set, our method (namely SeMANTIC) obtains better NDCG scores than those of baseline models trained on the full MMD training set.

Abstract:
Given coupled sentence image pairs, Multimodal Aspect-based Sentiment Analysis (MABSA) aims to detect aspect terms and predict their sentiment polarity. While existing methods have made great efforts in aligning images and text for improved MABSA performance, they still struggle to effectively mitigate the challenge of the noisy correspondence problem (NCP): the text description is often not well-aligned with the visual content. To alleviate NCP, in this paper, we introduce Aspect-driven Alignment and Refinement (ADAR), which is a two-stage coarse-to-fine alignment framework. In the first stage, ADAR devises a novel Coarse-to-fine Aspect-driven Alignment Module, which introduces Optimal Transport (OT) to learn the coarse-grained alignment between visual and textual features. Then the adaptive filter bin is applied to remove the irrelevant image regions at a fine-grained level; In the second stage, ADAR introduces an Aspect-driven Refinement Module to further refine the cross-modality feature representation. Extensive experiments on two benchmark datasets demonstrate the superiority of our model over state-of-the-art performance in the MABSA task.

Abstract:
The widespread adoption of bio-inspired cameras has catalyzed the development of spike-based intelligent applications. Despite its innovative imaging principle allows for functionality in extreme scenarios, the intricate nature of spike signals poses processing challenges to achieve desired performance. Traditional methods struggles to deliver visual perception and temporal prediction simultaneously, and they lack the flexibility needed for diverse intelligent applications. To address this problem, we analyze the spatio-temporal correlations between spike information at different temporal scales. A novel spike processing method is introduced for compact spike representations that utilizes intra-scale correlation for higher predictive accuracy. Additionally, we propose a multi-scale spatio-temporal aggregation unit (MSTAU) that further leverages inter-scale correlation to achieve efficient perception and precise prediction. Experimental results show noticeable improvements in scene reconstruction and object classification, with increases of 3.49dB in scene reconstruction quality and 2.20% in accuracy, respectively. Besides, the proposed method accommodate different visual applications via switching analysis models, offering a novel perspective for spike processing.

Abstract:
The detection of fake news has emerged as a pressing issue in the era of online social media. To detect meticulously fabricated fake news, propagation paths are introduced to provide nuanced social context to complement the pure semantics within news content. However, existing propagation-enhanced models face a dilemma between detection efficacy and social hazard. In this paper, we investigate the novel problem of early fake news detection via propagation path generation, capable of enjoying the merits of rich social context within propagation paths while alleviating potential social hazards. In contrast to previous discriminative detection models, we further propose a novel generative model, DGA-Fake, by simulating realistic propagation paths based on news content before actual spreading. A guided diffusion module is integrated into DGA-Fake to generate simulated user interaction sequences, guided by historical interactions and news content. Evaluation across three datasets demonstrates the superiority of our proposal.

Abstract:
Even though significant progress has been made in standardizing document layout analysis, complex layout documents like magazines and newspapers still present challenges. Models trained on standardized documents struggle with these complexities, and the high cost of annotating such documents limits dataset availability. To address this, we propose the Complex Layout Document Image Generation (DIG) model, which can generate diverse document images with complex layouts and authentic-looking text, aiding in layout analysis model training. Concretely, we first pre-train DIG on a large-scale document dataset with a text-sensitive loss function to address the issue of unreal generation of text regions. Then, we fine-tune it with a small number of documents with complex layouts to generate new images with the same layout. Additionally, we use a layout generation model to create new layouts, enhancing data diversity. Finally, we design a box-wise quality scoring function to filter out low-quality regions during layout analysis model training to enhance the effectiveness of using the generated images. Experimental results on the DSSE-200 and PRImA datasets show when incorporating generated images from DIG, the mAP of the layout analysis model is improved from 47.05 to 56.07 and from 53.80 to 62.26, respectively, which is a 19.17% and 15.72% enhancement compared to the baseline.

Abstract:
Generating photorealistic animations from a single still photo represents a significant advancement in multimedia editing and artistic creation. While existing AIGC methods have reached milestone successes, they often struggle with maintaining consistency with real-world physical laws, particularly in fluid dynamics. To address this issue, this paper introduces ANFluid, a physics solver and data-driven coupled framework that combines physics-aware simulation (PAS) and dual-flow texture learning (DFTL) to animate natural fluid photos effectively. The PAS component of ANFluid ensures that motion guides adhere to physical laws, and can be automatically tailored with specific numerical solver to meet the diversities of different fluid scenes. Concurrently, DFTL focuses on enhancing texture prediction. It employs bidirectional self-supervised optical flow estimation and multi-scale wrapping to strengthen dynamic relationships and elevate the overall animation quality. Notably, despite being built on a transformer architecture, the innovative encoder-decoder design in DFTL does not increase the parameter count but rather enhances inference efficiency. Extensive quantitative experiments have shown that our ANFluid surpasses most current methods on the Holynski and CLAW datasets. User studies further confirm that animations produced by ANFluid maintain better physical and content consistency with the real world and the original input, respectively. Moreover, ANFluid supports interactive editing during the simulation process, enriching the animation content and broadening its application potential.

Abstract:
Great progress has been made in rendering translucent materials in recent years, but automatically estimating parameters for heterogeneous materials such as jade and human skin remains a challenging task, often requiring specialized and expensive physical measurement devices. In this paper, we present a novel approach for estimating and transferring the parameters of heterogeneous translucent materials from a single 2D image to 3D models. Our method consists of four key steps: (1) An efficient viewpoint selection algorithm to minimize redundancy and ensure comprehensive coverage of the model. (2) Initializing a homogeneous translucent material to render initial images for translucent dataset. (3) Edit the rendered translucent images to update the translucent dataset. (4) Optimize the edited translucent results onto material parameters using inverse rendering techniques. Our approach offers a practical and accessible solution that overcomes the limitations of existing methods, which often rely on complex and costly specialized devices. We demonstrate the effectiveness and superiority of our proposed method through extensive experiments, showcasing its ability to transfer and edit high-quality heterogeneous translucent materials on 3D models, surpassing the results achieved by previous techniques in 3D scene editing.

Abstract:
Spiking neural networks (SNNs) have superb characteristics in sensory information recognition tasks due to their biological plausibility. However, the performance of some current spiking-based models is limited by their structures which means either fully connected or too-deep structures bring too much redundancy. This redundancy from both connection and neurons is one of the key factors hindering the practical application of SNNs. Although Some pruning methods were proposed to tackle this problem, they normally ignored the fact the neural topology in the human brain could be adjusted dynamically. Inspired by this, this paper proposed an evolutionary-based structure construction method for constructing more reasonable SNNs. By integrating the knowledge distillation and connection pruning method, the synaptic connections in SNNs can be optimized dynamically to reach an optimal state. As a result, the structure of SNNs could not only absorb knowledge from the teacher model but also search for deep but sparse network topology. Experimental results on CIFAR100, Tiny-imagenet and DVS-Gesture show that the proposed structure learning method can get pretty well performance while reducing the connection redundancy. The proposed method explores a novel dynamical way for structure learning from scratch in SNNs which could build a bridge to close the gap between deep learning and bio-inspired neural dynamics.

Abstract:
In recent years, the field of talking head generation has made significant strides. However, the need for substantial computational resources for model training, coupled with a scarcity of high-quality video data, poses challenges for the rapid customization of model to specific individual. Additionally, existing models usually only support single-modal control, lacking the ability to generate vivid facial expressions and controllable head poses based on multiple conditions such as audio, video, etc. These limitations restricts the models' widespread application. In this paper, we introduce a two-stage method called Control-Talker to achieve rapid customization of identity in talking head model and high-quality generation based on multimodal conditions. Specifically, we divide the training process into two stages: prior learning stage and identity rapid-customization stage. 1) In the prior learning stage, we leverage a diffusion-based model pre-trained on the high-quality image dataset to acquire a robust controllable facial prior. Meanwhile, we innovatively propose a high-frequency ControlNet structure to enhance the fidelity of the synthesized results. This structure adeptly extracts a high-frequency feature map from the source image, serving as a facial texture prior, thereby excellently preserving facial texture of the source image. 2) In the identity rapid-customization stage, the identity is fixed by fine-tuning the U-Net part of the diffusion model on merely several images of a specific individual. The entire fine-tuning process for identity customization can be completed within approximately ten minutes, thereby significantly reducing training costs. Further, we propose a unified driving method for both audio and video, enabling the model to precisely control expressions, poses, and lighting under multi conditions. Extensive experiments and visual results demonstrate that our method outperforms other state-of-the-art models. Additionally, our model demonstrates reduced training costs and lower data requirements.

Abstract:
The rapid advancement of generation methods has sparked significant concerns about potential misuse, emphasizing the urgency to detect new types of forgeries in open-world settings. Although pioneering works have explored the classification of open-world deepfakes (OW-DF), they neglect the influence of new forgery techniques, which struggle to handle a greater variety of manipulable objects and increasingly realistic artifacts. To align research with the evolving technologies of forgery, we propose a new task named Open-World Deepfake Interpretation (OW-DFI). This task involves the localization of imperceptible artifacts across diverse manipulated objects and deciphering forgery methods, especially new forgery techniques. To this end, we leverage non-casual semantics from large visual models (LVMs) and eliminate them from the nuanced manipulated artifacts. Our proposed model includes Semantic Intervention Learning (SIL) and Correlation-based Incremental Learning (CIL). SIL enhances the inconsistency of forgery artifacts with refined semantics from LVMs, while CIL combats catastrophic forgetting and semantic overfitting through an inter-forgery inheritance transpose and a targeted semantic intervention. Exploiting LVMs, our proposed method adopts an unconventional strategy that aligns with the semantic direction of LVMs, moving beyond just uncovering limited forgery-related features for deepfake detection. To assess the effectiveness of our approach in discovering new forgeries, we construct an Open-World Deepfake Interpretation (OW-DFI) benchmark and conduct experiments in an incremental form. Comprehensive experiments demonstrate our method's superiority on the OW-DFI benchmark, showcasing outstanding performance in localizing forgeries and decoding new forgery techniques.

Abstract:
Sleep staging is crucial for sleep tracking and health assessment. Polysomnography (PSG), containing multiple modalities such as electroencephalography, electrooculography, electromyography, and electrocardiography, is the fundamental means of sleep staging. However, due to performance differences in both classification and domain discrimination across modalities in PSG, existing domain generalization methods face a dilemma of modal imbalance. To balance inter-modal differences and achieve highly accurate cross-domain sleep staging, we propose SleepMG, a Multimodal GeneralizableSleep staging method. SleepMG assesses the classification and domain discrimination performances of each modality and further defines the modal performance metrics by calculating the variance between the performance score and the average performance of each modality. Guided by these metrics, the gradients of the classifier and domain discriminator are adaptively adjusted, placing greater emphasis on poorly-balanced modalities while reducing emphasis on well-balanced modalities. Experimental results on public sleep staging datasets demonstrate that SleepMG outperforms state-of-the-art sleep staging methods, effectively balancing multiple modalities as evidenced by the visual experiment of modal imbalance degree.

Abstract:
Symbols play a pivotal role in the documentation and dissemination of art. For instance, we use musical scores and dance notation to document musical compositions and choreographic movements. Existing hand representations do not fit well with hand movement documentation since (1) data-oriented representations, e.g., coordinates of hand keypoints, are not intuitive and vulnerable to noise, and (2) the sign language, another widely adopted representation for hand movements, focuses solely on semantic interaction rather than action encoding. To balance intuitiveness and precision, we propose a novel notation system, named Hand Labanotation (HL), for hand movement documentation. We first introduce a new HL dataset comprising 4M annotated images. Thereon, we propose a novel multi-view transformer architecture for automatically translating hand movements to HL. Extensive experiments demonstrate the promising capacity of our method for representing hand movements. This makes our method a general tool for hand movement documentation, driving various downstream applications like using HL to control robotic hands.

Abstract:
By leveraging multi-view inputs to synthesize novel-view images, Neural Radiance Fields (NeRF) have emerged as a prominent technique in the realm of 3D object reconstruction. However, existing methods primarily focus on global scene reconstruction using large datasets, which necessitate substantial computational resources and impose high-quality requirements on input images. Nevertheless, in practical applications, users prioritize the 3D reconstruction results of on-demand specific object (OSO) based on their individual demands . Furthermore, the collected images transmitted through high-interference wireless environment (HIWE) leads to negatively impact the accuracy of NeRF reconstruction, thereby limiting its scalability. In this paper, we propose a novel on-demand Semantic Neural Radiance Fields (OSNeRF) scheme, which offers fast and robust 3D object reconstruction for diverse tasks. Within OSNeRF, semantic encoder is employed to extract core semantic features of OSOs from the collected scene images, semantic decoder is utilized to facilitate robust image recovery under HIWE conditions, lightweight renderer is employed for fast and efficient object reconstruction. Moreover, a semantic control unit (SCU) is introduced to guide above components, thereby enhancing the efficiency of reconstruction. Demonstrative experiments demonstrate that the proposed OSNeRF enables fast and robust object reconstruction in HIWE, surpassing the performance of state-of-the-art (SOTA) methods in terms of reconstruction quality.

Abstract:
Joint classification of multi-modal remote sensing images has achieved great success thanks to complementary advantages of multi-modal images. However, modality absence is a common dilemma in real world caused by imaging conditions, which leads to a breakdown of most classification methods that rely on complete modalities. Existing approaches either learn shared representations or train specific models for each absence case so that they commonly confront the difficulty of balancing the complementary advantages of the modalities and scalability of the absence case. In this paper, we propose a language-guided visual prompt compensation network (LVPCnet) to achieve joint classification in case of arbitrary modality absence using a unified model that simultaneously considers modality complementarity. It embeds missing modality-specific knowledge into visual prompts to guide the model in capturing complete modal information from available ones for classification. Specifically, a language-guided visual feature decoupling stage (LVFD-stage) is designed to extract shared and specific modal feature from multi-modal images, establishing a complementary representation model of complete modalities. Subsequently, an absence-aware visual prompt compensation stage (VPC-stage) is proposed to learn visual prompts containing missing modality-specific knowledge through cross-modal representation alignment, further guiding the complementary representation model to reconstruct modality-specific features for missing modalities from available ones based on the learned prompts. The proposed VPC-stage entails solely training visual prompts to perceive missing information without retraining the model, facilitating effective scalability to arbitrary modal missing scenarios. Systematic experiments conducted on three public datasets have validated the effectiveness of the proposed approach.

Abstract:
Most existing methods for weakly supervised video moment localization use rule-based negative proposals. However, the rule-based ones have a limitation in capturing various confusing locations throughout the entire video. To alleviate the limitation, we propose learning-based negative proposals which are trained using a dual-signed cross-entropy loss. The dual-signed cross-entropy loss is controlled by a weight that changes gradually from a minus value to a plus one. The minus value makes the negative proposals be trained to capture query-irrelevant temporal boundaries (easy negative) in the earlier training stages, whereas the plus one makes them capture somewhat query-relevant temporal boundaries (hard negative) in the later training stages. To evaluate the quality of negative proposals, we introduce a new evaluation metric to measure how well a negative proposal captures a poorly-generated positive proposal. We verify that our negative proposals can be applied with negligible additional parameters and inference costs, achieving state-of-the-art performance on three public datasets.

Abstract:
Accurate long-term viewport prediction in tile-based 360° video adaptive streaming helps pre-download tiles for a further future, thus establishing a longer buffer to cope with network fluctuations. Long-term viewport motion is mainly influenced by Historical viewpoint Trajectory (HT) and Video Content information (VC). However, HT and VC are difficult to align in space due to their different modalities, and their relative importance in viewport prediction varies across prediction time steps. In this paper, we propose STAR-VP, a model that fuses HT and VC in a Space-aligned and Time-vARying manner for Viewport Prediction. Specifically, we first propose a novel saliency representation salxyz and a Spatial Attention Module to solve the spatial alignment of HT and VC. Then, we propose a two-stage fusion approach based on Transformer and gating mechanisms to capture their time-varying importance. Visualization of attention scores intuitively demonstrates STAR-VP's capability in space-aligned and time-varying fusion. Evaluation on three public datasets shows that STAR-VP achieves state-of-the-art accuracy for long-term (2-5s) viewport prediction without sacrificing short-term (<1s) prediction performance.

Abstract:
In recent years, Vision-Language Pre-training (VLP) models have demonstrated rich prior knowledge for multimodal alignment, prompting investigations into their application in Specific Domain Image-Text Retrieval(SDITR) such as Text-Image Person Re-identification (TIReID) and Remote Sensing Image-Text Retrieval (RSITR). Due to the unique data characteristics in specific scenarios, the primary challenge is to leverage discriminative fine-grained local information for improved mapping of images and text into a shared space. Current approaches interact with all multimodal local features for alignment, implicitly focusing on discriminative local information to distinguish data differences, which may bring noise and uncertainty. Furthermore, their VLP feature extractors like CLIP often focus on instance-level representations, potentially reducing the discriminability of fine-grained local features. To alleviate these issues, we propose an Explicit Key Local information Selection and Reconstruction Framework (EKLSR), which explicitly selects key local information to enhance feature representation. Specifically, we introduce a Key Local information Selection and Fusion (KLSF) that utilizes hidden knowledge from the VLP model to select interpretably and fuse key local information. Secondly, we employ Key Local segment Reconstruction (KLR) based on multimodal interaction to reconstruct the key local segments of images (text), significantly enriching their discriminative information and enhancing both inter-modal and intra-modal interaction alignment. To demonstrate the effectiveness of our approach, we conducted experiments on five datasets across TIReID and RSITR. Notably, our EKLSR model achieves state-of-the-art performance on two RSITR datasets.

Abstract:
Prior research on emotion recognition in extended reality (XR) has faced challenges due to the occlusion of facial expressions by Head-Mounted Displays (HMDs). This limitation hinders accurate Facial Expression Recognition (FER), which is crucial for immersive user experiences. This study aims to overcome the occlusion challenge by integrating physiological signals with partially visible facial expressions to enhance emotion recognition in XR environments. We employed a multi-task approach, utilizing a feature-level fusion to fuse Electroencephalography (EEG) and Galvanic Skin Response (GSR) signals with occluded facial expressions. The model predicts valence and arousal simultaneously from both macro-and micro-expression. Our method demonstrated improved accuracy in emotion recognition under partial occlusion conditions. The integration of temporal physiological signals with other modalities significantly enhanced performance, particularly for half-face emotion recognition. The study presents a novel approach to emotion recognition in XR, addressing the limitations of facial occlusion by HMDs. The findings suggest that physiological signals are vital for interpreting emotions in occluded scenarios, offering potential for real-time applications and advancing social XR applications.

Abstract:
Previous neural radiance fields often struggle to preserve high-frequency textures in urban and aerial large-scale scenes due to insufficient model capacity on the scene surface. This is attributed to their sampling locations or grid vertices falling in empty areas. Additionally, most models do not consider the drastic changes in distances. To address these issues, we propose a novel high-frequency surface shell radiance field, which uses depth-guided information to create a shell enveloping the scene surface under the current view, and then samples conic frustums on this shell to render high-frequency textures. Specifically, our method comprises three parts. Initially, we propose a strategy to fuse voxel grids and information of distance scales to generate a coarse scene at different distance scales. Subsequently, we construct a shell based on the depth information to carry out compensation to incorporate texture details not captured by voxels. Finally, the smooth and denoise post-processing further improves the rendering quality. Substantial scene experiments and ablation experiments demonstrate that our method achieves the obvious improvement of high-frequency textures at different distance scales and outperforms the state-of-the-art methods.

Abstract:
Expressive Human Mesh Recovery (HMR) involves reconstructing the 3D human body, including hands and face, from RGB images. It is difficult because humans are highly deformable, and hands are small and frequently occluded. Recent approaches have attempted to mitigate these issues using large datasets and models, but these solutions remain imperfect. Specifically, whole-body estimation models often inaccurately estimate hand poses, while hand expert models struggle with severe occlusions. To overcome these limitations, we introduce a dual-path cross augmentation framework with a novel adaptation approach called HMR-Adapter that enhances existing large HMR models. HMR-Adapter significantly improves expressive HMR performance by injecting additional guidance from other body parts. This approach refines hand pose predictions by incorporating body pose information and uses additional hand features to enhance body pose estimation in whole-body models. Remarkably, an HMR-Adapter with about 30M parameters significantly improves expressive HMR results by combining the adapted large whole-body and hand expert models. We show extensive experiments and analysis to demonstrate the efficacy of our method.

Abstract:
Existing weakly-supervised camouflaged object detection (WSCOD) methods have much difficulty in detecting accurate object boundaries due to insufficient and imprecise boundary supervision in scribble annotations. Drawing inspiration from human perception that discerns camouflaged objects by incorporating both object region and boundary information, we propose a novel Mutual Interaction Network (MiNet) for scribble-based WSCOD to alleviate the detection difficulty caused by insufficient scribbles. The proposed MiNet facilitates mutual reinforcement between region and edge cues, thereby integrating more robust priors to enhance detection accuracy. In this paper, we first construct an edge cue refinement net, featuring a core region-aware guidance module (RGM) aimed at leveraging the extracted region feature as a prior to generate the discriminative edge map. By considering both object semantic and positional relationships between edge feature and region feature, RGM highlights the areas associated with the object in the edge feature. Subsequently, to tackle the inherent similarity between camouflaged objects and the surroundings, we devise a region-boundary refinement net. This net incorporates a core edge-aware guidance module (EGM), which uses the enhanced edge map from the edge cue refinement net as guidance to refine the object boundaries in an iterative and multi-level manner. Experiments on CAMO, CHAMELEON, COD10K, and NC4K datasets demonstrate that the proposed MiNet outperforms the state-of-the-art methods.

Abstract:
Short videos turn into an important channel for public sharing, as well as they've become a fertile ground for fake news. Fake news video detection is to judge the veracity of news based on its different modal information, such as video, audio, text, image and social context information. Current detection models tend to learn the multimodal dataset biases within spurious correlations between news modalities and veracity labels as shortcuts, rather than learning how to integrate the multimodal information behind them to reason, resulting in seriously degrading their detection and generalization capabilities. To address this issues, we propose a Multimodal Multi-View Debiasing (MMVD) framework, which makes the first attempt to mitigate various multimodal biases for fake news video detection. Inspired by people's misleading situations by multimodal short videos, we summarize three cognitive biases: static, dynamic and social biases. MMVD put forward a multi-view causal reasoning strategy to learn unbiased dependencies within the cognitive biases, thus enhancing the unbiased prediction of multimodal videos. The extensive experimental results show that the MMVD could improve the detection performance of multimodal fake news video. Studies also confirm that our MMVD can mitigate multiple biases on complex real-world scenarios and improve generalization ability of fake news video detection.

Abstract:
In film education, high expenses and limited space significantly challenge teaching synchronized sound recording (SSR). Traditional methods, which emphasize theory with limited practical experience, often fail to bridge the gap between theoretical understanding and practical application. As such, we introduce MetaEcho, an educational virtual reality leveraging the presence theory for teaching SSR. MetaEcho provides realistic simulations of various recording equipment and facilitates communication between learners and instructors, offering an immersive learning experience that closely mirrors actual practices. An evaluation with 24 students demonstrated that MetaEcho surpasses the traditional method in presence, collaboration, usability, realism, comprehensibility, and creativity. Three experts also commented on the benefits of MetaEcho and the opportunities for promoting SSR education in the metaverse era.

Abstract:
The tasks of view synthesis and decoupling dynamic objects from the static environment for monocular scenes are both long-standing challenges in CV and CG. Most of the previous NeRF-based methods rely on implicit representation, which require additional supervision and training time. Later, various explicit representations like multi-planes or 3D gaussian splatting have been extended and applied to the task of novel view synthesis for dynamic scenes. They introduce an additional time dimension or a deformation field into the original representation to encode dynamics. Due to the effective explicit representations, these methods greatly reduce the time consumption, but still fail to achieve high rendering quality in some scenes, especially for some real scenes. For the latter decoupling problem, previous neural radiation field methods require frequent tuning of the relevant parameters for different scenes, which is very inconvenient for practical use. We consider above problems and propose a new representation of dynamic scenes based on tensor decomposition, which we call R4D-planes. The key to our method is remapping, which compensates for the shortcomings of the plane structure by fusing space-time information and remapping to new indexes. Furthermore, we implement a new decoupling structure, which can efficiently decouple dynamic and static scenes in a self-supervised manner. Experimental results show our method achieves better rendering quality and training efficiency in both view synthesis and decoupling tasks for monocular scenes.

Abstract:
Relief-type cultural heritage objects are commonly found at historical sites but often manifest with varying degrees of damage and deterioration. The traditional process of reconstructing these reliefs is laborious and requires extensive manual intervention and specialized archaeological knowledge. By utilizing a single old photo containing predamage information of a given relief, monocular depth estimation can be used to reconstruct 3D digital models. However, extracting depth variations along the edges is challenging in relief scenario due to the highly compression of the depth values, resulting in low-curvature edges. This paper proposes an innovative solution that leverages a multi-task neural network to enhance the depth estimation task by integrating the edge detection and semantic segmentation tasks. We redefine edge detection of relief data as a multi-class classification task rather than a typical binary classification task. In this paper, an edge matching module that performs this novel task is proposed to refine depth estimations specifically for edge regions. The proposed approach achieves better depth estimation results with finer details along the edge region. Additionally, the semantic and edge outputs provide a comprehensive reference for multi-modal understanding and analysis.

Abstract:
The wide use of mobile devices has led to a proliferated creation of extensive trajectory data, rendering trajectory classification increasingly vital and challenging for downstream applications. Existing deep learning methods offer powerful feature extraction capabilities to detect nuanced variances in trajectory classification tasks. However, their effectiveness remains compromised by the following two unsolved challenges. First, identifying the distribution of nearby trajectories based on noisy and sparse GPS coordinates poses a significant challenge, providing critical contextual features to the classification. Second, though efforts have been made to incorporate a shape feature by rendering trajectories into images, they fail to model the local correspondence between GPS points and image pixels. To address these issues, we propose a novel model termed Traj2Former to spotlight the spatial distribution of the adjacent trajectory points (i.e., contextual snapshot) and enhance the snapshot fusion between the trajectory data and the corresponding spatial contexts. We propose a new GPS rendering method to generate contextual snapshots, but it can be applied from a trajectory database to a digital map. Moreover, to capture diverse temporal patterns, we conduct a multi-scale sequential fusion by compressing the trajectory data with differing rates. Extensive experiments have been conducted to verify the superiority of the Traj2Former model.

Abstract:
Most works of interpretable neural networks strive for learning the semantics concepts merely from single modal information such as images. However, humans usually learn semantic concepts from multiple modalities and the semantics is encoded by the brain from fused multi-modal information. Inspired by cognitive science and vision-language learning, we propose a Prototype-Concept Alignment Network (ProCoNet) for learning visual prototypes under the guidance of textual concepts. In the ProCoNet, we have designed a visual encoder to decompose the input image into regional features of prototypes, while also developing a prompt generation strategy that incorporates in-context learning to prompt large language models to generate textual concepts. To align visual prototypes with textual concepts, we leverage the multimodal space provided by the pre-trained CLIP as a bridge. Specifically, the regional features from the vision space and the cropped regions of prototypes encoded by CLIP reside on different but semantically highly correlated manifolds, i.e. follow a multi-manifold distribution. We transform the multi-manifold distribution alignment problem into optimizing the projection matrix by Cayley transform on the Stiefel manifold. Through the learned projection matrix, visual prototypes can be projected into the multimodal space to align with semantically similar textual concept features encoded by CLIP. We conducted two case studies on the CUB-200-2011 and Oxford Flower dataset. Our experiments show that the ProCoNet provides higher accuracy and better interpretability compared to the single-modality interpretable model. Furthermore, ProCoNet offers a level of interpretability not previously available in other interpretable methods.

Abstract:
Multi-focus image fusion (MFIF) aims to combine multiple images with different focused regions into a single all-in-focus image. Existing unsupervised deep learning-based methods only fuse structural information of images in the spatial domain, neglecting potential solutions from the frequency domain exploration. In this paper, we make the first attempt to integrate spatial-frequency information to achieve high-quality MFIF. We propose a novel unsupervised spatial-frequency interaction MFIF network named SFIMFN, which consists of three key components: Adaptive Frequency Domain Information Interaction Module (AFIM), Ret-Attention-Based Spatial Information Extraction Module (RASEM), and Invertible Dual-domain Feature Fusion Module (IDFM). Specifically, in AFIM, we interactively explore global contextual information by combining the amplitude and phase information of multiple images separately. In RASEM, we design a customized transformer to encourage the network to capture important local high-frequency information by redesigning the self-attention mechanism with a bidirectional, two-dimensional form of explicit decay. Finally, we employ IDFM to fuse spatial-frequency information without information loss to generate the desired all-in-focus image. Extensive experiments on different datasets demonstrate that our method significantly outperforms state-of-the-art unsupervised methods in terms of qualitative and quantitative metrics as well as the generalization ability.

Abstract:
Multi-person motion prediction remains a challenging problem due to the intricate motion dynamics and complex interpersonal interactions, where uncertainty escalates rapidly across the forecasting horizon. Existing approaches always overlook the motion dynamic modeling among the prediction frames to reduce the uncertainty, but leave it entirely up to the deep neural networks, which lacks a dynamic inductive bias, leading to suboptimal performance. This paper addresses this limitation by proposing an effective multi-person motion prediction method named Hybrid Supervision Transformer (HSFormer), which formulates the dynamic modeling within the prediction horizon as a novel hybrid supervision task. To be precise, our method performs a rolling predicting process equipped with a hybrid supervision mechanism, which enforces the model to be able to predict the pose in the next frames based on the (typically error-contained) earlier predictions. Addition to the standard supervision loss, two self and auxiliary supervision mechanisms, which minimize the distance of the predictions with error-contained inputs and the predictions with error-free inputs (ground truth) and guide the model to make accurate predictions based on the ground truth, are introduced to improve the robustness of our model to the input deviation in inference and stabilize the training process, respectively. The optimization techniques, such as stop-gradient, are extended to our model to improve the training efficiency. Furthermore, we develop a fine-grained spatio-temporal correlation capture module to assist the feature learning and reduce the uncertainties arising from the intricate and varying interactions among the individuals. Our approach achieves state-of-the-art results on multiple multi-person datasets in both short- and long-term prediction.

Abstract:
Self-supervised category-level 6D pose estimation stands as a fundamental task in computer vision. However, current self-supervised methods face two major challenges. Firstly, existing networks struggle to reconstruct precise object models due to significant part-level shape variations among specific categories. Secondly, they are impacted by the many-to-one ambiguity in the correspondences between pixels and point clouds. To address these challenges, we propose a novel approach that includes a Part-level Shape Reconstruction (PSR) module and a Coarse-to-Fine Correspondence Optimization (CFCO) module. In the (PSR) module, we introduce a part-level discrete shape memory to capture more fine-grained shape variations of different objects and use it to perform precise reconstruction. In the (CFCO) module, we utilize Hungarian matching to generate one-to-one pseudo labels at both region and pixel levels, which provides explicit supervision for the corresponding similarity matrices. We evaluate our method on the REAL275 and WILD6D datasets. Our extensive experiments show that our self-supervised approach outperforms existing methods and achieves new state-of-the-art results within the self-supervised framework.

Abstract:
The alignment of Image Quality Assessment (IQA) models with diverse human preferences remains a challenge, owing to the variability in preferences for different types of visual content, including user-generated content and AI-Generated Content (AIGC), etc. Despite the significant success of existing IQA methods in assessing specific visual content by leveraging knowledge from pre-trained models, the intricate factors impacting final ratings and the specially designed network architecture of these methods result in gaps in their ability to accurately capture human preferences for novel visual content. To address this issue, we propose Align-IQA, a novel framework that aims to generate visual quality scores aligned with diverse human preferences for various types of visual content. Align-IQA contains two key designs: (1) A customizable quality-aware guidance injection module. By injecting specializable quality-aware prior knowledge into general-purpose pre-trained models, the proposed module guides the acquisition of quality-aware features and allows for various adjustments of features to be consistent with diverse human preferences for different types of visual content. (2) A multi-scale feature aggregation module. By simulating the multi-scale mechanism in the human visual system, the proposed module enables the extraction of a more comprehensive representation of quality-aware features from the human perception perspective. Extensive experimental results demonstrate that Align-IQA achieves better or comparable performance to State-Of-The-Art (SOTA) methods. Notably, Align-IQA outperforms the previous best results on AIGC datasets, achieving Pearson's Linear Correlation Coefficients (PLCCs) of 0.890 (+3.73%) on AGIQA-1K and 0.924 (+1.99%) on AGIQA-3K. Additionally, Align-IQA reduces training parameters by 72.26% and inference overhead by 78.12%, while maintaining SOTA performance.

Abstract:
Interpretable and robust medical diagnoses are essential traits for practicing clinicians. Most computer-augmented diagnostic systems suffer from three major problems: non-interpretability, limited modality analysis, and narrow focus. Existing frameworks can either deal with multimodality to some extent but suffer from non-interpretability or partially interpretable but provide a limited modality and multifaceted capabilities. Our work aims to integrate all these aspects in one complete framework to fully utilize the full spectrum of information offered by multiple modalities and facets. We propose our solution via our novel architecture VR-DiagNet, consisting of a planner and a classifier, optimized iteratively and cohesively. VR-DiagNet simulates the perceptual process of clinicians via the use of volumetric imaging information integrated with radiomic features modality; at the same time, it recreates human thought processes via a customized Monte Carlo Tree Search (MCTS) which constructs a volume-tailored experience tree to identify slices of interest (SoIs) in our multi-slice perception space. We conducted extensive experiments across two diagnostic tasks comprising six public medical volumetric benchmark datasets. Our findings showcase superior performance, as evidenced by heightened accuracy and area under the curve (AUC) metrics, reduced computational overhead, and expedited convergence while conclusively illustrating the immense value of integrating volumetric and radiomic modalities for our current problem setup.

Abstract:
Text-driven 3D indoor scene generation aims to automatically generate and arrange the objects, which form a 3D scene that accurately captures the semantics detailed in the given text description. Recent works have shown the potential to generate 3D scenes guided by specific object categories and room layouts but lack a robust mechanism to maintain consistent spatial relationships in alignment with the provided text description during the 3D scene generation. Besides, the annotations of the object and relationships of the 3D scenes are usually time- and cost-consuming, which are not easily obtained for the model training. Thus, in this paper, we conduct a dataset and benchmark for assessing spatial relations in text-driven 3D scene generation, which contains a comprehensive collection of 3D scenes, including textual descriptions, annotating object spatial relations, and providing both template and free-form natural language descriptions. We also provide a pseudo description feature generation method to address the 3D scenes without language annotations. We design an aligned latent space for spatial relation in 3D scenes and text description, in which we can sample the features according to the spatial relation for the few-shot learning. We also propose new metrics to investigate the ability of the approach to generate correct spatial relationships among objects.

Abstract:
The ACM Multimedia 2024 industry program offers a unique platform for fostering collaboration between academia and industry. This year's program features a diverse range of industry keynotes, expert talks, seminars, and demonstrations, showcasing the latest advancements in multimedia technology. Renowned experts from industry and academia will share their insights on topics such as generative AI, automotive design, computer vision, spatial experience, healthcare, and more. Attendees will have the opportunity to network with industry leaders, learn about cutting-edge technologies, and explore potential collaborations. The industry program highlights the growing importance of multimedia technology in various domains and demonstrates the innovative ways in which AI and other emerging technologies are transforming industries. By participating in this program, attendees can gain valuable knowledge, expand their professional networks, and contribute to the advancement of the field.

Abstract:
Maps are the fundamental elements of any navigation and localization system. With the fast expansion of urban areas and the increasing complexity of modern cities, traditional mapping techniques cannot meet the need for frequent map updates with enriched map details. This work investigates how geo-referenced very high-resolution (VHR) RGB satellite imagery can be used to extract geo-information to support the creation of fine-scale maps and facilitate map updates, focusing on two sub-directions. First, to alleviate the data scarcity issue of large-scale geo-information extraction from satellite imagery-based datasets, (1) GAN-assisted road segmentation proposes a new assisted training scheme to improve the model performance when the training dataset is limited and (2) a context-enhanced satellite-imagery dataset is created for large-scale parking lot detection to improve the type diversity of target objects. Second, to support rich map attribute geo-information extraction and vision-based navigation using street-view imagery, new methods are proposed to improve location and orientation extraction of the street-view imagery via cross-view matching with satellite imagery.

Abstract:
In today's era, three-dimensional point cloud data is not only voluminous but also widely applicable. Therefore, data compression has become a crucial step prior to processing. Although existing 3D point cloud compression techniques primarily focus on fidelity, in practical applications, the vast majority of compressed data serves machine perception tasks. Therefore, point cloud compression tailored for machine perception becomes particularly significant. To address this problem, we introduce an innovative point cloud compression algorithm library specifically designed for both machine and human perceptual requirements. This library represents the first collection of multi-perception point cloud compression algorithms on the PyTorch platform, integrating eleven advanced, learning-based algorithms. We category and analyze these algorithms in depth, according to different analysis tasks, to facilitate a better understanding and comparison. Moreover, we successfully replicate these algorithms and meticulously organize the pre-processing of point cloud data and the analysis networks for downstream tasks. Ultimately, we conduct experiments on multiple perceptual datasets for compression and analysis tasks, with results comprehensively summarized across various performance metrics. We will continue to update these algorithms to ease their adoption by researchers.

Abstract:
Unmanned Aerial Vehicles (UAVs) are necessary across diverse domains, including disaster surveillance and wildlife conservation. However, the development and evaluation of UAV-related algorithms often encounter a significant hurdle: the scarcity of authentic training data. In this paper, we introduce U2USim, a telepresence simulation platform with a dynamic environment, serving as a realistic synthetic data generation, performance evaluation, and visualization tool for UAV-to-UAV (U2U) cooperative learning. This paper presents the architecture, features, and capabilities of U2USim. Leveraging Unreal Engine (UE), AirSim APIs, and ROS (Robot Operating System), our platform enables realistic simulations, mirroring real-world conditions and facilitating research in UAV technology.

Affiliations: Muhammad SaadSaeed Swarm Robotics Lab NCRA, University of Engineering and Technology, Pakistan ; Institute of Computational Perception, Johannes Kepler University, Austria ; Institute of Computational Perception, Austria ; Rohan KumarDas Fortemedia Singapore, Singapore ; Muhammad SalmanTahir Swarm Robotics Lab NCRA, Pakistan ; Muhammad ZaighamZaheer Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE ; Muhammad IrzamLiaqat IMT School for Advanced Studies of Lucca, Italy ; Muhammad HarisKhan Mohamed bin Zayed University of Artificial Intelligence, UAE ; Mohamed bin Zayed University of Artificial Intelligence, UAE ; Muhammad HaroonYousaf Swarm Robotics Lab NCRA, AI Lab, Linz Institute of Technology

Abstract:
Multimodal Large Language Models (MLLMs), by expanding the model's capabilities to perceive and interact through multi-modalities, have significantly enhanced performance across various tasks. The perception of vision is an important modality developed into LLM, enabling research in vision-language to continuously lead the cutting-edge advancements in the MLLM community. However, the standard pre-training pipeline on image-text pairs results in a limited model understanding of relationships between multiple images and texts, as well as visual details. Additionally, the setting of fine-tuning with a frozen visual backbone hinders the enhancement of visual representations on new data. These two issues lead to suboptimal performance in models for demonstrative instruction following about multiple images. This work introduces a novel framework called MLoEM, which first converts long multimodal data into an interleaved image-instruction format, and then adopts a fully autoregressive architecture model, allowing for more robust and coherent learning from naturally occurring multimodal documents than pair-based pipeline. Additionally, we incorporate the Low-Rank Adaptation (LoRA) fine-tuning method, enhancing visual representations while maintaining the stability of previously learned knowledge. Finally, we utilize ensemble methods to enhance model performance on tasks. To alleviate the storage overhead issue of parallel ensembles with large models, we design an ensemble approach that shares the MLLM while only switching the LoRA matrices. In the experiments, the proposed MLoEM shows superior performance on the testing set.

Abstract:
Multimodal audio-image music transcription has been recently posed as a means of retrieving a digital score representation by leveraging the individual estimations from Automatic Music Transcription (AMT)---acoustic recordings---and Optical Music Recognition (OMR)---image scores---systems. Nevertheless, while proven to outperform single-modality recognition rates, this approach has been exclusively validated under controlled scenarios---monotimbral and monophonic synthetic data---mainly due to a lack of collections with symbolic score-level annotations for both recordings and graphical sheets. To promote research on this topic, this work presents the Multimodal mUSic Collection for Automatic Transcription (MUSCAT) assortment of acoustic recordings, image sheets, and their score-level annotations in several notation formats. This dataset comprises almost 80 hours of real recordings with varied instrumentation and polyphony degrees---ranging from piano to orchestral music---, 1251 scanned sheets, and 880 symbolic scores from 37 composers, which may also be used in other tasks involving metadata such as instrument identification or composer recognition. A fragmented subset of this collection solely focused on acoustic data for score-level AMT---the MUSic Collection for aUtomatic Transcription - fragmented Subset (MUSCUTS) assortment---is also presented together with a baseline experimentation, concluding the need to foster research on this field with real recordings. Finally, a web-based service is also provided to increase the size of the collections collaboratively.

Abstract:
Human Action Quality Assessment (AQA) is a prominent area of research in human action analysis. Current mainstream methods only consider the RGB modality which results in limited feature representation and insufficient performance due to the complexity of the AQA task. In this paper, we propose a simple and modular framework called the Two-Modality Assessment Framework (2M-AF), which comprises a skeleton stream, an RGB stream and a regression module. For the skeleton stream, we develop the Self-supervised Mask Encoder Graph Convolution Network (SME-GCN) to achieve representation learning, and further implement score assessment. Additionally, we propose a Preference Fusion Module (PFM) to fuse features, which can effectively avoid the disadvantages of different modalities. Our experimental results demonstrate the superiority of the proposed 2M-AF over current state-of-the-art methods on three publicly available datasets: AQA-7, UNLV-Diving, and MMFS-63.

Abstract:
Inducing linguistic knowledge for scene text recognition (STR) is a new trend that could provide semantics for performance boost. However, most autoregressive STR models optimize one-step ahead prediction (i.e., 1-gram prediction) for character sequence, which only utilizes the previous semantic context. Most non-autoregressive models only apply linguistic knowledge individually on the output sequence to refine the results in parallel, which do not fully utilize the visual clues concurrently. In this paper, we propose a novel language-based STR model, called ProphetSTR. It adopts an n-stream attention mechanism in the decoder to simultaneously predict the next n characters based on the previous predictions at each time step. It behaves like a prophet, encouraging the model to predict more accurate results by utilizing the previous semantic information and the near future clues. If the prediction results for the same character at successive time steps are inconsistent, we should not trust any of them. Otherwise, they are reliable predictions. Therefore, we propose a multi-modality verification module, masking the unreliable semantic features and inputting with visual and trusted semantic ones simultaneously for masked prediction recovery in parallel. It learns to align different modalities implicitly and considers both visual context and linguistic knowledge, which could generate more reliable results. Furthermore, we propose a multi-scale weight-sharing encoder for multi-granularity image representation. Extensive experiments demonstrate that ProphetSTR achieves state-of-the-art performances on many benchmarks. Further ablative studies prove the effectiveness of our proposed components.

Abstract:
Universal object detectors aim to detect any object in any scene without human annotation, exhibiting superior generalization. However, the current universal object detectors show degraded performance in harsh weather, and their insufficient real-time capabilities limit their application. In this paper, we present Uni-YOLO, a universal detector designed for complex scenes with real-time performance. Uni-YOLO is a one-stage object detector that uses general object confidence to distinguish between objects and backgrounds, and employs a grid cell regression method for real-time detection. To improve its robustness in harsh weather conditions, the input of Uni-YOLO is adaptively enhanced with a physical model-based enhancement module. During training and inference, Uni-YOLO is guided by the extensive knowledge of the vision-language model CLIP. An object augmentation method is proposed to improve generalization in training by utilizing multiple source datasets with heterogeneous annotations. Furthermore, an online self-enhancement method is proposed to allow Uni-YOLO to further focus on specific objects through self-supervised fine-tuning in a given scene. Extensive experiments on public benchmarks and a UAV deployment are conducted to validate its superiority and practical value.

Abstract:
Referring Remote Sensing Image Segmentation (RRSIS) is a challenging task that aims to identify specific regions in aerial images that are relevant to given textual conditions. Existing methods tend to adopt the paradigm of implicit optimization, utilizing a framework consisting of early cross-modal feature fusion and a fixed convolutional kernel-based predictor, neglecting the inherent inter-domain gap and conducting class-agnostic predictions. In this paper, we rethink the issues with the implicit optimization paradigm and address the RRSIS task from a dual-alignment perspective. Specifically, we prepend the dedicated Dual Alignment Network (DANet), including an explicit alignment strategy and a reliable agent alignment module. The explicit alignment strategy effectively reduces domain discrepancies by narrowing the inter-domain affinity distribution. Meanwhile, the reliable agent alignment module aims to enhance the predictor's multi-modality awareness and alleviate the impact of deceptive noise interference. Extensive experiments on two remote sensing datasets demonstrate the effectiveness of our proposed DANet in achieving superior segmentation performance without introducing additional learnable parameters compared to state-of-the-art methods.

Abstract:
3D Object Detection (3DOD) aims to accurately locate and identify 3D objects in point clouds, facing the challenge of balancing model performance with computational efficiency. Knowledge distillation emerges as a vital method for model compression in 3DOD, transferring knowledge from complex, larger models to smaller, efficient ones. However, the effectiveness of these methods is constrained by the intrinsic sparsity and structural complexity of point clouds. In this paper, we propose a novel methodology termed Joint Homophily and Heterophily Relational Knowledge Distillation (H2RKD) to distill robust relational knowledge in point clouds, thereby enhancing intra-object similarity and refining inter-object distinction. This unified strategy encompasses the integration of Collaborative Global Distillation (CGD) for distilling global relational knowledge across both distance and angular dimensions, and Separate Local Distillation (SLD) for a focused distillation of local relational dynamics. By seamlessly leveraging the relational dynamics within point clouds, the H2RKD facilitates a comprehensive knowledge transfer, significantly advancing 3D object detection capabilities. Extensive experiments on KITTI and unScenes datasets demonstrate the effectiveness of the proposed H2RKD.

Abstract:
Deformable image registration (DIR) is crucial for many medical image applications. In recent years, learning-based methods utilizing the convolutional neural network (CNN) or the Transformer have demonstrated their superiority in image registration, dominating a new era for DIR. However, very few of these methods can satisfy the demands of real-time applications due to the high spatial resolution of 3D volumes and the high complexity of 3D operators. To tackle this, we propose losslessly downsampling by shifting the strided convolution. A grouping strategy is then used to reduce redundant computations and support self-consistency learning. As an inherent regularizer of the network design, self-consistency learning improves the deformation quality and enables halving the proposed network after training. Furthermore, the proposed shifted connection converts the decoding operations into a lower-dimensional space, significantly reducing decoding overhead. Extensive experimental results on medical image registration demonstrate that our method is competitive with state-of-the-art methods in terms of registration performance, and additionally, it achieves over 3× the speed of most of them.

Abstract:
Deep learning has made significant advancements and breakthroughs in medical image recognition. However, the clinical reality is complex and multifaceted, with patients often suffering from multiple intertwined diseases, not all of which are equally common, leading to medical datasets that are frequently characterized by multi-labels and a long-tailed distribution. In this paper, we propose a method involving label decoupling and reconstruction (LDRNet) to address these two specific challenges. The label decoupling utilizes the fusion of semantic information from both categories and images to capture the class-aware features across different labels. This process not only integrates semantic information from labels and images to improve the model's ability to recognize diseases, but also captures comprehensive features across various labels to facilitate a deeper understanding of disease characteristics within the dataset. Following this, our label reconstruction method uses the class-aware features to reconstruct the label distribution. This step generates a diverse array of virtual features for tail categories, promoting unbiased learning for the classifier and significantly enhancing the model's generalization ability and robustness. Extensive experiments conducted on three multi-label long-tailed medical image datasets, including the Axial Spondyloarthritis Dataset, NIH Chest X-ray 14 Dataset, and ODIR-5K Dataset, have demonstrated that our approach achieves state-of-the-art performance, showcasing its effectiveness in handling the complexities associated with multi-label and long-tailed distributions in medical image recognition.

Abstract:
Hanfu is the representative traditional costume of Han nationality in China, which carries the outstanding craftsmanship of dyeing, weaving, and embroidery, and is of great significance to the inheritance of traditional culture. However, the existing methods of Hanfu publicity still have problems, which are not conducive to the inheritance of Hanfu culture. In this work, we developed the VisHanfu virtual reality system by focusing on the "Cross-Shaped Flat Structure", which is an integral feature of Hanfu. We have digitally restored twenty-five representative Hanfu historical artifacts and provided an interactive making experience. Combined with high realistic cloth simulation techniques, it allows users to interactively observe the movement effects of the Hanfu. The results of user studies demonstrates that our system can provide a favorable experience for users, and bring a better learning effect, which helps users to enhance their interest in learning and thus contributes to the inheritance of Hanfu culture.

Abstract:
Point clouds from real-world scenarios inevitably contain complex noise, significantly impairing the accuracy of downstream tasks. To tackle this challenge, cascading encoder-decoder architecture has become a conventional technical route to iterative denoise. However, circularly feeding the output of denoiser as its input again involves the re-extraction of underlying surface, leading to unstable denoising process and over-smoothed geometric details. To address these issues, we propose a novel denoising paradigm dubbed PD-Refiner that employs a single encoder to model the underlying surface. Then, we leverage several lightweight hierarchical Underlying Surface Inheritance Refiners (USIRs) to inherit and strengthen it, thereby avoiding the re-extraction from the intermediate point cloud. Furthermore, we design adaptive edge-aware supervision to improve the edge awareness of the USIRs, allowing for the adjustment of the denoising preferences from global structure to local details. The results demonstrate that our method not only achieves state-of-the-art performance in terms of denoising stability and efficacy, but also enhances edge clarity and point cloud uniformity.

Abstract:
Multimodal Emotion Recognition in Conversations aims to understand the human emotion of each utterance in a conversation from different types of data, such as speech and text. Previous works mainly focus on either complex unimodal feature extraction or sophisticated fusion techniques as general multimodal classification tasks do. However, they ignore the process of human perception, neglecting various levels of emotional features within each modality and disregarding the unique contributions of different modalities for emotion recognition. To address these issues, we propose a more cognitive-aligned multimodal fusion framework, namely DQ-Former. Specifically, DQ-Former utilizes a small set of learnable query tokens to collate and condense various granularities of emotion cues embedded at different layers of pre-trained unimodal models. Subsequently, it integrates these emotional features from different modalities with dynamic modality priorities at each intermediate fusion layer. This process enables explicit and effective fusion of different levels of information from diverse modalities. Extensive experiments on MELD and IEMOCAP datasets validate the effectiveness of DQ-Former. Our results show that the proposed method achieves a robust and interpretable multimodal representation for emotion recognition.

Abstract:
Large-scale pretrained image-language models have shown remarkable performance recently. However, building a video-language model is more challenging due to the complexity of video and the difficulty of collecting high-quality data. This paper builds a video-language model in an adaptive manner, which transfers the knowledge from the image domain and can achieve state-of-the-art performance without any further massive video pretraining. The main contributions include a Visual Perception Adapter that seamlessly and efficiently adapts a pretrained image-language model to the video domain and a fine-grained contrastive learning with Inter-modal Token Alignment that bridges semantic gaps between vision, audio, and language with less data. The proposed model is evaluated on video captioning and retrieval. Experiments demonstrate that the proposed model exhibits competitive performance compared to models pretrained on millions of video-text pairs. Notably, our model's CIDEr and R@1 scores on the MSR-VTT dataset exceed the existing state-of-the-art by 6.3% and 1.3%.

Abstract:
Single Image Super-Resolution (SISR) is a pivotal challenge in computer vision, aiming to restore high-resolution (HR) images from their low-resolution (LR) counterparts. The presence of diverse degradation kernels creates a significant domain gap, limiting the effective generalization of models in real-world scenarios. This study introduces the Bézier Curve basis-based Sparse Coding Network (BCSCN), a preprocessing network designed to mitigate input distribution discrepancies between the training and testing phases of super-resolution networks. BCSCN achieves this by removing visual defects associated with the degradation kernel in LR images, such as artifacts, residual structures, and noise. Additionally, we propose a set of rewards to guide the search for basis coefficients in BCSCN, enhancing the preservation of main content while eliminating information related to degradation. The experimental results highlight the importance of BCSCN, showcasing its capacity to effectively reduce domain gaps and enhance the generalization of super-resolution networks.

Abstract:
Federated learning is a promising privacy-preserving learning paradigm in which multiple clients can collaboratively learn a model with their image data kept local. For protecting data ownership, personalized watermarks are usually added to the image data by each client. However, the introduced watermarks can lead to a shortcut learning problem, where the learned model performs predictions over-rely on the simple watermark-related features and represents a low accuracy on real-world data. Existing works assume the central server can directly access the predefined shortcut features during the training process. However, these may fail in the federated learning setting as the shortcut features of the heterogeneous watermarked data are difficult to obtain.

Abstract:
Recovering the complete shape of a 3D object from limited viewpoints plays an important role in 3D vision. Recent point cloud completion methods prefer an encoding-decoding architecture for generating the global structure and local geometry from a set of input point proxies. In this paper, we introduce an innovative completion method aimed at uncovering structural details from input point clouds and maximizing their utility. Specifically, we improve both Encoding and Decoding for this task: (1) Key Context Fusion Encoding extracts and aggregates homologous key context by adaptively increasing the sampling bias towards salient structure and special contour points. (2) Semantic-based Decoding introduces a semantic EdgeConv module to prompt next Transformer decoder, which effectively learns and generates local geometry with semantic correlations from non-nearest neighbors. The experiments are evaluated on several 3D point cloud and 2.5D depth image datasets. Both qualitative and quantitative evaluations demonstrate that our method outperforms previous state-of-the-art methods.

Abstract:
Computer-Aided Design (CAD) generative modeling is widely applicable in the fields of industrial engineering. Recently, text-to-3D generation has shown rapid progress in point clouds, mesh, and other non-parametric representations. On the contrary, text to 3D parametric CAD generative modeling is a more appealing task in industry but has not been well explored. The parametric CAD model means the product shape can be defined by using the command sequences of CAD tools. To investigate this, we design an encoder-decoder framework, namely CAD Translator, for incorporating the embedding of parametric CAD sequences into texts appropriately with only one-stage training. We first align texts and parametric CAD sequences via a Cascading Contrastive Strategy in the latent space, and then we propose CT-Mix to conduct the random mask operation on their embeddings separately to further get a fusion embedding via the linear interpolation. This can strengthen the connection between texts and parametric CAD sequences effectively. To train CAD Translator, we build a Text2CAD dataset with the help of Large Multimodal Model (LMM) and conduct thorough experiments to demonstrate the effectiveness of our method.

Abstract:
Graph Fourier Transform (GFT) has demonstrated significant effectiveness in point cloud attribute compression task. However, existing graph modeling methods are based on the geometric relationships of the points, which leads to reduced efficiency of graph transforms in cases where the correlation between attributes and geometry is weak. In this paper, we propose a novel graph modeling method based on attribute prediction values. Specifically, we utilize Gaussian priors to model prediction values, then use maximum a posteriori estimation to learn the Laplacian matrix that best fits the prediction values in order to conduct separate graph transforms on prediction values and ground truth values to derive residuals, and subsequently perform quantization and entropy coding on these residuals. Additionally, since the partitioning of point clouds directly affects the coding performance, We design an adaptive block partitioning method based on ternary search, which selects reference points using distance threshold r and performs block partitioning and non-reference point attribute prediction based on these reference points. By conducting ternary search on distance threshold r, we rapidly identify the optimal block partitioning strategy. Moreover, we introduce an efficient residual encoding method based on Morton codes for the attributes of reference points while the prediction attributes of non-reference points are modeled using the proposed graph-based modeling approach. Experimental results demonstrate that our method significantly outperforms two attribute compression methods employed by Moving Picture Experts Group (MPEG) in lossless geometry based attribute compression tasks, with an average of 30.57% BD-rate gain compared to Predictive Lifting Transform (PLT), and an average of 33.54% BD-rate gain compared to Region-Adaptive Hierarchical Transform (RAHT), which exhibits significantly improved rate-distortion performance over the current state-of-the-art method based on GFT.

Abstract:
Presentation skills, which involve the effective use of verbal and nonverbacl cues, enable audiences to better understand the content being presented. We develope a deep learning-based online assessment system that can objectively evaluate speakers' oral presentations and slide design, providing comprehensive feedback to support their self-practice. For the speaking skill assessment, we construct a multimodal neural network, including LSTMs and attention networks, to analyze the linguistic and acoustic features of oral presentations. The proposed model can predict 14 distinct types of audience impressions with an average accuracy of 85.0%. For the slide design assessment, we propose a method that can analyze slide design based on their visual and structural features, independent of file formats. It can determine whether the slides meet 10 assessment criteria with an average accuracy of 81.7%.

Abstract:
In real-world recon-videos such as surveillance and drone reconnaissance videos, commonly used explicit language, acoustic and facial expressions information is often missing. However, these videos are always rich in anomalous sentiments (e.g., criminal tendencies), which urgently requires the implicit scene information (e.g., actions and object relations) to fast and precisely identify these anomalous sentiments. Motivated by this, this paper proposes a new chat-paradigm Implicit anomalous sentiment Discovering and grounding (IasDig) task, aiming to interactively, fast discovering and grounding anomalous sentiments in recon-videos via leveraging the implicit scene information (i.e., actions and object relations). Furthermore, this paper believes that this IasDig task faces two key challenges, i.e., scene modeling and scene balancing. To this end, this paper proposes a new Scene-enhanced Video Large Language Model named Hawkeye, i.e., acting like a raptor (e.g., a Hawk) to discover and locate prey, for the IasDig task. Specifically, this approach designs a graph-structured scene modeling module and a balanced heterogeneous MoE module to address the above two challenges, respectively. Extensive experimental results on our constructed scene-sparsity and scene-density IasDig datasets demonstrate the great advantage of Hawkeye to IasDig over the advanced Video-LLM baselines, especially on the metric of false negative rates. This justifies the importance of the scene information for identifying implicit anomalous sentiments and the impressive practicality of Hawkeye for real-world applications.

Abstract:
Adversarial training (AT) is a fundamental method to enhance the robustness of Deep Neural Networks (DNNs) against adversarial examples. While AT achieves improved robustness on adversarial examples, it often leads to reduced accuracy on clean examples. Considerable effort has been devoted to handling the trade-off from the perspective of input space. However, we demonstrate that the trade-off can also be illustrated from the perspective of the gradient space. In this paper, we propose Adversarial Training with Adaptive Gradient Reconstruction (AGR), a novel approach that balances generalization (accuracy on clean examples) and robustness (accuracy on adversarial examples) in adversarial training via steering through clean and adversarial gradient directions. We first introduce an ingenious technique named Gradient Orthogonal Projection in the case of negative correlation gradients to adjust the adversarial gradient direction to reduce the degradation of generalization. Then we present a gradient interpolation scheme in the case of positive correlation gradients for efficiently increasing the generalization without compromising the robustness of the final obtained. Rigorous theoretical analysis proves that our AGR has lower generalization error upper bounds indicating its effectiveness. Comprehensive experiments empirically demonstrate that AGR achieves excellent capability of balancing generalization and robustness, and is compatible with various adversarial training methods to achieve superior performance.

Abstract:
Unsupervised domain adaptation (UDA) has been a crucial way for cross-domain semantic segmentation of remote sensing images and reached apparent advents. However, most existing efforts focus on single source single target domain adaptation, which don't explicitly consider the serious domain shift between multiple source and target domains in real applications, especially inter-domain shift between various target domains and intra-domain shift within each target domain. In this paper, to address simultaneous inter-domain shift and intra-domain shift for multiple target domains, we propose a novel unsupervised, multistage, multisource and multitarget domain adaptation network (MultiDAN), which involves multisource and multitarget domain adaptation (MSMTDA), entropy-based clustering (EC) and multistage domain adaptation (MDA). Specifically, MSMTDA learns feature-level multiple adversarial strategies to alleviate complex domain shift between multiple target and source domains. Then, EC clusters the various target domains into multiple subdomains based on entropy of target predictions of MSMTDA. Besides, we propose a new pseudo label update strategy (PLUS) to dynamically produce more accurate pseudo labels for MDA. Finally, MDA aligns the clean subdomains, including pseudo labels generated by PLUS, with other noisy subdomains in the output space via the proposed multistage adaptation algorithm (MAA). The extensive experiments on the benchmark remote sensing datasets highlight the superiority of our MultiDAN against recent state-of-the-art UDA methods.

Abstract:
Acquiring commonsense knowledge about entity-pairs from images is crucial across diverse applications. Distantly supervised learning has made significant advancements by automatically retrieving images containing entity pairs and summarizing commonsense knowledge from the bag of images. However, the retrieved images may not always cover all possible relations, and the informative features across the bag of images are often overlooked. To address these challenges, a Multi-modal Cross-domain Feature Learning framework is proposed to incorporate the general domain knowledge from a large vision-text foundation model, ViT-GPT2, to handle unseen relations and exploit complementary information from multiple sources. Then, a Group Attention module is designed to exploit the attentive information from other instances of the same bag to boost the informative features of individual instances. Finally, a Gamma-corrected Gated Fusion is designed to select a subset of informative instances for a comprehensive summarization of commonsense entity relations. Extensive experimental results demonstrate the superiority of the proposed method over state-of-the-art models for extracting commonsense knowledge.

Abstract:
Facing two significant challenges for monocular depth estimation under a lightweight network, including the preservation of detail information and the artifact reduction of the predicted depth maps, this paper proposes a self-supervised monocular depth estimation framework, called LiteGfm. It contains a DepthNet with an Anti-Artifact Guided (AAG) module and a PoseNet. In the AAG module, a Guided Image Filtering with cross-detail masking is first designed to filter the input features of the decoder for preserving comprehensive detail information. Second, a filter kernel generator is proposed to decompose the Sobel operator along the vertical and horizontal axes for achieving cross-detail masking, which better captures the structure and edge feature for minimizing artifacts. Furthermore, a boundary-aware loss between the reconstructed and input images is presented to preserve high-frequency details for decreasing artifacts. Extensive experimental results demonstrate that LiteGfm under 1.9M parameters gets more optimal performance than state-of-the-art methods.

Abstract:
Neural Radiance Fields (NeRFs) demonstrate high efficiency in generating photo-realistic novel view. Recent studies introduce the trials on the 3D inpainting by NeRF. However, the performance of these works have been validated for data collected in a narrow range of multi-view, while degrade for the wide range of multi-view. To address this problem, we propose a novel NeRF framework to remove the obstacle and reproduce occluded areas in high quality for both wide and narrow range of multi-view. In this framework, we design a region coding network to carry out object segmentation. With the depth information, the segmentation component transfers a single obstacle mask to other views in high accuracy. By referring to the segmentation results, we introduce an innovative view selection mechanism to reconstruct the occluded area using supplementary information from multi-view and 2D inpainting. We also contribute to the evaluation of 3D scene de-occlusion by introducing a dataset including views captured in wide range and in pair with and without the obstacle object for comparison. We evaluate our framework in both narrow and wide range datasets by quantitative measurement and visually qualitative comparison, which confirm the competitive and superior performance of our framework.

Abstract:
In recent years, the use of AI/ML technologies in the medical device industry has increased significantly, with a boom in the radiology sector. AI/ML-tech based Software as Medical Devices (SaMD) support healthcare professionals in their clinical routine, saving medical time, decreasing workload while improving image readouts. But more importantly, AI/ML technologies offer the ability to extract valuable and reliable insights from medical images, catching the unseen and paving the way for medical breakthroughs while improving the care pathway for diseases monitored by medical images.

Abstract:
The field of floorplan generation has attracted significant interest from the community. Remarkably, recent advances in generative models have markedly enhanced the development of this field. However, generating floorplans that satisfy various conditions remains a challenging task. This paper proposes a learning framework, named Cons2Plan, for automatically and high-quality generating vector floorplans from various conditions. The input conditions can be graphs, boundaries, or a combination of both. The conditional diffusion model is the core component of our Cons2Plan. The denoising network uses a conditional embedding module to incorporate the conditions during the reverse process. Additionally, Cons2Plan incorporates a two-stage approach that generates graph conditions based on boundaries. It uses three networks for node prediction and a novel conditional edge generation diffusion model, named CEDM, for edge generation. We conduct qualitative evaluations, quantitative comparisons, and ablation studies to show that our method produces better floorplans than state-of-the-art methods.

Abstract:
Lightweight models play an important role in real-life applications, especially in the recent mobile device era. However, due to limited network scale and low-quality images, the performance of lightweight models on Scene Text Recognition (STR) tasks is still much to be improved. Recently, contrastive learning has shown its power in many areas, with promising performances without additional computational cost. Based on these observations, we propose a new efficient and effective frame-level contrastive learning (FLCL) framework for lightweight STR models. The FLCL framework consists of a backbone to extract basic features, a Text Perceiver Module (TPM) to focus on text-relevant representations, and a FLCL loss to update the network. The backbone can be any feature extraction architecture. The TPM is an innovative Mamba-based structure that is designed to suppress features irrelevant to the text content from the backbone. Unlike existing word-level contrastive learning, we look into the nature of the STR task and propose the frame-level contrastive learning loss, which can work well with the famous Connectionist Temporal Classification loss. We conduct experiments on six well-known STR benchmarks as well as a new low-quality dataset. Compared to vanilla contrastive learning and other non-parameter methods, the FLCL framework significantly outperforms others on all datasets, especially the low-quality dataset. In addition, character feature visualization demonstrates that the proposed method can yield more discriminative character features for visually similar characters, which also substantiates the efficacy of the proposed methods. Codes and the low-quality dataset will be available soon.

Abstract:
This paper studies the task of speech reconstruction from ultrasound tongue images and optical lip videos recorded in a silent speaking mode, where people only activate their intra-oral and extra-oral articulators without producing real speech. This task falls under the umbrella of articulatory-to-acoustic (A2A) conversion and may also be referred to as a silent speech interface. To overcome the domain discrepancy between silent and standard vocalized articulation, we introduce a novel pseudo target generation strategy. It integrates the text modality to align with articulatory movements, thereby guiding the generation of pseudo acoustic features for supervised training on speech reconstruction from silent articulation. Furthermore, we propose to employ a denoising diffusion probabilistic model as the fundamental architecture for the A2A conversion task and train the model using a combined training approach with the generated pseudo acoustic features. Experiments show that our proposed method significantly improves the intelligibility and naturalness of the reconstructed speech in the silent speaking mode compared to all baseline methods. Specifically, the word error rate of the reconstructed speech decreases by approximately 5% when measured using an automatic speech recognition engine for intelligibility assessment, and the subjective mean opinion score for naturalness improves by 0.14. Moreover, analytical experiments reveal that the proposed pseudo target generation strategy can generate pseudo acoustic features that synchronize better with articulatory movements than previous strategies. Samples are available at our project page.

Abstract:
The burgeoning field of text-to-music generation models has shown great promise in their ability to generate high-quality music aligned with users' textual descriptions. These models effectively capture abstract/global musical features such as style and mood. However, they often inadequately produce the precise rendering of critical music loop attributes, including melody, rhythms, and instrumentation, which are essential for modern music loop production. To overcome this limitation, this paper proposed a Loops Transformer and a Multi-Stage Cross Attention mechanism that enable a cohesive integration of textual and MIDI input specifications. Additionally, a novel Instrument-Aware Reinforcement Learning technique was introduced to ensure the correct adoption of instrumentation. We demonstrated that the proposed model can generate music loops that simultaneously satisfy the conditions specified by both natural language texts and MIDI input, ensuring coherence between the two modalities. We also showed that our model outperformed the state-of-the-art baseline model, MusicGen, in both objective metrics (by lowering the FAD score by 1.3, indicating superior quality with lower scores, and by improving the Normalized Dynamic Time Warping Distance with given melodies by 12%) and subjective metrics (by +2.56% in OVL, +5.42% in REL, and +7.74% in Loop Consistency). These improvements highlight our model's capability to produce musically coherent loops that satisfy the complex requirements of contemporary music production, representing a notable advancement in the field. Generated music loop samples can be explored at: https://loopstransformer.netlify.app/.

Abstract:
Fusing the data of millimeter-wave Radar sensors and high-definition cameras has emerged as a viable approach to achieving precise 3D object detection for roadside traffic surveillance. For roadside perception systems, earlier studies have pointed out that it is better to perform the fusion on the 2D image plane than on the BEV plane (which is popular for on-car perception systems), especially when the perception range is large (e.g., >150m). Image-plane fusion requires critical transformations, like perspective projection from the Radar's BEV to the camera's 2D plane and reverse IPM. However, real-world issues like uneven terrain and sensor movement degrade these transformations' precision, impacting fusion effectiveness. To alleviate these issues, we propose a geometry-based Radar-camera fusion method on the ground, namely FARFusion V2. Specifically, we extend the ground-plane assumption in FARFusion[20] to support arbitrary shapes by formulating the ground height as an implicit representation based on geometric transformations. By incorporating the ground information, we can enhance Radar data with target height measurements. Consequently, we can thus project the enhanced Radar data onto the 2D plane to obtain more accurate depth information, thereby assisting the IPM process. A real-time parameterized transformation parameters estimation module is further introduced to refine the view transformation processes. Moreover, considering various measurement noises across these two sensors, we introduce an uncertainty-based depth fusion strategy into the 2D fusion process to maximize the probability of obtaining the optimal depth value. Extensive experiments are conducted on our collected roadside OWL benchmark, demonstrating the excellent localization capacity of FARFusion V2 in far-range scenarios. Our method achieves an average location accuracy of 0.771m when we extend the detection range up to 500m.

Abstract:
Currently, Transformer-based prohibited object detection methods in X-ray images appear constantly, but there are still some shortcomings such as poor performance and high computational complexity for prohibited object detection with heavily occlusion. Therefore, a coarse to fine detection method for prohibited object in X-ray images based on progressive Transformer decoder is proposed in this paper. Firstly, a coarse to fine framework is proposed, which includes two stages: coarse detection and fine detection. Through adaptive inference in stages, the computational efficiency of the model is effectively improved. Then, a position and class object queries method is proposed, which improves the convergence speed and detection accuracy of the model by fusing the position and class information of prohibited object with object queries. Finally, a progressive Transformer decoder is proposed, which distinguishes high and low score queries by decreasing confidence thresholds, so that high-score queries are not affected by low-score queries in the decoding stage, and the model can focus more on decoding low-score queries, which usually correspond to prohibited object with severe occlusion. The experimental results on three public benchmark datasets (SIXray, OPIXray, HiXray) demonstrate that compared with the baseline DETR, the proposed method achieves the state-of-the-art detection accuracy with a 21.6% reduction in model computational complexity. Especially for prohibited objects with heavily occlusion, accurate detection can be carried out.

Abstract:
This paper presents an optimized approach for the AI-generated image detection task in the 2024 ACM Multimedia Grand Challenge. Given the rapidly evolving capabilities of generative models, traditional detection methods often struggle with accuracy and generalization. The proposed solution builds upon the baseline by integrating advanced model architecture enhancements and novel detection modules, specifically designed to address the complexities of AI-generated images. Through rigorous testing, the approach demonstrates significant improvements in precision, particularly in identifying images produced by unknown generative models. This work highlights the critical need for adaptable and robust detection methods to keep pace with the advancements in AI-generated content and sets a new standard for future research and development in this domain.