ACMMM2025

Abstract:
Charts are a fundamental visualization format widely used in data analysis across research and industry. While enabling users to edit charts based on high-level intentions is of great practical value, existing methods primarily rely on natural language instructions, which are often too ambiguous to support fine-grained editing. In this work, we introduce a novel paradigm for multimodal chart editing, where user intent is expressed through a combination of natural language and visual indicators that explicitly highlight the elements to be modified. To support this paradigm, we present ChartM3, a new benchmark for Multimodal chart editing with Multi-level complexity and Multi-perspective evaluation. ChartM3 contains 1,000 samples spanning four levels of editing difficulty. Each sample includes triplets in the form of (chart, code, multimodal instructions). To comprehensively evaluate chart editing models, ChartM3 provides metrics that assess both visual appearance and code correctness. Our benchmark reveals significant limitations in current multimodal large language models (MLLMs), including GPT-4o, particularly in their ability to interpret and act on visual indicators. To address this, we construct ChartM3-Train, a large-scale training set with 24,000 multimodal chart editing samples. Fine-tuning MLLMs on this dataset leads to substantial improvements, demonstrating the importance of multimodal supervision in building practical chart editing systems. Our datasets, codes, and evaluation tools are available at https://github.com/MLrollIT/ChartM3.

Abstract:
Animation colorization plays a vital role in animation production, yet existing methods struggle to achieve color accuracy and temporal consistency. To address these challenges, we propose AnimeColor, a novel reference-based animation colorization framework leveraging Diffusion Transformers (DiT). Our approach integrates sketch sequences into a DiT-based video diffusion model, enabling sketch-controlled animation generation. We introduce two key components: a High-level Color Extractor (HCE) to capture semantic color information and a Low-level Color Guider (LCG) to extract fine-grained color details from reference images. These components work synergistically to guide the video diffusion process. Additionally, we employ a multi-stage training strategy to maximize the utilization of reference image color information. Extensive experiments demonstrate that AnimeColor outperforms existing methods in color accuracy, sketch alignment, temporal consistency, and visual quality. Our framework not only advances the state of the art in animation colorization but also provides a practical solution for industrial applications. The code will be made publicly available at https://github.com/IamCreateAI/AnimeColor.

Abstract:
Recent advancements in graph unlearning models have enhanced model utility by preserving the node representation essentially invariant, while using gradient ascent on the forget set to achieve unlearning. However, this approach causes a drastic degradation in model utility during the unlearning process due to the rapid divergence speed of gradient ascent. In this paper, we introduce INPO, an Influence-aware Negative Preference Optimization framework that focuses on slowing the divergence speed and improving the robustness of the model utility to the unlearning process. Specifically, we first analyze that NPO has slower divergence speed and theoretically propose that unlearning high-influence edges can reduce impact of unlearning. We design an influence-aware message function to amplify the influence of unlearned edges and mitigate the tight topological coupling between the forget set and the retain set. The influence of each edge is quickly estimated by a removal-based method. Additionally, we propose a topological entropy loss from the perspective of topology to avoid excessive information loss in the local structure during unlearning. Extensive experiments conducted on five real-world datasets demonstrate that INPO-based model achieves state-of-the-art performance on all forget quality metrics while maintaining the model's utility. Codes are available at https://github.com/sh-qiangchen/INPO.

Abstract:
Few-shot font generation aims to create new fonts with a limited number of glyph references. It can be used to significantly reduce the labor cost of manual font design. However, due to the variety and complexity of font styles, the results generated by existing methods often suffer from visible defects, such as stroke errors, artifacts and blurriness. To address these issues, we propose DA-Font, a novel framework which integrates a Dual-Attention Hybrid Module (DAHM). Specifically, we introduce two synergistic attention blocks: the component attention block that leverages component information from content images to guide the style transfer process, and the relation attention block that further refines spatial relationships through interacting the content feature with both original and stylized component-wise representations. These two blocks collaborate to preserve accurate character shapes and stylistic textures. Moreover, we also design a corner consistency loss and an elastic mesh feature loss to better improve geometric alignment. Extensive experiments show that our DA-Font outperforms the state-of-the-art methods across diverse font styles and characters, demonstrating its effectiveness in enhancing structural integrity and local fidelity. The source code can be found at https://github.com/wrchen2001/DA-Font.

Abstract:
While text-driven diffusion models demonstrate remarkable performance in image editing, the critical components of their text embeddings remain underexplored. The ambiguity and entanglement of these embeddings pose challenges for precise editing. In this paper, we provide a comprehensive analysis of text embeddings in Stable Diffusion XL, offering three key insights: (1) aug embedding ~. aug embedding is obtained by combining the pooled output of the final text encoder with the timestep embeddings. https://github.com/huggingface/diffusers retains complete textual semantics but contributes minimally to image generation as it is only fused via the ResBlocks. More text information weakens its local semantics while preserving most global semantics. (2) BOS and padding embedding do not contain any semantic information. (3) EOS holds the semantic information of all words and stylistic information. Each word embedding is important and does not interfere with the semantic injection of other embeddings. Based on these insights, we propose PSP (Prompt-Softbox-Prompt), a training-free image editing method that leverages free-text embedding. PSP enables precise image editing by modifying text embeddings within the cross-attention layers and using Softbox to control the specific area for semantic injection. This technique enables the addition and replacement of objects without affecting other areas of the image. Additionally, PSP can achieve style transfer by simply replacing text embeddings. Extensive experiments show that PSP performs remarkably well in tasks such as object replacement, object addition, and style transfer. Our code is available at https://github.com/yangyt46/PSP.

Abstract:
Unmanned Aerial Vehicles, operating in environments with relatively few obstacles, offer high maneuverability and full three-dimensional mobility. This allows them to rapidly approach objects and perform a wide range of tasks often challenging for ground robots, making them ideal for exploration, inspection, aerial imaging, and everyday assistance. In this paper, we introduce AirStar, a UAV-centric embodied platform that turns a UAV into an intelligent aerial assistant: a large language model acts as the cognitive core for environmental understanding, contextual reasoning, and task planning. AirStar accepts natural interaction through voice commands and gestures, removing the need for a remote controller and significantly broadening its user base.It combines geospatial knowledge-driven long-distance navigation with contextual reasoning for fine-grained short-range control, resulting in an efficient and accurate vision-and-language navigation (VLN) capability. Furthermore, the system also offers built-in capabilities such as cross-modal question answering, intelligent filming, and target tracking. With a highly extensible framework, it supports seamless integration of new functionalities, paving the way toward a general-purpose, instruction-driven intelligent UAV agent.The supplementary PPT is available at https://buaa-colalab.github.io/airstar.github.io.

Abstract:
The significance of informative and robust point representations has been widely acknowledged for 3D scene understanding. Despite existing self-supervised pre-training counterparts demonstrating promising performance, the model collapse and structural information deficiency remain prevalent due to insufficient point discrimination difficulty, yielding unreliable expressions and suboptimal performance. In this paper, we present GaussianCross, a novel cross-modal self-supervised 3D representation learning architecture integrating feed-forward 3D Gaussian Splatting (3DGS) techniques to address current challenges. GaussianCross seamlessly converts scale-inconsistent 3D point clouds into a unified cuboid-normalized Gaussian representation without missing details, enabling stable and generalizable pre-training. Subsequently, a tri-attribute adaptive distillation splatting module is incorporated to construct a 3D feature field, facilitating synergetic feature capturing of appearance, geometry, and semantic cues to maintain cross-modal consistency. To validate GaussianCross, we perform extensive evaluations on various benchmarks, including ScanNet, ScanNet200, and S3DIS. In particular, GaussianCross shows a prominent parameter and data efficiency, achieving superior performance through linear probing (<0.1% parameters) and limited data training (1% of scenes) compared to state-of-the-art methods. Furthermore, GaussianCross demonstrates strong generalization capabilities, improving the full fine-tuning accuracy by 9.3% mIoU and 6.1% AP50 on ScanNet200 semantic and instance segmentation tasks, respectively, supporting the effectiveness of our approach. The code, weights, and visualizations are publicly available at https://rayyoh.github.io/GaussianCross/.

Abstract:
Large Language Models (LLMs) and Multimodal LLMs have shown promising capabilities for SVG processing, yet existing benchmarks suffer from limited real-world coverage, lack of complexity stratification, and fragmented evaluation paradigms. We introduce SVGenius, a comprehensive benchmark comprising 2,377 queries across three progressive dimensions: understanding, editing, and generation. Built on real-world data from 24 application domains with systematic complexity stratification, SVGenius evaluates models through 8 task categories and 18 metrics. We assess 22 mainstream models spanning different scales, architectures, training paradigms, and accessibility levels. Our analysis reveals that while proprietary models significantly outperform open-source counterparts, all models exhibit systematic performance degradation with increasing complexity, indicating fundamental limitations in current approaches; however, reasoning-enhanced training proves more effective than pure scaling for overcoming these limitations, though style transfer remains the most challenging capability across all model types. SVGenius establishes the first systematic evaluation framework for SVG processing, providing crucial insights for developing more capable vector graphics models and advancing automated graphic design applications. Appendix and supplementary materials (including all data and code) are available at https://zju-real.github.io/SVGenius.

Abstract:
MER2025 is the third year of our MER series of challenges. Previously, MER2023 (http://merchallenge.cn/mer2023) focused on multi-label learning, noise robustness, and semi-supervised learning, while MER2024 (https://zeroqiaoba.github.io/MER2024-website) introduced a new track dedicated to open-vocabulary emotion recognition. This year, MER2025 centers on the theme ''When Affective Computing Meets Large Language Models (LLMs)''. We aim to shift the paradigm from traditional categorical frameworks reliant on predefined emotion taxonomies to LLM-driven generative methods, offering innovative solutions for more accurate and reliable emotion understanding. The challenge contains four tracks: MER-SEMI focuses on fixed categorical emotion recognition enhanced by semi-supervised learning; MER-FG explores fine-grained emotions, expanding recognition from basic to nuanced emotional states; MER-DES incorporates multimodal cues (beyond emotion words) into predictions to enhance model interpretability; MER-PR reveals whether emotion prediction results can improve personality recognition performance. For the first three tracks, the baseline code is available at MERTools (https://github.com/zeroQiaoba/MERTools) and datasets can be accessed via Hugging Face (https://huggingface.co/datasets/MERChallenge/MER2025). For the last track, the dataset and baseline code are available on GitHub (https://github.com/cai-cong/MER25_personality).

Abstract:
After pre-training on extensive image-text pairs, Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a wide variety of benchmarks. However, a substantial volume of multimodal interleaved documents remains underutilized for contrastive vision-language representation learning. To fully leverage these unpaired documents, we initially establish a Real-World Data Extraction pipeline to extract high-quality images and texts. Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts. To further enhance fine-grained visual information, we propose an image semantic augmented generation module for synthetic text production. Furthermore, we employ a semantic balance sampling strategy to improve dataset diversity, enabling better learning of long-tail concepts. Based on these innovations, we construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales: 15M, 30M, and 100M. We compare our dataset with other widely used datasets of equivalent scale for CLIP training. Models pre-trained on RealSyn consistently achieve state-of-the-art performance across various downstream tasks, including linear probe, zero-shot transfer, zero-shot robustness, and zero-shot retrieval. Furthermore, extensive experiments confirm that RealSyn significantly enhances contrastive vision-language representation learning and demonstrates robust scalability. The code will be released in https://garygutc.github.io/RealSyn.

Abstract:
Out-of-distribution (OOD) detection is crucial for building reliable machine learning models. Although negative prompt tuning has enhanced the OOD detection capabilities of vision-language models, these tuned models often suffer from reduced generalization performance on unseen classes and styles. To address this challenge, we propose a novel method called Knowledge Regularized Negative Feature Tuning (KR-NFT), which integrates an innovative adaptation architecture termed Negative Feature Tuning (NFT) and a corresponding knowledge-regularization (KR) optimization strategy. Specifically, NFT applies distribution-aware transformations to pre-trained text features, effectively separating positive and negative features into distinct spaces. This separation maximizes the distinction between in-distribution (ID) and OOD images. Additionally, we introduce image-conditional learnable factors through a lightweight meta-network, enabling dynamic adaptation to individual images and mitigating sensitivity to class and style shifts. Compared to traditional negative prompt tuning, NFT demonstrates superior efficiency and scalability. To optimize this adaptation architecture, the KR optimization strategy is designed to enhance the discrimination between ID and OOD sets while mitigating pre-trained knowledge forgetting. This enhances OOD detection performance on trained ID classes while simultaneously improving OOD detection on unseen ID datasets. Notably, when trained with few-shot samples from ImageNet dataset, KR-NFT not only improves ID classification accuracy and OOD detection but also significantly reduces the FPR95 by 5.44% under an unexplored generalization setting with unseen ID categories. Codes can be found at https://github.com/ZhuWenjie98/KRNFT.

Abstract:
Visual selective attention, driven by individual preferences, regulates human prioritization of visual stimuli by bridging subjective cognitive mechanisms with objective visual elements, thereby steering the semantic interpretation and hierarchical processing of dynamic visual scenes. However, existing models and datasets predominantly neglect the influence of subjective cognitive diversity on fixation behavior. Conventional saliency prediction models, typically employing segmentation approaches, rely on low-resolution imagery to generate saliency heatmaps, subsequently upscaled to native resolutions, which limiting their capacity to capture personalized attention patterns. Furthermore, MLLMs are constrained by factors such as hallucinations, making it very costly to strictly adhere to the expected format in tasks involving multiple point predictions, and achieving precise point positioning is challenging. To address these limitations, we present Subjective Personalized Attention for Ad vertisement Videos, namely SPA-ADV, a large-scale multimodal dataset capturing gaze behaviors from over 4,500 participants varying in age and gender with 486 videos. Furthermore, we propose PRE-MAP, a novel eye-tracking saliency model that characterizes Personalized visual disparities through Reinforcement learning-optimized Eye-tracking, built upon MLLMs and guided by Multi-Attribute user profiles to predict Points. To ensure MLLMs produce prediction points that are both format-correct and spatially accurate, we introduce Consistency Group Relative Policy Optimization (C-GRPO), inspired by the variability in eye movement points and Multi-Attribute profiles. Extensive experiments on SPA-ADV and other benchmarks demonstrate the effectiveness of our approach. The code and dataset are available at https://github.com/mininglamp-MLLM/PRE-MAP.

Abstract:
Face parsing aims to segment facial images into key components such as eyes, lips, and eyebrows. While existing methods rely on dense pixel-level annotations, such annotations are expensive and labor-intensive to obtain. To reduce annotation cost, we introduce Weakly Supervised Face Parsing (WSFP), a new task setting that performs dense facial component segmentation using only weak supervision, such as image-level labels and natural language descriptions. WSFP introduces unique challenges due to the high co-occurrence and visual similarity of facial components, which lead to ambiguous activations and degraded parsing performance. To address this, we propose DisFaceRep, a representation disentanglement framework designed to separate co-occurring facial components through both explicit and implicit mechanisms. Specifically, we introduce a co-occurring component disentanglement strategy to explicitly reduce dataset-level bias, and a text-guided component disentanglement loss to guide component separation using language supervision implicitly. Extensive experiments on CelebAMask-HQ, LaPa, and Helen demonstrate the difficulty of WSFP and the effectiveness of DisFaceRep, which significantly outperforms existing weakly supervised semantic segmentation methods. The code will be released at https://github.com/CVI-SZU/DisFaceRep.

Abstract:
Interview performance assessment is essential for determining candidates' suitability for professional positions. To ensure holistic and fair evaluations, we propose a novel and comprehensive framework that explores ''365'' aspects of interview performance by integrating three modalities (video, audio, and text), six responses per candidate, and five key evaluation dimensions. The framework employs modality-specific feature extractors to encode heterogeneous data streams and subsequently fused via a Shared Compression Multilayer Perceptron. This module compresses multimodal embeddings into a unified latent space, facilitating efficient feature interaction. To enhance prediction robustness, we incorporate a two-level ensemble learning strategy: (1) independent regression heads predict scores for each response, and (2) predictions are aggregated across responses using a mean-pooling mechanism to produce final scores for the five target dimensions. By listening to the unspoken, our approach captures both explicit and implicit cues from multimodal data, enabling comprehensive and unbiased assessments. Achieving a multi-dimensional average MSE of 0.1824, our framework secured first place in the AVI Challenge 2025, demonstrating its effectiveness and robustness in advancing automated and multimodal interview performance assessment. The full implementation is available at https://github.com/Qianvenh/AVI2025-Track2.

Abstract:
While promptable segmentation (e.g., SAM) has shown promise for various segmentation tasks, it still requires manual visual prompts for each object to be segmented. In contrast, task-generic promptable segmentation aims to reduce the need for such detailed prompts by employing only a task-generic prompt to guide segmentation across all test samples. However, when applied to Camouflaged Object Segmentation (COS), current methods still face two critical issues: 1) semantic ambiguity in getting instance-specific text prompts, which arises from insufficient discriminative cues in holistic captions, leading to foreground-background confusion; 2) semantic discrepancy combined with spatial separation in getting instance-specific visual prompts, which results from global background sampling far from object boundaries with low feature correlation, causing SAM to segment irrelevant regions. To mitigate the issues above, we propose RDVP-MSD, a novel training-free test-time adaptation framework that synergizes Region-constrained Dual-stream Visual Prompting (RDVP) via Multimodal Stepwise Decomposition Chain of Thought (MSD-CoT). MSD-CoT progressively disentangles image captions to eliminate semantic ambiguity, while RDVP injects spatial constraints into visual prompting and independently samples visual prompts for foreground and background points, effectively mitigating semantic discrepancy and spatial separation. Without requiring any training or supervision, RDVP-MSD achieves a state-of-the-art segmentation result on multiple COS benchmarks. The codes will be available at https://github.com/ycyinchao/RDVP-MSD.

Abstract:
With the advances in surgical robotics, robot-assisted endoscopic submucosal dissection (ESD) enables rapid resection of large lesions, minimizing recurrence rates and improving long-term overall survival. Despite these advantages, ESD is technically challenging and carries high risks of complications, necessitating skilled surgeons and precise instruments. Recent advancements in Multimodal Large Language Models (MLLMs) offer promising decision support and predictive planning capabilities for robotic systems, which allow the robot to complete complex tasks in more challenging scenarios. However, the training of MLLMs requires large-scale, well-annotated datasets, and existing datasets for multi-level fine-grained ESD surgical motion reasoning are scarce and lack detailed annotations. In this paper, we design a hierarchical decomposition of ESD motion granularity and introduce a multi-level surgical motion dataset (CoPESD) for training MLLMs as the robotic Co-Pilot of Endoscopic Submucosal Dissection. CoPESD includes 17,679 images with 32,699 bounding boxes and 88,395 multi-level motions, from over 35 hours of ESD videos for both robot-assisted and conventional surgeries. Extensive experiments demonstrate the effectiveness of CoPESD in training MLLMs to comprehend surgical scenarios and reason following surgical robotic motions. As the first multimodal ESD motion dataset, CoPESD supports advanced research in ESD motion decision-making and surgical automation. The dataset is available at https://github.com/gkw0010/CoPESD.

Abstract:
Video Anomaly Detection (VAD) focuses on identifying anomalies within videos. Supervised methods require an amount of in-domain training data and often struggle to generalize to unseen anomalies. In contrast, training-free methods leverage the intrinsic world knowledge of large language models (LLMs) to detect anomalies but face challenges in localizing fine-grained visual transitions and diverse events. Therefore, we propose EventVAD, an event-aware video anomaly detection framework that combines tailored dynamic graph architectures and multimodal LLMs to perform fine-grained temporal-event reasoning. Specifically, EventVAD first employs dynamic spatiotemporal graph modeling with time-decay constraints to capture event-aware video features. Then, it performs adaptive noise filtering and uses signal ratio thresholding to detect event boundaries via unsupervised statistical features. Finally, it utilizes a hierarchical prompting strategy to guide MLLMs in performing reasoning and making final decisions. We conducted extensive experiments on the UCF-Crime and XD-Violence datasets. The results demonstrate that EventVAD with a 7B MLLM achieves state-of-the-art (SOTA) in training-free settings, outperforming strong baselines that use 7B or larger MLLMs. The code is available at https://github.com/YihuaJerry/EventVAD.

Abstract:
While hate speech detection (HSD) has been extensively studied in text, existing multi-modal approaches remain limited, particularly in videos. As modalities are not always individually informative, simple fusion methods fail to fully capture inter-modal dependencies. Moreover, previous work often omits relevant modalities such as on-screen text and audio, which may contain subtle hateful content and thus provide essential cues, both individually and in combination with others. In this paper, we present MM-HSD, a multi-modal model for HSD in videos that integrates video frames, audio, and text derived from speech transcripts and from frames (i.e.on-screen text) together with features extracted by Cross-Modal Attention (CMA). We are the first to use CMA as an early feature extractor for HSD in videos, to systematically compare query/key configurations, and to evaluate the interactions between different modalities in the CMA block. Our approach leads to improved performance when on-screen text is used as a query and the rest of the modalities serve as a key. Experiments on the HateMM dataset show that MM-HSD outperforms state-of-the-art methods on M-F1 score (0.874), using concatenation of transcript, audio, video, on-screen text, and CMA for feature extraction on raw embeddings of the modalities. The code is available at https://github.com/idiap/mm-hsd.

Abstract:
Contrastive Language-Image Pre-training (CLIP) has achieved success on multiple downstream tasks by aligning image and text modalities. However, the nature of global contrastive learning limits CLIP's ability to comprehend compositional concepts, such as relations and attributes. Although recent studies employ global hard negative samples to improve compositional understanding, these methods significantly compromise the model's inherent general capabilities by forcibly distancing textual negative samples from images in the embedding space. To overcome this limitation, we introduce a Decoupled Global-Local Alignment (DeGLA) framework that improves compositional understanding while substantially mitigating losses in general capabilities. To optimize the retention of the model's inherent capabilities, we incorporate a self-distillation mechanism within the global alignment process, aligning the learnable image-text encoder with a frozen teacher model derived from an exponential moving average. Under the constraint of self-distillation, it effectively mitigates the catastrophic forgetting of pretrained knowledge during fine-tuning. To improve compositional understanding, we first leverage the in-context learning capability of Large Language Models (LLMs) to construct about 2M high-quality negative captions across five types. Subsequently, we propose the Image-Grounded Contrast (IGC) loss and Text-Grounded Contrast (TGC) loss to enhance vision-language compositionally. Experimental results across both general and compositional reasoning tasks validate the effectiveness of the DeGLA framework. Our code is released at https://github.com/xiaoxing2001/DeGLA.

Abstract:
This paper introduces the Recognize Anything Plus Model (RAM++), an open-set image tagging model effectively leveraging multi-grained text supervision. Previous approaches (e.g., CLIP) primarily utilize global text supervision paired with images, leading to sub-optimal performance in recognizing multiple individual semantic tags. In contrast, RAM++ seamlessly integrates individual tag supervision with global text supervision, all within a unified alignment framework. This integration not only ensures efficient recognition of predefined tag categories, but also enhances generalization capabilities for diverse open-set categories. Furthermore, RAM++ employs large language models (LLMs) to convert semantically constrained tag supervision into more expansive tag description supervision, thereby enriching the scope of open-set visual description concepts. Comprehensive evaluations on various image recognition benchmarks demonstrate RAM++ exceeds existing state-of-the-art (SOTA) open-set image tagging models on most aspects. Specifically, for predefined commonly used tag categories, RAM++ showcases 10.2 mAP and 15.4 mAP enhancements over CLIP on OpenImages and ImageNet. For open-set categories beyond predefined, RAM++ records improvements of 5.0 mAP and 6.4 mAP over CLIP and RAM respectively on OpenImages. For diverse human-object interaction phrases, RAM++ achieves 7.8 mAP and 4.7 mAP improvements on the HICO benchmark.

Abstract:
In dyadic interactions, facial reactions are crucial for conveying an individuals' responses to their conversational partners. Individuals may exhibit varied but appropriate facial reactions (AFRs) when perceiving the same behavioral expression. Although some recent methods can already respond multiple appropriate facial reactions to the given human speaker behaviors, the AFRs generated by these methods often fail to adequately preserve crucial head motions, leading to visual jitter and unnatural transitions between generated AFR segments. In this paper, we propose a novel and generic PFLPosNet framework which addresses the aforementioned problems at both pre-processing and post-processing stages, where a new pose-aware face behavior localization method PFL is introduced to retain the head pose displacement information from the source data. In addition, the framework proposes a real-time head pose adjustment method, PosNet, to ensure continuity and smoothness in the visual output of the model when using data with correct head pose displacement. Experimental results demonstrate that our approach not only generates more coherent and natural facial reaction sequences but also significantly outperforms existing online MAFRG methods in terms of continuity and smoothness. Our code is made available at https://github.com/rainforcetime/PFLPosNet.

Abstract:
Music generation is pivotal in multimedia, aiding creation and lowering the creative threshold. It focuses on generating music with clear vocals and harmonious accompaniment based on lyrics, combining high artistic creativity with technical challenges. The music codec is an important bridging component in large language model-based music generation, connecting language models with the generated music. However, existing neural codecs typically require token rates exceeding 50 Hz to achieve acceptable music quality, resulting in a context length that surpasses 12,000 tokens for a 4-minute song-a scale that is computationally demanding. This highlights the need for high-compression, high-fidelity music codecs that can reconstruct both vocals and accompaniment with high quality at low frame rates and bitrates, thereby better assisting music generation. To address this, we introduce MuCodec, designed for high-quality music reconstruction at ultra-low bitrates, facilitating more efficient music generation. MuCodec employs a two-stage training method, enabling its encoder, MuEncoder, to extract semantic and acoustic features in a unified representation. These features are discretized using residual vector quantization and converted into Mel-VAE features through flow matching, with reconstruction quality improved by representation alignment during training. The Mel-VAE features are then reconstructed into music using a pretrained Mel-VAE decoder and HiFi-GAN. To the best of our knowledge, MuCodec is the first codec capable of reconstructing 48kHz stereo music at an ultra-low bitrate of 0.35 kbps (25 Hz), achieving state-of-the-art performance in both subjective and objective evaluations, and can more effectively support music generation. Code and Demo: https://mucodec.github.io/Mucodec/.

Abstract:
We rethink the role of positional encoding in 3D representation learning and fine-tuning. We argue that using positional encoding in point Transformer-based methods serves to aggregate multi-scale features of point clouds. Additionally, we explore parameter-efficient fine-tuning (PEFT) through the lens of prompts and adapters, introducing a straightforward yet effective method called PPT for point cloud analysis. PPT incorporates increased patch tokens and trainable positional encoding while keeping most pre-trained model parameters frozen. Extensive experiments validate that PPT is both effective and efficient. Our proposed method of PEFT tasks, namely PPT, with only 1.05M of parameters for training, gets state-of-the-art results in several mainstream datasets, such as 95.01% accuracy in the ScanObjectNN OBJ_BG dataset. Codes and weights will be released at https://github.com/MIV-XJTU/PPT.

Abstract:
Visuo-tactile perception aims to understand an object's tactile properties. However, the field remains underexplored due to the high cost of data collection. We observe that visually distinct objects can exhibit similar surface textures or material properties. For example, a leather sofa and a leather jacket can share similar tactile properties. This implies that tactile understanding can be guided by material cues in visual data, even without direct tactile supervision. In this paper, we introduce RA-Touch, a retrieval-augmented framework that improves visuo-tactile perception by leveraging visual data enriched with tactile semantics. We carefully recaption a large-scale visual dataset with tactile-focused descriptions, enabling the model to access tactile semantics typically absent from conventional visual datasets. A key challenge remains in effectively utilizing these tactile-aware external descriptions. RA-Touch addresses this by retrieving visual-textual representations aligned with tactile inputs and integrating them to focus on relevant textural and material properties. By outperforming prior methods, we demonstrate the potential of retrieval-based visual reuse for tactile understanding. Code is available at https://aim-skku.github.io/RA-Touch.

Abstract:
Editing long videos remains a challenging task due to the need for maintaining both global consistency and temporal coherence across thousands of frames. Existing methods often suffer from structural drift or temporal artifacts, particularly in minute-long sequences. We introduce AnchorSync, a novel diffusion-based framework that enables high-quality, long-term video editing by decoupling the task into sparse anchor frame editing and smooth intermediate frame interpolation. Our approach enforces structural consistency through a progressive denoising process and preserves temporal dynamics via multimodal guidance. Extensive experiments show that AnchorSync produces coherent, high-fidelity edits, surpassing prior methods in visual quality and temporal stability. The source code is available at https://github.com/VISION-SJTU/AnchorSync.

Abstract:
Weakly-supervised semantic segmentation aims to assign category labels to each pixel using weak annotations, significantly reducing manual annotation costs. Although existing methods have achieved remarkable progress in well-lit scenarios, their performance significantly degrades in low-light environments due to two fundamental limitations: severe image quality degradation (e.g., low contrast, noise, and color distortion) and the inherent constraints of weak supervision. These factors collectively lead to unreliable class activation maps and semantically ambiguous pseudo-labels, ultimately compromising the model's ability to learn discriminative feature representations. To address these problems, we propose Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-light Semantic Segmentation (DGKD-WLSS), a novel framework that synergistically combines Diffusion-Guided Knowledge Distillation (DGKD) with Depth-Guided Feature Fusion (DGF2). DGKD aligns normal-light and low-light features via diffusion-based denoising and knowledge distillation, while DGF2 integrates depth maps as illumination-invariant geometric priors to enhance structural feature learning. Extensive experiments demonstrate the effectiveness of DGKD-WLSS, which achieves state-of-the-art performance in weakly supervised semantic segmentation tasks under low-light conditions. The source codes have been released at: https://github.com/ChunyanWang1/DGKD-WLSS

Abstract:
With the rapid advancement of mobile imaging, capturing screens using smartphones has become a prevalent practice in distance learning and conference recording. However, moiré artifacts, caused by frequency aliasing between display screens and camera sensors, are further amplified by the image signal processing pipeline, leading to severe visual degradation. Existing sRGB domain demoiréing methods struggle with irreversible information loss, while recent two-stage raw domain approaches suffer from information bottlenecks and inference inefficiency. To address these limitations, we propose a single-stage raw domain demoiréing framework, Dual-Stream Demoiréing Network (DSDNet), which leverages the synergy of raw and YCbCr images to remove moiré while preserving luminance and color fidelity. Specifically, to guide luminance correction and moiré removal, we design a raw-to-YCbCr mapping pipeline and introduce the Synergic Attention with Dynamic Modulation (SADM) module. This module enriches the raw-to-sRGB conversion with cross-domain contextual features. Furthermore, to better guide color fidelity, we develop a Luminance-Chrominance Adaptive Transformer (LCAT), which decouples luminance and chrominance representations. Extensive experiments demonstrate that DSDNet outperforms state-of-the-art methods in both visual quality and quantitative evaluation and achieves an inference speed 2.4x faster than the second-best method, highlighting its practical advantages. We provide an anonymous online demo at https://dsdnet.github.io/DSDNet/.

Abstract:
Multimodal immersive spatial drama generation focuses on creating continuous multi-speaker binaural speech with dramatic prosody based on multimodal prompts, with potential applications in AR, VR, and others. This task requires simultaneous modeling of spatial information and dramatic prosody based on multimodal inputs, with high data collection costs. To the best of our knowledge, our work is the first attempt to address these challenges. We construct MRSDrama, the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts. Then, we propose ISDrama, the first immersive spatial drama generation model through multimodal prompting. ISDrama comprises these primary components: 1) Multimodal Pose Encoder, based on contrastive learning, considering the Doppler effect caused by moving speakers to extract unified pose information from multimodal prompts. 2) Immersive Drama Transformer, a flow-based mamba-transformer model that generates high-quality drama, incorporating Drama-MOE to select proper experts for enhanced prosody and pose control. We also design a context-consistent classifier-free guidance strategy to coherently generate complete drama. Experimental results show that ISDrama outperforms baseline models on objective and subjective metrics. The demos are available at https://aaronz345.github.io/ISDramaDemo. We provide the dataset and the evaluation code at https://huggingface.co/datasets/AaronZ345/MRSDrama and https://github.com/AaronZ345/ISDrama.

Abstract:
Diffusion models have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. These models face two fundamental challenges: strict temporal dependencies preventing parallelization, and computationally intensive forward passes required at each denoising step. Drawing inspiration from speculative decoding in large language models, we present SpeCa, a novel '' Forecast-then-verify'' acceleration framework that effectively addresses both limitations. SpeCa's core innovation lies in introducing Speculative Sampling to diffusion models, predicting intermediate features for subsequent timesteps based on fully computed reference timesteps. Our approach implements a parameter-free verification mechanism that efficiently evaluates prediction reliability, enabling real-time decisions to accept or reject each prediction while incurring negligible computational overhead. Furthermore, SpeCa introduces sample-adaptive computation allocation that dynamically modulates resources based on generation complexity-allocating reduced computation for simpler samples while preserving intensive processing for complex instances. Experiments demonstrate 6.34X acceleration on FLUX with minimal quality degradation (5.5% drop), 7.3× speedup on DiT while preserving generation fidelity, and 79.84% VBench score at 6.1X acceleration for HunyuanVideo. The verification mechanism incurs minimal overhead (1.67%-3.5% of full inference costs), establishing a new paradigm for efficient diffusion model inference while maintaining generation quality even at aggressive acceleration ratios. Our codes have been released in Github: https://github.com/Shenyi-Z/Cache4Diffusion/

Abstract:
As a fundamental task for indoor scene understanding, 3D object detection has been extensively studied, and the accuracy on indoor point cloud data has been substantially improved. However, existing researches have been conducted on limited datasets, where the training and testing sets share the same distribution. In this paper, we consider the task of adapting indoor 3D object detectors from one dataset to another, presenting a comprehensive benchmark with ScanNet, SUN RGB-D and 3D Front datasets, as well as our newly proposed large-scale datasets ProcTHOR-OD and ProcFront generated by a 3D simulator. Since indoor point cloud datasets are collected and constructed in different ways, the object detectors are likely to overfit to specific factors within each dataset, such as point cloud quality, bounding box layout and instance features. We conduct experiments across datasets on different adaptation scenarios including synthetic-to-real adaptation, point cloud quality adaptation, layout adaptation and instance adaptation, analyzing the impact of different domain gaps on 3D object detectors. We also introduce several approaches to improve adaptation performances, providing baselines for domain adaptive indoor 3D object detection, hoping that future works may propose detectors with stronger generalization ability across domains. The benchmark datasets and baseline code are available on our project homepage: https://jeremyzhao1998.github.io/DAVoteNet-release.

Abstract:
The alt-text generation task produces concise, context-relevant descriptions of images, enabling blind and low-vision users to access online images. Despite the capabilities of large vision-language models, alt-text generation performance remains limited due to noisy user annotations, inconsistent standards, and MLLMs' insensitivity to contextual information. Previous efforts to fine-tune MLLMs using supervised fine-tuning (SFT) have struggled, as SFT relies on accurate target annotations, which are often flawed in user-generated alt-text. To address this, we propose Multi-faceted Cross-modal Direct Preference Optimization (MCM-DPO), which improves alt-text generation by learning to identify better options in preference pairs without requiring precise annotations. MCM-DPO optimizes preferences across single, paired, and multi-preference dimensions, covering textual, visual, and cross-modal factors. In light of the scarcity of high-quality annotated and preference-labeled datasets for alt-text, we constructed two large-scale, high-quality datasets named TAlt and PAlt, sourced from Twitter and Pinterest. These datasets include 202k annotated alt-text samples and 18k preference pairs that cover diverse preference dimensions, aiming to support further research in this domain. Experimental results show that our proposed MCM-DPO method consistently outperforms both DPO and SFT, establishing a new state of the-art in alt-text generation. We release the code and data here: https://github.com/LVUGAI/MCM-DPO.

Abstract:
Micro-Actions (MAs) are an important form of non-verbal communication in social interactions, with potential applications in human emotional analysis. However, existing methods in Micro-Action Recognition often overlook the inherent subtle changes in MAs, which limits the accuracy of distinguishing MAs with subtle changes. To address this issue, we present a novel Motion-guided Modulation Network (MMN) that implicitly captures and modulates subtle motion cues to enhance spatial-temporal representation learning. Specifically, we introduce a Motion-guided Skeletal Modulation module (MSM) to inject motion cues at the skeletal level, acting as a control signal to guide spatial representation modeling. In parallel, we design a Motion-guided Temporal Modulation module (MTM) to incorporate motion information at the frame level, facilitating the modeling of holistic motion patterns in micro-actions. Finally, we propose a motion consistency learning strategy to aggregate the motion cues from multi-scale features for micro-action classification. Experimental results on the Micro-Action 52 and iMiGUE datasets demonstrate that MMN achieves state-of-the-art performance in skeleton-based micro-action recognition, underscoring the importance of explicitly modeling subtle motion cues. The code will be available at https://github.com/momiji-bit/MMN

Abstract:
Composed Image Retrieval (CIR) represents a novel retrieval paradigm that is capable of expressing users' intricate retrieval requirements flexibly. It enables the user to give a multimodal query, comprising a reference image and a modification text, and subsequently retrieve the target image. Notwithstanding the considerable advances made by prevailing methodologies, CIR remains in its nascent stages due to two limitations: 1) inhomogeneity between dominant and noisy portions in visual data is ignored, leading to query feature degradation, and 2) the priority of textual data in the image modification process is overlooked, which leads to a visual focus bias. To address these two limitations, this work presents a focus mapping-based feature extractor, which consists of two modules: dominant portion segmentation and dual focus mapping. It is designed to identify significant dominant portions in images and guide the extraction of visual and textual data features, thereby reducing the impact of noise interference. Subsequently, we propose a textually guided focus revision module, which can utilize the modification requirements implied in the text to perform adaptive focus revision on the reference image, thereby enhancing the perception of the modification focus on the composed features. The aforementioned modules collectively constitute the segmentatiOn-based Focus shiFt reviSion nETwork (OFFSET), and comprehensive experiments on four benchmark datasets substantiate the superiority of our proposed method. The codes and data are available on https://zivchen-ty.github.io/OFFSET.github.io/.

Abstract:
Multi-modal recommender system focuses on utilizing rich modal information ( i.e., images and textual descriptions) of items to improve recommendation performance. The current methods have achieved remarkable success with the powerful structure modeling capability of graph neural networks. However, these methods are often hindered by sparse data in real-world scenarios. Although contrastive learning and homography (i.e., homogeneous graphs) are employed to address the data sparsity challenge, existing methods still suffer two main limitations: 1) Simple multi-modal feature contrasts fail to produce effective representations, causing noisy modal-shared features and loss of valuable information in modal-unique features; 2) The lack of exploration of the homograph relations between user interests and item co-occurrence results in incomplete mining of user-item interplay.

Abstract:
In modern urban environments, camera networks generate massive amounts of operational footage -- reaching petabytes each day -- making scalable video analytics essential for efficient processing. Many existing approaches adopt an SQL-based paradigm for querying such large-scale video databases; however, this constrains queries to rigid patterns with predefined semantic categories, significantly limiting analytical flexibility. In this work, we explore a language-driven video analytics paradigm aimed at enabling flexible and efficient querying of high-volume video data driven by natural language. Particularly, we build Lava, a system that accepts natural language queries and retrieves traffic targets across multiple levels of granularity and arbitrary categories. Lava comprises three main components: 1) a multi-armed bandit-based efficient sampling method for video segment-level localization; 2) a video-specific open-world detection module for object-level retrieval; and 3) a long-term object trajectory extraction scheme for temporal object association, yielding complete trajectories for object-of-interests. To support comprehensive evaluation, we further develop a novel benchmark by providing diverse, semantically rich natural language predicates and fine-grained annotations for multiple videos. Experiments on this benchmark demonstrate that Lava improves F1-scores for selection queries by 14% reduces MPAE for aggregation queries by 0.39, and achieves top-k precision of 86% while processing videos 9.6x faster than the most accurate baseline. Our code and dataset are available at https://github.com/yuyanrui/LAVA.

Abstract:
Text-to-audio (T2A) generation has achieved promising results with the recent advances in generative models. However, because of the limited quality and quantity of temporally-aligned audio-text pairs, existing T2A methods struggle to handle the complex text prompts that contain precise timing control, e.g., owl hooted at 2.4s-5.2s. Recent works have explored data augmentation techniques or introduced timing conditions as model inputs to enable timing-conditioned 10-second T2A generation, while their synthesis quality is still limited. In this work, we propose a novel training-free timing-controlled T2A framework, FreeAudio, making the first attempt to enable timing-controlled long-form T2A generation, e.g., owl hooted at 2.4s-5.2s and crickets chirping at 0s-24s. Specifically, we first employ an LLM to plan non-overlapping time windows and recaption each with a refined natural language description, based on the input text and timing prompts. Then we introduce: 1) Decoupling & Aggregating Attention Control for precise timing control; 2) Contextual Latent Composition for local smoothness and Reference Guidance for global consistency. Extensive experiments show that: 1) FreeAudio achieves state-of-the-art timing-conditioned T2A synthesis quality among training-free methods and is comparable to leading training-based methods; 2) FreeAudio demonstrates comparable long-form generation quality with training-based Stable Audio and paves the way for timing-controlled long-form T2A synthesis. Demo samples are available at: https://freeaudio.github.io/FreeAudio/.

Abstract:
Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words. We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model. Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs. Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs. Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o. Our models, data and code are available at: https://github.com/THU-KEG/LongWriter-V.

Abstract:
The increasing realism of content generated by GANs and diffusion models has made deepfake detection significantly more challenging. Existing approaches often focus solely on spatial or frequency-domain features, limiting their generalization to unseen manipulations. We propose the Spectral Cross-Attentional Network (SpecXNet), a dual-domain architecture for robust deepfake detection. The core Dual-Domain Feature Coupler (DDFC) decomposes features into a local spatial branch for capturing texture-level anomalies and a global spectral branch that employs Fast Fourier Transform to model periodic inconsistencies. This dual-domain formulation allows SpecXNet to jointly exploit localized detail and global structural coherence, which are critical for distinguishing authentic from manipulated images. We also introduce the Dual Fourier Attention (DFA) module, which dynamically fuses spatial and spectral features in a content-aware manner. Built atop a modified XceptionNet backbone, we embed the DDFC and DFA modules within a separable convolution block. Extensive experiments on multiple deepfake benchmarks show that SpecXNet achieves state-of-the-art accuracy, particularly under cross-dataset and unseen manipulation scenarios, while maintaining real-time feasibility. Our results highlight the effectiveness of unified spatial-spectral learning for robust and generalizable deepfake detection. To ensure reproducibility, we release the full code on https://github.com/inzamamulDU/SpecXNet

Abstract:
With the rapid advancement of AIGC technology, realistic fake facial images and videos that deceive human perception are now possible. Consequently, numerous face forgery detection techniques have been proposed. However, evaluating their effectiveness and generalizability remains a significant challenge. To address this, we introduce DeepFaceGen, a large-scale benchmark for quantitatively assessing face forgery detection performance and supporting iterative advancements in the field. DeepFaceGen comprises 776,990 real face images/videos and 773,812 forged face samples generated using 35 mainstream face generation techniques. During its construction, we prioritized content diversity, ethnic fairness, and comprehensive labeling to ensure its versatility and usability. DeepFaceGen is then used to evaluate 20 leading face forgery detection methods from multiple perspectives. Through extensive analysis, we present key insights and propose directions for future research. The code and dataset for DeepFaceGen are available at https://github.com/HengruiLou/DeepFaceGen

Abstract:
Rapid advances in Artificial Intelligence Generated Content (AIGC) have enabled increasingly sophisticated face forgeries, posing a significant threat to social security. However, current Deepfake detection methods are limited by constraints in existing datasets, which lack the diversity necessary in real-world scenarios. Specifically, these data sets fall short in four key areas: unknown of advanced forgery techniques, variability of facial scenes, richness of real data, and degradation of real-world propagation. To address these challenges, we propose the Multi-dimensional Face Forgery Image (MFFI ) dataset, tailored for real-world scenarios. MFFI enhances realism based on four strategic dimensions: 1) Wider Forgery Methods; 2) Varied Facial Scenes; 3) Diversified Authentic Data; 4) Multi-level Degradation Operations. MFFI integrates 50 different forgery methods and contains 1024K image samples. Benchmark evaluations show that MFFI outperforms existing public datasets in terms of scene complexity, cross-domain generalization capability, and detection difficulty gradients. These results validate the technical advance and practical utility of MFFI in simulating real-world conditions. The dataset and additional details are publicly available at https://github.com/inclusionConf/MFFI.

Abstract:
The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored. In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLM's language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short & long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities. The code will be released in https://garygutc.github.io/UniME.

Abstract:
Existing Visual Language Models (VLMs) suffer structural limitations where a few low contribution tokens may excessively capture global semantics, dominating the information aggregation process and suppressing the discriminative features in text-driven image retrieval tasks. To address this, we introduce CalibCLIP, a training-free method designed to calibrate the suppressive effects of dominant tokens. Specifically, in the visual space, we propose the Contrastive Visual Enhancer (CVE), which decouples visual features into target and low information regions. Subsequently, it identifies dominant tokens and dynamically suppresses their representations. In the textual space, we introduce the Discriminative Concept Calibrator (DCC), which aims to differentiate between general and discriminative concepts within the text query. By mitigating the challenges posed by generic concepts and improving the representations of discriminative concepts, DCC strengthens the differentiation among similar samples. Finally, extensive experiments demonstrate consistent improvements across seven benchmarks spanning three image retrieval tasks, underscoring the effectiveness of CalibCLIP. Code is available at: https://github.com/kangbin98/CalibCLIP.

Abstract:
Although pre-trained visual models with text have demonstrated strong capabilities in visual feature extraction, sticker emotion understanding remains challenging due to its reliance on multi-view information, such as background knowledge and stylistic cues. To address this, we propose a novel multi-granularity hierarchical fusion transformer (MGHFT), with a multi-view sticker interpreter based on Multimodal Large Language Models. Specifically, inspired by the human ability to interpret sticker emotions from multiple views, we first use Multimodal Large Language Models to interpret stickers by providing rich textual context via multi-view descriptions. Then, we design a hierarchical fusion strategy to fuse the textual context into visual understanding, which builds upon a pyramid visual transformer to extract both global and local sticker features at multiple stages. Through contrastive learning and attention mechanisms, textual features are injected at different stages of the visual backbone, enhancing the fusion of global- and local-granularity visual semantics with textual guidance. Finally, we introduce a text-guided fusion attention mechanism to effectively integrate the overall multimodal features, enhancing semantic understanding. Extensive experiments on 2 public sticker emotion datasets demonstrate that MGHFT significantly outperforms existing sticker emotion recognition approaches, achieving higher accuracy and more fine-grained emotion recognition. Compared to the best pre-trained visual models, our MGHFT also obtains an obvious improvement, 5.4% on F1 and 4.0% on accuracy. The code is released at https://github.com/cccccj-03/MGHFT\_ACMMM2025.

Abstract:
The rapid advancement of diffusion models holds the promise of revolutionizing the application of VR and AR technologies, which typically require scene-level 4D assets for user experience. Nonetheless, existing diffusion models predominantly concentrate on modeling static 3D scenes or object-level dynamics, constraining their capacity to provide truly immersive experiences. To address this issue, we propose HoloTime, a framework that integrates video diffusion models to generate panoramic videos from a single prompt or reference image, along with a 360-degree 4D scene reconstruction method that seamlessly transforms the generated panoramic video into 4D assets, enabling a fully immersive 4D experience for users. Specifically, to tame video diffusion models for generating high-fidelity panoramic videos, we introduce the 360World dataset, the first comprehensive collection of panoramic videos suitable for downstream 4D scene reconstruction tasks. With this curated dataset, we propose Panoramic Animator, a two-stage image-to-video diffusion model that can convert panoramic images into high-quality panoramic videos. Following this, we present Panoramic Space-Time Reconstruction, which leverages a space-time depth estimation method to transform the generated panoramic videos into 4D point clouds, enabling the optimization of a holistic 4D Gaussian Splatting representation to reconstruct spatially and temporally consistent 4D scenes. To validate the efficacy of our method, we conducted a comparative analysis with existing approaches, revealing its superiority in both panoramic video generation and 4D scene reconstruction. This demonstrates our method's capability to create more engaging and realistic immersive environments, thereby enhancing user experiences in VR and AR applications.

Abstract:
To date, there is a notable lack of rigorous benchmarks that assess Multimodal Large Language Models (MLLMs) within the financial domain, a field characterized by specialized financial charts and complex domain-specific expertise. To address this gap, we introduce MME-Finance, the first comprehensive bilingual multimodal benchmark tailored for financial analysis. MME-Finance comprises 4,751 meticulously curated samples, encompassing 2,274 open-ended questions, 2,000 binary-choice questions, and 477 multi-turn questions. To mitigate bias when LLMs act as judges, we also created an evaluation framework that strengthens alignment with human judgments by embedding visual context into the multimodal assessment pipeline. A comprehensive evaluation of 31 popular MLLMs has been conducted to assess their perception, reasoning, and cognitive capabilities. Gemini2.5Pro achieves highest accuracy of 79.28% and 85.71% on the open-ended questions and multi-turn questions, respectively. Among open-source models, InternVL3-78B attains 71.24 % accuracy on the open-ended question, whereas Qwen2.5-VL-72B achieves an F1 score of 88.73 % on the binary-choice question. The results indicate that state-of-the-art MLLMs demonstrate considerable overall competence, yet exhibit significant deficiencies in fine-grained visual perception and the understanding of domain-specific financial images. Source code is available at https://github.com/HiThink-Research/MME-Finance.

Abstract:
Intention recognition has traditionally focused on individual intentions, overlooking the complexities of collective intentions in group settings. To address this limitation, we introduce the concept of group intention, which represents shared goals emerging through the actions of multiple individuals, and Group Intention Forecasting (GIF), a novel task that forecasts when group intentions will occur by analyzing individual actions and interactions before the collective goal becomes apparent. To investigate GIF in a specific scenario, we propose SHOT, the first large-scale dataset for GIF, consisting of 1,979 basketball video clips captured from 5 camera views and annotated with 6 types of individual attributes. SHOT is designed with 3 key characteristics: multi-individual information, multi-view adaptability, and multi-level intention, making it well-suited for studying emerging group intentions. Furthermore, we introduce GIFT (Group Intention ForecasTer), a framework that extracts fine-grained individual features and models evolving group dynamics to forecast intention emergence. Experimental results confirm the effectiveness of SHOT and GIFT, establishing a strong foundation for future research in group intention forecasting. The dataset is available at https://xinyi-hu.github.io/SHOT\_DATASET.

Abstract:
Identity-preserving text-to-video (IPT2V) generation creates videos faithful to both a reference subject image and a text prompt. While fine-tuning large pretrained video diffusion models on ID-matched data achieves state-of-the-art result on IPT2V, data scarcity and high tuning costs hinder broader improvement. We thus introduce a Training-Free Prompt, Image, and Guidance Enhancement(TPIGE) framework that bridges the semantic gap between the video description and the reference image and design sampling guidance that enhance identity preservation and video quality, achieving performance gains at minimal cost. Specifically, we first propose 1 Face Aware Prompt Enhancement, using GPT-4o to enhance the text prompt with facial details derived from the reference image. We then propose 2 Prompt Aware Reference Image Enhancement, leveraging an identity-preserving image generator to refine the reference image, rectifying conflicts with the text prompt. The above mutual refinement significantly improves input quality before video generation. Finally, we propose 3 ID-Aware Spatiotemporal Guidance Enhancement, utilizing an unified gradients to optimize identity preservation and video quality jointly during generation. Our method outperforms prior work and is validated by automatic and human evaluations on a 1000-video test set-winning first place in the ACM Multimedia 2025 Identity-Preserving Video Generation Challenge, demonstrating state-of-the-art performance and strong generality. The code is available at https://github.com/Andyplus1/IPT2V.git

Abstract:
Semantic analysis on visible (RGB) and infrared (IR) images has gained significant attention due to their enhanced accuracy and robustness under challenging conditions including low-illumination and adverse weather. However, due to the lack of pre-trained foundation models on the large-scale infrared image datasets, existing methods prefer to design task-specific frameworks and directly fine-tune them with pre-trained foundation models on their RGB-IR semantic relevance datasets, which results in poor scalability and limited generalization. To address these limitations, we propose UniRGB-IR, a scalable and efficient framework for RGB-IR semantic tasks that introduces a novel adapter mechanism to effectively incorporate rich multi-modal features into pre-trained RGB-based foundation models. Our framework comprises three key components: a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (MFP) module, and a Supplementary Feature Injector (SFI) module. The MFP and SFI modules cooperate with each other as an adpater to effectively complement the ViT features with the contextual multi-scale features. During training process, we freeze the entire foundation model to inherit prior knowledge and only optimize the MFP and SFI modules. Furthermore, to verify the effectiveness of our framework, we utilize the ViT-Base as the pre-trained foundation model to perform extensive experiments. Experimental results on various RGB-IR semantic tasks demonstrate that our method can achieve state-of-the-art performance. The codes are available at https://github.com/PoTsui99/UniRGB-IR

Abstract:
Medical vision-language models (VLMs) have shown promise as clinical assistants across various medical fields. However, specialized dermatology VLM capable of delivering professional and detailed diagnostic analysis remains underdeveloped, primarily due to less specialized text descriptions in current dermatology multimodal datasets. To address this issue, we propose MM-Skin, the first large-scale multimodal dermatology dataset that encompasses 3 imaging modalities, including clinical, dermoscopic, and pathological and nearly 10k high-quality image-text pairs collected from professional textbooks. In addition, we generate over 27k diverse, instruction-following vision question answering (VQA) samples (9× the size of current largest dermatology VQA dataset). Leveraging public datasets and MM-Skin, we developed SkinVL, a dermatology-specific VLM designed for precise and nuanced skin disease interpretation. Comprehensive benchmark evaluations of SkinVL on VQA, supervised fine-tuning (SFT) and zero-shot classification tasks across 8 datasets, reveal its exceptional performance for skin diseases in comparison to both general and medical VLM models. The introduction of MM-Skin and SkinVL offers a meaningful contribution to advancing the development of clinical dermatology VLM assistants. Code and dataset are available at https://github.com/ZwQ803/MM-Skin.

Abstract:
Multimodal Emotion Recognition in Conversations remains a challenging task due to the complex interplay of textual, acoustic and visual signals. While recent models have improved performance via advanced fusion strategies, they often lack psychologically meaningful priors to guide multimodal alignment. In this paper, we revisit the use of CLIP and propose a novel Visual Emotion Guided Anchoring (VEGA) mechanism that introduces class-level visual semantics into the fusion and classification process. Distinct from prior work that primarily utilizes CLIP's textual encoder, our approach leverages its image encoder to construct emotion-specific visual anchors based on facial exemplars. These anchors guide unimodal and multimodal features toward a perceptually grounded and psychologically aligned representation space, drawing inspiration from cognitive theories (prototypical emotion categories and multisensory integration). A stochastic anchor sampling strategy further enhances robustness by balancing semantic stability and intra-class diversity. Integrated into a dual-branch architecture with self-distillation, our VEGA-augmented model achieves sota performance on IEMOCAP and MELD. Code is available at: https://github.com/dkollias/VEGA

Abstract:
While 3D Multi-modal Large Language Models (MLLMs) demonstrate remarkable scene understanding capabilities, their practical deployment faces critical challenges due to computational inefficiency. The key bottleneck stems from processing excessive object-centric visual tokens required for comprehensive 3D scene representation. Although visual token pruning has shown promise in accelerating 2D MLLMs, its applicability to 3D domains remains largely unexplored due to fundamental disparities in token structures. In this paper, we reveal two critical insights: (1) Significant redundancy exists in object-level 3D token representations, analogous to patch-level redundancy in 2D systems; (2) Global attention patterns exhibit strong predictive power for identifying non-essential tokens in 3D contexts. Building on these observations, we propose Fast3D, a plug-and-play visual token pruning framework for 3D MLLMs featuring two technical innovations: (1) Global Attention Prediction (GAP), where a lightweight neural network learns to predict the global attention distributions of the target model, enabling efficient token importance estimation for precise pruning guidance; (2) Sample-Adaptive visual token Pruning (SAP), which introduces dynamic token budgets through attention-based complexity assessment, automatically adjusting layer-wise pruning ratios based on input characteristics. Both of these two techniques operate without modifying the parameters of the target model. Extensive evaluations across five benchmarks validate the effectiveness of Fast3D, particularly under high visual token pruning ratios. Code is available at https://github.com/wencan25/Fast3D.

Abstract:
With the rapid advancement of video generation models such as Sora, video quality assessment (VQA) is becoming increasingly crucial for selecting high-quality videos from large-scale datasets used in pre-training. Traditional VQA methods, typically producing single numerical scores, often lack comprehensiveness and interpretability. To address these challenges, we introduce MVQA-68K, a novel multi-dimensional VQA dataset comprising over 68,000 carefully annotated videos, covering seven essential quality dimensions: overall aesthetics, camera movement, dynamic degree, texture detail, composition, visual quality, and factual consistency. Each annotation includes detailed chain-of-thought reasoning to facilitate interpretability and comprehensive understanding. Extensive experiments demonstrate that MVQA-68K significantly enhances the performance of various multimodal large language models (MLLMs) on the VQA task, achieving state-of-the-art results not only on our internal test set (Fig.1) but also on public benchmarks including LSVQ-test, LSVQ-1080p, and LIVE-VQC. Meantime, incorporating explicit reasoning process during VQA training substantially boosts the zero-shot generalization. Code and dataset will be available at github: https://github.com/Controller01-ai/MVQA-68K.

Abstract:
The rapid advancement of GenAI technology over the past few years has significantly contributed towards highly realistic deepfake content generation. Despite ongoing efforts, the research community still lacks a large-scale and reasoning capability driven deepfake benchmark dataset specifically tailored for person-centric object, context and scene manipulations. In this paper, we address this gap by introducing MultiFakeVerse, a large scale person-centric deepfake dataset, comprising 845,286 images generated through manipulation suggestions and image manipulations both derived from vision-language models (VLM). The VLM instructions were specifically targeted towards modifications to individuals or contextual elements of a scene that influence human perception of importance, intent, or narrative. This VLM-driven approach enables semantic, context-aware alterations such as modifying actions, scenes, and human-object interactions rather than synthetic or low-level identity swaps and region-specific edits that are common in existing datasets. Our experiments reveal that current state-of-the-art deepfake detection models and human observers struggle to detect these subtle yet meaningful manipulations. The code and dataset are available on https://github.com/Parul-Gupta/MultiFakeVerse GitHub.

Abstract:
Unifying multiple multi-modal visual object tracking (MMVOT) tasks draws increasing attention due to the complementary nature of different modalities in building robust tracking systems. Existing practices mix all data sensor types in a single training procedure, structuring a parallel paradigm from the data-centric perspective and aiming for a global optimum on the joint distribution of the involved tasks. However, the absence of a unified benchmark where all types of data coexist forces evaluations on separated benchmarks, causing inconsistency between training and testing, thus leading to performance degradation. To address these issues, this work advances in two aspects: A unified benchmark, coined as UniBench300, is introduced to bridge the inconsistency by incorporating multiple task data, reducing inference passes from three to one and cutting time consumption by 27%. The unification process is reformulated in a serial format, progressively integrating new tasks. In this way, the performance degradation can be specified as knowledge forgetting of previous tasks, which naturally aligns with the philosophy of continual learning (CL), motivating further exploration of injecting CL into the unification process. Extensive experiments conducted on two baselines and four benchmarks demonstrate the significance of UniBench300 and the superiority of CL in supporting a stable unification process. Moreover, while conducting dedicated analyses, the performance degradation is found to be negatively correlated with network capacity. Additionally, modality discrepancies contribute to varying degradation levels across tasks (RGBT > RGBD > RGBE in MMVOT), offering valuable insights for future multi-modal vision research. Source codes and the proposed benchmark is available at https://github.com/Zhangyong-Tang/UniBench300.

Abstract:
In clinical practice, medical image analysis often requires efficient execution on resource-constrained mobile devices. However, existing mobile models-primarily optimized for natural images-tend to perform poorly on medical tasks due to the significant information density gap between natural and medical domains. Combining computational efficiency with medical imaging-specific architectural advantages remains a challenge when developing lightweight, universal, and high-performing networks. To address this, we propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation. Specifically, we employ the newly proposed ConvUtr as a hierarchical patch embedding, featuring a parameter-efficient large-kernel CNN with inverted bottleneck fusion. This design exhibits transformer-like representation learning capacity while being lighter and faster. To enable efficient local-global information exchange, we introduce a novel Large-kernel Local-Global-Local (LKLGL) block that effectively balances the low information density and high-level semantic discrepancy of medical images. Finally, we incorporate a shallow and lightweight transformer bottleneck for long-range modeling and employ a cascaded decoder with downsampled skip connections for dense prediction. Despite its reduced computational demands, our medical-optimized architecture achieves state-of-the-art performance across eight public 2D and 3D datasets covering diverse imaging modalities, including zero-shot testing on four unseen datasets. These results establish it as an efficient yet powerful and generalization solution for mobile medical image analysis. Code is available at: https://github.com/FengheTan9/Mobile-U-ViT.

Abstract:
Face video quality assessment (FVQA) deserves to be explored in addition to general video quality assessment (VQA), as face videos are the primary content on social media platforms and human visual system (HVS) is particularly sensitive to human faces. However, FVQA is rarely explored due to the lack of large-scale FVQA datasets. To fill this gap, we present the first large-scale in-the-wild FVQA dataset, FVQ-20K, which contains 20,000 in-the-wild face videos together with corresponding mean opinion score (MOS) annotations. Along with the FVQ-20K dataset, we further propose a specialized FVQA method named FVQ-Rater to achieve human-like rating and scoring for face video, which is the first attempt to explore the potential of large multimodal models (LMMs) for the FVQA task. Concretely, we elaborately extract multi-dimensional features including spatial features, temporal features, and face-specific features (i.e., portrait features and face embeddings) to provide comprehensive visual information, and take advantage of the LoRA-based instruction tuning technique to achieve quality-specific fine-tuning, which shows superior performance on both FVQ-20K and CFVQA datasets. Extensive experiments and comprehensive analysis demonstrate the significant potential of the FVQ-20K dataset and FVQ-Rater method in promoting the development of FVQA. The code and dataset will be released at: https://github.com/wsj-sjtu/FVQ.

Abstract:
Reconstructing transparent surfaces is essential for tasks such as robotic manipulation in labs, yet it poses a significant challenge for 3D reconstruction techniques like 3D Gaussian Splatting (3DGS). These methods often encounter a transparency-depth dilemma, where the pursuit of photorealistic rendering through standard α-blending undermines geometric precision, resulting in considerable depth estimation errors for transparent materials. To address this issue, we introduce Transparent Surface Gaussian Splatting (TSGS), a new framework that separates geometry learning from appearance refinement. In the geometry learning stage, TSGS focuses on geometry by using specular-suppressed inputs to accurately represent surfaces. In the second stage, TSGS improves visual fidelity through anisotropic specular modeling, crucially maintaining the established opacity to ensure geometric accuracy. To enhance depth inference, TSGS employs a first-surface depth extraction method. This technique uses a sliding window over α-blending weights to pinpoint the most likely surface location and calculates a robust weighted average depth. To evaluate the transparent surface reconstruction task under realistic conditions, we collect a TransLab dataset that includes complex transparent laboratory glassware. Extensive experiments on TransLab show that TSGS achieves accurate geometric reconstruction and realistic rendering of transparent objects simultaneously within the efficient 3DGS framework. Specifically, TSGS significantly surpasses current leading methods, achieving a 37.3% reduction in chamfer distance and an 8.0% improvement in F1 score compared to the top baseline. Additionally, TSGS maintains high-quality novel view synthesis, evidenced by a 0.41dB gain in PSNR, demonstrating that TSGS overcomes the transparency-depth dilemma. The code and dataset are available at https://longxiang-ai.github.io/TSGS/.

Abstract:
Traditional ship detection methods primarily rely on single-modal approaches, such as visible or infrared images, which limit their application in complex scenarios involving varying lighting conditions and heavy fog. To address this issue, we explore the advantages of short-wave infrared (SWIR) and long-wave infrared (LWIR) in ship detection and propose a novel single-stage image fusion detection algorithm called LSFDNet. This algorithm leverages feature interaction between the image fusion and object detection subtask networks, achieving remarkable detection performance and generating visually impressive fused images. To further improve the saliency of objects in the fused images and improve the performance of the downstream detection task, we introduce the Multi-Level Cross-Fusion (MLCF) module. This module combines object-sensitive fused features from the detection task and aggregates features across multiple modalities, scales, and tasks to obtain more semantically rich fused features. Moreover, we utilize the position prior from the detection task in the Object Enhancement (OE) loss function, further increasing the retention of object semantics in the fused images. The detection task also utilizes preliminary fused features from the fusion task to complement SWIR and LWIR features, thereby enhancing detection performance. Additionally, we have established a Nearshore Ship Long-Short Wave Registration (NSLSR) dataset to train effective SWIR and LWIR image fusion and detection networks, bridging a gap in this field. We validated the superiority of our proposed single-stage fusion detection algorithm on two datasets. The source code and dataset are available at https://github.com/Yanyin-Guo/LSFDNet.

Abstract:
Generating human motion in scenes from text aims to synthesize semantically aligned and scene-aware motions. Existing methods have made significant progress by incorporating spatial reasoning and structured generation strategies to connect text descriptions with human-scene interactions. However, they typically rely on simple textual inputs and struggle to comprehend open-ended instructions. There are three key challenges: (1) difficulty in understanding complex instructions due to limited and templated training text annotations; (2) inability to generate natural motions that align with arbitrary trajectories described in text; (3) lack of motion diversity that matches the intended semantics. To address these challenges, we propose PSMo, which consists of two components: the Semantic Planner and the Scene-Aware Motion Generator. The Semantic Planner leverages a Multimodal Large Language Model (MLLM) to parse open-ended instructions, and plans fine-grained motion states aligned with arbitrary trajectories. The scene-aware motion generator adopts the diffusion model with trajectory constraints and a sequential tiling strategy. To enhance motion diversity, we introduce a retrieval-augmented strategy and Scene-Aware Retrieval Attention, which integrates multi-modal features into the generation process. Extensive experiments demonstrate that our method produces high-quality and natural motions under open-ended instructions in scenes.

Abstract:
The rapid advancement of large language models (LLMs) has drawn urgent attention to the task of machine-generated text detection (MGTD). However, existing approaches struggle in complex real-world scenarios: zero-shot detectors rely heavily on scoring model's output distribution while training-based detectors are often constrained by overfitting to the training data, limiting generalization. We found that the performance bottleneck of training-based detectors stems from the misalignment between training objective and task needs. To address this, we propose Direct Discrepancy Learning (DDL), a novel optimization strategy that directly optimizes the detector with task-oriented knowledge. DDL enables the detector to better capture the core semantics of the detection task, thereby enhancing both robustness and generalization. Built upon this, we introduce DetectAnyLLM, a unified detection framework that achieves state-of-the-art MGTD performance across diverse LLMs. To ensure a reliable evaluation, we construct MIRAGE, the most diverse multi-task MGTD benchmark. MIRAGE samples human-written texts from 10 corpora across 5 text-domains, which are then re-generated or revised using 17 cutting-edge LLMs, covering a wide spectrum of proprietary models and textual styles. Extensive experiments on MIRAGE reveal the limitations of existing methods in complex environment. In contrast, DetectAnyLLM consistently outperforms them, achieving over a 70% performance improvement under the same training data and base scoring model, underscoring the effectiveness of our DDL. Project page: https://fjc2005.github.io/detectanyllm.

Abstract:
Audio deepfake detection (ADD) has grown increasingly important due to the rise of high-fidelity audio generative models and their potential for misuse. Given that audio large language models (ALLMs) have made significant progress in various audio processing tasks, a heuristic question arises: Can ALLMs be leveraged to solve ADD?. In this paper, we first conduct a comprehensive zero-shot evaluation of ALLMs on ADD, revealing their ineffectiveness. To this end, we propose ALLM4ADD, an ALLM-driven framework for ADD. Specifically, we reformulate ADD task as an audio question answering problem, prompting the model with the question: ''Is this audio fake or real?''. We then perform supervised fine-tuning to enable the ALLM to assess the authenticity of query audio. Extensive experiments are conducted to demonstrate that our ALLM-based method can achieve superior performance in fake audio detection, particularly in data-scarce scenarios. As a pioneering study, we anticipate that this work will inspire the research community to leverage ALLMs to develop more effective ADD systems. Code is available at https://github.com/ucas-hao/qwen_audio_for_add.git.

Abstract:
Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker's malicious intent. This oversight frequently leads to low-quality outputs that, while successful in bypassing safety filters, lack substantial harmful content. To address this gap, we propose JPS, Jailbreak MLLMs with collaborative visual Perturbation and textual Steering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target-guided adversarial image perturbations for effective safety bypass, complemented by ''steering prompt'' optimized via a multi-agent system to specifically guide LLM responses fulfilling the attackers' intent. These visual and textual components undergo iterative co-optimization for enhanced performance. To evaluate the quality of attack outcomes, we propose the Malicious Intent Fulfillment Rate (MIFR) metric, assessed using a Reasoning-LLM-based evaluator. Our experiments show JPS sets a new state-of-the-art in both ASR and MIFR across various MLLMs and benchmarks, with analyses confirming its efficacy. Codes are available at https://github.com/thu-coai/JPS Warning: This paper contains potentially sensitive contents.

Abstract:
Short-video platforms have become a central part of digital content, with users rapidly engaging in various trending topics. Predicting the peak popularity of short-video topics is critical for understanding content dynamics and user behavior. This paper introduces the task of Short-Video Topic Peak Prediction (SVTPP) and proposes both a new dataset and an innovative method. We present the TopicVid dataset, designed to capture the peak trends of short-video topics across multiple platforms. The TopicVid dataset includes 7,701 topics, 58,539 users, and 96,936 videos, with data on views, comments, and shares. This dataset is the first to provide rich semantic information for short-video topic peak prediction, including content details, titles, and user interactions. We propose the Topic Large Graph Model (TLGM), a two-stage framework that integrates heterogeneous graph data with large language models. The TLGM model effectively analyzes the relationships within short-video topics to predict their peak popularity. Experimental results show that our method outperforms existing approaches for predicting short-video topic peaks. Our dataset and code is available at https://github.com/chensh911/TLGM.

Abstract:
Diffusion transformers have gained significant attention in recent years for their ability to generate high-quality images and videos, yet still suffer from a huge computational cost due to their iterative denoising process. Recently, feature caching has been introduced to accelerate diffusion transformers by caching the feature computation in previous timesteps and reusing it in the following timesteps, which leverage the temporal similarity of diffusion models while ignoring the similarity in the spatial dimension. In this paper, we introduce Cluster-Driven Feature Caching (ClusCa) as an orthogonal and complementary perspective for previous feature caching. Specifically, ClusCa performs spatial clustering on tokens in each timestep, computes only one token in each cluster and propagates their information to all the other tokens, which is able to reduce the number of tokens by over 90%. Extensive experiments on DiT, FLUX and HunyuanVideo demonstrate its effectiveness in both text-to-image and text-to-video generation. Besides, it can be directly applied to any diffusion transformer without requirements for training. For instance, ClusCa achieves 4.96x acceleration on FLUX with an ImageReward of 99.49%, surpassing the original model by 0.51%. The code is available at https://github.com/zhixin-zheng/ClusCa.

Abstract:
3D Gaussian Splatting (3DGS) is an emerging media representation that reconstructs real-world 3D scenes in high fidelity, enabling 6-degrees-of-freedom (6-DoF) navigation in virtual reality (VR). However, developing and evaluating 3DGS-enabled applications and optimizing their rendering performance require realistic user navigation data. Such data is currently unavailable for photorealistic 3DGS reconstructions of real-world scenes. This paper introduces EyeNavGS, the first publicly available 6-DoF navigation dataset featuring traces from 46 participants exploring twelve diverse, real-world 3DGS scenes. The dataset was collected at two sites, using the Meta Quest Pro headsets, recording the head pose and eye gaze data for each rendered frame during free world standing 6-DoF navigation. For each of the twelve scenes, we performed careful scene initialization to correct for scene tilt and scale, ensuring a perceptually-comfortable VR experience. We also release our open-source SIBR viewer software fork with record-and-replay functionalities and a suite of utility tools for data processing, conversion, and visualization. The EyeNavGS dataset and its accompanying software tools provide valuable resources for advancing research in 6-DoF viewport prediction, adaptive streaming, 3D saliency, and foveated rendering for 3DGS scenes. The EyeNavGS dataset is available at: https://symmru.github.io/EyeNavGS/

Abstract:
In the rapidly evolving field of Artificial Intelligence Generated Content (AIGC), a central challenge is distinguishing AI-synthesized images from natural images. Despite the impressive capabilities of advanced AI generative models in producing visually compelling content, significant discrepancies remain when compared to natural images. To systematically investigate and quantify these differences, we construct a large-scale multimodal dataset named DANI, comprising 5,000 natural images and over 440,000 AI-generated image (AIGI) samples produced by nine representative models using both unimodal and multimodal prompts, including Text-to-Image (T2I), Text-and-Image-to-Image (I2I), and Text and Image-to-Image (TI2I). We then introduce D-Judge, a benchmark designed to answer the critical question: how far are AI-generated images from truly realistic images? Our fine-grained evaluation framework assesses DANI across five key dimensions: naive visual quality, semantic alignment, aesthetic appeal, downstream task applicability, and coordinated human validation. Extensive experiments reveal substantial discrepancies across these dimensions, highlighting the importance of aligning quantitative metrics with human judgment to achieve a comprehensive understanding of AI-generated image quality. The code and dataset are publicly available at: https://github.com/ryliu68/DJudge, and https://huggingface.co/datasets/Renyang/DANI.

Abstract:
Recent advancements in image generative foundation models have prioritized quality improvements but often at the cost of increased computational complexity and inference latency. To address this critical trade-off, we introduce HiDream-I1, a new open-source image generative foundation model with 17B parameters that achieves state-of-the-art image generation quality within seconds. HiDream-I1 is constructed with a new sparse Diffusion Transformer (DiT) structure. Specifically, it starts with a dual-stream decoupled design of sparse DiT with dynamic Mixture-of-Experts (MoE) architecture, in which two separate encoders are first involved to independently process image and text tokens. Then, a single-stream sparse DiT structure with dynamic MoE architecture is adopted to trigger multi-model interaction for image generation in a cost-efficient manner. To support flexiable accessibility with varied model capabilities, we provide HiDream-I1 in three variants: HiDream-I1-Full,HiDream-I1-Dev, and HiDream-I1-Fast. Furthermore, we go beyond the typical text-to-image generation and remould HiDream-I1 with additional image conditions to perform precise instruction-based editing on given images, yielding a new image editing model namely HiDream-E1. We have open-sourced all the codes and model weights of HiDream-I1 and HiDream-E1: https://github.com/HiDream-ai/HiDream-I1 and https://github.com/HiDream-ai/HiDream-E1. These models quickly gained strong traction in the community, ranking among the top globally on the Hugging Face Models Trending list within just one week of launch. In under a month, it surpassed 280,000 downloads and has been officially integrated into the Diffusers library. It is now widely adopted by leading community tools and products, including ComfyUI, Recraft, WaveSpeedAI, fal.ai, and Pruna AI - reflecting the model's growing impact across the open-source AI ecosystem.

Abstract:
Multi-view classification aims to leverage information from multiple views of data to improve prediction performance by learning complementary and consistent representations. Therefore, in recent years, multi-view learning has attracted widespread attention in the community. Despite the success of existing multi-view learning methods, there are still some challenges when dealing with large-scale multi-view data. To address this issue, we propose a novel Multi-view Hashing Classification (MHC) framework to encode large-scale multi-view data as binary codes, thereby enhancing the semantic discrimination. Specifically, we leverage class prompts to generate corresponding textual descriptions for each instance and learn the corresponding anchor hash codes. To achieve intra-class compactness and inter-class separability, we propose Class-prompt Contrastive Learning (CCL) to enforce class-wise aggregation and separation in the Hamming space. To mitigate the cross-view heterogeneity gap, we propose a Supervised Cross-view Contrastive (SCC) module to align view-specific hash codes under label supervision. Finally, we present Boundary-aware Independent Hashing (BIH) that introduces boundary-aware constraints to reduce class boundary ambiguity, thereby improving the discrimination of fusion hash codes. Nevertheless, we observe that anchor hash codes could violate the bit independence assumption, which potentially hinders the optimization direction. To this end, we adopt a Bit-level Calibration Mechanism (BCM) to filter out redundant bits, thereby restoring bit independence. Extensive experiments conducted on ten benchmark datasets demonstrate the superiority of the proposed MHC in terms of both classification accuracy and inference efficiency. The code is released at https://github.com/Yuhang-lan04/MHC.

Abstract:
Due to the high cost of collection and labeling, there are relatively few datasets for camouflaged object detection (COD). In particular, for certain specialized categories, the available image dataset is insufficiently populated. Synthetic datasets can be utilized to alleviate the problem of limited data to some extent. However, directly training with synthetic datasets compared to real datasets can lead to a degradation in model performance. To tackle this problem, in this work, we investigate a new task, namely Syn-to-Real Camouflaged Object Detection (S2R-COD). In order to improve the model performance in real world scenarios, a set of annotated synthetic camouflaged images and a limited number of unannotated real images must be utilized. We propose the Cycling Syn-to-Real Domain Adaptation Framework (CSRDA), a method based on the student-teacher model. Specially, CSRDA propagates class information from the labeled source domain to the unlabeled target domain through pseudo labeling combined with consistency regularization. Considering that narrowing the intra-domain gap can improve the quality of pseudo labeling, CSRDA utilizes a recurrent learning framework to build an evolving real domain for bridging the source and target domain. Extensive experiments demonstrate the effectiveness of our framework, mitigating the problem of limited data and handcraft annotations in COD. Our code is publicly available at https://github.com/Muscape/S2R-COD.

Abstract:
Stereoscopic video has long been the subject of research due to its ability to deliver immersive three-dimensional content to a wide range of applications. The dual-view format inherently provides binocular disparity cues that enhance depth perception and realism, making it indispensable for fields such as telepresence, 3D mapping, and robotic vision. Until recently, however, end-to-end pipelines for capturing, encoding, and viewing high-quality stereoscopic video were neither widely accessible nor optimized for consumer-grade devices. Today's smartphones, such as the iPhone Pro, and modern Head-Mounted Displays (HMDs) like the Apple Vision Pro, offer built-in support for stereoscopic video capture, hardware-accelerated encoding, and seamless playback on devices like the Apple Vision Pro and Meta Quest 3, which require minimal user intervention. Apple refers to this streamlined workflow as spatial Video. Making the full stereoscopic video process available to everyone has made new applications possible. Despite these advances, there remains a notable absence of publicly available datasets that include the complete spatial video pipeline on consumer platforms, hindering reproducibility and comparative evaluation of emerging algorithms.

Abstract:
Effective cross-modal retrieval requires robust alignment of heterogeneous data types. Most existing methods focus on bi-modal retrieval tasks and rely on distributional alignment techniques such as Kullback-Leibler divergence, Maximum Mean Discrepancy, and correlation alignment. However, these methods often suffer from critical limitations, including numerical instability, sensitivity to hyperparameters, and their inability to capture the full structure of the underlying distributions. In this paper, we introduce the Cauchy-Schwarz (CS) divergence, a hyperparameter-free measure that improves both training stability and retrieval performance. We further propose a novel Generalized CS (GCS) divergence inspired by Holder's inequality. This extension enables direct alignment of three or more modalities within a unified mathematical framework through a bidirectional circular comparison scheme, eliminating the need for exhaustive pairwise comparisons. Extensive experiments on six benchmark datasets demonstrate the effectiveness of our method in both bi-modal and tri-modal retrieval tasks. The code of our CS/GCS divergence is publicly available at https://github.com/JiahaoZhang666/CSD.

Abstract:
Conventional voice conversion modifies voice characteristics from a source speaker to a target speaker, relying on audio input from both sides. However, this process becomes infeasible when clean audio is unavailable, such as in silent videos or noisy environments. In this work, we focus on the task of Silent Face-based Voice Conversion (SFVC), which does voice conversion entirely from visual inputs. i.e., given images of a target speaker and a silent video of a source speaker containing lip motion, SFVC generates speech aligning the identity of the target speaker while preserving the speech content in the source silent video. As this task requires generating intelligible speech and converting identity using only visual cues, it is particularly challenging. To address this, we introduce MuteSwap, a novel framework that employs contrastive learning to align cross-modality identities and minimize mutual information to separate shared visual features. Experimental results show that MuteSwap achieves impressive performance in both speech synthesis and identity conversion, especially under noisy conditions where methods dependent on audio input fail to produce intelligible results, demonstrating both the effectiveness of our training approach and the feasibility of SFVC. Demo page is available at https://pussycat0700.github.io/MuteSwap-Demo/.

Abstract:
Camouflaged Object Detection (COD) aims to identify objects that blend seamlessly into their surroundings. The inherent visual complexity of camouflaged objects, including their low contrast with the background, diverse textures, and subtle appearance variations, often obscures semantic cues, making accurate segmentation highly challenging. Existing methods primarily rely on visual features, which are insufficient to handle the variability and intricacy of camouflaged objects, leading to unstable object perception capability and ambiguous segmentation results. To tackle these limitations, we introduce a novel COD task, class-guided camouflaged object detection (CGCOD), which extends the conventional COD task by incorporating object-specific class knowledge to enhance detection robustness and accuracy. To facilitate this task, we present a new dataset, CamoClass, comprising camouflaged objects with class annotations. Furthermore, we propose a multi-stage framework, CGNet, which incorporates a plug-and-play class prompt generator and a simple yet effective class-guided detector. This establishes a new paradigm for COD, bridging the gap between contextual understanding and class-guided detection. Extensive experimental results demonstrate the effectiveness of our flexible framework in improving the performance of proposed and existing detectors by leveraging class-level textual information. The Camoclass dataset and the corresponding source code will be made publicly available upon acceptance at: https://github.com/bbdjj/CGCOD.

Abstract:
Pre-trained Vision-Language Models (VLMs) have been exploited in various Computer Vision tasks (e.g., few-shot recognition) via model adaptation, such as prompt tuning and adapters. However, existing adaptation methods are designed by human experts, requiring significant time cost and experience. Inspired by recent advances in Large Language Models (LLMs) based code generation, we propose an Evolutionary Vision-Language Model Adaptation (EvoVLMA) method to automatically search training-free efficient adaptation algorithms for VLMs. We recognize feature selection and logits computation as the key functions in training-free VLM adaptation, and propose a two-stage LLM-assisted evolutionary algorithm for optimizing these parts in a sequential manner, effectively addressing the challenge posed by the expansive search space through a divide-and-conquer strategy. Besides, to enhance the stability and efficiency of searching process, we propose low-precision code conversion, web based code execution and process monitoring, leading to a highly effective automatic algorithm design system. Extensive experiments demonstrate that the algorithms found by EvoVLMA can obtain promising results compared to previous manually-designed ones. More specifically, in the 8-shot image classification setting, the classical APE algorithm can be improved by 1.91 points in recognition accuracy. This research opens new possibilities for automating the optimization of adaptation algorithms of pre-trained multimodal models. Code is available at: https://github.com/kding1225/EvoVLMA

Abstract:
Visual tracking has seen remarkable advancements, largely driven by the availability of large-scale training datasets that have enabled the development of highly accurate and robust algorithms. While significant progress has been made in tracking general objects, research on more challenging scenarios, such as tracking camouflaged objects, remains limited. Camouflaged objects, which blend seamlessly with their surroundings or other objects, present unique challenges for detection and tracking in complex environments. In critical fields like military, security, agriculture, and marine monitoring, accurately tracking camouflaged objects is essential. To address this gap, we introduce the Camouflaged Object Tracking Dataset (COTD), a specialized benchmark designed specifically for evaluating camouflaged object tracking methods. The COTD dataset comprises 200 sequences and approximately 80,000 frames, each annotated with detailed bounding boxes. Our evaluation of 20 existing tracking algorithms reveals significant deficiencies in their performance with camouflaged objects. To address these issues, we propose a novel tracking framework, HIPTrack-MLS, which demonstrates promising results in improving tracking performance for camouflaged objects. COTD and code are avialable at https://github.com/openat25/HIPTrack-MLS.

Abstract:
Video Question Answering (Video QA) has emerged as a central task in multimodal learning. This tutorial provides a comprehensive overview of VideoQA research and highlights new frontiers. We begin with an introduction to VideoQA preliminaries, tracing how methods have adapted from third-person view short videos to capture egocentric and long-ranged spatial-temporal dynamics. We then focus on the impact of large multimodal models. Next, we expand the scope to spatial understanding beyond videos. Each topic is discussed through the lens of tasks, datasets, methods, and evaluation protocols. Finally, we conclude with future directions, including fine-grained and long-ranged video understanding, robustness and trustworthiness, Egocentric and embodied assistance, and omnimodal integration. This tutorial aims to equip participants with both a historical perspective and a forward-looking roadmap for advancing Video QA in the LLM era.

Abstract:
Low-Rank Adaptation (LoRA) and its variants have delivered strong capability in Parameter-Efficient Transfer Learning (PETL) by minimizing trainable parameters and benefiting from reparameterization. However, their projection matrices remain unrestricted during training, causing high representation redundancy and diminishing the effectiveness of feature adaptation in the resulting subspaces. While existing methods mitigate this by manually adjusting the rank or implicitly applying channel-wise masks, they lack flexibility and generalize poorly across various datasets and architectures. Hence, we propose ReSoRA, a method that explicitly models redundancy between mapping subspaces and adaptively Regularizes Subspace redundancy of Low-Rank Adaptation. Specifically, it theoretically decomposes the low-rank submatrices into multiple equivalent subspaces and systematically applies de-redundancy constraints to the feature distributions across different projections. Extensive experiments validate that our proposed method consistently facilitates existing state-of-the-art PETL methods across various backbones and datasets in vision-language retrieval and standard visual classification benchmarks. Besides, as a training supervision, ReSoRA can be seamlessly integrated into existing approaches in a plug-and-play manner, with no additional inference costs. Code is publicly available at: https://github.com/Lucenova/ReSoRA.

Abstract:
The complex application scenarios have raised critical requirements for precise and generalizable gaze estimation methods. Recently, the pre-trained CLIP has achieved remarkable performance on various vision tasks, but its potentials have not been fully exploited in gaze estimation. In this paper, we propose a novel Differential Contrastive Training strategy, which boosts gaze estimation performance with the help of the CLIP. Accordingly, a Differential Contrastive Gaze Estimation network (DCGaze) composed of a Visual Appearance-aware branch and a Semantic Differential-aware branch is introduced. The Visual Appearance-aware branch is essentially a primary gaze estimation network and it incorporates an Adaptive Feature-refinement Unit (AFU) and a Double-head Gaze Regressor (DGR), which both help the primary network to extract informative and gaze-related appearance features. Moreover, the Semantic Difference-aware branch is designed on the basis of the CLIP's text encoder to reveal the semantic difference of gazes. This branch could further empower the Visual Appearance-aware branch with the capability of characterizing the gaze-related semantic information. Extensive experimental results on four challenging datasets over within and cross-domain tasks demonstrate the effectiveness of our DCGaze. The code is available at https://github.com/LinZhang-bjtu/DCGaze.

Abstract:
Recent advances in soccer understanding have demonstrated rapid progress, yet existing research predominantly focuses on isolated or narrow tasks. To bridge this gap, we propose a comprehensive framework for holistic soccer understanding. Concretely, we make the following contributions in this paper: (i) we construct SoccerWiki, the first large-scale multimodal soccer knowledge base, integrating rich domain knowledge about players, teams, referees, and venues to enable knowledge-driven reasoning; (ii) we present SoccerBench, the largest and most comprehensive soccer-specific benchmark, featuring around 10K multimodal (text, image, video) multi-choice QA pairs across 13 distinct tasks; (iii) we introduce SoccerAgent, a novel multi-agent system that decomposes complex soccer questions via collaborative reasoning, leveraging domain expertise from SoccerWiki and achieving robust performance; (iv) extensive evaluations and comparisons with representative MLLMs on SoccerBench highlight the superiority of our agentic system.

Abstract:
Backdoor attacks pose a significant threat to deep neural networks, as backdoored models would misclassify poisoned samples with specific triggers into target classes while maintaining normal performance on clean samples. Among these, multi-target backdoor attacks can simultaneously target multiple classes. However, existing multi-target backdoor attacks all follow the dirty-label paradigm, where poisoned samples are mislabeled, and most of them require an extremely high poisoning rate. This makes them easily detectable by manual inspection. In contrast, clean-label attacks are more stealthy, as they avoid modifying the labels of poisoned samples. However, they generally struggle to achieve stable and satisfactory attack performance and often fail to scale effectively to multi-target attacks. To address this issue, we propose the Feature-based Full-target Clean-label Backdoor Attacks (FFCBA) which consists of two paradigms: Feature-Spanning Backdoor Attacks (FSBA) and Feature-Migrating Backdoor Attacks (FMBA). FSBA leverages class-conditional autoencoders to generate noise triggers that align perturbed in-class samples with the original category's features, ensuring the effectiveness, intra-class consistency, inter-class specificity and natural-feature correlation of triggers. While FSBA supports swift and efficient attacks, its cross-model attack capability is relatively weak. FMBA employs a two-stage class-conditional autoencoder training process that alternates between using out-of-class samples and in-class samples. This allows FMBA to generate triggers with strong target-class features, making it highly effective for cross-model attacks. We conduct experiments on multiple datasets and models, the results show that FFCBA achieves outstanding attack performance and maintains desirable robustness against the state-of-the-art backdoor defenses. Our source code is available at (https://github.com/YangxvYin/FFCBA_code).

Abstract:
Unlike popular solutions based on dense feature maps, Object-Centric Learning (OCL) represents visual scenes as sub-symbolic object-level feature vectors, termed slots, which are highly versatile for tasks involving visual modalities. OCL typically aggregates object superpixels into slots by iteratively applying competitive cross attention, known as Slot Attention, with the slots as the query. However, once initialized, these slots are reused naively, causing redundant slots to compete with informative ones for representing objects. This often results in objects being erroneously segmented into parts. Additionally, mainstream methods derive supervision signals solely from decoding slots into the input's reconstruction, overlooking potential supervision based on internal information. To address these issues, we propose Slot Attention with re-Initialization and self-Distillation (DIAS): i) We reduce redundancy in the aggregated slots and re-initialize extra aggregation to update the remaining slots; ii) We drive the bad attention map at the first aggregation iteration to approximate the good at the last iteration to enable self-distillation. Experiments demonstrate that DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Our source code and model checkpoints are available on https://github.com/Genera1Z/DIAS.

Abstract:
With the rapid advancement of Multi-modal Large Language Models (MLLMs), their capability in understanding both images and text has greatly improved. However, their potential for leveraging multi-modal contextual information in Retrieval-Augmented Generation (RAG) remains largely underexplored. To address this gap, this paper introduces Multi-Modal Retrieval-Augmented Generation (M2RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models in leveraging knowledge from multi-modal retrieval documents. The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking. All tasks are set in an open-domain setting, requiring RAG models to retrieve query-relevant information from a multi-modal document collection and use it as contextual input for RAG modeling. To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT), an instruction tuning method that optimizes MLLMs within multi-modal contexts. Our experiments demonstrate the effectiveness of MM-RAIT by significantly improving the quality of responses generated by different RAG models, outperforming MiniCPM-V 2.6 and Qwen2-VL with 34% and 33% gains, respectively. All data and code are available at https://github.com/NEUIR/M2RAG.

Abstract:
The rapid advancement of AI-driven visual generation technologies has catalyzed significant breakthroughs in image manipulation, particularly in achieving photorealistic localized editing effects on natural scene images (NSIs). Despite extensive research on image quality assessment (IQA) for AI-generated images (AGIs), most studies focus on fully AI-generated outputs (e.g., text-to-image generation), leaving the quality assessment of partial-AIGC images (PAIs)-images with localized AI-driven edits-an almost unprecedented field. Motivated by this gap, we construct the first large-scale PAI dataset towards explainable partial-AIGC image quality assessment (EPAIQA), the EPAIQA-15K, which includes 15K images with localized AI manipulation in different regions and over 300K multi-dimensional human ratings. Based on this, we leverage large multi-modal models (LMMs) and propose a three-stage model training paradigm. This paradigm progressively trains the LMM for editing region grounding, quantitative quality scoring, and quality explanation. Finally, we develop the EPAIQA series models, which possess explainable quality feedback capabilities. Our work represents a pioneering effort in the perceptual IQA field for comprehensive PAI quality assessment. The dataset is already available at https://github.com/jzhws/Partial-AIGC-IQA.

Abstract:
Image deocclusion (or amodal completion) aims to recover the invisible regions (\ie, shape and appearance) of occluded instances in images. Despite recent advances, the scarcity of high-quality data that balances diversity, plausibility, and fidelity remains a major obstacle. To address this challenge, we identify three critical elements: leveraging in-the-wild image data for diversity, incorporating human expertise for plausibility, and utilizing generative priors for fidelity. We propose SynergyAmodal, a novel framework for co-synthesizing in-the-wild amodal datasets with comprehensive shape and appearance annotations, which integrates these elements through a tripartite data-human-model collaboration. First, we design an occlusion-grounded self-supervised learning algorithm to harness the diversity of in-the-wild image data, fine-tuning an inpainting diffusion model into a partial completion diffusion model. Second, we establish a co-synthesis pipeline to iteratively filter, refine, select, and annotate the initial deocclusion results of the partial completion diffusion model, ensuring plausibility and fidelity through human expert guidance and prior model constraints. This pipeline generates a high-quality paired amodal dataset with extensive category and scale diversity, comprising approximately 16K pairs. Finally, we train a full completion diffusion model on the synthesized dataset, incorporating text prompts as conditioning signals. Extensive experiments demonstrate the effectiveness of our framework in achieving zero-shot generalization and textual controllability. Our code, dataset, and models will be made publicly available at https://github.com/imlixinyang/SynergyAmodal.

Abstract:
Diffusion Transformer (DiT) is a crucial method for content generation. However, it needs a lot of time to sample. Many studies have attempted to use caching to reduce the time consumption of sampling. Existing caching methods accelerate generation by reusing DiT features from the previous time step and skipping calculations in the next, but they tend to locate and cache low-error modules without focusing on reducing caching-induced errors, resulting in a sharp decline in generated content quality when increasing caching intensity. To solve this problem, we propose the Error-Optimized Cache (EOC). This method introduces three key improvements: (1) Prior knowledge extraction: Extract and process the caching differences; (2) A judgment method for cache optimization: Determine whether certain caching steps need to be optimized; (3) Cache optimization: reduce caching errors. Experiments show that this algorithm significantly reduces the error accumulation caused by caching, especially excessive caching. On the ImageNet dataset, without substantially increasing the computational load, this method improves the FID↓ of the generated images when the rule-based model FORA has a caching level of 75%, 50%, and 25%, and the training-based model Learning-to-cache has a caching level of 22%. Specifically, the FID↓ values change from 30.454 to 21.690 (28.8%), from 6.857 to 5.821 (15.1%), from 3.870 to 3.692 (4.6%), and from 3.539 to 3.451 (2.5%) respectively. Code is available at https://github.com/qiujx0520/EOC_MM2025.git.

Abstract:
While diffusion models advance text-to-motion generation, their static semantic conditioning ignores temporal-frequency demands: early denoising requires structural semantics for motion foundations while later stages need localized details for text alignment. This mismatch mirrors biological morphogenesis where developmental phases demand distinct genetic programs. Inspired by epigenetic regulation governing morphological specialization, we propose (ANT), an Adaptive Neural Temporal-Aware architecture. ANT orchestrates semantic granularity through: (i) Semantic Temporally Adaptive (STA) Module: Automatically partitions denoising into low-frequency structural planning and high-frequency refinement via spectral analysis. (ii) Dynamic Classifier-Free Guidance scheduling (DCFG): Adaptively adjusts conditional to unconditional ratio enhancing efficiency while maintaining fidelity. Extensive experiments show that ANT can be applied to various baselines, significantly improving model performance, and achieving state-of-the-art semantic alignment on StableMoFusion. Code can be found on https://github.com/CCSCovenant/ANT.

Abstract:
Dataset condensation distills a large dataset into a small synthetic surrogate dataset with similar training efficacy on downstream tasks. Of the existing condensation methods, diffusion-based methods that synthesize surrogate datasets with diffusion models have successfully distilled high-resolution datasets with high training efficacy and satisfactory cross-architectural transferability. However, these methods exhibit a random sampling bias that impairs their performance in dataset condensation settings. We propose a novel dataset condensation method called Noise-Optimized Distribution Distillation (NODD) that mitigates this sampling bias to improve the training performance of synthetic datasets generated with diffusion models. NODD can integrate with existing diffusion-based methods to produce synthetic datasets with enhanced training performance.

Abstract:
Generative AI models have substantially improved the realism of synthetic media, yet their misuse through sophisticated DeepFakes poses significant risks. Despite recent advances in deepfake detection, fairness remains inadequately addressed, enabling deepfake markers to exploit biases against specific populations. While previous studies have emphasized group-level fairness, individual fairness (i.e., ensuring similar predictions for similar individuals) remains largely unexplored. In this work, we identify for the first time that the original principle of individual fairness fundamentally fails in the context of deepfake detection, revealing a critical gap previously unexplored in the literature. To mitigate it, we propose the first generalizable framework that can be integrated into existing deepfake detectors to enhance individual fairness and generalization. Extensive experiments conducted on leading deepfake datasets demonstrate that our approach significantly improves individual fairness while maintaining robust detection performance, outperforming state-of-the-art methods. The code is available at: https://github.com/Purdue-M2/Individual-Fairness-Deepfake-Detection.

Abstract:
CT reconstruction provides radiologists with images for diagnosis and treatment, yet current deep learning methods are typically limited to specific anatomies and datasets, hindering generalization ability to unseen anatomies and lesions. To address this, we introduce the Multi-Organ medical image REconstruction (MORE) dataset, comprising CT scans across 9 diverse anatomies with 15 lesion types. This dataset serves two key purposes: (1) enabling robust training of deep learning models on extensive, heterogeneous data, and (2) facilitating rigorous evaluation of model generalization for CT reconstruction. We further establish a strong baseline solution that outperforms prior approaches under these challenging conditions. Our results demonstrate that: (1) a comprehensive dataset helps improve the generalization capability of models, and (2) optimization-based methods offer enhanced robustness for unseen anatomies. The MORE dataset is freely accessible under CC-BY-NC 4.0 at our project page https://more-med.github.io/.

Abstract:
Knowledge Distillation (KD) compresses neural networks by learning a small network (student) via transferring knowledge from a pre-trained large network (teacher). Many endeavours have been devoted to the image domain, while few works focus on video analysis which desires training much larger model making it be hardly deployed in resource-limited devices. However, traditional methods neglect two important problems, i.e., 1) Since the capacity gap between the teacher and the student exists, some knowledge w.r.t. difficult-to-transfer samples cannot be correctly transferred, or even badly affects the final performance of student, and 2) As training progresses, difficult-to-transfer samples may become easier to learn, and vice versa. To alleviate the two problems, we propose a Sample-level Adaptive Knowledge Distillation (SAKD) framework for action recognition. In particular, it mainly consists of the sample distillation difficulty evaluation module and the sample adaptive distillation module. The former applies the temporal interruption to frames, i.e., randomly dropout or shuffle the frames during training, which increases the learning difficulty of samples during distillation, so as to better discriminate their distillation difficulty. The latter module adaptively adjusts distillation ratio at sample level, such that KD loss dominates the training with easy-to-transfer samples while vanilla loss dominates that with difficult-to-transfer samples. More importantly, we only select those samples with both low distillation difficulty and high diversity to train the student model for reducing computational cost. Experimental results on three video benchmarks and one image benchmark demonstrate the superiority of the proposed method by striking a good balance between performance and efficiency. Code is available at https://github.com/mlvccn/SAKD_ActionRec.

Abstract:
Image fusion aims to integrate complementary information across modalities to generate high-quality fused images, thereby enhancing the performance of high-level vision tasks. While global spatial modeling mechanisms show promising results, constructing long-range feature dependencies in the spatial domain incurs substantial computational costs. Additionally, the absence of ground-truth exacerbates the difficulty of capturing complementary features effectively. To tackle these challenges, we propose a Residual Prior-driven Frequency-aware Network, termed as RPFNet. Specifically, RPFNet employs a dual-branch feature extraction framework: the Residual Prior Module (RPM) extracts modality-specific difference information from residual maps, thereby providing complementary priors for fusion; the Frequency Domain Fusion Module (FDFM) achieves efficient global feature modeling and integration through frequency-domain convolution. Additionally, the Cross Promotion Module (CPM) enhances the synergistic perception of local details and global structures through bidirectional feature interaction. During training, we incorporate an auxiliary decoder and saliency structure loss to strengthen the model's sensitivity to modality-specific differences. Furthermore, a combination of adaptive weight-based frequency contrastive loss and SSIM loss effectively constrains the solution space, facilitating the joint capture of local details and global features while ensuring the retention of complementary information. Extensive experiments validate the fusion performance of RPFNet, which effectively integrates discriminative features, enhances texture details and salient objects, and can effectively facilitate the deployment of the high-level vision task. The source code can be available at https://github.com/wang-x-1997/RPFNet.

Abstract:
Test-Time Adaptation (TTA) enables pre-trained models to bridge the gap between source and target datasets using unlabeled test data, addressing domain shifts caused by corruptions like weather changes, noise, or sensor malfunctions in test time. Multi-Modal Continual Test-Time Adaptation (MM-CTTA), as an extension of standard TTA, further allows models to handle multi-modal inputs and adapt to continuously evolving target domains. However, MM-CTTA faces critical challenges such as catastrophic forgetting and reliability bias, which are rarely addressed effectively under multi-modal corruption scenarios. In this paper, we propose a novel approach, Multi-modality Dynamic Analytic Adapter (MDAA), to tackle MM-CTTA tasks. MDAA introduces analytic learning-a closed-form training technique-through Analytic Classifiers (ACs) to mitigate catastrophic forgetting. Furthermore, we design the Dynamic Late Fusion Mechanism (DLFM) to dynamically select and integrate reliable information from different modalities. Extensive experiments show that MDAA achieves state-of-the-art performance across the proposed tasks. Supplementary materials and codes are available at https://github.com/FeeFee-1/Analytic-Continual-Test-Time-Adaptation-for-Multi-Modality-Corruption this https URL.

Abstract:
Spatial intelligence is fundamental to AI systems that interact with the physical world, particularly in 3D scene generation and spatial comprehension. Current layout generation in 3D scene synthesis remains highly complex, often constrained by predefined datasets and limited dynamic adaptation to changing spatial relationships. In this paper, we propose GraphCanvas3D, a flexible, query-driven framework for controllable 3D scene generation. Unlike traditional methods that require retraining and predefined input masks for modifications, GraphCanvas3D provides a training-free solution supporting the generation of diverse scenes-both indoor and outdoor-through free manipulation of objects and scene elements. Our framework employs hierarchical, graph-driven scene descriptions, representing spatial elements as graph nodes and establishing coherent relationships among objects in 3D environments. The decoupled object representation enables flexible, on-the-fly scene adjustments and dynamic, customizable scene creation. Experimental results and user studies demonstrate that GraphCanvas3D improves usability, adaptability, and generalization across various 3D scene generation tasks, offering a powerful tool for scalable and diverse scene synthesis.

Abstract:
While Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge, conventional single-agent RAG remains fundamentally limited in resolving complex queries demanding coordinated reasoning across heterogeneous data ecosystems. We present HM-RAG, a novel Hierarchical Multi-agent Multimodal RAG framework that pioneers collaborative intelligence for dynamic knowledge synthesis across structured, unstructured, and graph-based data. The framework is composed of a three-tiered architecture with specialized agents: a Decomposition Agent that dissects complex queries into contextually coherent sub-tasks via semantic-aware query rewriting and schema-guided context augmentation; Multi-source Retrieval Agents that carry out parallel, modality-specific retrieval using plug-and-play modules designed for vector, graph, and web-based databases; and a Decision Agent that uses consistency voting to integrate multi-source answers and resolve discrepancies in retrieval results through Expert Model Refinement. This architecture attains comprehensive query understanding by combining textual, graph-relational, and web-derived evidence, resulting in a remarkable 12.95% improvement in answer accuracy and a 3.56% boost in question classification accuracy over baseline RAG systems on the ScienceQA and CrisisMMD benchmarks. Notably, HM-RAG establishes state-of-the-art results in zero-shot settings on both datasets. Its modular architecture ensures seamless integration of new data modalities while maintaining strict data governance, marking a significant advancement in addressing the critical challenges of multimodal reasoning and knowledge synthesis in RAG systems.

Abstract:
3D affordance reasoning plays a critical role in associating human instructions with the functional regions of 3D objects, facilitating precise, task-oriented manipulations in embodied AI. However, current methods, which predominantly depend on sparse 3D point clouds, exhibit limited generalizability and robustness due to their sensitivity to coordinate variations and the inherent sparsity of the data. By contrast, 3D Gaussian Splatting (3DGS) delivers high-fidelity, real-time rendering with minimal computational overhead by representing scenes as dense, continuous distributions. This positions 3DGS as a highly effective approach for capturing fine-grained affordance details and improving recognition accuracy. Nevertheless, its full potential remains largely untapped due to the absence of large-scale, 3DGS-specific affordance datasets. To overcome these limitations, we present 3DAffordSplat, the first large-scale, multi-modal dataset tailored for 3DGS-based affordance reasoning. This dataset includes 23,672 Gaussian instances, 8,231 point cloud instances, and 6,631 manually annotated affordance labels, encompassing 21 object categories and 18 affordance types. Building upon this dataset, we introduce AffordSplatNet, a novel model specifically designed for affordance reasoning using 3DGS representations. AffordSplatNet features an innovative cross-modal structure alignment module that exploits structural consistency priors to align 3D point cloud and 3DGS representations, resulting in enhanced affordance recognition accuracy. Extensive experiments demonstrate that the 3DAffordSplat dataset significantly advances affordance learning within the 3DGS domain, while AffordSplatNet consistently outperforms existing methods across both seen and unseen settings, highlighting its robust generalization capabilities. Code, model, and video are available at https://hcplab-sysu.github.io/3DAffordSplat.

Abstract:
We aim to develop a robust yet flexible visual foundation model for Earth observation. It should possess strong capabilities in recognizing and localizing diverse visual targets while providing compatibility with various input-output interfaces required across different task scenarios. Current systems cannot meet these requirements, as they typically utilize task-specific architecture trained on narrow data domains with limited semantic coverage. Our study addresses these limitations from two aspects: data and modeling. We first introduce an automatic data engine that enjoys significantly better scalability compared to previous human annotation or rule-based approaches. It has enabled us to create the largest dataset of its kind to date, comprising 270K image-text-mask triplets covering an unprecedented range of diverse semantic categories and attribute specifications. Based on this data foundation, we further propose a task unification paradigm that centers around referring expression segmentation. It effectively handles a wide range of vision-centric perception tasks, including classification, detection, segmentation, grounding, etc, using a single model without any task-specific heads. Combining these innovations on data and modeling, we present RemoteSAM, a foundation model that establishes new SoTA on several earth observation perception benchmarks, outperforming other foundation models such as Falcon, GeoChat, and LHRS-Bot with significantly higher efficiency. Models and data are publicly available at https://github.com/1e12Leon/RemoteSAM.

Abstract:
Multi-view 3D reconstruction has remained an essential yet challenging problem in the field of computer vision. While DUSt3R and its successors have achieved breakthroughs in 3D reconstruction from unposed images, these methods exhibit significant limitations when scaling to multi-view scenarios, including high computational cost and cumulative error induced by global alignment. To address these challenges, we propose Regist3R, a novel stereo foundation model tailored for efficient and scalable incremental reconstruction. Regist3R leverages an incremental reconstruction paradigm, enabling large-scale 3D reconstructions from unordered and many-view image collections. We evaluate Regist3R on public datasets for camera pose estimation and 3D reconstruction. Our experiments demonstrate that Regist3R achieves comparable performance with optimization-based methods while significantly improving computational efficiency, and outperforms existing multi-view reconstruction models. Furthermore, to assess its performance in real-world applications, we introduce a challenging oblique aerial dataset which has long spatial spans and hundreds of views. The results highlight the effectiveness of Regist3R. We also demonstrate the first attempt to reconstruct large-scale scenes encompassing over thousands of views through pointmap-based foundation models, showcasing its potential for practical applications in large-scale 3D reconstruction tasks, including urban modeling, aerial mapping, and beyond.

Abstract:
Object-Centric Learning (OCL) aggregates image or video feature maps into object-level feature vectors, termed slots. It's self-supervision of reconstructing the input from slots struggles with complex object textures, thus Vision Foundation Model (VFM) representations are used as the aggregation input and reconstruction target. Existing methods leverage VFM representations in diverse ways yet fail to fully exploit their potential. In response, we propose a unified architecture, Vector-Quantized VFMs for OCL (VQ-VFM-OCL, or VVO). The key to our unification is simply shared quantizing VFM representations in OCL aggregation and decoding. Experiments show that across different VFMs, aggregators and decoders, our VVO consistently outperforms baselines in object discovery and recognition, as well as downstream visual prediction and reasoning. We also mathematically analyze why VFM representations facilitate OCL aggregation and why their shared quantization as reconstruction targets strengthens OCL supervision. Our source code and model checkpoints are available on https://github.com/Genera1Z/VQ-VFM-OCL.

Abstract:
Dynamic Facial Expression Recognition(DFER) is a rapidly evolving field of research that focuses on the recognition of time-series facial expressions. While previous research on DFER has concentrated on feature learning from a deep learning perspective, we put forward an AU-enhanced Dynamic Facial Expression Recognition architecture, namely AU-DFER, that incorporates AU-expression knowledge to enhance the effectiveness of deep learning modeling. In particular, the contribution of the Action Units(AUs) to different expressions is quantified, and a weight matrix is designed to incorporate a priori knowledge. Subsequently, the knowledge is integrated with the learning outcomes of a conventional deep learning network through the introduction of AU loss. The design is incorporated into the existing optimal model for dynamic expression recognition for the purpose of validation. Experiments are conducted on three recent mainstream open-source approaches to DFER on the principal datasets in this field. The results demonstrate that the proposed architecture outperforms the State-Of-The-Art(SOTA) methods without the need for additional arithmetic and generally produces improved results. Furthermore, we investigate the potential of AU loss function redesign to address data label imbalance issues in established dynamic expression datasets. To the best of our knowledge, this is the first attempt to integrate quantified AU-expression knowledge into various DFER models. We also devise strategies to tackle label imbalance, or minor class problems. Our findings suggest that employing a diverse strategy of loss function design can enhance the effectiveness of DFER. This underscores the criticality of addressing data imbalance challenges in mainstream datasets within this domain. The source code is available at https://github.com/Cross-Innovation-Lab/AU-DFER.

Abstract:
Emotion understanding is a critical yet challenging task. Most existing approaches rely heavily on identity-sensitive information, such as facial expressions and speech, which raises concerns about personal privacy. To address this, we introduce the De-identity Multimodal Emotion Recognition and Reasoning (DEEMO), a novel task designed to enable emotion understanding using de-identified video and audio inputs. The DEEMO dataset consists of two subsets: DEEMO-NFBL, which includes rich annotations of Non-Facial Body Language (NFBL), and DEEMO-MER, an instruction dataset for Multimodal Emotion Recognition and Reasoning using identity-free cues. This design supports emotion understanding without compromising identity privacy. In addition, we propose DEEMO-LLaMA, a Multimodal Large Language Model (MLLM) that integrates de-identified audio, video, and textual information to enhance both emotion recognition and reasoning. Extensive experiments show that DEEMO-LLaMA achieves state-of-the-art performance on both tasks, outperforming existing MLLMs by a significant margin, achieving 74.49% accuracy and 74.45% F1-score in de-identity emotion recognition, and 6.20 clue overlap and 7.66 label overlap in de-identity emotion reasoning. Our work contributes to ethical AI by advancing privacy-preserving emotion understanding and promoting responsible affective computing. The dataset and codes will be available at https://github.com/Leedeng/DEEMO.

Abstract:
Graphical User Interface (GUI) agents possess significant commercial and social value, and GUI agents powered by advanced multimodal large language models (MLLMs) have demonstrated remarkable potential. Currently, existing GUI agents usually utilize sequential episodes of multi-step operations across pages as the prior GUI knowledge, which fails to capture the complex transition relationship between pages, making it challenging for the agents to deeply perceive the GUI environment and generalize to new scenarios. Therefore, we design an automated pipeline to transform the sequential episodes into page graphs, which explicitly model the graph structure of the pages that are naturally connected by actions. To fully utilize the page graphs, we further introduce Retrieval-Augmented Generation (RAG) technology to effectively retrieve reliable perception guidelines of GUI from them, and a tailored multi-agent framework PG-Agent with task decomposition strategy is proposed to be injected with the guidelines so that it can generalize to unseen scenarios. Extensive experiments on various benchmarks demonstrate the effectiveness of PG-Agent, even with limited episodes for page graph construction. Our codes will be publicly available at https://github.com/chenwz-123/PG-Agent.

Abstract:
Shadows are a common factor degrading image quality. Single-image shadow removal (SR), particularly under challenging indirect illumination, is hampered by non-uniform content degradation and inherent ambiguity. Consequently, traditional methods often fail to simultaneously recover intra-shadow details and maintain sharp boundaries, resulting in inconsistent restoration and blurring that negatively affect both downstream applications and the overall viewing experience. To overcome these limitations, we propose the DenseSR, approaching the problem from a dense prediction perspective to emphasize restoration quality. This framework uniquely synergizes two key strategies: (1) deep scene understanding guided by geometric-semantic priors to resolve ambiguity and implicitly localize shadows, and (2) high-fidelity restoration via a novel Dense Fusion Block (DFB) in the decoder. The DFB employs adaptive component processing-using an Adaptive Content Smoothing Module (ACSM) for consistent appearance and a Texture-Boundary Recuperation Module (TBRM) for fine textures and sharp boundaries-thereby directly tackling the inconsistent restoration and blurring issues. These purposefully processed components are effectively fused, yielding an optimized feature representation preserving both consistency and fidelity. Extensive experimental results demonstrate the merits of our approach over existing methods. Our code can be available on https://github.com/VanLinLin/DenseSR

Abstract:
The electrocardiogram (ECG) is an essential and effective tool for diagnosing heart diseases. However, its effectiveness can be compromised by noise or unavailability of one or more leads of the standard 12-lead recordings, resulting in diagnostic errors or uncertainty. To address these challenges, we propose TolerantECG, a foundation model for ECG signals that is robust to noise and capable of functioning with arbitrary subsets of the standard 12-lead ECG. TolerantECG training combines contrastive and self-supervised learning frameworks to jointly learn ECG signal representations alongside their corresponding knowledge-retrieval-based text report descriptions and corrupted or lead-missing signals. Comprehensive benchmarking results demonstrate that TolerantECG consistently ranks as the best or second-best performer across various ECG signal conditions and class levels in the PTB-XL dataset, and achieves the highest performance on the MIT-BIH Arrhythmia Database. The source is available at this link: https://github.com/Fsoft-AIC/TolerantECG

Abstract:
3D point cloud segmentation aims to assign semantic labels to individual points in a scene for fine-grained spatial understanding. Existing methods typically adopt data augmentation to alleviate the burden of large-scale annotation. However, most augmentation strategies only focus on local transformations or semantic recomposition, lacking the consideration of global structural dependencies within scenes. To address this limitation, we propose a graph-guided data augmentation framework with dual-level constraints for realistic 3D scene synthesis. Our method learns object relationship statistics from real-world data to construct guiding graphs for scene generation. Local-level constraints enforce geometric plausibility and semantic consistency between objects, while global-level constraints maintain the topological structure of the scene by aligning the generated layout with the guiding graph. Extensive experiments on indoor and outdoor datasets demonstrate that our framework generates diverse and high-quality augmented scenes, leading to consistent improvements in point cloud segmentation performance across various models. Code is available at: https://github.com/alexander7xu/DualLevelAug

Abstract:
Temporal Sentence Grounding (TSG) aims to identify relevant moments in an untrimmed video that semantically correspond to a given textual query. Despite existing studies having made substantial progress, they often overlook the issue of spurious correlations between video and textual queries. These spurious correlations arise from two primary factors: (1) inherent biases in the textual data, such as frequent co-occurrences of specific verbs or phrases, and (2) the model's tendency to overfit to salient or repetitive patterns in video content. Such biases mislead the model into associating textual cues with incorrect visual moments, resulting in unreliable predictions and poor generalization to out-of-distribution examples. To overcome these limitations, we propose a novel TSG framework, causal intervention and counterfactual reasoning that utilizes causal inference to eliminate spurious correlations and enhance the model's robustness. Specifically, we first formulate the TSG task from a causal perspective with a structural causal model. Then, to address unobserved confounders reflecting textual biases toward specific verbs or phrases, a textual causal intervention is proposed, utilizing do-calculus to estimate the causal effects. Furthermore, visual counterfactual reasoning is performed by constructing a counterfactual scenario that focuses solely on video features, excluding the query and fused multi-modal features. This allows us to debias the model by isolating and removing the influence of the video from the overall effect. Experiments on public datasets demonstrate the superiority of the proposed method. The code is available at https://github.com/Tangkfan/CICR.

Abstract:
Recent text-to-image diffusion models have achieved remarkable success in generating high-quality images. However, their exclusive reliance on textual prompts falls short in precise control of image compositions. In this paper, we propose LoCo, a training-free approach for layout-to-image synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions. Specifically, LoCo features a novel Localized Attention Constraint, which utilizes the semantic affinity between pixels in self-attention maps to create precise representations of desired objects, thereby ensuring their accurate placement within designated regions. We further introduce a Padding Token Constraint to leverage the semantic information embedded in previously overlooked padding tokens, improving the consistency between object appearance and layout instructions. Our method seamlessly integrates with existing text-to-image and layout-to-image models, improving their spatial control capabilities and addressing semantic failures seen in prior approaches. Extensive experiments demonstrate the superiority of LoCo, outperforming state-of-the-art training-free layout-to-image methods both qualitatively and quantitatively across multiple benchmarks.

Abstract:
Compositional Customized Image Generation aims to customize multiple target concepts within generation content, which has gained attention for its wild application. Though a great success, existing approaches mainly concentrate on the target entity's appearance preservation, while neglecting the fine-grained interaction control among target entities. To enable the model of such interaction control capability, we focus on human object interaction scenario and propose the task of Customized Human Object Interaction Image Generation (CHOI), which simultaneously requires identity preservation for target human object and the interaction semantic control between them. We attribute two primary challenges of CHOI as follows: (1) the simultaneous identity preservation and interaction control demands require the model to decompose the human object into self-contained identity features and pose-oriented interaction features, while the current HOI image datasets fail to provide ideal samples for such feature-decomposed learning. (2) inappropriate spatial configuration between human and object may lead to the lack of desired interaction semantics, as it may provide wrong hints on the human object body parts crucial for interaction semantic expression. To tackle the above issues, we first collect and process a large-scale dataset, where each sample encompasses the same pair of human object involving different interactive poses. Such data is tailored for CHOI training, from where the model can learn how to decompose identity features and interaction features for target human and object. Then to provide appropriate spatial configuration for interaction semantic expression, we design a two-stage model Interact-Custom, which firstly explicitly model the spatial configuration by generating a foreground mask depicting the interaction behavior, then under the guidance of this mask, we generate the target human object interacting while preserving their identities features. Furthermore, if the background image and the union location of where the target human object should appear are provided by users, Interact-Custom also provides the optional functionality to specify them, offering high content controllability. Extensive experiments on our tailored metrics for CHOI task demonstrate the effectiveness of our approach. Our code is available at https://github.com/XZPKU/Inter-custom.git

Abstract:
Face filters have become a key element of short-form video content, enabling a wide array of visual effects such as stylization and face swapping. However, their performance often degrades in the presence of occlusions, where objects like hands, hair, or accessories obscure the face. To address this limitation, we introduce the novel task of face matting, which estimates fine-grained alpha mattes to separate occluding elements from facial regions. We further present FaceMat, a trimap-free, uncertainty-aware framework that predicts high-quality alpha mattes under complex occlusions. Our approach leverages a two-stage training pipeline: a teacher model is trained to jointly estimate alpha mattes and per-pixel uncertainty using a negative log-likelihood (NLL) loss, and this uncertainty is then used to guide the student model through spatially adaptive knowledge distillation. This formulation enables the student to focus on ambiguous or occluded regions, improving generalization and preserving semantic consistency. Unlike previous approaches that rely on trimaps or segmentation masks, our framework requires no auxiliary inputs making it well-suited for real-time applications. In addition, we reformulate the matting objective by explicitly treating skin as foreground and occlusions as background, enabling clearer compositing strategies. To support this task, we newly constructed CelebAMat, a large-scale synthetic dataset specifically designed for occlusion-aware face matting. Extensive experiments show that FaceMat outperforms state-of-the-art methods across multiple benchmarks, enhancing the visual quality and robustness of face filters in real-world, unconstrained video scenarios. The source code and CelebAMat dataset are available at https://github.com/hyebin-c/FaceMat.git

Abstract:
Diffusion models have recently achieved outstanding results in the field of image super-resolution. These methods typically inject low-resolution (LR) images via ControlNet. In this paper, we first explore the temporal dynamics of information infusion through ControlNet, revealing that the input from LR images predominantly influences the initial stages of the denoising process. Leveraging this insight, we introduce a novel timestep-aware diffusion model that adaptively integrates features from both ControlNet and the pre-trained Stable Diffusion (SD). Our method enhances the transmission of LR information in the early stages of diffusion to guarantee image fidelity and stimulates the generation ability of the SD model itself more in the later stages to enhance the detail of generated images. To train this method, we propose a timestep-aware training strategy that adopts distinct losses at varying timesteps and acts on disparate modules. Experiments on benchmark datasets demonstrate the effectiveness of our method.

Abstract:
While recent video deblurring methods have advanced significantly, they often overlook two valuable prior information: (1) motion vectors (MVs) and coding residuals (CRs) from video codecs, which provide efficient inter-frame alignment cues, and (2) the rich real-world knowledge embedded in pre-trained diffusion generative models. We present CPGD-Net, a novel two-stage framework that effectively leverages both coding priors and generative diffusion priors for high-quality deblurring. First, our coding-prior feature propagation (CPFP) module utilizes MVs for efficient frame alignment and CRs to generate attention masks, addressing motion inaccuracies and texture variations. Second, a coding-prior controlled generation (CPC) module network integrates coding priors into a pre-trained diffusion model, guiding it to enhance critical regions and synthesize realistic details. Experiments demonstrate our method achieves state-of-the-art perceptual quality with up to 30% improvement in IQA metrics. The code and the coding-prior-augmented dataset are available at: https://github.com/liuyike422/CPGD-Net.

Abstract:
Music enhances video narratives and emotions, driving demand for automatic video-to-music (V2M) generation. However, existing V2M methods relying solely on visual features or supplementary textual inputs generate music in a black-box manner, often failing to meet user expectations. To address this challenge, we propose a novel multi-condition guided V2M generation framework that incorporates multiple time-varying conditions for enhanced control over music generation. Our method uses a two-stage training strategy that enables learning of V2M fundamentals and audiovisual temporal synchronization while meeting users' needs for multi-condition control. In the first stage, we introduce a fine-grained feature selection module and a progressive temporal alignment attention mechanism to ensure flexible feature alignment. For the second stage, we develop a dynamic conditional fusion module and a control-guided decoder module to integrate multiple conditions and accurately guide the music composition process. Extensive experiments demonstrate that our method outperforms existing V2M pipelines in both subjective and objective evaluations, significantly enhancing control and alignment with user expectations.

Abstract:
The emergence of multimodal large language models has redefined the agent paradigm by integrating language and vision modalities with external data sources, enabling agents to better interpret human instructions and execute increasingly complex tasks. However, in this paper, we identify a critical yet previously overlooked security vulnerability in multimodal agents: cross-modal prompt injection attacks. To exploit this vulnerability, we propose CrossInject, a novel attack framework in which attacker embeds adversarial perturbations across multiple modalities to align with target malicious content, allowing external instructions to hijack the agents' decision-making process and execute unauthorized tasks. Our approach incorporates two key coordinated components. First, we introduce Visual Latent Alignment, where we optimize adversarial features to the malicious instructions in the visual embedding space based on a text-to-image generative model, ensuring that adversarial images subtly encode cues for malicious task execution. Subsequently, we present Textual Guidance Enhancement, where a large language model is leveraged to construct the black-box defensive system prompt through adversarial meta-prompting and generate a malicious textual command based on it that steers the agents' output toward better compliance with attacker's requests. Extensive experiments demonstrate that our method outperforms state-of-the-art attacks, achieving at least a +30.1% increase in attack success rates across diverse tasks. Furthermore, we validate our attack's effectiveness in real-world multimodal autonomous agents, highlighting its potential implications for safety-critical applications. Code can be found in https://github.com/Larry0454/CrossInject.

Abstract:
With the surge of large language models (LLMs), Large Vision-Language Models (VLMs)-which integrate vision encoders with LLMs for accurate visual grounding-have shown great potential in tasks like generalist agents and robotic control. However, VLMs are typically trained on massive web-scraped images, raising concerns over copyright infringement and privacy violations, and making data auditing increasingly urgent. Membership inference (MI), which determines whether a sample was used in training, has emerged as a key auditing technique, with promising results on open-source VLMs like LLaVA (AUC > 80%). In this work, we revisit these advances and uncover a critical issue: current MI benchmarks suffer from distribution shifts between member and non-member images, introducing shortcut cues that inflate MI performance. We further analyze the nature of these shifts and propose a principled metric based on optimal transport to quantify the distribution discrepancy. To evaluate MI in realistic settings, we construct new benchmarks with i.i.d. member and non-member images. Existing MI methods fail under these unbiased conditions, performing only marginally better than chance. Further, we explore the theoretical upper bound of MI by probing the Bayes Optimality within the VLM's embedding space and find the irreducible error rate remains high. Despite this pessimistic outlook, we analyze why MI for VLMs is particularly challenging and identify three practical scenarios-fine-tuning, access to ground-truth texts, and set-based inference-where auditing becomes feasible. Our study presents a systematic view of the limits and opportunities of MI for VLMs, providing guidance for future efforts in trustworthy data auditing. Code and data will be available at https://github.com/GradOpt/Revisiting-VLM-MIA\faGithub.

Abstract:
With the rapid development of wireless communication technology, the efficient utilization of spectrum resources, optimization of communication quality, and intelligent communication have become critical. Radio map reconstruction is essential for enabling advanced applications, yet challenges such as complex signal propagation and sparse observational data hinder accurate reconstruction in practical scenarios. Existing methods often fail to align physical constraints with data-driven features, particularly under sparse measurement conditions. To address these issues, we propose Physics-Aligned Radio Map Diffusion Model (PhyRMDM), a novel framework that establishes cross-domain representation alignment between physical principles and neural network features through dual learning pathways. The proposed model integrates Physics-Informed Neural Networks (PINNs) with a representation alignment mechanism that explicitly enforces consistency between Helmholtz equation constraints and environmental propagation patterns. Our architecture employs two synergistic U-Nets: the first ensures physical consistency by minimizing PDE residuals and boundary conditions through latent space alignment, while the second refines predictions via diffusion-based denoising with attention-guided feature fusion. This dual alignment strategy enables simultaneous satisfaction of wave propagation laws and data distribution characteristics. Experimental results demonstrate significant improvements over state-of-the-art methods, achieving NMSE of 0.0031 and RMSE of 0.0125 under Static Radio Map (SRM) conditions, and NMSE of 0.0047 with RMSE of 0.0146 in Dynamic Radio Map (DRM) scenarios. The proposed representation alignment paradigm provides 37.2% accuracy enhancement in ultra-sparse cases (1% sampling rate), confirming its effectiveness in bridging physics-based modeling and deep learning for radio map reconstruction. These advancements establish a new framework for sparse signal environment characterization, with direct applications in 5G/6G network optimization and intelligent spectrum management. The code can be found on the website: https://github.com/Hxxxz0/RMDM

Abstract:
Robotic task planning in real-world environments requires not only object recognition but also a nuanced understanding of spatial relationships between objects. We present a spatial-relationship-aware dataset of nearly 1,000 robot-acquired indoor images, annotated with object attributes, positions, and detailed spatial relationships. Captured using a Boston Dynamics Spot robot and labelled with a custom annotation tool, the dataset reflects complex scenarios with similar or identical objects and intricate spatial arrangements. We benchmark six state-of-the-art scene-graph generation models on this dataset, analysing their inference speed and relational accuracy. Our results highlight significant differences in model performance and demonstrate that integrating explicit spatial relationships into foundation models, such as ChatGPT 4o, substantially improves their ability to generate executable, spatially-aware plans for robotics. The dataset and annotation tool are publicly available at https://github.com/PengPaulWang/SpatialAwareRobotDataset, supporting further research in spatial reasoning for robotics.

Abstract:
The widespread deployment of large models in resource-constrained environments has underscored the need for efficient transmission of intermediate feature representations. In this context, feature coding, which compresses features into compact bitstreams, becomes a critical component for scenarios involving feature transmission, storage, and reuse. However, this compression process inevitably introduces semantic degradation that is difficult to quantify with traditional metrics. To address this, we formalize the research problem of Compressed Feature Quality Assessment (CFQA), aiming to evaluate the semantic fidelity of compressed features. To advance CFQA research, we propose the first benchmark dataset, comprising 300 original features and 12000 compressed features derived from three vision tasks and four feature codecs. Task-specific performance degradation is provided as true semantic distortion for evaluating CFQA metrics. We systematically assess three widely used metrics -- MSE, cosine similarity, and Centered Kernel Alignment (CKA) -- in terms of their ability to capture semantic degradation. Our findings demonstrate the representativeness of the proposed dataset while underscoring the need for more sophisticated metrics capable of measuring semantic distortion in compressed features. This work advances the field by establishing a foundational benchmark and providing a critical resource for the community to explore CFQA. To foster further research, we release the dataset and all associated source code at https://github.com/chansongoal/Compressed-Feature-Quality-Assessment.

Abstract:
We present Open-CD, a change detection toolbox that contains a rich set of change detection methods as well as related components and modules. The toolbox started from a series of open source general vision task tools, including OpenMMLab Toolkits, PyTorch Image Models (Timm), etc. It gradually evolves into a unified platform that covers many popular change detection methods and contemporary modules. It not only includes training and inference codes, but also provides some useful scripts for data analysis. We believe this toolbox is by far the most comprehensive change detection toolbox. In this report, we introduce the features, supported methods and applications of Open-CD. In addition, we also conduct a benchmarking study on different methods and components. We wish that the toolbox and benchmark could serve the growing research community by providing a flexible toolkit to re-implement existing methods and develop their own new change detectors. Code and models are available at https://github.com/likyoo/open-cd.

Abstract:
Micro-expressions (MEs) are involuntary, low-intensity, and short-duration facial expressions that often reveal an individual's genuine thoughts and emotions. Most existing ME analysis methods rely on window-level classification with fixed window sizes and hard decisions, which limits their ability to capture the complex temporal dynamics of MEs. Although recent approaches have adopted video-level regression frameworks to address some of these challenges, interval decoding still depends on manually predefined, window-based methods, leaving the issue only partially mitigated. In this paper, we propose a prior-guided video-level regression method for ME analysis. We introduce a scalable interval selection strategy that comprehensively considers the temporal evolution, duration, and class distribution characteristics of MEs, enabling precise spotting of the onset, apex, and offset phases. In addition, we introduce a synergistic optimization framework, in which the spotting and recognition tasks share parameters except for the classification heads. This fully exploits complementary information, makes more efficient use of limited data, and enhances the model's capability. Extensive experiments on multiple benchmark datasets demonstrate the state-of-the-art performance of our method, with an STRS of 0.0562 on CAS(ME)3 and 0.2000 on SAMMLV. The code is available at https://github.com/zizheng-guo/BoostingVRME.

Abstract:
This workshop aims to explore the potential of large generative models to revolutionize the way we interact with multimodal information. A Large Language Model (LLM) represents a sophisticated form of artificial intelligence engineered to comprehend and produce natural language text, exemplified by technologies such as GPT, LLaMA, Flan-T5, ChatGLM, and Qwen, etc. These models undergo training on extensive text datasets, exhibiting commendable attributes including robust language generation, zeroshot transfer capabilities, and In-Context Learning (ICL). With the surge in multimodal content-encompassing images, videos, audio, and 3D models-over the recent period, Large MultiModal Models (LMMs) have seen significant enhancements. These improvements enable the augmentation of conventional LLMs to accommodate multimodal inputs or outputs, as seen in BLIP, Flamingo, KOSMOS, LLaVA, Gemini, GPT-4, etc. Concurrently, certain research initiatives have delved into generating specific modalities, with Kosmos2 and MiniGPT-5 focusing on image generation, and SpeechGPT on speech production. There are also endeavors to integrate LLMs with external tools to achieve a near 'any-to-any' multimodal comprehension and generation capacity, illustrated by projects like Visual-ChatGPT, ViperGPT, MMREACT, HuggingGPT, and AudioGPT. Collectively, these models, spanning not only text and image generation but also other modalities, are referred to as large generative models. This workshop will provide an opportunity for researchers, practitioners, and industry professionals to explore the latest trends and best practices in the field of multimodal applications of large generative models. We also remark that the submissions are not limited to the use of such models. The workshop will also focus on exploring the challenges and opportunities of integrating large language models with other AI technologies such as computer vision and speech recognition. Additionally, the workshop will provide a platform for participants to present their research, share their experiences, and discuss potential collaborations.

Abstract:
Referring Medical Image Sequence Segmentation (Ref-MISS) is a novel and challenging task that aims to segment anatomical structures in medical image sequences (e.g., endoscopy, ultrasound, CT, and MRI) based on natural language descriptions. Existing 2D and 3D segmentation models struggle to explicitly track objects of interest across medical image sequences, and lack support for interactive, text-driven guidance. To address these limitations, we propose Text-Promptable Propagation (TPP), which enables the recognition of referred objects through cross-modal referring interaction, and maintains continuous tracking across the sequence via Transformer-based triple propagation, using text embeddings as queries. To support this task, we curate a large-scale benchmark, Ref-MISS-Bench, which covers 4 imaging modalities and 20 different organs and lesions. Experimental results on this benchmark demonstrate that TPP consistently outperforms state-of-the-art methods in both medical segmentation and referring video object segmentation. Code and data are available at https://github.com/yuanruntian/TPP.

Abstract:
Automatic speech quality assessment aims to quantify subjective human perception of speech through computational models to reduce the need for labor-consuming manual evaluations. While models based on deep learning have achieved progress in predicting mean opinion scores (MOS) to assess synthetic speech, the neglect of fundamental auditory perception mechanisms limits consistency with human judgments. To address this issue, we propose an auditory perception guided-MOS prediction model (APG-MOS) that synergistically integrates auditory modeling with semantic analysis to enhance consistency with human judgments. Specifically, we first design a perceptual module, grounded in biological auditory mechanisms, to simulate cochlear functions, which encodes acoustic signals into biologically aligned electrochemical representations. Secondly, we propose a residual vector quantization (RVQ)-based semantic distortion modeling method to quantify the degradation of speech quality at the semantic level. Finally, we design a residual cross-attention module, coupled with a progressive learning strategy, to enable multimodal fusion of encoded electrochemical signals and semantic representations. Experiments demonstrate that APG-MOS achieves superior performance on two primary benchmarks. The implementation code is available at https://github.com/BNU-ERC-ITEA/APG-MOS.

Abstract:
Multimodal graphs, which integrate unstructured heterogeneous data with structured interconnections, offer substantial real-world utility but remain insufficiently explored in unsupervised learning. In this work, we initiate the study of multimodal graph clustering, aiming to bridge this critical gap. Through empirical analysis, we observe that real-world multimodal graphs often exhibit hybrid neighborhood patterns, combining both homophilic and heterophilic relationships. To address this challenge, we propose a novel framework---Disentangled Multimodal Graph Clustering (DMGC) ---which decomposes the original hybrid graph into two complementary views: (1) a homophily-enhanced graph that captures cross-modal class consistency, and (2) heterophily-aware graphs that preserve modality-specific inter-class distinctions. We introduce a Multimodal Dual-frequency Fusion mechanism that jointly filters these disentangled graphs through a dual-pass strategy, enabling effective multimodal integration while mitigating category confusion. Our self-supervised alignment objectives further guide the learning process without requiring labels. Extensive experiments on both multimodal and multi-relational graph datasets demonstrate that DMGC achieves state-of-the-art performance, highlighting its effectiveness and generalizability across diverse settings. Our code is available at https://github.com/Uncnbb/DMGC.

Abstract:
Face anti-spoofing (FAS) aims to construct a robust system that can withstand diverse attacks. While recent efforts have concentrated mainly on cross-domain generalization, two significant challenges persist: limited semantic understanding of attack types and training redundancy across domains. We address the first by integrating vision-language models (VLMs) to enhance the perception of visual input. For the second challenge, we employ a meta-domain strategy to learn a unified model that generalizes well across multiple domains. Our proposed InstructFLIP is a novel instruction-tuned framework that leverages VLMs to enhance generalization via textual guidance trained solely on a single domain. At its core, InstructFLIP explicitly decouples instructions into content and style components, where content-based instructions focus on the essential semantics of spoofing, and style-based instructions consider variations related to the environment and camera characteristics. Extensive experiments demonstrate the effectiveness of InstructFLIP by outperforming SOTA models in accuracy and substantially reducing training redundancy across diverse domains in FAS. The project website is available at https://kunkunlin1221.github.io/InstructFLIP.

Abstract:
3D semantic field learning is crucial for applications like autonomous navigation, AR/VR, and robotics, where accurate comprehension of 3D scenes from limited viewpoints is essential. Existing methods struggle under sparse view conditions, relying on inefficient per-scene multi-view optimizations, which are impractical for many real-world tasks. To address this, we propose SLGaussian, a feed-forward method for constructing 3D semantic fields from sparse viewpoints, allowing direct inference of 3DGS-based scenes. By ensuring consistent SAM segmentations through video tracking and using low-dimensional indexing for high-dimensional CLIP features, SLGaussian efficiently embeds language information in 3D space, offering a robust solution for accurate 3D scene understanding under sparse view conditions. In experiments on two-view sparse 3D object querying and segmentation in the LERF and 3D-OVS datasets, SLGaussian outperforms existing methods in chosen IoU, Localization Accuracy, and mIoU. Moreover, our model achieves scene inference in under 30 seconds and open-vocabulary querying in just 0.011 seconds per query.

Abstract:
Visual anomaly detection has been widely used in industrial inspection and medical diagnosis. Existing methods typically demand substantial training samples, limiting their utility in zero-/few-shot scenarios. While recent efforts have leveraged CLIP's zero-shot recognition capability for this task, they often ignore optimizing visual features to focus on local anomalies, reducing their efficacy. In this work, we propose AF-CLIP (Anomaly-Focused CLIP) by dramatically enhancing its visual representations to focus on local defects. Our approach introduces a lightweight adapter that emphasizes anomaly-relevant patterns in visual features, simultaneously optimizing both class-level features for image classification and patch-level features for precise localization. To capture anomalies of different sizes and improve detection accuracy, prior to the adapter, we develop a multi-scale spatial aggregation mechanism to effectively consolidate neighborhood context. Complementing these visual enhancements, we design learnable textual prompts that generically characterize normal and abnormal states. After optimization on auxiliary datasets using a composite objective function, AF-CLIP demonstrates strong zero-shot detection capability. Our method is also extended to few-shot scenarios by extra memory banks. Experimental results across diverse industrial and medical datasets demonstrate the effectiveness and generalization of our proposed method. Code is available at https://github.com/Faustinaqq/AF-CLIP.

Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. Given the extensive applications of MLLMs, the associated safety issues have become increasingly critical. Due to the effectiveness of preference optimization in aligning MLLMs with human preferences, there is an urgent need for safety-related preference data for MLLMs. To address this, we construct the MMSafe-PO preference dataset towards harmless multimodal assistants, featuring multimodal instructions, the conversational format, and ranked paired responses from human feedback. We also identify two insightful observations: modality co-defense and modality cheating, which illustrate that MLLMs possess a certain level of inherent defense while still presenting unique safety challenges. Based on these observations, we propose the Blind Preference Optimization (BPO) approach. Comprehensive experiments on three benchmarks show that BPO effectively enhances the safety capabilities of MLLMs. Notably, BPO significantly improves the safety rate of the base MLLM by 45.0%, outperforming the DPO approach. Additionally, applying BPO to the MMSafe-PO dataset greatly reduces the base MLLM's unsafe rate on other safety benchmarks (14.5% on MM-SafetyBench and 82.9% on HarmEval), demonstrating the effectiveness and robustness of both the dataset and the approach.

Abstract:
Human motion style transfer allows characters to appear less rigidity and more realism with specific style. Traditional arbitrary image style transfer typically process mean and variance which is proved effective. Meanwhile, similar methods have been adapted for motion style transfer. However, due to the fundamental differences between images and motion, relying on mean and variance is insufficient to fully capture the complex dynamic patterns and spatiotemporal coherence properties of motion data. Building upon this, our key insight is to bring two more coefficient, skewness and kurtosis, into the analysis of motion style. Specifically, we propose a novel Adaptive Statistics Fusor (AStF) which consists of Style Disentanglement Module (SDM) and High-Order Multi-Statistics Attention (HOS-Attn). We trained our AStF in conjunction with a Motion Consistency Regularization (MCR) discriminator. Experimental results show that, by providing a more comprehensive model of the spatiotemporal statistical patterns inherent in dynamic styles, our proposed AStF shows proficiency superiority in motion style transfers over state-of-the-arts. Our code and model are available at https://github.com/CHMimilanlan/AStF.

Abstract:
The rapid technical progress of generative art (GenArt) has democratized the creation of visually appealing imagery. However, achieving genuine artistic impact - the kind that resonates with viewers on a deeper, more meaningful level - remains formidable as it requires a sophisticated aesthetic sensibility. This sensibility involves a multifaceted cognitive process extending beyond mere visual appeal, which is often overlooked by current computational methods. This paper pioneers an approach to capture this complex process by investigating how the reasoning capabilities of Multimodal LLMs (MLLMs) can be effectively elicited to perform aesthetic judgment. Our analysis reveals a critical challenge: MLLMs exhibit a tendency towards hallucinations during aesthetic reasoning, characterized by subjective opinions and unsubstantiated artistic interpretations. We further demonstrate that these hallucinations can be suppressed by employing an evidence-based and objective reasoning process, as substantiated by our proposed baseline, ArtCoT. MLLMs prompted by this principle produce multifaceted, in-depth aesthetic reasoning that aligns significantly better with human judgment. These findings have direct applications in areas such as AI art tutoring and as reward models for image generation. Ultimately, we hope this work paves the way for AI systems that can truly understand, appreciate, and contribute to art that aligns with human aesthetic values. Project homepage: https://github.com/songrise/MLLM4Art.

Abstract:
Deep Neural Networks (DNNs) are susceptible to backdoor attacks, where adversaries poison training data to implant backdoor into the victim model. Current backdoor defenses on poisoned data often suffer from high computational costs or low effectiveness against advanced attacks like clean-label and clean-image backdoors. To address them, we introduce CLIP-Guided backdoor Defense (CGD), an efficient and effective method that mitigates various backdoor attacks. CGD utilizes a publicly accessible CLIP model to identify inputs that are likely to be clean or poisoned. It then retrains the model with these inputs, using CLIP's logits as a guidance to effectively neutralize the backdoor. Experiments on 4 datasets and 11 attack types demonstrate that CGD reduces attack success rates (ASRs) to below 1% while maintaining clean accuracy (CA) with a maximum drop of only 0.3%, outperforming existing defenses. Additionally, we show that clean-data-based defenses can be adapted to poisoned data using CGD. Also, CGD exhibits strong robustness, maintaining low ASRs even when employing a weaker CLIP model or when CLIP itself is compromised by a backdoor. These findings underscore CGD's exceptional efficiency, effectiveness, and applicability for real-world backdoor defense scenarios. Code: https://github.com/binyxu/CGD.

Abstract:
Optical motion capture is a foundational technology driving advancements in cutting-edge fields such as virtual reality and film production. However, system performance suffers severely under large-scale marker occlusions common in real-world applications. An in-depth analysis identifies two primary limitations of current models: (i) the lack of training datasets accurately reflecting realistic marker occlusion patterns, and (ii) the absence of training strategies designed to capture long-range dependencies among markers. To tackle these challenges, we introduce the CMU-Occlu dataset, which incorporates ray tracing techniques to realistically simulate practical marker occlusion patterns. Furthermore, we propose OpenMoCap, a novel motion-solving model designed specifically for robust motion capture in environments with significant occlusions. Leveraging a marker-joint chain inference mechanism, OpenMoCap enables simultaneous optimization and construction of deep constraints between markers and joints. Extensive comparative experiments demonstrate that OpenMoCap consistently outperforms competing methods across diverse scenarios, while the CMU-Occlu dataset opens the door for future studies in robust motion solving. The proposed OpenMoCap is integrated into the MoSen MoCap system for practical deployment. The code is released at: https://github.com/qianchen214/OpenMoCap.

Abstract:
Contextual reasoning with constraints is crucial for enhancing temporal consistency in cross-frame modeling for visual tracking. However, mainstream tracking algorithms typically associate context by merely stacking historical information without explicitly supervising the association process, making it difficult to effectively model the target's evolving dynamics. To alleviate this problem, we propose RSTrack, which explicitly models and supervises context reasoning via three core mechanisms. 1) Context Reasoning Mechanism : Constructs a target state reasoning pipeline, converting unconstrained contextual associations into a temporal reasoning process that predicts the current representation based on historical target states, thereby enhancing temporal consistency. 2) Forward Supervision Strategy : Utilizes true target features as anchors to constrain the reasoning pipeline, guiding the predicted output toward the true target distribution and suppressing drift in the context reasoning process. 3) Efficient State Modeling : Employs a compression-reconstruction mechanism to extract the core features of the target, removing redundant information across frames and preventing ineffective contextual associations. These three mechanisms collaborate to effectively alleviate the issue of contextual association divergence in traditional temporal modeling. Experimental results show that RSTrack achieves state-of-the-art performance on multiple benchmark datasets while maintaining real-time running speeds. Our code is available at https://github.com/GXNU-ZhongLab/RSTrack.

Abstract:
In filmmaking, directors typically allow actors to perform freely based on the script before providing specific guidance on how to present key actions. AI-generated content faces similar requirements, where users not only need automatic generation of lip synchronization and basic gestures from audio input but also desire semantically accurate and expressive body movement that can be ''directly guided'' through text descriptions. Therefore, we present VersaAnimator, a versatile framework that synthesizes expressive talking human videos from arbitrary portrait images. Specifically, we design a motion generator that produces basic rhythmic movements from audio input and supports text-prompt control for specific actions. The generated whole-body 3D motion tokens can animate portraits of various scales, producing talking heads, half-body gestures and even leg movements for whole-body images. Besides, we introduce a multi-modal controlled video diffusion that generates photorealistic videos, where speech signals govern lip synchronization, facial expressions, and head motions while body movements are guided by the 2D poses. Furthermore, we introduce a token2pose translator to smoothly map 3D motion tokens to 2D pose sequences. This design mitigates the stiffness resulting from direct 3D to 2D conversion and enhances the details of the generated body movements. Extensive experiments shows that VersaAnimator synthesizes lip-synced and identity-preserving videos while generating expressive and semantically meaningful whole-body motions. https://digital-avatar.github.io/ai/VersaAnimator/

Abstract:
Recent advancements in Multi-modal Large Language Models (MLLMs) have led to significant progress in developing GUI agents for general tasks such as web browsing and mobile phone use. However, their application in professional domains remains under-explored. These specialized workflows introduce unique challenges for GUI perception models, including high-resolution displays and complex environments which lead to smaller target sizes. In this paper, we introduce ScreenSpot-Pro, a new benchmark designed to rigorously evaluate the grounding capabilities of MLLMs in high-resolution professional settings. The benchmark comprises authentic high-resolution images from a variety of professional domains with expert annotations. It spans 23 applications across five industries and three operating systems. Existing GUI grounding models perform poorly on this dataset, with the best model achieving only 18.9%. Our experiments reveal that strategically reducing the search area enhances accuracy. Based on this insight, we propose ScreenSeekeR, a visual search method that utilizes the GUI knowledge of a strong planner to guide a cascaded search, achieving state-of-the-art performance with 48.1% without any additional training. We hope that our benchmark and findings will advance the development of GUI agents for professional settings.

Abstract:
With the rapid advancement and widespread adoption of VR/AR technologies, there is a growing demand for the creation of high-quality, immersive dynamic scenes. However, existing generation works predominantly concentrate on the creation of static scenes or narrow perspective-view dynamic scenes, falling short of delivering a truly 360-degree immersive experience from any viewpoint. In this paper, we introduce TiP4GEN, an advanced text-to-dynamic panorama scene generation framework that enables fine-grained content control and synthesizes motion-rich, geometry-consistent panoramic 4D scenes. TiP4GEN integrates panorama video generation and dynamic scene reconstruction to create 360-degree immersive virtual environments. For video generation, we introduce a Dual-branch Generation Model consisting of a panorama branch and a perspective branch, responsible for global and local view generation, respectively. A bidirectional cross-attention mechanism facilitates comprehensive information exchange between the branches. For scene reconstruction, we propose a Geometry-aligned Reconstruction Model based on 3D Gaussian Splatting. By aligning spatial-temporal point clouds using metric depth maps and initializing scene cameras with estimated poses, our method ensures geometric consistency and temporal coherence for the reconstructed scenes. Extensive experiments demonstrate the effectiveness of our proposed designs and the superiority of TiP4GEN in generating visually compelling and motion-coherent dynamic panoramic scenes.

Abstract:
In this paper, we introduce DiffusionMat, a novel image matting framework that employs a diffusion model for the transition from coarse to refined alpha mattes. Diverging from conventional methods that utilize trimaps merely as loose guidance for alpha matte prediction, our approach treats image matting as a deterministic sequential refinement learning process. This process begins with the addition of noise to trimaps and iteratively denoises them using a pre-trained diffusion model, which incrementally guides the prediction towards a clean alpha matte. The key innovation of our framework is a correction module that adjusts the output at each denoising step, ensuring that the final result is consistent with the input image's structures. We also introduce the Alpha Reliability Propagation, a novel technique designed to maximize the utility of available guidance by selectively enhancing the trimap regions with confident alpha information, thus simplifying the correction task. To train the correction module, we devise specialized loss functions that target the accuracy of the alpha matte's edges and the consistency of its opaque and transparent regions. We evaluate our model across several image matting benchmarks, and the results indicate that DiffusionMat consistently outperforms existing methods.

Abstract:
Recent advancements in diffusion models (DMs) have been propelled by alignment methods that post-train models to better conform to human preferences. However, these approaches typically require computation-intensive training of a base model and a reward model, which not only incurs substantial computational overhead but may also compromise model accuracy and training efficiency. To address these limitations, we propose Inversion-DPO, a novel alignment framework that circumvents reward modeling by reformulating Direct Preference Optimization (DPO) with DDIM inversion for DMs. Our method conducts intractable posterior sampling in Diffusion-DPO with the deterministic inversion from winning and losing samples to noise and thus derive a new post-training paradigm. This paradigm eliminates the need for auxiliary reward models or inaccurate appromixation, significantly enhancing both precision and efficiency of training. We apply Inversion-DPO to a basic task of text-to-image generation and a challenging task of compositional image generation. Extensive experiments show substantial performance improvements achieved by Inversion-DPO compared to existing post-training methods and highlight the ability of the trained generative models to generate high-fidelity compositionally coherent images. For the post-training of compostitional image geneation, we curate a paired dataset consisting of 11,140 images with complex structural annotations and comprehensive scores, designed to enhance the compositional capabilities of generative models. Inversion-DPO explores a new avenue for efficient, high-precision alignment in diffusion models, advancing their applicability to complex realistic generation tasks. Our code is available at https://github.com/MIGHTYEZ/Inversion-DPO

Abstract:
Text-guided diffusion models revolutionize audio generation by adapting source audio to specific text prompts. However, existing zero-shot audio editing methods such as DDIM inversion accumulate errors across diffusion steps, reducing the effectiveness. Moreover, existing editing methods struggle with conducting complex non-rigid music edits while maintaining content integrity and high fidelity. To address these challenges, we propose MelodyEdit, a novel zero-shot music editing system based on innovative Disentangled Inversion Control (DIC) technique, which comprises Harmonized Attention Control and Disentangled Inversion. Disentangled Inversion disentangles the diffusion process into triple branches to rectify the deviated path of the source branch caused by DDIM inversion. Harmonized Attention Control unifies the mutual self-attention control and the cross-attention control with an intermediate Harmonic Branch to progressively generate the desired harmonic and melodic information in the target music. We also introduce ZoME-Bench, a comprehensive music editing benchmark with 1,100 samples covering ten distinct editing categories. ZoME-Bench facilitates both zero-shot and instruction-based music editing tasks. Our method outperforms state-of-the-art inversion techniques in editing fidelity and content preservation. The code and benchmark will be released. Audio samples are available at https://melody-edit.github.io/.

Abstract:
Music generation aims to create music segments that align with human aesthetics based on diverse conditions. Despite advancements in generating music from specific textual descriptions (e.g., style, genre, instruments), the practical application is still hindered by ordinary users' limited expertise to write accurate prompts. To bridge this application gap, this paper introduces MusFlow, a novel multimodal music generation model using Conditional Flow Matching (CFM). We employ multiple Multi-Layer Perceptrons to align multimodal conditions into the audio's CLAP embedding space. CFM is trained to reconstruct the compressed Mel-spectrogram in the VAE latent space guided by aligned feature embedding. MusFlow can generate music from images, story texts, and music captions. To collect data for model training, inspired by multi-agent collaboration, we construct an intelligent annotation workflow centered around a fine-tuned Qwen2-VL model. Using this workflow, we build a new multimodal music dataset, MMusSet, with each sample containing a quadruple of image, story text, music caption, and music piece. We conduct four sets of experiments: image-to-music, story-to-music, caption-to-music, and multimodal music generation. Experimental results demonstrate that MusFlow can generate high-quality music pieces from multimoal conditions. We hope this work can advance the application of music generation in multimedia field, making music creation more accessible. Our generated samples are available at https://anonymous22356.github.io/musflow.github.io/

Abstract:
Realistic hair strand generation is crucial for applications like computer graphics and virtual reality. While diffusion models can generate hairstyles from text or images, these inputs lack precision and user-friendliness. Instead, we propose the first sketch-based strand generation model, which offers finer control while remaining user-friendly. Our framework tackles key challenges, such as modeling complex strand interactions and diverse sketch patterns, through two main innovations: a learnable strand upsampling strategy that encodes 3D strands into multi-scale latent spaces, and a multi-scale adaptive conditioning mechanism using a transformer with diffusion heads to ensure consistency across granularity levels. Experiments on several benchmark datasets show our method outperforms existing approaches in realism and precision. Qualitative results further confirm its effectiveness.

Abstract:
4D content generation aims to create dynamically evolving 3D content that responds to specific input objects such as images or 3D representations. Current approaches typically incorporate physical priors to animate 3D representations, but these methods suffer from significant limitations: they not only require users lacking physics expertise to manually specify material properties but also struggle to effectively handle the generation of multi-material composite objects. To address these challenges, we propose Phys4DGen, a novel 4D generation framework that integrates multi-material composition perception with physical simulation. The framework achieves automated, physically plausible 4D generation through three innovative modules: first, the 3D Material Grouping module partitions heterogeneous material regions on 3D representations' surfaces via semantic segmentation; second, the Internal Physical Structure Discovery module constructs the mechanical structure of object interiors; finally, we distill physical prior knowledge from multimodal large language models to enable rapid and automatic material properties identification for both objects' surfaces and interiors. Experiments on both synthetic and real-world datasets demonstrate that Phys4DGen can generate high-fidelity 4D content with physical realism in open-world scenarios, significantly outperforming state-of-the-art methods.

Abstract:
We introduce MarkupDM, a multimodal markup document model that represents graphic design as an interleaved multimodal document consisting of both markup language and images. Unlike existing holistic approaches that rely on an element-by-attribute grid representation, our representation accommodates variable-length elements, type-dependent attributes, and text content. Inspired by fill-in-the-middle training in code generation, we train the model to complete the missing part of a design document from its surrounding context, allowing it to treat various design tasks in a unified manner. Our model also supports image generation by predicting discrete image tokens through a specialized tokenizer with support for image transparency. We evaluate MarkupDM on three tasks, attribute value, image, and text completion, and demonstrate that it can produce plausible designs consistent with the given context. To further illustrate the flexibility of our approach, we evaluate our approach on a new instruction-guided design completion task where our instruction-tuned MarkupDM compares favorably to state-of-the-art image editing models, especially in textual completion. These findings suggest that multimodal language models with our document representation can serve as a versatile foundation for broad design automation.

Abstract:
With the rapid advancement of diffusion models, text-to-image generation has achieved significant progress in image resolution, detail fidelity, and semantic alignment, particularly with models like Stable Diffusion 3.5, Stable Diffusion XL, and FLUX.1. However, generating emotionally expressive and abstract artistic images remains a major challenge, largely due to the lack of large-scale, fine-grained emotional datasets. To address this gap, we present the EmoArt Dataset-one of the most comprehensive emotion-annotated art datasets to date. It contains 132,664 artworks across 56 painting styles (e.g., Impressionism, Expressionism, Abstract Art), offering rich stylistic and cultural diversity. Each image includes structured annotations: objective scene descriptions, five key visual attributes (brushwork, composition, color, line, light), binary arousal-valence labels, twelve emotion categories, and potential art therapy effects. Using EmoArt, we systematically evaluate popular text-to-image diffusion models for their ability to generate emotionally aligned images from text. Our work provides essential data and benchmarks for emotion-driven image synthesis and aims to advance fields such as affective computing, multimodal learning, and computational art, enabling applications in art therapy and creative design. The dataset and more details can be accessed via the following link: https://zhiliangzhang.github.io/EmoArt-130k/

Abstract:
Diffusion models have significantly improved the performance of image editing. Existing methods realize various approaches to achieve high-quality image editing, including but not limited to text control, dragging operation, and mask-and-inpainting. Among these, instruction-based editing stands out for its convenience and effectiveness in following human instructions across diverse scenarios. However, it still focuses on simple editing operations like adding, replacing, or deleting, and falls short of understanding aspects of world dynamics that convey the realistic dynamic nature in the physical world. Therefore, this work EditWorld introduces a new editing task, namely world-instructed image editing, which defines and categorizes the instructions grounded by various world scenarios. We curate a new image editing dataset with world instructions using a set of large pretrained models (e.g., GPT, Video-LLava and SDXL). To enable sufficient simulation of world dynamics for image editing, our EditWorld trains model in the curated dataset, and improves instruction-following ability with designed post-edit strategy. Extensive experiments demonstrate our method significantly outperforms existing editing methods in this new task. https://github.com/YangLing0818/EditWorld

Abstract:
With the increasing integration of Multimodal Large Language Models (MLLMs) into the medical field, comprehensive evaluation of their performance in various medical domains becomes critical. However, existing benchmarks primarily assess general medical tasks, inadequately capturing performance in nuanced areas like the spine, which relies heavily on visual input. To address this, we introduce SpineBench, a comprehensive Visual Question Answering (VQA) benchmark designed for fine-grained analysis and evaluation of MLLMs in the spinal domain. SpineBench comprises 64,878 QA pairs from 40,263 spine images, covering 11 spinal diseases through two critical clinical tasks: spinal disease diagnosis and spinal lesion localization, both in multiple-choice format. SpineBench is built by integrating and standardizing image-label pairs from open-source spinal disease datasets, and samples challenging hard negative options for each VQA pair based on visual similarity (similar but not the same disease), simulating real-world challenging scenarios. We evaluate 12 leading MLLMs on SpineBench. The results reveal that these models exhibit poor performance in spinal tasks, highlighting limitations of current MLLM in the spine domain and guiding future improvements in spinal medicine applications. SpineBench is publicly available at https://zhangchenghanyu.github.io/SpineBench.github.io/.

Abstract:
Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs - either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 reference combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facilitate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of-the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively integrate multiple sources of visual inspiration. The dataset is publicly available at: https://multiref.github.io/.

Abstract:
Recent years have brought about a surge in neuromorphic ''event'' video research, primarily targeting computer vision applications. Event video eschews video frames in favor of asynchronous, per-pixel intensity samples. While much work has focused on a handful of representations for specific event cameras, these representations have shown limitations in flexibility, speed, and compressibility. We previously proposed the unified ADΔER representation to address these concerns. This paper introduces numerous improvements to the adder-viz software for visualizing real-time event transcode processes and applications in-the-loop. The MIT-licensed software is available from a centralized repository at https://github.com/ac-freeman/adder-codec-rs.

Abstract:
Depression is a widespread mental health issue affecting diverse age groups, with notable prevalence among college students and the elderly. However, existing datasets and detection methods primarily focus on young adults, neglecting the broader age spectrum and individual differences that influence depression manifestation. Current approaches often establish a direct mapping between multimodal data and depression indicators, failing to capture the complexity and diversity of depression across individuals. This challenge includes two tracks based on age-specific subsets: Track 1 uses the MPDD-Elderly dataset for detecting depression in older adults, and Track 2 uses the MPDD-Young dataset for detecting depression in younger participants. The Multimodal Personality-aware Depression Detection (MPDD) Challenge aims to address this gap by incorporating multimodal data alongside individual difference factors. We provide a baseline model that fuses audio and video modalities with individual difference information to detect depression manifestations in diverse populations. This challenge aims to promote the development of more personalized and accurate de pression detection methods, advancing mental health research and fostering inclusive detection systems. More details are available on the official challenge website: https://hacilab.github.io/MPDDChallenge.github.io.

Abstract:
Event-based image retrieval from free-form captions presents a significant challenge: models must understand not only visual features but also latent event semantics, context, and real-world knowledge. Conventional vision-language retrieval approaches often fall short when captions describe abstract events, implicit causality, temporal context, or contain long, complex narratives. To tackle these issues, we introduce a multi-stage retrieval framework combining dense article retrieval, event-aware language model reranking, and efficient image collection, followed by caption-guided semantic matching and rank-aware selection. We leverage Qwen3 for article search, Qwen3-Reranker for contextual alignment, and Qwen2-VL for precise image scoring. To further enhance performance and robustness, we fuse outputs from multiple configurations using Reciprocal Rank Fusion (RRF). Our system achieves the top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge, demonstrating the effectiveness of combining language-based reasoning and multimodal retrieval for complex, real-world image understanding. The code is available at https://github.com/vdkhoi20/EVENT-Retriever.

Abstract:
Recent progress of generative AI and the popularity of short-form video-sharing platforms have raised new risks of misinformation video issues, posing a potential threat to online multimedia ecosystems. With the aid of generative AI tools, producing and spreading vivid, persuasive misinformation videos has been easier, while detecting and preventing them has become harder. This tutorial introduces how to characterize, detect, and prevent misinformation videos, which consists of three technical parts: 1) Characterization of AI-generated and human-edited misinformation videos; 2) Detection approaches, covering those tailored for fully generated, manipulated, and human-edited videos; and 3) Prevention strategies, including those effective for the creation and spread phases. This tutorial concludes by discussing the status quo and ongoing challenges and highlighting the promising directions for future research. We expect to bring broader attention to misinformation video issues, gather and communicate with researchers of interest, and facilitate the engagement of those who are new to this field.

Abstract:
Single-Domain Generalized Object Detection(S-DGOD) aims to train an object detector on a single source domain while generalizing well to diverse unseen target domains, making it suitable for multimedia applications that involve various domain shifts, such as intelligent video surveillance and VR/AR technologies. With the success of large-scale Vision-Language Models, recent S-DGOD approaches exploit pre-trained vision-language knowledge to guide invariant feature learning across visual domains. However, the utilized knowledge remains at a coarse-grained level(e.g., the textual description of adverse weather paired with the image) and serves as an implicit regularization for guidance, struggling to learn accurate region- and object-level features in varying domains. In this work, we propose a new cross-modal feature learning method, which can capture generalized and discriminative regional features for S-DGOD tasks. The core of our method is the mechanism of Cross-modal and Region-aware Feature Interaction, which simultaneously learns both inter-modal and intra-modal regional invariance through dynamic interactions between fine-grained textual and visual features. Moreover, we design a simple but effective strategy called Cross-domain Proposal Refining and Mixing, which aligns the position of region proposals across multiple domains and diversifies them, enhancing the localization ability of detectors in unseen scenarios. Our method achieves new state-of-the-art results on S-DGOD benchmark datasets, with improvements of +8.8%mPC on Cityscapes-C and +7.9%mPC on DWD over baselines, demonstrating its efficacy. The code is available at https://github.com/startracker0/Boost.

Abstract:
Underwater image restoration aims to remove geometric and color distortions due to water refraction, absorption, and scattering. Previous studies focus on restoring either color or geometry, but to our best knowledge, not both. However, in practice it may be cumbersome to address the two rectifications one by one. In this paper, we propose NeuroPump, a self-supervised method to simultaneously optimize and rectify underwater geometry and color as if water were pumped out. The key idea is to explicitly model refraction, absorption, and scattering in Neural Radiance Field (NeRF) pipeline, such that it not only performs simultaneous geometric and color rectification, but also enables to synthesize novel views and optical effects by controlling the decoupled parameters. In addition, to address the lack of real paired ground truth images, we propose an underwater 360 benchmark dataset that has real paired (i.e., with and without water) images. Our method clearly outperforms other baselines both quantitatively and qualitatively. Our code and dataset is available at https://ygswu.github.io/NeuroPump.github.io/.

Abstract:
Semi-Supervised Learning (SSL) can leverage abundant unlabeled data to boost model performance. However, the class-imbalanced data distribution in real-world scenarios poses great challenges to SSL, resulting in performance degradation. Existing class-imbalanced semi-supervised learning (CISSL) methods mainly focus on rebalancing datasets but ignore the potential of using hard examples to enhance performance, making it difficult to fully harness the power of unlabeled data even with sophisticated algorithms. To address this issue, we propose a method that enhances the performance of Imbalanced Semi-Supervised Learning by Mining Hard Examples (SeMi). This method distinguishes the entropy differences among logits of hard and easy examples, thereby identifying hard examples and increasing the utility of unlabeled data, better addressing the imbalance problem in CISSL. In addition, we maintain a class-balanced memory bank with confidence decay for storing high-confidence embeddings to enhance the pseudo-labels' reliability. Although our method is simple, it is effective and seamlessly integrates with existing approaches. We perform comprehensive experiments on standard CISSL benchmarks and experimentally demonstrate that our proposed SeMi outperforms existing state-of-the-art methods on multiple benchmarks, especially in reversed scenarios, where our best result shows approximately a 54.8% improvement over the baseline methods. Our code is available at https://github.com/pywin/SeMi.

Abstract:
Audio-visual Generalized Zero-Shot Learning ((G)ZSL) has attracted significant attention for its ability to identify unseen classes in general video classification tasks. However, modality imbalance in (G)ZSL leads to over-reliance on the optimal modality, reducing discriminative capabilities for unseen classes. Though recent studies have attempted to address this issue, two challenges still remain unsolved: (a) Quality discrepancies, where modalities offer differing quantities and qualities of information for the same concept. (b) Content discrepancies, where the contributions of different samples within the same modality exhibit significant differences. To address these challenges, we propose a Discrepancy-Aware Attention Network (DAAN) for Enhanced Audio-Visual (G)ZSL. Our approach introduces a Redundant-Noise Mitigation Attention (RNMA) unit to minimize content discrepancies by mitigating redundant information in modalities and a Contrastive Sample Gradient Modulation (CSGM) mechanism to adjust gradient magnitudes and balance quality discrepancies. We quantify modality contributions by integrating optimization and convergence rate for more precise gradient modulation in CSGM. Experiments demonstrate DAAN achieves state-of-the-art performance on benchmark datasets, with ablation studies validating the effectiveness of individual modules. Code is available at https://github.com/xiaoxinning/DAAN-GZSL.

Abstract:
Knowledge distillation as an efficient knowledge transfer technique, has achieved remarkable success in unimodal scenarios. However, in cross-modal settings, conventional distillation methods encounter significant challenges due to data and statistical heterogeneities, failing to leverage the complementary prior knowledge embedded in cross-modal teacher models. This paper empirically reveals two critical issues in existing approaches: distillation path selection and knowledge drift. To address these limitations, we propose MST-Distill, a novel cross-modal knowledge distillation framework featuring a mixture of specialized teachers. Our approach employs a diverse ensemble of teacher models across both cross-modal and multimodal configurations, integrated with an instance-level routing network that facilitates adaptive and dynamic distillation. This architecture effectively transcends the constraints of traditional methods that rely on monotonous and static teacher models. Additionally, we introduce a plug-in masking module, independently trained to suppress modality-specific discrepancies and reconstruct teacher representations, thereby mitigating knowledge drift and enhancing transfer effectiveness. Extensive experiments across five diverse multimodal datasets, spanning visual, audio, and text, demonstrate that our method significantly outperforms existing state-of-the-art knowledge distillation methods in cross-modal distillation tasks. The source code is available at https://github.com/Gray-OREO/MST-Distill.

Abstract:
Deep anchor-based multi-view clustering methods enhance the scalability of neural networks by utilizing representative anchors to reduce the computational complexity of large-scale clustering. Despite their scalability advantages, existing approaches often incorporate anchor structures in a heuristic or task-agnostic manner, either through post-hoc graph construction or as auxiliary components for message passing. Such designs overlook the core structural demands of anchor-based clustering, neglecting key optimization principles. To bridge this gap, we revisit the underlying optimization problem of large-scale anchor-based multi-view clustering and unfold its iterative solution into a novel deep network architecture, termed LargeMvC-Net. The proposed model decomposes the anchor-based clustering process into three modules: RepresentModule, NoiseModule, and AnchorModule, corresponding to representation learning, noise suppression, and anchor estimation. Each module is derived by unfolding a step of the original optimization procedure into a dedicated network component, providing structural clarity and optimization traceability. In addition, an unsupervised reconstruction loss aligns each view with the anchor-induced latent space, encouraging consistent clustering structures across views. Extensive experiments on several large-scale multi-view benchmarks show that LargeMvC-Net consistently outperforms state-of-the-art methods in terms of both effectiveness and scalability. The source data, code. https://github.com/dushide/LargeMvC-Net_ACMMM_2025, and extended version http://arxiv.org/abs/2507.20980 are available.

Abstract:
Event cameras have attracted increasing attention in recent years due to their advantages in high dynamic range, high temporal resolution, low power consumption, and low latency. Some researchers have begun exploring pre-training directly on event data. Nevertheless, these efforts often fail to establish strong connections with RGB frames, limiting their applicability in multi-modal fusion scenarios. To address these issues, we propose a novel CM3AE pre-training framework for the RGB-Event perception. This framework accepts multi-modalities/views of data as input, including RGB images, event images, and event voxels, providing robust support for both event-based and RGB-event fusion based downstream tasks. Specifically, we design a multi-modal fusion reconstruction module that reconstructs the original image from fused multi-modal features, explicitly enhancing the model's ability to aggregate cross-modal complementary information. Additionally, we employ a multi-modal contrastive learning strategy to align cross-modal feature representations in a shared latent space, which effectively enhances the model's capability for multi-modal understanding and capturing global dependencies. We construct a large-scale dataset containing 2,535,759 RGB-Event data pairs for the pre-training. Extensive experiments on five downstream tasks fully demonstrated the effectiveness of CM3AE. Source code and pre-trained models will be released on https://github.com/Event-AHU/CM3AE.

Abstract:
Short-video misinformation detection has attracted wide attention in the multi-modal domain, aiming to accurately identify the misinformation in the video format accompanied by the corresponding audio. Despite significant advancements, current models in this field, trained on particular domains (source domains), often exhibit unsatisfactory performance on unseen domains (target domains) due to domain gaps. To effectively realize such domain generalization on the short-video misinformation detection task, we propose deep insights into the characteristics of different domains: (1) The detection on various domains may mainly rely on different modalities (i.e., mainly focusing on videos or audios). To enhance domain generalization, it is crucial to achieve optimal model performance on all modalities simultaneously. (2) For some domains focusing on cross-modal joint fraud, a comprehensive analysis relying on cross-modal fusion is necessary. However, domain biases located in each modality (especially in each frame of videos) will be accumulated in this fusion process, which may seriously damage the final identification of misinformation. To address these issues, we propose a new DOmain generalization model via ConsisTency and invariance learning for shORt-video misinformation detection (named DOCTOR), which contains two characteristic modules: (1) We involve the cross-modal feature interpolation to map multiple modalities into a shared space and the interpolation distillation to synchronize multi-modal learning; (2) We design the diffusion model to add noise to retain core features of multi modal and enhance domain invariant features through cross-modal guided denoising. Extensive experiments demonstrate the effectiveness of our proposed DOCTOR model. Our code is publicly available at https://github.com/ghh1125/DOCTOR.

Abstract:
Scene understanding is one of the core tasks in computer vision, aiming to extract semantic information from images to identify objects, scene categories, and their interrelationships. Although advancements in Vision-Language Models (VLMs) have driven progress in this field, existing VLMs still face challenges in adaptation to unseen complex wide-area scenes. To address the challenges, this paper proposes a Hierarchical Coresets Selection (HCS) mechanism to advance the adaptation of VLMs in complex wide-area scene understanding. It progressively refines the selected regions based on the proposed theoretically guaranteed importance function, which considers utility, representativeness, robustness, and synergy. Without requiring additional fine-tuning, HCS enables VLMs to achieve rapid understandings of unseen scenes at any scale using minimal interpretable regions while mitigating insufficient feature density. HCS is a plug-and-play method that is compatible with any VLM. Experiments demonstrate that HCS achieves superior performance and universality in various tasks. The code is available at https://wangjingyao07.github.io/HCS.github.io/.

Abstract:
Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images, anchored in explicit visual evidence to improve interpretability and facilitate integration into clinical workflows. However, existing methods often rely on separately trained detection modules that require extensive expert annotations, introducing high labeling costs and limiting generalizability due to pathology distribution bias across datasets. To address these challenges, we propose Self-Supervised Anatomical Consistency Learning (SS-ACL)-a novel and annotation-free framework that aligns generated reports with corresponding anatomical regions using simple textual prompts. SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy, organizing entities by spatial location. It recursively reconstructs fine-grained anatomical regions to enforce intra-sample spatial alignment, inherently guiding attention maps toward visually relevant areas prompted by text. To further enhance inter-sample semantic alignment for abnormality recognition, SS-ACL introduces a region-level contrastive learning based on anatomical consistency. These aligned embeddings serve as priors for report generation, enabling attention maps to provide interpretable visual evidence. Extensive experiments demonstrate that SS-ACL, without relying on expert annotations, (i) generates accurate and visually grounded reports-outperforming state-of-the-art methods by 10% in lexical accuracy and 25% in clinical efficacy, and (ii) achieves competitive performance on various downstream visual tasks, surpassing current leading visual foundation models by 8% in zero-shot visual grounding. Our code is available at https://github.com/kaelsunkiller/ssacl.

Abstract:
The reasoning segmentation task involves segmenting objects within an image by interpreting implicit user instructions. Despite significant advancements made by existing approaches, they remain constrained by low perceptual resolution, as visual encoders are typically pre-trained at lower resolutions. Furthermore, simply interpolating the positional embeddings of visual encoders to enhance perceptual resolution yields only marginal performance improvements while incurring substantial computational costs. To address this, we propose HRSeg, an efficient model with high-resolution fine-grained perception. It features two key innovations: High-Resolution Perception (HRP) and High-Resolution Enhancement (HRE). The HRP module processes high-resolution images through cropping, integrating local and global features for multi-granularity quality. The HRE module enhances mask features by integrating fine-grained information from high-resolution images, refining their alignment with text features for precise segmentation. Extensive ablation studies validate the effectiveness of our modules, while comprehensive experiments on multiple benchmark datasets demonstrate HRSeg's superior performance. Code will be available at https://github.com/WeihuangLin/HRSeg.

Abstract:
Deep generative models, such as diffusion models, have shown promising progress in image generation and audio generation via simplified continuity assumptions. However, the development of generative modeling techniques for generating multi-modal data, such as parametric CAD sequences, still lags behind due to the challenges in addressing long-range constraints and parameter sensitivity. In this work, we propose a novel framework for quantitatively constrained CAD generation, termed Target-Guided Bayesian Flow Network (TGBFN). For the first time, TGBFN handles the multi-modality of CAD sequences (i.e., discrete commands and continuous parameters) in a unified continuous and differentiable parameter space rather than in the discrete data space. In addition, TGBFN penetrates the parameter update kernel and introduces a guided Bayesian flow to control the CAD properties. To evaluate TGBFN, we construct a new dataset for quantitatively constrained CAD generation. Extensive comparisons across single-condition and multi-condition constrained generation tasks demonstrate that TGBFN achieves state-of-the-art performance in generating high-fidelity, condition-aware CAD sequences. The code is available at https://github.com/scu-zwh/TGBFN.

Abstract:
Image downscaling is critical for efficient storage and transmission of high-resolution (HR) images. Existing learning-based methods focus on performing downscaling within the sRGB domain, which typically suffers from blurred details and unexpected artifacts. RAW images, with their unprocessed photonic information, offer greater flexibility but lack specialized downscaling frameworks. In this paper, we propose a wavelet-based recurrent reconstruction framework that leverages the information lossless attribute of wavelet transformation to fulfill the arbitrary-scale RAW image downscaling in a coarse-to-fine manner, in which the Low-Frequency Arbitrary-Scale Downscaling Module (LASDM) and the High-Frequency Prediction Module (HFPM) are proposed to preserve structural and textural integrity of the reconstructed low-resolution (LR) RAW images, alongside an energy-maximization loss to align high-frequency energy between HR and LR domain. Furthermore, we introduce the Realistic Non-Integer RAW Downscaling (Real-NIRD) dataset, featuring a non-integer downscaling factor of 1.3×, and incorporate it with publicly available datasets with integer factors (2×, 3×, 4×) for comprehensive benchmarking arbitrary-scale image downscaling purposes. Extensive experiments demonstrate that our method outperforms existing state-of-the-art competitors both quantitatively and visually. The code and dataset will be released at https://github.com/RenYangSCU/ASRD.

Abstract:
In federated learning (FL), multi-step local updates and data heterogeneity usually lead to sharper global minima, which degrades the performance of the global model. Popular FL algorithms integrate sharpness-aware minimization (SAM) into local training to address this issue. However, in the high data heterogeneity setting, the flatness in local training does not imply the flatness of the global model. Therefore, minimizing the sharpness of the local loss surfaces on the client data does not enable the effectiveness of SAM in FL to improve the generalization ability of the global model. We define the flatness distance to explain this phenomenon. By rethinking the SAM in FL and theoretically analyzing the flatness distance, we propose a novel FedNSAM algorithm that accelerates the SAM algorithm by introducing global Nesterov momentum into the local update to harmonize the consistency of global and local flatness. FedNSAM uses the global Nesterov momentum as the direction of local estimation of client global perturbations and extrapolation. Theoretically, we prove a tighter convergence bound than FedSAM by Nesterov extrapolation. Empirically, we conduct comprehensive experiments on CNN and Transformer models to verify the superior performance and efficiency of FedNSAM. The code is available at https://github.com/junkangLiu0/FedNSAM.

Abstract:
Recently, query-based methods have achieved remarkable performance in Referring Video Object Segmentation (RVOS) by using textual static object queries to drive cross-modal alignment. However, these static queries are easily misled by distractors with similar appearance or motion, resulting in query selection bias. To address this issue, we propose Triple Query Former (TQF), which factorizes the referring query into three specialized components: an appearance query for static attributes, an intra-frame interaction query for spatial relations, and an inter-frame motion query for temporal association. Instead of relying solely on textual embeddings, our queries are dynamically constructed by integrating both linguistic cues and visual guidance. Furthermore, we introduce two motion-aware aggregation modules that enhance object token representations: Intra-frame Interaction Aggregation incorporates position-aware interactions among objects within a single frame, while Inter-frame Motion Aggregation leverages trajectory-guided alignment across frames to ensure temporal coherence. Extensive experiments on multiple RVOS benchmarks demonstrate the advantages of TQF and the effectiveness of our structured query design and motion-aware aggregation modules. The code is available at https://github.com/DavidZhang-1025/TQF.

Abstract:
3D Visual Question Answering (3D VQA) is crucial for enabling models to perceive the physical world and perform spatial reasoning. In 3D VQA, the free-form nature of answers often leads to improper annotations that can confuse or mislead models when training on the entire dataset. While other text generation tasks can mitigate this issue by learning on large-scale datasets, the scarcity of 3D scene data enlarges the negative effect of misleading annotations. Although active learning strategies can select valuable instances for training, they fail to identify and resolve misleading labels, which the oracle inevitably provides in practice. To address this issue, we propose a multi-turn interactive active learning strategy. This strategy selects data based on models' semantic uncertainty to form a solid knowledge foundation more effectively and actively requests reannotation from an oracle to resolve potentially misleading labels. For uncertainty assessment, we utilize a variance-based metric that takes semantic relationships between terms into consideration, thus avoiding the uniform inter-class similarity assumption of previous assessment metrics. Extensive experiments exhibit better model performance and a substantial reduction in training costs, with a halving of training costs for achieving relatively high accuracy. The code is available at https://github.com/fz-zsl/AQuA.

Abstract:
While large language models (LLMs) have advanced procedural planning for embodied AI systems through strong reasoning abilities, the integration of multimodal inputs and counterfactual reasoning remains underexplored. To tackle these challenges, we introduce LLaPa, a vision-language model framework designed for multimodal procedural planning. LLaPa generates executable action sequences from textual task descriptions and visual environmental images using vision-language models (VLMs). Furthermore, we enhance LLaPa with two auxiliary modules to improve procedural planning. The first module, the Task-Environment Reranker (TER), leverages task-oriented segmentation to create a task-sensitive feature space, aligning textual descriptions with visual environments and emphasizing critical regions for procedural execution. The second module, the Counterfactual Activities Retriever (CAR), identifies and emphasizes potential counterfactual conditions, enhancing the model's reasoning capability in counterfactual scenarios. Extensive experiments on ActPlan-1K and ALFRED benchmarks demonstrate that LLaPa generates higher-quality plans with superior LCS and correctness, outperforming advanced models. The code and models are available https://github.com/sunshibo1234/LLaPa.

Abstract:
Weakly supervised text-to-person image matching, as a crucial approach to reducing models' reliance on large-scale manually labeled samples, holds significant research value. However, existing methods struggle to predict complex one-to-many identity relationships, severely limiting performance improvements. To address this challenge, we propose a local-and-global dual-granularity identity association mechanism. Specifically, at the local level, we explicitly establish cross-modal identity relationships within a batch, reinforcing identity constraints across different modalities and enabling the model to better capture subtle differences and correlations. At the global level, we construct a dynamic cross-modal identity association network with the visual modality as the anchor and introduce a confidence-based dynamic adjustment mechanism, effectively enhancing the model's ability to identify weakly associated samples while improving overall sensitivity. Additionally, we propose an information-asymmetric sample pair construction method combined with consistency learning to tackle hard sample mining and enhance model robustness. Experimental results demonstrate that the proposed method substantially boosts cross-modal matching accuracy, providing an efficient and practical solution for text-to-person image matching. Code is available at https://github.com/syl6312/DGCMIA.

Abstract:
Open-world object counting leverages the robust text-image alignment of pre-trained vision-language models (VLMs) to enable counting of arbitrary categories in images specified by textual queries. However, widely adopted naive fine-tuning strategies concentrate exclusively on text-image consistency for categories contained in, which leads to limited generalizability for unseen categories. In this work, we propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories with minimal overhead in parameters and inference time. First, we introduce a two-stage visual prompt learning strategy composed of Category-Specific Prompt Initialization (CSPI) and Topology-Guided Prompt Refinement (TGPR). The CSPI generates category-specific visual prompts, and then TGPR distills latent structural patterns from the VLM's text encoder to refine these prompts. During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories, facilitating robust text-image alignment for unseen categories. Extensive experiments integrating SDVPT with all available open-world object counting models demonstrate its effectiveness and adaptability across three widely used datasets: FSC-147, CARPK, and PUCPR+. Code is available https://github.com/Eamon-0v0/SDVPT

Abstract:
Nowadays, misinformation articles, especially multimodal ones, are widely spread on social media platforms and cause serious negative effects. To control their propagation, Multimodal Misinformation Detection (MMD) becomes an active topic in the community to automatically identify misinformation. Previous MMD methods focus on supervising detectors by collecting offline data. However, in real-world scenarios, new events always continually emerge, making MMD models trained on offline data consistently outdated and ineffective. To address this issue, training MMD models under online data streams is an alternative, inducing an emerging task named continual MMD. Unfortunately, it is hindered by two major challenges. First, training on new data consistently decreases the detection performance on past data, named past knowledge forgetting. Second, the social environment constantly evolves over time, affecting the generalization on future data. To alleviate these challenges, we propose to remember past knowledge by isolating interference between event-specific parameters with a Dirichlet process-based mixture-of-expert structure, and anticipate future environmental distributions by learning a continuous-time dynamics model. Accordingly, we induce a new continual MMD method DAEDCMD. Extensive experiments demonstrate that DAEDCMD can consistently and significantly outperform the compared methods, including six MMD baselines and three continual learning methods.

Abstract:
Understanding how the brain represents visual information is a fundamental challenge in neuroscience and artificial intelligence. While AI-driven decoding of neural data has provided insights into the human visual system, integrating multimodal neuroimaging signals-such as EEG, MEG, and fMRI-remains a critical hurdle due to their inherent spatiotemporal misalignment. Current approaches often analyze these modalities in isolation, limiting a holistic view of neural representation. In this study, we introduce BrainFLORA, a unified framework for integrating cross-modal neuroimaging data to construct a shared neural representation. Our approach leverages multimodal large language models (MLLMs) augmented with modality-specific adapters and task decoders, achieving state-of-the-art performance in joint-subject visual retrieval task and has the potential to extend multitasking. Combining neuroimaging analysis methods, we further reveal how visual concept representations align across neural modalities and with real-world object perception. We demonstrate that the brain's structured visual concept representations exhibit an implicit mapping to physical-world stimuli, bridging neuroscience and machine learning from different modalities of neural imaging. Beyond methodological advancements, BrainFLORA offers novel implications for cognitive neuroscience and brain-computer interfaces (BCIs). Our code is available at https://github.com/ncclab-sustech/BrainFLORA.

Abstract:
The detection of telecom fraud faces significant challenges due to the lack of high-quality multimodal training data that integrates audio signals with reasoning-oriented textual analysis. To address this gap, we present TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. Our dataset is constructed through three strategies: (1) Privacy-preserved text-truth sample generation using automatically speech recognition-transcribed call recordings (with anonymized original audio), ensuring real-world consistency through text-to-speech model regeneration; (2) Semantic enhancement via large language model based self-instruction sampling on authentic ASR outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis, which simulates emerging fraud tactics through predefined communication scenarios and fraud typologies, enriches the conversation samples. The generated dataset contains 28,511 rigorously processed audio-text pairs with a total audio duration of more than 307 hours, complete with detailed annotations for fraud reasoning. The dataset is divided into three tasks: scenario classification, fraud detection, fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from TeleAntiFraud-28k, to facilitate systematic testing of model performance, reasoning capabilities, and thought processes on telecom fraud detection tasks. We also contribute a supervised fine-tuning model based on Qwen2-Audio, trained on the TeleAntiFraud-28k training set, while open-sourcing the data processing framework to enable community-driven dataset expansion. This work establishes a foundational framework for multimodal anti-fraud research while addressing critical challenges in data privacy and scenario diversity. The code of this paper is publicly available at https://github.com/JimmyMa99/TeleAntiFraud.

Abstract:
Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semantic content compared to the textual modality. However, previous works have largely overlooked the disparity in information density between these two modalities. This limitation can lead to two critical issues: 1) modification subject referring ambiguity and 2) limited detailed semantic focus, both of which degrade the performance of CVR models. To address the aforementioned issues, we propose a novel CVR framework, namely the Hierarchical Uncertainty-aware Disambiguation network (HUD). HUD is the first framework that leverages the disparity in information density between video and text to enhance multi-modal query understanding. It comprises three key components: (a) Holistic Pronoun Disambiguation, (b) Atomistic Uncertainty Modeling, and (c) Holistic-to-Atomistic Alignment. By exploiting overlapping semantics through holistic cross-modal interaction and fine-grained semantic alignment via atomistic-level cross-modal interaction, HUD enables effective object disambiguation and enhances the focus on detailed semantics, thereby achieving precise composed feature learning. Moreover, our proposed HUD is also applicable to the Composed Image Retrieval (CIR) task and achieves state-of-the-art performance across three benchmark datasets for both CVR and CIR tasks. The codes are available on https://zivchen-ty.github.io/HUD.github.io/.

Abstract:
With the rapid development of AI-generated content (AIGC), the creation of high-quality AI-generated videos has become faster and easier, resulting in the Internet being flooded with all kinds of video content. However, the impact of these videos on the content ecosystem remains largely unexplored. Video information retrieval remains a fundamental approach for accessing video content. Building on the observation that retrieval models often favor AI-generated content in ad-hoc and image retrieval tasks, we investigate whether similar biases emerge in the context of challenging video retrieval, where temporal and visual factors may further influence model behavior. To explore this, we first construct a comprehensive benchmark dataset containing both real and AI-generated videos, along with a set of fair and rigorous metrics to assess bias. This benchmark consists of 13,000 videos generated by two state-of-the-art open-source video generation models. We meticulously design a suite of rigorous metrics to accurately measure this preference, accounting for potential biases arising from the limited frame rate and suboptimal quality of AIGC videos. We then applied three off-the-shelf video retrieval models to perform retrieval tasks on this hybrid dataset. Our findings reveal a clear preference for AI-generated videos in retrieval. Further investigation shows that incorporating AI-generated videos into the training set of retrieval models exacerbates this bias. Unlike the preference observed in image modalities, we find that video retrieval bias arises from both unseen visual and temporal information, making the root causes of video bias a complex interplay of these two factors. To mitigate this bias, we fine-tune the retrieval models using a contrastive learning approach. The results of this study highlight the potential implications of AI-generated videos on retrieval systems and offer valuable insights for future research in this area. Our dataset and code are publicly available at https://github.com/Siaaaaaa1/video-source-bias.

Abstract:
Tombstones are historically and culturally rich artifacts, encapsulating individual lives, community memory, historical narratives and artistic expression. Yet, many tombstones today face significant preservation challenges, including physical erosion, vandalism, environmental degradation, and political shifts. In this paper, we introduce a novel multi-modal framework for tombstone digitization, aiming to improve the interpretation, organization and retrieval of tombstone content. Our approach leverages vision-language models (VLMs) to translate tombstone images into structured Tombstone Meaning Representations (TMRs), capturing both image and text information. To further enrich semantic parsing, we incorporate retrieval-augmented generation (RAG) to integrate externally dependent elements such as toponyms, occupation codes, and ontological concepts. Compared to traditional OCR-based pipelines, our method improves parsing accuracy from an F1 score of 36.1 to 89.5. Furthermore, we evaluate the model's robustness across diverse linguistic and cultural inscriptions, and simulate physical degradation through image fusion to assess performance under noisy or damaged conditions. Our work represents the first attempt to formalize tombstone understanding using large vision-language models, presenting implications for heritage preservation. The code and supplementary materials are available at: https://github.com/LastDance500/Tombstone-Parsing.

Abstract:
The rapid advancement of Text-guided Image Editing (TIE) enables image modifications through text prompts. However, current TIE models still struggle to balance image quality, editing alignment, and consistency with the original image, limiting their practical applications. Existing TIE evaluation benchmarks and metrics have limitations on scale or alignment with human perception. To this end, we introduce EBench-18K, the first large-scale image Editing Benchmark including 18K edited images with fine-grained human preference annotations for evaluating TIE. Specifically, EBench-18K includes 1,080 source images with corresponding editing prompts across 21 tasks, 18K+ edited images produced by 17 state-of-the-art TIE models, 55K+ mean opinion scores (MOSs) assessed from three evaluation dimensions, and 18K+ question-answering (QA) pairs. Based on EBench-18K, we employ outstanding LMMs to assess edited images, while the evaluation results, in turn, provide insights into assessing the alignment between the LMMs' understanding ability and human preferences. Then, we propose LMM4Edit, a LMM-based metric for evaluating image Editing models from perceptual quality, editing alignment, attribute preservation, and task-specific QA accuracy in an all-in-one manner. Extensive experiments show that LMM4Edit achieves outstanding performance and aligns well with human preference. Zero-shot validation on the other datasets also shows the generalization ability of our model. The dataset and code are available at https://github.com/IntMeGroup/LMM4Edit.

Abstract:
Low-light image enhancement aims to improve the visibility of degraded images to better align with human visual perception. While diffusion-based methods have shown promising performance due to their strong generative capabilities. However, their unidirectional modelling of degradation often struggles to capture the complexity of real-world degradation patterns, leading to structural inconsistencies and pixel misalignments. To address these challenges, we propose a bidirectional diffusion optimization mechanism that jointly models the degradation processes of both low-light and normal-light images, enabling more precise degradation parameter matching and enhancing generation quality. Specifically, we perform bidirectional diffusion-from low-to-normal light and from normal-to-low light during training and introduce an adaptive feature interaction block (AFI) to refine feature representation. By leveraging the complementarity between these two paths, our approach imposes an implicit symmetry constraint on illumination attenuation and noise distribution, facilitating consistent degradation learning and improving the model's ability to perceive illumination and detail degradation. Additionally, we design a reflection-aware correction module (RACM) to guide color restoration post-denoising and suppress overexposed regions, ensuring content consistency and generating high-quality images that align with human visual perception. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms state-of-the-art methods in both quantitative and qualitative evaluations while generalizing effectively to diverse degradation scenarios.Code

Abstract:
3D hand pose estimation has garnered great attention in recent years due to its critical applications in human-computer interaction, virtual reality, and related fields. Accurate estimation of hand joints is essential for high-quality hand pose estimation. However, existing methods neglect the importance of Distal Phalanx Tip (TIP) and Wrist in predicting hand joints overall and often fail to account for the phenomenon of error accumulation for distal joints in gesture estimation, which can cause certain joints to incur larger errors, resulting in misalignments and artifacts in pose estimation and degrading the overall reconstruction quality. To address this challenge, we propose a novel segmented architecture for enhanced hand pose estimation (EHPE). We perform a local extraction of the TIP and wrist, thus alleviating the effect of error accumulation on the prediction of the TIP and further reduce the predictive errors for all joints on this basis. EHPE consists of two key stages: In the TIP and Wrist Joints Extraction stage (TW-stage), the positions of the TIP and wrist joints are estimated to provide an initial accurate joint configuration; In the Prior Guided Joints Estimation stage (PG-stage), a dual-branch interaction network is employed to refine the positions of the remaining joints. Extensive experiments on two widely used benchmarks demonstrate that EHPE achieves state-of-the-art performance.

Abstract:
Camera-based 3D Semantic Scene Completion (SSC) is a critical task in autonomous driving systems, assessing voxel-level geometry and semantics for holistic scene perception. While existing voxel-based and plane-based SSC methods have achieved considerable progress, they struggle to capture physical regularities for realistic geometric details. On the other hand, neural reconstruction methods like NeRF and 3DGS demonstrate superior physical awareness, but suffer from high computational cost and slow convergence when handling large-scale, complex autonomous driving scenes, leading to inferior semantic accuracy. To address these issues, we propose the Semantic-PHysical Engaged REpresentation (SPHERE) for camera-based SSC, which integrates voxel and Gaussian representations for joint exploitation of semantic and physical information. First, the Semantic-guided Gaussian Initialization (SGI) module leverages dual-branch 3D scene representations to locate focal voxels as anchors to guide efficient Gaussian initialization. Then, the Physical-aware Harmonics Enhancement (PHE) module incorporates semantic spherical harmonics to model physical-aware contextual details and promote semantic-geometry consistency through focal distribution alignment, generating SSC results with realistic details. Extensive experiments and analyses on the popular SemanticKITTI and SSCBench-KITTI-360 benchmarks validate the effectiveness of SPHERE. The code is available at https://github.com/PKU-ICST-MIPL/SPHERE_ACMMM2025.

Abstract:
Local motion blur in digital images originates from the relative motion between dynamic objects and static imaging systems during exposure. Existing deblurring methods face significant challenges in addressing this problem due to their inefficient allocation of computational resources and inadequate handling of spatially varying blur patterns. To overcome these limitations, we first propose a trainable mask predictor that identifies blurred regions in the image. During training, we employ blur masks to exclude sharp regions. For inference optimization, we implement structural reparameterization by converting 3 x 3 convolutions to computationally efficient 1 x 1 convolutions, enabling pixel-level pruning of sharp areas to reduce computation. Second, we develop an intra-frame motion analyzer that translates relative pixel displacements into motion trajectories, establishing adaptive guidance for region-specific blur restoration. Our method is trained end-to-end using a combination of reconstruction loss, reblur loss, and mask loss guided by annotated blur masks. Extensive experiments demonstrate superior performance over state-of-the-art methods on both local and global blur datasets while reducing FLOPs by 49% compared to SOTA models (e.g., LMD-ViT). The source code is available at https://github.com/shangwei5/M2AENet.

Abstract:
Speech-driven 3D facial animation aims to synthesize realistic emotional facial expressions that match the input speech. However, existing approaches are constrained by two key limitations: (1) These methods rely on pre-trained models (e.g., Wav2Vec 2.0) as audio emotion feature extractors, which neglect critical frequency-domain characteristics, thereby emphasizing the challenge of discriminating between similar emotion categories. (2) They treat audio emotions as generic categorical states, ignoring individual differences in emotional expression, ultimately producing over-smoothed emotional representations that appear repetitive and stereotypical. To that end, we introduce PESTalk, a novel approach that generates 3D facial animations with Personalized Emotional Styles directly from speech inputs, thus significantly enhancing the realism of facial animations. Specifically, since acoustic frequency cues contain essential emotional information, we first propose a Dual-Stream Emotion Extractor (DSEE ), which captures both time-domain variations and frequency-domain characteristics of audio signals to extract fine-grained affective features and subtle emotional nuances. Furthermore, we design an Emotional Style Modeling Module (ESMM ) to achieve personalized emotional styles. This module first establishes a baseline representation for each subject based on voiceprint characteristics, then progressively refines it by continuously integrating emotional features. Ultimately, this process constructs a personalized emotional style representation for each subject in each emotion category, capturing their unique expression patterns. Finally, considering the scarcity of the 3D emotional talking face data, we employ an advanced facial capture model to extract pseudo facial blendshape coefficients from 2D emotional data, thereby constructing a large-scale 3D emotional talking face dataset with diverse emotions and personalized expressions (3D-EmoStyle). Extensive quantitative and qualitative evaluations show that PESTalk can generate realistic 3D facial animation and outperform state-of-the-art methods. The codes and dataset are available at: https://github.com/tianshunhan/PESTalk.

Abstract:
Accurate prediction of placental diseases via whole slide images (WSIs) is critical for preventing severe maternal and fetal complications. However, WSI analysis presents significant computational challenges due to the massive data volume. Existing WSI classification methods encounter critical limitations: (1) inadequate patch selection strategies that either compromise performance or fail to sufficiently reduce computational demands, and (2) the loss of global histological context resulting from patch-level processing approaches. To address these challenges, we propose an Efficient multimodal framework for Patient-level placental disease Diagnosis, named EmmPD. Our approach introduces a two-stage patch selection module that combines parameter-free and learnable compression strategies, optimally balancing computational efficiency with critical feature preservation. Additionally, we develop a hybrid multimodal fusion module that leverages adaptive graph learning to enhance pathological feature representation and incorporates textual medical reports to enrich global contextual understanding. Extensive experiments conducted on both a self-constructed patient-level Placental dataset and two public datasets demonstrating that our method achieves state-of-the-art diagnostic performance. The code is available at https://github.com/ECNU-MultiDimLab/EmmPD.

Abstract:
In current web environment, fake news spreads rapidly across online social networks, posing serious threats to society. Existing multimodal fake news detection methods can generally be classified into knowledge-based and semantic-based approaches. However, these methods are heavily rely on human expertise and feedback, lacking flexibility. To address this challenge, we propose a Dynamic Analysis and Adaptive Discriminator (DAAD) approach for fake news detection. For knowledge-based methods, we introduce the Monte Carlo Tree Search algorithm to leverage the self-reflective capabilities of large language models (LLMs) for prompt optimization, providing richer, domain-specific details and guidance to the LLMs, while enabling more flexible integration of LLM comment on news content. For semantic-based methods, we define four typical deceit patterns: emotional exaggeration, logical inconsistency, image manipulation, and semantic inconsistency, to reveal the mechanisms behind fake news creation. To detect these patterns, we carefully design four discriminators and expand them in depth and breadth, using the soft-routing mechanism to explore optimal detection models. Experimental results on three real-world datasets demonstrate the superiority of our approach.

Abstract:
Semi-supervised Camouflaged Object Detection (SSCOD) aims to reduce reliance on costly pixel-level annotations by leveraging limited annotated data and abundant unlabeled data. However, existing SSCOD methods based on Teacher-Student frameworks suffer from severe prediction bias and error propagation under scarce supervision, while their multi-network architectures incur high computational overhead and limited scalability. To overcome these limitations, we propose ST-SAM, a highly annotation-efficient yet concise framework that breaks away from conventional SSCOD constraints. Specifically, ST-SAM employs Self-Training strategy that dynamically filters and expands high-confidence pseudo-labels to enhance a single-model architecture, thereby fundamentally circumventing inter-model prediction bias. Furthermore, by transforming pseudo-labels into hybrid prompts containing domain-specific knowledge, ST-SAM effectively harnesses the Segment Anything Model's potential for specialized tasks to mitigate error accumulation in self-training. Experiments on COD benchmark datasets demonstrate that ST-SAM achieves state-of-the-art performance with only 1% labeled data, outperforming existing SSCOD methods and even matching fully supervised methods. Remarkably, ST-SAM requires training only a single network, without relying on specific models or loss functions. This work establishes a new paradigm for annotation-efficient SSCOD. Codes will be available at https://github.com/hu-xh/ST-SAM.

Abstract:
Recent advancements in image editing have utilized large-scale multimodal models to enable intuitive, natural instruction-driven interactions. However, conventional methods still face significant challenges, particularly in spatial reasoning, precise region segmentation, and maintaining semantic consistency, especially in complex scenes.To overcome these challenges, we introduce SmartFreeEdit, a novel end-to-end framework that integrates a multimodal large language model (MLLM) with a hypergraph-enhanced inpainting architecture, enabling precise, mask-free image editing guided exclusively by natural language instructions. The key innovations of SmartFreeEdit include: (1) the introduction of region-aware tokens and a mask embedding paradigm that enhance the model's spatial understanding of complex scenes; (2) a reasoning segmentation pipeline designed to optimize the generation of editing masks based on natural language instructions; and (3) a hypergraph-augmented inpainting module that ensures the preservation of both structural integrity and semantic coherence during complex edits, overcoming the limitations of local-based image generation. Extensive experiments on the Reason-Edit benchmark demonstrate that SmartFreeEdit surpasses current state-of-the-art methods across multiple evaluation metrics, including segmentation accuracy, instruction adherence, and visual quality preservation, while addressing the issue of local information focus and improving global consistency in the edited image. Our project will be available at https://github.com/smileformylove/SmartFreeEdit.

Abstract:
Transformer-based networks have achieved strong performance in low-level vision tasks like image deraining by utilizing spatial or channel-wise self-attention. However, irregular rain patterns and complex geometric overlaps challenge single-paradigm architectures, necessitating a unified framework to integrate complementary global-local and spatial-channel representations. To address this, we propose a novel Cross Paradigm Representation and Alignment Transformer (CPRAformer). Its core idea is the hierarchical representation and alignment, leveraging the strengths of both paradigms (spatial-channel and global-local) to aid image reconstruction. It bridges the gap within and between paradigms, aligning and coordinating them to enable deep interaction and fusion of features. Specifically, we use two types of self-attention in the Transformer blocks: sparse prompt channel self-attention (SPC-SA) and spatial pixel refinement self-attention (SPR-SA). SPC-SA enhances global channel dependencies through dynamic sparsity, while SPR-SA focuses on spatial rain distribution and fine-grained texture recovery. To address the feature misalignment and knowledge differences between them, we introduce the Adaptive Alignment Frequency Module (AAFM), which aligns and interacts with features in a two-stage progressive manner, enabling adaptive guidance and complementarity. This reduces the information gap within and between paradigms. Through this unified cross-paradigm dynamic interaction framework, we achieve the extraction of the most valuable interactive fusion information from the two paradigms. Extensive experiments demonstrate that our model achieves state-of-the-art performance on eight benchmark datasets and further validates CPRAformer's robustness in other image restoration tasks and downstream applications.

Abstract:
Prevalent lossy image compression schemes can be divided into: 1) explicit image compression (EIC), including traditional standards and neural end-to-end algorithms; 2) implicit image compression (IIC) based on implicit neural representations (INR). The former is encountering impasses of leveling off bitrate reduction at a cost of tremendous complexity while the latter suffers from excessive smoothing quality as well as lengthy decoder models. In this paper, we propose an innovative paradigm, which we dub Unicorn (Unified Neural Image Compression with One Nnumber Reconstruction). By conceptualizing the images as index-image pairs and learning the inherent distribution of pairs in a subtle neural network model, Unicorn can reconstruct a visually pleasing image from a randomly generated noise with only one index number. The neural model serves as the unified decoder of images while the noises and indexes corresponds to explicit representations. As a proof of concept, we propose an effective and efficient prototype of Unicorn based on latent diffusion models with tailored model designs. Quantitive and qualitative experimental results demonstrate that our prototype achieves significant bitrates reduction compared with EIC and IIC algorithms. More impressively, benefitting from the unified decoder, our compression ratio escalates as the quantity of images increases. We envision that more advanced model designs will endow Unicorn with greater potential in image compression. The code will be made publicly available upon publication.

Abstract:
Multimodality-to-Multiaudio (MM2MA) generation faces significant challenges in synthesizing diverse and contextually aligned audio types (e.g., sound effects, speech, music, and songs) from multimodal inputs (e.g., video, text, images), owing to the scarcity of high-quality paired datasets and the lack of robust multi-task learning frameworks. Recently, multi-agent system shows great potential in tackling the above issues. However, directly applying it to MM2MA task presents three critical challenges: (1) inadequate fine-grained understanding of multimodal inputs (especially for video), (2) the inability of single models to handle diverse audio events, and (3) the absence of self-correction mechanisms for reliable outputs. To this end, we propose AudioGenie, a novel training-free multi-agent system featuring a dual-layer architecture with a generation team and a supervisor team. For the generation team, a fine-grained task decomposition and an adaptive Mixture-of-Experts (MoE) collaborative entity are designed for detailed comprehensive multimodal understanding and dynamic model selection, and a trial-and-error iterative refinement module is designed for self-correction. The supervisor team ensures temporal-spatial consistency and verifies outputs through feedback loops. Moreover, we build MA-Bench, the first benchmark for MM2MA tasks, comprising 198 annotated videos with multi-type audios. Experiments demonstrate that our AudioGenie achieves state-of-the-art (SOTA) or comparable performance across 9 metrics in 8 tasks. User study further validates the effectiveness of our method in terms of quality, accuracy, alignment, and aesthetic. The project website with audio samples can be found at https://audiogenie.github.io/.

Abstract:
Generating 3D human motion from text descriptions remains challenging due to the diverse and complex nature of human motion. While existing methods excel within the training distribution, they often struggle with out-of-distribution motions, limiting their applicability in real-world scenarios. Existing VQVAE-based methods often fail to represent novel motions faithfully using discrete tokens, which hampers their ability to generalize beyond seen data. Meanwhile, diffusion-based methods operating on continuous representations often lack fine-grained control over individual frames. To address these challenges, we propose a robust motion generation framework MoMADiff, which combines masked modeling with diffusion processes to generate motion using frame-level continuous representations. Our model supports flexible user-provided keyframe specification, enabling precise control over both spatial and temporal aspects of motion synthesis. MoMADiff demonstrates strong generalization capability on novel text-to-motion datasets with sparse keyframes as motion prompts. Extensive experiments on two held-out datasets and two standard benchmarks show that our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and keyframe adherence. The code is available at: https://github.com/zzysteve/MoMADiff

Abstract:
Video try-on replaces clothing in videos with target garments. Existing methods struggle to generate high-quality and temporally consistent results when handling complex clothing patterns and diverse body poses. We present 3DV-TON, a novel diffusion-based framework for generating high-fidelity and temporally consistent video try-on results. Our approach employs generated animatable textured 3D meshes as explicit frame-level guidance, alleviating the issue of models over-focusing on appearance fidelity at the expanse of motion coherence. This is achieved by enabling direct reference to consistent garment texture movements throughout video sequences. The proposed method features an adaptive pipeline for generating dynamic 3D guidance: (1) selecting a keyframe for initial 2D image try-on, followed by (2) reconstructing and animating a textured 3D mesh synchronized with original video poses. We further introduce a robust rectangular masking strategy that successfully mitigates artifact propagation caused by leaking clothing information during dynamic human and garment movements. To advance video try-on research, we introduce HR-VVT, a high-resolution benchmark dataset containing 130 videos with diverse clothing types and scenarios. Quantitative and qualitative results demonstrate our superior performance over existing methods.

Abstract:
In this paper, we present LaVieID, a novel local a utoregressive vi deo diffusion framework designed to tackle the challenging id entity-preserving text-to-video task. The key idea of LaVieID is to mitigate the loss of identity information inherent in the stochastic global generation process of diffusion transformers (DiTs) from both spatial and temporal perspectives. Specifically, unlike the global and unstructured modeling of facial latent states in existing DiTs, LaVieID introduces a local router to explicitly represent latent states by weighted combinations of fine-grained local facial structures. This alleviates undesirable feature interference and encourages DiTs to capture distinctive facial characteristics. Furthermore, a temporal autoregressive module is integrated into LaVieID to refine denoised latent tokens before video decoding. This module divides latent tokens temporally into chunks, exploiting their long-range temporal dependencies to predict biases for rectifying tokens, thereby significantly enhancing inter-frame identity consistency. Consequently, LaVieID can generate high-fidelity personalized videos and achieve state-of-the-art performance. Our code and models are available at https://github.com/ssugarwh/LaVieID.

Abstract:
Scalable Vector Graphics (SVG) has become an indispensable technology in front-end development and UI/UX design, due to its inherent advantages in scalability, editability, and rendering efficiency. In the creation of vector graphics, while expressing creative concepts is straightforward, translating them into precise digital artworks is often challenging and time-consuming. To overcome this technical bottleneck and achieve intelligent conversion from concept to final product, we have constructed SVG-1M, a large-scale dataset of high-quality SVG samples with paired textual descriptions. Through innovative data augmentation and annotation processes, we built precisely aligned ''Text instruction-SVG code'' training pairs, with a subset enhanced by Chain-of-Thought (CoT) annotations. This provides rich semantic supervision signals for model learning. Based on this dataset, we propose SVGen, an end-to-end generative model capable of directly converting natural language descriptions into SVG code. This design addresses the challenges of generating semantically accurate vector graphics while preserving complete structural information. We explored various training strategies and introduced a progressive curriculum learning approach, optimized with reinforcement learning algorithms. Notably, this study innovatively applies the CoT paradigm to vector graphics generation, effectively enhancing both the accuracy and interpretability of SVG synthesis. Experimental validation demonstrates that SVGen exhibits significant advantages over general large models in terms of SVG generation quality, while also surpassing optimization-based rendering methods in generation efficiency. The proposed method enables intelligent conversion between natural language and vector graphics, enabling novel workflows like real-time AI-assisted design iteration. Code, model, and data is released at: https://github.com/gitcat-404/SVGen

Abstract:
Recent advances in diffusion models have endowed talking head synthesis with subtle expressions and vivid head movements, but have also led to slow inference speed and insufficient control over generated results. To address these issues, we propose Ditto, a diffusion-based talking head framework that enables fine-grained controls and real-time inference. Specifically, we utilize an off-the-shelf motion extractor and devise a diffusion transformer to generate representations in a specific motion space. We optimize the model architecture and training strategy to address the issues in generating motion representations, including insufficient disentanglement between motion and identity, and large internal discrepancies within the representation. Besides, we employ diverse conditional signals while establishing a mapping between motion representation and facial semantics, enabling control over the generation process and correction of the results. Moreover, we jointly optimize the holistic framework to enable streaming processing, real-time inference, and low first-frame delay, offering functionalities crucial for interactive applications such as AI assistants. Extensive experimental results demonstrate that Ditto generates compelling talking head videos and exhibits superiority in both controllability and real-time performance.

Abstract:
Scene-level 3D generation represents a critical frontier in multimedia and computer graphics. While existing approaches have achieved encouraging progress, they still face challenges such as constrained object diversity and limited support for interactive editing. In this paper, we present HiScene, a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation and delivers high-fidelity scenes with compositional identities and aesthetic scene content. Our key insight is treating scenes as hierarchical ''objects'' under isometric views, where a room functions as a complex object that can be further decomposed into manipulatable items. This hierarchical approach enables us to generate 3D content that aligns with 2D representations while maintaining compositional structure. To ensure completeness and spatial alignment of each decomposed instance, we develop a video-diffusion-based amodal completion technique that effectively handles occlusions and shadows between objects, and introduce shape prior injection to ensure spatial coherence within the scene. Experimental results demonstrate that our method produces more natural object arrangements and complete object instances suitable for interactive applications, while maintaining physical plausibility and alignment with user inputs.

Abstract:
Creating a realistic animatable avatar from a single static portrait remains challenging. Existing approaches often struggle to capture subtle facial expressions, the associated global body movements, and the dynamic background. To address these limitations, we propose a novel framework that leverages a pretrained video diffusion Transformer model to generate high-fidelity, coherent talking portraits with controllable motion dynamics. At the core of our work is a dual-stage audio-visual alignment strategy. In the first stage, we employ a clip-level training scheme to establish coherent global motion by aligning audio-driven dynamics across the entire scene, including the reference portrait, contextual objects, and background. In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals. To preserve identity without compromising motion flexibility, we replace the commonly used reference network with a lightweight cross-attention module that effectively maintains facial consistency throughout the video. Furthermore, we integrate a motion intensity modulation module that explicitly controls facial keypoints and body joint trajectories, enabling fine-grained manipulation of portrait movements beyond mere lip motion. Extensive experimental results show that our proposed approach achieves higher quality with better realism, coherence, motion intensity, and identity preservation. Our demo, code, models can be found on this page: https://fantasy-amap.github.io/fantasy-talking/.

Abstract:
How far are we really from automatically generating neural networks? While neural network weight generation shows promise, current approaches struggle with generalization to unseen tasks and practical application exploration. To address this, we propose T2W, a diffusion transformer framework that generates task-specific weights conditioned on natural language descriptions. T2W hierarchically processes network parameters into uniform blocks, integrates text embeddings from CLIP via a prior attention mechanism, and employs adversarial training with weight-space augmentation to enhance generalization. Experiments on Cifar100, Caltech256, and TinyImageNet demonstrate T2W's ability to produce high-quality weights for unseen tasks, outperforming optimization-based initialization and enabling novel applications such as weight enhancement and text-guided model fusion. Our work bridges textual semantics with weight-space dynamics, supported by an open-source dataset of text-weight pairs, advancing the practicality of generative models in neural network parameter synthesis. Our code is available on https://github.com/TianSuya/T2W.

Abstract:
Diffusion models exhibit impressive generative capabilities but are significantly impacted by exposure bias. In this paper, we make a key observation: the energy of predicted noisy samples in the reverse process continuously declines compared to perturbed samples in the forward process. Building on this, we identify two important findings: 1) The reduction in energy follows distinct patterns in the low-frequency and high-frequency subbands; 2) The subband energy of reverse-process reconstructed samples is consistently lower than that of forward-process ones, and both are lower than the original data samples. Based on the first finding, we introduce a dynamic frequency regulation mechanism utilizing wavelet transforms, which separately adjusts the low- and high-frequency subbands. Leveraging the second insight, we derive the rigorous mathematical form of exposure bias. It is worth noting that, our method is training-free and plug-and-play, significantly improving the generative quality of various diffusion models and frameworks with negligible computational cost. The source code is available at https://github.com/kunzhan/wpp.

Abstract:
Diffusion-based text-to-image models have demonstrated remarkable capabilities in generating realistic images, but they raise societal and ethical concerns, such as the creation of unsafe content. While concept editing is proposed to address these issues, they often struggle to balance the removal of unsafe concept with maintaining the model's general generative capabilities. In this work, we propose ACE, a new editing method that enhances concept editing in diffusion models. ACE introduces a novel cross null-space projection approach to precisely erase unsafe concept while maintaining the model's ability to generate high-quality, semantically consistent images. Extensive experiments demonstrate that ACE significantly outperforms the advancing baselines, improving semantic consistency by 24.56% and image generation quality by 34.82% on average with only 1% of the time cost. These results highlight the practical utility of concept editing by mitigating its potential risks, paving the way for broader applications in the field.

Abstract:
The success of diffusion models has enabled effortless, high-quality image modifications that precisely align with users' intentions, thereby raising concerns about their potential misuse by malicious actors. Previous studies have attempted to mitigate such misuse through adversarial attacks. However, these approaches heavily rely on image-level inconsistencies, which pose fundamental limitations in addressing the influence of textual prompts. In this paper, we propose PromptFlare, a novel adversarial protection method designed to protect images from malicious modifications facilitated by diffusion-based inpainting models. Our approach leverages the cross-attention mechanism to exploit the intrinsic properties of prompt embeddings. Specifically, we identify and target shared token of prompts that is invariant and semantically uninformative, injecting adversarial noise to suppress the sampling process. The injected noise acts as a cross-attention decoy, diverting the model's focus away from meaningful prompt-image alignments and thereby neutralizing the effect of prompt. Extensive experiments on the EditBench dataset demonstrate that our method achieves state-of-the-art performance across various metrics while significantly reducing computational overhead and GPU memory usage. These findings highlight PromptFlare as a robust and efficient protection against unauthorized image manipulations. The code is available at https://github.com/NAHOHYUN-SKKU/PromptFlare.

Abstract:
Computer-Aided Design (CAD) generative modeling is driving significant innovations across industrial applications. Recent works have shown remarkable progress in creating solid models from various inputs such as point clouds, meshes, and text descriptions. However, these methods fundamentally diverge from traditional industrial workflows that begin with 2D engineering drawings. The automatic generation of parametric CAD models from these 2D vector drawings remains underexplored despite being a critical step in engineering design. To address this gap, our key insight is to reframe CAD generation as a sequence-to-sequence learning problem where vector drawing primitives directly inform the generation of parametric CAD operations, preserving geometric precision and design intent throughout the transformation process. We propose Drawing2CAD, a framework with three key technical components: a network-friendly vector primitive representation that preserves precise geometric information, a dual-decoder transformer architecture that decouples command type and parameter generation while maintaining precise correspondence, and a soft target distribution loss function accommodating inherent flexibility in CAD parameters. To train and evaluate Drawing2CAD, we create CAD-VGDrawing, a dataset of paired engineering drawings and parametric CAD models, and conduct thorough experiments to demonstrate the effectiveness of our method. Code and dataset are available at https://github.com/lllssc/Drawing2CAD.

Abstract:
Despite significant progress in diffusion-based image generation, subject-driven generation and instruction-based editing remain challenging. Existing methods typically treat them separately, struggling with limited high-quality data and poor generalization. However, both tasks require capturing complex visual variations while maintaining consistency between inputs and outputs. Inspired by this, we propose MIGE, a unified framework that standardizes task representations using multimodal instructions. It first treats subject-driven generation as creation on a blank canvas and instruction-based editing as modification of an existing image, establishing a shared input-output formulation, then introduces a novel multimodal encoder that maps free-form multimodal instructions into a unified vision-language space, integrating visual and semantic features through a feature fusion mechanism. This unification enables joint training of both tasks, providing two key advantages: (1) Cross-Task Enhancement: by leveraging shared visual and semantic representations, joint training improves instruction adherence and visual consistency in both subject-driven generation and instruction-based editing. (2) Generalization: learning in a unified format facilitates cross-task knowledge transfer, enabling MIGE to generalize to novel compositional tasks, including instruction-based subject-driven editing. Experiments show that MIGE excels in both subject-driven generation and instruction-based editing while setting a SOTA in the new task of instruction-based subject-driven editing. Code and model have been publicly available at https://github.com/Eureka-Maggie/MIGE.

Abstract:
Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design that makes the model output phoneme tokens and audio tokens in parallel to enhance content consistency, inspired by chain-of-thought (CoT) and modality-of-thought (CoM) techniques. Besides, we introduce EmoVoice-DB, a high-quality 40-hour English emotion dataset featuring expressive speech and fine-grained emotion labels with natural language descriptions. EmoVoice achieves state-of-the-art performance on the English EmoVoice-DB test set using only synthetic training data, and on the Chinese Secap test set using our in-house data. We further investigate the reliability of existing emotion evaluation metrics and their alignment with human perceptual preferences, and explore using SOTA multimodal LLMs GPT-4o-audio and Gemini to assess emotional speech. Dataset, code, checkpoints and demo samples are available at https://github.com/yanghaha0908/EmoVoice.

Abstract:
Translating chart images into executable plotting scripts-referred to as the chart-to-code generation task-requires Multimodal Large Language Models (MLLMs) to perform fine-grained visual parsing, precise code synthesis, and robust cross-modal reasoning. However, this task is inherently under-constrained: multiple valid code implementations can produce the same visual chart, and evaluation must consider both code correctness and visual fidelity across diverse dimensions. This makes it difficult to learn accurate and generalizable mappings through standard supervised fine-tuning. To address these challenges, we propose a dual preference-guided refinement framework that combines a feedback-driven, dual-modality reward mechanism with iterative preference learning. Our approach introduces a structured variant generation strategy and a visual reward model to efficiently produce high-quality, aspect-aware preference pairs-making preference collection scalable and supervision more targeted. These preferences are used in an offline reinforcement learning setup to optimize the model toward multi-dimensional fidelity. Experimental results show that our framework significantly enhances the performance of general-purpose open-source MLLMs, enabling them to generate high-quality plotting code that rivals specialized chart-centric models and even some proprietary systems. The code and datasets are publicly available at https://github.com/Zhihan72/Chart2Code.

Abstract:
Autoregressive image generation has witnessed rapid advancements, with prominent models such as scale-wise visual auto-regression pushing the boundaries of visual synthesis. However, these developments also raise significant concerns regarding data privacy and copyright. In response, training data detection has emerged as a critical task for identifying unauthorized data usage in model training. To better understand the vulnerability of autoregressive image generative models to such detection, we conduct the first study that applies membership inference to this domain. Our approach comprises two key components: implicit classification and an adaptive score aggregation strategy. First, we compute the implicit token-wise classification score within the query image. Then we propose an adaptive score aggregation strategy to acquire a final score, which places greater emphasis on the tokens with lower scores. A higher final score indicates that the sample is more likely to be involved in the training set. Extensive experiments demonstrate the superiority of our method over those designed for LLMs, in both class-conditional and text-to-image scenarios. Moreover, our approach exhibits strong robustness and generalization under various data transformations. Furthermore, sufficient experiments suggest two novel key findings: (1) A linear scaling law on membership inference, exposing the vulnerability of large foundation models. (2) Training data from scale-wise visual autoregressive models is easier to detect than other autoregressive paradigms. Our code is available at https://github.com/Chrisqcwx/ImageAR-MIA.

Abstract:
The detection and grounding of manipulated content in multimodal data has emerged as a critical challenge in media forensics. While existing benchmarks demonstrate technical progress, they suffer from misalignment artifacts that poorly reflect real-world manipulation patterns: practical attacks typically maintain semantic consistency across modalities, whereas current datasets artificially disrupt cross-modal alignment, creating easily detectable anomalies. To bridge this gap, we pioneer the detection of semantically-coordinated manipulations where visual edits are systematically paired with semantically consistent textual descriptions. Our approach begins with constructing the first Semantic-Aligned Multimodal Manipulation (SAMM) dataset, generated through a two-stage pipeline: 1) applying state-of-the-art image manipulations, followed by 2) generation of contextually-plausible textual narratives that reinforce the visual deception. Building on this foundation, we propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework. RamDG commences by harnessing external knowledge repositories to retrieve contextual evidence, which serves as the auxiliary texts and encoded together with the inputs through our image forgery grounding and deep manipulation detection modules to trace all manipulations. Extensive experiments demonstrate our framework significantly outperforms existing methods, achieving 2.06% higher detection accuracy on SAMM compared to state-of-the-art approaches. The dataset and code are publicly available at https://github.com/shen8424/SAMM-RamDG-CAP.

Abstract:
How to enable agents to predict the outcomes of their own motion intentions in three-dimensional space has been a fundamental problem in embodied intelligence. To explore general spatial imagination capability, we present AirScape, the first world model designed for six-degree-of-freedom aerial agents. AirScape predicts future observation sequences based on current visual inputs and motion intentions. Specifically, we construct a dataset for aerial world model training and testing, which consists of 11k video-intention pairs. This dataset includes first-person-view videos capturing diverse drone actions across a wide range of scenarios, with over 1,000 hours spent annotating the corresponding motion intentions. Then we develop a two-phase schedule to train a foundation model-initially devoid of embodied spatial knowledge-into a world model that is controllable by motion intentions and adheres to physical spatio-temporal constraints. Experimental results demonstrate that AirScape significantly outperforms existing foundation models in 3D spatial imagination capabilities, especially with over a 50% improvement in metrics reflecting motion alignment. The project is available at: https://embodiedcity.github.io/AirScape/.

Abstract:
The proliferation of online misinformation videos poses serious societal risks. Current datasets and detection methods primarily target binary classification or single-modality localization based on post-processed data, lacking the interpretability needed to counter persuasive misinformation. In this paper, we introduce the task of Grounding Multimodal Misinformation (GroundMM), which verifies multimodal content and localizes misleading segments across modalities. We present the first real-world dataset for this task, GroundLie360, featuring a taxonomy of misinformation types, fine-grained annotations across text, speech, and visuals, and validation with Snopes evidence and annotator reasoning. We also propose a VLM-based, QA-driven baseline, FakeMark, using single and cross-modal cues for effective detection and grounding. Our experiments highlight the challenges of this task and lay a foundation for explainable multimodal misinformation detection. Dataset will be released at https://github.com/yangbingjian/GroundLie360.

Abstract:
Visual object tracking in real-world scenarios presents numerous challenges including occlusion, interference from similar objects and complex backgrounds - all of which limit the effectiveness of RGB-based trackers. Multispectral imagery, which captures pixel-level spectral reflectance, enhances target discriminability. However, the availability of multispectral tracking datasets remains limited. To bridge this gap, we introduce MSITrack, the largest and most diverse multispectral single object tracking dataset to date. MSITrack offers the following key features: (i) More Challenging Attributes - including interference from similar objects and similarity in color and texture between targets and backgrounds in natural scenarios, along with a wide range of real-world tracking challenges; (ii) Richer and More Natural Scenes - spanning 55 object categories and 300 distinct natural scenes, MSITrack far exceeds the scope of existing benchmarks. Many of these scenes and categories are introduced to the multispectral tracking domain for the first time; (iii) Larger Scale - 300 videos comprising over 129k frames of multispectral imagery. To ensure annotation precision, each frame has undergone meticulous processing, manual labeling and multi-stage verification. Extensive evaluations using representative trackers demonstrate that the multispectral data in MSITrack significantly improves performance over RGB-only baselines, highlighting its potential to drive future advancements in the field. The MSITrack dataset is publicly available at: https://github.com/Fengtao191/MSITrack.

Abstract:
Creating recipe images is a key challenge in food computing, with applications in culinary education and multimodal recipe assistants. However, existing datasets lack fine-grained alignment between recipe goals, step-wise instructions, and visual content. We present RecipeGen, the first large-scale, real-world benchmark for recipe-based Text-to-Image (T2I), Image-to-Video (I2V), and Text-to-Video (T2V) generation. RecipeGen contains 26,435 recipes, 196,724 images, and 4,491 videos, covering diverse ingredients, cooking procedures, styles, and dish types. We further propose domain-specific evaluation metrics to assess ingredient fidelity and interaction modeling, benchmark representative T2I, I2V, and T2V models, and provide insights for future recipe generation models. Project page is available at https://wenbin08.github.io/RecipeGen.

Abstract:
The rapid evolution of Multi-modality Large Language Models (MLLMs) is driving significant advancements in visual understanding and generation. Nevertheless, a comprehensive assessment of their capabilities, concerning the fine-grained physical principles especially in geometric optics, remains underexplored. To address this gap, we introduce GOBench, the first benchmark to systematically evaluate MLLMs' ability across two tasks: 1) Generating Optically Authentic Imagery and 2) Understanding Underlying Optical Phenomena. We curate high-quality prompts of geometric optical scenarios and use MLLMs to construct the GOBench-Gen-1k dataset. We then organize subjective experiments to assess the generated imagery based on Optical Authenticity, Aesthetic Quality, and Instruction Fidelity, revealing MLLMs' generation flaws that violate optical principles. For the understanding task, we apply crafted evaluation instructions to test the optical understanding ability of eleven prominent MLLMs. The experimental results demonstrate that current models face significant challenges in both optical generation and understanding. The top-performing generative model, GPT-4o-Image, cannot perfectly complete all generation tasks, and the best-performing MLLM model, Gemini-2.5Pro, attains a mere 37.35% accuracy in optical understanding. Database and codes are publicly available at: https://github.com/aiben-ch/GOBench.

Abstract:
With the rapid development of MLLMs, evaluating their visual capabilities has become increasingly crucial. Current benchmarks primarily fall into two main types: basic perception benchmarks,which focus on local details but lack deep reasoning (e.g., ''what is in the image?''), and mainstream reasoning benchmarks, which concentrate on prominent image elements but may fail to assess subtle clues requiring intricate analysis. However, profound visual understanding and complex reasoning depend more on interpreting subtle, inconspicuous local details than on perceiving salient, macro-level objects. These details, though occupying minimal image area, often contain richer, more critical information for robust analysis. To bridge this gap, we introduce the VER-Bench, a novel framework to evaluate MLLMs' ability to: 1) identify fine-grained visual clues, often occupying, on average, just 0.25% of the image area; 2) integrate these clues with world knowledge for complex reasoning. Comprising 374 carefully designed questions across Geospatial, Temporal, Situational, Intent, System State, and Symbolic reasoning, each question in VER-Bench is accompanied by structured evidence: visual clues and question-related reasoning derived from them. VER-Bench reveals current models' limitations in extracting subtle visual evidence and constructing evidence-based reasoning chains, highlighting the need to enhance models' capabilities in fine-grained visual evidence extraction, integration, and reasoning for genuine visual understanding and human-like analysis. The dataset is available at https://github.com/verbta/ACMMM-25-Materials.

Abstract:
Fine-grained analysis of complex and high-speed sports like badminton presents a significant challenge for Multimodal Large Language Models (MLLMs), despite their notable advancements in general video understanding. This difficulty arises primarily from the scarcity of datasets with sufficiently rich and domain-specific annotations. To bridge this gap, we introduce FineBadminton, a novel and large-scale dataset featuring a unique multi-level semantic annotation hierarchy (Foundational Actions, Tactical Semantics, and Decision Evaluation) for comprehensive badminton understanding. The construction of FineBadminton is powered by an innovative annotation pipeline that synergistically combines MLLM-generated proposals with human refinement. We also present FBBench, a challenging benchmark derived from FineBadminton, to rigorously evaluate MLLMs on nuanced spatio-temporal reasoning and tactical comprehension. Together, FineBadminton and FBBench provide a crucial ecosystem to catalyze research in fine-grained video understanding and advance the development of MLLMs in sports intelligence. Furthermore, we propose an optimized baseline approach incorporating Hit-Centric Keyframe Selection to focus on pivotal moments and Coordinate-Guided Condensation to distill salient visual information. The results on FBBench reveal that while current MLLMs still face significant challenges in deep sports video analysis, our proposed strategies nonetheless achieve substantial performance gains. The project homepage is available at https://finebadminton.github.io/FineBadminton/.

Abstract:
AudioSet is a widely used benchmark in the audio research community and has significantly advanced various audio-related tasks. However, persistent issues with label accuracy and completeness remain critical bottlenecks that limit performance in downstream applications. To address the aforementioned challenges, we propose a three-stage reannotation framework that harnesses general-purpose audio-language foundation models to systematically improve the label quality of AudioSet. The framework employs a cross-modal prompting strategy, inspired by the concept of prompt chaining, wherein prompts are sequentially composed to execute subtasks (audio comprehension, label synthesis, and semantic alignment). Leveraging this framework, we construct a high-quality, structured relabeled version of AudioSet-R. Extensive experiments conducted on representative audio classification models-including AST, PANNs, SSAST, and AudioMAE-consistently demonstrate substantial performance improvements, thereby validating the generalizability and effectiveness of the proposed approach in enhancing label reliability.The code is publicly available at: https://github.com/colaudiolab/AudioSet-R.

Abstract:
We introduce OpenEvents V1, a large-scale benchmark dataset designed to advance event-centric vision-language understanding. Unlike conventional image captioning and retrieval datasets that focus on surface-level descriptions, OpenEvents V1 dataset emphasizes contextual and temporal grounding through three primary tasks: (1) generating rich, event-aware image captions, (2) retrieving event-relevant news articles from image queries, and (3) retrieving event-relevant images from narrative-style textual queries. The dataset comprises over 200,000 news articles and 400,000 associated images sourced from CNN and The Guardian, spanning diverse domains and time periods. We provide extensive baseline results and standardized evaluation protocols for all tasks. OpenEvents V1 establishes a robust foundation for developing multimodal AI systems capable of deep reasoning over complex real-world events. The dataset is publicly available at https://ltnghia.github.io/eventa/openevents-v1.

Abstract:
Multimodal Large Language Models (MLLMs) have garnered significant attention for their strong visual-semantic understanding. Most existing chart benchmarks evaluate MLLMs' ability to parse information from charts to answer questions. However, they overlook the inherent output biases of MLLMs, where models rely on their parametric memory to answer questions rather than genuinely understanding the chart content. To address this limitation, we introduce a novel Chart Hypothetical Question Answering (HQA) task, which imposes assumptions on the same question to compel models to engage in counterfactual reasoning based on the chart content. Furthermore, we introduce HAI, a human-AI interactive data synthesis approach that leverages the efficient text-editing capabilities of LLMs alongside human expert knowledge to generate diverse and high-quality HQA data at a low cost. Using HAI, we construct Chart-HQA, a challenging benchmark synthesized from publicly available data sources. Evaluation results on 18 MLLMs of varying model sizes reveal that current models face significant generalization challenges and exhibit imbalanced reasoning performance on the HQA task. Our codebase and newly generated datasets are available at https://github.com/chenxn2020/Chart-HQA.

Abstract:
Text-driven image editing has achieved remarkable success in following single instructions. However, real-world scenarios often involve complex, multi-step instructions, particularly ''chain'' instructions where operations are interdependent. Current models struggle with these intricate directives, and existing benchmarks inadequately evaluate such capabilities. Specifically, they often overlook multi-instruction and chain-instruction complexities, and common consistency metrics are flawed. To address this, we introduce ComplexBench-Edit, a novel benchmark designed to systematically assess model performance on complex, multi-instruction, and chain-dependent image editing tasks. ComplexBench-Edit also features a new vision consistency evaluation method that accurately assesses non-modified regions by excluding edited areas. Furthermore, we propose a simple yet powerful Chain-of-Thought (CoT)-based approach that significantly enhances the ability of existing models to follow complex instructions. Our extensive experiments demonstrate ComplexBench-Edit's efficacy in differentiating model capabilities and highlight the superior performance of our CoT-based method in handling complex edits. The data and code are released at https://github.com/llllly26/ComplexBench-Edit.

Abstract:
As camera-equipped robotic platforms become increasingly integrated into daily life, robotic-generated videos have begun to appear on streaming media platforms, enabling us to envision a future where humans and robots coexist. We innovatively propose the concept of Robotic-Generated Content (RGC) to term these videos generated from egocentric perspective of robots. The perceptual quality of RGC videos is critical in human-robot interaction scenarios, and RGC videos exhibit unique distortions and visual requirements that differ markedly from those of professionally-generated content (PGC) videos and user-generated content (UGC) videos. However, dedicated research on quality assessment of RGC videos is still lacking. To address this gap and to support broader robotic applications, we establish the first Robotic-Generated Content Database (RGCD), which contains a total of 2,100 videos drawn from three robot categories and sourced from diverse platforms. A subjective VQA experiment is conducted subsequently to assess human visual perception of robotic-generated videos. Finally, we conduct a benchmark experiment to evaluate the performance of 11 state-of-the-art VQA models on our database. Experimental results reveal significant limitations in existing VQA models when applied to complex, robotic-generated content, highlighting a critical need for RGC-specific VQA models. Our RGCD is publicly available at: https://github.com/IntMeGroup/RGC-VQA.

Abstract:
The rapid surge of text-to-speech and face-voice reenactment models makes video fabrication easier and highly realistic. To encounter this problem, we require datasets that rich in type of generation methods and perturbation strategy which is usually common for online videos. To this end, we propose AV-Deepfake1M++, an extension of the AV-Deepfake1M having 2 million video clips with diversified manipulation strategy and audio-visual perturbation. This paper includes the description of data generation strategies along with benchmarking of AV-Deepfake1M++ using state-of-the-art methods. We believe that this dataset will play a pivotal role in facilitating research in Deepfake domain. Based on this dataset, we host the 2025 1M-Deepfakes Detection Challenge. The challenge details, dataset and evaluation scripts are available online under a research-only license at https://deepfakes1m.github.io/2025.

Abstract:
Intent-oriented controlled video captioning aims to generate targeted descriptions for specific targets in a video based on customized user intent. Current Large Visual Language Models (LVLMs) have gained strong instruction following and visual comprehension capabilities. Although the LVLMs demonstrated proficiency in spatial and temporal understanding respectively, it was not able to perform fine-grained spatial control in time sequences in direct response to instructions. This substantial spatio-temporal gap complicates efforts to achieve fine-grained intention-oriented control in video. Towards this end, we propose a novel IntentVCNet that unifies the temporal and spatial understanding knowledge inherent in LVLMs to bridge the spatio-temporal gap from both prompting and model perspectives. Specifically, we first propose a prompt combination strategy designed to enable LLM to model the implicit relationship between prompts that characterize user intent and video sequences. We then propose a parameter efficient box adapter that augments the object semantic information in the global visual context so that the visual token has a priori information about the user intent. The final experiment proves that the combination of the two strategies can further enhance the LVLM's ability to model spatial details in video sequences, and facilitate the LVLMs to accurately generate controlled intent-oriented captions. Our proposed method achieved state-of-the-art results in several open source LVLMs and was the runner-up in the IntentVC challenge. Our code is available on https://github.com/thqiu0419/IntentVCNet.

Abstract:
Social media platforms serve as central hubs for content dissemination, opinion expression, and public engagement across diverse modalities. Accurately predicting the popularity of social media videos enables valuable applications in content recommendation, trend detection, and audience engagement. In this paper, we present Multimodal Video Predictor (MVP), our winning solution to the Video Track of the SMP Challenge 2025. MVP constructs expressive post representations by integrating deep video features extracted from pretrained models with user metadata and contextual information. The framework applies systematic preprocessing techniques, including log-transformations and outlier removal, to improve model robustness. A gradient-boosted regression model is trained to capture complex patterns across modalities. Our approach ranked first in the official evaluation of the Video Track, demonstrating its effectiveness and reliability for multimodal video popularity prediction on social platforms. The source code is available at https://github.com/yllhwa/SMPDVideo.

Abstract:
The Event-Enriched Image Analysis (EVENTA) Grand Challenge, hosted at ACM Multimedia 2025, introduces the first large-scale benchmark for event-level multimodal understanding. Traditional captioning and retrieval tasks largely focus on surface-level recognition of people, objects, and scenes, often overlooking the contextual and semantic dimensions that define real-world events. EVENTA addresses this gap by integrating contextual, temporal, and semantic information to capture the who, when, where, what, and why behind an image. Built upon the OpenEvents V1 dataset, the challenge features two tracks: Event-Enriched Image Retrieval and Captioning, and Event-Based Image Retrieval. A total of 45 teams from six countries participated, with evaluation conducted through Public and Private Test phases to ensure fairness and reproducibility. The top three teams were invited to present their solutions at ACM Multimedia 2025. EVENTA establishes a foundation for context-aware, narrative-driven multimedia AI, with applications in journalism, media analysis, cultural archiving, and accessibility. Further details about the challenge are available at the official homepage: https://ltnghia.github.io/eventa/eventa-2025.

Abstract:
The attention mechanism has become a dominant operator in point cloud learning, but its quadratic complexity leads to limited inter-point interactions, hindering long-range dependency modeling between objects. Due to excellent long-range modeling capability with linear complexity, the selective state space model (S6), as the core of Mamba, has been exploited in point cloud learning for long-range dependency interactions over the entire point cloud. Despite some significant progress, related works still suffer from imperfect point cloud serialization and lack of locality learning. To this end, we explore a state space model-based point cloud network termed HydraMamba to address the above challenges. Specifically, we design a shuffle serialization strategy, making unordered point sets better adapted to the causal nature of S6. Meanwhile, to overcome the deficiency of existing techniques in locality learning, we propose a ConvBiS6 layer, which is capable of capturing local geometries and global context dependencies synergistically. Besides, we propose MHS6 by extending the multi-head design to S6, further enhancing its modeling capability. HydraMamba achieves state-of-the-art results on various tasks at both object-level and scene-level. The code is available at https://github.com/Point-Cloud-Learning/HydraMamba.

Abstract:
Multivariate Time Series Forecasting plays a key role in many applications. Recent works have explored using Large Language Models for MTSF to take advantage of their reasoning abilities. However, many methods treat LLMs as end-to-end forecasters, which often leads to a loss of numerical precision and forces LLMs to handle patterns beyond their intended design. Alternatively, methods that attempt to align textual and time series modalities within latent space frequently encounter alignment difficulty. In this paper, we propose to treat LLMs not as standalone forecasters, but as semantic guidance modules within a dual-stream framework. We propose DualSG, a dual -stream framework that provides explicit semantic guidance, where LLMs act as Semantic Guides to refine rather than replace traditional predictions. As part of DualSG, we introduce Time Series Caption, an explicit prompt format that summarizes trend patterns in natural language and provides interpretable context for LLMs, rather than relying on implicit alignment between text and time series in the latent space. We also design a caption-guided fusion module that explicitly models inter-variable relationships while reducing noise and computation. Experiments on real-world datasets from diverse domains show that DualSG consistently outperforms 15 state-of-the-art baselines, demonstrating the value of explicitly combining numerical forecasting with semantic guidance. The code is made available at https://github.com/BenchCouncil/DualSG

Abstract:
Most existing sound event detection (SED) algorithms operate under a closed-set assumption, restricting their detection capabilities to predefined classes. While recent efforts have explored language-driven zero-shot SED by exploiting audio-language models, their performance is still far from satisfactory due to the lack of fine-grained alignment and cross-modal feature fusion. In this work, we propose the Detect Any Sound Model (DASM), a query-based framework for open-vocabulary SED guided by multi-modal queries. DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors derived from text or audio prompts. To support this formulation, DASM introduces a dual-stream decoder that explicitly decouples event recognition and temporal localization: a cross-modality event decoder performs query-feature fusion and determines the presence of sound events at the clip-level, while a context network models temporal dependencies for frame-level localization. Additionally, an inference-time attention masking strategy is proposed to leverage semantic relations between base and novel classes, substantially enhancing generalization to novel classes. Experiments on the AudioSet Strong dataset demonstrate that DASM effectively balances localization accuracy with generalization to novel classes, outperforming CLAP-based methods in open-vocabulary setting (+ 7.8 PSDS) and the baseline in the closed-set setting (+ 6.9 PSDS). Furthermore, in cross-dataset zero-shot evaluation on DESED, DASM achieves a PSDS1 score of 42.2, even exceeding the supervised CRNN baseline. The project page is available at https://cai525.github.io/Transformer4SED/demo_page/DASM/.

Abstract:
Remote sensing change detection is essential for monitoring urban expansion, disaster assessment, and resource management, offering timely, accurate, and large-scale insights into dynamic landscape transformations. While deep learning has revolutionized change detection, the increasing complexity and computational demands of modern models have not necessarily translated into significant accuracy gains. Instead of following this trend, this study explores a more efficient approach, focusing on lightweight models that maintain high accuracy while minimizing resource consumption, which is an essential requirement for on-satellite processing. To this end, we propose FlickCD, which means quick flick then get great results, pushing the boundaries of the performance-resource trade-off. FlickCD introduces an Enhanced Difference Module (EDM) to amplify critical feature differences between temporal phases while suppressing irrelevant variations such as lighting and weather changes, thereby reducing computational costs in the subsequent change decoder. Additionally, the FlickCD decoder incorporates Local-Global Fusion Blocks, leveraging Shifted Window Self-Attention (SWSA) and Efficient Global Self-Attention (EGSA) to effectively capture semantic information at multiple scales, preserving both coarse- and fine-grained changes. Extensive experiments on four benchmark datasets demonstrate that FlickCD reduces computational and storage overheads by more than an order of magnitude while achieving state-of-the-art (SOTA) performance or incurring only a minor (<1% F1) accuracy trade-off. The implementation code is publicly available at https://github.com/xulsh8/FlickCD.

Abstract:
Recent advances in diffusion models have notably enhanced text-to-image (T2I) generation quality, but they also raise the risk of generating unsafe content. Traditional safety methods like text blacklisting or harmful content classification have significant drawbacks: they can be easily circumvented or require extensive datasets and extra training. To overcome these challenges, we introduce PurifyGen, a novel, training-free approach for safe T2I generation that retains the model's original weights. PurifyGen introduces a dual-stage strategy for prompt purification. First, we evaluate the safety of each token in a prompt by computing its complementary semantic distance, which measures the semantic proximity between the prompt tokens and concept embeddings from predefined toxic and clean lists. This enables fine-grained prompt classification without explicit keyword matching or retraining. Tokens closer to toxic concepts are flagged as risky. Second, for risky prompts, we apply a dual-space transformation: we project toxic-aligned embeddings into the null space of the toxic concept matrix, effectively removing harmful semantic components, and simultaneously align them into the range space of clean concepts. This dual alignment purifies risky prompts by both subtracting unsafe semantics and reinforcing safe ones, while retaining the original intent and coherence. We further define a token-wise strategy to selectively replace only risky token embeddings, ensuring minimal disruption to safe content. PurifyGen offers a plug-and-play solution with theoretical grounding and strong generalization to unseen prompts and models. Extensive testing shows that PurifyGen surpasses current methods in reducing unsafe content across five datasets and competes well with training-dependent approaches.

Abstract:
Video text-based visual question answering (Video TextVQA) aims to answer questions by explicitly reading and reasoning about the text involved in a video. Most works in this field follow a frame-level framework which suffers from redundant text entities and implicit relation modeling, resulting in limitations in both accuracy and efficiency. In this paper, we rethink the Video TextVQA task from an instance-oriented perspective and propose a novel model termed GAT (Gather and Trace). First, to obtain accurate reading result for each video text instance, a context-aggregated instance gathering module is designed to integrate the visual appearance, layout characteristics, and textual contents of the related entities into a unified textual representation. Then, to capture dynamic evolution of text in the video flow, an instance-focused trajectory tracing module is utilized to establish spatio-temporal relationships between instances and infer the final answer. Extensive experiments on several public Video TextVQA datasets validate the effectiveness and generalization of our framework. GAT outperforms existing Video TextVQA methods, video-language pretraining methods, and video large language models in both accuracy and inference speed. Notably, GAT surpasses the previous state-of-the-art Video TextVQA methods by 3.86% in accuracy and achieves ten times of faster inference speed than video large language models. The source code is available at https://github.com/zhangyan-ucas/GAT.

Abstract:
Multimodal emotion recognition (MER) aims to identify emotional states by integrating and analyzing information from multiple modalities. However, inherent modality heterogeneity and inconsistencies in emotional cues remain key challenges that hinder performance. To address these issues, we propose a Decoupled Representations with Knowledge Fusion (DRKF) method for MER. DRKF consists of two main modules: an Optimized Representation Learning (ORL) Module and a Knowledge Fusion (KF) Module. ORL employs a contrastive mutual information estimation method with progressive modality augmentation to decouple task-relevant shared representations and modality-specific features while mitigating modality heterogeneity. KF includes a lightweight self-attention-based Fusion Encoder (FE) that identifies the dominant modality and integrates emotional information from other modalities to enhance the fused representation. To handle potential errors from incorrect dominant modality selection under emotionally inconsistent conditions, we introduce an Emotion Discrimination Submodule (ED), which enforces the fused representation to retain discriminative cues of emotional inconsistency. This ensures that even if the FE selects an inappropriate dominant modality, the Emotion Classification Submodule (EC) can still make accurate predictions by leveraging preserved inconsistency information. Experiments show that DRKF achieves state-of-the-art (SOTA) performance on IEMOCAP, MELD, and M3ED. The source code is publicly available at https://github.com/PANPANKK/DRKF.

Abstract:
In this paper, we propose SPECTRUM, a temporal-frequency synergistic model that unlocks the untapped potential of multi-domain representation learning for online handwriting verification (OHV). SPECTRUM comprises three core components: (1) a multi-scale interactor that finely combines temporal and frequency features through dual-modal sequence interaction and multi-scale aggregation, (2) a self-gated fusion module that dynamically integrates global temporal and frequency features via self-driven balancing. These two components work synergistically to achieve micro-to-macro spectral-temporal integration. (3) A multi-domain distance-based verifier then utilizes both temporal and frequency representations to improve discrimination between genuine and forged handwriting, surpassing conventional temporal-only approaches. Extensive experiments demonstrate SPECTRUM's superior performance over existing OHV methods, underscoring the effectiveness of temporal-frequency multi-domain learning. Furthermore, we reveal that incorporating multiple handwritten biometrics fundamentally enhances the discriminative power of handwriting representations and facilitates verification. These findings not only validate the efficacy of multi-domain learning in OHV but also pave the way for future research in multi-domain approaches across both feature and biometric domains. Code is publicly available at https://github.com/NiceRingNode/SPECTRUM.

Abstract:
Multimodal Sentiment Analysis (MSA) faces two critical challenges: the lack of interpretability in the decision logic of multimodal fusion and modality imbalance caused by disparities in inter-modal information density. To address these issues, we propose KAN-MCP, a novel framework that integrates the interpretability of Kolmogorov-Arnold Networks (KAN) with the robustness of the Multimodal Clean Pareto (MCPareto) framework. First, KAN leverages its univariate function decomposition to achieve transparent analysis of cross-modal interactions. This structural design allows direct inspection of feature transformations without relying on external interpretation tools, thereby ensuring both high expressiveness and interpretability. Second, the proposed MCPareto enhances robustness by addressing modality imbalance and noise interference. Specifically, we introduce the Dimensionality Reduction and Denoising Modal Information Bottleneck (DRD-MIB) method, which jointly denoises and reduces feature dimensionality. This approach provides KAN with discriminative low-dimensional inputs to reduce the modeling complexity of KAN while preserving critical sentiment-related information. Furthermore, MCPareto dynamically balances gradient contributions across modalities using the purified features output by DRD-MIB, ensuring lossless transmission of auxiliary signals and effectively alleviating modality imbalance. This synergy of interpretability and robustness not only achieves superior performance on benchmark datasets such as CMU-MOSI, CMU-MOSEI, and CH-SIMS v2 but also offers an intuitive visualization interface through KAN's interpretable architecture. Our code is released on https://github.com/LuoMSen/KAN-MCP.

Abstract:
Vision-language models (VLMs) are prone to hallucinations that critically compromise reliability in medical applications. While preference optimization can mitigate these hallucinations through clinical feedback, its implementation faces challenges such as clinically irrelevant training samples, imbalanced data distributions, and prohibitive expert annotation costs. To address these challenges, we introduce CheXPO, a Chest X-ray Preference Optimization strategy that combines confidence-similarity joint mining with counterfactual rationale. Our approach begins by synthesizing a unified, fine-grained multi-task chest X-ray visual instruction dataset across different question types for supervised fine-tuning (SFT). We then identify hard examples through token-level confidence analysis of SFT failures and use similarity-based retrieval to expand hard examples for balancing preference sample distributions, while synthetic counterfactual rationales provide fine-grained clinical preferences, eliminating the need for additional expert input. Experiments show that CheXPO achieves 8.93% relative performance gain using only 5% of SFT samples, reaching state-of-the-art performance across diverse clinical tasks. 1Code: https://github.com/ResearchGroup-MedVLLM/CheX-Phi35V

Abstract:
Human-Object Interaction (HOI) detection aims to identify humans and objects within images and interpret their interactions. Existing HOI methods rely heavily on large datasets with manual annotations to learn interactions from visual cues. These annotations are labor-intensive to create, prone to inconsistency, and limit scalability to new domains and rare interactions. We argue that recent advances in Vision-Language Models (VLMs) offer untapped potential, particularly in enhancing interaction representation. While prior work has injected such potential and even proposed training-free methods, there remain key gaps. Consequently, we propose a novel training-free HOI detection framework for Dynamic Scoring with enhanced semantics (dysco) that effectively utilizes textual and visual interaction representations within a multimodal registry, enabling robust and nuanced interaction understanding. This registry incorporates a small set of visual cues and uses innovative interaction signatures to improve the semantic alignment of verbs, facilitating effective generalization to rare interactions. Additionally, we propose a unique multi-head attention mechanism that adaptively weights the contributions of the visual and textual features. Experimental results demonstrate that our dysco surpasses training-free state-of-the-art models and is competitive with training-based approaches, particularly excelling in rare interactions. Code is available at https://github.com/francescotonini/dysco.

Abstract:
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in visual understanding and multimodal reasoning. However, LVLMs frequently exhibit hallucination phenomena, manifesting as the generated textual responses that demonstrate inconsistencies with the provided visual content. Existing hallucination mitigation methods are predominantly text-centric, the challenges of visual-semantic alignment significantly limit their effectiveness, especially when confronted with fine-grained visual understanding scenarios. To this end, this paper presents ViHallu, a Vision-Centric Hallucination mitigation framework that enhances visual-semantic alignment through Visual Variation Image Generation and Visual Instruction Construction. ViHallu introduces visual variation images with controllable visual alterations while maintaining the overall image structure. These images, combined with carefully constructed visual instructions, enable LVLMs to better understand fine-grained visual content through fine-tuning, allowing models to more precisely capture the correspondence between visual content and text, thereby enhancing visual-semantic alignment. Extensive experiments on multiple benchmarks show that ViHallu effectively enhances models' fine-grained visual understanding while significantly reducing hallucination tendencies. Furthermore, we release ViHallu-Instruction, a visual instruction dataset specifically designed for hallucination mitigation and visual-semantic alignment. Code is available at https://github.com/oliviadzy/ViHallu.

Abstract:
Understanding dynamic outdoor environments requires capturing complex object interactions and their evolution over time. LiDAR-based 4D point clouds provide precise spatial geometry and rich temporal cues, making them ideal for representing real-world scenes. However, despite their potential, 4D LiDAR remains underexplored in the context of Multimodal Large Language Models (MLLMs) due to the absence of high-quality, modality-specific annotations and the lack of MLLM architectures capable of processing its high-dimensional composition. To address these challenges, we introduce B4DL, a new benchmark specifically designed for training and evaluating MLLMs on 4D LiDAR understanding. In addition, we propose a scalable data generation pipeline and an MLLM model that, for the first time, directly processes raw 4D LiDAR by bridging it with language understanding. Combined with our dataset and benchmark, our model offers a unified solution for spatio-temporal reasoning in dynamic outdoor environments. We provide rendered 4D LiDAR videos, generated dataset, and inference outputs on diverse scenarios at: https://mmb4dl.github.io/mmb4dl/

Abstract:
Bridging natural language and 3D geometry is a crucial step toward flexible, language-driven scene understanding. While recent advances in 3D Gaussian Splatting (3DGS) have enabled fast and high-quality scene reconstruction, research has also explored incorporating open-vocabulary understanding into 3DGS. However, most existing methods require iterative optimization over per-view 2D semantic feature maps, which not only results in inefficiencies but also leads to inconsistent 3D semantics across views. To address these limitations, we introduce a training-free framework that constructs a superpoint graph directly from Gaussian primitives. The superpoint graph partitions the scene into spatially compact and semantically coherent regions, forming view-consistent 3D entities and providing a structured foundation for open-vocabulary understanding. Based on the graph structure, we design an efficient reprojection strategy that lifts 2D semantic features onto the superpoints, avoiding costly multi-view iterative training. The resulting representation ensures strong 3D semantic coherence and naturally supports hierarchical understanding, enabling both coarse- and fine-grained open-vocabulary perception within a unified semantic field. Extensive experiments demonstrate that our method achieves state-of-the-art open-vocabulary segmentation performance, with semantic field reconstruction completed over 30× faster.

Abstract:
With the advancement of face manipulation technology, forgery images in multi-face scenarios are gradually becoming a more complex and realistic challenge. Despite this, detection and localization methods for such multi-face manipulations remain underdeveloped. Traditional manipulation localization methods either indirectly derive detection results from localization masks, resulting in limited detection performance, or employ a naive two-branch structure to simultaneously obtain detection and localization results, which cannot effectively benefit the localization capability due to limited interaction between the two tasks. This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization. The MoNFAP primarily introduces two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM). The proposed FUP integrates detection and localization tasks using a token learning strategy and multiple forgery-aware transformers, which facilitates the use of classification information to enhance localization capability. Furthermore, to mitigate the interference from general semantic object information, we propose the MNM that leverages multiple noise extractors based on the mixture of experts concept. This allows the MNM to learn semantic-agnostic forgery features from general RGB features, further boosting the performance of our proposed framework. Finally, we establish a comprehensive benchmark for multi-face detection and localization, and the proposed MoNFAP achieves significant performance. The code is available: https://github.com/miaoct/MoNFAP.

Abstract:
Recent advancements in image generation have provoked social and security concerns, yet most detection methods rely on black-box models that generalize poorly. By utilizing advances in Multi-modal Large Language Models (MLLMs), we propose a framework that fuses six specialized paradigms, each analyzing a distinct aspect of the image, to provide a final verdict with coherent, evidence-based reasoning. Experimental results on a diverse dataset of real and AI-generated images demonstrate that our approach outperforms both traditional detection methods and top humans, while providing . This study underscores the potential of MLLMs in developing robust, explainable, and reasoning-driven detection systems. The code is available at https://github.com/Gennadiyev/mllm-defake.

Abstract:
Vision-language models (VLMs) exhibit remarkable zero-shot capabilities but struggle with distribution shifts in downstream tasks when labeled data is unavailable, which has motivated the development of Test-Time Adaptation (TTA) to improve VLMs' performance during inference without annotations. Among various TTA approaches, cache-based methods show promise by preserving historical knowledge from low-entropy samples in a dynamic cache and fostering efficient adaptation. However, these methods face two critical reliability challenges: (1) entropy often becomes unreliable under distribution shifts, causing error accumulation in the cache and degradation in adaptation performance; (2) the final predictions may be unreliable due to inflexible decision boundaries that fail to accommodate large downstream shifts. To address these challenges, we propose a Reliable Test-time Adaptation (ReTA) method that integrates two complementary strategies to enhance reliability from two perspectives. First, to mitigate the unreliability of entropy as a sample selection criterion for cache construction, we introduce Consistency-aware Entropy Reweighting (CER), which incorporates consistency constraints to weight entropy during cache updating. While conventional approaches rely solely on low entropy for cache prioritization and risk introducing noise, our method leverages predictive consistency to maintain a high-quality cache and facilitate more robust adaptation. Second, we present Diversity-driven Distribution Calibration (DDC), which models class-wise text embeddings as multivariate Gaussian distributions, enabling adaptive decision boundaries for more accurate predictions across visually diverse content. Extensive experiments demonstrate that ReTA consistently outperforms state-of-the-art methods, particularly under real-world distribution shifts.

Abstract:
Radiography imaging protocols target on specific anatomical regions, resulting in highly consistent images with recurrent structural patterns across patients. Recent advances in medical anomaly detection have demonstrated the effectiveness of CNN- and transformer-based approaches. However, CNNs exhibit limitations in capturing long-range dependencies, while transformers suffer from quadratic computational complexity. In contrast, Mamba-based models, leveraging superior long-range modeling, structural feature extraction, and linear computational efficiency, have emerged as a promising alternative. To capitalize on the inherent structural regularity of medical images, this study introduces SP-Mamba, a spatial-perception Mamba framework for unsupervised medical anomaly detection. The window-sliding prototype learning and Circular-Hilbert scanning-based Mamba are introduced to better exploit consistent anatomical patterns and leverage spatial information for medical anomaly detection. Furthermore, we excavate the concentration and contrast characteristics of anomaly maps for improving anomaly detection. Extensive experiments on three diverse medical anomaly detection benchmarks confirm the proposed method's state-of-the-art performance, validating its efficacy and robustness.

Abstract:
Multimodal entity linking plays a crucial role in a wide range of applications. Recent advances in large language model-based methods have become the dominant paradigm for this task, effectively leveraging both textual and visual modalities to enhance performance. Despite their success, these methods still face two challenges, including unnecessary incorporation of image data in certain scenarios and the reliance only on a one-time extraction of visual features, which can undermine their effectiveness and accuracy. To address these challenges, we propose a novel LLM-based framework for the multimodal entity linking task, called Intra- and Inter-modal Collaborative Reflections. This framework prioritizes leveraging text information to address the task. When text alone is insufficient to link the correct entity through intra- and inter-modality evaluations, it employs a multi-round iterative strategy that integrates key visual clues from various aspects of the image to support reasoning and enhance matching accuracy. Extensive experiments on three widely used public datasets demonstrate that our framework consistently outperforms current state-of-the-art methods in the task, achieving improvements of 3.2%, 5.1%, and 1.6%, respectively. Our code is available at https://github.com/ziyan-xiaoyu/I2CR/.

Abstract:
Like image coding in visual data transmission, feature coding is essential for the distributed deployment of large models by significantly reducing transmission and storage burden. However, prior studies have mostly targeted task- or model-specific scenarios, leaving the challenge of universal feature coding across diverse large models largely unexplored. In this paper, we present the first systematic study on universal feature coding for large models. The key challenge lies in the inherently diverse and distributionally incompatible nature of features extracted from different models. For example, features from DINOv2 exhibit highly peaky, concentrated distributions, while those from Stable Diffusion 3 (SD3) are more dispersed and uniform. This distributional heterogeneity severely hampers both compression efficiency and cross-model generalization. To address this, we propose a learned peaky-to-balanced distribution transformation, which reshapes highly skewed feature distributions into a common, balanced target space. This transformation is non-uniform, data-driven, and plug-and-play, enabling effective alignment of heterogeneous distributions without modifying downstream codecs. With this alignment, a universal codec trained on the balanced target distribution can effectively generalize to features from different models and tasks. We validate our approach on three representative large models (LLaMA3, DINOv2, and SD3) across multiple tasks and modalities. Extensive experiments show that our method achieves notable improvements in both compression efficiency and cross-model generalization over task-specific baselines. All source code has been made available at https://github.com/chansongoal/DT-UFC.

Abstract:
Audiovisual emotion recognition (AVER) aims to infer human emotions from nonverbal visual-audio (VA) cues, offering modality-complementary and language-agnostic advantages. However, AVER remains challenging due to the inherent ambiguity of emotional expressions, cross-modal expressive disparities, and the scarcity of reliably annotated data. Recent self-supervised AVER approaches have introduced strong multimodal representations, yet they predominantly rely on modality-specific encoders and coarse content-level alignment, limiting fine-grained emotional semantic modeling. To address these issues, we propose VAEmo, an efficient two-stage framework for emotion-centric joint VA representation learning with external knowledge injection. In Stage~1, a unified and lightweight representation network is pre-trained on large-scale speaker-centric VA corpora via masked reconstruction and contrastive objectives, mitigating the modality gap and learning expressive, complementary representations without emotion labels. In Stage~2, multimodal large language models automatically generate detailed affective descriptions according to our well-designed chain-of-thought prompting for only a small subset of VA samples; these rich textual semantics are then injected by aligning their corresponding embeddings with VA representations through dual-path contrastive learning, further bridging the emotion gap. Extensive experiments on multiple downstream AVER benchmarks show that VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance for efficient, generalizable VA emotion representations.

Abstract:
Facial expression recognition (FER) is a challenging task due to pervasive occlusion and dataset biases. Especially when facial information is partially occluded, existing FER models struggle to extract effective facial features, leading to inaccurate classifications. In response, we present ORSANet, which introduces the following three key contributions: First, we introduce auxiliary multi-modal semantic guidance to disambiguate facial occlusion and learn high-level semantic knowledge, which is two-fold: 1) we introduce semantic segmentation maps as dense semantics prior to generate semantics-enhanced facial representations; 2) we introduce facial landmarks as sparse geometric prior to mitigate intrinsic noises in FER, such as identity and gender biases. Second, to facilitate the effective incorporation of these two multi-modal priors, we customize a Multi-scale Cross-interaction Module (MCM) to adaptively fuse the landmark feature and semantics-enhanced representations within different scales. Third, we design a Dynamic Adversarial Repulsion Enhancement Loss (DARELoss) that dynamically adjusts the margins of ambiguous classes, further enhancing the model's ability to distinguish similar expressions. We further construct the first occlusion-oriented FER dataset to facilitate specialized robustness analysis on various real-world occlusion conditions, dubbed Occlu-FER. Extensive experiments on both public benchmarks and Occlu-FER demonstrate that our proposed ORSANet achieves SOTA recognition performance. Code is publicly available at https://github.com/Wenyuzhy/ORSANet-master.

Abstract:
Emotion recognition using electroencephalography (EEG) signals has attracted increasing attention in recent years. However, existing methods often lack generalization in cross-corpus settings, where a model trained on one dataset is directly applied to another without retraining, due to differences in data distribution and recording conditions. To tackle the challenge of cross-corpus EEG-based emotion recognition, we propose a novel framework termed Soft Contrastive Masked Modeling (SCMM). Grounded in the theory of emotional continuity, SCMM integrates soft contrastive learning with a hybrid masking strategy to effectively capture emotion dynamics (refer to short-term continuity). Specifically, in the self-supervised learning stage, we propose a soft weighting mechanism that assigns similarity scores to sample pairs, enabling fine-grained modeling of emotional transitions and capturing the temporal continuity of human emotions. To further enhance representation learning, we design a similarity-aware aggregator that fuses complementary information from semantically related samples based on pairwise similarities, thereby improving feature expressiveness and reconstruction quality. This dual design contributes to a more discriminative and transferable representation, which is crucial for robust cross-corpus generalization. Extensive experiments on the SEED, SEED-IV, and DEAP datasets show that SCMM achieves state-of-the-art (SOTA) performance, outperforming the second-best method by an average accuracy of 4.26% under both same-class and different-class cross-corpus settings. The source code is available at https://github.com/Kyler-RL/SCMM.

Abstract:
Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to the given query. To tackle this task, most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality. Although a few recent works try to tackle the joint audio-vision-text reasoning, they treat all modalities equally and simply embed them without fine-grained interaction for moment retrieval. These designs are counter-practical as: Not all audios are helpful for video moment retrieval, and the audio of some videos may be complete noise or background sound that is meaningless to the moment determination. To this end, we propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR. Specifically, after integrating the textual guidance with vision and audio separately, we first design a pseudo-label-supervised audio importance predictor that predicts the importance score of the audio, and accordingly assigns weights to mitigate the interference caused by noisy audio. Then, we design a multi-granularity audio fusion module that adaptively fuses audio and visual modalities at local-, event-, and global-level, fully capturing their complementary contexts. We further propose a cross-modal knowledge distillation strategy to address the challenge of missing audio modality during inference. To evaluate our method, we further construct a new VMR dataset, i.e., Charades-AudioMatter, where audio-related samples are manually selected and re-organized from the original Charades-STA to validate the model's capability in utilizing audio modality. Extensive experiments validate the effectiveness of our method, achieving state-of-the-art with audio-video fusion in VMR methods. Our code is available at https://github.com/HuiGuanLab/IMG.

Abstract:
The past decade has witnessed rapid advancements in cross-modal retrieval, with significant progress made in accurately measuring the similarity between cross-modal pairs. However, the persistent hubness problem, a phenomenon where a small number of targets frequently appear as nearest neighbors to numerous queries, continues to hinder the precision of similarity measurements. Despite several proposed methods to reduce hubness, their underlying mechanisms remain poorly understood. To bridge this gap, we analyze the widely-adopted Inverted Softmax approach and demonstrate its effectiveness in balancing target probabilities during retrieval. Building on these insights, we propose a probability-balancing framework for more effective hubness reduction. We contend that balancing target probabilities alone is inadequate and, therefore, extend the framework to balance both query and target probabilities by introducing Sinkhorn Normalization (SN). Notably, we extend SN to scenarios where the true query distribution is unknown, showing that current methods, which rely solely on a query bank to estimate target hubness, produce suboptimal results due to a significant distributional gap between the query bank and targets. To mitigate this issue, we introduce Dual Bank Sinkhorn Normalization (DBSN), incorporating a corresponding target bank alongside the query bank to narrow this distributional gap. Our comprehensive evaluation across various cross-modal retrieval tasks, including image-text retrieval, video-text retrieval, and audio-text retrieval, demonstrates consistent performance improvements, validating the effectiveness of both SN and DBSN. All code are publicly available at https://github.com/ppanzx/DBSN.

Abstract:
The explosive growth of the video game industry has created an urgent need for recommendation systems that can scale with expanding catalogs and maintain user engagement. While prior work has explored accuracy and diversity in recommendations, existing models underutilize playtime, a rich behavioral signal unique to gaming platforms, and overlook the potential of multimodal information to enhance diversity. In this paper, we propose DP2Rec, a novel Dual-Phase Playtime-guided Recommendation model designed to jointly optimize accuracy and diversity. First, we introduce a playtime-guided interest intensity exploration module that separates strong and weak preferences via dual-beta modeling, enabling fine-grained user profiling and more accurate recommendations. Second, we present a playtime-guided multimodal random walks module that simulates player exploration using transitions guided by both playtime-derived interest similarity and multimodal semantic similarity. This mechanism preserves core preferences while promoting cross-category discovery through latent semantic associations and adaptive category balancing. Extensive experiments on a real-world game dataset show that DP2Rec outperforms existing methods in both recommendation accuracy and diversity. The dataset and source code are released at https://github.com/zqxwcevrtyui/DP2Rec

Abstract:
Multimodal Large Language Models (MLLM) have significantly advanced AI-assisted medical diagnosis, but often generate factually inconsistent responses that deviate from established medical knowledge. Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external sources, but it presents two key challenges. First, insufficient retrieval can miss critical information, whereas excessive retrieval can introduce irrelevant or misleading content, disrupting model output. Second, even when the model initially provides correct answers, over-reliance on retrieved data can lead to factual errors. To address these issues, we introduce Multimodal Intelligent Retrieval and Augmentation (MIRA) framework, designed to optimize factual accuracy in MLLM. MIRA consists of two key components: (1) a calibrated Rethinking and Rearrangement module that dynamically adjusts the number of retrieved contexts to manage factual risk, and (2) A medical RAG framework integrating image embeddings and a medical knowledge base with a query-rewrite module for efficient multimodal reasoning. This enables the model to effectively integrate both its inherent knowledge and external references. Our evaluation of publicly available medical VQA and report generation benchmarks demonstrates that MIRA substantially enhances factual accuracy and overall performance, achieving new state-of-the-art results. Code is released at https://github.com/mbzuai-oryx/MIRA.

Abstract:
Cross-domain recommendation (CDR) aims to address the persistent cold-start problem in Recommender Systems. Current CDR research concentrates on transferring cold-start users' information from the auxiliary domain to the target domain. However, these systems face two main issues: the underutilization of multimodal data, which hinders effective cross-domain alignment, and the neglect of side users who interact solely within the target domain, leading to inadequate learning of the target domain's vector space distribution. To address these issues, we propose a model leveraging Multimodal data and Side users for diffusion Cross-domain recommendation (MuSiC). We first employ a multimodal large language model to extract item multimodal features and leverage a large language model to uncover user features. Secondly, we propose the cross-domain diffusion module to learn the generation of feature vectors in the target domain. This approach involves learning feature distribution from side users and understanding the patterns in cross-domain transformation through overlapping users. Subsequently, the trained diffusion module is used to generate feature vectors for cold-start users in the target domain, enabling the completion of cross-domain recommendation tasks. Finally, our experimental evaluation of the Amazon dataset confirms that MuSiC achieves state-of-the-art performance, significantly outperforming all selected baselines. Our code is available: https://github.com/zhangf16/MuSiC.

Abstract:
Music-Driven Dance Generation seeks to create dance movements synchronized with music, playing a key role in applications like performance and gaming. While solo dance generation has seen progress, group dance generation remains underexplored. Although several methods have been proposed, existing approaches frequently fail to ensure spatial-temporal coherence, resulting in unrealistic and aesthetically unpleasing performances. To tackle the issue, we introduce CoheDancers, a novel framework for Music-Driven Interactive Group Dance Generation. CoheDancers aims to enhance group dance generation coherence by decomposing it into three key aspects: synchronization, naturalness, and fluidity. Correspondingly, we develop a Cycle Consistency based Dance Synchronization strategy to foster music-dance correspondences, an Auto-Regressive-based Exposure Bias Correction strategy to enhance the fluidity of the generated dances, and an Adversarial Training Strategy to augment the naturalness of the group dance output. Collectively, these strategies enable CoheDancers to produce highly coherent group dances with superior quality. Furthermore, to establish better benchmarks for Group Music2Dance, we construct the most diverse and comprehensive open-source dataset to date, I-Dancers, featuring rich dancer interactions, and create comprehensive evaluation metrics. Experimental evaluations on I-Dancers and other extant datasets substantiate that CoheDancers achieves unprecedented state-of-the-art performance. Code is available at https://github.com/XulongT/CoheDancers.

Abstract:
Cybersickness remains a major obstacle to the widespread adoption of immersive virtual reality (VR), particularly in consumer-grade environments. While prior methods rely on invasive signals such as electroencephalography (EEG) for high predictive accuracy, these approaches require specialized hardware and are impractical for real-world applications. In this work, we propose a scalable, deployable framework for personalized cybersickness prediction leveraging only non-invasive signals readily available from commercial VR headsets, including head motion, eye tracking, and physiological responses. Our model employs a modality-specific graph neural network enhanced with a Difference Attention Module to extract temporal-spatial embeddings capturing dynamic changes across modalities. A cross-modal alignment module jointly trains the video encoder to learn personalized traits by aligning video features with sensor-derived representations. Consequently, the model accurately predicts individual cybersickness using only video input during inference. Experimental results show our model achieves 88.4% accuracy, closely matching EEG-based approaches (89.16%), while reducing deployment complexity. With an average inference latency of 90ms, our framework supports real-time applications, ideal for integration into consumer-grade VR platforms without compromising personalization or performance. The code will be relesed at https://github.com/U235-Aurora/PTGNN.

Abstract:
Multimodal Biomedical Image Incremental Learning (MBIIL) is essential for handling diverse tasks and modalities in the biomedical domain, as training separate models for each modality or task significantly increases inference costs. Existing incremental learning methods focus on task expansion within a single modality, whereas MBIIL seeks to train a unified model incrementally across modalities. The MBIIL faces two challenges: I) How to preserve previously learned knowledge during incremental updates? II) How to effectively leverage knowledge acquired from existing modalities to support new modalities? To address these challenges, we propose MSLoRA-CR, a method that fine-tunes Modality-Specific LoRA modules while incorporating Contrastive Regularization to enhance intra-modality knowledge sharing and promote inter-modality knowledge differentiation. Our approach builds upon a large vision-language model (LVLM), keeping the pretrained model frozen while incrementally adapting new LoRA modules for each modality or task. Experiments on the incremental learning of biomedical images demonstrate that MSLoRA-CR outperforms both the state-of-the-art (SOTA) approach of training separate models for each modality and the general incremental learning method (incrementally fine-tuning LoRA). Specifically, MSLoRA-CR achieves a 1.88% improvement in overall performance compared to unconstrained incremental learning methods while maintaining computational efficiency. Our code is publicly available at https://github.com/VentusAislant/MSLoRA_CR.

Abstract:
Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information. The success of the BAG systems depends on the effectiveness of cross-modal reasoning and spatial understanding. Current methods have explored the use of visual information as guidance for binaural audio generation. However, they rely solely on cross-attention mechanisms to guide the generation process and under-utilise the temporal and spatial information in video data during training and inference. These limitations result in the loss of fine-grained spatial details and risk overfitting to specific environments, ultimately constraining model performance. In this paper, we address the aforementioned issues by introducing a new audio-visual binaural generation model with an audio-visual conditional normalisation layer that dynamically aligns the target difference audio features using visual context. To enhance spatial sensitivity, we also introduce a contrastive learning method that mines negatives from shuffled visual features. We also introduce a cost-efficient way to utilise test-time augmentation in video data to enhance performance. Our approach achieves state-of-the-art generation accuracy on the FAIR-Play, MUSIC-Stereo, and YT-MUSIC benchmarks. Code is available at https://github.com/SonyResearch/CCStereo.

Abstract:
Layout generation plays a crucial role in enhancing both user experience and design efficiency. However, current approaches suffer from task-specific generation capabilities and perceptually misaligned evaluation metrics, leading to limited applicability and ineffective measurement. In this paper, we propose Uni-Layout, a novel framework that achieves unified generation, human-mimicking evaluation and alignment between the two. For universal generation, we incorporate various layout tasks into a single taxonomy and develop a unified generator that handles background or element contents constrained tasks via natural language prompts. To introduce human feedback for the effective evaluation of layouts, we build Layout-HF100k, the first large-scale human feedback dataset with 100,000 expertly annotated layouts. Based on Layout-HF100k, we introduce a human-mimicking evaluator that integrates visual and geometric information, employing a Chain-of-Thought mechanism to conduct qualitative assessments alongside a confidence estimation module to yield quantitative measurements. For better alignment between the generator and the evaluator, we integrate them into a cohesive system by adopting Dynamic-Margin Preference Optimization (DMPO), which dynamically adjusts margins based on preference strength to better align with human judgments. Extensive experiments show that Uni-Layout significantly outperforms both task-specific and general-purpose methods. Our code is publicly available at https://github.com/JD-GenX/Uni-Layout.

Abstract:
Diffusion models have shown great potential in generating realistic image detail. However, adapting these models to video super-resolution (VSR) remains challenging due to their inherent stochasticity and lack of temporal modeling. Previous methods have attempted to mitigate this issue by incorporating motion information and temporal layers. However, unreliable motion estimation from low-resolution videos and costly multiple sampling steps with deep temporal layers limit them to short sequences. In this paper, we propose UltraVSR, a novel framework that enables ultra-realistic and temporally-coherent VSR through an efficient one-step diffusion space. A central component of UltraVSR is the Degradation-aware Reconstruction Scheduling (DRS), which estimates a degradation factor from the low-resolution input and transforms the iterative denoising process into a single-step reconstruction from low-resolution to high-resolution videos. To ensure temporal consistency, we propose a lightweight Recurrent Temporal Shift (RTS) module, including an RTS-convolution unit and an RTS-attention unit. By partially shifting feature components along the temporal dimension, it enables effective propagation, fusion, and alignment across frames without explicit temporal layers. The RTS module is integrated into a pretrained text-to-image diffusion model and is further enhanced through Spatio-temporal Joint Distillation (SJD), which improves temporally coherence while preserving realistic details. Additionally, we introduce a Temporally Asynchronous Inference (TAI) strategy to capture long-range temporal dependencies under limited memory constraints. Extensive experiments show that UltraVSR achieves state-of-the-art performance, both qualitatively and quantitatively, in a single sampling step. Code is available at https://github.com/yongliuy/UltraVSR.

Abstract:
Lens flare removal remains an information confusion challenge in the underlying image background and the optical flares, due to the complex optical interactions between light sources and camera lens. While recent solutions have shown promise in decoupling the flare corruption from image, they often fail to maintain contextual consistency, leading to incomplete and inconsistent flare removal. To eliminate this limitation, we propose DeflareMamba, which leverages the efficient sequence modeling capabilities of state space models while maintains the ability to capture local-global dependencies. Particularly, we design a hierarchical framework that establishes long-range pixel correlations through varied stride sampling patterns, and utilize local-enhanced state space models that simultaneously preserves local details. To the best of our knowledge, this is the first work that introduces state space models to the flare removal task. Extensive experiments demonstrate that our method effectively removes various types of flare artifacts, including scattering and reflective flares, while maintaining the natural appearance of non-flare regions. Further downstream applications demonstrate the capacity of our method to improve visual object recognition and cross-modal semantic understanding. Code is available at https://github.com/BNU-ERC-ITEA/DeflareMamba.

Abstract:
The integration of prompt tuning with multimodal learning has shown significant generalization abilities for various downstream tasks. Despite advancements, existing methods heavily depend on massive modality-specific labeled data (e.g., video, audio, and image), or are customized for a single modality. In this study, we present Text as Any-Modality by Consistent Prompt Tuning (TaAM-CPT), a scalable approach for constructing a general representation model toward unlimited modalities using solely text data. TaAM-CPT comprises modality prompt pools, text construction, and modality-aligned text encoders from pre-trained models, which allows for extending new modalities by simply adding prompt pools and modality-aligned text encoders. To harmonize the learning across different modalities, TaAM-CPT designs intra- and inter-modal learning objectives, which can capture category details within modalities while maintaining semantic consistency across different modalities. Benefiting from its scalable architecture and pre-trained models, TaAM-CPT can be seamlessly extended to accommodate unlimited modalities. Remarkably, without any modality-specific labeled data, TaAM-CPT achieves leading results on diverse datasets spanning various modalities, including video classification, image classification, and audio classification. The code is available at https://github.com/Jinx630/TaAM-CPT.

Abstract:
Reconstructing objects and extracting high-quality surfaces play a vital role in the real world. Current 4D representations show the ability to render high-quality novel views for dynamic objects, but cannot reconstruct high-quality meshes due to their implicit or geometrically inaccurate representations. In this paper, we propose a novel representation that can reconstruct accurate meshes from sparse image input, named Dynamic 2D Gaussians (D-2DGS). We adopt 2D Gaussians for basic geometry representation and use sparse-controlled points to capture the 2D Gaussian's deformation. By extracting the object mask from the rendered high-quality image and masking the rendered depth map, we remove floaters that are prone to occur during reconstruction and can extract high-quality dynamic mesh sequences of dynamic objects. Experiments demonstrate that our D-2DGS is outstanding in reconstructing detailed and smooth high-quality meshes from sparse inputs. The code is available at https://github.com/hustvl/Dynamic-2DGS.

Abstract:
Cued Speech (CS) is a visual communication system that combines lip-reading with hand coding to facilitate communication for individuals with hearing impairments. Automatic CS Recognition (ACSR) aims to convert CS hand gestures and lip movements into text via AI-driven methods. Traditionally, the temporal asynchrony between hand and lip movements requires the design of complex modules to facilitate effective multimodal fusion. However, constrained by limited data availability, current methods demonstrate insufficient capacity for adequately training these fusion mechanisms, resulting in suboptimal performance. Recently, multi-agent systems have shown promising capabilities in handling complex tasks with limited data availability. To this end, we propose the first collaborative multi-agent system for ACSR, named Cued-Agent. It integrates four specialized sub-agents: a Multimodal Large Language Model-based Hand Recognition agent that employs keyframe screening and CS expert prompt strategies to decode hand movements, a pretrained Transformer-based Lip Recognition agent that extracts lip features from the input video, a Hand Prompt Decoding agent that dynamically integrates hand prompts with lip features during inference in a training-free manner, and a Self-Correction Phoneme-to-Word agent that enables post-processing and end-to-end conversion from phoneme sequences to natural language sentences for the first time through semantic refinement. To support this study, we expand the existing Mandarin CS dataset by collecting data from eight hearing-impaired cuers, establishing a mixed dataset of fourteen subjects. Extensive experiments demonstrate that our Cued-Agent performs superbly in both normal and hearing-impaired scenarios compared with state-of-the-art methods. The implementation is available at https://github.com/DennisHgj/Cued-Agent.

Abstract:
Galaxy morphology analysis involves studying galaxies based on their shapes and structures. For such studies, fundamental tasks include identifying and classifying galaxies in astronomical images, as well as retrieving visually or structurally similar galaxies through similarity search. Existing methods either directly train domain-specific foundation models on large, annotated datasets or fine-tune vision foundation models on a smaller set of images. The former is effective but costly, while the latter is more resource-efficient but often yields lower accuracy. To address these challenges, we introduce GalaxAlign, a multimodal approach inspired by how citizen scientists identify galaxies in astronomical images by following textual descriptions and matching schematic symbols. Specifically, GalaxAlign employs a tri-modal alignment framework to align three types of data during fine-tuning: (1) schematic symbols representing galaxy shapes and structures, (2) textual labels for these symbols, and (3) galaxy images. By incorporating multimodal instructions, GalaxAlign eliminates the need for expensive pretraining and enhances the effectiveness of fine-tuning. Experiments on galaxy classification and similarity search demonstrate that our method effectively fine-tunes general pre-trained models for astronomical tasks by incorporating domain-specific multi-modal knowledge. Code is available at https://github.com/RapidsAtHKUST/GalaxAlign.

Abstract:
Image assessment aims to evaluate the quality and aesthetics of images and has been applied across various scenarios, such as natural and AIGC scenes. Existing methods mostly address these sub-tasks or scenes individually. While some works attempt to develop unified image assessment models, they have struggled to achieve satisfactory performance or cover a broad spectrum of assessment scenarios. In this paper, we present Gamma, a Generic imAge assessMent model using Mixture of Assessment Experts, which can effectively assess images from diverse scenes through mixed-dataset training. Achieving unified training in image assessment presents significant challenges due to annotation biases across different datasets. To address this issue, we first propose a Mixture of Assessment Experts (MoAE) module, which employs shared and adaptive experts to dynamically learn common and specific knowledge for different datasets, respectively. In addition, we introduce a Scene-based Differential Prompt (SDP) strategy, which uses scene-specific prompts to provide prior knowledge and guidance during the learning process, further boosting adaptation for various scenes. Our Gamma model is trained and evaluated on 12 datasets spanning 6 image assessment scenarios. Extensive experiments show that our unified Gamma outperforms other state-of-the-art mixed-training methods by significant margins while covering more scenes. Codes are available at https://github.com/zht8506/Gamma.

Abstract:
Ensuring the ethical deployment of text-to-image models requires effective techniques to prevent the generation of harmful or inappropriate content. While concept erasure methods offer a promising solution, existing finetuning-based approaches suffer from notable limitations. Anchor-free methods risk disrupting sampling trajectories, leading to visual artifacts, while anchor-based methods rely on the heuristic selection of anchor concepts. To overcome these shortcomings, we introduce a finetuning framework, dubbed ANT, which Automatically guides deNoising Trajectories to avoid unwanted concepts. ANT is built on a key insight: reversing the condition direction of classifier-free guidance during mid-to-late denoising stages enables precise content modification without sacrificing early-stage structural integrity. This inspires a trajectory-aware objective that preserves the integrity of the early-stage score function field-which steers samples toward the natural image manifold-without relying on heuristic anchor concept selection. For single-concept erasure, we propose an augmentation-enhanced weight saliency map to precisely identify the critical parameters that most significantly contribute to the unwanted concept, enabling more thorough and efficient erasure. For multi-concept erasure, our objective function offers a versatile plug-and-play solution that significantly boosts performance. Extensive experiments demonstrate that ANT achieves state-of-the-art results in both single and multi-concept erasure, delivering high-quality, safe outputs without compromising the generative fidelity. Code is available at https://github.com/lileyang1210/ANT

Abstract:
Trajectory distillation based on consistency models (CMs) provides an effective framework for accelerating diffusion models by reducing inference steps. However, we find that existing CMs degrade style similarity and compromise aesthetic quality in stylization tasks-especially when handling image-to-image or video-to-video transformations that start denoising from partially noised inputs. The core limitation stems from existing methods enforcing initial-step alignment between the probability flow ODE (PF-ODE) trajectories of student models and their imperfect teacher models. This partial alignment strategy inevitably fails to guarantee full trajectory consistency, thereby compromising the overall generation quality. To address this issue, we propose Single Trajectory Distillation (STD), a training framework initiated from partial noise states. To counteract the additional time overhead introduced by STD, we design a trajectory bank that pre-stores intermediate states of the teacher model's PF-ODE trajectories, effectively offsetting the computational cost during student model training. This mechanism ensures STD maintains equivalent training efficiency compared to conventional consistency models. Furthermore, we incorporate an asymmetric adversarial loss to explicitly enhance style consistency and perceptual quality in generated outputs. Extensive experiments on image and video stylization demonstrate that our method surpasses existing acceleration models in terms of style similarity and aesthetic evaluations. Our code and results are available on the project page: https://single-trajectory-distillation.github.io/.

Abstract:
Co-speech gesture generation is crucial for creating lifelike avatars and enhancing human-computer interactions by synchronizing gestures with speech. Despite recent advancements, existing methods struggle with accurately identifying the rhythmic or semantic triggers from audio for generating contextualized gesture patterns and achieving pixel-level realism. To address these challenges, we introduce Contextual Gesture, a framework that improves co-speech gesture video generation through three innovative components: (1) a chronological speech-gesture alignment that temporally connects two modalities, (2) a contextualized gesture tokenization that incorporate speech context into motion pattern representation through distillation, and (3) a structure-aware refinement module that employs edge connection to link gesture keypoints to improve video generation. Our extensive experiments demonstrate that Contextual Gesture not only produces realistic and speech-aligned gesture videos but also supports long-sequence generation and video gesture editing applications, shown in Fig.1

Abstract:
Face Video Restoration (FVR) aims to reconstruct high-quality face videos from degraded input. Traditional methods struggle to preserve fine-grained, identity-specific features when degradation is severe, often producing average-looking faces that lack individual characteristics. To address these challenges, we introduce IP-FVR, a novel method that leverages a high-quality reference face image as a visual prompt to provide identity conditioning during the denoising process. IP-FVR incorporates semantically rich identity information from the reference image using decoupled cross-attention mechanisms, ensuring detailed and identity consistent results. For intra-clip identity drift (within 24 frames), we introduce an identity-preserving feedback learning method that combines cosine similarity-based reward signals with suffix-weighted temporal aggregation. This approach effectively minimizes drift within sequences of frames. For inter-clip identity drift, we develop an exponential blending strategy that aligns identities across clips by iteratively blending frames from previous clips during the denoising process. This method ensures consistent identity representation across different clips. Additionally, we enhance the restoration process with a multi-stream negative prompt, guiding the model's attention to relevant facial attributes and minimizing the generation of low-quality or incorrect features. Extensive experiments on both synthetic and real-world datasets demonstrate that IP-FVR outperforms existing methods in both quality and identity preservation, showcasing its substantial potential for practical applications in face video restoration. Our code and datasets are available at https://ip-fvr.github.io/.

Abstract:
While recent video-to-audio (V2A) models can generate realistic background audio from visual input, they largely overlook speech, an essential part of many video soundtracks. This paper proposes a new task, video-to-soundtrack (V2ST) generation, which aims to jointly produce synchronized background audio and speech within a unified framework. To tackle V2ST, we introduce DualDub, a unified framework built on a multimodal language model that integrates a multimodal encoder, a cross-modal aligner, and dual decoding heads for simultaneous background audio and speech generation. Specifically, our proposed cross-modal aligner employs causal and non-causal attention mechanisms to improve synchronization and acoustic harmony. Besides, to handle data scarcity, we design a curriculum learning strategy that progressively builds the multimodal capability. Finally, we introduce DualBench, the first benchmark for V2ST evaluation with a carefully curated test set and comprehensive metrics. Experimental results demonstrate that DualDub achieves state-of-the-art performance, generating high-quality and well-synchronized soundtracks with both speech and background audio. DualBench and generated samples of DualDub are available at https://github.com/wjtian-wonderful/DualBench.

Abstract:
Diffusion models have shown remarkable promise for image restoration by leveraging powerful priors. Prominent methods typically frame the restoration problem within a Bayesian inference framework, which iteratively combines a denoising step with a likelihood guidance step. However, the interactions between these two components in the generation process remain underexplored. In this paper, we analyze the underlying gradient dynamics of these components and identify significant instabilities. Specifically, we demonstrate conflicts between the prior and likelihood gradient directions, alongside temporal fluctuations in the likelihood gradient itself. We show that these instabilities disrupt the generative process and compromise restoration performance. To address these issues, we propose Stabilized Progressive Gradient Diffusion (SPGD), a novel gradient management technique. SPGD integrates two synergistic components: (1) a progressive likelihood warm-up strategy to mitigate gradient conflicts; and (2) adaptive directional momentum (ADM) smoothing to reduce fluctuations in the likelihood gradient. Extensive experiments across diverse restoration tasks demonstrate that SPGD significantly enhances generation stability, leading to state-of-the-art performance in quantitative metrics and visually superior results. Code is available at https://github.com/74587887/SPGD.

Abstract:
Masked modeling framework has shown promise in co-speech motion generation. However, it struggles to identify semantically significant frames for effective motion masking. In this work, we propose a speech-queried attention-based mask modeling framework for co-speech motion generation. Our key insight is to leverage motion-aligned speech features to guide the masked motion modeling process, selectively masking rhythm-related and semantically expressive motion frames. Specifically, we first propose a motion-audio alignment module (MAM) to construct a latent motion-audio joint space. In this space, both low-level and high-level speech features are projected, enabling motion-aligned speech representation using learnable speech queries. Then, a speech-queried attention mechanism (SQA) is introduced to compute frame-level attention scores through interactions between motion keys and speech queries, guiding selective masking toward motion frames with high attention scores. Finally, the motion-aligned speech features are also injected into the generation network to facilitate co-speech motion generation. Qualitative and quantitative evaluations confirm that our method outperforms existing state-of-the-art approaches, successfully producing high-quality co-speech motion. Project page: https://xiangyue-zhang.github.io/EchoMask

Abstract:
In this work, we study how vision-language models (VLMs) can be utilized to enhance the safety for the autonomous driving system, including perception, situational understanding, and path planning. However, existing research has largely overlooked the evaluation of these models in traffic safety-critical driving scenarios. To bridge this gap, we create the benchmark (SafeDrive228K) and propose a new baseline based on VLM with knowledge graph-based retrieval-augmented generation (SafeDriveRAG) for visual question answering (VQA). Specifically, we introduce SafeDrive228K, the first large-scale multimodal question-answering benchmark comprising 228K examples across 18 sub-tasks. This benchmark encompasses a diverse range of traffic safety queries, from traffic accidents and corner cases to common safety knowledge, enabling a thorough assessment of the comprehension and reasoning abilities of the models. Furthermore, we propose a plug-and-play multimodal knowledge graph-based retrieval-augmented generation approach that employs a novel multi-scale subgraph retrieval algorithm for efficient information retrieval. By incorporating traffic safety guidelines collected from the Internet, this framework further enhances the model's capacity to handle safety-critical situations. Finally, we conduct comprehensive evaluations on five mainstream VLMs to assess their reliability in safety-sensitive driving tasks. Experimental results demonstrate that integrating RAG significantly improves performance, achieving a +4.73% gain in Traffic Accidents tasks, +8.79% in Corner Cases tasks and +14.57% in Traffic Safety Commonsense across five mainstream VLMs, underscoring the potential of our proposed benchmark and methodology for advancing research in traffic safety. Our source code and data are available at https://github.com/Lumos0507/SafeDriveRAG.

Abstract:
Recent advances in NeRF and 3DGS have significantly enhanced the efficiency and quality of 3D content synthesis. However, efficient personalization of generated 3D content remains a critical challenge. Current 3D personalization approaches predominantly rely on knowledge distillation-based methods, which require computationally expensive retraining procedures. To address this challenge, we propose Invert3D, a novel framework for convenient 3D content personalization. Nowadays, vision-language models such as CLIP enable direct image personalization through aligned vision-text embedding spaces. However, the inherent structural differences between 3D content and 2D images preclude direct application of these techniques to 3D personalization. Our approach bridges this gap by establishing alignment between 3D representations and text embedding spaces. Specifically, we develop a camera-conditioned 3D-to-text inverse mechanism that projects 3D contents into a 3D embedding aligned with text embeddings. This alignment enables efficient manipulation and personalization of 3D content through natural language prompts, eliminating the need for computationally retraining procedures. Extensive experiments demonstrate that Invert3D achieves effective personalization of 3D content.

Abstract:
Egocentric video has seen increased interest in recent years, as it is used in a range of areas. However, most existing datasets are limited to a single perspective. In this paper, we present the CASTLE 2024 dataset, a multimodal collection containing ego- and exo-centric (i.e., first- and third-person perspective) video and audio from 15 time-aligned sources, as well as other sensor streams and auxiliary data. The dataset was recorded by volunteer participants over four days in a common location and includes the point of view of 10 participants, with an additional 5 fixed cameras providing an exocentric perspective. The entire dataset contains over 600 hours of UHD video recorded at 50 frames per second. In contrast to other datasets, CASTLE 2024 does not contain any partial censoring, such as blurred faces or distorted audio. The dataset is available via https://castle-dataset.github.io/.

Abstract:
With the rapid advancement of generative models, the realism of AI-generated images has significantly improved, posing critical challenges for verifying digital content authenticity. Current deepfake detection methods often depend on datasets with limited generation models and content diversity that fail to keep pace with the evolving complexity and increasing realism of the AI-generated content. Large multimodal models (LMMs), widely adopted in various vision tasks, have demonstrated strong zero-shot capabilities, yet their potential in deepfake detection remains largely unexplored. To bridge this gap, we present DFBench, a large-scale DeepFake Benchmark featuring (i) broad diversity, including 540,000 images across real, AI-edited, and AI-generated content, (ii) latest model, the fake images are generated by 12 state-of-the-art generation models, and (iii) bidirectional benchmarking and evaluating for both the detection accuracy of deepfake detectors and the evasion capability of generative models. Based on DFBench, we propose MoA-DF, Mixture of Agents for DeepFake detection, leveraging a combined probability strategy from multiple LMMs. MoA-DF achieves state-of-the-art performance, further proving the effectiveness of leveraging LMMs for deepfake detection. Database and codes are publicly available at https://github.com/IntMeGroup/DFBench.

Abstract:
3D Gaussian Splatting (3DGS) has emerged as a promising approach for novel view synthesis, offering real-time rendering with high visual fidelity. However, its substantial storage requirements present significant challenges for practical applications. While recent state-of-the-art (SOTA) 3DGS methods increasingly incorporate dedicated compression modules, there is a lack of a comprehensive framework to evaluate their perceptual impact. Therefore we present 3DGS-IEval-15K, the first large-scale image quality assessment (IQA) dataset specifically designed for compressed 3DGS representations. Our dataset encompasses 15,200 images rendered from 10 real-world scenes through 6 representative 3DGS algorithms at 20 strategically selected viewpoints, with different compression levels leading to various distortion effects. Through controlled subjective experiments, we collect human perception data from 60 viewers. We validate dataset quality through scene diversity and MOS distribution analysis, and establish a comprehensive benchmark with 30 representative IQA metrics covering diverse types. As the largest-scale 3DGS quality assessment dataset to date, our work provides a foundation for developing 3DGS specialized IQA metrics, and offers essential data for investigating view-dependent quality distribution patterns unique to 3DGS. The database is publicly available at https://github.com/YukeXing/3DGS-IEval-15K.

Abstract:
Multi-view egocentric dynamic scene reconstruction holds significant research value for applications in holographic documentation of social interactions. However, existing reconstruction datasets focus on static multi-view or single-egocentric view setups, lacking multi-view egocentric datasets for dynamic scene reconstruction. Therefore, we present MultiEgo, the first multi-view egocentric dataset for 4D dynamic scene reconstruction. The dataset comprises five canonical social interaction scenes: meetings, performances, and a presentation. Each scene provides five authentic egocentric videos captured by participants wearing AR glasses. We design a hardware-based data acquisition system and processing pipeline, achieving sub-millisecond temporal synchronization across views, coupled with accurate pose annotations. Experiment validation demonstrates the practical utility and effectiveness of our dataset for free-viewpoint video (FVV) applications, establishing MultiEgo as a foundational resource for advancing multi-view egocentric dynamic scene reconstruction research. Our project page and dataset are available at https://woxelikeloud.github.io/multiego/.

Abstract:
Recent advancements in Large Multimodal Models (LMMs) have shown promise in Autonomous Driving Systems (ADS). However, their direct application to ADS is hindered by challenges such as misunderstanding of traffic knowledge, complex road conditions, and diverse states of vehicle. To address these challenges, we propose the use of Knowledge Editing, which enables targeted modifications to a model's behavior without the need for full retraining. Meanwhile, we introduce ADS-Edit, a multimodal knowledge editing dataset specifically designed for ADS, which includes various real-world scenarios, multiple data types, and comprehensive evaluation metrics. We conduct comprehensive experiments and derive several interesting conclusions. We hope that our work will contribute to the further advancement of knowledge editing applications in the field of autonomous driving.

Abstract:
The SEAR Dataset is a novel multimodal resource designed to study the emerging threat of social engineering (SE) attacks orchestrated through augmented reality (AR) and multimodal large language models (LLMs). This dataset captures 180 annotated conversations across 60 participants in simulated adversarial scenarios, including meetings, classes and networking events. It comprises synchronized AR-captured visual/audio cues (e.g., facial expressions, vocal tones), environmental context, and curated social media profiles, alongside subjective metrics such as trust ratings and susceptibility assessments. Key findings reveal SEAR's alarming efficacy in eliciting compliance (e.g., 93.3% phishing link clicks, 85% call acceptance) and hijacking trust (76.7% post-interaction trust surge). The dataset supports research in detecting AR-driven SE attacks, designing defensive frameworks, and understanding multimodal adversarial manipulation. Rigorous ethical safeguards, including anonymization and IRB compliance, ensure responsible use. The SEAR dataset is available at https://github.com/INSLabCN/SEAR-Dataset.

Abstract:
Identifying individual animals within large wildlife populations is essential for effective wildlife monitoring and conservation efforts. Recent advancements in computer vision have shown promise in animal re-identification (Animal ReID) by leveraging data from camera traps. However, existing Animal ReID datasets rely exclusively on visual data, overlooking environmental metadata that ecologists have identified as highly correlated with animal behavior and identity, such as temperature and circadian rhythms. Moreover, the emergence of multimodal models capable of jointly processing visual and textual data presents new opportunities for Animal ReID, but existing datasets fail to leverage these models' text-processing capabilities, limiting their full potential. To address these limitations, we propose MetaWild, a multimodal Animal ReID dataset comprising 20,890 images across six species, paired with environmental metadata extracted from embedded camera trap overlays and scene contexts. Additionally, to facilitate the use of metadata in existing ReID methods, we propose the Meta-Feature Adapter (MFA), a lightweight module that can be incorporated into existing vision-language model (VLM)-based Animal ReID methods, allowing ReID models to leverage both environmental metadata and visual information to improve ReID performance. Experiments on MetaWild show that combining baseline ReID models with MFA to incorporate metadata consistently improves performance compared to using visual information alone, validating the effectiveness of incorporating metadata in re-identification. We hope that our proposed dataset can inspire further exploration of multimodal approaches for Animal ReID. Our dataset and supplementary materials are available at https://jim-lyz1024.github.io/MetaWild/.

Abstract:
A major challenge in finger vein recognition is the lack of large-scale public datasets. Existing datasets contain few identities and limited samples per finger, restricting the advancement of deep learning-based methods. To address this, we introduce FVeinSyn, a synthetic generator capable of producing diverse finger vein patterns with rich intra-class variations. Using FVeinSyn, we created FingerVeinSyn-5M -- the largest available finger vein dataset -- containing 5 million samples from 50,000 unique fingers, each with 100 variations including shift, rotation, scale, roll, varying exposure levels, skin scattering blur, optical blur, and motion blur. FingerVeinSyn-5M is also the first to offer fully annotated finger vein images, supporting deep learning applications in this field. Models pretrained on FingerVeinSyn-5M and fine-tuned with minimal real data achieve an average 53.91% performance gain across multiple benchmarks. The dataset is publicly available at: https://github.com/EvanWang98/FingerVeinSyn-5M.

Abstract:
Detecting hate speech in videos remains challenging due to the complexity of multimodal content and the lack of fine-grained annotations in existing datasets. We present HateClipSeg, a large-scale multimodal dataset with both video-level and segment-level annotations, comprising over 11,714 segments labeled as Normal or across five Offensive categories: Hateful, Insulting, Sexual, Violence, Self-Harm, along with explicit target victim labels. Our three-stage annotation process yields high inter-annotator agreement (Krippendorff's alpha = 0.817). We propose three tasks to benchmark performance: (1) Trimmed Hateful Video Classification, (2) Temporal Hateful Video Localization, and (3) Online Hateful Video Classification. Results highlight substantial gaps in current models, emphasizing the need for more sophisticated multimodal and temporally aware approaches. The HateClipSeg dataset are publicly available at https://github.com/Social-AI-Studio/HateClipSeg.git.

Abstract:
Camouflaged Object Detection (COD) aims to identify objects that blend seamlessly into natural scenes. Although RGB-based methods have advanced, their performance remains limited under challenging conditions. Multispectral imagery, providing rich spectral information, offers a promising alternative for enhanced foreground-background discrimination. However, existing COD benchmark datasets are exclusively RGB-based, lacking essential support for multispectral approaches, which has impeded progress in this area. To address this gap, we introduce MCOD, the first challenging benchmark dataset specifically designed for multispectral camouflaged object detection. MCOD features three key advantages: (i) Comprehensive challenge attributes: It captures real-world difficulties such as small object sizes and extreme lighting conditions commonly encountered in COD tasks. (ii) Diverse real-world scenarios: The dataset spans a wide range of natural environments to better reflect practical applications. (iii) High-quality pixel-level annotations: Each image is manually annotated with precise object masks and corresponding challenge attribute labels. We benchmark eleven representative COD methods on MCOD, observing a consistent performance drop due to increased task difficulty. Notably, integrating multispectral modalities substantially alleviates this degradation, highlighting the value of spectral information in enhancing detection robustness. We anticipate MCOD will provide a strong foundation for future research in multispectral camouflaged object detection. The dataset is publicly accessible at https://github.com/yl2900260-bit/MCOD.

Abstract:
In dyadic interactions, a broad spectrum of human facial reactions might be appropriate for responding to each human speaker behaviour. Following the successful organisation of the REACT 2023 and REACT 2024 challenges, we are proposing the REACT 2025 challenge encouraging the development and benchmarking of Machine Learning (ML) models that can be used to generate multiple appropriate, diverse, realistic and synchronised human-style facial reactions expressed by human listeners in response to an input stimulus (i.e., audio-visual behaviours expressed by their corresponding speakers). As a key of the challenge, we provide challenge participants with the first natural and large-scale multi-modal Multiple Appropriate Facial Reaction Generation (MAFRG) dataset (called MARS) recording 136 human-human dyadic interactions containing a total of 2856 interaction sessions covering five different topics. In addition, this paper also presents the challenge guidelines and the performance of our baselines on the two proposed sub-challenges: Offline MAFRG and Online MAFRG, respectively. The challenge baseline code is publicly available at https://github.com/reactmultimodalchallenge/baseline_react2025

Abstract:
Tiny object detection plays a vital role in drone surveillance, remote sensing, and autonomous systems, enabling the identification of small targets across vast landscapes. However, existing methods suffer from inefficient feature leverage and high computational costs due to redundant feature processing and rigid query allocation. To address these challenges, we propose Dome-DETR, a novel framework with Density-Oriented Feature-Query Manipulation for Efficient Tiny Object Detection. To reduce feature redundancies, we introduce a lightweight Density-Focal Extractor (DeFE) to produce clustered compact foreground masks. Leveraging these masks, we incorporate Masked Window Attention Sparsification (MWAS) to focus computational resources on the most informative regions via sparse attention. Besides, we propose Progressive Adaptive Query Initialization (PAQI), which adaptively modulates query density across spatial areas for better query allocation. Extensive experiments demonstrate that Dome-DETR achieves state-of-the-art performance (+3.3 AP on AI-TOD-V2 and +2.5 AP on VisDrone) while maintaining low computational complexity and a compact model size. Code is available at https://github.com/RicePasteM/Dome-DETR.

Abstract:
Hierarchical feature representations play a pivotal role in computer vision, particularly in object detection for autonomous driving. Multi-level semantic understanding is crucial for accurately identifying pedestrians, vehicles, and traffic signs in dynamic environments. However, existing architectures, such as YOLO and DETR, struggle to maintain feature consistency across different scales while balancing detection precision and computational efficiency. To address these challenges, we propose Butter, a novel object detection framework designed to enhance hierarchical feature representations for improving detection robustness. Specifically, Butter introduces two key innovations: Frequency-Adaptive Feature Consistency Enhancement (FAFCE) Component, which refines multi-scale feature consistency by leveraging adaptive frequency filtering to enhance structural and boundary precision, and Progressive Hierarchical Feature Fusion Network (PHFFNet) Module, which progressively integrates multi-level features to mitigate semantic gaps and strengthen hierarchical feature learning. Through extensive experiments on BDD100K, KITTI, and Cityscapes, Butter demonstrates superior feature representation capabilities, leading to notable improvements in detection accuracy while reducing model complexity. By focusing on hierarchical feature refinement and integration, Butter provides an advanced approach to object detection that achieves a balance between accuracy, deployability, and computational efficiency in real-time autonomous driving scenarios. Our model and implementation are publicly available at https://github.com/Aveiro-Lin/Butter, facilitating further research and validation within the autonomous driving community.

Abstract:
Popular transformer detectors have achieved promising performance through query-based learning using attention mechanisms. However, the roles of existing decoder query types (e.g., content query and positional query) are still underexplored. These queries are generally predefined with a fixed number (fixed-query), which limits their flexibility. We find that the learning of these fixed-query is impaired by Recurrent Opposing in Teractions (ROT) between two attention operations: Self-Attention (query-to-query) and Cross-Attention (query-to-encoder), thereby degrading decoder efficiency. Furthermore, "query ambiguity" arises when shared-weight decoder layers are processed with both one-to-one and one-to-many label assignments during training, violating DETR's one-to-one matching principle. To address these challenges, we propose DS-Det, a more efficient detector capable of detecting a flexible number of objects in images. Specifically, we reformulate and introduce a new unified Single-Query paradigm for decoder modeling, transforming the fixed-query into flexible. Furthermore, we propose a simplified decoder framework through attention disentangled learning: locating boxes with Cross-Attention (one-to-many process), deduplicating predictions with Self-Attention (one-to-one process), addressing ''query ambiguity'' and ''ROT'' issues directly, and enhancing decoder efficiency. We further introduce a unified PoCoo loss that leverages box size priors to prioritize query learning on hard samples such as small objects. Extensive experiments across five different backbone models on COCO2017 and WiderPerson datasets demonstrate the general effectiveness and superiority of DS-Det. The source codes are available at https://github.com/Med-Process/DS-Det/.

Abstract:
Existing multi-view learning models struggle in open-set scenarios due to their implicit assumption of class completeness. Moreover, static view-induced biases, which arise from spurious view-label associations formed during training, further degrade their ability to recognize unknown categories. In this paper, we propose a multi-view open-set learning framework via ambiguity uncertainty calibration and view-wise debiasing. To simulate ambiguous samples, we design O-Mix, a novel synthesis strategy to generate virtual samples with calibrated open-set ambiguity uncertainty. These samples are further processed by an auxiliary ambiguity perception network that captures atypical patterns for improved open-set adaptation. Furthermore, we incorporate an HSIC-based contrastive debiasing module that enforces independence between view-specific ambiguous and view-consistent representations, encouraging the model to learn generalizable features. Extensive experiments on diverse multi-view benchmarks demonstrate that the proposed framework consistently enhances unknown-class recognition while preserving strong closed-set performance. The source code are available https://github.com/ZihanFang11/2025_MOCD_ACMMM

Abstract:
Multi-organ medical segmentation is a crucial component of medical image processing, essential for doctors to make accurate diagnoses and develop effective treatment plans. Despite significant progress in this field, current multi-organ segmentation models often suffer from inaccurate details, dependence on geometric prompts and loss of spatial information. Addressing these challenges, we introduce a novel model named CRISP-SAM2 with CR oss-modal Interaction and Semantic Prompting based on SAM2. This model represents a promising approach to multi-organ medical segmentation guided by textual descriptions of organs. Our method begins by converting visual and textual inputs into cross-modal contextualized semantics using a progressive cross-attention interaction mechanism. These semantics are then injected into the image encoder to enhance the detailed understanding of visual information. To eliminate reliance on geometric prompts, we use a semantic prompting strategy, replacing the original prompt encoder to sharpen the perception of challenging targets. In addition, a similarity-sorting self-updating strategy for memory and a mask-refining process is applied to further adapt to medical imaging and enhance localized details. Comparative experiments conducted on seven public datasets indicate that CRISP-SAM2 outperforms existing models. Extensive analysis also demonstrates the effectiveness of our method, thereby confirming its superior performance, especially in addressing the limitations mentioned earlier. Our code is available at: https://github.com/YU-deep/CRISP_SAM2.git.

Abstract:
Hard samples pose a significant challenge in person re-identification (ReID) tasks, particularly in clothing-changing person Re-ID (CC-ReID). Their inherent ambiguity or similarity, coupled with the lack of explicit definitions, makes them a fundamental bottleneck. These issues not only limit the design of targeted learning strategies but also diminish the model's robustness under clothing or viewpoint changes. In this paper, we propose a novel multimodal-guided Hard Sample Generation and Learning (HSGL) framework, which is the first effort to unify textual and visual modalities to explicitly define, generate, and optimize hard samples within a unified paradigm. HSGL comprises two core components: (1) Dual-Granularity Hard Sample Generation (DGHSG), which leverages multimodal cues to synthesize semantically consistent samples, including both coarse- and fine-grained hard positives and negatives for effectively increasing the hardness and diversity of the training data. (2) Hard Sample Adaptive Learning (HSAL), which introduces a hardness-aware optimization strategy that adjusts feature distances based on textual semantic labels, encouraging the separation of hard positives and drawing hard negatives closer in the embedding space to enhance the model's discriminative capability and robustness to hard samples. Extensive experiments on multiple CC-ReID benchmarks demonstrate the effectiveness of our approach and highlight the potential of multimodal-guided hard sample generation and learning for robust CC-ReID. Notably, HSAL significantly accelerates the convergence of the targeted learning procedure and achieves state-of-the-art performance on both PRCC and LTCC datasets. The code is available at https://github.com/undooo/TryHarder-ACMMM25.

Abstract:
Anomaly segmentation aims to identify Out-of-Distribution (OoD) anomalous objects within images. Existing pixel-wise methods typi- cally assign anomaly scores individually and employ a global thresh- olding strategy to segment anomalies. Despite their effectiveness, these approaches encounter significant challenges in real-world applications: (1) neglecting spatial correlations among pixels within the same object, resulting in fragmented segmentation; (2) variabil- ity in anomaly score distributions across image regions, causing global thresholds to either generate false positives in background areas or miss segments of anomalous objects. In this work, we intro- duce OoDDINO, a novel multi-level anomaly segmentation frame- work designed to address these limitations through a coarse-to-fine anomaly detection strategy. OoDDINO combines an uncertainty- guided anomaly detection model with a pixel-level segmentation model within a two-stage cascade architecture. Initially, we propose an Orthogonal Uncertainty-Aware Fusion Strategy (OUAFS) that sequentially integrates multiple uncertainty metrics with visual representations, employing orthogonal constraints to strengthen the detection model's capacity for localizing anomalous regions accurately. Subsequently, we develop an Adaptive Dual-Threshold Network (ADT-Net), which dynamically generates region-specific thresholds based on object-level detection outputs and pixel-wise anomaly scores. This approach allows for distinct thresholding strategies within foreground and background areas, achieving fine- grained anomaly segmentation. The proposed framework is compatible with other pixel-wise anomaly detection models, which act as a plug-in to boost the performance. Extensive experiments on two benchmark datasets validate our framework's superiority and compatibility over state-of-the-art methods. Source code is available at: https://github.com/OoDDINO/OoD-DINO.

Abstract:
Semi-supervised learning (SSL) has achieved significant progress by leveraging both labeled data and unlabeled data. Existing SSL methods overlook a common real-world scenario when labeled data is extremely scarce, potentially as limited as a single labeled sample in the dataset. General SSL approaches struggle to train effectively from scratch under such constraints, while methods utilizing pre-trained models often fail to find an optimal balance between leveraging limited labeled data and abundant unlabeled data. To address this challenge, we propose Firstly Adapt, Then catEgorize (FATE), a novel SSL framework tailored for scenarios with extremely limited labeled data. At its core, the two-stage prompt tuning paradigm FATE exploits unlabeled data to compensate for scarce supervision signals, then transfers to downstream tasks. Concretely, FATE first adapts a pre-trained model to the feature distribution of downstream data using volumes of unlabeled samples in an unsupervised manner. It then applies an SSL method specifically designed for pre-trained models to complete the final classification task. FATE is designed to be compatible with both vision and vision-language pre-trained models. Extensive experiments demonstrate that FATE effectively mitigates challenges arising from the scarcity of labeled samples in SSL, achieving an average performance improvement of 33.74% across seven benchmarks compared to state-of-the-art SSL methods. Code is available at https://github.com/ganchi-huanggua/FATE.git.

Abstract:
Soccer is a globally popular sporting event, typically characterized by long matches and distinctive highlight moments. Recent advances in Multimodal Large Language Models (MLLMs) show promising capabilities in temporal grounding and video understanding. However, generating soccer commentary requires both precise temporal localization and semantically rich descriptions over long-form videos. Existing soccer MLLMs often rely on temporal priors for caption generation, which limits their ability to process the entire video in an end-to-end manner. Traditional approaches, on the other hand, follow a complex two-step paradigm that fails to capture the global context, leading to suboptimal performance. To solve the above issues, we present TimeSoccer, the first end-to-end soccer MLLM for Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos. TimeSoccer jointly predicts timestamps and generates captions in a single pass, enabling global context modeling across 45-minute matches. To support long video understanding of soccer matches, we introduce MoFA-Select, a training-free, motion-aware frame compression module that adaptively selects representative frames via a coarse-to-fine strategy, and incorporates complementary training paradigms to strengthen the model's ability to handle long temporal sequences. Extensive experiments demonstrate that our TimeSoccer achieves State-of-The-Art (SoTA) performance on the SDVC task in an end-to-end form, generating high-quality commentary with accurate temporal alignment and strong semantic relevance. For more information, please visit: https://vpx-ecnu.github.io/TimeSoccer-Website/.

Abstract:
In histopathology, tissue sections are typically stained using common H&E staining or special stains (MAS, PAS, PASM, etc. ) to clearly visualize specific tissue structures. The rapid advancement of deep learning offers an effective solution for generating virtually stained images, significantly reducing the time and labor costs associated with traditional histochemical staining. However, a new challenge arises in separating the fundamental visual characteristics of tissue sections from the visual differences induced by staining agents. Additionally, virtual staining often overlooks essential pathological knowledge and the physical properties of staining, resulting in only style-level transfer. To address these issues, we introduce, for the first time in virtual staining tasks, a pathological vision-language large model (VLM) as an auxiliary tool. We integrate contrastive learnable prompts, foundational concept anchors for tissue sections, and staining-specific concept anchors to leverage the extensive knowledge of the pathological VLM. This approach is designed to describe, frame, and enhance the direction of virtual staining. Furthermore, we have developed a data augmentation method based on the constraints of the VLM. This method utilizes the VLM's powerful image interpretation capabilities to further integrate image style and structural information, proving beneficial in high-precision pathological diagnostics. Extensive evaluations on publicly available multi-domain unpaired staining datasets demonstrate that our method can generate highly realistic images and enhance the accuracy of downstream tasks, such as glomerular detection and segmentation. Our code. https://github.com/CZZZZZZZZZZZZZZZZZ/VPGAN-HARBOR is available.

Abstract:
Although large vision-language models (LVLMs) have demonstrated impressive capabilities in multi-modal understanding and reasoning, their practical applications are still limited by massive model parameters and high computational costs. Recent efforts from natural language processing (NLP) have shown the effectiveness of layer pruning, offering a plausible training-free compression solution. However, due to the modality divergence between vision and language, it is unclear whether these NLP techniques are still effective in LVLMs. In this paper, we empirically prove that directly applying these layer pruning methods to LVLMs is ineffective. Through extensive experiments, we find that non-essential vision-language (VL) tokens and inter-layer feature gaps pose critical challenges to pruning layers in LVLMs. Based on these insights, we propose a novel framework Short-LVLM (SVL) that can utilize important VL tokens and mitigate the layer-wise feature gaps. Notably, Short-LVLM not only achieves a superior trade-off between performance and efficiency but also exhibits several potential advantages,i.e., training-free, model-agnostic, and highly compatible. The code for this work is publicly available at https://github.com/ASGO-MM/Short-LVLM.

Abstract:
Large vision-language models (LVLMs) have demonstrated exceptional performance on complex multimodal tasks. However, they continue to suffer from significant hallucination issues, including object, attribute, and relational hallucinations. To accurately detect these hallucinations, we investigated the variations in cross-modal attention patterns between hallucination and non-hallucination states. Leveraging these distinctions, we developed a lightweight detector capable of identifying hallucinations. Our proposed method, Detecting Hallucinations by Cross-modal Attention Patterns (DHCP), is straightforward and does not require additional LVLM training or extra LVLM inference steps. Experimental results show that DHCP achieves remarkable performance in hallucination detection. By offering novel insights into the identification and analysis of hallucinations in LVLMs, DHCP contributes to advancing the reliability and trustworthiness of these models. The code is available at https://github.com/btzyd/DHCP.

Abstract:
Medical Large Vision-Language Models (Med-LVLMs) demonstrate significant potential in healthcare, but their reliance on general medical data and coarse-grained global visual understanding limits them in intelligent ophthalmic diagnosis. Currently, intelligent ophthalmic diagnosis faces three major challenges: (i) Data. The lack of deeply annotated, high-quality, multi-modal ophthalmic visual instruction data; (ii) Benchmark. The absence of a comprehensive and systematic benchmark for evaluating diagnostic performance; (iii) Model. The difficulty of adapting holistic visual architectures to fine-grained, region-specific ophthalmic lesion identification. In this paper, we propose the Eyecare Kit, which systematically tackles the aforementioned three key challenges with the tailored dataset, benchmark and model: First, we construct a multi-agent data engine with real-life ophthalmology data to produce Eyecare-100K, a high-quality ophthalmic visual instruction dataset. Subsequently, we design Eyecare-Bench, a benchmark that comprehensively evaluates the overall performance of LVLMs on intelligent ophthalmic diagnosis tasks across multiple dimensions. Finally, we develop the EyecareGPT, optimized for fine-grained ophthalmic visual understanding thoroughly, which incorporates an adaptive resolution mechanism and a layer-wise dense connector. Extensive experimental results indicate that the EyecareGPT achieves state-of-the-art performance in a range of ophthalmic tasks, underscoring its significant potential for the advancement of open research in intelligent ophthalmic diagnosis. Our project is available at https://github.com/DCDmllm/EyecareGPT.

Abstract:
Hallucinations pose a significant challenge in Large Vision Language Models (LVLMs), with misalignment between multimodal features identified as a key contributing factor. This paper reveals the negative impact of the long-term decay in Rotary Position Encoding (RoPE), used for positional modeling in LVLMs, on multimodal alignment. Concretely, under long-term decay, instruction tokens exhibit uneven perception of image tokens located at different positions within the two-dimensional space: prioritizing image tokens from the bottom-right region since in the one-dimensional sequence, these tokens are positionally closer to the instruction tokens. This biased perception leads to insufficient image-instruction interaction and suboptimal multimodal alignment. We refer to this phenomenon as ''image alignment bias.'' To enhance instruction's perception of image tokens at different spatial locations, we propose MCA-LLaVA, based on Manhattan distance, which extends the long-term decay to a two-dimensional, multi-directional spatial decay. MCA-LLaVA integrates the one-dimensional sequence order and two-dimensional spatial position of image tokens for positional modeling, mitigating hallucinations by alleviating image alignment bias. Experimental results of MCA-LLaVA across various hallucination and general benchmarks demonstrate its effectiveness and generality. The code can be accessed in https://github.com/ErikZ719/MCA-LLaVA.

Abstract:
Efficient trackers achieve faster runtime by reducing computational complexity and model parameters. However, this efficiency often compromises the expense of weakened feature representation capacity, thus limiting their ability to accurately capture target states using single-layer features. To overcome this limitation, we propose Multi-State Tracker (MST), which utilizes highly lightweight state-specific enhancement (SSE) to perform specialized enhancement on multi-state features produced by multi-state generation (MSG) and aggregates them in an interactive and adaptive manner using cross-state interaction (CSI). This design greatly enhances feature representation while incurring minimal computational overhead, leading to improved tracking robustness in complex environments. Specifically, the MSG generates multiple state representations at multiple stages during feature extraction, while SSE refines them to highlight target-specific features. The CSI module facilitates information exchange between these states and ensures the integration of complementary features. Notably, the introduced SSE and CSI modules adopt a highly lightweight hidden state adaptation-based state space duality (HSA-SSD) design, incurring only 0.1 GFLOPs in computation and 0.66 M in parameters. Experimental results demonstrate that MST outperforms all previous efficient trackers across multiple datasets, significantly improving tracking accuracy and robustness. In particular, it shows excellent runtime performance, with an AO score improvement of 4.5% over the previous SOTA efficient tracker HCAT on the GOT-10K dataset. The code is available at https://github.com/wsumel/MST.

Abstract:
Natural Language-Guided Drones (NLGD) offer a novel and flexible interaction paradigm for tasks such as target matching and navigation. However, the wide field of view and complex compositional semantic relationships inherent in drone scenarios place greater demands on visual language understanding. First, mainstream Vision-Language Models (VLMs) primarily focus on global feature alignment and lack fine-grained semantic understanding. Second, existing hierarchical semantic modeling methods rely on precise entity partitioning and strict containment relationship constraints, which limits their effectiveness in complex drone environments. To address these challenges, we propose the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework, comprising two core components: 1) Region-Global Image-Text Contrastive Learning (RG-ITC). Avoiding precise scene entity partitioning, RG-ITC models hierarchical local-to-global cross-modal semantics by contrasting local visual regions with global text semantics, and vice versa. 2) Region-Global Image-Text Matching Learning (RG-ITM). Instead of relying on strict relationship constraints, this component evaluates local semantic consistency within global cross-modal representations, improving the comprehension of complex compositional semantics. Furthermore, drone scenario textual descriptions are often incomplete or ambiguous, destabilizing global semantic alignment. To mitigate this, HCCM incorporates a Momentum Contrast and Momentum Distillation (MCD) mechanism, enhancing alignment robustness. Extensive experiments on the GeoText-1652 benchmark demonstrate HCCM significantly outperforms existing methods, achieving state-of-the-art Recall@1 scores of 28.8% (image retrieval) and 14.7% (text retrieval). Moreover, HCCM exhibits strong zero-shot generalization on the unseen ERA dataset, achieving 39.93% mean recall (mR), surpassing evaluated fine-tuned models. These results highlight the effectiveness and robustness of HCCM across diverse scenarios. Our implementation is available at https://github.com/rhao-hur/HCCM.

Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance across various objective multimodal perception tasks, yet their application to subjective, emotionally nuanced domains, such as psychological analysis, remains largely unexplored. In this paper, we introduce PICK, a multi-step framework designed for Psychoanalytical Image Comprehension through hierarchical analysis and Knowledge injection with MLLMs, specifically focusing on the House-Tree-Person (HTP) Test, a psychological assessment test. First, we decompose drawings containing multiple instances into semantically meaningful sub-drawings, constructing a hierarchical representation that captures spatial structure and content across three levels: single-object level, multi-object level, and whole level. Next, we analyze these sub-drawings at each level with a targeted focus, extracting psychological or emotional insights from their visual cues. We also introduce an HTP knowledge base and design a feature extraction module, trained with reinforcement learning, to generate a psychological profile for single-object level analysis. This profile captures both holistic stylistic features and dynamic object-specific features (such as those of the house, tree, or person), correlating them with psychological states. Finally, we integrate these multi-faceted information to produce a well-informed assessment that aligns with expert-level reasoning. Our approach bridges the gap between MLLMs and specialized expert domains, offering a structured and interpretable framework for understanding human mental states through visual expression. Experimental results demonstrate that the proposed PICK significantly enhances the capability of MLLMs in psychological analysis. It is further validated as a general framework through extensions to emotion understanding tasks. Codes are released at https://github.com/YanbeiJiang/PICK.

Abstract:
Few-Shot Segmentation(FSS) aims to efficient segmentation of new objects with few labeled samples. However, its performance significantly degrades when domain discrepancies exist between training and deployment. Cross-Domain Few-Shot Segmentation(CD-FSS) is proposed to mitigate such performance degradation. Current CD-FSS methods primarily sought to develop segmentation models on a source domain capable of cross-domain generalization. However, driven by escalating concerns over data privacy and the imperative to minimize data transfer and training expenses, the development of source-free CD-FSS approaches has become essential. In this work, we propose a source-free CD-FSS method that leverages both textual and visual information to facilitate target domain task adaptation without requiring source domain data. Specifically, we first append Task-Specific Attention Adapters (TSAA) to the feature pyramid of a pretrained backbone, which adapt multi-level features extracted from the shared pre-trained backbone to the target task. Then, the parameters of the TSAA are trained through a Visual-Visual Embedding Alignment (VVEA) module and a Text-Visual Embedding Alignment (TVEA) module. The VVEA module utilizes global-local visual features to align image features across different views, while the TVEA module leverages textual priors from pre-aligned multi-modal features (e.g., from CLIP) to guide cross-modal adaptation. By combining the outputs of these modules through dense comparison operations and subsequent fusion via skip connections, our method produces refined prediction masks. Under both 1-shot and 5-shot settings, the proposed approach achieves average segmentation accuracy improvements of 2.18% and 4.11%, respectively, across four cross-domain datasets, significantly outperforming state-of-the-art CD-FSS methods. Code are available at https://github.com/ljm198134/TVGTANet.

Abstract:
Mobile robots are increasingly required to navigate and interact within unknown and unstructured environments to meet human demands. Demand-driven navigation (DDN) enables robots to identify and locate objects based on implicit human intent, even when object locations are unknown. However, traditional data-driven DDN methods rely on pre-collected data for model training and decision-making, limiting their generalization capability in unseen scenarios. In this paper, we propose CogDDN, a VLM-based framework that emulates the human cognitive and learning mechanisms by integrating fast and slow thinking systems and selectively identifying key objects essential to fulfilling user demands. CogDDN identifies appropriate target objects by semantically aligning detected objects with the given instructions. Furthermore, it incorporates a dual-process decision-making module, comprising a Heuristic Process for rapid, efficient decisions and an Analytic Process that analyzes past errors, accumulates them in a knowledge base, and continuously improves performance. Chain of Thought (CoT) reasoning strengthens the decision-making process. Extensive closed-loop evaluations on the AI2Thor simulator with the ProcThor dataset show that CogDDN outperforms single-view camera-only methods by 15%, demonstrating significant improvements in navigation accuracy and adaptability. The project page is available at https://yuehaohuang.github.io/CogDDN/.

Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have achieved remarkable progress in general domains and demonstrated promise in multimodal mathematical reasoning. However, applying MLLMs to geometry problem solving (GPS) remains challenging due to lack of accurate step-by-step solution data and severe hallucinations during reasoning. In this paper, we propose GeoGen, a pipeline that can automatically generates step-wise reasoning paths for geometry diagrams. By leveraging the precise symbolic reasoning, GeoGen produces large-scale, high-quality question-answer pairs. To further enhance the logical reasoning ability of MLLMs, we train GeoLogic, a Large Language Model (LLM) using synthetic data generated by GeoGen. Serving as a bridge between natural language and symbolic systems, GeoLogic enables symbolic tools to help verifying MLLM outputs, making the reasoning process more rigorous and alleviating hallucinations. Experimental results show that our approach consistently improves the performance of MLLMs, achieving remarkable results on benchmarks for geometric reasoning tasks. This improvement stems from our integration of the strengths of LLMs and symbolic systems, which enables a more reliable and interpretable approach for the GPS task. Codes are available at https://github.com/ycpNotFound/GeoGen.

Abstract:
Large Language Model (LLM) agents have demonstrated impressive capabilities in social deduction games (SDGs) like Werewolf, where strategic reasoning and social deception are essential. However, current approaches remain limited to textual information, ignoring crucial multimodal cues such as facial expressions and tone of voice that humans naturally use to communicate. Moreover, existing SDG agents primarily focus on inferring other players' identities without modeling how others perceive themselves or fellow players. To address these limitations, we use One Night Ultimate Werewolf (ONUW) as a testbed and present MultiMind, the first framework integrating multimodal information into SDG agents. MultiMind processes facial expressions and vocal tones alongside verbal content, while employing a Theory of Mind (ToM) model to represent each player's suspicion levels toward others. By combining this ToM model with Monte Carlo Tree Search (MCTS), our agent identifies communication strategies that minimize suspicion directed at itself. Through comprehensive evaluation in both agent-versus-agent simulations and studies with human players, we demonstrate MultiMind's superior performance in gameplay. Our work presents a significant advancement toward LLM agents capable of human-like social reasoning across multimodal domains. Our code is available at https://github.com/CjangCjengh/onuw.

Abstract:
Public response prediction is critical for understanding how individuals or groups might react to specific events, policies, or social phenomena, making it highly valuable for crisis management, policy-making, and social media analysis. However, existing works face notable limitations. First, they lack micro-level personalization, producing generic responses that ignore individual user preferences. Moreover, they overlook macro-level sentiment distribution and only deal with individual-level sentiment, constraining them from analyzing broader societal trends and group sentiment dynamics. To address these challenges, we propose SocialAlign, a unified framework that predicts real-world responses at both micro and macro levels in social contexts. At the micro level, SocialAlign employs SocialLLM with an articulate Personalized Analyze-Compose LoRA (PAC-LoRA) structure, which deploys specialized expert modules for content analysis and response generation across diverse topics and user profiles, enabling the generation of personalized comments with corresponding sentiments. At the macro level, it models group sentiment distributions and aligns predictions with real-world sentiment trends derived from social media data. To evaluate SocialAlign in real-world scenarios, we introduce SentiWeibo, a large-scale dataset curated from authentic social interactions on the Weibo platform. Experimental results on our SentiWeibo and related LaMP benchmark demonstrate that SocialAlign surpasses strong baselines, showing improved accuracy, interpretability, and generalization in public response prediction. We hope our work inspires further research in public response prediction and computational social science: https://github.com/Znull-1220/SocialAlign.

Abstract:
Recently, prompt learning has achieved remarkable success in adapting pre-trained Vision-Language Models (VLMs) to downstream tasks such as image classification. However, its application to the downstream Image-Text Retrieval (ITR) task is more challenging. We find that the challenge lies in discriminating both fine-grained attributes and similar subcategories of the downstream data. To address this challenge, we propose Dual prompt Learning with Joint Category-Attribute Reweighting (DCAR), a novel dual-prompt learning framework to achieve precise image-text matching. The framework dynamically adjusts prompt vectors from both semantic and visual dimensions to improve the performance of CLIP on the downstream ITR task. Based on the prompt paradigm, DCAR jointly optimizes attribute and category features to enhance fine-grained representation learning. Specifically, (1) at the attribute level, it dynamically updates the weights of attribute descriptions based on text-image mutual information correlation; and (2) at the category level, it introduces negative samples from multiple perspectives with category-matching weighting to learn subcategory distinctions. To validate our method, we construct the Fine-class Described Retrieval Dataset (FDRD), which serves as a challenging benchmark for ITR in downstream data domains. It covers over 1,500 downstream fine categories and 230,000 image-caption pairs with detailed attribute annotations. Extensive experiments on FDRD demonstrate that DCAR achieves state-of-the-art performance over existing baselines. The code and data are available at https://github.com/wyf202322/DCAR.

Abstract:
With the growing demand for safeguarding sensitive user information in recommender systems, recommendation attribute unlearning is receiving increasing attention. Existing studies predominantly focus on single-attribute unlearning. However, privacy protection requirements in the real world often involve multiple sensitive attributes and are dynamic. Existing single-attribute unlearning methods cannot meet these real-world requirements due to CH1: the inability to handle multiple unlearning requests simultaneously, and CH2: the lack of efficient adaptability to dynamic unlearning needs. To address these challenges, we propose LEGO, a lightweight and efficient multiple-attribute unlearning framework. Specifically, we divide the multiple-attribute unlearning process into two steps: i) Embedding Calibration removes information related to a specific attribute from user embedding, and ii) Flexible Combination combines these embeddings into a single embedding, protecting all sensitive attributes. We frame the unlearning process as a mutual information minimization problem, providing LEGO a theoretical guarantee of simultaneous unlearning, thereby addressing CH1. With the two-step framework, where Embedding Calibration can be performed in parallel and Flexible Combination is flexible and efficient, we address CH2. Extensive experiments on three real-world datasets across three representative recommendation models demonstrate the effectiveness and efficiency of our proposed framework.

Abstract:
Decoding speech from brain signals is a challenging research problem. Although existing technologies have made progress in reconstructing the mel spectrograms of auditory stimuli at the word or letter level, there remain core challenges in the precise reconstruction of minute-level continuous imagined speech: traditional models struggle to balance the efficiency of temporal dependency modeling and information retention in long-sequence decoding. To address this issue, this paper proposes the Dynamic Multiscale Fusion Network (DMF2Mel), which consists of four core components: the Dynamic Contrastive Feature Aggregation Module (DC-FAM), the Hierarchical Attention-Guided Multi-Scale Network (HAMS-Net), the SplineMap attention mechanism, and the bidirectional state space module (convMamba). Specifically, the DC-FAM separates speech-related ''foreground features'' from noisy ''background features'' through local convolution and global attention mechanisms, effectively suppressing interference and enhancing the representation of transient signals. HAMS-Net, based on the U-Net framework, achieves cross-scale fusion of high-level semantics and low-level details. The SplineMap attention mechanism integrates the Adaptive Gated Kolmogorov-Arnold Network (AGKAN) to combine global context modeling with spline-based local fitting. The convMamba captures long-range temporal dependencies with linear complexity and enhances nonlinear dynamic modeling capabilities. Results on the SparrKULee dataset show that DMF2Mel achieves a Pearson correlation coefficient of 0.074 in mel spectrogram reconstruction for known subjects (a 48% improvement over the baseline) and 0.048 for unknown subjects (a 35% improvement over the baseline).Code is available at: https://github.com/fchest/DMF2Mel.

Abstract:
The oracle bone inscription (OBI) recognition plays a significant role in understanding the history and culture of ancient China. However, the existing OBI datasets suffer from a long-tail distribution problem, leading to biased performance of OBI recognition models across majority and minority classes. With recent advancements in generative models, OBI synthesis-based data augmentation has become a promising avenue to expand the sample size of minority classes. Unfortunately, current OBI datasets lack large-scale structure-aligned image pairs for generative model training. To address these problems, we first present the Oracle-P15K, a structure-aligned OBI dataset for OBI generation and denoising, consisting of 14,542 images infused with domain knowledge from OBI experts. Second, we propose a diffusion model-based pseudo OBI generator, called OBIDiff, to achieve realistic and controllable OBI generation. Given a clean glyph image and a target rubbing-style image, it can effectively transfer the noise style of the original rubbing to the glyph image. Extensive experiments on OBI downstream tasks and user preference studies show the effectiveness of the proposed Oracle-P15K dataset and demonstrate that OBIDiff can accurately preserve inherent glyph structures while transferring authentic rubbing styles effectively. The dataset, code, and pre-trained models are available at https://github.com/LJHolyGround/Oracle-P15K.

Abstract:
Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation. More details are at https://github.com/zhangdaxia22/CookAnything.

Abstract:
Online Cardiac Monitoring (OCM) emerges as a compelling enhancement for the next-generation video streaming platforms. It enables various applications including remote health, affective computing, and deepfake detection. Yet the physiological information encapsulated in the video streams has been long neglected. In this paper, we present the design and implementation of CardioLive, the first online cardiac monitoring system in video streaming platforms. We leverage the naturally co-existed video and audio streams and devise CardioNet, the first audio-visual network to learn the cardiac series. It incorporates multiple unique designs to extract temporal and spectral features, ensuring robust performance under realistic streaming conditions. To enable the Service-On-Demand OCM, we implement CardioLive as a plug-and-play middleware service and develop systematic solutions to practical issues including changing FPS and unsynchronized streams. Extensive evaluations demonstrate the effectiveness of our system. We achieve a Mean Squared Error of 1.79 BPM error, outperforming the video-only and audio-only solutions by 69.2% and 81.2%, respectively. CardioLive achieves average throughput of 115.97 and 98.16 FPS in Zoom and YouTube. We believe our work opens up new applications for video stream systems. Code is available at https://github.com/aiot-lab/CardioLive.

Abstract:
Photo-realistic novel view synthesis from multi-view images, such as neural radiance field (NeRF) and 3D Gaussian Splatting (3DGS), has gained significant attention for its superior performance. However, most existing methods rely on low dynamic range (LDR) images, limiting their ability to capture detailed scenes in high-contrast environments. While some prior works address high dynamic range (HDR) scene reconstruction, they typically require multi-view sharp images with varying exposure times captured at fixed camera positions-a process that is time-consuming and impractical. To make data acquisition more flexible, we propose Casual3DHDR, a robust one-stage method that reconstructs 3D HDR scenes from casually-captured auto-exposure (AE) videos, even under severe motion blur and unknown, varying exposure times. Our approach integrates a continuous camera trajectory into a unified physical imaging model, jointly optimizing exposure times, camera poses, and the camera response function (CRF). Extensive experiments on synthetic and real-world datasets demonstrate that Casual3DHDR outperforms existing methods in robustness and rendering quality.

Abstract:
Novel view synthesis (NVS) from a single image is highly ill-posed due to large unobserved regions, especially for views that deviate significantly from the input. While existing methods focus on consistency between the source and generated views, they often fail to maintain coherence and correct view alignment across long-range or looped trajectories. We propose a model that addresses this by decomposing single-view NVS into a 360-degree scene extrapolation followed by novel view interpolation. This design ensures long-term view and scene consistency by conditioning on keyframes extracted and warped from a generated panoramic representation. In the first stage, a panorama diffusion model learns the scene prior from the input perspective image. Perspective keyframes are then sampled and warped from the panorama and used as anchor frames in a pre-trained video diffusion model, which generates novel views through a proposed spatial noise diffusion process. Compared to the prior work, our method produces globally consistent novel views-even in loop-closure scenarios, while enabling flexible camera control. Experiments on diverse scene datasets demonstrate that our approach outperforms existing methods in generating coherent views along user-defined trajectories. Our implementation is available at https://github.com/YiGuYT/LookBeyond.

Abstract:
Recent advances in diffusion transformer models for motion-guided video generation, such as Tora, have shown significant progress. In this paper, we present Tora2, an enhanced version of Tora, which introduces several design improvements to expand its capabilities in both appearance and motion customization. Specifically, we introduce a decoupled personalization extractor that generates comprehensive personalization embeddings for multiple open-set entities, better preserving fine-grained visual details compared to previous methods. Building on this, we design a gated self-attention mechanism to integrate trajectory, textual description, and visual information for each entity. This innovation significantly reduces misalignment in multimodal conditioning during training. Moreover, we introduce a contrastive loss that jointly optimizes trajectory dynamics and entity consistency through explicit mapping between motion and personalization embeddings. Tora2 is, to our best knowledge, the first method to achieve simultaneous multi-entity customization of appearance and motion for video generation. Experimental results demonstrate that Tora2 achieves competitive performance with state-of-the-art customization methods while providing advanced motion control capabilities, which marks a critical advancement in multi-condition video generation.

Abstract:
In this paper, we propose a novel task of text-controlled human-object interaction generation in 3D scenes with movable objects. Existing human-scene interaction datasets suffer from insufficient interaction categories and typically only consider interactions with static objects (do not change object positions), and the collection of such datasets with movable objects is difficult and costly. To address this problem, we construct the InteractMove dataset for Movable Human-Object Interaction in 3D Scenes by aligning existing human-object interaction data with scene contexts, featuring three key characteristics: 1) scenes containing multiple movable objects with text-controlled interaction specifications (including same-category distractors requiring spatial and 3D scene context understanding), 2) diverse object types and sizes with varied interaction patterns (one-hand, two-hand, etc.), and 3) physically plausible object manipulation trajectories. With the introduction of various movable objects, this task becomes more challenging, as the model needs to identify objects to be interacted with accurately, learn to interact with objects of different sizes and categories, and avoid collisions between movable objects and the scene. To tackle such challenges, we propose a novel pipeline solution. We first use 3D visual grounding models to identify the interaction object. Then, we propose a hand-object joint affordance learning to predict contact regions for different hand joints and object parts, enabling accurate grasping and manipulation of diverse objects. Finally, we optimize interactions with local-scene modeling and collision avoidance constraints, ensuring physically plausible motions and avoiding collisions between objects and the scene. Comprehensive experiments demonstrate our method's superiority in generating physically plausible, text-compliant interactions compared to existing approaches. The code is available at https://github.com/Cxhcmhhh/InteractMove.

Abstract:
In low-light image enhancement, Retinex-based deep learning methods have garnered significant attention due to their exceptional interpretability. These methods decompose images into mutually independent illumination and reflectance components, allows each component to be enhanced separately. In fact, achieving perfect decomposition of illumination and reflectance components proves to be quite challenging, with some residuals still existing after decomposition. In this paper, we formally name these residuals as inter-component residuals (ICR), which has been largely underestimated by previous methods. In our investigation, ICR not only affects the accuracy of the decomposition but also causes enhanced components to deviate from the ideal outcome, ultimately reducing the final synthesized image quality. To address this issue, we propose a novel Inter-correction Retinex model (IRetinex) to alleviate ICR during the decomposition and enhancement stage. In the decomposition stage, we leverage inter-component residual reduction module to reduce the feature similarity between illumination and reflectance components. In the enhancement stage, we utilize the feature similarity between the two components to detect and mitigate the impact of ICR within each enhancement unit. Extensive experiments on three low-light benchmark datasets demonstrated that by reducing ICR, our method outperforms state-of-the-art approaches both qualitatively and quantitatively. Our code is available at: https://github.com/caoluyang0830/IRetinex.git.

Abstract:
Generalizable 3D Gaussian Splatting reconstruction showcases advanced Image-to-3D content creation but requires substantial computational resources and large datasets, posing challenges to training models from scratch. Current methods usually entangle the prediction of 3D Gaussian geometry and appearance, which rely heavily on data-driven priors and result in slow regression speeds. To address this, we propose Stereo-GS, a disentangled framework for efficient 3D Gaussian prediction. Our method extracts features from local image pairs using a stereo vision backbone and fuses them via global attention blocks. Dedicated point and Gaussian prediction heads generate multi-view point-maps for geometry and Gaussian features for appearance, combined as GS-maps to represent the 3DGS object. A refinement network enhances these GSmaps for high-quality reconstruction. Unlike existing methods that depend on camera parameters, our approach achieves pose-free 3D reconstruction, improving robustness and practicality. By reducing resource demands while maintaining high-quality outputs, Stereo- GS provides an efficient, scalable solution for real-world 3D content generation. Project page: https://kevinhuangxf.github.io/stereo-gs.

Abstract:
In this paper, we tackle a new task of 3D object synthesis, where a 3D model is composited with another object category to create a novel 3D model. However, most existing text/image/3D-to-3D methods struggle to effectively integrate multiple content sources, often resulting in inconsistent textures and inaccurate shapes. To overcome these challenges, we propose a straightforward yet powerful approach, category+3D-to-3D (C33D), for generating novel and structurally coherent 3D models. Our method begins by rendering multi-view images and normal maps from the input 3D model, then generating a novel 2D object using adaptive text-image harmony (ATIH) with the front-view image and another object category as inputs. To ensure texture consistency, we introduce texture multi-view diffusion, which refines the textures of the remaining multi-view RGB images based on the novel 2D object. For enhanced shape accuracy, we propose shape multi-view diffusion to improve the 2D shapes of both the multi-view RGB images and the normal maps, also conditioned on the novel 2D object. Finally, these outputs are used to reconstruct a complete and novel 3D model. Extensive experiments demonstrate the effectiveness of our method, yielding impressive 3D creations.

Abstract:
Blind face restoration (BFR) is a fundamental and challenging problem in computer vision. To faithfully restore high-quality (HQ) photos from poor-quality ones, recent research endeavors predominantly rely on facial image priors from the powerful pretrained text-to-image (T2I) diffusion models. However, such priors often lead to the incorrect generation of non-facial features and insufficient facial details, thus rendering them less practical for real-world applications. In this paper, we propose a novel framework, namely AuthFace that achieves highly authentic face restoration results by exploring a face-oriented generative diffusion prior. To learn such a prior, we first collect a dataset of 1.5K high-quality images, with resolutions exceeding 8K, captured by professional photographers. Based on the dataset, we then introduce a novel face-oriented restoration-tuning pipeline that fine-tunes a pretrained T2I model. Identifying key criteria of quality-first and photography-guided annotation, we involve the retouching and reviewing process under the guidance of photographers for high-quality images that show rich facial features. The photography-guided annotation system fully explores the potential of these high-quality photographic images. In this way, the potent natural image priors from pretrained T2I diffusion models can be subtly harnessed, specifically enhancing their capability in facial detail restoration. Moreover, to minimize artifacts in critical facial areas, such as eyes and mouth, we propose a time-aware latent facial feature loss to learn the authentic face restoration process. Extensive experiments on the synthetic and real-world BFR datasets demonstrate the superiority of our approach.

Abstract:
360° omnidirectional images (ODIs) have gained considerable attention recently, and are widely used in various virtual reality (VR) and augmented reality (AR) applications. However, capturing such images is expensive and requires specialized equipment, making ODI synthesis increasingly important. While common 2D image generation and editing methods are rapidly advancing, these models struggle to deliver satisfactory results when generating or editing ODIs due to the unique format and broad 360° Field-of-View (FoV) of ODIs. To bridge this gap, we construct Any2Omni, the first comprehensive ODI generation-editing dataset comprises 60,000+ training data covering diverse input conditions and up to 9 ODI generation and editing tasks. Built upon Any2Omni, we propose an Omni model for Omni-directional image generation and editing (Omni2), with the capability of handling various ODI generation and editing tasks under diverse input conditions using one model. Extensive experiments demonstrate the superiority and effectiveness of the proposed Omni2 model for both the ODI generation and editing tasks. Both the Any2Omni dataset and the Omni2 model are publicly available at: https://github.com/IntMeGroup/Omni2.

Abstract:
Recent advances in text-guided video editing have showcased promising results in appearance editing (e.g., stylization). However, video motion editing in the temporal dimension (e.g., from eating to waving), which distinguishes video editing from image editing, is underexplored. In this work, we present UniEdit, a tuning-free framework that supports both video motion and appearance editing by harnessing the power of a pre-trained text-to-video generator within an inversion-then-generation framework. To realize motion editing while preserving source video content, based on the insights that temporal and spatial self-attention layers encode inter-frame and intra-frame dependency, we introduce auxiliary motion-reference and reconstruction branches to produce text-guided motion and source features respectively. The obtained features are then injected into the main editing path via temporal and spatial self-attention layers. We also validate the effectiveness and flexibility of UniEdit by deploying it on three T2V generative models with different architectures. Experiments demonstrate that UniEdit covers video motion editing and various appearance editing scenarios, and surpasses the state-of-the-art methods. Our code is publicly available.

Abstract:
Hallucination in large vision-language models (LVLMs) is a significant challenge, i.e., generating objects that are not present in the visual input, which significantly compromises the reliability of models. Recent studies often attribute hallucinations to a lack of visual understanding, yet ignore a more fundamental issue: the model's inability to effectively extract or decouple visual features. In this paper, we revisit the hallucinations in LVLMs from an architectural perspective, investigating whether the primary cause lies in the visual encoder (feature extraction) or the modal alignment module (feature decoupling). Motivated by our preliminary findings, we propose a parameter-efficient fine-tuning strategy, PATCH, to mitigate hallucinations in LVLMs. This plug-and-play method can be integrated into various LVLMs, leveraging adaptive virtual tokens to extract object features from bounding boxes, thereby addressing hallucinations stemming from inadequate feature decoupling. PATCH achieves state-of-the-art performance across multiple multi-modal hallucination datasets and demonstrates significant improvements in general capabilities. We hope this work provides deeper insights into the underlying causes of hallucinations in LVLMs, fostering further advancements and innovation in this field. The code will be available at https://github.com/YuyingShang/PATCH.

Abstract:
High-fidelity hand gesture generation represents a significant challenge in human-centric generation tasks. Existing methods typically employ a single-view mesh-rendered image prior to enhancing gesture generation quality. However, the spatial complexity of hand gestures and the inherent limitations of single-view rendering make it difficult to capture complete gesture information, particularly when fingers are occluded. The fundamental contradiction lies in the loss of 3D topological relationships through 2D projection and the incomplete spatial coverage inherent to single-view representations. Diverging from single-view prior approaches, we propose a multi-view prior framework, named Multi-Modal UNet-based Feature Encoder (MUFEN), to guide diffusion models in learning comprehensive 3D hand information. Specifically, we extend conventional front-view rendering to include rear, left, right, top, and bottom perspectives, selecting the most information-rich view combination as training priors to address occlusion. This multi-view prior with a dedicated dual stream encoder significantly improves the model's understanding of complete hand features. Furthermore, we design a bounding box feature fusion module, which can fuse the gesture localization features and multi-modal features to enhance the location-awareness of the MUFEN features to the gesture-related features. Experiments demonstrate that our method achieves state-of-the-art performance on quantitative metrics and exhibits superior qualitative results. The source code is available at https://github.com/fuqifan/MUFEN.

Abstract:
Large Vision-Language Models (LVLMs) have achieved impressive progress in multi-modal understanding and generation. However, they still tend to produce hallucinated content that is inconsistent with the visual input, which limits their reliability in real-world applications. We propose CoFi-Dec, a training-free decoding framework that mitigates hallucinations by integrating generative self-feedback with coarse-to-fine visual conditioning. Inspired by the human visual process from global scene perception to detailed inspection, CoFi-Dec first generates two intermediate textual responses conditioned on coarse- and fine-grained views of the original image. These responses are then transformed into synthetic images using a text-to-image model, forming multi-level visual hypotheses that enrich grounding cues. To unify the predictions from these multiple visual conditions, we introduce a Wasserstein-based fusion mechanism that aligns their predictive distributions into a geometrically consistent decoding trajectory. This principled fusion reconciles high-level semantic consistency with fine-grained visual grounding, leading to more robust and faithful outputs. Extensive experiments on six hallucination-focused benchmarks show that CoFi-Dec substantially reduces both entity-level and semantic-level hallucinations, outperforming existing decoding strategies. The framework is model-agnostic, requires no additional training, and can be seamlessly applied to a wide range of LVLMs.

Abstract:
Diffusion models (DMs) have demonstrated exceptional performance in text-to-image tasks, leading to their widespread use. With the introduction of classifier-free guidance (CFG), the quality of images generated by DMs is significantly improved. However, one can use DMs to generate more harmful images by maliciously guiding the image generation process through CFG. Existing safe alignment methods aim to mitigate the risk of generating harmful images but often reduce the quality of clean image generation. To address this issue, we propose SafeCFG to adaptively control harmful features with dynamic safe guidance by modulating the CFG generation process. It dynamically guides the CFG generation process based on the harmfulness of the prompts, inducing significant deviations only in harmful CFG generations, achieving high quality and safety generation. SafeCFG can simultaneously modulate different harmful CFG generation processes, so it could eliminate harmful elements while preserving high-quality generation. Additionally, SafeCFG provides the ability to detect image harmfulness, allowing unsupervised safe alignment on DMs without pre-defined clean or harmful labels. Experimental results show that images generated by SafeCFG achieve both high quality and safety, and safe DMs trained in our unsupervised manner also exhibit good safety performance. The project page is https://github.com/matrix0721/SafeCFG.

Abstract:
Text-to-image diffusion models have demonstrated remarkable effectiveness in rapid and high-fidelity personalization, even when provided with only a few user images. However, the effectiveness of personalization techniques has lead to concerns regarding data privacy, intellectual property protection, and unauthorized usage. To mitigate such unauthorized usage and model replication, the idea of generating ''unlearnable'' training samples utilizing image poisoning techniques has emerged. Existing methods for this have limited imperceptibility as they operate in the pixel space which results in images with noise and artifacts. In this work, we propose a novel model-based perturbation strategy that operates within the latent space of diffusion models. Our method alternates between denoising and inversion while modifying the starting point of the denoising trajectory: of diffusion models. This trajectory-shifted sampling ensures that the perturbed images maintain high visual fidelity to the original inputs while being resistant to inversion and personalization by downstream generative models. This approach integrates unlearnability into the framework of Latent Diffusion Models (LDMs), enabling a practical and imperceptible defense against unauthorized model adaptation. We validate our approach on four benchmark datasets to demonstrate robustness against state-of-the-art inversion attacks. Results demonstrate that our method achieves significant improvements in imperceptibility (~8% - 10% on perceptual metrics including PSNR, SSIM, and FID) and robustness (~10% on average across five adversarial settings), highlighting its effectiveness in safeguarding sensitive data. https://github.com/naresh-ub/unlearnable_samples.

Abstract:
With the advancement of intelligent healthcare, medical pre-trained language models (Med-PLMs) have emerged and demonstrated significant effectiveness in downstream medical tasks. While these models are valuable assets, they are vulnerable to misuse and theft, requiring copyright protection. However, existing watermarking methods for pre-trained language models (PLMs) cannot be directly applied to Med-PLMs due to domain-task mismatch and inefficient watermark embedding. To fill this gap, we propose the first training-free backdoor model watermarking for Med-PLMs, employing low-frequency words as triggers and embedding the watermark by replacing their embeddings in the model's word embedding layer with those of specific medical terms. The watermarked Med-PLMs produce the same output for triggers as for the corresponding specified medical terms. We leverage this unique mapping to design tailored watermark extraction schemes for different downstream tasks, addressing the challenge of domain-task mismatch in previous methods. Experiments demonstrate superior effectiveness of our watermarking method across medical downstream tasks, robustness against model extraction, pruning, fusion-based backdoor removal attacks, and high efficiency with 10-second embedding. Our code is available at https://github.com/edu-yinzhaoxia/Med-PLMW.

Abstract:
Neural video codecs (NVCs), leveraging the power of end-to-end learning, have demonstrated remarkable coding efficiency improvements over traditional video codecs. Recent research has begun to pay attention to the quality structures in NVCs, optimizing them by introducing explicit hierarchical designs. However, less attention has been paid to the reference structure design, which fundamentally should be aligned with the hierarchical quality structure. In addition, there is still significant room for further optimization of the hierarchical quality structure. To address these challenges in NVCs, we propose EHVC, an efficient hierarchical neural video codec featuring three key innovations: (1) a hierarchical multi-reference scheme that draws on traditional video codec design to align reference and quality structures, thereby addressing the reference-quality mismatch; (2) a lookahead strategy to utilize an encoder-side context from future frames to enhance the quality structure; (3) a layer-wise quality scale with random quality training strategy to stabilize quality structures during inference. With these improvements, EHVC achieves significantly superior performance to the state-of-the-art NVCs. Code will be released in: https://github.com/bytedance/NEVC.

Abstract:
Existing all-in-one image restoration approaches, which aim to handle multiple weather degradations within a single framework, are predominantly trained and evaluated using mixed single-weather synthetic datasets. However, these datasets often differ significantly in resolution, style, and domain characteristics, leading to substantial domain gaps that hinder the development and fair evaluation of unified models. Furthermore, the lack of a large-scale, real-world all-in-one weather restoration dataset remains a critical bottleneck in advancing this field. To address these limitations, we present a real-world all-in-one adverse weather image restoration benchmark dataset, which contains image pairs captured under various weather conditions, including rain, snow, and haze, as well as diverse outdoor scenes and illumination settings. The resulting dataset provides precisely aligned degraded and clean images, enabling supervised learning and rigorous evaluation. We conduct comprehensive experiments by benchmarking a variety of task-specific, task-general, and all-in-one restoration methods on our dataset. Our dataset offers a valuable foundation for advancing robust and practical all-in-one image restoration in real-world scenarios. The dataset has been publicly released and is available at https://github.com/guanqiyuan/WeatherBench.

Abstract:
Visual parsing of images and videos is critical for a wide range of real-world applications. However, progress in this field is constrained by limitations of existing datasets: (1) limited annotation diversity, which limits the support for diverse vision tasks within a unified dataset; (2) insufficient coverage of domains, particularly a lack of datasets tailored for educational scenarios; and (3) a lack of explicit procedural guidance, with weak logical rules and insufficient representation of a structured task process. To address these gaps, we introduce PhysLab, the first dataset that captures students conducting complex physics experiments. The dataset includes four representative experiments that feature diverse scientific instruments and rich human-object interaction (HOI) patterns. PhysLab comprises 620 long-form videos and provides multi-granularity annotations that support a variety of vision tasks, including action recognition, object detection, HOI analysis, etc. We establish baselines and perform extensive evaluations to highlight key challenges in the parsing of procedural educational videos. We expect PhysLab to serve as a valuable resource for advancing comprehensive visual parsing, facilitating intelligent classroom systems, and fostering closer integration among computer vision, multimedia, and educational technologies. The dataset and the evaluation toolkit are publicly available at https://github.com/ZMH-SDUST/PhysLab.

Abstract:
Environmental, Social, and Governance (ESG) reports are essential for assessing sustainability, regulatory compliance, and financial transparency. However, these documents are typically long, multimodal, and structurally complex, combining dense text, tables, figures, and layout-sensitive semantics. Existing AI systems often struggle to perform reliable document-level reasoning in such settings, and no dedicated benchmark currently exists in ESG domain. To fill the gap, we introduce MMESGBench, a first-of-its-kind benchmark dataset targeted to evaluate multimodal understanding and reasoning across multi-source ESG documents. This dataset is constructed via a human-AI collaborative, multi-stage pipeline. First, a multimodal LLM generates candidate question-answer (QA) pairs by jointly interpreting textual, tabular, and visual information from layout-aware document pages. Second, an LLM verifies the semantic accuracy, completeness, and reasoning complexity of each QA pair. This automated process is followed by an expert-in-the-loop validation, where domain specialists validate and calibrate QA pairs to ensure quality, relevance, and diversity. MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories. Questions are categorized as single-page, cross-page, or unanswerable, with each accompanied by fine-grained multimodal evidence. Initial experiments validate that multimodal and retrieval-augmented models substantially outperform text-only baselines. MMESGBench is publicly available as an open-source dataset at https://github.com/Zhanglei1103/MMESGBench.

Abstract:
Aerial navigation is a fundamental yet underexplored capability in embodied intelligence, enabling agents to operate in large-scale, unstructured environments where traditional navigation paradigms fall short. However, most existing research follows the Vision-and-Language Navigation (VLN) paradigm, which heavily depends on sequential linguistic instructions, limiting its scalability and autonomy. To address this gap, we introduce UAV-ON, a benchmark for large-scale Object Goal Navigation (ObjectNav) by aerial agents in open-world environments, where agents operate based on high-level semantic goals without relying on detailed instructional guidance as in VLN. UAV-ON comprises 14 high-fidelity Unreal Engine environments with diverse semantic regions and complex spatial layouts, covering urban, natural, and mixed-use settings. It defines 1270 annotated target objects, each characterized by an instance-level instruction that encodes category, physical footprint, and visual descriptors, allowing grounded reasoning. These instructions serve as semantic goals, introducing realistic ambiguity and complex reasoning challenges for aerial agents. To evaluate the benchmark, we implement several baseline methods, including Aerial ObjectNav Agent (AOA)-a modular policy that integrates instruction semantics with egocentric observations for long-horizon, goal-directed exploration. Empirical results show that all baselines struggle in this setting, highlighting the compounded challenges of aerial navigation and semantic goal grounding. UAV-ON aims to advance research on scalable UAV autonomy driven by semantic goal descriptions in complex real-world environments. Our benchmark and code are available at: https://github.com/Kyaren/UAV_ON.

Abstract:
We propose the LEHA-CVQAD (Large-scale Enriched Human Annotated) dataset, which comprises 6,240 clips for compression-oriented video quality assessment. 59 source videos are encoded with 186 codec-preset variants, ≈1.8M pairwise, and ≈1.5k MOS ratings are fused into a single quality scale; part of the videos remains hidden for blind evaluation. We also propose Rate-Distortion Alignment Error (RDAE), a novel evaluation metric that quantifies how well VQA models preserve bitrate-quality ordering, directly supporting codec parameter tuning. Testing IQA/VQA methods reveals that popular VQA metrics exhibit high RDAE and lower correlations, underscoring the dataset's challenges and utility. The open part and the results of LEHA-CVQAD are available at https://aleksandrgushchin.github.io/lcvqad/

Abstract:
The field of video generation has witnessed remarkable advances in recent years, driven by innovations in deep generative models. Nevertheless, the fidelity of AI-generated videos remains far from perfect, with synthesized content frequently exhibiting visual artifacts, such as temporally inconsistent motion, physically implausible trajectories, unnatural object deformations, and local blurring, that undermine realism and user trust. Precise detection and spatial localization of these artifacts are of critical importance: not only are they essential for automatic quality control pipelines that improves user experience, but they also provide actionable diagnostic signals for researchers and practitioners to guide model development and evaluation. Despite its significance, the research community currently lacks a comprehensive benchmark tailored for artifact localization in AI-generated videos. Existing datasets either focus solely on detection at the video or frame level, or lack fine-grained spatial annotations necessary for developing and benchmarking localization methods. To fill this gap, we present BrokenVideos, a benchmark dataset comprising ~3,254 AI-generated videos with carefully-annotated, pixel-level masks indicating regions of visual corruption. Each annotation is the result of careful human inspection, ensuring high-quality ground truth for artifact localization tasks. We demonstrate that training existing video artifact detection models and multi-modal large language models (MLLMs) on BrokenVideos substantially enhances their ability to localize corrupted regions within generated content. Through extensive experiments and cross-model evaluations, we show that BrokenVideos provides a critical foundation for both benchmarking and advancing artifact localization research. We hope our dataset can catalyze further innovation in both video generation and its quality assurance. The dataset is available at: https://broken-video-detection-datetsets.github.io/Broken-Video-Detection-Datasets.github.io/.

Abstract:
Large Language Models (LLMs) hold significant potential for advancing fact-checking by leveraging their capabilities in reasoning, evidence retrieval, and explanation generation. However, existing benchmarks fail to comprehensively evaluate LLMs and Multimodal Large Language Models (MLLMs) in realistic misinformation scenarios. To bridge this gap, we introduce RealFactBench, a comprehensive benchmark designed to assess the fact-checking capabilities of LLMs and MLLMs across diverse real-world tasks, including Knowledge Validation, Rumor Detection, and Event Verification. RealFactBench consists of 6K high-quality claims drawn from authoritative sources, encompassing multimodal content and diverse domains. Our evaluation framework further introduces the Unknown Rate (UnR) metric, enabling a more nuanced assessment of models' ability to handle uncertainty and balance between over-conservatism and over-confidence. Extensive experiments on 7 representative LLMs and 4 MLLMs reveal their limitations in real-world fact-checking and offer valuable insights for further research. RealFactBench is publicly available at https://github.com/kalendsyang/RealFactBench.git.

Abstract:
Currently, artificial intelligence is profoundly transforming the audio domain; however, numerous advanced algorithms and tools remain fragmented, lacking a unified and efficient framework to unlock their full potential. Existing audio agent frameworks often suffer from complex environment configurations and inefficient tool collaboration. To address these limitations, we introduce AudioFab, an open-source agent framework aimed at establishing an open and intelligent audio-processing ecosystem. Compared to existing solutions, AudioFab's modular design resolves dependency conflicts, simplifying tool integration and extension. It also optimizes tool learning through intelligent selection and few-shot learning, improving efficiency and accuracy in complex audio tasks. Furthermore, AudioFab provides a user-friendly natural language interface tailored for non-expert users. As a foundational framework, AudioFab's core contribution lies in offering a stable and extensible platform for future research and development in audio and multimodal AI. The code is available at https://github.com/SmileHnu/AudioFab.

Abstract:
Identity-preserving text-to-video (IPT2V) generation, which aims to create high-fidelity videos with consistent human identity, has become crucial for downstream applications. However, current end-to-end frameworks suffer a critical spatial-temporal trade-off: optimizing for spatially coherent layouts of key elements ( e.g., character identity preservation) often compromises instruction-compliant temporal smoothness, while prioritizing dynamic realism risks disrupting the spatial coherence of visual structures. To tackle this issue, we propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics. Specifically, our paper proposes a semantic prompt optimization mechanism and stage-wise decoupled generation paradigm. The former module decouples the prompt into spatial and temporal components. Aligned with the subsequent stage-wise decoupled approach, the spatial prompts guide the text-to-image (T2I) stage to generate coherent spatial features, while the temporal prompts direct the sequential image-to-video (I2V) stage to ensure motion consistency. Experimental results validate that our approach achieves excellent spatiotemporal consistency, demonstrating outstanding performance in identity preservation, text relevance, and video quality. By leveraging this simple yet robust mechanism, our algorithm secures the runner-up position in 2025 ACM Multimedia Challenge. Our code is available at https://github.com/rain152/IPVG.

Abstract:
Accurate engagement estimation is essential for adaptive human-computer interaction systems, yet robust deployment is hindered by poor generalizability across diverse domains (e.g., cultures and languages) and challenges in modeling complex interaction dynamics. To tackle these issues, we propose DAPA (Domain-Adaptive Parallel Attention), a novel framework for generalizable conversational engagement modeling. DAPA introduces a Domain Prompting mechanism by prepending learnable domain-specific vectors to the input, explicitly conditioning the model on the data's origin to facilitate domain-aware adaptation while preserving generalizable engagement representations. To capture interactional synchrony, the framework also incorporates a Parallel Cross-Attention module that explicitly aligns reactive (forward BiLSTM) and anticipatory (backward BiLSTM) states between participants. Extensive experiments demonstrate that DAPA establishes a new state-of-the-art performance on several cross-cultural and cross-linguistic benchmarks, notably achieving an absolute improvement of 0.45 in Concordance Correlation Coefficient (CCC) over a strong baseline on the NoXi-J test set. The superiority of our method was also confirmed by winning the first place in the Multi-Domain Engagement Estimation Challenge at MultiMediate'25. The source code will be made available at https://github.com/MSA-LMC/DAPA.

Abstract:
Image captioning systems often produce generic descriptions that fail to capture event-level semantics which are crucial for applications like news reporting and digital archiving. We present ReCap, a novel pipeline for event-enriched image retrieval and captioning that incorporates broader contextual information from relevant articles to generate narrative-rich, factually grounded captions. Our approach addresses the limitations of standard vision-language models that typically focus on visible content while missing temporal, social, and historical contexts. ReCap comprises three integrated components: (1) a robust two-stage article retrieval system using DINOv2 embeddings with global feature similarity for initial candidate selection followed by patch-level mutual nearest neighbor similarity re-ranking; (2) a context extraction framework that synthesizes information from article summaries, generic captions, and original source metadata; and (3) a large language model-based caption generation system with Semantic Gaussian Normalization to enhance fluency and relevance. Evaluated on the OpenEvents V1 dataset as part of Track 1 in the EVENTA 2025 Grand Challenge, ReCap achieved a strong overall score of 0.54666, ranking 2nd on the private test set. These results highlight ReCap's effectiveness in bridging visual perception with real-world knowledge, offering a practical solution for context-aware image understanding in high-stakes domains. The code is available at https://github.com/Noridom1/EVENTA2025-Event-Enriched-Image-Captioning.

Abstract:
Unsupervised anomaly detection in hyperspectral images (HSI), aiming to detect unknown targets from backgrounds, is challenging for earth surface monitoring. However, current studies are hindered by steep computational costs due to the high-dimensional property of HSI and dense sampling-based training paradigm, constraining their rapid deployment. Our key observation is that, during training, not all samples within the same homogeneous area are indispensable, whereas ingenious sampling can provide a powerful substitute for reducing costs. Motivated by this, we propose an Asymmetrical Consensus State Space Model (ACMamba) to significantly reduce computational costs without compromising accuracy. Specifically, we design an asymmetrical anomaly detection paradigm that utilizes region-level instances as an efficient alternative to dense pixel-level samples. In this paradigm, a low-cost Mamba-based module is introduced to discover global contextual attributes of regions that are essential for HSI reconstruction. Additionally, we develop a consensus learning strategy from the optimization perspective to simultaneously facilitate background reconstruction and anomaly compression, further alleviating the negative impact of anomaly reconstruction. Theoretical analysis and extensive experiments across eight benchmarks verify the superiority of ACMamba, demonstrating a faster speed and stronger performance over the state-of-the-art. Code is released at https://github.com/PURE-melo/ACMamba.

Abstract:
Online 3D occupancy prediction provides a comprehensive spatial understanding of embodied environments. While the innovative EmbodiedOcc framework utilizes 3D semantic Gaussians for progressive indoor occupancy prediction, it overlooks the geometric characteristics of indoor environments, which are primarily characterized by planar structures. This paper introduces EmbodiedOcc++, enhancing the original framework with two key innovations: a Geometry-guided Refinement Module (GRM) that constrains Gaussian updates through plane regularization, along with a Semantic-aware Uncertainty Sampler (SUS) that enables more effective updates in overlapping regions between consecutive frames. GRM regularizes the position update to align with surface normals. It determines the adaptive regularization weight using curvature-based and depth-based constraints, allowing semantic Gaussians to align accurately with planar surfaces while adapting in complex regions. To effectively improve geometric consistency from different views, SUS adaptively selects proper Gaussians to update. Comprehensive experiments on the EmbodiedOcc-ScanNet benchmark demonstrate that EmbodiedOcc++ achieves state-of-the-art performance across different settings. Our method demonstrates improved edge accuracy and retains more geometric details while ensuring computational efficiency, which is essential for online embodied perception. The code will be released at: https://github.com/PKUHaoWang/EmbodiedOcc2.

Abstract:
3D Visual Grounding (3DVG) aims to localize objects in 3D scenes using natural language descriptions. Although supervised methods achieve higher accuracy in constrained settings, zero-shot 3DVG holds greater promise for real-world applications since eliminating scene-specific training requirements. However, existing zero-shot methods face challenges of spatial-limited reasoning due to reliance on single-view localization, and contextual omissions or detail degradation. To address these issues, we propose SeqVLM, a novel zero-shot 3DVG framework that leverages multi-view real-world scene images with spatial information for target object reasoning. Specifically, SeqVLM first generates 3D instance proposals via a 3D semantic segmentation network and refines them through semantic filtering, retaining only semantic-relevant candidates. A proposal-guided multi-view projection strategy then projects these candidate proposals onto real scene image sequences, preserving spatial relationships and contextual details in the conversion process of 3D point cloud to images. Furthermore, to mitigate VLM computational overload, we implement a dynamic scheduling mechanism that iteratively processes sequances-query prompts, leveraging VLM's cross-modal reasoning capabilities to identify textually specified objects. Experiments on the ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, achieving Acc@0.25 scores of 55.6% and 53.2%, surpassing previous zero-shot methods by 4.0% and 5.2%, respectively, which advance 3DVG toward greater generalization and real-world applicability.

Abstract:
As Artificial Intelligence Generated Content (AIGC) advances, a variety of methods have been developed to generate text, images, videos, and 3D shapes from single or multimodal inputs, contributing efforts to emulate human-like cognitive content creation. However, generating realistic large-scale scenes from a single input presents a challenge due to the complexities involved in ensuring consistency across extrapolated views generated by models. Benefiting from recent video generation models and implicit neural representations, we propose Scene123, a 3D scene generation model, which combines a video generation framework to ensure realism and diversity with implicit neural fields integrated with Masked Autoencoders (MAE) to effectively ensure the consistency of unseen areas across views. Specifically, the input image (or a text-generated image) is first warped to simulate adjacent views, with the invisible regions filled using the consistency-enhanced MAE model. Nonetheless, the synthesized images often exhibit inconsistencies in viewpoint alignment, thus we utilize the produced views to optimize a neural radiance field, enhancing geometric consistency. Moreover, to further enhance the details and texture fidelity of generated views, we employ a GAN-based Loss against images derived from the input image through the video generation model. Extensive experiments demonstrate that our method can generate realistic and consistent scenes from a single prompt. Both qualitative and quantitative results indicate that our approach surpasses existing state-of-the-art methods.

Abstract:
Even from an early age, humans naturally adapt between exocentric (Exo) and egocentric (Ego) perspectives to understand daily procedural activities. Inspired by this cognitive ability, we propose a novel Unsupervised Ego-Exo Dense Procedural Activity Captioning (UE^2 DPAC) task, which aims to transfer knowledge from the labeled source view to predict the time segments and descriptions of action sequences for the target view without annotations. Despite previous works endeavoring to address the fully-supervised single-view or cross-view dense video captioning, they lapse in the proposed task due to the significant inter-view gap caused by temporal misalignment and irrelevant object interference. Hence, we propose a Gaze Consensus-guided Ego-Exo Adaptation Network (GCEAN) that injects the gaze information into the learned representations for the fine-grained Ego-Exo alignment. Specifically, we propose a Score-based Adversarial Learning Module (SALM) that incorporates a discriminative scoring network and compares the scores of distinct views to learn unified view-invariant representations from a global level. Then, the Gaze Consensus Construction Module (GCCM) utilizes the gaze to progressively calibrate the learned representations to highlight the regions of interest and extract the corresponding temporal contexts. Moreover, we adopt hierarchical gaze-guided consistency losses to construct gaze consensus for the explicit temporal and spatial adaptation between the source and target views. To support our research, we propose a new EgoMe-UE^2 DPAC benchmark, and extensive experiments demonstrate the effectiveness of our method, which outperforms many related methods by a large margin. Code is available at https://github.com/ZhaofengSHI/GCEAN.

Abstract:
Text-to-video retrieval essentially aims to train models to align visual content with textual descriptions accurately. Due to the impressive general multimodal knowledge demonstrated by image-text pretrained models such as CLIP, existing work has primarily focused on extending CLIP knowledge for video-text tasks. However, videos typically contain richer information than images. In current video-text datasets, textual descriptions can only reflect a portion of the video content, leading to partial misalignment in video-text matching. Therefore, directly aligning text representations with video representations can result in incorrect supervision, ignoring the inequivalence of information. In this work, we propose T2VParser to extract multiview semantic representations from text and video, achieving adaptive semantic alignment rather than aligning the entire representation. To extract corresponding representations from different modalities, we introduce Adaptive Decomposition Tokens, which consist of a set of learnable tokens shared across modalities. The goal of T2VParser is to emphasize precise alignment between text and video while retaining the knowledge of pretrained models. Experimental results demonstrate that T2VParser achieves accurate partial alignment through effective cross-modal content decomposition. The code is available at https://github.com/Lilidamowang/T2VParser.

Abstract:
Sparse-view 3D reconstruction is a fundamental yet challenging task in practical 3D reconstruction applications. Recently, many methods based on 3D Gaussian Splatting (3DGS) have been proposed to address sparse-view 3D reconstruction. Although these methods have made considerable advancements, they still show significant issues with overfitting. To reduce the overfitting, we introduce VGNC, a novel Validation-guided Gaussian Number Control approach based on generative novel view synthesis (NVS) models. To the best of our knowledge, this is the first attempt to alleviate the overfitting issue of sparse-view 3DGS with generative validation images. Specifically, we first introduce a validation image generation method based on a generative NVS model. We then propose a Gaussian number control strategy that utilizes generated validation images to determine optimal Gaussian numbers, thereby reducing the issue of overfitting. We conducted detailed experiments on various sparse-view 3DGS baselines and datasets to evaluate the effectiveness of VGNC. Extensive experiments show that our approach not only reduces overfitting but also improves rendering quality on the test set while decreasing the number of Gaussians. This reduction lowers storage demands and accelerates both training and rendering. Our code is available at: https://github.com/LinLif1869/VGNC.

Abstract:
Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out misaligned visual queries and enhances contrastive learning efficiency. Extensive experiments on three VG benchmarks, namely RefCOCO, RefCOCO+, and RefCOCOg, verify the superiority of AlignCAT against existing weakly supervised methods on two VG tasks. Our code is available at: https://github.com/I2-Multimedia-Lab/AlignCAT.

Abstract:
Dynamic Facial Expression Recognition (DFER) plays a critical role in affective computing and human-computer interaction. Although existing methods achieve comparable performance, they inevitably suffer from performance degradation under sample heterogeneity caused by multi-source data and individual expression variability. To address these challenges, we propose a novel framework, called Heterogeneity-aware Distributional Framework (HDF), and design two plug-and-play modules to enhance time-frequency modeling and mitigate optimization imbalance caused by hard samples. Specifically, the Time-Frequency Distributional Attention Module (DAM) captures both temporal consistency and frequency robustness through a dual-branch attention design, improving tolerance to sequence inconsistency and visual style shifts. Then, based on gradient sensitivity and information bottleneck principles, an adaptive optimization module Distribution-aware Scaling Module (DSM) is introduced to dynamically balance classification and contrastive losses, enabling more stable and discriminative representation learning. Extensive experiments on two widely used datasets, DFEW and FERV39k, demonstrate that HDF significantly improves both recognition accuracy and robustness. Our method achieves superior weighted average recall (WAR) and unweighted average recall (UAR) while maintaining strong generalization across diverse and imbalanced scenarios. Codes are released at https://github.com/QIcita/HDF_DFER.

Abstract:
Missing modalities have recently emerged as a critical research direction in multimodal emotion recognition (MER). Conventional approaches typically address this issue through missing modality reconstruction. However, these methods fail to account for variations in reconstruction difficulty across different samples, consequently limiting the model's ability to handle hard samples effectively. To overcome this limitation, we propose a novel Hardness-Aware Dynamic Curriculum Learning framework, termed HARDY-MER. Our framework operates in two key stages: first, it estimates the hardness level of each sample, and second, it strategically emphasizes hard samples during training to enhance model performance on these challenging instances. Specifically, we first introduce a Multi-view Hardness Evaluation mechanism that quantifies reconstruction difficulty by considering both Direct Hardness (modality reconstruction errors) and Indirect Hardness (cross-modal mutual information). Meanwhile, we introduce a Retrieval-based Dynamic Curriculum Learning strategy that dynamically adjusts the training curriculum by retrieving samples with similar semantic information and balancing the learning focus between easy and hard instances. Extensive experiments on benchmark datasets demonstrate that HARDY-MER consistently outperforms existing methods in missing-modality scenarios. Our code will be made publicly available at https://github.com/HARDY-MER/HARDY-MER.

Abstract:
AI-driven video generation techniques have made significant progress in recent years. However, AI-generated videos (AGVs) involving human activities often exhibit substantial visual and semantic distortions, hindering the practical application of video generation technologies in real-world scenarios. To address this challenge, we conduct a pioneering study on human activity AGV quality assessment, focusing on visual quality evaluation and the identification of semantic distortions. First, we construct the AI-Generated Human activity Video Quality Assessment (Human-AGVQA) dataset, consisting of 6,000 AGVs derived from 15 popular text-to-video (T2V) models using 400 text prompts that describe diverse human activities. We conduct a subjective study to evaluate the human appearance quality, action continuity quality, and overall video quality of AGVs, and identify semantic issues of human body parts. Based on Human-AGVQA, we benchmark the performance of T2V models and analyze their strengths and weaknesses in generating different categories of human activities. Second, we develop an objective evaluation metric, named AI-Generated Human activity Video Quality metric (GHVQ), to automatically analyze the quality of human activity AGVs. GHVQ systematically extracts human-focused quality features, AI-generated content-aware quality features, and temporal continuity features, making it a comprehensive and explainable quality metric for human activity AGVs. The extensive experimental results show that GHVQ outperforms existing quality metrics on the Human-AGVQA dataset by a large margin, demonstrating its efficacy in assessing the quality of human activity AGVs. The Human-AGVQA dataset and GHVQ metric will be released at https://github.com/zczhang-sjtu/GHVQ.git.

Abstract:
The explosive growth of multi-source multimedia data has significantly increased the demands for transmission and storage, placing substantial pressure on bandwidth and storage infrastructures. While Autoregressive Compression Models (ACMs) have markedly improved compression efficiency through probabilistic prediction, current approaches remain constrained by two critical limitations: suboptimal compression ratios due to insufficient fine-grained feature extraction during probability modeling, and real-time processing bottlenecks caused by high resource consumption and low compression speeds. To address these challenges, we propose Efficient Dual-path Parallel Compression (EDPC), a hierarchically optimized compression framework that synergistically enhances modeling capability and execution efficiency via coordinated dual-path operations. At the modeling level, we introduce the Information Flow Refinement (IFR) metric grounded in mutual information theory, and design a Multi-path Byte Refinement Block (MBRB) to strengthen cross-byte dependency modeling via heterogeneous feature propagation. At the system level, we develop a Latent Transformation Engine (LTE) for compact high-dimensional feature representation and a Decoupled Pipeline Compression Architecture (DPCA) to eliminate encoding-decoding latency through pipelined parallelization. Experimental results demonstrate that EDPC achieves comprehensive improvements over state-of-the-art methods, including a 2.7× faster compression speed, and a 3.2% higher compression ratio. These advancements establish EDPC as an efficient solution for real-time processing of large-scale multimedia data in bandwidth-constrained scenarios. Our code is available at https://github.com/Magie0/EDPC.

Abstract:
Audio-driven emotional 3D facial animation aims to generate synchronized lip movements and vivid facial expressions. However, most existing approaches focus on static and predefined emotion labels, limiting their diversity and naturalness. To address these challenges, we propose MEDTalk, a novel framework for fine-grained and dynamic emotional talking head generation. Our approach first disentangles content and emotion embedding spaces from motion sequences using a carefully designed cross-reconstruction process, enabling independent control over lip movements and facial expressions. Beyond conventional audio-driven lip synchronization, we integrate audio and speech text, predicting frame-wise intensity variations and dynamically adjusting static emotion features to generate realistic emotional expressions. Furthermore, to enhance control and personalization, we incorporate multimodal inputs-including text descriptions and reference expression images-to guide the generation of user-specified facial expressions. With MetaHuman as the priority, our generated results can be conveniently integrated into the industrial production pipeline. The code is available at: https://github.com/SJTU-Lucy/MEDTalk.

Abstract:
Reconstructing and segmenting scenes from unconstrained photo collections obtained from the Internet is a novel but challenging task. Unconstrained photo collections are easier to get than well-captured photo collections. These unconstrained images suffer from inconsistent lighting and transient occlusions, which makes segmentation challenging. Previous segmentation methods cannot address transient occlusions or accurately restore the scene's lighting conditions. Therefore, we propose Seg-Wild, an interactive segmentation method based on 3D Gaussian Splatting for unconstrained image collections, suitable for in-the-wild scenes. We integrate multi-dimensional feature embeddings for each 3D Gaussian and calculate the feature similarity between the feature embeddings and the segmentation target to achieve interactive segmentation in the 3D scene. Additionally, we introduce the Spiky 3D Gaussian Cutter (SGC) to smooth abnormal 3D Gaussians. We project the 3D Gaussians onto a 2D plane and calculate the ratio of 3D Gaussians that need to be cut using the SAM mask. We also designed a benchmark to evaluate segmentation quality in in-the-wild scenes. Experimental results demonstrate that compared to previous methods, Seg-Wild achieves better segmentation results and reconstruction quality. Our code will be available at https://github.com/Sugar0725/Seg-Wild.

Abstract:
Diabetic retinopathy (DR) grading plays a critical role in early clinical intervention and vision preservation. Recent explorations predominantly focus on visual lesion feature extraction through data processing and domain decoupling strategies. However, they generally overlook domain-invariant pathological patterns and underutilize the rich contextual knowledge of foundation models, relying solely on visual information, which is insufficient for distinguishing subtle pathological variations. Therefore, we propose integrating fine-grained pathological descriptions to complement prototypes with additional context, thereby resolving ambiguities in borderline cases. Specifically, we propose a Hierarchical Anchor Prototype Modulation (HAPM) framework to facilitate DR grading. First, we introduce a variance spectrum-driven anchor prototype library that preserves domain-invariant pathological patterns. We further employ a hierarchical differential prompt gating mechanism, dynamically selecting discriminative semantic prompts from both LVLM and LLM sources to address semantic confusion between adjacent DR grades. Finally, we utilize a two-stage prototype modulation strategy that progressively integrates clinical knowledge into visual prototypes through a Pathological Semantic Injector (PSI) and a Discriminative Prototype Enhancer (DPE). Extensive experiments across eight public datasets demonstrate that our approach achieves pathology-guided prototype evolution while outperforming state-of-the-art methods. The code is available at https://github.com/zhcz328/HAPM.

Abstract:
Automating the synthesis of coordinated bimanual piano performances poses significant challenges, particularly in capturing the intricate choreography between the hands while preserving their distinct kinematic signatures. In this paper, we propose a dual-stream neural framework designed to generate synchronized hand gestures for piano playing from audio input, addressing the critical challenge of modeling both hand independence and coordination. Our framework introduces two key innovations: (i) a decoupled diffusion-based generation framework that independently models each hand's motion via dual-noise initialization, sampling distinct latent noise for each while leveraging a shared positional condition, and (ii) a Hand-Coordinated Asymmetric Attention (HCAA) mechanism suppresses symmetric (common-mode) noise to highlight asymmetric hand-specific features, while adaptively enhancing inter-hand coordination during denoising. Comprehensive evaluations demonstrate that our framework outperforms existing state-of-the-art methods across multiple metrics.

Abstract:
Mammography is the most commonly used imaging modality for breast cancer screening, driving an increasing demand for deep-learning techniques to support large-scale analysis. However, the development of accurate and robust methods is often limited by insufficient data availability and a lack of diversity in lesion characteristics. While generative models offer a promising solution for data synthesis, current approaches often fail to adequately emphasize lesion-specific features and their relationships with surrounding tissues. In this paper, we propose Gated Conditional Diffusion Model (GCDM), a novel framework designed to jointly synthesize holistic mammogram images and localized lesions. GCDM is built upon a latent denoising diffusion framework, where the noised latent image is concatenated with a soft mask embedding that represents breast, lesion, and their transitional regions, ensuring anatomical coherence between them during the denoising process. To further emphasize lesion-specific features, GCDM incorporates a gated conditioning branch that guides the denoising process by dynamically selecting and fusing the most relevant radiomic and geometric properties of lesions, effectively capturing their interplay. Experimental results demonstrate that GCDM achieves precise control over small lesion areas while enhancing the realism and diversity of synthesized mammograms. These advancements position GCDM as a promising tool for clinical applications in mammogram synthesis. Our code is available at https://github.com/lixinHUST/Gated-Conditional-Diffusion-Model/

Abstract:
Text-driven object insertion in the 3D scene is an emerging task that enables intuitive scene editing through natural language. Despite its potential, existing 2D editing-based methods often suffer from reliance on spatial priors such as 2D masks, 3D bounding boxes, and they struggle to ensure inserted object consistency. These limitations hinder flexibility and scalability in real-world applications. In this paper, we propose FreeInsert, a novel framework that leverages foundation models (MLLMs, LGM, and diffusion models) to disentangle object generation and spatial placement, enabling unsupervised and flexible object insertion in 3D scenes without spatial priors. FreeInsert begins with an MLLM-based parser that extracts structured semantics-including object types, spatial relationships, and attachment regions-from user instructions. These semantics guide both the reconstruction of the inserted object for 3D consistency and the learning of its degrees of freedom. We first leverage the spatial reasoning capabilities of MLLMs to initialize the object's pose and scale. To further enhance natural integration with the scene, a hierarchical spatially-aware stage is employed to refine the object's placement, incorporating both the spatial semantics and priors inferred by the MLLM. Finally, the object's appearance is enhanced using inserted-object image to improve visual fidelity. Experimental results demonstrate that FreeInsert enables semantically coherent, spatially precise, and visually realistic 3D insertions, without requiring any spatial priors, offering a user-friendly and flexible editing experience. Project page: https://tjulcx.github.io/FreeInsert/.

Abstract:
This paper introduces EasyAnimate, an efficient and high quality video generation framework that leverages diffusion transformers to achieve high-quality video production, encompassing data processing, model training, and end-to-end inference. Despite substantial advancements achieved by video diffusion models, existing video generation models still struggles with slow generation speeds and less-than-ideal video quality. To improve training and inference efficiency without compromising performance, we propose Hybrid Window Attention. We design the multidirectional sliding window attention in Hybrid Window Attention, which provides stronger receptive capabilities in 3D dimensions compared to naive one, while reducing the model's computational complexity as the video sequence length increases. To enhance video generation quality, we optimize EasyAnimate using reward backpropagation to better align with human preferences. As a post-training method, it greatly enhances the model's performance while ensuring efficiency. In addition to the aforementioned improvements, EasyAnimate integrates a series of further refinements that significantly improve both computational efficiency and model performance. We introduce a new training strategy called Training with Token Length to resolve uneven GPU utilization in training videos of varying resolutions and lengths, thereby enhancing efficiency. Additionally, we use a multimodal large language model as the text encoder to improve text comprehension of the model. Experiments demonstrate significant enhancements resulting from the above improvements. The EasyAnimate achieves state-of-the-art performance on both the VBench leaderboard and human evaluation. Code and pre-trained models are available at https://github.com/aigc-apps/EasyAnimate.

Abstract:
The growing popularity of 3D Gaussian Splatting (3DGS) has intensified the need for effective copyright protection. Current 3DGS watermarking methods rely on computationally expensive fine-tuning procedures for each predefined message. We propose the first generalizable watermarking framework that enables efficient protection of Splatter Image-based 3DGS models through a single forward pass. We introduce GaussianBridge that transforms unstructured 3D Gaussians into Splatter Image format, enabling direct neural processing for arbitrary message embedding. To ensure imperceptibility, we design a Gaussian-Uncertainty-Perceptual heatmap prediction strategy for preserving visual quality. For robust message recovery, we develop a dense segmentation-based extraction mechanism that maintains reliable extraction even when watermarked objects occupy minimal regions in rendered views. Project page: https://kevinhuangxf.github.io/marksplatter.

Abstract:
Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ''timeline'' representation from the embedding space. These representations model time and its chronological progression and thereby facilitate temporal reasoning tasks. Our timeline approaches achieve competitive to superior accuracy compared to a prompt-based baseline while being computationally efficient. All code and data are available at https://tekayanidham.github.io/timeline-page/.

Abstract:
Early-stage fire scenes (0-15 minutes after ignition) represent a crucial temporal window for emergency interventions. During this stage, the smoke produced by combustion significantly reduces the visibility of surveillance systems, severely impairing situational awareness and hindering effective emergency response and rescue operations. Consequently, there is an urgent need to remove smoke from images to obtain clear scene information. However, the development of smoke removal algorithms remains limited due to the lack of large-scale, real-world datasets comprising paired smoke-free and smoke-degraded images. To address these limitations, we present a real-world surveillance image desmoking benchmark dataset named SmokeBench, which contains image pairs captured under diverse scenes setup and smoke concentration. The curated dataset provides precisely aligned degraded and clean images, enabling supervised learning and rigorous evaluation. We conduct comprehensive experiments by benchmarking a variety of desmoking methods on our dataset. Our dataset provides a valuable foundation for advancing robust and practical image desmoking in real-world fire scenes. This dataset has been released to the public and can be downloaded from https://github.com/ncfjd/SmokeBench.

Abstract:
AI-based image enhancement techniques have been widely adopted in various visual applications, significantly improving the perceptual quality of user-generated content (UGC). However, the lack of specialized quality assessment models has become a significant limiting factor in this field, limiting user experience and hindering the advancement of enhancement methods. While perceptual quality assessment methods have shown strong performance on UGC and AIGC individually, their effectiveness on AI-enhanced UGC (AI-UGC) which blends features from both-remains largely unexplored. To address this gap, we construct AU-IQA, a benchmark dataset comprising 4,800 AI-UGC images produced by three representative enhancement types which include super-resolution, low-light enhancement, and denoising. On this dataset, we further evaluate a range of existing quality assessment models, including traditional IQA methods and large multimodal models. Finally, we provide a comprehensive analysis of how well current approaches perform in assessing the perceptual quality of AI-UGC. The access link to the AU-IQA is https://github.com/WNNGGU/AU-IQA-Dataset.

Abstract:
Vision-Language Models (VLMs) have recently emerged, demonstrating remarkable vision-understanding capabilities. However, training these models requires large-scale datasets, which brings challenges related to efficiency, effectiveness, and quality of web data. In this paper, we introduce SynthVLM, a new data synthesis and curation method for generating image-caption pairs. Unlike traditional methods, where captions are generated from images, SynthVLM utilizes advanced diffusion models and high-quality captions to synthesize and select images from text captions, thereby creating precisely aligned image-text pairs. We further introduce SynthVLM-100K, a high-quality dataset consisting of 100K curated and synthesized image-caption pairs. In both model and human evaluations, SynthVLM-100K outperforms traditional real-world datasets. Leveraging this dataset, we develop a new family of multimodal large language models (MLLMs), SynthVLM-7B and SynthVLM-13B, which achieve state-of-the-art (SOTA) performance on various vision question-answering (VQA) tasks. Notably, our models outperform LLaVA across most metrics with only 18% pretrain data. Furthermore, SynthVLM-7B and SynthVLM-13B attain SOTA performance on the MMLU benchmark, demonstrating that the high-quality SynthVLM-100K dataset preserves language abilities. Our dataset and the complete data generating and curating methods can be found in https://github.com/starriver030515/SynthVLM.

Abstract:
Visual Question Answering (VQA) in remote sensing (RS) is pivotal for interpreting Earth observation data. However, existing RS VQA datasets are constrained by limitations in annotation richness, question diversity, and the assessment of specific reasoning capabilities. This paper introduces Remote Sensing Vision Language Model Question Answering (RSVLM-QA) dataset, a new large-scale, content-rich VQA dataset for the RS domain. RSVLM-QA is constructed by integrating data from several prominent RS segmentation and detection datasets: WHU, LoveDA, INRIA, and iSAID. We employ an innovative dual-track annotation generation pipeline. Firstly, we leverage Large Language Models (LLMs), specifically GPT-4.1, with meticulously designed prompts to automatically generate a suite of detailed annotations including image captions, spatial relations, and semantic tags, alongside complex caption-based VQA pairs. Secondly, to address the challenging task of object counting in RS imagery, we have developed a specialized automated process that extracts object counts directly from the original segmentation data; GPT-4.1 then formulates natural language answers from these counts, which are paired with preset question templates to create counting QA pairs. RSVLM-QA comprises 13,820 images and 162,373 VQA pairs, featuring extensive annotations and diverse question types. We provide a detailed statistical analysis of the dataset and a comparison with existing RS VQA benchmarks, highlighting the superior depth and breadth of RSVLM-QA's annotations. Furthermore, we conduct benchmark experiments on Six mainstream Vision Language Models (VLMs), demonstrating that RSVLM-QA effectively evaluates and challenges the understanding and reasoning abilities of current VLMs in the RS domain. We believe RSVLM-QA will serve as a pivotal resource for the RS VQA and VLM research communities, poised to catalyze advancements in the field. The dataset, generation code, and benchmark models are publicly available at https://github.com/StarZi0213/RSVLM-QA.

Abstract:
Text-to-video (T2V) synthesis has advanced rapidly, yet current evaluation metrics primarily capture visual quality and temporal consistency, offering limited insight into how synthetic videos perform in downstream tasks such as text-to-video retrieval (TVR). In this work, we introduce SynTVA, a new dataset and benchmark designed to evaluate the utility of synthetic videos for building retrieval models. Based on 800 diverse user queries derived from MSRVTT training split, we generate synthetic videos using state-of-the-art T2V models and annotate each video-text pair along four key semantic alignment dimensions: Object & Scene, Action, Attribute, and Prompt Fidelity. Our evaluation framework correlates general video quality assessment (VQA) metrics with these alignment scores, and examines their predictive power for downstream TVR performance. To explore pathways of scaling up, we further develop an Auto-Evaluator to estimate alignment quality from existing metrics. Beyond benchmarking, our results show that SynTVA is a valuable asset for dataset augmentation, enabling the selection of high-utility synthetic samples that measurably improve TVR outcomes. Project page and dataset can be found at https://jasoncodemaker.github.io/SynTVA/.

Abstract:
Document Visual Question Answering (Document VQA) faces significant challenges when processing long documents in low-resource environments due to context limitations and insufficient training data. This paper presents AdaDocVQA, a unified adaptive framework addressing these challenges through three core innovations: a hybrid text retrieval architecture for effective document segmentation, an intelligent data augmentation pipeline that automatically generates high-quality reasoning question-answer pairs with multi-level verification, and adaptive ensemble inference with dynamic configuration generation and early stopping mechanisms. Experiments on Japanese document VQA benchmarks demonstrate substantial improvements with 83.04% accuracy on Yes/No questions, 52.66% on factual questions, and 44.12% on numerical questions in JDocQA, and 59% accuracy on LAVA dataset. Ablation studies confirm meaningful contributions from each component, and our framework establishes new state-of-the-art results for Japanese document VQA while providing a scalable foundation for other low-resource languages and specialized domains. Our code available at: https://github.com/Haoxuanli-Thu/AdaDocVQA.

Abstract:
Multimodal relation extraction (MRE) is a crucial task in the fields of Knowledge Graph and Multimedia, playing a pivotal role in multimodal knowledge graph construction. However, existing methods are typically limited to extracting a single type of relational triplet, which restricts their ability to extract triplets beyond the specified types. Directly combining these methods fails to capture dynamic cross-modal interactions and introduces significant computational redundancy. Therefore, we propose a novel unified multimodal Relation Extraction framework with Multilevel Optimal Transport and mixture-of-Experts, termed REMOTE, which can simultaneously extract intra-modal and inter-modal relations between textual entities and visual objects. To dynamically select optimal interaction features for different types of relational triplets, we introduce mixture-of-experts mechanism, ensuring the most relevant modality information is utilized. Additionally, considering that the inherent property of multilayer sequential encoding in existing encoders often leads to the loss of low-level information, we adopt a multilevel optimal transport fusion module to preserve low-level features while maintaining multilayer encoding, yielding more expressive representations. Correspondingly, we also create a Unified Multimodal Relation Extraction (UMRE) dataset to evaluate the effectiveness of our framework, encompassing diverse cases where the head and tail entities can originate from either text or image. Extensive experiments show that REMOTE effectively extracts various types of relational triplets and achieves state-of-the-art performanc on almost all metrics across two other public MRE datasets. We release our resources at https://github.com/Nikol-coder/REMOTE.

Abstract:
Technical Element Score (TES) and Program Component Score (PCS) evaluations in figure skating demand precise assessment of athletic actions and artistic interpretation, respectively. Existing methods face three major challenges. Firstly, video and audio cues are regarded as common features for both TES and PCS predictions in previous works without considering the prior evaluation criterion of figure skating. Secondly, action elements in competitions are separated in time, TES should be derived from each element's score, but existing methods try to give an overall TES prediction without evaluating each action element. Thirdly, lengthy competition videos make it difficult and inefficient to handle long-range contexts. To address these challenges, we propose a two-stream Mamba pyramid network that aligns with actual judging criteria to predict TES and PCS by separating visual-feature based TES evaluation stream from audio-visual-feature based PCS evaluation stream. In the PCS evaluation stream, we introduce a multi-level fusion mechanism to guarantee that video-based features remain unaffected when assessing TES, and enhance PCS estimation by fusing visual and auditory cues across each contextual level of the pyramid. In the TES evaluation stream, the multi-scale Mamba pyramid and TES head we proposed effectively address the challenges of localizing and evaluating action elements with various temporal scales and give score predictions. With Mamba's superior ability to capture long-range dependencies and its linear computational complexity, our method is ideal for handling lengthy figure skating videos. Comprehensive experimentation demonstrates that our framework attains state-of-the-art performance on the FineFS benchmark. Furthermore, it yields competitive outcomes on two additional datasets without further training. Our source code is available at https://github.com/ycwfs/Figure-Skating-Action-Quality-Assessment.

Abstract:
Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects while preserving the vocal timbre of a given brief reference audio. Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality. To address these issues, we propose a novel dubbing architecture based on Large Language Model (LLM) and Conditional Flow Matching (CFM), named FlowDubber, which achieves high-quality audio-visual sync and pronunciation by incorporating a large speech language model with dual contrastive alignment while improving acoustic quality via Flow-based Voice Enhancing (FVE). First, we introduce Qwen2.5 as the backbone of large speech language model to learn the in-context sequence from movie scripts and reference audio. Second, the proposed semantic-aware learning focuses on capturing LLM semantic knowledge at the phoneme level, which facilitates mutual alignment with lip movement from silent video via Dual Contrastive Alignment (DCA). Third, the FVE introduces an LLM-based acoustics flow matching guidance to strengthen clarity by decoupling Classifier-Free Guidance (CFG) enhancement. Extensive experiments demonstrate that our method outperforms several state-of-the-art methods on two primary benchmarks. The demos are available at https://galaxycong.github.io/LLM-Flow-Dubber/.

Abstract:
Deep learning has shown remarkable performance in integrating multimodal data for survival prediction. However, existing multimodal methods mainly focus on single cancer types and overlook the challenge of generalization across cancers. In this work, we are the first to reveal that multimodal prognosis models often generalize worse than unimodal ones in cross-cancer scenarios, despite the critical need for such robustness in clinical practice. To address this, we propose a new task: Cross-Cancer Single Domain Generalization for Multimodal Prognosis, which evaluates whether models trained on a single cancer type can generalize to unseen cancers. We identify two key challenges: degraded features from weaker modalities and ineffective multimodal integration. To tackle these, we introduce two plug-and-play modules: Sparse Dirac Information Rebalancer (SDIR) and Cancer-aware Distribution Entanglement (CADE). SDIR mitigates the dominance of strong features by applying Bernoulli-based sparsification and Dirac-inspired stabilization to enhance weaker modality signals. CADE, designed to synthesize the target domain distribution, fuses local morphological cues and global gene expression in latent space. Experiments on a four-cancer-type benchmark demonstrate superior generalization, laying the foundation for practical, robust cross-cancer multimodal prognosis. Code is available at here.

Abstract:
Industrial anomaly detection for 2D objects has gained significant attention and achieved progress in anomaly detection (AD) methods. However, identifying 3D depth anomalies using only 2D information is insufficient. Despite explicitly fusing depth information into RGB images or using point cloud backbone networks to extract depth features, both approaches struggle to adequately represent 3D information in multimodal scenarios due to the disparities among different modal information. Additionally, due to the scarcity of abnormal samples in industrial data, especially in multimodal scenarios, it is necessary to perform anomaly generation to simulate real-world abnormal samples. Therefore, we propose a novel unified multimodal anomaly detection framework to address these issues. Our contributions consist of 3 key aspects. (1) We extract visible depth information from 3D point cloud data simply and use 2D RGB images to represent appearance, which disentangles depth and appearance to support unified anomaly generation. (2) Benefiting from the flexible input representation, the proposed Multi-Scale Gaussian Anomaly Generator and Unified Texture Anomaly Generator can generate richer anomalies in RGB and depth. (3) All modules share parameters for both RGB and depth data, effectively bridging 2D and 3D anomaly detection. Subsequent modules can directly leverage features from both modalities without complex fusion. Experiments show our method outperforms state-of-the-art (SOTA) on MVTec-3D AD and Eyecandies datasets. Code available at: https://github.com/Xantastic/BridgeNet

Abstract:
Multimodal large language models (MLLMs) have garnered widespread attention due to their ability to understand multimodal input. However, their large parameter sizes and substantial computational demands severely hinder their practical deployment and application. While quantization is an effective way to reduce model size and inference latency, its application to MLLMs remains underexplored. In this paper, we propose MQuant, a post-training quantization (PTQ) framework designed to tackle the unique challenges of multimodal large language models (MLLMs). Conventional quantization often struggles with MLLMs because of (a) high inference latency from large visual token counts, (b) distributional disparities between visual and textual tokens, and (c) extreme outliers introduced by Hadamard-based transformations. To address these issues, MQuant introduces: • Modality-Specific Static Quantization (MSQ), assigning distinct static scales for visual vs. textual tokens; • Attention-Invariant Flexible Switching (AIFS), reordering tokens to preserve casual attention while eliminating expensive token-wise scale computations; • Rotation Magnitude Suppression (RMS), mitigating weight outliers arising from online Hadamard rotations. On five mainstream MLLMs (including Qwen-VL, MiniCPM-V, CogVLM2), MQuant under W4A8 achieves near-floating-point accuracy (<1% degradation) while reducing inference latency by up to 30%, significantly outperforming existing PTQ baselines. Our MQuant effectively bridges the gap for efficient and accurate MLLMs inference in resource-constrained devices. Code will be released in https://github.com/StiphyJay/MQuant.

Abstract:
Achieving pixel-level segmentation with low computational cost using multimodal data remains a key challenge in crack segmentation tasks. Existing methods lack the capability for adaptive perception and efficient interactive fusion of cross-modal features. To address these challenges, we propose a Lightweight Adaptive Cue-Aware Vision Mamba network (LIDAR), which efficiently perceives and integrates morphological and textural cues from different modalities under multimodal crack scenarios, generating clear pixel-level crack segmentation maps. Specifically, LIDAR is composed of a Lightweight Adaptive Cue-Aware Visual State Space module (LacaVSS) and a Lightweight Dual Domain Dynamic Collaborative Fusion module (LD3CF). LacaVSS adaptively models crack cues through the proposed mask-guided Efficient Dynamic Guided Scanning Strategy (EDG-SS), while LD3CF leverages an Adaptive Frequency Domain Perceptron (AFDP) and a dual-pooling fusion strategy to effectively capture spatial and frequency-domain cues across modalities. Moreover, we design a Lightweight Dynamically Modulated Multi-Kernel convolution (LDMK) to perceive complex morphological structures with minimal computational overhead, replacing most convolutional operations in LIDAR. Experiments on three datasets demonstrate that our method outperforms other state-of-the-art (SOTA) methods. On the light-field depth dataset, our method achieves 0.8204 in F1 and 0.8465 in mIoU with only 5.35M parameters. Code and datasets are available at https://github.com/Karl1109/LIDAR-Mamba.

Abstract:
The emergence of medical generalist foundation models has revolutionized conventional task-specific model development paradigms, aiming to better handle multiple tasks through joint training on large-scale medical datasets. However, recent advances prioritize simple data scaling or architectural component enhancement, while neglecting to re-examine multi-task learning from a data-centric perspective. Critically, simply aggregating existing data resources leads to decentralized image-task alignment, which fails to cultivate comprehensive image understanding or align with clinical needs for multi-dimensional image interpretation. In this paper, we introduce the image-centric multi-annotation X-ray dataset (IMAX), the first attempt to enhance the multi-task learning capabilities of medical multi-modal large language models (MLLMs) from the data construction level. To be specific, IMAX is featured from the following attributes: 1) High-quality data curation. A comprehensive collection of more than 354K entries applicable to seven different medical tasks. 2) Image-centric dense annotation. Each X-ray image is associated with an average of 4.10 tasks and 7.46 training entries, ensuring multi-task representation richness per image. Compared to the general decentralized multi-annotation X-ray dataset (DMAX), IMAX consistently demonstrates significant multi-task average performance gains ranging from 3.20% to 21.05% across seven open-source state-of-the-art medical MLLMs. Moreover, we investigate differences in statistical patterns exhibited by IMAX and DMAX training processes, exploring potential correlations between optimization dynamics and multi-task performance. Finally, leveraging the core concept of IMAX data construction, we propose an optimized DMAX-based training strategy to alleviate the dilemma of obtaining high-quality IMAX data in practical scenarios. Related resources are available at https://github.com/MSIIP/IMAX.

Abstract:
In recent years, pre-trained multimodal large models have attracted widespread attention due to their outstanding performance in various multimodal applications. Nonetheless, the extensive computational resources and vast datasets required for their training present significant hurdles for deployment in environments with limited computational resources. Many existing methods attempt to compress pre-trained multimodal large models through knowledge distillation, typically focusing on a single optimization objective. While such methods successfully reduce model parameters, they often incur significant performance degradation. Moreover, single-scale optimization fails to ensure comprehensive learning of the teacher model's knowledge across different aspects. In this work, we propose, for the first time, a dynamic self-adaptive multiscale distillation (DSMD) from pre-trained multi-modal large model for efficient cross-modal retrieval method, considering multiple scales from the perspectives of fine granularity, global structure, and hard negative sample mining. Furthermore, we design a dynamic loss balancer, eliminating the need to manually tune objective weights during distillation. This dynamic mechanism ensures that all objectives are optimized in a balanced and adaptive manner throughout the training process. Experiments demonstrate that our multiscale distillation framework achieves significant performance improvements over traditional single-scale distillation methods. Additionally, our proposed dynamic balancer effectively stabilizes the distillation process, ensuring consistent optimization across objectives. The distilled student model achieves 90% of the teacher model's performance while using only 10% of its parameters. Notably, our model also achieves state-of-the-art performance on cross-modal retrieval tasks, outperforming existing approaches. Codes are available at https://github.com/chrisx599/DSMD.

Abstract:
Despite the great success of Large Vision Language Models (LVLMs), their high computational cost severely limits their broad applications. The computational cost of LVLMs mainly stems from the visual sequence of the input, which consists of hundreds or even thousands of tokens. Although existing methods have made progress by removing redundant tokens, they suffer from severe performance degradation with high pruning rates due to the loss of visual information. In this paper, we propose an Adaptive Content Compensation Method (ACCM), which can effectively mitigate the visual information loss via an image caption. Specifically, ACCM comprises two key components: a lightweight caption model and a selector. Firstly the caption model generates question-related descriptions under the guidance of the user instruction. Then the selector further identifies a contextually appropriate caption from multiple candidates. Leveraging self-supervised learning, our modules could be learned efficiently without any human or automated labeling. We conduct extensive experiments across seven benchmarks and the results show that ACCM significantly outperforms existing methods with lower FLOPs (e.g., surpassing SOTA by 20.6% with 6.5% fewer FLOPs) . https://github.com/ASGO-MM/ACCM.

Abstract:
Novel view synthesis (NVS) in low-light scenes remains a significant challenge due to degraded inputs characterized by severe noise, low dynamic range (LDR) and unreliable initialization. While recent NeRF-based approaches have shown promising results, most suffer from high computational costs, and some rely on carefully captured or pre-processed data-such as RAW sensor inputs or multi-exposure sequences-which severely limits their practicality. In contrast, 3D Gaussian Splatting (3DGS) enables real-time rendering with competitive visual fidelity; however, existing 3DGS-based methods struggle with low-light sRGB inputs, resulting in unstable Gaussian initialization and ineffective noise suppression. To address these challenges, we propose LL-Gaussian, a novel framework for 3D reconstruction and enhancement from low-light sRGB images, enabling pseudo normal-light novel view synthesis. Our method introduces three key innovations: 1) an end-to-end Low-Light Gaussian Initialization Module (LLGIM) that leverages dense priors from learning-based MVS approach to generate high-quality initial point clouds; 2) a dual-branch Gaussian decomposition model that disentangles intrinsic scene properties (reflectance and illumination) from transient interference, enabling stable and interpretable optimization; 3) an unsupervised optimization strategy guided by both physical constrains and diffusion prior to jointly steer decomposition and enhancement. Additionally, we contribute a challenging dataset collected in extreme low-light environments and demonstrate the effectiveness of LL-Gaussian. Compared to state-of-the-art NeRF-based methods, LL-Gaussian achieves up to 2,000× faster inference and reduces training time to just 2%, while delivering superior reconstruction and rendering quality.

Abstract:
Video Moment Retrieval (MR) and Highlight Detection (HD) aim to pinpoint specific moments and assess clip-wise relevance based on the text query. While DETR-based joint frameworks have made significant strides, there remains untapped potential in harnessing the intricate relationships between temporal motion and spatial semantics within video content. In this paper, we propose the Motion-Semantics DETR (MS-DETR), a framework that captures rich motion-semantics features through unified learning for MR/HD tasks. The encoder first explicitly models disentangled intra-modal correlations within motion and semantics dimensions, guided by the given text queries. Subsequently, the decoder utilizes the task-wise correlation across temporal motion and spatial semantics dimensions to enable precise query-guided localization for MR and refined highlight boundary delineation for HD. Furthermore, we observe the inherent sparsity dilemma within the motion and semantics dimensions of MR/HD datasets. To address this issue, we enrich the corpus from both dimensions by generation strategies and propose contrastive denoising learning to ensure the above components learn robustly and effectively. Extensive experiments on four MR/HD benchmarks demonstrate that our method outperforms existing state-of-the-art models by a margin. Our code is available at https://github.com/snailma0229/MS-DETR.git.

Abstract:
Emotions critically influence mental health, driving interest in music-based affective computing via neurophysiological signals with Brain-computer Interface techniques. While prior studies leverage music's accessibility for emotion induction, three key limitations persist: (1) Stimulus Constraints : Music stimuli are confined to small corpora due to copyright and curation costs, with selection biases from heuristic emotion-music mappings that ignore individual affective profiles. (2) Modality Specificity : Overreliance on unimodal neural data (e.g., EEG) ignores complementary insights from cross-modal signal fusion. (3) Portability Limitation : Cumbersome setups (e.g., 64+ channel gel-based EEG caps) hinder real-world applicability due to procedural complexity and portability barriers. To address these limitations, we propose MEEtBrain, a portable and multimodal framework for emotion analysis (valence/arousal), integrating AI-generated music stimuli with synchronized EEG-fNIRS acquisition via a wireless headband. By MEEtBrain, the music stimuli can be automatically generated by AI on a large scale, eliminating subjective selection biases while ensuring music diversity. We use our developed portable device that is designed in a lightweight headband-style and uses dry electrodes, to simultaneously collect EEG and fNIRS recordings. A 14-hour dataset from 20 participants was collected in the first recruitment to validate the framework's efficacy, with AI-generated music eliciting target emotions (valence/arousal). We are actively expanding our multimodal dataset (44 participants in the latest dataset) and make it publicly available to promote further research and practical applications. The dataset is available at https://zju-bmi-lab.github.io/ZBra.

Abstract:
The furnishing of multi-modal large language models (MLLMs) has led to the emergence of numerous benchmark studies, particularly those evaluating their perception and understanding capabilities. Among these, understanding image-evoked emotions aims to enhance MLLMs' empathy, with significant applications such as human-machine interaction and advertising recommendations. However, current evaluations of this MLLM capability remain coarse-grained, and a systematic and comprehensive assessment is still lacking. To this end, we introduce EEmo-Bench, a novel benchmark dedicated to the analysis of the evoked emotions in images across diverse content categories. Our core contributions include: 1) Regarding the diversity of the evoked emotions, we adopt an emotion ranking strategy and employ the Valence-Arousal-Dominance (VAD) as emotional attributes for emotional assessment. In line with this methodology, 1,960 images are collected and manually annotated. 2) We design four tasks to evaluate MLLMs' ability to capture the evoked emotions by single images and their associated attributes: Perception, Ranking, Description, and Assessment. Additionally, image-pairwise analysis is introduced to investigate the model's proficiency in performing joint and comparative analysis. In total, we collect 6,773 question-answer pairs and perform a thorough assessment on 19 commonly-used MLLMs. The results indicate that while some proprietary and large-scale open-source MLLMs achieve promising overall performance, the analytical capabilities in certain evaluation dimensions remain suboptimal. Our EEmo-Bench paves the path for further research aimed at enhancing the comprehensive perceiving and understanding capabilities of MLLMs concerning image-evoked emotions, which is crucial for machine-centric emotion perception and understanding. Our code and benchmark datasets are available at https://github.com/workerred/EEmo-Bench.

Abstract:
Event camera, a novel neuromorphic vision sensor, records data with high temporal resolution and wide dynamic range, offering new possibilities for accurate visual representation in challenging scenarios. However, event data is inherently sparse and noisy, mainly reflecting brightness changes, which complicates effective feature extraction. To address this, we propose a self-supervised pre-training framework to fully reveal latent information in event data, including edge information and texture cues. Our framework consists of three stages: Difference-guided Masked Modeling, inspired by the event physical sampling process, reconstructs temporal intensity difference maps to extract enhanced information from raw event data. Backbone-fixed Feature Transition contrasts event and image features without updating the backbone to preserve representations learned from masked modeling and stabilizing their effect on contrastive learning. Focus-aimed Contrastive Learning updates the entire model to improve semantic discrimination by focusing on high-value regions. Extensive experiments show our framework is robust and consistently outperforms state-of-the-art methods on various downstream tasks, including object recognition, semantic segmentation, and optical flow estimation. The code and dataset are available at https://github.com/BIT-Vision/EventPretrain.

Abstract:
Class-Incremental Unsupervised Domain Adaptation (CI-UDA) aims to adapt a model from a labeled source domain to an unlabeled target domain, where the sets of potential target classes appearing at different time steps are disjoint and are subsets of the source classes. The key to solving this problem lies in avoiding catastrophic forgetting of knowledge about previous target classes during continuously mitigating the domain shift. Most previous works cumbersomely combine two technical components. On one hand, they need to store and utilize rehearsal target sample from previous time steps to avoid catastrophic forgetting; on the other hand, they perform alignment only between classes shared across domains at each time step. Consequently, the memory will continuously increase and the asymmetric alignment may inevitably result in knowledge forgetting. In this paper, we propose to mine and preserve domain-invariant and class-agnostic knowledge to facilitate the CI-UDA task. Specifically, via using CLIP, we extract the class-agnostic properties which we name as ''attribute''. In our framework, we learn a ''key-value'' pair to represent an attribute, where the key corresponds to the visual prototype and the value is the textual prompt. We maintain two attribute dictionaries, each corresponding to a different domain. Then we perform attribute alignment across domains to mitigate the domain shift, via encouraging visual attention consistency and prediction consistency. Through attribute modeling and cross-domain alignment, we effectively reduce catastrophic knowledge forgetting while mitigating the domain shift, in a rehearsal-free way. Experiments on three CI-UDA benchmarks demonstrate that our method outperforms previous state-of-the-art methods and effectively alleviates catastrophic forgetting. Code is available at https://github.com/RyunMi/VisTA.

Abstract:
Large Video Language Models (LVLMs) have rapidly emerged as the focus of multimedia AI research. Nonetheless, when confronted with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic shifts that unfold over extended durations. Moreover, mainstream text-based retrieval pipelines, which rely chiefly on surface-level lexical overlap, ignore the rich temporal interdependence among visual, audio, and subtitle channels. To mitigate these limitations, we propose TV-RAG, a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning. The framework contributes two main mechanisms: (i) a time-decay retrieval module that injects explicit temporal offsets into the similarity computation, thereby ranking text queries according to their true multimedia context; and (ii) an entropy-weighted key-frame sampler that selects evenly spaced, information-dense frames, reducing redundancy while preserving representativeness. By weaving these temporal and semantic signals together, TV-RAG realises a dual-level reasoning routine that can be grafted onto any LVLM without re-training or fine-tuning. The resulting system offers a lightweight, budget-friendly upgrade path and consistently surpasses most leading baselines across established long-video benchmarks such as Video-MME, MLVU, and LongVideoBench, confirming the effectiveness of our model.

Abstract:
Over the past decade, adaptive video streaming technology has witnessed significant advancements, particularly driven by the rapid evolution of deep learning techniques. However, the black-box nature of deep learning algorithms presents challenges for developers in understanding decision-making processes and optimizing for specific application scenarios. Although existing research has enhanced algorithm interpretability through decision tree conversion, interpretability does not directly equate to developers' subjective comprehensibility. To address this challenge, we introduce ComTree, the first bitrate adaptation algorithm generation framework that considers comprehensibility. The framework initially generates the complete set of decision trees that meet performance requirements, then leverages large language models to evaluate these trees for developer comprehensibility, ultimately selecting solutions that best facilitate human understanding and enhancement. Experimental results demonstrate that ComTree significantly improves comprehensibility while maintaining competitive performance, showing potential for further advancement. The source code and appendix are available at https://github.com/thu-media/ComTree.

Abstract:
Multi-modal large language models (MLLMs) incorporate heterogeneous modalities into LLMs, enabling a comprehensive understanding of diverse scenarios and objects. Despite the proliferation of evaluation benchmarks and leaderboards for MLLMs, they predominantly overlook the critical capacity of MLLMs to comprehend world knowledge with structured abstractions that appear in visual form. To address this gap, we propose a novel evaluation paradigm and devise M3STR, an innovative benchmark grounded in the Multi-Modal Map for STRuctured understanding. This benchmark leverages multi-modal knowledge graphs to synthesize images encapsulating subgraph architectures enriched with multi-modal entities. M3STR necessitates that MLLMs not only recognize the multi-modal entities within the visual inputs but also decipher intricate relational topologies among them. We delineate the benchmark's statistical profiles and automated construction pipeline, accompanied by an extensive empirical analysis of 26 state-of-the-art MLLMs. Our findings reveal persistent deficiencies in processing abstractive visual information with structured knowledge, thereby charting a pivotal trajectory for advancing MLLMs' holistic reasoning capacities. Code and data are released at https://github.com/zjukg/M3STR

Abstract:
The advancement of 3D vision-language (3D VL) learning is hindered by several limitations in existing 3D VL datasets: they rarely necessitate reasoning beyond a close range of objects in single viewpoint, and annotations often link instructions to single objects, missing richer contextual alignments between multiple objects. This significantly curtails the development of models capable of deep, multi-view 3D scene understanding over distant objects. To address these challenges, we introduce MV-ScanQA, a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views (compared to less than 7% in existing datasets), thereby rigorously testing multi-view compositional reasoning. To facilitate the training of models for such demanding scenarios, we present TripAlign dataset, a large-scale and low-cost 2D-3D-language pre-training corpus containing 1M <2D view, set of 3D objects, text> triplets that explicitly aligns groups of contextually related objects with text, providing richer, view-grounded multi-object multimodal alignment signals than previous single-object annotations. We further develop LEGO, a baseline method for the multi-view reasoning challenge in MV-ScanQA, transferring knowledge from pre-trained 2D LVLMs to 3D domain with TripAlign. Empirically, LEGO pre-trained on TripAlign achieves state-of-the-art performance not only on the proposed MV-ScanQA, but also on existing benchmarks for 3D dense captioning and question answering. Datasets and code are available at https://matthewdm0816.github.io/tripalign-mvscanqa.

Abstract:
The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, we introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion, paired with question-answer sets constrained by temporal evolution patterns. We further propose an explicit reasoning chain generation paradigm that extracts spatiotemporal objects via keyframe semantic alignment, derives object state transition-based reasoning paths using large language models, and ensures logical coherence through human-verified validation. This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference. Our StreamingCoT and its construction toolkit can be accessed at https://github.com/Fleeting-hyh/StreamingCoT.

Abstract:
Deep learning has shown great promise in physiological signal analysis, yet its progress is hindered by heterogeneous data formats, inconsistent preprocessing strategies, fragmented model pipelines, and non-reproducible experimental setups. To address these limitations, we present Tyee, a unified, modular, and fully-integrated configurable toolkit designed for intelligent physiological healthcare. Tyee introduces three key innovations: (1) a unified data interface and configurable preprocessing pipeline for 12 kinds of signal modalities; (2) a modular and extensible architecture enabling flexible integration and rapid prototyping across tasks; and (3) end-to-end workflow configuration, promoting reproducible and scalable experimentation. Tyee demonstrates consistent practical effectiveness and generalizability, outperforming or matching baselines across all evaluated tasks (with state-of-the-art results on 12 of 13 datasets). The Tyee toolkit is released at https://github.com/SmileHnu/Tyee and actively maintained.

Abstract:
Multimodal Empathetic Response Generation (MERG) is crucial for building emotionally intelligent human-computer interactions. Although large language models (LLMs) have improved text-based ERG, challenges remain in handling multimodal emotional content and maintaining identity consistency. Thus, we propose E3RG, an Explicit Emotion-driven Empathetic Response Generation System based on multimodal LLMs which decomposes MERG task into three parts: multimodal empathy understanding, empathy memory retrieval, and multimodal response generation. By integrating advanced expressive speech and video generative models, E3RG delivers natural, emotionally rich, and identity-consistent responses without extra training. Experiments validate the superiority of our system on both zero-shot and few-shot settings, securing Top-1 position in the Avatar-based Multimodal Empathy Challenge on ACM MM'25. Our code is available at https://github.com/RH-Lin/E3RG.

Abstract:
Continual learning requires machine learning models to continuously acquire new knowledge in dynamic environments while avoiding the forgetting of previous knowledge. Prompt-based continual learning methods effectively address the issue of catastrophic forgetting through prompt expansion and selection. However, existing approaches often suffer from low accuracy in prompt selection, which can result in the model receiving biased knowledge and making biased predictions. To address this issue, we propose the Multiple Queries with Multiple Keys (MQMK) prompt matching paradigm for precise prompt selection. The goal of MQMK is to select the prompts whose training data distribution most closely matches that of the test sample. Specifically, Multiple Queries enable precise breadth search by introducing task-specific knowledge, while Multiple Keys perform deep search by representing the feature distribution of training samples at a fine-grained level. Each query is designed to perform local matching with a designated task to reduce interference across queries. Experiments show that MQMK enhances the prompt matching rate by over 30% in challenging scenarios and achieves state-of-the-art performance on three widely adopted continual learning benchmarks. The code is available at https://github.com/DunweiTu/MQMK.

Abstract:
RGB-D salient object detection (SOD) aims to identify the most conspicuous objects in a scene with the incorporation of depth cues. Existing methods mainly rely on CNNs, limited by the local receptive fields, or Vision Transformers that suffer from the cost of quadratic complexity, posing a challenge in balancing performance and computational efficiency. Recently, state space models (SSM), Mamba, have shown great potential for modeling long-range dependency with linear complexity. However, directly applying SSM to RGB-D SOD may lead to deficient local semantics as well as the inadequate cross-modality fusion. To address these issues, we propose a Local Emphatic and Adaptive Fusion state space model (LEAF-Mamba) that contains two novel components: 1) a local emphatic state space module (LE-SSM) to capture multi-scale local dependencies for both modalities. 2) an SSM-based adaptive fusion module (AFM) for complementary cross-modality interaction and reliable cross-modality integration. Extensive experiments demonstrate that the LEAF-Mamba consistently outperforms 16 state-of-the-art RGB-D SOD methods in both efficacy and efficiency. Moreover, our method can achieve excellent performance on the RGB-T SOD task, proving a powerful generalization ability. Our code is publicly available at https://github.com/LanhooNg/LEAF-Mamba.

Abstract:
While Vision-Language Models (VLMs) have shown promising progress in general multimodal tasks, they often struggle with industrial anomaly detection and reasoning, particularly in delivering interpretable explanations and generalizing to unseen categories. This limitation stems from the inherently domain-specific nature of anomaly detection, which hinders the applicability of existing VLMs in industrial scenarios that require precise, structured, and context-aware analysis. To address these challenges, we propose SAGE, a VLM-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). SFE integrates domain-specific knowledge into visual reasoning via fact extraction and fusion, while E-DPO aligns model outputs with expert preferences using entropy-aware optimization. Additionally, we introduce AD-PL, a preference-optimized dataset tailored for industrial anomaly reasoning, consisting of 28,415 question-answering instances with expert-ranked responses. To evaluate anomaly reasoning models, we develop Multiscale Logical Evaluation (MLE), a quantitative framework analyzing model logic and consistency. SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings. The code, model, and dataset are available at https://github.com/amoreZgx1n/SAGE.

Abstract:
Spatial audio playback defines immersive listening. However, objective evaluation methods for perceptual dimensions like sound field and sound image remain underdeveloped, hindered by the lack of fine-grained spatial audio datasets and the neglect of echoes and reverberation in diverse playback conditions. To address these challenges, we propose MESA, a multi-modal evaluation framework for spatial audio systems, and introduce PSA-MOS, a high-quality multi-scene spatial audio dataset. Specifically: 1) PSA-MOS provides 50 hours of high-quality spatial audio recordings spanning 6 playback scenarios and 7 device types, with detailed localization annotations and fine-grained MOS ratings across four perceptual dimensions. 2) We develop SAE-Encoder, a spatial audio encoder that captures both acoustic-spatial cues and fine-grained perceptual patterns. 3) MESA integrates visual scene context to enhance evaluation robustness through echo and reverberation modeling. Experimental results demonstrate that SAE-Encoder achieves superior performance in SELD tasks. With a two-stage training strategy, MESA exhibits strong correlation with human perceptual assessments, effectively guiding spatial audio quality optimization. The demos are available at https://david-pigeon.github.io/mesaDemo.

Abstract:
This paper presents DriVerse, a generative model for simulating navigation-driven driving scenes from a single image and a future trajectory. Previous autonomous driving world models either directly feed the trajectory or discrete control signals into the generation pipeline, leading to poor alignment between the control inputs and the implicit features of the 2D base generative model, which results in low-fidelity video outputs. Some methods use coarse textual commands or discrete vehicle control signals, which lack the precision to guide fine-grained, trajectory-specific video generation, making them unsuitable for evaluating actual autonomous driving algorithms. DriVerse introduces explicit trajectory guidance in two complementary forms: it tokenizes trajectories into textual prompts using a predefined trend vocabulary for seamless language integration, and converts 3D trajectories into 2D spatial motion priors to enhance control over static content within the driving scene. To better handle dynamic objects, we further introduce a lightweight motion alignment module, which focuses on the inter-frame consistency of dynamic pixels, significantly enhancing the temporal coherence of moving elements over long sequences. We also propose an inference-time strategy to address issues caused by rapid vehicle heading changes. With minimal training and no need for additional data, DriVerse outperforms specialized models on future video generation tasks across both the nuScenes and Waymo datasets. Code is available at https://github.com/shalfun/DriVerse

Abstract:
Accurate color alignment in text-to-image (T2I) generation is critical for applications such as fashion, product visualization, and interior design, yet current diffusion models struggle with nuanced and compound color terms (e.g., Tiffany blue, baby pink), often producing images that are misaligned with human intent. Existing approaches rely on cross-attention manipulation, reference images, or fine-tuning but fail to systematically resolve ambiguous color descriptions. To precisely render colors under prompt ambiguity, we propose a training-free framework that enhances color fidelity by leveraging a large language model (LLM) to disambiguate color-related prompts and guiding color blending operations directly in the text embedding space. Our method first employs a large language model (LLM) to resolve ambiguous color terms in the text prompt, and then refines the text embeddings based on the spatial relationships of the resulting color terms in the CIELab color space. Unlike prior methods, our approach improves color accuracy without requiring additional training or external reference images. Experimental results demonstrate that our framework improves color alignment without compromising image quality, bridging the gap between text semantics and visual generation. All supplementary materials are available at https://Sung-Lin.github.io/TintBench/.

Abstract:
Humans can perceive and reason about spatial relationships from sequential visual observations, such as egocentric video streams. However, how pretrained models acquire such abilities, especially high-level reasoning, remains unclear. This paper introduces Embodied-R, a collaborative framework combining large-scale Vision-Language Models (VLMs) for perception and small-scale Language Models (LMs) for reasoning. Using Reinforcement Learning (RL) with a novel reward system considering think-answer logical consistency, the model achieves slow-thinking capabilities with limited computational resources. After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models (OpenAI-o1, Gemini-2.5-pro) on both in-distribution and out-of-distribution embodied spatial reasoning tasks. Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration. We further explore research questions including response length, training on VLM, strategies for reward design, and differences in model generalization after SFT (Supervised Fine-Tuning) and RL training. The project page is available at: https://embodiedcity.github.io/Embodied-R/.

Abstract:
In this study, we introduce a novel method called group-wise VI sual token Selection and Aggregation (VISA) to address the issue of inefficient inference stemming from excessive visual tokens in multimoal large language models (MLLMs). Compared with previous token pruning approaches, our method can preserve more visual information while compressing visual tokens. We first propose a graph-based visual token aggregation (VTA) module. VTA treats each visual token as a node, forming a graph based on semantic similarity among visual tokens. It then aggregates information from removed tokens into kept tokens based on this graph, producing a more compact visual token representation. Additionally, we introduce a group-wise token selection strategy (GTS) to divide visual tokens into kept and removed ones, guided by text tokens from the final layers of each group. This strategy progressively aggregates visual information, enhancing the stability of the visual information extraction process. We conduct comprehensive experiments on LLaVA-1.5, LLaVA-NeXT, and Video-LLaVA across various benchmarks to validate the efficacy of VISA. Our method consistently outperforms previous methods, achieving a superior trade-off between model performance and inference speed.

Abstract:
Despite the impressive generative capabilities of text-to-image (T2I) diffusion models, they remain vulnerable to generating inappropriate content, especially when confronted with implicit sexual prompts. Unlike explicit harmful prompts, these subtle cues, often disguised as seemingly benign terms, can unexpectedly trigger sexual content due to underlying model biases, raising significant ethical concerns. However, existing detection methods are primarily designed to identify explicit sexual content and therefore struggle to detect these implicit cues. Fine-tuning approaches, while effective to some extent, risk degrading the model's generative quality, creating an undesirable trade-off. To address this, we propose NDM, the first noise-driven detection and mitigation framework, which could detect and mitigate implicit malicious intention in T2I generation while preserving the model's original generative capabilities. Specifically, we introduce two key innovations: first, we leverage the separability of early-stage predicted noise to develop a noise-based detection method that could identify malicious content with high accuracy and efficiency; second, we propose a noise-enhanced adaptive negative guidance mechanism that could optimize the initial noise by suppressing the prominent region's attention, thereby enhancing the effectiveness of adaptive negative guidance for sexual mitigation. Experimentally, we validate NDM on both natural and adversarial datasets, demonstrating its superior performance over existing SOTA methods, including SLD, UCE, and RECE, etc.

Abstract:
Large language models (LLMs) have demonstrated strong performance in natural language generation but remain limited in knowle- dge-intensive tasks due to outdated or incomplete internal knowledge. Retrieval-Augmented Generation (RAG) addresses this by incorporating external retrieval, with GraphRAG further enhancing performance through structured knowledge graphs and multi-hop reasoning. However, existing GraphRAG methods largely ignore the temporal dynamics of knowledge, leading to issues such as temporal ambiguity, time-insensitive retrieval, and semantic redundancy. To overcome these limitations, we propose Temporal GraphRAG (T-GRAG), a dynamic, temporally-aware RAG framework that models the evolution of knowledge over time. T-GRAG consists of five key components: (1) a Temporal Knowledge Graph Generator that creates time-stamped, evolving graph structures; (2) a Temporal Query Decomposition mechanism that breaks complex temporal queries into manageable sub-queries; (3) a Three-layer Interactive Retriever that progressively filters and refines retrieval across temporal subgraphs; (4) a Source Text Extractor to mitigate noise; and (5) a LLM-based Generator that synthesizes contextually and temporally accurate responses. We also introduce Time-LongQA, a novel benchmark dataset based on real-world corporate annual reports, designed to test temporal reasoning across evolving knowledge. Extensive experiments show that T-GRAG significantly outperforms prior RAG and GraphRAG baselines in both retrieval accuracy and response relevance under temporal constraints, highlighting the necessity of modeling knowledge evolution for robust long-text question answering. Our code is publicly available on the T-GRAG https://github.com/Arvin0313/T-GRAG.git

Abstract:
Large-scale high-quality 3D motion datasets with multi-person interactions are crucial for data-driven models in autonomous driving to achieve fine-grained pedestrian interaction understanding in dynamic urban environments. However, existing datasets mostly rely on estimating 3D poses from monocular RGB video frames, which suffer from occlusion and lack of temporal continuity, thus resulting in unrealistic and low-quality human motion. In this paper, we introduce Waymo-3DSkelMo, the first large-scale dataset providing high-quality, temporally coherent 3D skeletal motions with explicit interaction semantics, derived from the Waymo Perception dataset. Our key insight is to utilize 3D human body shape and motion priors to enhance the quality of the 3D pose sequences extracted from the raw LiDRA point clouds. The dataset covers over 14,000 seconds across more than 800 real driving scenarios, including rich interactions among an average of 27 agents per scene (with up to 250 agents in the largest scene). Furthermore, we establish 3D pose forecasting benchmarks under varying pedestrian densities, and the results demonstrate its value as a foundational resource for future research on fine-grained human behavior understanding in complex urban environments. The dataset and code will be available at https://github.com/GuangxunZhu/Waymo-3DSkelMo

Abstract:
Instruction-based garment editing enables precise image modifications via natural language, with broad applications in fashion design and customization. Unlike general editing tasks, it requires understanding garment-specific semantics and attribute dependencies. However, progress is limited by the scarcity of high-quality instruction-image pairs, as manual annotation is costly and hard to scale. While MLLMs have shown promise in automated data synthesis, their application to garment editing is constrained by imprecise instruction modeling and a lack of fashion-specific supervisory signals. To address these challenges, we present an automated pipeline for constructing a garment editing dataset. We first define six editing instruction categories aligned with real-world fashion workflows to guide the generation of balanced and diverse instruction-image triplets. Second, we introduce Fashion Edit Score, a semantic-aware evaluation metric that captures semantic dependencies between garment attributes and provides reliable supervision during construction. Using this pipeline, we construct a total of 52,257 candidate triplets and retain 20,596 high-quality triplets to build EditGarment, the first instruction-based dataset tailored to standalone garment editing. The project page is https://yindq99.github.io/EditGarment-project/.

Abstract:
Accurate and reliable personality assessment plays a vital role in many fields, such as emotional intelligence, mental health diagnostics, and personalized education. Unlike fleeting emotions, personality traits are stable, often subconsciously leaked through language, facial expressions, and body behaviors, with asynchronous patterns across modalities. It was hard to model personality semantics with traditional superficial features and seemed impossible to achieve effective cross-modal understanding. To address these challenges, we propose a novel personality assessment framework called Traits Run Deep. It employs psychology-informed prompts to elicit high-level personality-relevant semantic representations. Besides, it devises a Text-Centric Trait Fusion Network that anchors rich text semantics to align and integrate asynchronous signals from other modalities. To be specific, such fusion module includes a Chunk-Wise Projector to decrease dimensionality, a Cross-Modal Connector and a Text Feature Enhancer for effective modality fusion and an ensemble regression head to improve generalization in data-scarce situations. To our knowledge, we are the first to apply personality-specific prompts to guide large language models (LLMs) in extracting personality-aware semantics for improved representation quality. Furthermore, extracting and fusing audio-visual apparent behavior features further improves the accuracy. Experimental results on the AVI validation set have demonstrated the effectiveness of the proposed components, i.e., approximately a 45% reduction in mean squared error (MSE). Final evaluations on the test set of the AVI Challenge 2025 confirm our method's superiority, ranking first in the Personality Assessment track. The source code will be made available at https://github.com/MSA-LMC/TraitsRunDeep.

Abstract:
Underwater Instance Segmentation (UIS) tasks are crucial for underwater complex scene detection. Mamba, as an emerging state space model with inherently linear complexity and global receptive fields, is highly suitable for processing image segmentation tasks with long sequence features. However, due to the particularity of underwater scenes, there are many challenges in applying Mamba to UIS. The existing fixed-patch scanning mechanism cannot maintain the internal continuity of scanned instances in the presence of severely underwater color distortion and blurred instance boundaries, and the hidden state of the complex underwater background can also inhibit the understanding of instance objects. In this work, we propose the first Mamba-based underwater instance segmentation model UIS-Mamba, and design two innovative modules, Dynamic Tree Scan (DTS) and Hidden State Weaken (HSW), to migrate Mamba to the underwater task. DTS module maintains the continuity of the internal features of the instance objects by allowing the patches to dynamically offset and scale, thereby guiding the minimum spanning tree and providing dynamic local receptive fields. HSW module suppresses the interference of complex backgrounds and effectively focuses the information flow of state propagation to the instances themselves through the Ncut-based hidden state weakening mechanism. Experimental results show that UIS-Mamba achieves state-of-the-art performance on both UIIS and USIS10K datasets, while maintaining a low number of parameters and computational complexity. Code is available at https://github.com/Maricalce/UIS-Mamba.

Abstract:
This paper introduces a new Segment Anything Model (SAM) that leverages reverse parameter configuration and test-time training to enhance its performance on Camouflaged Object Detection (COD), named SAM-TTT. While most existing SAM-based COD models primarily focus on enhancing SAM by extracting favorable features and amplifying its advantageous parameters, a crucial gap is identified: insufficient attention to adverse parameters that impair SAM's semantic understanding in downstream tasks. To tackle this issue, the Reverse SAM Parameter Configuration Module is proposed to effectively mitigate the influence of adverse parameters in a train-free manner by configuring SAM's parameters. Building on this foundation, the T-Visioner Module is unveiled to strengthen advantageous parameters by integrating Test-Time Training layers, originally developed for language tasks, into vision tasks. Test-Time Training layers represent a new class of sequence modeling layers characterized by linear complexity and an expressive hidden state. By integrating two modules, SAM-TTT simultaneously suppresses adverse parameters while reinforcing advantageous ones, significantly improving SAM's semantic understanding in COD task. Our experimental results on various COD benchmarks demonstrate that the proposed approach achieves state-of-the-art performance, setting a new benchmark in the field. The code will be available at https://github.com/guobaoxiao/SAM-TTT.

Abstract:
Facial Emotion Analysis (FEA) plays a crucial role in visual affective computing, aiming to infer a person's emotional state based on facial data. Scientifically, facial expressions (FEs) result from the coordinated movement of facial muscles, which can be decomposed into specific action units (AUs) that provide detailed emotional insights. However, traditional methods often struggle with limited interpretability, constrained generalization and reasoning abilities. Recently, Multimodal Large Language Models (MLLMs) have shown exceptional performance in various visual tasks, while they still face significant challenges in FEA due to the lack of specialized datasets and their inability to capture the intricate relationships between FEs and AUs. To address these issues, we introduce a novel FEA Instruction Dataset that provides accurate and aligned FE and AU descriptions and establishes causal reasoning relationships between them, followed by constructing a new benchmark, FEABench. Moreover, we propose FEALLM, a novel MLLM architecture designed to capture more detailed facial information, enhancing its capability in FEA tasks. Our model demonstrates strong performance on FEABench and impressive generalization capability through zero-shot evaluation on various datasets, including RAF-DB, AffectNet, BP4D, and DISFA, showcasing its robustness and effectiveness in FEA tasks. The code will be available at https://github.com/953206211/FEALLM.

Abstract:
The dynamic propagation of social media has broadened the reach of financial advisory content through podcast videos, yet extracting insights from lengthy, multimodal segments (30-40 minutes) remains challenging. We introduce FASTER(Financial Advisory Summariser with Textual Embedded Relevant images), a modular framework that tackles three key challenges: (1) extracting modality-specific features, (2) producing optimized, concise summaries, and (3) aligning visual keyframes with associated textual points. FASTER employs BLIP-2 for semantic visual descriptions, OCR for textual patterns, and Whisper-based transcription with Speaker diarization as BOS features. A modified Direct Preference Optimization (DPO)-based loss function, equipped with BOS-specific fact-checking, ensures precision, relevance, and factual consistency against the human-aligned summary. A ranker-based retrieval mechanism further aligns keyframes with summarized content, enhancing interpretability and cross-modal coherence. To acknowledge data resource scarcity, we introduce Fin-APT, a dataset comprising 470 publicly accessible financial advisory pep-talk videos for robust multimodal research. Comprehensive cross-domain experiments confirm FASTER's strong performance, robustness, and generalizability when compared to Large Language Models (LLMs) and Vision-Language Models (VLMs). By establishing a new standard for multimodal summarization, FASTER makes financial advisory content more accessible and actionable, thereby opening new avenues for research.

Abstract:
Aligning the rhythm of visual motion in a video with a given music track is a practical need in multimedia production, yet remains an underexplored task in autonomous video editing. Effective alignment between motion and musical beats enhances viewer engagement and visual appeal, particularly in music videos, promotional content, and cinematic editing. Existing methods typically depend on labor-intensive manual cutting, speed adjustments, or heuristic-based editing techniques to achieve synchronization. While some generative models handle joint video and music generation, they often entangle the two modalities, limiting flexibility in aligning video to music beats while preserving the full visual content. In this paper, we propose a novel and efficient framework-termed MVAA (Music-Video Auto-Alignment)-that automatically edits video to align with the rhythm of a given music track while preserving the original visual content. To enhance flexibility, we modularize the task into a two-step process in our MVAA: aligning motion keyframes with audio beats, followed by rhythm-aware video inpainting. Specifically, we first insert keyframes at timestamps aligned with musical beats, then use a frame-conditioned diffusion model to generate coherent intermediate frames, preserving the original video's semantic content. Since comprehensive test-time training can be time-consuming, we adopt a two-stage strategy: pretraining the inpainting module on a small video set to learn general motion priors, followed by rapid inference-time fine-tuning for video-specific adaptation. This hybrid approach enables adaptation within ~10 minutes with one epoch on a single NVIDIA 4090 GPU using CogVideoX-5b-I2V [77] as the backbone. Extensive experiments show that our approach can achieve high-quality beat alignment and visual smoothness. User studies further validate the natural rhythmic quality of the results, confirming their effectiveness for practical music-video editing. The code is available at: zhangxinyu-xyz.github.io/MVAA

Abstract:
Robot manipulation is a fundamental capability of embodied intelligence, enabling effective robot interactions with the physical world. In robotic manipulation tasks, predicting precise grasping positions and object placement is essential. Achieving this requires object recognition to localize target object, predicting object affordances for interaction and spatial affordances for optimal arrangement. While Vision-Language Models (VLMs) provide insights for high-level task planning and scene understanding, they often struggle to predict precise action positions, such as functional grasp points and spatial placements. This limitation stems from the lack of annotations for object and spatial affordance data in their training datasets. To address this gap, we introduce RoboAfford, a novel large-scale dataset designed to enhance object and spatial affordance learning in robot manipulation. Our dataset comprises 819,987 images paired with 1.9 million question answering (QA) annotations, covering three critical tasks: object affordance recognition to identify objects based on attributes and spatial relationships, object affordance prediction to pinpoint functional grasping parts, and spatial affordance localization to identify free space for placement. Complementing this dataset, we propose RoboAfford-Eval, a comprehensive benchmark for assessing affordance-aware prediction in real-world scenarios, featuring 338 meticulously annotated samples across the same three tasks. Extensive experimental results reveal the deficiencies of existing VLMs in affordance learning, while fine-tuning on the RoboAfford dataset significantly enhances their affordance prediction in robot manipulation, validating the dataset's effectiveness. The dataset, benchmark and evaluation code will be made publicly available to facilitate future research. Project website: https://roboafford-dataset.github.io/.

Abstract:
Unlike bitmap images, scalable vector graphics (SVG) maintain quality when scaled, frequently employed in computer vision and artistic design in the representation of SVG code. In this era of proliferating AI-powered systems, enabling AI to understand and generate SVG has become increasingly urgent. However, AI-driven SVG understanding and generation (U&G) remain significant challenges. SVG code, equivalent to a set of curves and lines controlled by floating-point parameters, demands high precision in SVG U&G. Besides, SVG generation operates under diverse conditional constraints, including textual prompts and visual references, which requires powerful multi-modal processing for condition-to-SVG transformation. Recently, the rapid growth of Multi-modal Large Language Models (MLLMs) have demonstrated capabilities to process multi-modal inputs and generate complex vector controlling parameters, suggesting the potential to address SVG U&G tasks within a unified model. To unlock MLLM's capabilities in the SVG area, we propose an SVG-centric dataset called UniSVG, comprising 525k data items, tailored for MLLM training and evaluation. To our best knowledge, it is the first comprehensive dataset designed for unified SVG generation (from textual prompts and images) and SVG understanding (color, category, usage, etc.). As expected, learning on the proposed dataset boosts open-source MLLMs' performance on various SVG U&G tasks, surpassing SOTA close-source MLLMs like GPT-4V. We release dataset, benchmark, weights, codes and experiment details on https://ryanlijinke.github.io/.

Abstract:
Large Vision-Language Models (LVLMs) usually generate texts which satisfy context coherence but don't match the visual input. Such a hallucination issue hinders LVLMs' applicability in the real world. The key to solving hallucination in LVLM is to make the text generation rely more on the visual content. Most previous works choose to enhance/adjust the features/output of a specific modality (i.e., visual or textual) to alleviate hallucinations in LVLM, which do not explicitly or systematically enhance the visual reliance. In this paper, we comprehensively investigate the factors that may degenerate the visual reliance in text generation of LVLM from a Bayesian perspective. We propose to mitigate hallucination in LVLM from three aspects. Firstly, we observe that not all visual tokens are informative in generating meaningful texts. We propose to evaluate and remove redundant visual tokens to avoid their disturbance. Secondly, LVLM may encode inappropriate prior information, making it lean toward generating unexpected words. We propose a simple, yet effective way to rectify the prior from a Bayesian perspective. Thirdly, we observe that starting from certain steps, the posterior of next-token prediction conditioned on visual tokens may collapse to a prior distribution which does not depend on any informative visual tokens at all. Thus, we propose to stop further text generation to avoid hallucination. Extensive experiments on three benchmarks, including POPE, CHAIR, and MME, demonstrate that our method can consistently mitigate the hallucination issue of LVLM and performs favorably against previous state-of-the-arts. Codes are available at https://github.com/NeilHnxTcc/EVRB.

Abstract:
Recent advances in large vision-language models (LVLMs) have showcased their remarkable capabilities across a wide range of multimodal vision-language tasks. However, these models remain vulnerable to visual adversarial attacks, which can substantially compromise their performance. In this paper, we introduce F3, a novel adversarial purification framework that employs a counterintuitive ''fighting fire with fire'' strategy: intentionally introducing simple perturbations to adversarial examples to mitigate their harmful effects. Specifically, F3 leverages cross-modal attentions derived from randomly perturbed adversary examples as reference targets. By injecting noise into these adversarial examples, F3 effectively refines their attention, resulting in cleaner and more reliable model outputs. Remarkably, this seemingly paradoxical approach of employing noise to counteract adversarial attacks yields impressive purification results. Furthermore, F3 offers several distinct advantages: it is training-free and straightforward to implement, and exhibits significant computational efficiency improvements compared to existing purification methods. These attributes render F3 particularly suitable for large-scale industrial applications where both robust performance and operational efficiency are critical priorities. The code is available at https://github.com/btzyd/F3.

Abstract:
Spatial reasoning is a fundamental capability of multimodal large language models (MLLMs), yet their performance in open aerial environments remains underexplored. In this work, we present Open3D-VQA, a novel benchmark for evaluating MLLMs' ability to reason about complex spatial relationships from an aerial perspective. The benchmark comprises 73k QA pairs across seven general spatial reasoning tasks, offered in multiple-choice, true/false, and short-answer formats, and supports both visual and point cloud modalities. The questions are automatically generated from spatial relations extracted from both real-world and simulated aerial scenes. Evaluation on 13 popular MLLMs reveals that: 1) Models are generally better at answering questions about relative spatial relations than absolute distances, 2) 3D LLMs fail to demonstrate significant advantages over 2D LLMs, and 3) Fine-tuning solely on the simulated dataset can significantly improve the model's spatial reasoning performance in real-world scenarios. The benchmark, generation pipeline, and evaluation toolkit are released on this page.

Abstract:
Stickers are increasingly used in social media to express sentiment and intent. Despite their significant impact on sentiment analysis and intent recognition, little research has been conducted in this area. To address this gap, we propose a new task: Multimodal chat Sentiment Analysis and Intent Recognition involving Stickers (MSAIRS). Additionally, we introduce a novel multimodal dataset containing Chinese chat records and stickers excerpted from several mainstream social media platforms. Our dataset includes paired data with the same text but different stickers, the same sticker but different contexts, and various stickers consisting of the same images with different texts, allowing us to better understand the impact of stickers on chat sentiment and intent. We also propose an effective multimodal joint model, MMSAIR, featuring differential vector construction and cascaded attention mechanisms for enhanced multimodal fusion. Our experiments demonstrate the necessity and effectiveness of jointly modeling sentiment and intent, as they mutually reinforce each other's recognition accuracy. MMSAIR significantly outperforms traditional models and advanced MLLMs, demonstrating the challenge and uniqueness of sticker interpretation in social media. Our dataset and code will be publicly available.

Abstract:
Large language models (LLMs) have obtained promising results in mathematical reasoning, a foundational human intelligence skill. Most previous studies focus on improving or measuring the performance of LLMs via textual math datasets (e.g., MATH, GSM8K). In this paper, we release a Chinese multimodal math (CMM-Math) dataset, including benchmark and training parts, to evaluate and enhance the mathematical reasoning of LMMs. CMM-Math contains over 28,000 high-quality samples, featuring a variety of problem types (e.g., choice, fill-in-the-blank, analysis) with detailed solutions across 12 grade levels from elementary to high school in China. The problem may contain multiple images, and the visual context may be present in the questions or opinions, which makes this dataset more challenging. Our comprehensive analysis reveals that state-of-the-art LMMs on the CMM-Math face challenges, emphasizing the necessity for further improvements in LMM development. We also propose a Multimodal Mathematical LMM (Math-LMM) to handle the problems with mixed input of multiple images and text segments. The Math-LMM is trained using three stages: foundational pre-training, foundational fine-tuning, and mathematical fine-tuning. The extensive experiments indicate that our model effectively improves math reasoning performance by comparing it with the SOTA LMMs over three multimodal mathematical datasets. We release the datasets on GitHub (https://github.com/ECNU-ICALK/EduChat-Math) and Huggingface (https://huggingface.co/datasets/ecnu-icalk/cmm-math).

Abstract:
Human gaze communication is complex, comprising atomic-level (e.g. mutual, share, etc.) and event-level (e.g. follow, aversion, etc.) behaviours. Various methods have been developed to analyse gaze communication in images, but they typically fall short of fully understanding the complexities of the human gaze in videos. In this paper, we present a multi-task, multimodal model based on Contrastive Language-Image Pre-training (CLIP), designed to jointly predict atomic-level and event-level gaze communication, along with gaze target estimation. Specifically, we leverage the Vision-Language model to capture and utilise the semantic information between the atomic-level and event-level gaze communication categories. Additionally, most datasets in this field lack comprehensive annotations for both levels of gaze communication and detailed gaze target information. Therefore, we present a fully annotated gaze communication dataset, GP-Static++. We validate our model on GP-Static++ and several publicly available datasets, demonstrating its state-of-the-art performance. The dataset and code are available at https://pengc98.github.io/Multi-Task-Gaze-Communication-Understanding/.

Abstract:
Conversational Speech Synthesis (CSS) is a key task in the user-agent interaction area, aiming to generate more expressive and empathetic speech for users. However, it is well-known that ''listening'' and ''eye contact'' play crucial roles in conveying emotions during real-world interpersonal communication. Existing CSS research is limited to perceiving only text and speech within the dialogue context, which restricts its effectiveness. Moreover, speech-only responses further constrain the interactive experience. To address these limitations, we introduce a Conversational Speech-Visual Synthesis (CSVS) task as an extension of traditional CSS. By leveraging multimodal dialogue context, it provides users with coherent audiovisual responses. To this end, we develop a CSVS system named UniTalker, which is a unified model that seamlessly integrates multimodal perception and multimodal rendering capabilities. Specifically, it leverages a large-scale language model to comprehensively understand multimodal cues in the dialogue context, including speaker, text, speech, and the talking-face animations. After that, it employs multi-task sequence prediction to first infer the target utterance's emotion and then generate empathetic speech and natural talking-face animations. To ensure that the generated speech-visual content remains consistent in terms of emotion, content, and duration, we introduce three key optimizations: 1) Designing a specialized neural landmark codec to tokenize and reconstruct facial expression sequences. 2) Proposing a bimodal speech-visual hard alignment decoding strategy. 3) Applying emotion-guided rendering during the generation stage. Comprehensive objective and subjective experiments demonstrate that our model synthesizes more empathetic speech and provides users with more natural and emotionally consistent talking-face animations. The source code and generated samples are available at: https://github.com/AI-S2-Lab/UniTalker.

Abstract:
Recent advancements in multimodal AIGC have enabled impressive text-to-video synthesis, but a critical challenge remains: maintaining consistent identity of key subjects across generated frames. To address this limitation, we introduce the Identity-Preserving Video Generation (IPVG) grand challenge. This challenge aims to propel the field toward more controllable generative models by focusing community efforts on preserving identity during the video generation process. To support these efforts, we publicly release the Identity-Preserving Video Benchmark (VIP-200K), a novel dataset comprising approximately 500,000 video-prompt pairs with 200,000 unique identities, each coupled with a reference identity image. Through this grand challenge and dataset, we provide a fertile ground for developing solutions that lead to more user-steerable video synthesis systems. The challenge homepage is https://hidream-ai.github.io/ipvg-challenge.github.io/.

Abstract:
Event-based object detection plays a crucial role in scenarios involving high-speed motion, extreme lighting conditions, and high-frequency detection. However, existing methods fail to address the challenges posed by small objects, including discriminative feature deficiency, the loss of critical information, and the inherent sparsity of event data. Moreover, the lack of benchmark datasets has significantly hindered progress in this field. To tackle these issues, we propose the Fully Deformable Detection Network (FDDNet), a lightweight framework that dynamically adapts to extract key features. First, we introduce a Long-Term Deformable Temporal Receptive Module (LDTR), which aligns critical features across consecutive event streams and leverages a State Space Model for long-range temporal modeling, enhancing the detection of high-speed small objects. Second, to address the sparsity of event data and the concentration of key features along object edges, we design a Sparse Feature Aggregation Block (SFAB) within the backbone and a coarse-to-fine deformable detection head, enabling hierarchical feature refinement from local to global, and improving the detection quality of sparse targets. Finally, to mitigate the lack of event-based small object datasets, we develop a high-quality, annotation-free data acquisition method and collect a real-world benchmark dataset for validation. Extensive experiments demonstrate that our approach achieves state-of-the-art (SOTA) performance on event-based small object detection tasks, with a mAP of 37.4% (+2.4%) on our benchmark and runs at 88 FPS, showcasing both accuracy and real-time capability. Our code and Supplement are available at https://github.com/Lqm26/ESOD.

Abstract:
Incomplete multi-view clustering (IMVC) deals with real-world scenarios where certain views are partially missing, posing significant challenges to effective clustering. Most existing IMVC approaches face a trade-off: imputation-free methods suffer from information bias and imbalance, while full-imputation methods risk introducing and propagating noise. To overcome these limitations, we propose Energy-Based Deep Incomplete Multi-View Clustering (Energy-DIMC), a novel selective-imputation framework that leverages energy-based models (EBMs) to guide reliable imputations and robust clustering. EBMs assess data compatibility by assigning lower energy to more coherent structures, effectively modeling complex inter-view and inter-sample dependencies. Inspired by EBMs, Energy-DIMC integrates four key components: 1) a view feature projector that learns view-specific features and projects them into a common feature space; 2) an energy-guided selective imputation module that identifies the most reliable source view for each view based on view energies, and performs feature imputation only when cross-view transfer is feasible, avoiding unreliable imputations; 3) an energy-based representation fusion module that aggregates observed and selectively imputed features across views via a view attention mechanism, generating view-coherent representations; 4) an energy-enhanced contrastive alignment module that enforces consistency between view-specific and view-coherent representations using dual-level energy signals to preserve true positives. Extensive experiments demonstrate that Energy-DIMC outperforms state-of-the-art IMVC methods across diverse missing-view scenarios. The code is available at https://github.com/sunway677/EnergyIMVC.

Abstract:
Recent advancements in computer vision and Natural Language Processing (NLP) have transformed how machines interpret visual and textual information. While traditional image processing techniques rely on pixel-based analysis and feature extraction, and NLP leverages self-attention mechanisms for contextual understanding, the potential of viewing images as a language remains largely unexplored. In this work as a proof-of-concept, we introduce a novel ''Vision Language'' framework, transforming image pixels into sequences of alphanumeric characters, effectively treating images as ''text.'' This unique representation enables the direct application of NLP techniques to computer vision tasks, such as classification and segmentation, by utilizing text-based models for image analysis. We demonstrate this approach by adapting text classification methods for image classification on benchmark datasets. Furthermore, we construct a domain-specific medical language from the MedMNIST dataset, spanning 11 medical imaging sub-datasets, illustrating how a unified language can effectively generalize across distinct datasets. We create a broad, generalized language from ImageNet, leveraging approximately 1.3 million images to show that a language derived from extensive image data can simplify and unify image processing across tasks and datasets. Our findings reveal a promising cross-disciplinary approach that bridges computer vision and NLP. Our code is publicly available at https://github.com/mustakinalam/The-Birth-of-Vision-Language.

Abstract:
Gaussian Splatting (GS) is widely used for efficient 3D scene representation and rendering by modeling scenes as continuous Gaussian distributions. However, GS struggles with high-frequency details and sharp transitions due to its low-pass filtering effect, often requiring multiple Gaussian stacking, which increases computational and memory costs. To overcome these limitations, we propose Truncated and Tailored Gaussian Splatting (TNT-GS), a novel approach that enhances shape complexity and preserves sharp boundaries. Our method truncates Gaussians to generate sharp edges and flexible shapes without excessive stacking, improving efficiency. We also introduce learnable parameters to dynamically tailor the receptive field of the primitives, optimizing the balance between high-frequency details and smooth regions. Furthermore, we employ specialized densification strategies to further improve efficiency during tile computation. Experimental results show that TNT-GS outperforms state-of-the-art methods in storage efficiency and rendering speed, offering a robust solution for real-time rendering. The code of TNT-GS is available at https://github.com/GoogolplexGoodenough/TNT-GS.

Abstract:
With the rapid development of short video platforms (such as Kuaishou and TikTok), these platforms have increasingly become important channels for the spread of fake news.Therefore, multi-modal fake news detection has attracted extensive attention.Existing studies mainly focus on directly integrating multi-modal information or discovering implicit clues in posts to improve detection performance.However, due to the abuse of video editing techniques, event-irrelevant segments (e.g., advertisements) are frequently mixed into videos, introducing noise information, thereby weakening models' ability to learn crucial information.Moreover, video creators often inject personal tampered information into original news content through audio modality manipulation, potentially distorting the factual. To address these challenges, we propose a novel Event Consistency-aware Robust Fake News Detection (ECR-FND) framework, comprising two key components: an Event-aware Video Denoising Learning (EVDL) and an Audio Tampering-information Capturing Module (ATCM).Specifically, the EVDL filters out the event-irrelevant segments within video modality to focus on core news events. The ATCM adaptively amplifies tampering information in audio modality, enhancing the model's capacity to detect manipulation attempts.Extensive experiments on two benchmark datasets (FakeSV and FakeTT) demonstrate ECR-FND's effectiveness.Our source code is available at https://github.com/immc-lab/ECR-FND.

Abstract:
Group re-identification (G-ReID) attempts to recognize human groups across multiple camera perspectives. It is a challenging task due to occlusion, perspective variation, and illumination change. While Person Attribute Recognition (PAR) methods have shown robustness under similar challenges, yet their potential in G-ReID remains unexplored. Though existing G-ReID datasets are well-crafted, however, they lack person-level attribute annotations. This restricts G-ReID methods to explore attribute-based matching, essentially limiting their capability of multi-modal analysis. In this work, we bridge this gap by utilizing person-level attributes for group-level re-identification. We introduce PAG-ReID (Person Attribute based Group Re-identification), a large-scale dataset constructed by combining three popular G-ReID datasets: CM-Group, Road Group, and CUHK-SYSU-Group. PAG-ReID includes 19K group images encompassing 2,148 groups and 5,504 unique person IDs. At the person level, it provides 25K individual images, each annotated with 19 diverse human attributes, resulting in nearly 475K fine-grained annotations. Next, we propose an effective baseline (CLIP-based) that transforms attribute information into natural language descriptions, enabling joint multi-modal (visual-textual) reasoning for PAR as well as G-ReID tasks. Experiments demonstrate the effectiveness of our approach, setting a new direction for person attribute-centric group re-identification. To our knowledge, this is the first work to unify G-ReID and PAR in a single multi-modal framework. PAG-ReID can be found at https://github.com/draxler1/PAG-ReID.

Abstract:
Open-world semi-supervised learning (OWSSL) extends traditional semi-supervised learning to open-world scenarios by identifying novel categories in unlabeled data, thereby enhancing the model's generalization capability. However, existing OWSSL datasets typically assume a balanced class distribution, whereas real-world applications often exhibit highly imbalanced distributions. This imbalance makes it particularly challenging to learn tail classes and discover novel categories. This paper introduces the Class-Balanced Representation and Recognition Framework (CBTM-NCD), which uses the Variational Dirichlet Process (VDP) to improve tail class features and includes a generative data balancing strategy.Additionally, CBTM-NCD adopts a two-stage optimization strategy to identify novel category samples, effectively tackling three major challenges prevalent in open-world long-tailed scenarios in open-world long-tailed distributions: insufficient feature representation of tail classes, difficulty in discovering unknown categories, and class distribution imbalance.To enhance transparency and reproducibility, the code is available at https://github.com/wuzelei123/CBTM-NCD.

Abstract:
Multimodal learning integrates diverse modalities to enhance robustness, yet real-world scenarios suffer from heterogeneous imbalance phenomena (noise interference, modality partial missing, intermodal information disparities), degrading performance through biased feature representations. Existing methods fail to adaptively modulate models under dynamic imbalance conditions. We propose GMML, a framework dynamically balancing multimodal gradients to counteract imbalance-induced biases: i) An imbalance-aware gradient modulation adaptively identifies contributions with smooth weight transitions to balance conflicting gradients; ii) A parameter constraint method enforces ℓ2-norm constraints on encoders, suppressing parameter oscillations and blocking noisy updates under modality missing/noise. Theoretically, GMML achieves a larger certified radius upper bound for complex imbalances, with convergence radius analysis providing theoretical guarantees. Experiments demonstrate superior robustness against three imbalance types, outperforming state-of-the-art by 3.3% and 2.3% in accuracy on KS and UCF-101 benchmarks. series Code: https://github.com/zhangzikai-security-ML/GMML.

Abstract:
Audio phase retrieval aims to reconstruct phase from the given magnitude and obtain the time-domain audio waveform. While deep learning techniques have promoted the development of this area, existing deep neural network (DNN)-based methods usually suffer from some inherent problems like limited generalization capability to different audio types, failing to adapt to different sampling rates, and inflexibility for varying computational complexity during the inference stage, which heavily hinder the development of the filed. To tackle these challenges, in this paper, we introduce a novel phase task estimation task called versatile auido phase retrieval and a Band-Aware Phase Estimation Network (BAPEN ) is proposed. Specifically, we first collect and establish a new benchmark for the task, which encompasses speech, sound effects, and music and the total duration is around 414 hours. Besides, a sub-band oriented framework is proposed, which involves hierarchical sub-band encoding/decoding and a dual-path network structure is specially devised for efficient narrow- and cross-band modeling, respectively. Furthermore, to enable dynamic control over the inference cost, we propose a simple yet effective sampling strategy for network depth augmentation during training. Both objective and subject results validate the promising performance of the BAPEN while possessing more flexible application ranges. Audio samples are available on: https://lingling-dai.github.io/BAPEN/.

Abstract:
Screen sharing is a common feature in video conferencing applications, especially for remote work and presentations. However, internet conditions such as limited bandwidth, packet loss, and compression can significantly reduce the visual quality of shared screen content. Most existing video quality metrics are designed for natural scenes and were not benchmarked on screen content. In this work, we present a large-scale subjective dataset of screen content videos captured from video conferencing apps. The dataset includes 1,600 distorted videos with corresponding subjective quality scores. Subjective scores were collected using crowdsourced pairwise comparisons. The dataset provides a valuable resource for developing and benchmarking video quality metrics tailored to screen content. The evaluation of objective metrics revealed that several general-purpose quality metrics outperform both full-reference and no-reference metrics. The dataset is available at the following link: https://videoprocessing.github.io/screen-content-dataset.

Abstract:
Capturing human motion with existing monocular estimators often results in large errors when dealing with rare poses, occlusions, truncations, and frame blurring, leading to jitter and long-term drift. Although previous methods have introduced post-processing networks for pose refinement, they struggle to balance global smoothing and fine-grained correction. In this work, we propose MotionRefineNet, which leverages the synergy and complementarity between long- and short-term features in the temporal domain and high- and low-frequency features in the frequency domain to address these challenges. The temporal branch is designed as a hierarchical motion structure to learn multi-time scale features, where long-term features learn motion smoothness, and short-term features capture local rapid changes. The frequency branch employs different frequency band learning strategies based on the degrees of freedom (DoF) of body parts. For body parts with low DoF, the focus is on low-frequency features that represent overall motion trends and regular actions. For body parts with high DoF, we design a filter to adaptively extract useful information from all frequency bands, including subtle motion changes in the high-frequency bands. Extensive experiments on multiple datasets and estimators demonstrate that MotionRefineNet outperforms existing methods in refining 2D, 3D, and SMPL poses, achieving superior pose smoothing and deviation correction. Our code is available at: https://github.com/Wheels319/MotionRefineNet.

Abstract:
The current one-stream tracking framework has received far-reaching attention for its significant improvement in tracking performance, yet it is essentially an extension of Siamese trackers. However, the one-stream framework of discriminative trackers has not been effectively exploited, still using separate feature extraction and model prediction. Therefore, this article aims to implement a one-stream learning strategy for feature extraction and model prediction under the discriminative tracking framework. To this end, we have leveraged the prevailing Vision Transformer and Vision Mamba backbones to achieve our motivation. Moreover, we innovatively combine templates with discriminative tracking methods to enhance the ability of target-aware feature learning, and further propose the attention fusion module to implement spatiotemporal template fusion, which can enhance the adaptability of the tracking model to dynamic changes of targets. The experiments on multiple popular tracking benchmarks have demonstrated that our proposed tracking architecture has superior tracking performance. Concisely, our tracker obtains an AUC of 73.3% on LaSOT dataset, and an AO of 78.2% on GOT-10k dataset. The code, raw results, and trained models are available at https://github.com/hexdjx/VisTrack.

Abstract:
Multi-view learning plays a pivotal role in enabling intelligent systems to understand the world from diverse perspectives. While recent advances in evidential multi-view learning have introduced uncertainty-aware mechanisms, most existing methods treat all views equally during fusion-overlooking the disparity in evidence strength across views. In real-world scenarios, however, some views may offer weaker yet clearer category cues, which deserve greater emphasis during integration. In this paper, we propose a Strength-Adaptive Evidential Multi-View Learning (SAEML) method that performs reliability-aware fusion by explicitly modeling the contribution of each view's evidence. Our method introduces Fisher-evidential networks to preserve potentially valuable category-wise evidence for downstream fusion through multi-peaked outputs, and assesses view contributions from three key aspects: (1) self-view belief mass reflecting internal evidence distribution, (2) cross-view belief mass capturing complementarity between views, and (3) category-aware uncertainty mass modeling reliability at a fine-grained level. These metrics jointly guide a one-step, weighted fusion strategy that avoids information dilution and enhances multi-view complementarity. Experiments on six datasets validate the effectiveness of SAEML, including a 7% performance gain over SOTA methods on the 3-sources dataset. The code is released at https://github.com/Wednesque/SAEML.

Abstract:
Long-term action recognition (LTAR) is challenging due to extended temporal spans with complex atomic action correlations and visual confounders. Although vision-language models (VLMs) have shown promise, they often rely on statistical correlations instead of causal mechanisms. Moreover, existing causality-based methods address modal-specific biases but lack cross-modal causal modeling, limiting their utility in VLM-based LTAR. This paper proposes Cross-Modal Dual-Causal Learning (CMDCL), which introduces a structural causal model to uncover causal relationships between videos and label texts. CMDCL addresses cross-modal biases in text embeddings via textual causal intervention and removes confounders inherent in the visual modality through visual causal intervention guided by the debiased text. These dual-causal interventions enable robust action representations to address LTAR challenges. Experimental results on three benchmarks including Charades, Breakfast and COIN, demonstrate the effectiveness of the proposed model. Our code is available at https://github.com/xushaowu/CMDCL.

Abstract:
Recent advancements in Vision Language Models (VLMs) have demonstrated remarkable capabilities in understanding static screenshots. However, a key aspect of building a robust GUI automation system is understanding dynamic GUI actions such as videos depicting fundamental GUI actions, which enables agents to learn from human demonstrations. This is a non-trivial task that is distinct from natural scene video captioning: (i) GUI screenshots contain more concentrated information than natural scenes due to their high-resolution environment. (ii) Events in GUI videos occurred more quickly, requiring attention on time-span detection. (iii) Frames in GUI videos with less information increase unnecessary computational costs for captioning. To address these challenges, we propose Act2Cap, a new video captioning benchmark specifically designed for GUI action videos, comprising 10,866 diverse video caption pairs containing not only temporal information of keyframes but also detailed narration on action types, elements, location, and purpose. In addition, we propose GUI Narrator, a framework utilizing cursor detection to enhance action interpretation in high-resolution screenshots. Our framework demonstrates improved performance in both open-source models and as a plug-and-play solution for closed-source models while reducing computational costs. The datasets and models are available at https://github.com/showlab/GUI-Narrator.

Abstract:
Video-to-Image Affordance Grounding aims to localize object affordances in static images by learning from human demonstration videos. However, existing fully supervised methods rely on paired image-video inputs during both training and inference, which significantly limits their practicality in real-world scenarios. Conversely, weakly supervised approaches support video-free inference but often struggle to capture the most critical interaction features necessary for affordance learning. To overcome these limitations, we propose a novel two-stage framework, VCL. In the first stage, we adopt the standard fully supervised paradigm to train the model, enabling it to effectively extract meaningful interaction features from demonstration videos. In the second stage, these extracted features are conceptualized, and a conceptual module is introduced to map natural language instructions into the concept space. Our approach establishes a new paradigm for affordance learning from demonstrations, enabling the model to learn from video but perform precise text-conditioned inference. Experiments on two benchmark datasets show that our base model outperforms previous state-of-the-art methods, while our text-conditioned model achieves competitive performance without requiring paired video inputs during inference. Code will be released at https://github.com/Fanzy27/VCL.

Abstract:
With the rapid proliferation of 2D and 3D data, driven by advances in virtual environments and AI-generated content, cross-modal 2D-3D retrieval has attracted growing attention. However, it is easy to introduce noisy labels due to the spatial complexity of 3D content. Although various methods have been proposed to address this issue, they still struggle to handle or effectively re-exploit noisy samples. Moreover, existing approaches are prone to error accumulation due to the self-reinforcement of the model during training. To address these issues, we propose a Noise-Robust Cross-modal Learning (NRCL) framework based on the hybrid strategy. Specifically, NRCL introduces a Robust Cross-modal Co-separator (RCC), which separates noisy samples from clean ones by leveraging modality complementarity and adopting a co-teaching paradigm to mitigate potential error accumulation of the single model during training. Besides, a Reliable Soft Rectification (RSR) method is adopted to correct noisy labels by aggregating historical and dual-model predictions, exploiting the discriminative information from noisy samples. Finally, a Robust Cross-modal Prototype Learning (RCPL) is proposed to improve the discriminability of inter-class and alleviate the inherent gaps across modalities in the shared common space, which jointly leverages clean and rectified labels, thereby mitigating the detrimental impact of noisy samples. Extensive experiments are conducted on three 3D multimodal datasets to verify the effectiveness of our method by comparing it with 10 state-of-the-art methods. The code is available at https://github.com/yangaonidaye123/NRCL.

Abstract:
The pelvis is a high-incidence region for trauma, and its precise segmentation is vital for clinical diagnosis and treatment. Although deep learning has achieved high precision for pelvis segmentation, it is still confronted with challenges such as the scarcity of medical data and the high cost of annotation. Recently, generative models have offered a viable solution through synthetic data augmentation. Thus, we propose a novel text-conditioned generative framework that simultaneously produces high-fidelity CT images and their corresponding segmentation masks, with a special focus on accurately generating symmetric pelvic structures, including the left hip bone, right hip bone and sacrum. Firstly, we fine-tuned a medical text encoder to transform detailed descriptions into precise generation conditions. Considering the easily overlooked symmetry attribute in pelvic bone images, we introduced a novel coordinate-aware enhancement module that incorporates bone-specific centroid coordinates for symmetrical awareness. Finally, we expanded it to multi-task learning for generation of paired pelvic images and segmentation labels. To test our method, we released a pelvic CT image dataset with textual description (CT-PelvisText) and then transferred it to downstream segmentation using the generative pelvic image. Our experiments demonstrate the reliability of our method, which contributes to accurate pelvic segmentation. This work can also be easily extended to other medical images with symmetry properties, which provides potential for efficient learning in small-sample datasets. Our code and dataset are available at: https://github.com/CurellaSong/TSA_LDM.

Abstract:
We study how large vision models (LVMs) can predict food nutrition through lightweight and interpretable adapters---the machine learning modules the predictions of which could be understood by humans. We introduce novel nutrition adapters that use features extracted by pre-trained LVMs and output the so-called nutrition maps. Nutrition maps indicate the concentration of nutrition values per each image location. We use such an interpretable representation to obtain the nutrition targets as a sum of all nutrition concentrations on the maps. To understand our approach's generalization capability, we systematically analyze the behavior of our novel interpretable adapters leveraging different LVMs with different food image-nutrition datasets. Our lightweight approach delivers better or on-par performance than the state-of-the-art models on the Nutrition5k and the Nutritionverse-Real benchmarks. The code is provided at https://github.com/vitaly-emelianov/nutrition-adapters.

Abstract:
Diffusion Models (DMs) have revolutionized Text-to-Image (T2I) generation, yet inherent dataset biases often result in skewed representations across demographics, perpetuating stereotypes and social inequities. Existing debiasing approaches primarily focus on the text processing component, overlooking the intricate biases in the diffusion model's U-Net architecture. This paper presents a novel approach to addressing these biases through a causal analysis of bias disentanglement within the U-Net architecture. We introduce the Contrast Neuron Sensitivity Metric, which enables precise identification of neurons sensitive to bias, allowing for targeted interventions. Our debiasing paradigm fine-tunes these identified neurons with a combination of distribution and semantic loss, requiring only 0.2M parameters to be adjusted, which is far less than prior methods. Experiments show that our method effectively removes gender and race biases and maintains the diversity distribution of images. It enables both absolute fairness and relative adjustments by modifying target attribute distributions (e.g., young:old = 7:3). Furthermore, our approach is scalable, allowing simultaneous fine-tuning across multiple biases, and achieves good bias reduction even with non-templated prompts. The code is available on https://github.com/FanQi-AI/Debias.

Abstract:
Image editing requires semantically modifying specific regions according to user instructions while preserving overall visual coherence. Although diffusion models have shown remarkable progress in image generation, their application to editing tasks faces two critical limitations: (1) insufficient understanding of editing objectives often leads to inconsistencies in style, attribute, or texture between generated content and background regions, and (2) over-reliance on ambiguous textual prompts that frequently lack crucial details, resulting in suboptimal edits. To address these challenges, we propose SAKR-Edit, a novel framework that enhances editing quality and controllability through Scene-Aware Knowledge Reasoning. Specifically, our approach introduces a scene-aware knowledge reasoning module that combines large language models (LLMs) with vision-language models (e.g., BLIP-2) to integrate global and local semantic information for improved instruction comprehension. The system employs chain-of-thought reasoning and contextual learning to parse instructions, infer implicit editing intentions, and supplement missing details, thereby improving editing precision. Additionally, we construct SSUD, a structured scene understanding dataset for evaluating editing models in real-world scenarios. Extensive experiments demonstrate that SAKR-Edit outperforms existing methods in image realism, style consistency, and structural integrity, while showing robust stability and adaptability in real-world applications. Our code and dataset are released at https://github.com/SAKR-Edit/sakr-edit.github.io.

Abstract:
Musical structure spans nested timescales, from fine-grained fluctuations to long-range organization. To capture this, we apply Detrended Fluctuation Analysis (DFA) and Multifractal Detrended Fluctuation Analysis (MFDFA) to compare human-composed music (Billboard Top 5 hits, 1950-2024) with AI-generated outputs from Suno, DiffRhythm, and YuE. While all models capture fractal properties of music, differences persist. Suno aligns most closely with human music but shows reduced fine-scale variability. DiffRhythm yields narrower spectra and lower complexity, while YuE matches large-scale structure yet exhibits greater small-scale variability. Decade-level analysis shows divergence is smallest for earlier, more homogeneous eras (1950s-1960s) and greatest during periods of stylistic diversity and production complexity (2000s-2020s). We propose integrating fractal descriptors into training objectives and refining architectures to improve temporal sensitivity, advancing AI systems toward more structurally representative and authentic music generation. The code and results can be found at https://github.com/zhangkkevin/billboard-ai-fractal-comparison.

Abstract:
Embodied AI calls for a reliable, cross-modal object recognition that deeply mines High-Quality (HQ) object appearance (i.e., visual information) and touch details (i.e., haptic information). While in real-world scenarios, cross-modal data is usually degraded due to data acquisition and delivery in complex environments. In this paper, we propose a Robust Visual-Haptic recognition (RobustVisH) model that identifies Low-Quality (LQ) visual-haptic data with transmission distortion for the first time. First, we introduce the WIreless Transmission Interference-based Multi-modal benchmark (WITIM) as a visual-haptic dataset under transmission interference. In particular, the dataset consists of WITIM/AU and WITIM/PHAC-2, in which the original signals are obtained from AU and PHAC-2, respectively. Second, we design a trainable weighted fusion and a Transformer encoder based on the bi-directional self-attention mechanism, enabling RobustVisH to form and learn fused visual-haptic features after modality-specific one-dimensional feature encoding. Third, we employ a covariate shift paradigm, transferring knowledge of RobustVisH from HQ data to LQ data, thereby increasing its robustness against transmission-interference inputs. Experimental results demonstrate that the proposed RobustVisH improves the accuracy of the state-of-the-art method by 2.06% and 9.28% on WITIM/AU and WITIM/PHAC-2, respectively. Source code is available at: https://github.com/lylibylily/RobustVisH.

Abstract:
Anchor-based strategies have become the dominant paradigm for large-scale multi-view clustering, where the quality and representational capacity of anchors are crucial to clustering performance. Existing methods typically learn anchors adaptively, focusing only on dynamically selecting anchors from the original data. However, these methods often lack an information-theoretic metric to evaluate how effectively the selected anchors capture the intrinsic characteristics of their respective clusters. Moreover, few approaches attempt to enhance the internal structure of anchor matrix to further improve clustering performance. To address these challenges, we propose a novel Anchor-Driven High-Throughput Encoding (ADHTE) framework that optimizes anchors by maximizing their throughput encoding capacity. In this method, the High-Throughput Encoding rate serves as a metric for anchor effectiveness, and we employ a deep neural network to optimize the anchor matrix. In addition, we predefine a clustering indicator matrix to construct a consistent anchor matrix across views, thereby ensuring anchor alignment. Furthermore, we propose an edge-alignment learning scheme to produce a bipartite graph with consistent edges across views. Extensive experiments on eight benchmark datasets demonstrate that the proposed ADHTE framework exhibits superior effectiveness and robustness compared to other state-of-the-art methods. The code of this paper is released on https://github.com/enjoypiker/ADHTE.

Abstract:
Multimodal learning, which has been given great significance recently, may face the challenge of the imbalanced multimodal phenomenon, which leads to the insufficient optimization of both multimodal and unimodal objectives. The core problem lies in the optimization conflicts between the above optimization objectives, resulting in the diverse updating directions and strengths that cause antagonism between them. In this paper, we mathematically analyze the optimization processes of imbalanced multimodal learning in the hyperspaces from a novel geometric perspective. Additionally, based on our theoretical analysis, we defined the volumes of the gradients constructed parallel polyhedron in the hyperspace to quantify the misalignment between the optimization objectives. Subsequently, we proposed the Geometric Gradient Divergence Modulation (GGDM), which leverages the volumes of gradient polyhedron to perform gradient modulation, encouraging alignment among gradients and promoting a synergistic optimization effect. Lastly, we evaluate our GGDM on five widely used multimodal benchmarks, where RGB image, optical flow, text, image, video and audio are involved. Our method achieved state-of-the-art performance compared to other imbalanced multimodal learning methods. Our code is available at: https://github.com/ConstantineWayne/GGDM.

Abstract:
Lifelong Person Re-identification (LReID) focuses on continuously adapting to new domains over time while preserving knowledge from previously seen domains, particularly under the domain incremental learning setting. The major challenge of LReID is catastrophic forgetting, typically caused by large domain shifts during training. To address this, we propose a novel Amplitude-aware Domain Style Replay (ADSR) framework, which introduces a Fourier-based Style Transfer (FST) mechanism to generate synthetic data that reflects the style of previously encountered domains. These proxy images help retain prior knowledge without the need to store actual past data. Our method transfers stylistic information-mainly encoded in the amplitude spectrum-from old domains to new ones, creating old-stylized images that preserve the content of new domain data while adopting the visual style of earlier domains. To further boost generalization, we design a Self-Stylization Normalization (SSN) module that adapts the current domain's style distribution, making the model more robust to stylistic variations. Additionally, we introduce a Multi-Granularity Transfer (MGT) module that uses K-Means clustering to extract multiple representative style features from each domain, enabling compact yet comprehensive storage and replay of domain-specific information. Extensive experiments on multiple LReID benchmarks show that ADSR achieves superior performance over existing approaches, effectively reducing forgetting and improving cross-domain generalization. Our code is available at https://github.com/cclong8/MM2025-ADSR.

Abstract:
Weakly-supervised camouflaged object detection aims to achieve performance comparable to fully-supervised methods while relying on minimal, coarse annotations (e.g., points, scribble, and box). However, this task is exceptionally challenging, as it requires models to not only overcome the inherent difficulty of detecting camouflaged objects but also to effectively distinguish foreground from background with limited supervision. To address this challenge, we propose a novel training paradigm called Progressive Representation Learning, which aims to jointly enhance the model's ability to extract discriminative features from both the training strategy and model architecture perspectives. Specifically, in terms of the training strategy, a Progressive Self-Training Alignment (PSTA) method is constructed at the image level to generate multi-level self-constructed data, enabling enhanced detection of challenging camouflaged objects through hierarchical self-learning. From the model architecture, we design a progressive fine-tuning module (Multi-Scale Multi-Resolution LoRA, MSMR) and a Adaptive Frequency-aware Fusion (AFF) module. The former explicitly improves multi-level feature representation during the encoding stage, while the latter focuses on high-frequency information in the fusion stage to boost fine-detail detection. Extensive experimental results demonstrate that our method significantly outperforms existing weakly supervised approaches-achieving an average improvement of 6% on the Fwβ metric across three datasets-with fewer parameters. Moreover, it even surpasses some fully supervised state-of-the-art methods on certain metrics, highlighting the effectiveness of our progressive representation learning paradigm. Code and results are publicly available at https://github.com/shuyonggao/PRLNet.

Abstract:
Ultra-fine-grained visual classification (ultra-FGVC) targets at classifying sub-grained categories of fine-grained objects. This inevitably requires discriminative representation learning within a limited training set. Exploring intrinsic features from the object itself via contrastive learning has demonstrated great progress towards learning discriminative representation. Yet forcingly dividing highly similar categories at the representation level may over-guide the learned feature space, leading to overfitting in the ultra-FGVC tasks. To this end, this paper introduces CLA-Net, a novel contrastive Lie algebra learning framework to address this fundamental problem in ultra-FGVC. The core design is a self-supervised module that performs self-shuffling and masking and then distinguishes these altered images from other images at a second-order representation level. This drives the model to learn an optimized feature space that has a large inter-class distance while remaining tolerant to intra-class variations. By incorporating this self-supervised module, the network acquires more knowledge from the intrinsic structure of the input data, which improves the generalization ability without requiring extra manual annotations. CLA-Net demonstrates strong performance on eight publicly available datasets, demonstrating its effectiveness in the ultra-FGVC task. The code is available at: https://github.com/zichengpan/CLA-NET.

Abstract:
Multi-label classification has recently demonstrated promising performance through CLIP-based unsupervised learning. However, existing CLIP-based approaches primarily focus on object-centric features, which limits their ability to capture rich contextual dependencies between objects and their surrounding scenes. In addition, the vision transformer architecture of CLIP exhibits a bias toward the most prominent object, often failing to recognize small or less conspicuous objects precisely. To address these limitations, we propose Background-Aware CLIP-GCN (BAC-GCN), a novel framework that explicitly models class-background interactions and is designed to capture fine-grained visual patterns of small objects effectively. BAC-GCN is composed of three key components: (i) a Similarity Kernel that extracts patch-level local features for each category (i.e., class and background), (ii) a CLIP-GCN that captures relational dependencies between local-global and class-background features, and (iii) a Re-Training for Small Objects (ReSO) strategy that enhances the representation of small and hard-to-learn objects by learning their distinctive visual characteristics. Therefore, our method facilitates a deeper understanding of complex visual contexts, enabling the model to make decisions by leveraging diverse visual cues and their contextual relationships. Extensive experiments demonstrate that BAC-GCN achieves state-of-the-art performance on three benchmark multi-label datasets: VOC07, COCO, and NUS, validating the effectiveness of our approach. The project page is available at: https://github.com/yonghyeonjo46/BAC-GCN.

Abstract:
Long video understanding is essential for various practical applications including surveillance and film analysis. While recent Vision-Language Models (VLMs) have advanced performance in this domain, efficiency remains a key challenge, especially for hour-long videos. Existing methods commonly reduce visual tokens via compression in the vision encoder, but token count still grows linearly with video length. Alternative approaches apply importance-based token reduction in the language model, yet their non-causal design limits efficiency gains to offline, single-query settings. In this work, we emphasize the need for causal importance estimation-where a token's relevance is determined only from prior context-to enable efficient, real-time long video understanding. We propose ØurMethod, a Causal Importance-based Token Reduction framework to reduce visual token redundancy in long video understanding tasks, enabling practical memory control and enhanced computational efficiency. Experiments on both offline and streaming benchmarks show that ØurMethod reduces latency by 49% in offline multi-query scenarios and effectively controls chunked prefilling time in streaming, all within a 24GB memory footprint and with less than 1% performance drop. The code and appendix are available at https://github.com/Columbine21/CITR.

Abstract:
The automatic generation of diverse and human-like facial reactions in dyadic dialogue remains a critical challenge for human-computer interaction systems. Existing methods fail to model the stochasticity and dynamics inherent in real human reactions. To address this, we propose ReactDiff, a novel temporal diffusion framework for generating diverse facial reactions that are appropriate for responding to any given dialogue context. Our key insight is that plausible human reactions demonstrate smoothness, and coherence over time, and conform to constraints imposed by human facial anatomy. To achieve this, ReactDiff incorporates two vital priors (spatio-temporal facial kinematics) into the diffusion process: i) temporal facial behavioral kinematics and ii) facial action unit dependencies. These two constraints guide the model toward realistic human reaction manifolds, avoiding visually unrealistic jitters, unstable transitions, unnatural expressions, and other artifacts. Extensive experiments on the REACT2024 dataset demonstrate that our approach not only achieves state-of-the-art reaction quality but also excels in diversity and reaction appropriateness. Our code is publicly available at https://github.com/lingjivoo/ReactDiff.

Abstract:
While multi-modal features offer rich semantic signals to enhance sequential recommendation systems, their integration with ID-based embeddings remains challenging. Conventional fusion strategies often degrade performance despite the semantic potential of multimodal data. Through empirical analysis, we identify asymmetric convergence dynamics between rapidly adapting ID embeddings and slowly evolving modality representations as the fundamental barrier. To address this, we propose DeCoRec, a novel framework to decouple ID and modality optimization trajectories to prevent gradient interference. To further reconcile ID and multi-modal data, we introduce modality-aware interest clustering and cross-modal contrastive learning to align semantic neighborhoods with behavioral patterns. Extensive experiments demonstrate 5-7% improvements in NDCG/HiT metrics against the existing schemes and particular robustness in cold-start scenarios. The code is available: https://github.com/KIKIENAO/decorec

Abstract:
Vision-language models like CLIP have revolutionized person re-identification (ReID) by enabling cross-modal semantic alignment. However, most of the existing CLIP-based ReID methods suffer from a critical limitation: semantic entanglement, where identity and attribute features are indiscriminately compressed into a single, undifferentiated token representation. This oversight fails to account for their inherently distinct roles in characterizing individuals.To address this limitation, we propose an Identity-Attribute-Decoupled Tokenization (IADT) method, a hierarchical framework with two synergistic components:Subject-oriented tokens that model identity through a cross-modality feature inverse mapping paradigm, preserving invariant biometric features;Attribute-aware tokens that capture localized characteristics through the cross-interaction of local features and learnable prototype vectors, dynamically focusing on discriminative regions without manual supervision.The hierarchical tokenization enables disentangled yet complementary representation learning: Identity and attribute semantics are encoded into distinct embedding subspaces, while cross-token contrastive learning establishes semantic reinforcement through attention-guided feature interaction. Crucially, this process does not require part-level annotations, making it directly applicable to real-world deployment. Extensive experiments validate effectiveness of the proposed method. For example, on the Market-1501 dataset, IADT achieves 97.1% mAP (+2.5% over SOTA) and 98.2% Rank-1 accuracy. For the challenging MSMT benchmark, it attains 88.9% mAP (+1.7% improvement) with 93.1% Rank-1 accuracy, demonstrating consistent superiority. The code will be available at https://github.com/llraay/IADT.

Abstract:
Imagine James Bond speaking like Mr. Bean---such a mismatch would create a jarring dissonance and break the viewer's immersion. Current research on virtual avatar animation has focused on modeling 3D geometry, appearance, motion generation, however, neglecting the harmony between speech prosody and the avatar's visual presentation and contextual environment. In this paper, we seek to bridge this gap by firstly identifying and defining the key elements necessary for achieving audiovisual harmony, such as appearance, expression, body posture, backgrounds and colors. Subsequently, we propose a method that jointly models semantic consistency in avatar animation, named HarmoniVox, specifically on crafting prosodic speech consistent with the avatar's essence from given visual image. To achieve this, we implement a technical framework with a mutual modal contrastive learning strategy, enhancing multimodal alignment in a coarse-to-fine fashion. To support this method, we establish a experimental dataset HarAvaSpeech comprising 28,929 image-audio pairs, designed to encompass expressive speech prosody and rich avatar visual presentations across a wide range of contexts. Leveraging this dataset, our experiments demonstrate that the proposed method outperforms the baselines in manipulating the nuanced tone and harmonious rhythm of speech with the avatar visual presentations, and reveal generalizability on out-of-domain cases. Demo would be provided in https://harmonivox.github.io/harmonivox/.

Abstract:
Long-tailed out-of-distribution learning aims to reduce performance bias in long-tailed in-distribution (ID) data while rejecting out-of-distribution (OOD) samples, which are often mistaken for under-represented tail classes. To achieve OOD detection, existing methods incorporate an outlier exposure (OE) term into the long-tailed recognition (LTR) loss. However, as we prove in this paper, the OE term induces a gradient conflict with the ID objectives, especially for tail classes, thereby contradicting the core motivation of LTR. To avoid the ID-OOD dilemma, we propose Dynamic Ambiguity-aware Recalibration for Logits (DARL), an ambiguity-guided long-tailed OOD learning approach, grounded on two theoretical insights. First, we show that the mixed ID data can mitigate the conflict in OE training and exhibits higher intrinsic ambiguity than the original ID data, thus able to serve as a surrogate for real OOD data. Second, we introduce an ambiguity-aware logit adjustment that can dynamically calibrate the class margins using energy-based ambiguity metrics, effectively reducing early-stage bias while avoiding late-stage overfitting. Extensive experiments show that DARL achieves the overall state-of-the-art performance of long-tailed OOD learning. Moreover, compared with the OE methods, DARL trains solely on the ID data, which can reduce the data requirements by 80%. The code is available in https://github.com/XuanZhang-A/DARL.

Abstract:
For camera-based image capturing, the impact of exposure or camera parameters (ISO sensitivity, shutter speed, and aperture F-number) on imaging quality is decisive. Such parameters interact in a coupled manner during the imaging process to determine the exposure quality and the degree of blur in a photograph. Naturally, decoupling such parameters from images holds significant value for applications like image quality assessment and illumination optimization. However, there has been no systematic research dedicated to this topic. In this paper, we propose a new benchmark, Cam-Bench, for estimating camera parameters on images directly. It collects an image dataset Cam-10K with various indoor scenes and accurate labels of camera parameters. Based on Cam-10K, we propose a camera parameter estimation network to decouple and regress recorded exposure information. To the best of our knowledge, Cam-Bench is the first benchmark for camera parameter estimation. Experiments demonstrate that it can enhance the performance of various downstream applications.The source code has been made publicly available at: https://github.com/pengquanhong/CamBench.

Abstract:
Multi-view clustering based on anchor graph has gained a lot of attention because of its ability to handle large-scale datasets in linear time. However, the existing methods learn anchors from linear space, ignore the nonlinear manifold characteristics of the original data, and fail to fully consider the geometric topological relationship of anchors in the space, which limits the representability of the anchor graph and makes these algorithms unable to deal with the complex data distribution in real scenarios. In addition, most methods involve the control of multiple hyper-parameters, which often require a significant amount of time for tuning, resulting in algorithms that are not as flexible and scalable as they could be. To address the above problems, this paper proposes Scalable multi-view clustering based on tight anchor distribution (TAD-MVC), where TAD-MVC closely associates neighboring anchors together and learns the degree of closeness between all anchors. In particular, TAD-MVC adaptively evaluates the tightness between anchors and spreads this tight structure with the aim of constraining anchors with a high degree of similarity to the same cluster, while anchors in different clusters will exclude each other as much as possible in order to learn anchors that are sufficiently representative. Experimental results on numerous datasets significantly outperforms the state-of-the-art methods. The code is available at https://github.com/whbdmu/TAD-MVC.

Abstract:
Retrieving target pedestrians from cross-modal images captured by infrared and visible cameras is critical in 24-hour intelligent surveillance. The primary challenge lies in narrowing the modality gap between the visible and infrared modalities. In view of this, existing research tends to extract modality-shared features to bridge the modality gap. However, the extraction process and effectiveness of the shared features are often insufficiently justified. In contrast, we observe that a certain portion of semantics remains invariant across visible and infrared modalities. These invariant semantics provide the basis for extracting modality-shared features. Based on this criterion, we propose a novel method named Low-light Invariant Representation Learning (IRL), which aims to construct an invariant space shared between visible and infrared modalities. Specifically, we introduce a Modality Invariant Extractor, which divides invariance into modality invariance and scale invariance, and extracts the invariant features from different scales and dimensions respectively. Furthermore, a Low-light Representation Enhancement module is designed, which reuses the invariant features and shallow modality features through paired enhancement units and compensation units to highlight cross-modality shared features. Extensive experiments on SYSU-MM01, RegDB, and LLCM benchmarks demonstrate the effectiveness of our method. Code is available https://github.com/Mapzzone/IRL.

Abstract:
Visible-infrared object detection has gained significant attention because of its applications in autonomous driving, video surveillance, and related fields. The effective fusion of multimodal information is fundamental to its success. The existing approaches concentrate on improving the pixel-level fusion mechanisms; detection performance has reached a plateau. We propose a new framework for SAM-guided semantic knowledge fusion (SemFusion). The core idea is to leverage semantic priors from large models while incorporating a lightweight cross-modal fusion strategy. Specifically, our method comprises two stages. In the first stage, the Flow-Guided RGB Feature Alignment (FGRA) module establishes object-aware correspondences between multimodalities based on SAM-generated masks. This ensures semantic-level feature matching by deformable convolution alignment. In the second stage, the Semantic Knowledge Distillation (SKD) strategy facilitates the transfer of large-model knowledge to the detection model through SAM feature, offset, and mask level distillations. For the detector model, three blocks are designed to augment any off-the-shelf detector. They are deformable cross-modal alignment, spatio-channel preliminary fusion, and mask-guided feature refinement. By alignment with SAM masks, semantic alignment and fusion can be achieved, breaking the pixel-level fusion barrier. Extensive experiments demonstrate that our method, as a plugin, exhibits superior performance on the DroneVehicle, VEDAI, and LLVIP datasets. Code is available at https://github.com/liting1018/SemFusion.

Abstract:
Federated face generation technology leverages decentralized private data to achieve high-quality face synthesis. However, regulations such as the GDPR confer users the right to be forgotten, necessitating the removal of contributions from specific clients in the global model. Existing generation model unlearning methods are primarily designed for centralized environments and are inadequate for addressing the constraints of data privacy storage and limited client computational resources in federated settings. To address this gap, we propose F2GU, the first federated unlearning framework specifically tailored for face generation models, enabling the effective removal of contributions associated with specific clients (identities) while ensuring privacy. Our proposed Generation Trajectory Redirection method dynamically guides the generation trajectory away from target identities, thereby effectively eliminating contributions from specific clients. Additionally, we devise a Mirroring-guided Trajectory Optimization strategy that constructs a mirror projection utilizing the retained client trajectory origins to ensure the generative capabilities of the model are preserved post-unlearning. We conduct extensive experiments on two mainstream face generation models (GAN and Diffusion Model) across three different datasets.The results indicate that our method demonstrates superior performance in both the success rate of identity unlearning and the preservation of generation quality. The code can be available at https://github.com/FanQi-AI/FFGU.

Abstract:
Large vision-language models (LVLMs) have recently achieved significant advancements, demonstrating powerful capabilities in understanding and reasoning about visual information. However, LVLMs may generate biased responses that reflect the user beliefs rather than the facts, a phenomenon known as sycophancy. Sycophancy can pose serious challenges to the performance, trustworthiness, and security of LVLMs, raising concerns about their practical applications. We note that there is limited work on the evaluation and mitigation of sycophancy in LVLMs. In this paper, we introduce SyEval-VL, a benchmark specifically designed to evaluate sycophancy in LVLMs. SyEval-VL offers a comprehensive evaluation of sycophancy in visual understanding and reasoning across various scenarios with a multi-round dialogue format. We evaluate sycophancy in several popular LVLMs, providing an in-depth analysis of various sycophantic behaviors and their consequential impacts. Additionally, we propose a novel framework, Human Feedback-based Retrieval-Augmented Generation (HFRAG), to mitigate sycophancy in LVLMs by determining the appropriate timing of retrieval, profiling the proper retrieval target, and augmenting the decoding of LVLMs. Extensive experiments demonstrate that the proposed method significantly mitigates sycophancy in LVLMs without requiring additional training. Our code is available at: https://github.com/immc-lab/SyEval-VL

Abstract:
The development of multi-modal Unmanned Aerial Vehicles (UAVs) environment perception systems is hindered by three critical gaps in existing datasets: (1) insufficient modalities and pixel misalignment, (2) noisy labels, and (3) limited task types. To address these gaps, we propose an automatic data construction approach and construct a multi-modal UAV-based environment perception dataset, UEMM-Air. Its synthetic nature ensures scalability, reproducibility, and rare-event coverage, making it suitable for large-scale model pre-training. Benefiting from our automated data collection and annotation pipeline, UEMM-Air encompasses 120k data pairs across 6 aligned modalities and supports 4 perception tasks, significantly exceeding existing datasets (max 60k data, 3 modalities, 2 tasks). Compared to existing synthetic datasets like SynDrone, UEMM-Air provides more accurate annotations by avoiding noisy labels from direct coordinate computation. Notably, models pre-trained on UEMM-Air achieve a 5.8% accuracy improvement compared to those utilizing other synthetic datasets, while requiring less than half the data. This benchmark establishes performance evaluation of UAV multi-modal environmental perception models, and hopefully encourages more research efforts towards enabling UAVs to undertake more multi-modal tasks. The dataset and its generation engine are openly accessible under a permissive license at https://github.com/1e12Leon/UEMM-Air.

Abstract:
Over the past few decades, significant resources have been invested in developing video codecs for the storage and delivery of video data. International standards development organizations, such as MPEG (Moving Picture Experts Group) and AOM (Alliance for Open Media), have fostered large-scale competitive collaboration among industry players. While open-source software developed alongside these video codec standards has accelerated verification of the standard and enabled rapid adoption, as it is used as a starting point for implementation and future research, this practice has not been fully applied to the ecosystem of mezzanine video codecs used in high-quality video capture and post-production. To address this gap, the industry's first open source project for collaborative innovation in professional video codecs, OpenAPV, has been established. The project aims to facilitate collaborative research and development of a royalty free, open source, and open standard video codec for professional use. This paper outlines the technical aspects of the video codec, together with the open source software implementation. The codec has been successfully tested across various platforms and shows excellent R-D (Rate-Distortion) performance and coding speed, even without hardware acceleration. Furthermore, it demonstrates robust quality resilience against multiple rounds of encoding and decoding cycles, which are common in video post-production. The open source project is hosted at https://github.com/AcademySoftwareFoundation/openapv.

Abstract:
Recent latent diffusion models (LDMs) have been explored to generate diverse domain-specific images based on source domain data, showing promising performance in domain generalization tasks. However, although the generated images present counterfactual augmentation, such as the background and style changes, the distortion of object details disrupts the causal factors, such as texture and shape. This leads to negative outcomes when directly applying LDM to domain generalization in object detection. To address the problems mentioned above, we propose Object-Preserving Counterfactual Diffusion augmentation method (OPCD) to explore the diffusion model to generate diverse domain-specific images without disrupting the object details. First, we construct a region-aware image generation framework, which leverages labeled source domain data to guide LDM in generating region-constrained images that preserve the semantic consistency of the original source images. Second, we propose object-preserving counterfactual augmentation, which retains the object region of the generated image and fuses diversified global information. This ensures that object details are not distorted and that the generated information is maintained. Third, to reduce the resource burden of generating a large number of images in LDM, we design a random insertion strategy. It mixes generated and source domain images, turning limited diversity samples into abundant training data. Experimental results on several benchmark datasets show that OPCD outperforms existing methods in single-domain generalized object detection. Codes can be found at https://github.com/qinhongda8/OPCD.

Abstract:
Table structure recognition (TSR), the task of extracting logical and physical structures from table images, is critical for document understanding. Current end-to-end image-to-text methods typically employ a top-down strategy where physical structure prediction depends on the logical decoder's output sequence. However, this process often suffers from training instability and misalignment between predicted bounding boxes and ground-truth cell positions. To address this issue, we propose G2LFormer, a novel transformer-based framework that employs a ''Global-to-Local'' query enhancement strategy. Specifically, G2LFormer introduces a Vision-guided Query Enhancer to integrate both textual and visual modalities, significantly improving the overall query representation capability and boosting prediction accuracy. Additionally, we design a Multi-scale Manhattan Vision-guider that leverages a spatial attenuation matrix to guide each query towards its corresponding cell location, effectively balancing local and global information for more precise bounding box generation. Extensive experiments on benchmark datasets demonstrate G2LFormer's superior performance, while ablation studies confirming the significant contribution of each proposed module in achieving state-of-the-art results. The source code and model have been released at: https://github.com/Hzbupahaozi/G2LFormer.

Abstract:
Learning discriminative representations of different speakers is a key challenge in open-set speaker recognition. To mitigate the mismatch between closed-set training and open-set testing, margin-based losses have been widely adopted to directly optimize the cosine similarity between speaker representations and proxy class vectors. While recent studies have shown that enhancing the margin for hard samples can improve representation learning, we observe three key limitations: (1) the measurement of sample hardness fails to fully capture differences in speaker representations, (2) margin-based emphasis does not significantly increase the gradient magnitude, and (3) the potential performance degradation caused by emphasizing hard samples are rarely considered. To address these issues, we propose Adaspeaker, a novel loss framework that combines an Intra-Inter sample hardness coefficient (Int2H) with a gradient-aware adaptive scaling strategy. Specifically, Int2H jointly models inter-class and intra-class hardness to estimate sample importance, which is subsequently used to adaptively scale cosine similarities for enhancing the gradient contribution of important samples. Experiments conducted on five evaluation settings show that Adaspeaker outperforms existing loss functions. Moreover, Adaspeaker can be seamlessly integrated into margin-based losses, yielding an average performance improvement of 12.6%. Code is available at https://github.com/LiuJinghan2001/Adaspeaker.

Abstract:
Multi-modal remote sensing image clustering aims to group similar pixels into the same cluster and separate dissimilar ones by leveraging the consistency and complementary information across multiple modalities, without relying on label guidance. Most existing deep learning-based methods address this task through a two-stage pipeline of feature learning followed by clustering, or adopt simple instance-level contrastive learning frameworks. In this paper, we propose an end-to-end deep multi-level contrastive clustering (DMLCC) model for multi-modal remote sensing images. The proposed DMLCC consists of three key components. Specifically, modality-specific vision encoders are initially employed to extract preliminary feature representations tailored to the characteristics of each modality. Spatial-spectral cross-modal fusion is then performed by integrating dedicated spatial and spectral feature extractors alongside a cross-modal fusion block to effectively capture and align complementary information across modalities. Finally, multi-level contrastive learning is applied to enhance feature discriminability through instance-level contrastive learning, while simultaneously promoting cluster separability via cluster-level contrastive learning. The network is trained in an end-to-end manner by integrating the three components to directly yield the clustering results. Extensive experiments on three datasets demonstrate that the proposed DMLCC outperforms state-of-the-art methods. Our code is publicly available at https://github.com/ZhangYongshan/DMLCC.

Abstract:
Recently, contrastive learning has emerged as a promising approach for multi-view clustering (MVC), as it enforces cross-view consistency and leverages complementary information from different views to enhance the analysis of heterogeneous data. However, traditional contrastive MVC methods suffer from an inherent limitation: their one-to-many contrast mechanism induces the False Negative Problem (FNP), where semantically similar intra-class instances are erroneously repelled. This phenomenon compromises intra-class consistency and ultimately degrades clustering performance. To overcome this issue, we propose a novel Pseudo lA bel gU ided univerS um lE arning (PAUSE) framework for robust multi-view clustering. Specifically, PAUSE operates in two synergistic stages: (1) A warm-up stage that employs dual contrastive learning to generate reliable pseudo-labels, establishing robust semantic relationships; (2) A fine-tuning stage that synthesizes universum samples via Mixup between anchor instances and out-of-class centroids, guided by the acquired pseudo-labels. This unique mechanism constructs generalized negative classes that expand inter-class margins while preserving intra-class cohesion. Crucially, the widened decision boundaries prevent misclassification of displaced intra-class instances, effectively circumventing FNP without requiring explicit negative pair correction. We further devise a robust universum contrastive loss that explicitly enforces cross-view consistency through adaptive boundary constraints. Extensive experiments on five multi-view benchmarks demonstrate that our PAUSE consistently outperforms 11 state-of-the-art multi-view learning methods. Our code is accessible at: https://github.com/xixi-555/PAUSE_main_code.

Abstract:
LiDAR-Radar fusion has been widely regarded as an effective strategy for enhancing sensor-level robustness in 3D perception under adverse weather. However, it remains fundamentally insufficient to address feature-level domain shifts induced by diverse weather conditions - a critical yet often overlooked bottleneck in multimodal 3D object detection. In this work, we advocate a new perspective: all-weather 3D detection should be formulated as a lightweight capacity allocation problem, rather than simply enlarging or duplicating models for each weather domain. To this end, we propose DA3D, a Domain-Aware Dynamic Adaptation framework that leverages LoRA as a domain-adaptive capacity controller for efficient and scalable feature modulation. In addition, we introduce a domain-aware rank adaptation strategy that dynamically reallocates LoRA capacity based on domain difficulty, allowing the model to focus its representational power where it matters most. Extensive experiments on the K-Radar benchmark show that DA3D consistently improves 3D detection across both radar-only and LiDAR-Radar fusion backbones, achieving +4.9% AP3D on RTNH, +3.8% on 3D-LRF, and +8.1% on L4DR at IoU=0.5. Notably, DA3D outperforms existing multi-weather modeling methods under the same parameter budget, offering a practical and scalable solution for robust all-weather 3D perception. The code is available at https://github.com/Dawns14/DA3D.

Abstract:
Music generation has developed remarkably, and research on music generation from text has progressed. However, there are cases where it is difficult to use text as the query when users want to obtain music that is suitable for images, while image-to-music generation is still unexplored. There is a gap between images, which are visual content, and music, and it has been difficult to associate them directly. To address this problem, we realize an image-to-music generation method through musical captions which describe the characteristics of the music. Musical captions possess an enhanced capability to steer the process of music generation. Therefore, if musical captions effectively convey the intended message of the image, they serve as an excellent intermediary between the images and music. The proposed method connects these two different modalities through the medium of musical captions that describe the specialized content of the music. By generating images through musical captions using multi-modal large language models, we construct an image-musical caption pair dataset. Using query-similar images and their paired musical captions in the dataset, in-context learning on a multi-modal large language model is conducted to generate the musical caption corresponding to the target image. The generated musical caption is then input into a text-to-music generative model, and thus, the proposed method enables high-quality image-to-music generation. We conducted experiments to evaluate the quality of the generated music and the consistency with target images through both subjective and objective metrics. The results confirm the effectiveness of our proposed method. The code can be found at https://github.com/lsllsls/CAI2M.

Abstract:
Referring video object segmentation (RVOS) extracts objects from videos based on provided text narrations. Previous approaches typically work on all video frames simultaneously through offline processing, but this is not always possible. Offline processing becomes also ineffective for long videos, as directly cutting the video into short clips to fit memory limits results in the loss of temporal consistency. To make RVOS more applicable to real-world video streams or long video scenarios, we introduce a Filtering Framework for RVOS (FF-RVOS), the first model capable of operating in online, semi-online, and offline modes with just one training session. We redesign RVOS as a stochastic optimization problem and leverage filtering to optimize object states across temporal sequences. Our method enhances temporal consistency for video streams and when a long video is processed in shorter clip sequences due to memory limitations. FF-RVOS demonstrates superior performance compared to previous state-of-the-art methods on public benchmarks with clear improvements, especially when the video is cut into short clips for processing. Our framework can also be embedded into different offline methods to boost temporal consistency. The project page is https://github.com/haliphinx/FF-RVOS.

Abstract:
Image aesthetic assessment (IAA) is challenging due to its subjective and diverse nature, making visual information alone insufficient. Existing methods using paired image and user comments, though effective, have limited practicality. Moreover, user comments contain both high- and weak-aesthetic-related textual information, thus directly integrating all textual information with visual information cannot guarantee the effectiveness. To address these issues, we propose a novel synergistic coarse-fine vision-language alignment (CoFiVLA) framework for IAA. It includes a CoFiVLA pretraining network and a CoFiVLA prediction network, aiming to effectively and comprehensively align and synergize image and text modalities at both coarse and fine granularities during training, while eliminating the dependence on user comments during inference. In the CoFiVLA pretraining network, the coarse- and fine-grained vision-language alignment branches work together, aligning the visual features with the high-aesthetic-related textual information and all textual information, respectively. In the coarse-grained alignment branch, we innovatively propose employing the large language model LLaMA to construct an aesthetic summary dataset, which extracts high-aesthetic-related text from user comments. Furthermore, in the CoFiVLA prediction network, we first extract features corresponding to learnable prompts of different aesthetic quality categories based on the CoFiVLA pretrained model, then fuse visual features with these learnable textual features. Thus we achieve aesthetic quality semantics embedded image aesthetic representations for effective IAA without requiring user comments during inference. Our aesthetic summary dataset and source code are available at https://github.com/lifusheng-chn/CoFiVLA.

Abstract:
Document Large Vision Language Models excel in document-centric tasks and have become a key focus of research. Existing frameworks embed features from a lightweight, document-specific encoder into the first layer of a general-purpose Vision Language Model (VLM). However, this introduces a feature mismatch problem. VLMs typically consist of many stacked layers, with the feature hierarchy becoming increasingly abstract at higher layers. Specifically, the first-layer feature in a VLM is token-level, whereas the feature from the encoder is task-level, resulting in a mismatch. Consequently, it is crucial to identify an optimal layer within the VLM for embedding the encoder's features. Inspired by physics, we reformulate the search for the optimal embedding as a problem of finding the shortest time curve. Leveraging the properties of the shortest time curve, we theoretically derive a task-agnostic proxy score that requires only partial training and propose our searching framework, Brac4VLM. Our theoretical derivation shows that Brac4VLM reduces search time by 97.8% compared to brute-force methods. Experimental results further demonstrate that Brac4VLM identifies embedding points that closely align with the true optima. Moreover, the DocVLM with the optimal embedding position identified achieves state-of-the-art performance across various document-centric tasks. Codes: https://github.com/MaxKinny/Brac4VLM.

Abstract:
Multi-image understanding is crucial in real-world applications such as social media analysis and news reporting. However, existing benchmarks fall short in evaluating models' ability to integrate external knowledge and perform cross-image reasoning. To address this gap, we introduce MRBench, a comprehensive benchmark designed to assess knowledge-based reasoning across 12 diverse domains, incorporating four types of image relations: visually similar, identical entities, attribute-associated, and independent images. Additionally, we propose Multimodal Adaptive Retrieval Reasoning (MARR), a novel framework that enables the analysis of relationships among multiple input images and adaptively determines when to terminate the retrieval process. Extensive evaluations of state-of-the-art multimodal large language models (MLLMs) show a notable gap between model and human performance. The best-performing model, Gemini 2.0, reaches 56.86% accuracy, still 20.24% below humans. Proprietary models generally surpass open-source ones, particularly on visually similar and same-entity tasks, underscoring current limits in multi-image reasoning and retrieval and positioning MRBench as a key diagnostic tool. Our benchmark is available for further research and development in this field. https://github.com/Bruce-XJChen/MRBench.

Abstract:
Cross-modal retrieval refers to identifying semantically relevant data across different modalities. However, annotation errors or inherent ambiguity can cause semantic inconsistency in sample pairs, degrading retrieval performance. Prior efforts either relied heavily on the quality of explicitly dividing clean and noisy subsets, or solely leveraged carefully selected single anchor information, neglecting relationships among diverse neighbors. In this paper, we propose a novel Graph-based Label Propagation (GLP) framework that learns pseudo-labels via label propagation on a sparse graph, enabling self-correction of noisy labels. Specifically, each modality's instances are treated as nodes, connected via k-nearest neighbor (kNN) search to form a sparse graph. Pseudo-label vectors are generated for all nodes within one modality to capture the matching degree of inter-modal nodes. Through iterative label propagation, the stabilized pseudo-labels implicitly exploit both intra- and inter-modal relationships to derive a reliable matching degree. A dynamic queue further enhances graph quality by updating high-quality nodes. Experiments on Flickr30K, MSCOCO, and CC120K show that our method outperforms state-of-the-art approaches, especially under high noise. Code is available at https://github.com/njustkmg/MM25-GLP.

Abstract:
Formula spotting aims to simultaneously detect and recognize formulas in documents, with broad applications in intelligent document parsing, mathematical reasoning, and more. Although existing methods that first detect and then recognize have achieved prominent results, they still suffer from semantic confusion from similar character structures, semantic loss from bounding box perturbation, and visual interference from non-formula regions. To address these issues, we propose a Synergy Perception and Representation Mining Network. This network facilitates explicit interaction between the RoI features of the detection module and the semantic features of the recognition module, leveraging additional visual priors to distinguish subtle differences in similar characters. Moreover, to better perceive the boundary character structure of formulas and filter out irrelevant visual interference, a Formula Representation Mining module is proposed. This module employs progressive attention mining to achieve a complementarity between semantic information and visual context without disrupting the linguistic priors of the formulas. Additionally, to enhance the efficiency of formula decoding, we propose a parallel mask, allowing the network to output multiple LaTeX tokens simultaneously in a single prediction step. To evaluate the effectiveness of our method in formula spotting, two novel datasets: Formula-7K and Exam-1K are established. To the best of our knowledge, they are the first formula spotting datasets. Experimental results on Formula-7K and Exam-1K validate the generality and effectiveness of the proposed method. Code is available at https://github.com/hongen123/SynRMFormer.

Abstract:
Detecting hazardous activities is essential for ensuring safety. However, existing datasets often lack coverage of the nuanced and diverse hazards present in indoor environments, which hinders the development of a specialized model. To address this, we introduce the Real-World Hazardous Activities Dataset (RHAD), a novel and diverse video dataset specifically curated for recognizing hazardous activities in real-world indoor settings. Leveraging RHAD, we introduce HazardNet, a hybrid deep-learning architecture designed for hazardous activity recognition. HazardNet integrates local and global spatial-temporal representation modules to effectively capture complex patterns, enabling a robust understanding of the activity. We perform comprehensive evaluations by benchmarking against a range of state-of-the-art activity recognition models. Experimental results show that our proposed model performs significantly better, surpassing the latest model, VideoMamba, with a 9.2% accuracy gain. Moreover, by providing the dataset and an effective recognition model, our work lays the foundation for further research, paving the way for enhanced safety measures and preventive interventions. The dataset and code are available at https://github.com/ShehzadCS18/RHAD.

Abstract:
There exists an affective gap between the video content and the emotions that the video creator hopes to evoke in viewers. Existing methods for video emotional content analysis attempt to learn emotion-related features directly or enhance the discrimination of models, but lack emotional cause descriptions, limiting their interpretability and the model's reasoning capabilities. In this work, we introduce EmoCause, the first large-scale video emotional dataset with multi-attribute, multi-split emotional cause descriptions. EmoCause builds upon existing datasets and is divided into 14K video splits, with over 294K emotional cause descriptions. Inspired by psychology and video prior knowledge, each video split is linked to four primary emotional cause attributes: audio, visuals, content, and shot, further divided into 12 sub-attributes, with each cause including a fact and analysis. Then, we merge all the causes into an emotional chain-of-thought for the entire video to enhance the reasoning process. Furthermore, we develop a multimodal large language model (MLLM) for video emotional content analysis, EmoDETective. EmoDETective performs training on EmoCause using progressive learning, which includes Detecting cause fact, Exploring cause analysis, and Thinking with complete reasoning. Experimental results show that our approach surpasses the existing MLLMs baseline and outperforms state-of-the-art methods, demonstrating superior emotional analysis capabilities. Ablation experiments indicate improvements from both the proposed dataset and training strategy. Code and datasets: https://github.com/Listever/EmoDETective/

Abstract:
While text-based person search (TBPS) has achieved notable progress in recent years, existing methods heavily rely on laboriously annotated and well-aligned pedestrian image-text pairs, incurring prohibitive annotation costs. To overcome this limitation, we propose to train a TBPS model using pure images without any annotations. To tackle this challenging problem, we propose an unsupervised cross-modal person search framework via Progressive Diverse Text Generation (PSPD), leveraging large pre-trained models as assistants. Particularly, PSPD features three modules: Progressive Diverse Text Generation (PDTG), Fine-grained Saliency Region Alignment (FSRA) and Cross-Modal pseudo Label Correction (CMLC), allowing training with only unannotated images. The PDTG generates and dynamically adjusts prompts to produce accurate, diverse textual descriptions in multiple styles. The FSRA then uses large language models to generate fine-grained attributes and achieves cross-modal fine-grained semantic alignment. Additionally, the CMLC is applied to eliminate pseudo label noise through dual mutual nearest-neighbor matching, combined with distance-based judgment and a voting mechanism. Experimental results demonstrate the effectiveness of our method in unsupervised settings across various text-based person search datasets. Source code is at https://github.com/flychen321/PSPD.

Abstract:
Multimodal Sequential Recommendation (MMSR) leverages rich item features but often suffers from noisy representations derived from pre-trained models (PTMs). Existing methods neglect critical challenges: (1) domain shift between PTM training data and recommendation scenarios, (2) interest-agnostic noise within modalities (e.g., irrelevant background details), and (3) interaction uncertainty complicating modality fusion. To address these intertwined challenges, we propose DMMD4SR, a novel Diffusion Model-based Multi-level Multimodal Denoising framework for Sequential Recommendation. Inspired by the denoising power of diffusion models, DMMD4SR employs a progressive, multi-level strategy. It includes layers specifically designed to mitigate domain shift noise and context-aware interest-agnostic noise within modalities. Furthermore, an Uncertainty-Guided Modality Denoising Fusion Layer adaptively integrates the purified representations while accounting for interaction uncertainty. Extensive experiments on benchmark datasets demonstrate that DMMD4SR significantly outperforms state-of-the-art baselines, validating the effectiveness of our multi-level denoising approach. The code is available at https://github.com/luweihai/DMMD4SR.

Abstract:
Few-shot class-incremental learning (FSCIL) grapples with the dual challenge of learning new classes from minimal labeled training data while alleviating catastrophic forgetting of previous learned classes. Compared with previous methods employing static adaptation on specific parameters, current works verify that dynamic weights and sequence modeling in Selective State Space Models (SSMs) can capture distinctive feature drifts in FSCIL. However, the flattening operation in SSMs fragments the latent semantic relationship, where the resulting task isolation and representation degeneration are detrimental to FSCIL. Toward this issue, this paper presents a novel framework named Probabilistic Mixture of Hyperbolic State Space Experts (PmH-SSE) for FSCIL. First, since SSMs rely on scanning as an alternative to self-attention, the Hyperbolic state space model with multi-scale hybrid scan is built to facilitate few-shot learning by providing an extra Hyperbolic geometry that encodes hierarchical relationships. Moreover, we propose the probabilistic mixture of Mamba to increase the model's flexibility in handling non-stationary data streams in FSCIL and enhance the stability of high-parameter models in few-shot conditions. Finally, under the same experimental conditions, the proposed PmH-SSE demonstrates superior performance in comprehensive experiments. The codes are available at https://github.com/yawencui/PmH-SSE.

Abstract:
Anchor graph has gained considerable attention in the field of multi-view clustering due to its capability to handle large-scale datasets and superior clustering performance. However, current approaches hypothesize that views select the equal number of anchors and only share the anchor graph of single magnitude. This limits their representability and ignores the varying data distributions from multiple views. In addition, most of anchor strategies attain the clustering labels by performing post-processing on anchor graph, which consumes additional time and exists the loss of information. So as to address the above problems, this paper presents a Dual-Constraint Multi-view Fuzzy Clustering with Scalable Anchor Graph Learning (DMFC-SAGL) that explores the anchor graphs of different magnitudes and directly derives the clustering labels via fuzzy clustering. In particular, DMFC-SAGL first learns the multi-scale anchor graphs based different quantity of anchors to accommodate the distinct data distributions and extract more comprehensive information. Then, by carrying out fuzzy clustering for anchor graphs, the soft labels can be generated directly without additional post-processing. Moreover, DMFC-SAGL imposes the dual-constraint of low-rank tensor and orthogonality during label learning to ensure the information of consistency and diversity among multi-scale anchor graphs. Experiments with advanced baselines on the seven multi-view datasets indicate the superiority of the proposed method. The code is available at https://github.com/whbdmu/DMFC-SAGL.

Abstract:
Rheumatoid arthritis (RA) is a chronic autoimmune disease characterized by joint inflammation and progressive structural damage. Joint space width (JSW) is a critical indicator in conventional radiography (CR) for evaluating disease progression, which has become a prominent research topic in computer-aided diagnostic (CAD) systems. However, deep learning-based radiological CAD systems for JSW analysis face significant challenges in data quality, including data imbalance, limited variety, and annotation difficulties. This work introduced a challenging image synthesis scenario and proposed Layer Separation Networks (LSN) to accurately separate the soft tissue layer, the upper bone layer, and the lower bone layer in conventional radiographs of finger joints. Using these layers, the adjustable JSW images can be synthesized to address data quality challenges and achieve ground truth (GT) generation. Experimental results demonstrated that LSN-based synthetic images closely resemble real radiographs, and significantly enhanced the performance in downstream tasks. The code and dataset are available at: https://github.com/pokeblow/LSN.

Abstract:
Temporal forgery in multimedia-where audio or video streams are subtly manipulated-poses critical challenges for content authenticity verification. While video-level detection has advanced, Temporal Forgery Localization (TFL) remains underexplored, often limited by weak audio-visual modeling and reliance on non-learnable post-processing. To address these challenges, we propose RegQAV, a Register-enhanced Query-based Audio-Visual framework for TFL. RegQAV exploits pretrained foundation models to capture fine-grained audio-visual correspondences and learnable registers are introduced to mitigate the model's tendency to overly focus on a limited set of temporal features. A query-based localization strategy enables end-to-end optimization without post-processing. We also introduce a Modality Fusion Adapter (MFA) for effective multi-scale integration of audio-visual data, a Deepfake Queries Generation (DQG) module for efficient query initialization, and a Poisson Count-Based Approach to dynamically predict the number of forgeries. Experiments on LAV-DF and AV-Deepfake1M show that RegQAV achieves state-of-the-art performance with fewer parameters, faster inference, and stronger generalization. This work offers significant potential for real-time deepfake detection and other multimedia verification applications. The code is available at https://github.com/zxd3099/RegQAV.

Abstract:
Precipitation nowcasting plays a pivotal role in urban planning and disaster mitigation, where extending forecast horizons offers critical advantages for proactive decision-making. Most data-driven methods focus on modeling radar echo sequences through end-to-end spatiotemporal predictive learning, yielding precise short-term predictions; however, they fundamentally neglect the inherent physical mechanism governing precipitation system. Moreover, approaches relying solely on single-modality radar observations suffer from persistent information bottlenecks, severely limiting their temporal generalizability for extended forecasting. To address these challenges, we propose PiMMNet, a Physics-informed Multi-Modal Network. It is constructed based on the advection-diffusion principle from fluid dynamics, explicitly modeling the precipitation evolution as a spatiotemporal transport processes characterized by the deterministic advection and the stochastic source. We carefully design a multi-model motion estimation network and a motion-guided diffusion model to describe the deterministic and stochastic terms, respectively. The core innovation of our method lies in jointly estimating a physics-constrained velocity field from multi-modal inputs (radar and satellite data). In this case, we naturally align the motion evolution among modalities into a unified representation, inherently mitigating cross-modal distribution biases. Experimental evaluations on two real-world multi-modal meteorological datasets demonstrate the efficacy of our approach, showcasing significant improvements in accuracy and robustness for longer-range precipitation nowcasting. Our code are available at https://github.com/DeminYu98/PiMMNet.

Abstract:
Neural oil painting synthesis is to sequentially predict brushstroke color and position, forming an oil painting step by step, which could serve as a painting teacher for education and entertainment. Existing methods usually suffer from degraded generalization for real-world photo inputs due to the training-test distribution gap, often manifesting as stroke-induced artifacts (e.g., over-smoothed textures or inconsistent granularity). In an attempt to mitigate this gap, we introduce a domain-agnostic neural painting (DANP) framework that aligns model to the test domain. In particular, we focus on updating affine parameters of normalization layers efficiently, while keeping other parameters frozen. To stabilize adaptation, our framework introduces: (1) Asymmetric Dual-Branch with mirror augmentation for robust feature alignment via geometric transformations, (2) Dual-Branch Interaction Loss combining intra-branch reconstruction and inter-branch consistency, and we also involve an empirical optimization strategy to mitigate gradient oscillations in practice. Experiments on real-world images from diverse domains (e.g., faces, landscapes, and artworks) validate the effectiveness of DANP in resolution-invariant adaptation, decreasing ~11.3% reconstruction error at 512px and ~20.3% at 1024px compared to the baseline model. It is worth noting that our method is compatible with existing methods, e.g., Paint Transformer, and further improve the ~10.3% perceptual quality. Dataset and code will be publicly released at: https://domain-agnostic-neural-oil-painting.github.io/DANP.

Abstract:
Accurate and robust 3D hand pose estimation (HPE) plays a crucial role in human-computer interaction. Existing 3D HPE solutions predominantly rely on vision-based or inertial measurement units (IMUs)-based methods. Vision-based methods benefit from rich appearance information for high-accuracy HPE but are sensitive to field of view (FoV), occlusion, motion blur and lighting. IMU-based methods can operate immune to optical sensitivity and FoV constraints but remain vulnerable to cumulative integration errors and drift. Given their complementary strengths, combining dual modalities offers a promising direction for HPE in complex environments. However, the lack of large-scale visual-inertial datasets has limited progress in this area. In this paper, we construct VIHand, the first large-scale glove-worn dataset for visual-inertial HPE, comprising over 1.4 million synchronized RGB-D and IMU frames from 15 subjects. It enables comprehensive research in HPE tasks, such as multimodal fusion and cross-modal knowledge transfer. Building on VIHand, we propose visual-inertial fusion network (VIFNet) for dual-modalities estimation, and its distilled student model (VIFNet-S) for IMU-only evaluation. Experimental results reveal that integrating visual and inertial modalities significantly improves the accuracy and robustness of 3D HPE, particularly under occlusion and motion blur. In IMU-only inference even sparse IMU configurations, models distilled from visual-inertial supervision achieve substantial performance gains, enabling robust HPE for challenging optical sensitive scenarios. Our dataset and supplementary materials are available on the project website: https://shirley0118.github.io/VIHand.

Abstract:
Recently, numerous benchmarks have been constructed to evaluate various general capabilities (e.g., perception and reasoning) of Vision-Language Large Models (VLLMs). However, few studies have focused on the robustness of VLLMs when dealing with altered prompts and images. To fill this gap, this paper first constructs a real-world, high-quality, and challenging benchmark, namely RBench (i.e., Robust Bench). Specifically, RBench is human-annotated, with both prompts and images being modified to enrich the difficulty, and cross-validation to ensure data quality. Then, we propose a new method, called Robustness Booster (RBoost in short), to effectively enhance the robustness of existing VLLMs by automatically generating high-value instruction-tuning training data. Extensive experiments demonstrate the vulnerability of existing VLLMs when handling altered inputs, and the superiority of our RBoost method in improving model robustness. RBench is available at https://github.com/zhaominyiz/RBench.

Abstract:
We introduce the MIRAGE Challenge, a comprehensive benchmark for multimodal interleaved reasoning and generation, to ACM MM 2025. The challenge aims to evaluate models' abilities to both understand and generate content from complex, multimodal contexts consisting of interlinked images and text. The challenge is accompanied by the MIRAGE Dataset, comprising 263.7K high-quality instruction-response pairs across 35 tasks in two tracks: reasoning and generation. These pairs span 20 diverse scenarios, from surveillance to artistic creation, ensuring broad coverage. The challenge includes seven major categories: Multi-Image Reasoning, Document and Knowledge-Based Understanding, Interactive Multi-Modal Communication, Multi-Image Discrimination, Sequential Visual Generation, Material-based Image Coloring, and Visual Reference Customization. Hosting the MIRAGE Challenge at MM 2025 will drive significant progress in unified multimodal learning and inspire broad involvement in developing more versatile AI systems capable of both understanding and generating multimodal content. Challenge details and participation information are available at https://mm25mirage.github.io/mirage/.

Abstract:
Multimodal Domain Generalization (MMDG) aims to enhance the robustness of multimodal models against distribution shifts in unseen target domains. Unlike unimodal domain generalization methods, which primarily focus on mitigating domain bias within individual modalities, MMDG faces unique challenges, notably modality heterogeneity (divergent feature spaces) and stability discrepancy (varying sensitivity to domain shifts). To tackle these challenges, we propose Modality-Domain Joint Adversarial Training, a unified framework that addresses these challenges through two key innovations: (1) a tri-discriminator adversarial module that mitigates domain biases in both modality-specific and multimodal representations, while suppressing modality-heterogeneous patterns in the representation space; and (2) a stability-aware dynamic weighting mechanism that adaptively balances modality contributions based on cross-domain stability, reducing reliance on unstable modalities. Additionally, we provide the first theoretical error bound for MMDG, offering a theoretical foundation that supports the effectiveness of our approach. Our approach achieves state-of-the-art performance on the EPIC-Kitchens and HAC datasets while using 75.2% fewer parameters than previous MMDG methods. The source code is available at https://github.com/lihongzhao99/MMDG-Joint-Adversarial-Training.

Abstract:
Exposure correction aims to restore underexposed and overexposed images to normal exposed images in a single network. However, conventional methods primarily focus on correcting non-extreme exposure cases and struggle to accurately restore lightness and structure information in extreme exposure scenarios. Through a thorough investigation, we observe that the extreme exposure correction task is limited by the lack of high-quality benchmark datasets. To address the above challenges, in this paper, we construct the first Extreme Exposure Dataset named EED by manually collecting a large number of diverse scenes. By introducing probabilistic blur kernel, EED not only ensures the rich diversity and brightness distribution of scenes but also approaches the degradation of the real world. To achieve exposure correction in extreme conditions, we propose a novel Extreme Exposure Correction Network by leveraging the mask-aware Fourier transform prior, which decouples lightness and structure components precisely. To restore severe abnormal lightness and lost structure information in extreme exposure scenes, we introduce a well-exposed referenced image to guide the coarse restoration and employ a Timestep-guided Frequency Diffusion Module for further refinement. Extensive experiments demonstrate the superiority of our dataset and method. The dataset will be available at https://github.com/juvenoia/EED.

Abstract:
The Dense Audio-Visual Event Localization (DAVEL) task aims to temporally localize all audio-visual events within untrimmed videos. Existing methods typically operate under a closed-set assumption, which limits their ability to generalize to test videos containing previously unseen event categories-an essential capability for open-world scenarios. To this end, we propose a novel task setting, Open-Vocabulary Dense Audio-Visual Event Localization (OV-DAVEL), along with a one-stage method Open-DAVTR designed to detect events that were not observed during training. Open-DAVTR consists of two core components: a class-agnostic foreground-aware generator and a multi-modal semantic-aware classifier. Specifically, the generator is a Detection Transformer-based module that produces event proposals while adaptively attending to discriminative foreground snippets for downstream classification. The classifier leverages rich temporal representations and context-aware textual semantics to effectively recognize events, regardless of whether they were seen during training. In addition, we establish comprehensive OV-DAVEL benchmarks across various settings. Experimental results show that our model significantly outperforms existing baselines in detecting both seen and unseen events, highlighting its effectiveness in open-vocabulary event localization. Code and data are available at: https://github.com/yujialele/OV-DEAVEL.

Abstract:
Auditory attention detection (AAD) aims to identify the attended speaker in multi-talker environments by analyzing brain activity recorded through neural monitoring techniques. Recent AAD approaches have achieved great progress in improving detection accuracy. However, they still face challenges in capturing complex spatio-temporal dependencies and high-order nonlinear relationships across brain regions. To address these challenges, this paper proposes DHGCN, a dual hypergraph convolutional network that integrates a hypergraph modeling module, a dual-branch hypergraph learning (DHGL) module, and a feature fusion module. Specifically, the hypergraph modeling module constructs spatial and temporal hypergraphs from EEG signals, enabling the representation of high-order relationships among channels and time points. The DHGL module comprises two parallel branches: a spatial branch that learns high-order spatial dependencies across EEG channels, and a temporal branch that captures complex temporal dependencies. Each branch uses its corresponding hypergraph structure, which is established during the modeling phase. The feature fusion module then aggregates spatial and temporal representations from both branches to support robust auditory attention classification. Extensive experiments on multiple benchmark datasets demonstrate that DHGCN consistently outperforms state-of-the-art AAD models. It achieves superior classification performance while reducing the trainable parameters count by over 50% compared to the state-of-the-art models. Code is available at: https://github.com/nobody1219/DHGCN.git.

Abstract:
Traditional multi-view clustering methods rely on the cross-view sample alignment presumption to explore consistent and complementary information from multiple views. However, in real-world scenarios, sensor heterogeneity and decentralized data storage and processing frequently make this presumption violated, leading to the Unaligned Multi-view Clustering (UMC) problem. Although existing works have promoted the development of UMC, they have at least one of the following limitations, i.e., high computational complexity, inadequate use of high-order correlation and two-stage clustering. To address these limitations, we propose a Joint High-order Correlation Learning (JHCL) framework for scalable one-step unaligned multi-view clustering. Specifically, multi-order bipartite graphs are utilized to make fully use of intra-view high-order correlations. Then, based on a tensorial bipartite graph alignment and fusion model, inter-view high-order correlations are exploited simultaneously. In such manner, the learned consistent bipartite graph retains adequate structural information for accurate and fast clustering in one step. Extensive experiments on real-world datasets validate the superiority of JHCL in both clustering performance and computational efficiency. Code available: https://github.com/revolution6575/JHCL.git.

Abstract:
Hyperspectral image super-resolution (HSI-SR) has attracted significant attention in high-resolution HSI reconstruction. Most existing fusion-based HSI-SR methods assume that multi-source images are perfectly registered, which is impractical due to varying imaging conditions. Furthermore, methods that consider the registration issue typically treat registration and fusion as two separate steps, resulting in the accumulation of registration errors during the fusion process. To address these issues, we propose a Cycle-Consistent Mamba-Based Registration-Fusion Joint Network (CCM-RFJN), which step-wise optimizes the Registration-Fusion Unified Module (RFUM) through multiple cyclic iterative SR processes. Specifically, in each SR iteration, we map the super-resolved HR-HSI obtained through the RFUM back to the unregistered LR-HSI for the next SR, with cycle-consistency constraints imposed on both LR-HSI and HR-HSI to adaptively optimize the RFUM based on the reciprocal training strategy. In RFUM, we integrate the proposed Interactive Mamba Registration Module (IMR) and Dual-attention Mamba Fusion Module (DAMF), thereby achieving registration-fusion joint optimization. Specifically, IMR is developed to incorporate the interactive Mamba encoder into a pyramid architecture to facilitate multi-level information interactions, generating the deformation field to correct non-rigid misalignments. DAMF is designed to utilize the dual-attention Mamba mechanism to highlight and aggregate key features, thereby enhancing fusion performance. Experiments on three public datasets demonstrate that CCM-RFJN achieves the state-of-the-art performance. The code is available at https://github.com/Jiahuiqu/CCM-RFJN.

Abstract:
Class Incremental Learning (CIL) aims to continually learn new classes from a stream of data without forgetting previously learned ones. Recent approaches have leveraged pre-trained models (PTMs) to improve performance, especially vision-language models, which offer better generalization than models trained solely on visual data. Many of these methods rely on simple language templates to generate class representations, which then serve as classifiers. However, due to differences between the pre-training data and downstream tasks, these textual features can become too similar for certain classes, leading to prediction errors. To address this issue, we propose a method that optimizes the geometric structure of both visual and textual features across different classes. Inspired by neural collapse theory, we introduce a multi-modal alignment strategy: for each class, a reference vector is chosen from a simplex Equiangular Tight Frame, and both the visual and textual features of the class are aligned with this vector. To better capture intra-class variations, we also construct multiple visual prototypes for each class. A multi-prototype supervised contrastive loss is then employed to pull an image feature toward the closest matching prototype of its true class and push it away from prototypes of other classes. We evaluate our approach on five widely used CIL benchmarks. The results show that our method achieves state-of-the-art performance, demonstrating its effectiveness in addressing the challenges of class incremental learning. Our code is available at https://github.com/qcNPU/NCSCMP.

Abstract:
Multimodal link prediction on multimodal knowledge graphs is an inference task aimed at finding missing triples, which seeks to improve prediction accuracy by leveraging a wide range of information. However, current multimodal knowledge graph link prediction methods are primarily designed in the spatial domain, necessitating ever-growing complexity in fusion strategies. In addition, most of them focus on only three modalities (text, image, and structure). Forward-looking approaches, however, should accommodate a broader array of modalities. Motivated by the operational simplification enabled by transforming features into the frequency (or time-frequency) domain, we propose a wavelet-transform-based multimodal link prediction method, WFF, which offers high modal extensibility and low fusion complexity. Specifically, for unimodal information, we designed a Unimodal Time-Frequency Knowledge Enhancement module, UTFKE, which extracts time-frequency features via discrete wavelet transform and enhances information quality through adaptive filtering. To address the challenge of multimodal fusion, we devised a Multimodal Time-Frequency Knowledge Fusion module MTFKF that supports high modal extensibility and enables effective, efficient integration. Extensive experiments on multiple well-known datasets demonstrate that WFF outperforms strong baselines and achieves state-of-the-art performance. In addition, WFF extends modality to audio and video, further validating the model's effectiveness. Our code is available at https://github.com/xxd12315/WFF.

Abstract:
Currently, fusion-based hyperspectral image super-resolution (fusion-based HSI-SR) has become an efficient technology to improve the spatial resolution of hyperspectral images. However, in real scenarios, it may not be possible to obtain high-resolution multispectral images (HR-MSI) of the same temporal and region corresponding to low-resolution hyperspectral images (LR-HSI) due to the limitations of imaging conditions and environmental changes. In view of this spatial-temporal constraint, it becomes a feasible solution to regard HR-MSI, which has similar spatial structure and semantics to LR-HSI, as a reference to assist in reconstruction. Therefore, this paper proposes a Cross-Correlation & Self-Similarity Guided Texture Transfer Network (C2S2TNet), which utilizes the texture details of HR-MSI and the self-similarity information of LR-HSI to achieve reference-based hyperspectral image super-resolution. Specifically, we design a Cross-Correlation & Self-Similarity Guided Cluster-Aware Matching (C2S2CAM) strategy, which realizes multi-correspondence texture matching and feature aggregation in non-local regions based on dynamic clustering and cluster-aware graph structure, effectively alleviating the misuse and underuse of information. In addition, we also propose a Spectral-Spatial State-Space Fusion Module (S2-SSFM) based on the state-space model to perform feature fusion and enhancement in both spatial and spectral domains to ensure that the target HR-HSI maintains the spatial-spectral structural consistency with the LR-HSI. Experimental verification shows that C2S2TNet can achieve excellent performance in cross-temporal and cross-regional scenarios, confirming the effectiveness of this method. Code can be accessed at https://github.com/Jiahuiqu/C2S2TNet.

Abstract:
Clustering, a fundamental task in machine learning and data mining, is essential for uncovering patterns by grouping data points with similar characteristics. Traditional methods struggle with nonlinear data structures, but kernel-based approaches alleviate this issue by mapping data to high dimensional spaces. Multiple Kernel Clustering (MKC) further improves clustering by automating kernel selection and integration. However, MKC faces challenges related to kernel graph quality, information loss during relax-and-discretization, neglect of balanced clustering constraints, and the trade-off between high clustering quality and balance. To address these challenges, we introduce Balanced Multiple Kernel Clustering (BMKC). BMKC utilizes local kernel reconstruction and advanced high-order diffusion techniques for comprehensive kernel graph learning. It directly learns a discrete partition matrix using a robust L1-induced local reconstruction criterion, eliminating the two step process. BMKC incorporates an automatic mechanism for trade-off control between clustering and balance, supported by a versatile optimization algorithm accommodating various balance regularization choices. Experimental validation demonstrates the superior performance of MKC on benchmarks data sets, showcasing its effectiveness. The code for our method is publicly available at https://github.com/ChenYan01TYUT/BMKC-ACM-MM-2025.

Abstract:
Drug-drug interaction (DDI) prediction is a pivotal task in biomedical research. Emerging multimodal approaches that integrate graph neural networks (GNNs) and large language models (LLMs) have gained traction, as GNNs capture molecular structures while LLMs provide a rich biomedical context. However, real-world DDI data often exhibit distribution shifts across structural and textual dimensions, stemming from variations in molecular scaffolds, drug sizes, and assay conditions. Existing methods assume an independent and identically distributed (I.I.D.) setting, failing to handle such shifts primarily due to there key limitations: (i) the entanglement of core interaction motifs with incidental structural features; (ii) inflexible message-passing GNN architectures ill-suited for diverse drug pairs; and (iii) underutilized biomedical knowledge in LLMs for capturing pairwise interaction semantics. These limitations highlight the need for a disentangled, dynamic, and pairwise-aware modeling strategy to achieve out-of-distribution generalized DDI prediction. To solve this problem, we propose DyNamic Pairwise Architecture Search for Generalizable Drug-Drug Interaction LLM (DyNAS-DDI), a novel framework that dynamically adapts network architectures for each molecular pair and integrates biomedical knowledge from LLMs to improve generalization under distribution shifts. Specifically, we propose three modules: (i) Motif-driven disentangled molecule encoding, which disentangles molecular representations into distinct motif-based features while preserving key structural signals through a self-supervised graph encoder; (ii) Attentionbased pairwise neural architecture search, where multi-head attention enriches molecular features to guide a dynamic search mechanism that adaptively optimizes message passing for diverse interaction types; and (iii) retrieval-augmented molecular instruction tuning, where external biomedical knowledge is incorporated to improve interpretability and enable reasoning for unseen drug interactions. Extensive experiments on four datasets for DDI with out-of-distribution (OOD) splits demonstrate our method's superior generalization abilities under distribution shifts. Our code can be available at https://github.com/EkkoXiao/DyNAS-DDI.

Abstract:
In recent years, numerous studies have proposed uncertainty-guided multimodal learning to adapt to dynamic relationships between different modalities. Dynamic multimodal fusion enables more dominant modalities to receive greater weight during the fusion process, thereby preventing the influence of spurious features from less reliable modalities on decision-making. However, there is No free lunch. We observe that the introduction of dynamic fusion during training exacerbates the model's tendency toward Greedy (a phenomenon known to induce decision shortcuts in multimodal learning). This results in a model that does not fully take advantage of the lower quality modalities. In this paper, we provide a theoretical analysis showing that dynamic fusion intensifies Greedy, and we present experimental results that support this observation. In summary, this paper explains the Greedy risk in dynamic multimodal learning from both theoretical and experimental perspectives, serving as a cautionary reminder for researchers when employing dynamic multimodal learning. Our code is available at https://github.com/d-xr/GreedyDynFusion.

Abstract:
Nuclei instance segmentation in histopathological images is pivotal for cancer diagnosis, yet heavily reliant on costly pixel-level annotations. While point-supervised methods reduce annotation burdens, existing approaches struggle to reconstruct accurate nuclear boundaries and fail to leverage spatial distribution cues, particularly in dense or irregularly structured tissues. This paper presents DeNSe, a novel density-guided weakly-supervised framework that integrates nuclei counting with instance segmentation to address these limitations. Unlike conventional methods, DeNSe introduces a multi-task learning architecture, combining density regression and segmentation through two key innovations: (1) a distribution-aware alignment module based on optimal transport theory, which harmonizes feature representations between density maps and segmentation masks to enhance instance differentiation; and (2) a morphology-aware refinement strategy that dynamically adjusts pseudo-labels using density-guided confidence scores, mitigating errors from coarse annotations. By formulating counting as a density regression task, DeNSe captures global nuclei distribution, enabling robust segmentation in densely packed regions. Extensive experiments on three benchmarks demonstrate state-of-the-art performance, achieving significant improvements across various metrics over existing methods. Notably, DeNSe exhibits strong robustness to annotation offsets and generalizes across diverse tissue types, offering a cost-effective and scalable tool for clinical applications. The code is available at https://github.com/lingboboo/DENSE.

Abstract:
Recently, video editing task has gained widespread attention due to its practical applications and rapid advancements. However, current automatic evaluation metrics for video editing are mostly poorly aligned with human judgments. Thus, researchers heavily rely on human evaluation, which is not only labor-intensive but also difficult to ensure consistency and objectivity. To address these issues, we propose EditEval, the largest-ever video editing benchmark to comprehensively evaluate the performance of video editing models in three aspects: Textual Faithfulness,Frame Consistency, and Video Fidelity. It includes 200 video clips and 1,010 text prompts, from which 160 instances are sampled to generate 1,280 edited videos using eight open-source video editing models, accompanied by human annotations. Furthermore, we propose EditScore, leveraging the advanced reasoning and comprehension capabilities of Multi-modal Large Language Models (MLLMs) as evaluators to assess edited videos across the aforementioned aspects. Experiments show that the best-performing video editing model only reaches an average score of 3.16 (out of a perfect 5), highlighting the challenge of EditEval. Besides, results from more than 10 MLLMs demonstrate the great potential of utilizing EditScore for automatic evaluation. Notably, for textual faithfulness, EditScore equipped with LLaVA-OneVision-7B achieves a significantly higher Pearson Correlation score compared to previous methods based on CLIP (0.50 vs 0.22). The code and dataset are available at: https://github.com/XMUDeepLIT/EditEval

Abstract:
While pre-trained models in the general domain have proliferated, existing methods for transferring these models to specific domains often depend on source domain data for distribution alignment and are typically tailored for single tasks. We propose a source data-free approach, Fourier Self-Adaptation (FSA), which effectively adapts general models to a wide range of specific domains. Our method leverages the distinct properties of Fourier phase and amplitude: phase contains high-level structural and positional information, which is less affected by domain shifts, while amplitude contains details and brightness information, which is more affected by domain shifts. FSA adjusts the image distribution by initializing a trainable adaptive image from a normal distribution. It then interpolates the amplitude of the target domain image with that of the adaptive image, where the interpolation ratio is dynamically controlled by learnable weight and bias. During training, the model captures advanced phase information of the target image and refines the data distribution through amplitude interpolation. Additionally, a dual regularization loss constrains the model representation, encouraging it to focus on the intrinsic relationships of the target domain data while discarding irrelevant knowledge. We evaluate FSA using general pre-trained models on 11 unimodal image classification datasets and 6 multimodal visual question answering datasets, covering specific domains such as radiology, pathology, remote sensing, and art. Our method consistently achieves state-of-the-art performance across multiple datasets, with performance improvements ranging from 1% to 8% compared to basic pre-trained models. Source code are available at https://github.com/Alivelei/FSA.

Abstract:
The rapid evolution of vision-language models (VLMs), such as CLIP, has demonstrated remarkable zero-shot capabilities in downstream tasks. Prompt learning paradigms like Context Optimization (CoOp) refine learnable prompts for efficient adaptation. However, their application in the biomedical domain remains limited due to the insufficient utilization of specialized biomedical knowledge and cross-modality structural relationships. To address these, we introduce MeDKCoOp, a Medical Dual Knowledge-guided graph adaptation method that leverages systematic integration of knowledge through three aspects: exploit domain-specific knowledge from both textual and visual branches, formalize it into graph-structured representations, and leverage knowledge-guided relation transfer for learning cross-modality fusion. By dynamically optimizing learnable prompts through relation learning process, our method achieves disentangled visual representation and enhances transferability to downstream tasks. Evaluations across 8 biomedical datasets spanning 7 imaging modalities demonstrate state-of-the-art cross-domain generalization, with an average 15.12% accuracy improvement over baselines. Our work establishes a new paradigm through graph prompt learning in medical vision-language models, advancing robust diagnostic AI in data-scarce clinical scenarios. Our code is available at: https://github.com/WangYijun-OUC/MeDKCoOp.

Abstract:
In audio-visual joint analysis, generalized zero-shot learning (GZSL) aims to recognize unseen categories by aligning semantic information across modalities. However, significant challenges arise from temporal and semantic discrepancies between audio and video modalities. Traditional methods typically depend on static semantic embeddings, thereby overlooking the dynamic nature of these modalities and often resulting in modality imbalance. We propose the Dual-Factor Compensatory Clustering Network (DFCNet), an end-to-end framework for dynamic fusion and optimized alignment of heterogeneous modal information to address these limitations. DFCNet employs a multi-branch architecture, integrating a parallel multi-layer perceptron (MLP) for semantic modeling and a Bidirectional Long Short-Term Memory (BiLSTM) network for capturing temporal consistency. The Compensatory Fusion Block (CFB) employs tensor decomposition to facilitate cross-modal coupling, where the low-rank representation decomposition aligns intra-modal feature distributions. Additionally, we introduce the Dual-Factor Clustering Multi-objective Optimization Framework (DCMOF), which ensures gradient equilibrium and adaptively adjusts modality contribution weights to strengthen robust cross-modal alignment. Designed as a pluggable module, DFCNet can be seamlessly integrated into existing base models. Experimental results demonstrate that our framework significantly improves the performance of the Audio-Visual Cross-Modal Alignment (AVCA) model across multiple GZSL datasets. Ablation studies further validate the critical role of CFB in cross-modal alignment and highlight the significance of DCMOF in optimizing modality coordination. The code is available at https://github.com/ATKEROM/DFCNet.

Abstract:
While non-autoregressive sign language translation (NASLT) has the advantage in inference speed, the translation quality of NASLT models lags significantly behind that of the state-of-the-art (SOTA) autoregressive sign language translation (ASLT) models. To bridge the quality gap, we exploit glosses to unlock the potential of NASLT models. Concretely, we propose Gloss-enhanced Levenshtein Transformer (GLevT) for sign language translation (SLT), which takes glosses as initial sequences for editing into texts. In particular, to alleviate the inconsistency between training and inference of GLevT, which is introduced by glosses, we propose a dual-centric learning policy and a keyframe-based gloss replacement method for training, further improving the translation quality of GLevT. Experiments on CSL-Daily demonstrate that GLevT outperforms other NASLT models by approximately 4 points in BLEU and ROUGE scores, while achieving performance comparable to the SOTA ASLT models with a 3.46~5.26× inference speed-up. Furthermore, we extend GLevT to gloss-free SLT, achieving performance comparable to SOTA large models, despite having only 49M parameters. We release code at https://github.com/XMUDeepLIT/GLevT.

Abstract:
Medical image segmentation plays an important role in clinical decision making and auxiliary diagnosis. Today, however, it still faces three major challenges. 1. In the task of medical image segmentation, due to the different types of lesions and the large difference in the size of the lesion area, the segmentation accuracy is seriously reduced. 2. In order to pursue the segmentation performance, the model is difficult to be applied to the actual medical environment due to the excessive parameters. 3. Relying too much on manually labeled images to assist training. In order to meet these challenges, we propose a lightweight segmentation network, which is dedicated to extracting local and global information and fusing multi-level and multi-source features to maximize the segmentation accuracy for different shape lesions, especially for the case of fuzzy boundary and small segmentation target. The method of generating intermediate mask self-monitoring is used to generate additional labeled images to assist training. Finally, by using efficient down sampling and up sampling operations, the parameter quantity is only 1.37M while effectively extracting information. On the BUSI and ISIC2018 datasets, mIoU and DSC scores reached 75.57%, 83.57% and 83.85%, 90.38% respectively, indicating that we have reached the best balance between parameters and performance. The code is available at https://github.com/Jay217219/EMIFS.

Abstract:
Medical image segmentation is crucial for clinical decision-making, treatment planning, and disease tracking. Nonetheless, it confronts two significant challenges: the presence of ''soft boundaries'' between the foreground and background exacerbated by poor illumination and low contrast, and the misleading co-occurrence of salient and non-salient objects during the training phase, which complicates the model's accuracy in distinguishing relevant features. To overcome these challenges, we introduce RoDeCon-Net, a novel framework engineered to enhance medical image segmentation. RoDeCon-Net incorporates a Feature Decoupling Unit (FDU) that dynamically separates encoded features into foreground, background, and uncertain regions, using advanced attention mechanisms to refine feature distinction and reduce uncertainty. Additionally, our Contrast-driven Feature Alignment Unit (CFAU) and Cross-layer Feature Cascade Unit (CFCU) synergize to reinforce feature contrasts and promote effective multi-level feature fusion, thus improving the detection of salient objects amidst complex backgrounds and handling various object scales within images. Comprehensive evaluations of RoDeCon-Net on five diverse medical image datasets validate its superior performance and versatility, showcasing its potential to set new benchmarks in medical image segmentation. Our code is available on https://github.com/ILoveACM-MM/RoDeCon-Net.

Abstract:
Neuropsychology-inspired models have been utilized in recent advances in EEG emotion recognition, such as convolutional networks for spatial features and Transformers for temporal dependencies. While these methods benefit from domain knowledge like frequency-band features and spatial correlations, most overlook the fundamental fact that EEG signals are complex mixtures of neural source activities recorded at the scalp. EEG signals presenting challenges for emotion recognition, particularly in cross-subject scenarios due to significant inter-subject variance. Inspired by neurophysiological principles, we propose a novel framework, named Sera, for EEG-based emotion recognition that explicitly separates source activities and aligns representations across subjects. Sera introduces two key components: (1) a variational autoencoder (VAE) with multiple multi-stage decoders (M2VAE) designed to disentangle EEG signals into independent sources, mimicking the neural generation process, and (2) a coarse-to-fine representation alignment block (CFRA) to mitigate subject-to-subject variability. The coarse alignment employs adversarial training with a domain discriminator, while the fine-grained alignment matches covariance matrices to capture temporal correlations within EEG segments. Extensive experiments demonstrate that Sera outperforms the state-of-the-art methods with improvements ranging from 1% to 5%, averaging 3.14% and 3.05% on the DEAP and DREAMER datasets, respectively, confirming its effectiveness and neurophysiological grounding. The code is available at: https://github.com/JZH98/Sera-code.

Abstract:
Real-time emotion recognition provides promising applications for mental healthcare monitoring and human-computer interaction design. Electroencephalography (EEG) emotion recognition has become a hot topic in the field of affective computing and intelligent brain-computer interface (BCI), and it is a feasible solution for achieving real-time emotion recognition. However, due to the uncertainty and individual specificity of emotional cognition, there are still some challenges in achieving efficient online emotion decoding applications. To address this, in this work, we propose an online emotion decoding method named DMSGL (Real-Time EEG Emotion Recognition from Dynamic Mixed Spatiotemporal Graph Learning). Specifically, in the DMSGL, we propose to explore the latent emotion-related graph features from EEG with cognition-inspired and data-driven learning strategies, and the temporal analysis with attention learning is utilized to further extract the robust spatiotemporal graph patterns for efficient EEG emotion decoding. Both simulated online emotion decoding and real-time emotion monitoring experimental results have consistently indicated that the proposed DMSGL can effectively satisfy the application requirements of real-time emotion decoding and achieves an accuracy of 68.35% in real-world online scenarios. Compared with other baseline methods, the proposed DMSGL has improved by 2-5% in the scenario of real-time emotion recognition. In conclusion, the proposed DMSGL provides a promising solution for realizing real-time emotion recognition and further exploring related applications. Our code is released on https://github.com/UESTC-BAC/DMSGL.

Abstract:
Multimodal recommender systems enhance recommendation performance by integrating information from different modalities (e.g., text and images). A common approach is to link items with high modality similarity in modality graphs, helping users explore their interests more broadly. However, existing methods often introduce noise when enhancing modality graphs, making it challenging to effectively balance performance and accuracy. To address this issue, we propose an Interest Tree Augmented Modality Graph RecommendER for Multimodal Recommendation (TAMER). In this framework, we first redistribute item modality features using various component analysis methods to ensure more reliable item similarity within modality graphs. Next, we construct interest graphs based on reliable semantic relationships and prune the interest graphs into multiple interest trees. These interest trees are then applied to the multimodal item-item homogeneous graph to extend potential links within the modality homogeneous graph. The interest tree-based enhancement method effectively captures high-order relationships in the modality graph while avoiding noisy links. The effectiveness of the proposed method is demonstrated through comprehensive experiments on three real-world datasets. Compared with the strongest baseline methods, our method achieves an average improvement of 9.98% across four evaluation metrics. The source code is available at https://github.com/Z-last-ONE/TAMER.

Abstract:
Transformer architecture has driven significant advancements in deep hashing, establishing itself as a dominant framework for large-scale retrieval and storage applications. However, existing Transformer-based deep hashing methods typically employ unvarying feature transformations across all images, limiting their adaptability to diverse visual patterns. This rigidity restricts the model's capacity to learn both highly distinctive and generalizable discrete representations, posing challenges for retrieval in open-world scenarios. To overcome this challenge, we propose a novel Factorized Transformer Hashing (FTH) framework, which introduces a factorized transformer to enhance the generalization and discriminative power of hash codes. Specifically, we decompose the Multi-Head Self-Attention (MHSA) and Multi-Layer Perceptron (MLP) blocks into multiple sub-blocks, forming a transformer factorization scheme that captures diverse feature characteristics through independent sub-blocks. Furthermore, we develop an adaptive selection strategy, leveraging a set of learnable selectors with the Softmax function, to dynamically route each image to the most appropriate sub-block for processing. Extensive experiments on three benchmark datasets demonstrate that the proposed FTH framework significantly outperforms state-of-the-art baselines in both image hashing and zero-shot hashing tasks. Source code is available at https://github.com/QinLab-WFU/FTH.

Abstract:
Aesthetic Image Cropping (AIC) aims to improve the visual appeal of images by removing redundant content while preserving attractive elements. Despite the encouraging progresses achieved in data-driven approaches, most existing models struggle to understand user intentions, particularly for diversified scenes with multiple subjects. Moreover, they can only provide cropping results without explanations, which further restricts their usability in real-world applications. Motivated by the above facts, we introduce InstructCrop : a multimodal large language model (MLLM)-based AIC framework, which can understand user instructions and provide explanatory reasons for cropping results. Specifically, we first build a multimodal Image Cropping Instruction Tuning (ICIT) dataset through a cost-effective paradigm by generating high-quality instruction tuning data based on the existing cropping datasets. Then, we embed dynamic domain knowledge into the cropping model by integrating cropping-aware experts of aesthetic assessment and composition classification. Finally, we adapt MLLMs to generate the cropping results and corresponding explanations. Quantitative and qualitative experiments on three benchmark datasets demonstrate that InstructCrop enables effective and interpretable image cropping, which aligns better with user intentions. Data and code are available at https://github.com/sxfly99/InstructCrop.

Abstract:
A significant challenge in audio deepfake detection (ADD) is to improve model generalization against unseen vocoders and other unknown factors, as existing methods often overfit to specific vocoder patterns or synthetic-irrelevant factors. To overcome this challenge, by focusing on vocoder-agnostic features and synthetic traces for generalizable ADD, we propose a novel dual-level disentanglement with meta-learning (ALDEN ) framework. Specifically, we first introduce an adversarial-training-based disentanglement learning (ADL) module to explicitly learn vocoder-agnostic and vocoder-specific features, effectively disentangling audio signals in terms of low-level characteristics. To suppress synthetic-irrelevant information, such as semantics and speaker identities, we simultaneously employ a reconstruction-based disentanglement learning (RDL) module, which further disentangles both synthetic-relevant and synthetic-irrelevant features from vocoder-agnostic features at a high-level of semantics. Additionally, as low-level non-semantic features are more critical in ADD, a vocoder-agnostic meta-learning (VAML) module is proposed to simulate cross-vocoder scenarios so as to further boost generalization performance. Extensive experiments demonstrate that ALDEN outperforms state-of-the-art methods in cross-vocoder and in-the-wild scenarios. The code, model, and supplementary materials will be released on the project page: https://beyond0814.github.io/ALDEN/.

Abstract:
Diffusion-based super-resolution methods have achieved impressive results under normal lighting conditions. However, their performance in low-light scenarios faces fundamental limitations due to two inherent challenges. First, the characteristic noise patterns and complex degradation features in severely underexposed images create significant obstacles for diffusion models to establish reliable noise prediction mechanisms. Second, these methods often fail to establish effective coupling between the degradation priors of low-light observations and the reconstruction process, resulting in compromised detail recovery and unrealistic texture synthesis.To address these limitations, we propose Degradation-aware Adaptation with Representation Embedding (DARE) method, a novel one-step diffusion framework specifically designed for super-resolution in dark environments. DARE employs a degradation-aware low-rank adaptation strategy that dynamically adjusts model parameters conditioned on degradation-specific features, effectively addressing compound degradations such as low-light, blur, and noise. Furthermore, we introduce a content-sensitive representation embedding mechanism, integrating complementary spatial and frequency domain priors through a bilinear cross-attention module. This module explicitly captures second-order statistical correlations, enriching semantic understanding and detail recovery during the denoising process. Extensive experiments across diverse low-light scenarios demonstrate that DARE outperforms state-of-the-art methods in terms of both visual quality and perceptual accuracy. The code is available at https://github.com/csmty/DARE.

Abstract:
Face cartoonization remains a challenging task due to significant geometric deformations between facial photos and cartoons, as well as the absence of paired training data for supervised learning. Existing methods struggle to generate high-quality cartoonized avatars with attribute consistency. To address this challenge, this paper proposes an unsupervised facial cartoonization method based on cross-domain aligned and deformable vector quantization (CADQ). Firstly, we construct textual descriptions with facial attributes for both photo datasets and cartoon collections. Attribute consistency during transformation is enforced through individually contrastive learning between image-text cross-modal features and globally distribution alignment across photo-cartoon domains. Secondly, a deformable Transformer with dual attention is introduced during the transformation process, which queries corresponding cartoon codebook entries based on image features to simulate cross-domain geometric deformations. Experimental results demonstrate that the proposed method can convert facial photos into high-quality cartoons with attribute consistency, outperforming existing state-of-the-art approaches. Furthermore, the method can be effectively extended to unsupervised cross-domain generation of other artistic portrait styles, achieving superior or highly competitive performance. Our code has been released at: https://github.com/IIP-Lab-XDU/CADQ.

Abstract:
The integration of multimodal information, particularly visual content, into dialogue systems has primarily focused on interpreting user-provided inputs, while comparatively little attention has been given to the proactive use of such content to enhance responses. In this paper, we explore a new research direction that addresses this gap by enabling dialogue systems to autonomously determine when and how to supplement textual responses with relevant images, based on conversational context and user intent. To support this goal, we propose AMI (Automated Multimodal Insertion), a novel framework for dynamic, context-aware multimodal supplementation in dialogue. We also introduce RID (Response with Appropriate Image Dataset), a bilingual (Chinese-English) multimodal multi-turn dialogue dataset designed to train and evaluate systems on this capability. RID features fine-grained annotations on image insertion timing and rationale, along with carefully aligned image-text pairs to ensure semantic coherence. Our experiments demonstrate that models trained with RID not only generate more informative and engaging responses, but also exhibit a stronger ability to leverage visual content when it is truly beneficial. These findings highlight the potential of proactive multimodal supplementation and offer new insights for advancing the development of intelligent, human-like dialogue systems. Code and data are available at: https://github.com/Tanthen/SCIR-AMI.

Abstract:
The push toward data-driven video processing, combined with recent advances in video coding and streaming technologies, has fueled the need for diverse, large-scale, and high-quality video datasets. However, the limited availability of such datasets remains a key barrier to the development of next-generation video processing solutions. In this paper, we introduce Nature-1k, a large-scale video dataset consisting of 1000 professionally captured 4K Ultra High Definition (UHD) videos, each recorded at 60fps. The dataset covers a wide range of environments, lighting conditions, texture complexities, and motion patterns. To maintain temporal consistency, which is crucial for spatio-temporal learning applications, the dataset avoids scene cuts within the sequences. We further characterize the dataset using established metrics, including spatial and temporal video complexity metrics, as well as colorfulness, brightness, and contrast distribution. Moreover, Nature-1k includes a compressed version to support rapid prototyping and lightweight testing. The quality of the compressed videos is evaluated using four commonly used video quality metrics: PSNR, SSIM, MS-SSIM, and VMAF. Finally, we compare Nature-1k with existing datasets to demonstrate its superior quality and content diversity. The dataset is suitable for a wide range of applications, including Generative Artificial Intelligence (AI), video super-resolution and enhancement, video interpolation, as well as video coding, and adaptive video streaming optimization. Dataset URL: https://cd-athena.github.io/Nature-1k.

Abstract:
Despite growing interest in Ultra-High-Definition (UHD) content, research and development on 8K videos remains constrained by the scarcity and limited quantity of publicly available footage, the lack of accessible, high-quality datasets, and the substantial computational cost of processing such data. To address this gap, we present the VIdeo Dataset for Exploration and Analysis (VIDEA-8K-60FPS), an open 8K 60FPS HDR video dataset comprising 190 video sequences (120x10s, 40x30s, 30x60s) containing more than an hour of unique footage and covering a diverse range of use cases in terms of motion, lighting, colors, and camera stability. Each clip was minimally processed and encoded with lossless-tuned HEVC, achieving approximately 70% file size reduction while preserving visual integrity, allowing the dataset to be more accessible. The dataset is designed to be diverse to support various video research fields. By making VIDEA-8K-60FPS publicly available, we aim to lower the barrier to 8K video research and facilitate the development of more efficient and scalable video processing methods. GitHub: https://github.com/talshoura/VIDEA-8K-60FPS-Dataset

Abstract:
Short video platforms are popular for sharing information, but they also spread harmful content quickly. Research on detecting harmful videos is limited due to unclear categories and a lack of good datasets. To solve this, we created CH-SV, the first Chinese Harmful Short Video dataset with 6,728 videos labeled into six categories: danger, offense, vulgarity, fakeness, violence, and normal content. We analyzed CH-SV in detail and proposed HAVE, a new detection framework that improves video understanding using AI-generated semantics. Experiments on CH-SV show that our dataset can help advance harmful video detection research, and HAVE outperforms existing methods. Resources including core code, dataset samples, supplementary material, and licensing details are available at https://github.com/DLUTSSL/CH-SV.

Abstract:
Recent rapid progress in Text-to-Audio (T2A) models contrasts sharply with the stagnation observed in the evolution of corresponding evaluation benchmarks. Existing benchmarks, such as AudioCaps, suffer from limited diversity and quality, as well as biased category distributions, leading to increasingly questionable reliability in assessing advanced T2A models. This paper introduces AudioAtlas, a comprehensive and balanced evaluation benchmark specifically designed for evaluating T2A models aimed at movie production. Based on an object-centric audio category system, AudioAtlas provides high-quality reference samples characterized by categorical balance and diversity. It includes detailed overall and event-level captions with rich descriptors, plus fine-grained temporal annotations from human experts, enabling thorough evaluation of temporal alignment and semantic accuracy. To enable precise evaluation of temporally-aligned generation across universal categories, two novel metrics are proposed leveraging recent advancements in large-scale Audio Language Models (AudioLLMs) and contrastive learning models. By re-benchmarking six currently influential T2A models, AudioAtlas provides evaluations better aligned with aesthetic considerations, offering clearer optimization directions for movie-production-oriented T2A systems. Additionally, we conduct a comprehensive comparative analysis on temporally-controllable T2A methods with training-based, and promising training-free approaches inspired by region-controllable image generation, clarifying current limitations and pointing out directions for future research. Audio specifically refers to sound event excluding speech and music. Further details are available on the project page: https://audioatlas.github.io/AudioAtlas/

Abstract:
With the increasing demand for outfit planning in real-world travel scenarios, the need for constructing a travel fashion wardrobe, a series of outfits tailored to a user's personalization and destination-specific context over a short travel period, has grown significantly. However, existing systems or works often focus on isolated factors and rely on retrieval-based methods, with insufficient utilization of generative models, limiting their adaptability to real-world travel scenarios. To address this issue, this study introduces GenWardrobe, a fully generative system for travel fashion wardrobe construction. GenWardrobe consists of three key modules: user query analysis, fashion knowledge retrieval via retrieval-augmented generation and wardrobe image generation. To facilitate users' usage, we encapsulate the solution into an interactive web application. Expert-level evaluation shows that GenWardrobe significantly outperforms traditional systems in both personalization and visual appeal. PowerPoint file and more materials of Genwordrobe can be found on our Github repository: https://github.com/ShanFengShanFeng/GenWardrobe.

Abstract:
The growth of online educational content, particularly slide-based video lectures, has created a need for tools that enhance navigation, comprehension, and accessibility. Many existing systems for video analysis are closed-source, hindering reproducibility and extension. To address this, we present the Lecture Video Analysis Toolkit, an open-source, proof-of-concept application designed for the multimodal analysis of slide video lectures. The toolkit integrates a processing pipeline that includes scene detection, visual entity extraction, transcription, optical character recognition (OCR), and semantic linking between spoken and visual content using embeddings. A key contribution is its interactive interface, motivated by early user feedback, that allows for customization of the viewing experience to suit individual preferences. The entire system is openly available and serves as a research prototype for validating the potential of multimodal analysis in creating more inclusive and improved learning experiences. A live demo is accessible at https://travis-seng.fr/svla, and the source code is openly available at https://github.com/travisseng/svla-toolkit.

Abstract:
A critical challenge in advancing human-like conversational AI systems is enabling models to understand and respond to user emotions contextually, a task known as Multimodal Empathetic Response Generation (MERG). While prevailing multimodal models attempt to resolve cross-modal emotional discrepancies via concatenation or cross-attention, their simplistic fusion mechanisms often fail to account for the nuanced and contradictory nature of human emotions. Consequently, the resulting feature representations suffer from these unresolved internal conflicts, limiting their effectiveness. In this paper, we propose a novel Multimodal Empathetic Reasoning and Inconsistency-Aware (MERIA) framework. MERIA introduces a multimodal disentanglement encoder based on β-VAE and extends the AvaMERG dataset with multimodal chains of empathy (M-CoE). Our framework outperforms existing methods, achieving the best human evaluation scores in the empathetic text response generation task of the MERG 2025 Challenge. Our code is available at https://github.com/DANG-ai/MERIA-MERG.

Abstract:
Point cloud regression localization technology has a wide range of applications in the multimedia field. For example, in virtual reality and augmented reality, accurate point cloud localization can significantly enhance the user experience. Recently, point cloud pose regression algorithms based on APR (Absolute Pose Regression) and SCR (Scene Coordinate Regression) have achieved near sub-meter accuracy, requiring multiple repetitive trajectories for training. The key to their success lies in the diversity of viewpoints, temporal changes, and trajectories, which is resource-consuming. However, due to the errors in GPS/INS, the coupling between trajectories is not ideal, and the stability of re-localization is insufficient. Since LiDAR has covered most of the scene, single-shot localization has the potential to approach or even surpass multi-trajectory localization methods through pose enhancement. Specifically, we present Pose Enhancement Localization (PELoc), which feeds one trajectory, proposing SSDA (Single-shot Data Augmentation) and LTI (LiDAR Trajectories-coupled Interpolation) to simulate different driving poses, and we introduce KP-CL (Key Points Contrastive Learning) through feature perturbation to mitigate the differences in viewpoint/temporal phase transformations in similar scenes across different trajectories. Our algorithm has been tested on the Oxford, QE-Oxford, and NCLT datasets, where single-shot localization accuracy can approach near sub-meter level on QE-Oxford and NCLT. The code will be published in https://github.com/Eaton2022/PELoc.

Abstract:
Symbolic music understanding is a fundamental task in multimedia interpretation, which aims to decode musical attributes from symbolic music representations. Existing methods usually handle musical sequences as linguistic data, ignoring intrinsic musical properties. For instance, recurring melodies often appear in musical pieces with subtle variations, which requires methods that are aware of both local musical details and overall repetitive patterns, yet current methods can not fit the request. To address this issue, we introduce FG-Midiformer, a modified transformer framework that incorporates a multi-scale-aware feature learning (MSAFL) module and a local feature enhanced classification (LFEC) module for fine-grained understanding of multi-attributes. Specifically, the MSAFL module is designed to capture multi-scale musical relationships by embedding efficient multi-scale attention for long-term dependency modeling. In order to improve the classification accuracy of musical attributes, we devise the LFEC module, in which an attention mechanism with full 3-D weights is first introduced to efficiently highlight and leverage important local musical features. The LFEC module strengthens local feature representation and improves the sensitivity of FG-Midiformer to subtle differences between musical attributes. Extensive experiments show that FG-Midiformer achieves state-of-the-art performance in multi-attribute understanding tasks such as melody identification, velocity prediction, composer categorization, and emotion classification. The code will be released at https://github.com/Viki66666/FG-Midiformer.

Abstract:
In cutting-edge domains such as unmanned aerial vehicles and autonomous driving, edge-based audio-visual systems struggle to strike an optimal balance between complexity and performance. Unlike prevailing approaches that typically rely on pruning and knowledge distillation to streamline unimodal or hybrid models, we propose the Bio-Inspired Multimodal Network (BIMNet), which achieves an efficient audio-visual shared architecture. BIMNet integrates bio-inspired audio-visual modules that emulate the hierarchical sensory integration observed in nocturnal birds, to replicate equivalent biological information flow for both multiscale night vision and noise-adaptive hearing. Experimental findings show that BIMNet achieves superior performance and efficiency in diverse image datasets (varying in spatial scales and lighting conditions), audio datasets (encompassing various types of human and environmental sound), and audio-visual joint event detection tasks. Project support is available at: https://github.com/Mental-Scholar/BIMNet.

Abstract:
Despite technological progress in underwater object detection, there are still problems such as the domain shift stemming from diverse environmental conditions and severe image degradation in underwater environments. To address these challenges, we propose a hypernetwork-powered domain generalization framework that synergizes physical priors with multi-frequency feature learning, called Hy-UOD. To combat domain shift challenges, we devise a meta-learning empowered hypernetwork architecture that synthesizes domain-generalization parameters through environment-specific physical descriptor encoding for cross-domains. To further mitigate the impact of complex degradation on object detection performance, we designed a Multi-frequency Feature Dynamic Adaptation (i.e., MFDA) module based on hypernetwork features and domain-specific information. This module implements a systematic compensation for degraded features through a multi-level dynamic adaptation mechanism: ''low-frequency correction, high-frequency refinement, and mid-frequency reconstruction''. Experiments on multiple underwater datasets demonstrate the robust detection performance and strong cross-domain generalization capability of our method. The source code will be available at https://github.com/White-cat-ed/HyUOD.

Abstract:
Recent advances in Weakly Supervised Semantic Segmentation (WSSS) focus on generating high-quality Class Activation Maps (CAMs) using image-level labels. However, the co-occurrence of foreground-background concepts in a single image often induces semantic confusion, which degrades the quality of conventional CAM-based approaches. In this paper, we propose VLHP, a novel framework that leverages vision-language hybrid prototypes to overcome semantic confusion. Specifically, VLHP constructs hybrid prototypes through cross-modal association between textual embeddings and visual features, generating discriminative semantic representations while effectively bridging the modality gap. To further improve discriminability, we introduce two dedicated strategies: Discriminative Explicit Alignment (DEA) to explore cross-modal consistent discrimination and Confounding Background Decoupling (CBD) to model co-occurring backgrounds and decouple them. Finally, a Prototype-driven Class-aware Decoder (PCD) employs these refined prototypes as category-specific priors to generate precise segmentation masks in a single-stage framework. Extensive experiments on PASCAL VOC and MS COCO benchmarks demonstrate that VLHP outperforms state-of-the-art alternatives. The code is available at https://github.com/fjy0105/VLHP.

Abstract:
Existing methods for video answer localization (VAL) in instructional video focus predominantly on coarse-grained themes, failing to address detailed step content and inter-step relation crucial for effective comprehension. Current datasets, such as MedVidQA, primarily capture video content but lack annotations for step structure and inter-step relation. To address this gap, we introduce InstructStep, a newly proposed VAL task, specifically designed for Instructional Video Step Content and Relation Localization. It extends original VAL task to step-centric content and relation. Accordingly, we create a InstructStep Dataset with fine-grained step content and relation QA pairs. To tackle the challenges of this task, we propose a Step-Centric Multi-Level Knowledge Distillation (SC-MLKD) approach that: (1) A two-stage training strategy that generates step-specific summaries in the first stage and introduces a step branch in the second stage to learn step relations. This is applied only during training, ensuring no additional inference time. (2) Multi-level knowledge distillation, including feature, relation, and response distillation, across visual, text and step branches to capture fine-grained and step-centric features. Comprehensive experiments demonstrate the efficiency of SC-MLKD, with notable gains of up to 5.83% in step content and up to 5.9% in step relation. The dataset have been made publicly available on https://github.com/hewangsh/InstructStep

Abstract:
Despite recent progress in decoding static images from brain activity, reconstructing dynamic visual experiences from EEG signals remains challenging due to the complex temporal dynamics involved. Current approaches primarily rely on pre-trained video generation models while failing to fully leverage the rich temporal-spatial information embedded in EEG signals for video synthesis. This paper proposes MINDEVMulti-modal Integrated Neural DEcoding and Visualization), a framework that places EEG signal processing at the core of video reconstruction. We introduce three key technical contributions: (1) a dual-branch feature extractor that captures both temporal dynamics and spatial relationships in EEG signals, (2) an EEG-driven semantic bridge that uses neural patterns to guide language model interpretation, and (3) a multi-modal video synthesis pipeline where EEG features lead the generation process while semantic guidance provides refinement. Our framework prioritizes the millisecond-level temporal resolution of EEG signals, using them to drive both visual content generation and semantic understanding. Evaluated on the SEED-DV dataset, MINDEV demonstrates superior performance with a semantic classification accuracy of 93.2% and a structural similarity index (SSIM) of 0.4777, establishing a new state-of-the-art for EEG-based video reconstruction. Our code is publicly available at https://github.com/HHarr1son/MINDEV.

Abstract:
Current medical vision-language pre-training models primarily follow two paradigms: report-supervised cross-modal alignment pre-training and reconstruction-based self-supervised pre-training. The former enhances the discriminative power of representations, while the latter facilitates fine-grained representation learning. However, naively combining these two paradigms inherits their inherent limitations: reconstruction-based methods treat all image patches equally during reconstruction, failing to effectively capture critical pathological details-since disease-related regions typically occupy only a small fraction of the image. Meanwhile, alignment-based methods suffer from suboptimal representations due to the presence of false negatives. To address these challenges, we propose a novel pre-training framework that integrates two key components: Pathology-Aware Reconstruction (PAR) and Discriminative Knowledge-Boosted Alignment (DKBA). Through a cascaded training strategy, our framework effectively combines the strengths of both paradigms while mitigating their inherent limitations. During the reconstruction pre-training stage, PAR incorporates pathology-aware priors to enhance the model's ability to capture fine-grained pathological details. In the alignment pre-training stage, DKBA leverages a medical knowledge graph as external supervision to improve cross-modal clustering alignment, thereby reducing the negative impact of false negatives. Extensive experiments on diverse downstream medical imaging tasks including image classification, object detection, and semantic segmentation, demonstrate the superior generalization capabilities of our method. Our code is publicly available at https://github.com/Felix1118/PADKB.

Abstract:
Recent diffusion-based methods have shown strong ability in the depth estimation task, but they largely overlook the rich textual priors embedded in pretrained diffusion models that can enhance both performance and robustness in diverse scenes. In this paper, we propose TPDepth, a diffusion-based, affine-invariant monocular depth estimator that incorporates textual semantics via a Text-Prompted ControlNet. While directly injecting text into the diffusion U-Net can cause the network to over-attend to local semantic cues and compromise global structural modeling, TPDepth processes textual features through a separate ControlNet branch, allowing semantic information to be incorporated without disrupting the spatial reasoning pipeline. Prompt-conditioned features are modulated by an Adaptive Control Scale Module(ACSM) and injected into decoder of the diffusion UNet with skip connections. The model is fine-tuned with a fixed timestep for deterministic prediction. TPDepth achieves state-of-the-art results on NYUv2, KITTI, and ScanNet, and demonstrates competitive performance on two additional zero-shot benchmarks using only 61K training images. Code and models can be found on our https://github.com/Lioely/TPDepth project page.

Abstract:
Social group detection aims to identify groups of individuals exhibiting social behavior from multi-individual trajectory data. Recent approaches often determine group correlations based on global trajectory similarity, while temporal dynamics can cause diverging member trajectories and undermine similarity-based measures. Other methods model pairwise interaction strengths to capture group relations, focusing only on explicit direct interactions while ignoring implicit indirect interactions. To address temporal variability of group structures, we decompose long trajectories into multiple semantic sub-trajectories, enabling the capture of dynamic characteristics. Furthermore, to explore implicit indirect interactions, we introduce a unified spatio-temporal graph structure that models both direct and indirect interactions among individuals. In addition, considering the contextual influence of the neighborhood of an individual, we incorporate neighborhood information into the trajectory representation process. Based on these insights, we propose a Segmented-Trajectory-Aware Spatio-Temporal Graph Convolutional Network (SegTraj). This framework uniformly models explicit and implicit interactions through a spatio-temporal graph, and fuses individual trajectories with contextual neighborhood information for fine-grained representation of group relationships. Extensive experiments on three datasets covering both synthetic and real-world scenarios demonstrate that SegTraj significantly outperforms baseline methods. The code is available at https://github.com/DC0827/SegTraj.

Abstract:
The efficiency and high transferability of transformation-based adversarial attacks (TAAs) make them a promising tool for robustness analysis. Despite the improvements in transferability brought by various image transformations, their underlying causes remain unclear, and there is still room for further improvement. We find that with attention-based models as surrogate models, adversarial examples generated by TAAs with relatively lower transferability tend to exhibit checkerboard artifacts, whereas those with higher transferability do not. This motivates us to explore the relationship between transferability and checkerboard artifacts. We confirm that checkerboard artifacts originate from the patching operation in attention-based surrogate models. Checkerboard artifacts vanish under the condition that spatial transformations are applied and gradients are calculated with respect to perturbations. Based on whether checkerboard artifacts are eliminated, we categorize model augmentations into cross-pixel augmentations and in-place augmentations. The former promotes interactions between pixels, breaks patch isolation, and thereby improves transferability while removing artifacts. The latter in-place augment the diversity of parameter features, enhancing transferability but failing to break isolation and remove artifacts. They constitute two distinct ways toward enhancing transferability. Integrating them enables higher transferability. Therefore, we propose an attack design paradigm to fully leverage both augmentations. To verify this paradigm, we design a basic In-place and Cross-pixel Attack (I-C Attack) with simple transformations. Extensive experiments demonstrate that, despite its simplicity, I-C attack can achieve much higher transferability while maintaining low computational cost. The code is available at https://github.com/chinaliangjiaming/I-C-Attack.git.

Abstract:
Cross-view geo-localization(CVGL) aims to determine the location of a ground-view image by referencing geo-tagged satellite-view images. Existing methods assume known ground-view image orientation-an unrealistic constraint in real-world scenarios where cameras have arbitrary orientations and limited fields of view (FOV).Unknown orientation cross-view geo-localization (UOCVGL) better reflects real-world applications but introduces severe feature alignment challenges due to misalignments in both viewpoints and orientations, significantly degrading localization accuracy. To address this challenge, we propose PLGeo, a novel method for UOCVGL, which includes two key components: (1) a Patch-wise Similarity Enhancement Component, which computes patch-level similarities between corresponding patches and refines alignments using learned attention weights, improving accuracy and mitigating issues caused by varying orientations and reduced FOV in ground-view images; and (2) an Attention-guided Patch Matching Component, which refines intra-domain feature matching within the same view by emphasizing stronger correspondences and suppressing weaker ones. We comprehensively evaluate PLGeo on several benchmark datasets under different settings, including unknown orientation, limited FOVs, robust datasets, the UAV dataset, north-aligned setting, and few-shot scenarios. Experimental results demonstrate that PLGeo consistently outperforms state-of-the-art methods, exhibiting remarkable robustness and generalization ability even in challenging real-world conditions. Code is available at https://github.com/1203ll/PLGeo.

Abstract:
Generative diffusion model approaches have achieved remarkable success in multimodal recommendation by generating latent user interest interaction graphs. However, current diffusion methods based on Gaussian noise introduce uncertain interest bias noise. This noise not only disrupts the original user-item interaction bipartite graph structure but also undermines the model's ability to accurately capture user interest preferences. To address these challenges, we propose Unbiased Interest Generation for Multimodal Recommendation (GenRec). Our approach aims to generate valid latent user interests while non-invasively preserving the original interest graph structure. We innovatively introduce the Multi-Modal Interest Generaction Module. During the forward process, we simulate user interest state transitions using ''forward flipping.'' In the reverse stage, we generate binary interaction graphs following a Bernoulli distribution. To further mitigate the random uncertainty during the generation process, we design a Multi-Modal Interest Debiase Module. By constructing a multimodal interest clustering space and using user interest hashing, we correct and enhance the generated interest graphs. Finally, the Multi-modal High-Order Graph Learning Optimization is employed to capture high-order interaction information between users and items. A contrastive learning loss function is used for model optimization. We conduct extensive experiments on different real-world open datasets from the industrial sector. Compared with the state-of-the-art DiffMM, our method significantly improves NDCG@20 by 7.3% on the TikTok dataset and boosts Recall@20 by 4.2% on the Sports dataset. The experimental results validate the effectiveness of GenRec. The code is publicly available at https://github.com/orangeheyue/GenRec-V1.

Abstract:
The field of cross-modal retrieval aims to construct a shared representation space for samples from multiple modalities, typically within the vision and language domains. Deep hashing, with its high computational efficiency and low storage costs, has emerged as a central focus in this field and has garnered significant attention in recent research. However, current hash retrieval, concentrating on deterministic methods, struggles to effectively capture semantically ambiguous correspondences between cross-modal samples, where heterogeneous data have complex-semantic many-to-many relationships in the latent space. To address this limitation, we propose a novel Deep Probabilistic Binary Embedding (DPBE) framework, designed to generate discriminative, modality-invariant hash codes that facilitate accurate and reliable cross-modal retrieval. In contrast to contemporary probabilistic methods, we focus on optimizing hash networks to learn more accurate binary embeddings by using the learning mode of probabilistic embeddings. We introduce the first Bayesian encoder for hash learning, which employs Laplace Approximation to model a distribution over network weights. Extensive experimental results demonstrate that our approach not only outperforms deterministic methods in retrieval performance but also provides uncertainty estimates, enhancing the interpretability of the embeddings. The corresponding code is available at https://github.com/QinLab-WFU/DPBE.

Abstract:
In the digital age, brand meaning is increasingly shaped through user participation and content sharing on social media platforms. However, significant perceptual gaps often exist between official brand narratives and consumer interpretations. These multimodal and cognitively nuanced gaps are challenging to detect and model using traditional analytical methods. To address this, we propose a multi-agent framework that metaphorically models perception as an optical process-propagation, interference, and measurement---termed OPIM. We construct a novel dual-perspective dataset from representative social media platforms, integrating text and image content from both user-generated and official brand communications. We evaluate brand perception along six psychological dimensions. Experiments across 15 brands demonstrate that our framework effectively captures key perception gaps, particularly in sincerity, professionalism, and attractiveness. In contrast, materialism and sophistication exhibit higher alignment between brand messaging and consumer perception. Our framework enhances the cognitive alignment and multimodal interpretability of large language models, offering actionable insights for brand strategy and bridging computational modeling with human-centric understanding. The dataset will be available at https://github.com/htgan-ai/OPIM.

Abstract:
Vision-Language-Action (VLA) systems are crucial for autonomous decision-making in embodied intelligence. While current systems have advanced the instruction-following capabilities, their limited spatial perception often leads to suboptimal performance for mobile manipulation tasks in unstructured environments. To address this challenge, we propose Uni-Sight, an end-to-end VLA system for robust mobile manipulation. Uni-Sight unifies decision-making, perception, and control through joint training, enabling synchronized cross-component optimization. Within the system, we introduce Latent Feature Aligner (LFA) that ensures accurate target localization by aligning multi-view data. Specifically, we develop Domain Transfer Policy (DTP), a hierarchical policy constrained by LiDAR-guided spatial priors, which ensures 3D spatial understanding with limited visual coverage. Extensive experiments on 20 real-world mobile manipulation tasks demonstrate the high task success rate and robust execution performance of Uni-Sight. Our Uni-Sight achieves a 3.04× the success rate of existing methods, and exhibits superior generalization in both long-horizon and zero-shot scenes. Code and dataset are publicly available at https://github.com/trantor2nd/Uni-Sight.

Abstract:
Open-vocabulary object detection (OvOD) uses Vision-Language Models (VLMs) to detect arbitrary categories specified by natural language. However, existing methods often struggle with performance instability caused by granularity-induced semantic drift, which arises from misaligned label embeddings across varying levels of specificity. In this paper, we propose GraSecon, a Graph-guided Semantically Consistent representation framework that enhances zero-shot detection robustness without requiring additional training. We construct a hierarchical Fine-grained Semantic Graph enriched with visually grounded attributes from large language models (LLMs). This graph captures hierarchical, sibling and cross-level relations, enabling controlled Laplacian refinement to harmonize the embedding space and improve visual-semantic alignment. To strengthen fine-grained discriminability, we introduce a Key Semantic Node Mining module that identifies and anchors semantically sensitive nodes, ensuring robust feature representation. Furthermore, our Semantic Relevance-Driven Laplacian Propagation adaptively propagates information, promoting coherent and context-aware embedding alignment across granularities. Extensive experiments on the iNatLoc and FSOD datasets demonstrate that GraSecon outperforms prior SOTA methods, achieving average mAP50 improvements of 6.5% and 5.4%. Code is publicly available at: https://github.com/minoslab-csu/GraSecon.

Abstract:
Salient object detection (SOD) plays a crucial role in image understanding and visual guidance. However, due to the complexity of underwater environments, the accuracy of underwater salient object detection is often low. To improve the accuracy and robustness of underwater salient object detection, different from the existing spatial domain aware RGB-D methods that rely on pixel-level probabilities, we propose a novel Fourier-Spatial Entangled Conditional Diffusion model (FSCDiff) for underwater salient object detection. The FSCDiff aims to address the insufficient representation and boundary shift issues in underwater salient object detection by leveraging Fourier-domain information and the powerful multi-step iterative generation capability of diffusion models. The FSCDiff framework consists of two key components: the Dual-Domain Entanglement Enhancement Block (DTEB) and the Stable Time-step Mask Prediction Module (STMP). DTEB utilizes Fourier-spatial entanglement learning to fully exploit the Fourier and spatial domain information of RGB images and depth maps, thereby optimizing feature representation. STMP takes advantage of the excellent multi-step iterative mechanism of diffusion models to enhance the accuracy and robustness of the segmentation results. Comprehensive experimental results indicate that our FSCDiff method outperforms the state-of-the-art approaches on the USOD10K and USOD datasets. The source code is available at: https://github.com/lgwplay/FSCDiff.

Abstract:
The rapid advancement of multimedia technologies and their increasing integration in education have underscored the importance of multimedia learning. Knowledge Tracing (KT) plays a crucial role in enabling adaptive multimedia learning by continuously monitoring students' progress and forecasting their performance throughout the learning process. Question lies at the heart of the KT process, making its representation crucial for building efficient KT models. However, the sparsity and complexity of question data pose significant challenges for existing methods to capture the underlying features of questions, thereby affecting the accuracy of knowledge state predictions. To address this issue, this paper attempts to introduce the diffusion model to the KT field, proposing a novel knowledge tracing model, DiffuQKT. The model presents a diffusion-based generative approach for question representation and enhances the stability of knowledge states through contrastive learning. Specifically, DiffuQKT first constructs question representations based on their concepts, difficulty, and variations, and then, during the forward phase, progressively adds noise to the question representations, disrupting them into a Gaussian distribution. In the reverse phase, DiffuQKT gradually recovers the representations from noise, generating higher-quality question representations for knowledge tracing. Furthermore, to guide more meaningful question generation, we incorporate question concepts and difficulty as conditions during the denoising process. In addition, to improve the robustness of knowledge states against subtle variations in question representations, we employ contrastive learning to stabilize knowledge states across both original and denoised question representations. We conduct extensive experiments on four public datasets, comparing DiffuQKT with 15 baseline methods. The results demonstrate that DiffuQKT significantly outperforms existing models. Moreover, we find that the diffusion-based generative approach for question representation proposed in this paper has the ability to significantly improve the performance of baseline models. The code can be found at https://github.com/lilstrawberry/DiffuQKT.

Abstract:
Recent advancements in concept customization via diffusion models have significantly enhanced controllability and quality. However, precise relation customization, which controls the position of interactions among multiple instances, remains challenging due to unpredictable initial latent noise. Existing methods primarily rely on conditional prompts and attention control, overlooking the structured potential of initial noise. This paper introduces Position-LoRA, a novel framework leveraging structural prior in initial noise to improve relation customization and layout control. Position-LoRA employs a differential fine-tuning scheme and a latent noise encoder. The guided fine-tuning enhances generation tendencies from structured initial noise, embedding explicit relationship-specific spatial information. The latent noise encoder dynamically manipulates latent noises, enabling precise spatial control and flexibility in relational image generation. Furthermore, a fine-grained guidance and control strategy is employed during generation to enhance the image-text alignment and layout alignment. Experiments demonstrate that Position-LoRA improves stability, controllability, and fidelity in relational image generation with layout control, surpassing existing concept customization and layout-to-image methods in qualitative and quantitative evaluations. Code is available at https://github.com/liyiming09/Position-LoRA.

Abstract:
Recent advances in text-to-image diffusion models have demonstrated remarkable capabilities in generating high-quality visual content with style and feature controlled. A fundamental challenge remains in simultaneously maintaining three critical properties of generated image sequences: (1) fine-grained style control, (2) strict image-prompt alignment, and (3) cross-image content coherence. To overcome the challenge, we leverage AnyStyleDiffusion to overcome the challenge. Specifically, we interpret any artistic style required by users on generated image as a feature in models' weight space. Interpolation between weight space obtains models expressing middle styles with linear transition. Hyper-receptive Motion Layers is proposed to align outputs of diverse weight spaces, operating as adaptive style modulators. These HRMLs are separated from interpolated diffusion models, leveraging zero-shot compatibility with existing model checkpoints. By employing Homogeneous Stable Diffusion, direct interpolation on weight space is avoided to improve synthesis efficiency. Comprehensive evaluations across personalized models demonstrate our method's superiority in generating content-coherent sequences with dynamic style transformations. Code will be released at https://github.com/shermandozer/AnyStyleDiffusion.git.

Abstract:
Previous dynamic view synthesis works often struggle with limited available views, resulting in noticeable artifacts and blurriness in the outputs. In this paper, we present Sparse4DGS, a novel 4D Gaussian splatting framework that enables high-fidelity view synthesis from sparse inputs through three key innovations: 1) a global-local feature extraction encoder that integrates global Hexplane fields with local hash grids, effectively capturing both rough backgrounds and fine details; 2) a flow-guided feature aggregation module that stabilizes dynamic 3D Gaussians between adjacent frames, ensuring temporal continuity; and 3) a 4D geometry constraint scheme that utilizes monocular depth and pseudo-viewpoint depth supervision to enhance the structural consistency of dynamic scene details. Our approach achieves higher rendering quality while maintaining model compactness. Experimental results on various benchmarks demonstrate that our method performs favorably against state-of-the-art methods in terms of both rendering quality and training time. The source code and trained models are available at https://github.com/hu-dong-dong/Sparse4DGS.

Abstract:
Multi-model fitting is a fundamental challenge in computer vision, where real-world data often contains severe gross outliers and pseudo-outliers. Existing methods rely on inefficient sequential hypothesize-and-verify frameworks that require a predefined number of models and inlier thresholds-parameters that are difficult to determine in practical scenes. To overcome these limitations, we propose a novel Adaptive Graph Attention-guided parallel multi-model fitting method (AGASAC) that jointly learns local and global features, performs parallel hypothesis sampling, and executes confidence-embedded model selection. Specifically, we design a dual-confidence graph attention module that models data relationships using an adaptive graph attention network. This module computes minimal-set confidence and quality confidence to guide the multi-model fitting process, eliminating manual parameter tuning. Additionally, we propose a parallel discriminative sampling module that leverages minimal-set confidence to concurrently sample hypotheses. By enforcing a quantized consensus constraint, this module maximizes inter-model variance while minimizing intra-model discrepancy. It enables computationally efficient hypothesis generation and pseudo-outlier suppression. To obtain high-quality models, we present a quality-embedded selection module that integrates quality confidence into the joint optimization of model selection and data clustering. Extensive experiments show that the proposed method achieves a lower transfer error of 0.39 pixels and a 36.92% runtime reduction, surpassing state-of-the-art methods. The code is available at https://github.com/YWY-Vivian/AGASAC.

Abstract:
As image-generative AI models become increasingly accessible to the public, the demand for content safety has surged. Although model developers have introduced alignment mechanisms to prevent the creation of threatening images, and extensive researches have been conducted on verifying the authenticity of AI-generated images, a significant number of ex-regulatory images have been discovered that fall into regulatory gaps. These images are neither covered by existing alignment mechanisms nor included in the scope of current detection methods. To address this, we introduce ExDA, a detection and attribution framework specifically designed for such ex-regulatory images. ExDA utilizes a frozen CLIP:ViT-L/14 as a visual feature extractor to extract rich and unbiased visual features, complemented by a text feature reduction layer to unify semantic styles. For obtaining highly discriminative features, ExDA introduces an SFS-ResNet network, where each basic layer is replaced with a meticulously designed Multi-Channel Margin Convolution (MMConv). Additionally, a plug-and-play multi-generation model attributor is integrated behind the detector. Given the lack of ex-regulatory images in existing public datasets, we constructed ExImage, a dataset containing 72,000 ex-regulatory images, to validate ExDA's effectiveness. Experiments show that ExDA achieves an average detection accuracy of 99.07% on ExImage, and demonstrating significant performance improvements of +5.73% and +10.36% on GenImage and high-challenge Chameleon datasets respectively in cross-datasets evaluation. Notably, ExDA also achieves excellent performance in attribution tasks, demonstrating its superior ability to identify the intrinsic fingerprints of generative models. Our code is available at https://github.com/mwp-create-wonders/ExDA.

Abstract:
We introduce a novel approach for concept blending in pretrained text-to-image diffusion models, aiming to generate images at the intersection of multiple text prompts. At each time step during diffusion denoising, our algorithm forecasts predictions w.r.t. the generated image and makes informed text conditioning decisions. Central to our method is the unique analogy between diffusion models, which are rooted in non-equilibrium thermodynamics, and the Black-Scholes model for financial option pricing. By drawing parallels between key variables in both domains, we derive a robust algorithm for concept blending that capitalizes on the Markovian dynamics of the Black-Scholes framework. Our text-based concept blending algorithm is data-efficient, meaning it does not need additional training. Furthermore, it operates without human intervention or hyperparameter tuning. We highlight the benefits of our approach by comparing it qualitatively and quantitatively to other text based concept blending techniques, including linear interpolation, alternating prompts, step-wise prompt switching, and CLIP-guided prompt selection across various scenarios such as single object per text prompt, multiple objects per text prompt and backgrounds. Our work shows that financially inspired techniques can enhance text-to-image concept blending in generative AI, paving the way for broader innovation. Code is available at https://github.com/divyakraman/BlackScholesDiffusion2024.

Abstract:
Gaussian splatting video has recently emerged as a promising representation for immersive 6-degree-of-freedom (6DoF) content due to its low-latency rendering, compact data structure, and high visual fidelity. In particular, 4D Gaussian splatting video-which models dynamic scenes as temporally evolving Gaussian splats in 3D space-offers an efficient solution for rendering photorealistic, interactive experiences. However, a systematic understanding of user behavior in such environments, especially head movement, remains largely unexplored due to the absence of dedicated datasets tailored to this format. This lack of data severely limits progress in viewpoint prediction, attention modeling, and video streaming optimization. To address this critical gap, we introduce ViewGauss-the first publicly available dataset that captures full 6DoF head movement during the viewing of 4D Gaussian splatting videos. Our dataset is collected from 35 participants using a high-precision Vive Focus Vision headset in a controlled environment, while they freely watched four reconstructed Gaussian splatting video sequences derived from the HiFi4G dataset. The data are recorded with high temporal resolution using position coordinates and unit quaternions, and organized into structured CSV files with precise timestamps for downstream synchronization and behavioral analysis. To demonstrate the practical value of ViewGauss, we conduct a preliminary viewpoint prediction experiment using the iTransformer model. The results show that head orientation patterns in 4D Gaussian splatting video scenes are not only temporally coherent but also learnable, highlighting the potential of ViewGauss as a benchmark for future behavioral modeling and predictive rendering systems. The dataset is publicly available at: https://github.com/Cedarleigh/ViewGauss-DataSet.

Abstract:
Referring multi-object tracking (RMOT), which aims to track one or more objects in a video based on a natural language query, is increasingly crucial for a wide range of real-world applications. However, the study of RMOT in satellite video (RMOT-SV) scenarios remains limited, largely due to the high cost of data acquisition and the difficulty of annotation. To address this gap, we introduce RefSat, the first dataset for benchmarking RMOT-SV. RefSat comprises 212 top-down viewpoint video clips, totaling 31,129 frames, collected from a variety of publicly available satellite video datasets. By combining manual annotations of object appearance and position with automatic motion estimation, we build a semi-automatic pipeline that generates high-quality natural language descriptions covering object attributes and motion trajectories, resulting in over 4,000 objects paired with carefully designed textual queries. RefSat features satellite-specific challenges such as small object sizes, cloud occlusions, and motion-referenced semantics. To address these, we introduce RSRefTrack, a tailored baseline designed for small object perception and motion-aware grounding, which outperforms existing state-of-the-art RMOT methods on the RefSat benchmark. Project page: https://github.com/Zhang-Peirong/RefSat

Abstract:
We present MISP-QEKS, the first large-scale tri-modal benchmark for query-by-example keyword spotting (QEKS). Specifically, MISP-QEKS comprises 610,000 enrollment-query pairs with real-world noise, and covers 9,830 keywords. We also introduce a cross-modal enrolment-query matcher (XEQ-Matcher) as the baseline, which computes wake-word likelihoods by pairwise similarity between pre-trained enrolment and query embeddings. We also propose two plug-and-play modules: a visual gating module (VGM) that filters noise using lip movements, and a multimodal alignment module (MAM) that enforces phone-level consistency across modalities. Experiments show that XEQ-Matcher delivers peak performance with visual enrollment and query, achieving 82.82% AUC/24.23% EER and 79.79% AUC/26.20% EER on in-vocabulary and out-of-vocabulary splits. Incorporating the VGM further improves performance by +1.24% AUC/-0.89% EER and +2.65% AUC/-1.43% EER, respectively. And adding the MAM on top of VGM yields an additional +1.88% AUC/-1.74% EER and +3.00% AUC/-3.28% EER, respectively. These results confirm that tri-modal fusion significantly enhances robustness and generalisation of QEKS. MISP-QEKS with code is available at https://github.com/coalboss/MISP-QEKS.

Abstract:
Errors often occur during human-robot interactions, such as failing to respond, interrupting users, or providing answers that do not meet user expectations. Detecting these issues on time is crucial for making human-robot communication more natural and user-friendly. In this technical report, we present the approach proposed by our team, CFM-HRI, for the ERR@HRI 2.0 Challenge 2025, targeting the task of interaction error detection. Specifically, we propose a lightweight and efficient time-series classification approach, empowered by cross-modal alignment, to detect interaction errors more accurately and promptly. To mitigate temporal misalignment across modalities, we adopt an upsampling alignment strategy, followed by feature fusion to obtain unified representations. A sliding-window voting mechanism is then introduced to construct training samples along with corresponding ground truth labels. Several machine learning models are employed to detect errors based on the fused features. Experimental results demonstrate the effectiveness of our approach in capturing cross-modal inconsistencies and improving detection accuracy. Our approach won first place in Sub-Challenge 2 and second place in Sub-Challenge 1 of the ERR@HRI 2.0 Challenge, held in conjunction with ACM MM 2025. We provide detailed descriptions of our data processing and experimental setup, along with an analysis of the limitations of our approach and potential directions for future work. Our code is available on https://github.com/setsaile/CFM-HRI-ERR-HRI2.0.

Abstract:
Medical image retrieval is essential for clinical diagnosis and medical education, yet remains highly challenging in endoscopic imaging due to limited annotated data, the lack of domain-specific pretrained models, and subtle visual similarities across anatomical regions. In this work, we utilize self-supervised contrastive learning to pretrain a strong image encoder tailored for endoscopic data, which serves as the backbone for downstream retrieval tasks. For text-to-image retrieval, we adopt a multi-modal contrastive learning approach that aligns textual and visual representations based on this pretrained backbone. To further enhance retrieval performance, we propose a novel re-ranking module that leverages the reasoning capabilities of large vision-language models (LVLMs), such as GPT-4o and Gemini. We also provide a comparative analysis of various retrieval strategies, offering insights into their effectiveness in clinical scenarios. Our method achieves top-2 in text-to-image and top-5 in image-to-image retrieval at the ENTRep Challenge 2025, demonstrating its potential value for endoscopic image retrieval. Source code is available at https://github.com/ELO-Lab/ENTRep-LDSF.

Abstract:
The 1st Workshop on Cognition-oriented Multimodal Affective and Empathetic Computing (CogMAEC) was held at ACM Multimedia 2025. It focused on moving emotional AI beyond basic recognition toward deeper cognitive understanding. While traditional multimodal affective computing has emphasized simple emotion detection, the rise of multimodal large language models (MLLMs) has spurred interest in modeling how emotions emerge and evolve in context. The workshop gathered researchers on emotion reasoning, multimodal understanding, and human-computer empathy, exploring how machines can not only recognize emotions but also explain their causes and simulate human-like affective reasoning. The program featured invited talks, oral presentations, and posters spanning perception, interaction, causal modeling, and cognitive grounding. CogMAEC provided a platform to connect researchers across disciplines and foster future work on cognitively aware affective computing. Materials are available at https://CogMAEC.github.io/MM2025.

Abstract:
Most few-shot learning methods aim to train models to learn parameters that can generalize to new categories using training sets, after which the model parameters are typically fixed. However, due to limited data, models often fail to learn generalizable parameters, as they tend to overfit source domain-specific inductive biases. This can lead to catastrophic forgetting or poor adaptation to new domains. Unlike previous methods, we propose a Text Feature guided dynamic Parameter Adjustment (TFPA) method for few-shot action recognition. Inspired by basis decomposition in vector spaces, TFPA reformulates the traditional linear layer into a set of basis mapping matrices in the parameter space. Each matrix functions analogously to a basis vector in linear algebra, and their linear combinations collectively span the parameter space. To construct a domain-adaptive parameter matrix from these combinations, we propose a Coordinate Vector Computation (CVC) module, which leverages text features as semantic guidance to adaptively estimate optimal linear combination coefficients for the basis mapping matrices. Furthermore, we propose a Centroid Exclusion Loss (CEL) and a Contrastive Clustering Loss (CCL) to enhance the distinctiveness among the basis mapping matrices. These regularization terms promote functional specialization and reduce redundancy across the basis mapping matrices, thereby enhancing performance. Experimental results on five benchmark datasets demonstrate the effectiveness and strong generalization ability of our method in few-shot action recognition. The code will be released soon at https://github.com/ReverseSuzhou/TFPA.

Abstract:
In multi-view clustering (MVC), anchor technique is generally hailed as an effective means for filtering noise and improving computation efficiency. However, existing methods usually construct anchors via heuristic strategy, random sampling, or orthogonal learning, which overlook the distribution differences between anchors and original data, leading to anchors lacking structural characteristics. To generate the anchors that are with similar distributions to original data, in the paper we carefully devise a LASD algorithm from the perspective of optimal transport (OT). Concretely, we firstly design a Multi-View OT (MVOT) framework through complementary and consensus representation learning. Then, we theoretically demonstrate the convexity of MVOT using the positive semidefiniteness of its Hessian matrix, and accordingly the global optimal solution of each transport plan can be reached. Further, we establish the strong dual condition for MVOT by the relative interior. Based on dual programming, consequently, we successfully obtain the transport plan between anchors and original data for each view within linear computational complexity. Afterwards, the spectral clustering operation is employed on the consensus plan to produce the discrete cluster labels. Abundant experiments underscore that our learned anchors do well reflect the distributions of original data, and the generated clustering results outperform multiple strong MVC competitors, even under large-scale scenarios. The source code is available at https://github.com/junpuzhang/LASD.

Abstract:
Infrared and visible image fusion (IVIF) aims to extract fine details from visible images and complementary information from infrared images. Most existing methods directly extract relevant and complementary features from each modality using neural networks, often overlooking the guidance process and the distinct frequency-domain characteristics of these features. To address this, we propose HRFusion-a novel frequency-domain framework that extracts complementary features from hybrid features using prior-constrained relevant features, effectively enhancing complementary information and reducing redundancy. In HRFusion, hybrid and relevant features are robustly extracted to guide the subsequent fusion stage. By leveraging frequency differences between complementary and relevant features, we introduce the Enhanced Complementary Frequency Network (ECFNet), which uses optimized Variational Mode Decomposition (VMD) to effectively separate and process these signals for fusion. The overall architecture is built with the proposed DTBlock, which captures both global and local features. Extensive experiments show that our method achieves state-of-the-art performance on the TNO, MSRS, M3FD, and Harvard Brain datasets, outperforming recent approaches. Code is available at https://github.com/liuuuuu777/HRFusion.

Abstract:
Tensorial Multi-view Clustering (TMC) methods have received significant attention due to their ability to capture the high-order correlation among different views. Despite their notable progress, two issues persist: 1) Most TMC methods are highly reliant on prior structure information in sorted datasets, rendering them challenging to handle the shuffled case. 2) Extremely high computational complexity arising from tensor-related operations. To address these limitations, we propose a novel framework termed Anchors Bring Stability and Efficiency: Fast Tensorial Multi-view Clustering on Shuffled Case (SE-FTMC). SE-FTMC first learns a set of anchors and a unified corresponding matrix between anchors and samples, then the low-rank tensor learning is adopted in anchor space instead of sample space to avoid reliance on the prior information hidden in sorted samples and reduce the computational complexity. Finally, SE-FTMC passes the high-order correlations learned in the anchor space into the sample space through the corresponding matrix to improve the efficiency and stability of clustering. Furthermore, SE-FTMC is solved by an efficient algorithm with linear complexity. Extensive experiments on various datasets demonstrate the effectiveness and superiority of our SE-FTMC compared with state-of-the-art methods. The code is publicly available at: https://github.com/jijintian/SE-FTMC.

Abstract:
Data absence and privacy preservation are critical concerns in exploring the data clustering structure. Anchor-based incomplete multi-view clustering methods can efficiently reveal the intrinsic structure of heterogeneous data and has attracted plenty of attentions in recent years. However, current researches face two problems: the incomplete samples leads to structural representation discrepancies, and a single structure cannot implement effective data mining; data is typically stored in a distributed manner, and the consequent privacy requirement imposes difficulties for model optimization with consistent representation. In this study, we propose a unified anchor-based incomplete multi-view clustering method in federated learning framework for distributed data, revolving around individual structure preservation and server-central tensor regularization. The framework designs an adaptive embedding graph learning strategy to dynamically capture global and local structures within clients. A tensor regularization term is developed, which explores higher-order correlations across clients with adaptive weights, and further guides the complementarity of client-specific information. Moreover, we construct the optimization framework combined with the augmented Lagrangian method to both reduce the complexity and prevent the leakage of data privacy. Experimental studies are compared with the state-of-the-art algorithms to demonstrate that the proposed method implements effective incomplete data exploration, strengthening the applicability in distributed environment for IMVC research. The codes of this article are released in https://github.com/LiYannnnnudt/FIMC.

Abstract:
This paper studies the challenging task of Referring Expression Comprehension (REC), which aims at detecting the text-referred target object in an input image. To achieve this, most recent works attempt to adapt powerful pretrained models through integrating additional structures (e.g., low-rank adaptation (LoRA) or adapter modules) to enable efficient parameter tuning. However, all these methods process pretrained features in a position-agnostic manner. This will limit their effectiveness in REC tasks, where the positional information is essential to correctly localize the target object. To this end, we propose a novel parameter-efficient tuning approach, named Multi-Modal Adaptive Positional Encoding (MAP), which addresses the above problem from a new perspective of positional encoding. More specifically, MAP first generates initial positional embeddings for different visual encoder layers from a set of learnable vectors, and then adjusts them adaptively based on spatial-wise visual-linguistic correlations of input data. In this way, the positional information of different image tokens can be appropriately modeled and utilized by MAP, thus making it more applicable to REC tasks. Extensive experiments on five widely-used datasets demonstrate that MAP achieves comparable results to full fine-tuning methods with much fewer extra parameters and outperforms other parameter-efficient tuning approaches. Our source code is available at: https://github.com/Mr-Bigworth/MAP.

Abstract:
Automatic segmentation of echocardiography videos is crucial for computer-aided cardiovascular function assessment in clinical practice. However, it is a challenging task owing to the existence of massive speckle noise, the large shape variations of heart structures between frames, and limited annotations. In this paper, we propose a novel semi-supervised video segmentation model to comprehensively meet these challenges. The proposed approach has two key techniques. First, we propose a dual-stream architecture that processes spatial and temporal features through separate pathways to capture structural details and motion patterns, then enhances spatiotemporal representations by interacting these decomposed features with query features generated from the original input. Second, as speckle noise primarily concentrates in high-frequency regions, we extend the traditional dilated convolution from a frequency perspective, enabling it to adaptively adjust the dilation rate and convolution kernel weights based on high frequency speckle noise information. This enables the network to focus on specific frequency bands, thereby enhancing its ability to capture both low-frequency context and high-frequency local details. Extensive experiments on the CAMUS and EchoNet-Dynamic datasets demonstrate that our method outperforms existing state-of-the-art methods in terms of both accuracy and inference speed. Codes are available at https://github.com/guojx2255/HSCA-SDC.

Abstract:
Accurate 3D human mesh recovery from point clouds remains challenging. Most existing methods depend on full 3D supervision or complete input data, both of which are difficult to obtain in practice. %cr update This calls for robust solutions capable of handling partial point clouds in a self-supervised manner. However, the incompleteness of point clouds and the absence of supervision signals pose dual challenges. To tackle these challenges, This calls for robust solutions to handle partial point clouds in a self-supervised manner. To tackle the dual challenges of point cloud incompleteness and the absence of supervision signals, we propose a novel method named SS-HMR, which offers three key insights. First, we estimate point-wise semantics in a self-supervised manner to match partial inputs with a canonical template. The resulting correspondences serve as supervision signals for the regression network in human mesh recovery. Second, we incorporate regression-based and optimization-based paradigms into a self-improving loop: the regression network provides strong initialization for optimization, while the optimization routine generates pseudo-labels that, in turn, enhance the regression network. This mutual feedback enables more accurate and stable mesh recovery over time. Third, generating multiple initializations and selecting the best result mitigates the optimization routine's sensitivity to initialization, improving robustness to sparse and noisy data. %cr update Third, to mitigate sensitivity to initialization in the optimization routine, we generate diverse initialization candidates and transform the challenge of escaping local optima into a controllable selection task, improving robustness against sparse and noisy data. Extensive experiments are conducted on three public datasets and results demonstrate that SS-HMR outperforms existing methods. Notably, SS-HMR performs excellently on different test data, whether from original point clouds captured by depth cameras or LiDAR devices, or from noise-added ones. This shows that SS-HMR has strong generalization ability and robustness across different data sources. Codes are available at https://github.com/suchang-99/SS-HMR.

Abstract:
With the advancement of Artificial Intelligence Generated Content (AIGC) technology, digital human representations are increasingly appearing in multimedia interactions. This trend is particularly prominent in news broadcasting. The uniformity of news anchors' appearances and broadcasting environments has facilitated the widespread adoption of AI-powered news anchors. With high accuracy in news reporting and advancements in technology, AI anchors have been increasingly implemented in various news programs. However, there is currently a lack of objective analysis regarding the cognitive impact of digital human news broadcasting on audiences and its corresponding effects on brain signals. In this work, we investigate the differences in electroencephalography (EEG) responses when subjects watch news broadcasts delivered by digital humans versus real human anchors under various conditions. Our contributions are threefold: 1) We develop a dataset recording EEG signals from 32 subjects while they were watching news broadcasts. According to the presentation format (human/AI anchor) and the level of attention (high/normal), we categorize the dataset into four groups. 2) We utilize EEG signals to analyze the perceptual differences of subjects when watching news presented in different formats and in varying attention states. In addition, we investigate the cognitive differences of the subjects in perceived authenticity and importance of the news under these different conditions. 3) We propose an asymmetric multi-representation learning framework to better utilize and analyze the data. The code and data are available at https://github.com/Arcee-LYK/EEG-News.

Abstract:
Multimodal emotion recognition based on physiological signals faces the challenge of missing modality due to issues such as inaccurate signal synchronization and inadequate device contact. Existing methods either require additional generative modules to handle missing modalities, leading to extra computational overhead, or fail to effectively capture both modality-specific and joint representation. In contrast, we propose a Unified Multi-task Pre-training (UMAP) framework based on the mixture of experts structure. Our approach offers two key advantages: (1) It retains a joint structure while flexibly handling both unimodal and multimodal inputs by selecting lightweight modality experts and shared experts, thus preserving both modality-specific features and joint multimodal information. 2) Three pre-training tasks-contrastive learning, modality matching, and modality generation-are integrated into UMAP through different attention masks, enhancing the model's ability to adapt to both complete and incomplete modalities during fine-tuning. Comprehensive experiments conducted on three benchmark datasets demonstrate that UMAP achieves state-of-the-art (SOTA) performance, both in multimodal scenarios and in cases where any modality is missing. The code is available at https://github.com/iiieeeve/UMAP.

Abstract:
Chinese calligraphy offers fruitful visual structure not found in abstract/ figurative paintings or photographic images. It makes it well-suited for studying how personality shapes aesthetic preference, yet few works have explored this link. This paper introduces the first computational framework that models the link between viewer personality and Kai Shu calligraphic preference. It collects a dataset of Kai Shu calligraphy images, user preference scores, and Big Five personality traits. It extracts 160 structural feature descriptors from eight categories, such as stroke curvature, layout, and whitespace. Regression and attribution methods reveal five patterns, such as visual structure predicts perceived style, and that traits like Openness and Neuroticism influence preference patterns. High-Openness users prefer balanced, clean layouts. Low-Neuroticism users favor lighter, irregular forms. Some personality-feature pairs follow inverted-U trends, where moderate structural complexity leads to higher preference. These results connect cognitive traits with visual structure and support interpretable, personality aware modeling of aesthetic response. Our findings support personalized style discovery and open up new directions for interest-driven aesthetic education and digital preservation. Code available at https://github.com/tianchengliu18/kai2trait.

Abstract:
While recent advancements in supervised gait recognition have yielded promising results, these approaches rely heavily on annotated walking data, limiting their generalizability to complex environments. This paper presents a self-supervised gait recognition framework using human poses as input to address this challenge, focusing on high-quality pretrained data and self-supervised learning strategies. We first introduce StreamGait, a large-scale, unlabelled dataset that captures in-the-wild distributions of walking sequences. This dataset is curated from Internet livestreams across diverse geographic and environmental scenarios, reflecting variations in real-world camera angles, weather, and pedestrian behavior. Our framework, MirrorGait, conducts self-supervised learning by integration with 2D-to-3D pose reconstruction to synthesize multi-view perspectives for effective 3D-aware contrastive learning. With specific designs of temporal position embedding and gait partition head on a Transformer backbone, the encoder can readily adapt to the periodic and fine-grained nature of gait. Extensive experiments on three widely used gait datasets, Gait3D, GREW, and OUMVLP-Pose, demonstrate that our method, with minimal fine-tuning on the pretrained model, achieves state-of-the-art performance among pose-based gait recognition approaches. The dataset, code, and models are available at https://github.com/BNU-IVC/StreamGait.

Abstract:
Video Individual Counting (VIC), which seeks to count unique individuals across video sequences without duplication, has broader applications than traditional Video Crowd Counting (VCC), including urban planning, event management, and safety monitoring. However, although current VIC approaches have demonstrated strong capabilities, their reliance on identity-level or group-level annotations necessitates substantial labeling effort and expense. To reduce the high costs of manual annotation, we introduce VIC-SSL, a novel self-supervised learning approach that utilizes unlabeled data along with the innovative feature-level augmentation technique called Foreground-driven ShiftMix (F-ShiftMix). By blending and shifting in the feature space rather than the image space, F-ShiftMix generates realistic crowd motion without explicit annotations, while preserving global semantic coherence. Furthermore, VIC-SSL integrates the Cost-guided Flow Prompt (CFP) and the Distinction-aware Cross-Attention (DCA) to enhance flow-aware localization and inter-frame correspondence learning. Our extensive experiments across three datasets, including SenseCrowd, CroHD, and CARLA, demonstrate that VIC-SSL substantially outperforms existing methods, achieving state-of-the-art results with significantly reduced data requirements. These results showcase VIC-SSL's potential to dramatically lower annotation costs and improve the deployment feasibility of VIC systems in complex scenarios. The project website is available at https://leohuang0511.github.io/vic-ssl.

Abstract:
Point cloud completion is crucial for downstream tasks in 3D visual perception. However, existing methods often struggle to generalize to real-world scans due to their heavy reliance on abundant paired point clouds for training and their neglect of the distribution shift between training and testing datasets. To address these limitations, this paper explores a practical and challenging setting: ''source-free domain adaptive point cloud completion'', where a well-trained source model must adapt to the target data distribution without access to source data, aiming to improve completion performance. To tackle this problem, we propose a novel method called ''Dual-Stage Preservation and Fusion'' (DSPF), which comprises two key training stages tailored to this new setting. In the source preservation stage, we introduce graph structural alignment and marginal feature alignment to preserve and transfer essential knowledge from the source domain. In the target fusion stage, we design a self-supervised loss to capture the geometric structure of target instances and establish a bidirectional interaction mechanism to transfer partial source knowledge to the target distribution. Extensive experiments on various cross-domain point cloud completion benchmarks demonstrate that our proposed DSPF significantly outperforms existing methods, validating its effectiveness and robustness in source-free domain adaptation scenarios. Our code is available at https://github.com/ZhiXia-SEU/DSPF.

Abstract:
Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks, thereby enhancing accessibility and productivity. However, existing works mainly concentrate on improving AI agent's capabilities to automate procedures for liberating humans from tedious complexities, lacking the evaluation of potential erroneous agent-generated actions that could lead to task failure and irreversible system damage, inherently involving a degree of risk particularly for fields where automated systems have control over critical operations. To address this issue, we introduce TrustScorer, which evaluates the trustworthiness of actions generated by AI agents, enabling a new human-AI collaboration paradigm in GUI task automation, i.e., actions with low predicted trust scores are redirected for human intervention, thereby mingling human precision with AI efficiency. We further construct TrustBench, which provides diverse real-world GUI automation tasks with per-step ground-truth action sequence and trust score annotations, benchmarking trustworthy GUI task automation. Experimental results show that our designed three types of exploratory TrustScorer methods can help identify erroneous actions and enhance the task success rate (up to 38.1% improvement). Moreover, comprehensive analysis highlights the critical role of action trustworthiness assessment in agent automation and provides valuable insights for future explorations into more advanced trustworthiness scoring techniques to further support the reliable human-AI collaboration. Code and data are available at https://github.com/showlab/TrustScorer.

Abstract:
End-to-end automated fact-checking (AFC) aims to assess the truthfulness of claims using retrieved evidence. Some researchers use crawlers or search APIs to retrieve evidence from the web for veracity classification. However, existing methods indiscriminately rely on the retrieved evidence and overlook that the retrieved results are not always reliable. This unilateral reliance on evidence significantly hampers the performance of fact-checking. In this paper, we account for the diverse reliability levels of retrieved evidence and eliminate the negative impact from the causal perspective. To achieve our goal, we propose a novel Causal intervention and Counterfactual reasoning based Multi-Checker framework (CCMC), which introduces two additional counterfactual fact-checkers to verify claims from the counterfactual perspective. Specifically, we construct two distinct types of counterfactual instances via causal intervention to imitate the situation where the evidence is partially reliable or totally unreliable. Correspondingly, two counterfactual fact-checkers are trained with tailored counterfactual instances by counterfactual reasoning. During inference, the two counterfactual fact-checkers are employed to estimate and eliminate the potential impact of unreliable evidence. Extensive experiments on two real-world datasets demonstrate the superiority of our approach for improving end-to-end AFC. Especially, we surpass existing methods by 3.70% and 5.55% under gold and system evidence on the MOCHEG benchmark, respectively. Our code is available at https://github.com/BeiyuXuboL/CCMC.

Abstract:
Text-to-image generation models can create diverse, high-quality images, but they frequently encounter challenges in accurately rendering text within those images due to the insufficient representation of desired text. In this study, we introduce RealText, a method for generating scene text images that excels in producing precise and realistic scene text images in any language. We disentangle scene text images generation into three stages: background and glyph image generation, text deformation, and whole image generation. Initially, we utilize prompts to guide the creation of well-organized background images. By identifying optimal text placements on these backgrounds, we render the glyph images of target text using user-specified font, effectively eliminating incorrect characters. In the next stage, we propose scene sensing to perceive text carrier surfaces and viewpoints through 3D scene reconstruction using depth and normal map to apply text deformation, thereby enhancing the realism of generated images. The final stage involves generating complete image with the aid of background and glyph guidance. Thanks to glyph disentangling, scene sensing, and text inpainting, we can exert more precise control over scene text image generation process. We have developed a unified framework which supports major generation models. Extensive experiments illustrate the exceptional performance of our method in generating images with multilingual text. The codes will soon be available at https://github.com/cccvl/RealText.

Abstract:
Transformer-based models have demonstrated remarkable performance in computer vision tasks. However, their increasing model size leads to substantial memory demands and higher latency, hindering practical deployment. This paper presents an adaptive dynamic layer-skipping framework based on Markov Decision Process, which determines optimal computational paths based on the current state of input samples. We introduce a Temporal Importance Difference Reward mechanism to address the credit assignment problem in layer-skipping decisions, and develop a knowledge distillation strategy using learnable cognitive tokens to compensate for information loss. Experiments on various models demonstrate that our method significantly reduces computational costs while maintaining accuracy, offering a practical solution for deploying high-performance Transformer models in resource-constrained environments. The code is available at https://github.com/wjjkhl/ASTER

Abstract:
Long-term multi-animal tracking in densely group-housed agricultural settings is critical for automated behavior monitoring and early anomaly detection in precision livestock farming. However, it poses significant challenges due to persistent occlusions from feeders and water dispensers, high inter-individual appearance similarity, and drastic visual changes across day and night cycles. Existing multi-object tracking datasets rarely capture the combined difficulty of these real-world conditions. To address this, we introduce OinkTrack, a large-scale benchmark for continuous multi-pig tracking in commercial farm environments. The dataset comprises over five hours of annotated video across sixteen sequences, covering day, night, night-to-day, and day-to-night transitions. Each sequence ranges from one minute to one hour, featuring an average of thirty-six pigs per frame. In total, OinkTrack provides 573,700 bounding boxes linked to 574 consistent pig identities. It enables detailed behavior analysis under varying lighting and crowding conditions. We describe the data collection and annotation process, present statistical insights into tracking difficulty, and benchmark 11 state-of-the-art tracking methods. OinkTrack provides a robust foundation for developing long-term tracking models and supports downstream applications such as individual activity profiling and early detection of abnormal behavior in real-world, high-density animal populations. The complete dataset and supplementary materials are publicly accessible at https://leohuang0511.github.io/oinktrack-page.

Abstract:
Over recent years, EEG-based brain decoding of perceived multimedia is emerging to be an important multidisciplinary research area. Lack of data sets with multimedia stimuli, however, presents a significant challenge for its further advancement. In this paper, we establish a facial-image stimulated EEG dataset, named as EEG-Face, to address the challenge and provide a crucial support for relevant research, such as brain-computer interface (BCI), face recognition via brain-perceived EEGs, and multimedia content analysis via brain perception activities. As facial images not only distinguish between genders but also dive deeper into individual differences, our proposed EEG-Face provides larger scope, more focus, and greater potential for dedicated research on brain perception of human faces. As shown in Figure 1, the proposed EEG-Face essentially consists of 20,000 brain responded EEG trials stimulated with 40 individual faces, all of whom are Chinese film stars. Following the establishment of the dataset, a range of experiments over EEG-Face is carried out to demonstrate its usability and feasibility, which include: (i) neural correlation of gender perceptions; (ii) EEG-Stimulus pairing verification; and (iii) face recognition via classification of randomized EEG trials. The dataset and the codes for all reported experiments are available from: https://github.com/eeg-wx2024/EEG-Face.

Abstract:
Osteoporosis has emerged as a significant global public health challenge, affecting more than 200 million people worldwide. Precise screening through bone mineral density (BMD) assessment is essential for timely intervention. Although dual-energy X-ray absorptiometry (DXA) serves as the gold standard for BMD evaluation, its widespread application is hampered by high costs and limited accessibility. In recent years, deep learning has demonstrated great potential in osteoporosis classification and predicting BMD from X-ray and CT scans. However, existing public datasets have notable limitations, particularly the absence of paired multimodal imaging datasets for cross-modal modeling and the lack of accurate BMD annotations required for osteoporosis-specific research and clinical model optimization. To address these issues, we introduce the Lumbar Multimodal Osteoporosis Screening dataset (LUMOS), the first multimodal dataset specifically designed for lumbar osteoporosis screening. LUMOS integrates clinical data from 803 patients, including 1,620 anteroposterior/lateral lumbar X-rays with BMD values and T-scores, comprehensive demographic information, and 280 lumbar CT scans. The advent of LUMOS is expected to propel forward research on automated osteoporosis classification, BMD prediction, and other related tasks. Its standardized and multimodal nature fills critical gaps in lumbar osteoporosis data, providing high-quality data support for the development and validation of medical AI algorithms in the early detection of osteoporosis. The dataset is available at https://keyueshi.github.io/LUMOS/.

Abstract:
Low-light Image Enhancement (LIE) technology adaptively improves brightness while preserving texture details and suppressing noise artifacts, thereby reducing visual degradation caused by insufficient illumination. While deep learning-based image enhancement algorithms have made significant progress, a key gap remains in establishing standardized methods for fairly evaluating and comparing their performance. To bridge this gap, this paper systematically investigates enhanced low-light image quality assessment from both subjective and objective dimensions. First, we introduce a Real-world Low-light Image Enhancement quality assessment dataset (RLIE), which contains 1540 images from 154 scenarios, each with a subjective score given by the subjects. Based on this, we propose a low light enhanced image quality assessment method based on Multi-level Illumination Injection and Hierarchical Discrepancy Perception (MIIHDP). The core idea of this method is to hierarchically inject separated illumination information into the feature extraction process, then tailor the processing of difference information at different scales to obtain a more comprehensive representation. Finally, extensive statistical analyses demonstrate the rationality of the proposed RLIE dataset, and experimental results show the superior performance of the proposed MIIHDP compared with state-of-the-arts. Our dataset and code are released at: https://github.com/BoHu90/RLIE.

Abstract:
Reproducibility is indispensable for transferring explainable-AI algorithms from academic prototypes to production systems. This companion paper documents the artefacts, procedures, and outcomes that reproduce the empirical claims of ''Enhancing Model Interpretability with Local Attribution over Global Exploration'' (ACM MM 2024). We release a containerised archive containing source code, data-serialisation scripts, one-click executables, and a detailed README, all conforming to the ACM Multimedia reproducibility guidelines. The regenerated Insertion and Deletion scores deviate by only 2.2% on average. In addition, an exhaustive 10, 20, 30 3 grid-search over key hyper-parameters reveals a new configuration, (30, 20, 30), that improves the Insertion score of three convolutional backbones by 7.51% without additional code changes. These artefacts provide a rigorous, extensible foundation for future research on local attribution methods. Our code is available at: https://github.com/LMBTough/LA/

Abstract:
Generating multiple appropriate facial reactions (MAFR) is essential for effective human-agent interaction. However, existing methods typically do not jointly model both local and global emotional cues, and neglect the temporal dynamics of facial expressions, leading to emotionally inconsistent and less natural reactions. In this work, we combine local and global emotional features to form a more comprehensive emotional representation. Our method further introduces motion-aware visual features that capture the dynamic evolution of facial expressions beyond static frames. By integrating both appearance and motion information within a structured generative framework, our approach enables more context-aware and temporally natural listener reactions. Experimental results demonstrate that our method outperforms existing approaches in both reaction diversity and appropriateness, which ranked first in the React 2025 challenge offline track.The implementation code can be accessed at:https://github.com/mtv-2025-react/mtv-2025-2025.git.

Abstract:
Understanding fine-grained sentiment dynamics in human conversations is a central goal for next-generation artificial intelligence, especially in scenarios where interactions are rich in both modalities and context. To advance research in this area, we organize the Multimodal Conversational Aspect-based Sentiment Analysis (MCABSA) challenge to the community of aspect-based sentiment analysis. The MCABSA challenge introduces two novel subtasks: 1) Panoptic Sentiment Sextuple Extraction, panoramically recognizing holder, target, aspect, opinion, sentiment, and rationale from multi-turn, multi-party multimodal dialogue; and 2) Sentiment Flipping Analysis, detecting the dynamic sentiment transformation throughout the conversation along with the causal reasons. To support these tasks, we present the PanoSent dataset, a high-quality, large-scale benchmark featuring multi-turn, multi-party dialogues annotated with both explicit and implicit sentiment elements across text, image, audio, and video modalities. PanoSent offers extensive real-world scenario coverage, providing a comprehensive testbed for multimodal conversational sentiment analysis. The challenge has attracted widespread participation from both academia and industry, with over 30 teams registered and more than 100 successful submissions. In this paper, we introduce the task, dataset, and evaluation settings, summarize the systems of the top teams, and discuss the findings of the participants. Further details of the challenge can be found at https://panosent.github.io/MM25-challenge.

Abstract:
Unsupervised Video Object Segmentation (UVOS) aims to predict pixel-level masks for the most salient objects in videos without any prior annotations. While memory mechanisms have been proven critical in various video segmentation paradigms, their application in UVOS yield only marginal performance gains despite sophisticated design. Our analysis reveals a simple but fundamental flaw in existing methods: over-reliance on memorizing high-level semantic features. UVOS inherently suffers from the deficiency of lacking fine-grained information due to the absence of pixel-level prior knowledge. Consequently, memory design relying solely on high-level features, which predominantly capture abstract semantic cues, is insufficient to generate precise predictions. To resolve this fundamental issue, we propose a novel hierarchical memory architecture to incorporate both shallow- and high-level features for memory, which leverages the complementary benefits of pixel and semantic information. Furthermore, to balance the simultaneous utilization of the pixel and semantic memory features, we propose a heterogeneous interaction mechanism to perform pixel-semantic mutual interactions, which explicitly considers their inherent feature discrepancies. Through the design of Pixel-guided Local Alignment Module (PLAM) and Semantic-guided Global Integration Module (SGIM), we achieve delicate integration of the fine-grained details in shallow-level memory and the semantic representations in high-level memory. Our Hierarchical Memory with Heterogeneous Interaction Network (HMHI-Net) consistently achieves state-of-the-art performance across all UVOS and video saliency detection benchmarks. Moreover, HMHI-Net consistently exhibits high performance across different backbones, further demonstrating its superiority and robustness. Project page: https://github.com/ZhengxyFlow/HMHI-Net.

Abstract:
Small-scale object detection remains a major challenge in semi-supervised object detection (SSOD), particularly in medical image analysis. Conventional teacher models often struggle to accurately capture the features of low-contrast small lesions, leading to noisy pseudo-labels in both localization and classification, which introduces severe uncertainty and degrades detection performance. To address this issue, we propose Dual Teacher, a novel multimodal semi-supervised detection framework designed to enhance pseudo-label reliability and improve small-scale lesion detection. Specifically, we introduce two complementary teacher models: Hybrid-Scale Teacher, which exploits downsampled views to strengthen multi-scale feature learning, and Entropy-Based Multi-Modal Teacher, which leverages entropy maps to refine the quality of small-scale pseudo-labels. To effectively fuse predictions from both teachers and resolve conflicts, we propose a Dempster-Shafer-based Dual-Teacher pseudo-label fusion strategy that explicitly models uncertainty and optimizes classification confidence. Additionally, we introduce a class-adaptive threshold mechanism that dynamically adjusts pseudo-label selection based on dual-teacher predictions, further boosting the recall of small-scale lesions. Extensive experiments on the Dental Disease Dataset, ChestX-Det and M3FD demonstrate that our method consistently surpasses state-of-the-art SSOD approaches. Code is available at: https://github.com/z316910/Dual-Teacher.git.

Abstract:
RGB-Thermal images leverage complementary optical and thermal modalities to identify objects. While achieving superior performance, the reliance on multimodal fusion inherently limits inference efficiency and adaptability to harsh RGB-failure environments. In this work, we propose a multimodal decomposed distillation framework to develop robust thermal-only detectors by transferring knowledge from multimodal teachers. Unlike conventional one-to-one distillation, we decouple the tasks of simultaneously mimicking RGB-T teacher representations and preserving thermal-specific student feature integrity into dual branches to avoid intrinsic semantic conflicts. Specifically, we present channel-adaptive prompt learning for cross-modal decomposition and a frequency-guided dynamic module for decomposed knowledge integration. The dual-branch architecture employs asymmetric training objectives to ensure effective cross-modal knowledge transfer while preserving the integrity of thermal information. Furthermore, to exploit finer-grained instance knowledge across both feature and prediction levels, we introduce a customized instance alignment distillation to enhance the local discriminability in feature pyramids, and propose an uncertainty-aware logit distillation to compensate for ambiguous predictions in detection heads. Experiments on three datasets validate the effectiveness of our framework in boosting thermal-based detectors. Code is released at https://github.com/lyf0801/DecomKD.

Abstract:
The open-vocabulary paradigm enhances 6D object pose estimation by leveraging language cues to transfer learned representations from seen to unseen objects, yet its performance suffers from a misalignment between vision-language representations and the 6D pose space. This challenge is compounded by the intrinsic lack of pose awareness in models like CLIP (Contrastive Language-Image Pre-Training). To address this limitation, we introduce CLIP-6D, which equips CLIP with the ability to estimate poses through the learning of generalizable representations specific to objects. CLIP-6D consists of three key components: (1) an innovative self-alignment strategy that enables CLIP to derive geometric representations from RGB images by leveraging its inherent feature extractor; % utilizing its inherent feature extraction capability; (2) a multiplex representation interactive learning method efficiently bridges the heterogeneous representations of object category priors, geometry, and spatial correlations; (3) a lightweight adapter using knowledge distillation improves CLIP's capture of detailed semantic representations. Experiments show that CLIP-6D achieves an improvement of 10.8% and 16.4% in the metric 5° 2 cm and 10° 5 cm for zero-shot generalization and achieves a speedup of 5.4 FPS in inference over current state-of-the-art methods. The source code and models are available at https://github.com/whoawong/CLIP-6D.git.

Abstract:
Existing multimodal entity and relation extraction tasks primarily focus on text-to-text or text-to-visual entity relations, overlooking real-world complexities involving visual-to-text and visual-to-visual cases, thus failing to capture the richer semantic structures in complex cross-modal interactions. To address the limitations, we propose a new task, Unconstrained Multimodal Entity and Relation Extraction (U-MERE), which jointly extracts arbitrary visual and textual entities, and their relations from image-text pairs. To accomplish U-MERE, we construct UMERE-Bench, a benchmark with over 9,000 samples that comprehensively covers four cross-modal entity relation directions and three task settings. Given the difficulty of jointly modeling diverse directions of cross-modal entity relations, we introduce Collaborative Modeling and Order-Sensitive (CMOS), which collaboratively guides large vision-language models (LVLMs) to decompose task complexity and mitigates generation order bias from fixed target relation sequences. CMOS employs small models to generate candidate entities, guiding LVLMs to capture key information and jointly optimizes multiple feasible relation orderings to reduce order dependency. Additionally, we design a Multimodal Order-aware Matching (MOM) evaluation method to align predictions with ground truth for precise assessment. Experimental results reveal that current LVLMs show limited performance on U-MERE, underscoring its inherent challenges, while CMOS consistently achieves superior performance across multiple advanced LVLMs, demonstrating its effectiveness and generalization capability. The dataset and code will be available in https://github.com/jiaweidoris/U-MERE.

Abstract:
Existing Unsupervised Video Object Segmentation (UVOS) solutions primarily focus on frame-to-frame propagation and often struggle with extended sequences where objects undergo complex transformations. In this work, we observe that many natural and artificial motions exhibit inherent periodicity, where objects return to similar states across time, particularly in complex scenarios. Leveraging this insight, we present PeriodVOS, a novel framework that exploits recurring motion patterns to enhance segmentation quality across diverse video contexts. Specifically, we propose to establish intra-period consistency to enforce stable segmentation within short time windows, while mitigating the effects of temporary disturbances. Furthermore, to capture global dependencies, we present inter-period correlation to build associations between similar object states across different time periods. Additionally, an adaptive temporal contextual coupling is designed to dynamically adjust how temporal context is integrated based on video content. Through extensive evaluation on six standard benchmarks, including DAVIS-2016, FBMS, Youtube-Objects, DAVSOD, ViSal, and MCL datasets, our PeriodVOS outperforms state-of-thearts, demonstrating the potential of video periodic mining particularly in challenging scenarios. We have released source code on https://github.com/smdshzyjbr-qhw/PeriodVOS.

Abstract:
Digital images, serving as the primary carrier of information, have been wildly spread on the Internet. Image steganography is a technology that employs images as the carrier for information hiding. While current deep image steganography demonstrated impressive encoding abilities across various media, two serious problems have been overlooked in deep image-to-image steganography and hinder its application under real-world scenarios, which we define as the problem of Pixel Value Overflow and Gap of Precision. In this paper, we explore the cause of those problems and introduce a plug-and-play Universal Suppressor to solve the application problems of deep image-to-image steganography in real-world scenarios, which can be flexibly applied to various models with different structures. Experiments demonstrate that our Universal Suppressor performs well in existing state-of-the-art (SOTA) models and confers them with intrinsic robustness for real-world deployment. The code will be released at https://github.com/aoli-gei/USP.

Abstract:
Near-infrared transmission through the finger can capture the vein structure for identity recognition. However, in outdoor applications, finger vein imaging is significantly affected by environmental illumination resulting in low recognition performance. Existing methods typically address this issue by constructing multi-illumination models, but collecting multi-illumination images from individual is challenging, and overexposure can cause venous structure distortion. This paper proposes MDA-Net, a Multi-illumination Domain Adaptive Network for finger vein recognition, which is engineered to excel in the dynamic outdoor lighting landscape with various conditions including overexposure, using only data collected under a single illumination for training. Firstly, an Illumination Feature Separation Network(IFSNet) is used to remove the illumination components and obtain illumination-invariant features; Then an Absorption Difference Feature Extraction network(ADFENet) is used to reduce the impact of venous structure distortion under illumination conditions, especially overexposure. To replicate the entire range from low-light to overexposure in outdoor scenarios, a novel Multi-Illumination Finger Vein Dataset (MIFVD) is constructed with significant illumination variations. Experimental results show that MDA-Net significantly improves recognition performance under complex illumination conditions, achieving a state-of-the-art (SOTA) average recognition rate of 91.67% and an average equal error rate (EER) of 0.96%. Further validation on public datasets SDU and USM, demonstrates SOTA EERs of 0.16% and 0.10%, respectively. The License for MIFVD can be accessed at: https://github.com/AHU-MedImagingIJR/MIFVD.

Abstract:
Recently, Learned Image Compression (LIC) models have garnered significant attention due to their superior performance in comparison to traditional image codecs. However, the growing complexity of these deep learning-based models results in high memory consumption and computational load, which limits their practical deployment. Quantization has emerged as a promising technique to reduce both the storage requirements and computational overhead. Despite its success in high-level vision tasks like image recognition and object detection, quantization techniques applied to LIC models remain underexplored. In this work, we identify the unique challenges of quantizing LIC models, specifically focusing on the impact of latent distribution ranges in high-bitrate. We observe that the activation layers of high-bitrate models exhibit a wider distribution range, which causes significant performance degradation after quantization. Furthermore, we explore the limitations of existing LIC quantization schemes, such as per-channel quantization for activation layers, which result in poor hardware acceleration performance and increased data storage overhead. To address these challenges, we propose the activation and weight distribution balancing post-training quantization (AWDB-PTQ) method for LIC models, which uses a coarse-to-fine strategy to optimize balancing coefficients. In addition, we employ per-tensor activation quantization and symmetric uniform quantization to better facilitate hardware acceleration. Experimental results demonstrate that our proposed method outperforms existing methods in terms of both compression performance and computational efficiency. Our code and data are available at: https://github.com/jie-yu16/AWDB-PTQ.

Abstract:
Dance is an important art form in human culture, but creating new dances can be both challenging and time-consuming. In this paper, we propose a novel dance choreography framework, EDMG, designed to efficiently generate creative and long-lasting dance sequences conditioning on music and dance descriptions. In the first stage, we propose a flexible dance diffusion method, combined with dance genre description and descriptions of fundamental movements to generate the dance sequences. To achieve high computational efficiency and inference speed, EDMG designs a lightweight denoising module by using selective parallel scanning algorithm from Mamba2. This Parallel Mamba Denoiser reduces significantly the number of parameters and accelerates remarkably both the learning and inference processes. In the second stage, by designing a smoothing module with a long receptive field, we mitigate joint error accumulation that causes jittering movements and foot sliding, thereby enhancing the fluency and visual appeal of the dance movements. Furthermore, we extend the AIST++ dataset by adding detailed descriptions of dance genres and fundamental movements, using the Large Language Model (LLM). These descriptions further improve the choreography generation. EDMG is validated through extensive experiments, demonstrating that our method can both effectively and efficiently generate long-term dances suitable for various dance genres. Project URL: https://github.com/neymar277/EDMG.

Abstract:
Low-light image enhancement (LLIE) aims to restore low-light images to normal lighting conditions by improving their illumination and fine details, thereby facilitating efficient execution of downstream visual tasks. Traditional LLIE methods improve image quality but often introduce high-frequency artifacts, which are difficult to eliminate, hindering detail recovery and quality enhancement in LLIE. To solve this problem, we introduce a novel perspective: instead of traditional artifact suppression, sparsification-induced artifacts are repurposed as constructive regularization signals to guide detail recovery. By analyzing the impact of sparsified frequency components and their role in reconstruction artifacts, a detailed mathematical framework is presented. Specifically, we propose a novel loss function SASW-Loss which combining Sparse Artifact Similarity Loss (SAS-Loss) and Walsh-Hadamard Coefficient Loss (WHC-Loss). SAS-Loss mitigates the over-compensation of missing frequencies, helping the network recover structural details, while WHC-Loss optimizes the frequency-domain representation, restoring luminance, suppressing noise, and enhancing both structure and details. Extensive experiments show that our approach outperforms existing state-of-the-art methods, achieving superior performance in structural detail preservation and noise suppression. These results validate the effectiveness of our new perspective, which leverages sparsification artifacts to guide detail recovery, demonstrating significant improvements and robust performance across multiple models, and opening new avenues for future research. The code is available at https://github.com/werringwu/SASW.git.

Abstract:
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities but raise significant privacy concerns due to their abilities to infer sensitive personal information from images with high precision. While current LVLMs are relatively well aligned to protect universal privacy, e.g., credit card data, we argue that privacy is inherently personalized and context-dependent. This work pivots towards a novel task: can LVLMs achieve Inference-Time Personalized Privacy Protection (ITP3), allowing users to dynamically specify privacy boundaries through language specifications? To this end, we present SPY-Bench, the first systematic assessment of ITP3 ability, which comprises (1) 32,700 unique samples with image-question pairs and personalized privacy instructions across 67 categories and 24 real-world scenarios, and (2) novel metrics grounded in user specifications and context awareness. Benchmarking the ITP3 ability of 21 SOTA LVLMs, we reveal that: (i) most models, even the top-performing o4-mini, perform poorly, with only ~24% compliance accuracy; (ii) they show quite limited contextual privacy understanding capability. Therefore, we implemented initial ITP3 alignment methods, including a novel Noise Contrastive Alignment variant which achieves 96.88% accuracy while maintaining reasonable general performance. These results mark an initial step towards the ethical deployment of more controllable LVLMs. Code and data are at https://github.com/achernarwang/specify-privacy-yourself.

Abstract:
In natural language communication, emotions are often conveyed through non-verbal sounds (NVs), such as laughter, crying, cough and so on. However, most existing text-to-speech (TTS) corpora lack annotations for these non-verbal sounds, leading to a scarcity of systems capable of generating them. To address this gap, we introduce SMIIP-NV, a non-verbal speech synthesis corpus annotated with both emotions and non-verbal sounds, including laughter, crying, and cough. To the best of our knowledge, SMIIP-NV is the largest publicly available open-source expressive speech corpus that includes non-verbal speech and rich annotations. It comprises 33 hours of speech data, covering five distinct emotions and three types of non-verbal sounds, with detailed transcriptions and precise timestamps for each occurrence of non-verbal sounds. Additionally, the corpus provides annotations for speech segments that contain laughter or crying. To demonstrate the utility of this dataset, we establish a baseline for non-verbal speech synthesis by employing a lightweight large language model (LLM). The SMIIP-NV dataset and static audio demonstrations are publicly available at https://axunyii.github.io/SMIIP-NV. The interactive real-time demonstrations can be accessed at https://huggingface.co/spaces/xunyi/SMIIP-NV_Finetuned_CosyVoice2.

Abstract:
In an NBA game scenario, consider the challenge of locating and analyzing the 3D poses of players performing a user-specified action, such as attempting a shot. Traditional 3D human pose estimation (3DHPE) methods often fall short in such complex, multi-person scenes due to their lack of semantic integration and reliance on isolated pose data. To address these limitations, we introduce Language-Driven 3D Human Pose Estimation (L3DHPE), a novel approach that extends 3DHPE to general multi-person contexts by incorporating detailed language descriptions. We present Panoptic-L3D, the first dataset designed for L3DHPE, featuring 3,838 linguistic annotations for 1,476 individuals across 588 videos, with 6,035 masks and 91k frame-level 3D skeleton annotations. Additionally, we propose Cascaded Pose Perception (CPP), a benchmarking method that simultaneously performs language-driven mask segmentation and 3D pose estimation within a unified model. CPP first learns 2D pose information, utilizes a body fusion module to aid in mask segmentation, and employs a mask fusion module to mitigate mask noise before outputting 3D poses. Extensive evaluation of CPP and existing benchmarks on Panoptic-L3D demonstrates the necessity of this novel task and dataset for advancing 3DHPE. Our dataset is available at https://languagedriven3dposeestimation.github.io/.

Abstract:
This paper presents our solution for the Micro-Expression Visual Question Answering (ME-VQA) task in the 2025 Facial Micro-Expression Grand Challenge (MEGC). To address the limitations of traditional micro-expression recognition (MER) methods in dynamic modeling, semantic interpretation, and natural language interaction, we propose Emotion-Qwen-VL, a fully fine-tuned multimodal large language model tailored for micro-expression understanding. Specifically, we construct a structured, instruction-based QA dataset that reformulates emotion categories and action unit (AU) annotations into natural language QA pairs, covering classification, AU detection, and causal reasoning. We then adopt a full-parameter fine-tuning strategy to guide Qwen2.5-VL in learning fine-grained temporal facial dynamics and their emotional semantics. Experimental results on the MEGC 2025 test set demonstrate that Emotion-Qwen-VL outperforms strong baselines such as Qwen2.5-VL and QVQ across multiple dimensions, including coarse-grained macro-expression classification, fine-grained micro-expression classification, and language generation. Our results highlight the effectiveness, interpretability, and adaptation potential of large models in micro-expression understanding. The code is available at: https://github.com/2308623956/MEGC2025.

Abstract:
Multimodal Conversational Aspect-based Sentiment Analysis (MCA BSA) is a challenging task for multimodal dialogue understanding. Existing works often treat the entire dialogue as a flat sequence and feed it into Large Language Models (LLMs) for pipeline-style generation. However, these methods sometimes accumulate errors and overlook critical discourse structure and fine-grained inter-word relations that are essential for accurate sentiment reasoning. To address these limitations, we propose SDG-MLLM, a unified generative framework that integrates Structured Dialogue Graphs into Multimodal LLM (MLLM) for an end-to-end MCABSA. Specifically, we construct heterogeneous dialogue graphs that capture diverse structural relations, including syntactic dependencies, coreference links, speaker turns, reply flow, semantic role labeling, and sentiment propagation paths. These graphs are encoded using a heterogeneous dialogue graph encoder, and the resulting structure-aware graph features are injected into the embedding layer of LLM. Furthermore, SDG-MLLM incorporates aligned multimodal features such as image, audio, and video cues at the utterance level to enable unified and context-aware multimodal reasoning. Experiments on the MCABSA dataset show that SDG-MLLM significantly outperforms strong baselines across multiple tasks. In addition, our method also achieved top performance in the ACM MM 2025 Grand Challenge of MCABSA. Our code is available at https://github.com/Liuxj-Anya/SDG-MLLM.

Abstract:
Understanding plant growth dynamics is a critical component of modern agricultural research, with applications in yield prediction, phenotyping, and sustainable crop management. Despite recent advances in computer vision and deep learning, progress in plant growth modeling has been constrained by the lack of publicly available, high-resolution, multiview, and temporally rich datasets. To address this gap, we introduce Growth Modelling GroMo25, the first international challenge on plant growth modeling using multiview imagery. In this challenge, we propose a dataset that comprises high-resolution images of four crops: wheat, mustard, radish, and okra, captured at consistent time intervals from multiple camera viewpoints under controlled environmental conditions. The challenge focuses on two key tasks: (1) plant age prediction and (2) leaf count estimation, both requiring models to use spatial and temporal plant features. GroMo25 attracted participation from multiple teams worldwide, encouraging benchmarking and innovation in vision-based plant phenotyping. The GitHub repository is publicly available at https://github.com/mriglab/GroMo-Plant-Growth-Modeling-with-Multiview-Images.

Abstract:
In many real-world applications, labeling an image ''a man riding a horse'' fails to satisfy demands for the who, when, where, and why. Although LVLMs excel at describing visual content, isolated images often lack the event context; users thus rely on related news articles or social posts to enrich them, but cropping or resizing complicates tracking back to their source. In this paper, we propose ENRIC, an innovative end-to-end system for the EVENTA Challenge Track 1, leveraging the OpenEvents-V1 dataset, comprising over 200,000 news articles paired with more than 400,000 images. Our system includes three components: (1) semantic retrieval filters candidate article images via vision-language embeddings, (2) uncertainty-guided re-ranking flags ambiguous queries using three confidence heuristics and re-ranks candidates by combining visual similarity with texture similarity, and (3) event-aware caption generation employs chain-of-thought prompting that aggregates five inputs from article, image, and CIDEr-derived contexts to guide the LLM in incorporating all necessary elements. ENRIC achieved the highest combined evaluation score of 0.5501, ranking first and outperforming other solutions across nearly all metrics. By combining semantic retrieval, uncertainty-guided re-ranking, and event-aware caption generation, ENRIC demonstrates the efficiency of its approach for event-enriched image analysis. GitHub repository: https://github.com/NamQuanProject/EVENTA25-ENRIC

Abstract:
Infrared and Visible Image Fusion (IVIF) under unregistered conditions has been of great interest in various visual tasks under challenging environments. While existing approaches often demonstrate promising results on specific benchmarks, they tend to exhibit performance drops in unseen scenarios and incur high computational overhead when retrained on new datasets. To address these challenges, we propose TRACE, a Training-free Reinforcement-based Alignment method for Cross-modality Enhancement, which incorporates Evaluator, a rewarding network, into an evaluation-driven Reinforcement Learning (RL) framework, enabling efficient and plug-and-play refinement of any existing registration approach. Specifically, TRACE constructs the Evaluator network to assess the alignment quality of the given registration model, generating confidence scores and adjustment masks via spatial and channel attention. Leveraging these cues as RL rewards, TRACE iteratively refines the registration network to mitigate misalignments until the accumulated improvement is satisfied. Due to its training-free and plug-and-play nature, TRACE notably enhances fusion results across diverse and unseen scenarios. TRACE achieves impressive improvements in different methods across diverse datasets with minimal computational cost. The project page is available at https://github.com/pubyLu/TRACE.

Abstract:
Embodied artificial intelligence has rapidly developed under the impetus of multimodal learning, robotics, and cognitive science, demonstrating great potential in fields such as navigation and manipulation. However, building embodied agents that can robustly operate in diverse and dynamic environments still faces challenges, such as handling partial observability and environmental adaptability. Multimodal large language models (MLLMs) are vital for embodied intelligence due to their ability to process multimodal information, but they encounter difficulties in understanding spatial environments and performing dynamic decisions and evolution. Inspired by the functional specialization of the left and right hemispheres of the human brain, this paper proposes a brain-inspired learning and evolution paradigm for embodied agents. The method designs an embodied context-augmented MLLM to simulate the language processing and logical analysis capabilities of the left hemisphere, responsible for understanding instructions and visual scenes. At the same time, it constructs a perceptual context-guided world model based on the recurrent state space model to simulate the spatial perception and holistic thinking functions of the right hemisphere, capturing environmental dynamics and predicting future states. By simulating the communication function of the corpus callosum, we propose dynamic communication slots for efficient information exchange between MLLMs and the world model, which also allows the agent to quickly adapt to dynamic environments without requiring extensive computational resources. Experiments show that the proposed paradigm significantly improves the performance of embodied agents in a series of tasks and enhances their generalization ability in zero-shot tasks through embodied exploration experience and online evolution. Our project page is available at https://feliciaxyao.github.io/EvoAgent/.

Abstract:
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in visual-language understanding for downstream multimodal tasks. However, these models often generate descriptions containing objects or details not present in the input image, a phenomenon commonly referred to as ''hallucination''. Existing methods focus solely on single-side hallucination mitigation: Intra-modal-only reinforcement (e.g. visual attention enhancement) ignores prompt-based guidance; Inter-modal-only correlation correction may introduce low-information visual tokens to mislead reasoning. To tackle this challenge, we propose Dual-Modal Collaborative Attention Reinforcement (DuCAR). Specifically, DuCAR is equipped with intra-visual CLS-driven sampling and cross-modal dynamic sampling, extracting important visual tokens guided by intra- and inter-modal joint information. During the multimodal fusion stage, DuCAR adaptively enhances the attention weights of these visual tokens. Our sampling and enhancement strategies in DuCAR simultaneously reinforces informative visual tokens, and suppresses attention dispersion towards question-irrelevant visual information. We conduct extensive experiments on the POPE and CHAIR hallucination benchmarks, demonstrating that our method outperforms existing state-of-the-art mitigation baselines and effectively reduces hallucinations in text generated by LVLMs. The code is available in the https://github.com/xjy2020/DuCAR.

Abstract:
Recently, LiDAR-based fully sparse 3D object detection has gained great attention, which utilizes point clouds to boost efficiency. Nevertheless, the relationship between well-studied dense representation and fully sparse representation is under-explored in existing studies, which focuses solely on building sparse representation by feature diffusion to solve the notorious center point missing problem. To this end, we propose a dense-guided fully sparse detection scheme, named DGFSD, to bridge the gap between dense and sparse features by dense-guided diffusion. Different from prior studies, we propose DgD (Dense-guided Diffusion) to overcome the center feature missing problem by dense knowledge transferring. Specifically, DgD transfers high-quality central point features from dense representations to endow sparse representations with dense knowledge. Moreover, we customize DFW (Dense Feature Weighting) to express uninformative representation and lift foreground representation. It makes high-quality dense feature contribute more to arcuate regression. To the best of our knowledge, we are the first to explore dense knowledge's impact on fully sparse framework. Extensive experiments conducted on nuScenes and Argoverse2 benchmark demonstrate the effectiveness of the proposed method. Specifically, DGFSD achieves 71.6% NDS and 67.3% mAP on the nuScenes test benchmark. On Argoverse2, DGFSD achieves 40.6% mAP, outperforming previous best hybrid and fully sparse methods. The code is available at https://github.com/Raiden-cn/DGFSD.

Abstract:
Single-view face relighting aims to adjust the portrait lighting while preserving the original background. Although recent diffusion-based methods achieve great relit results by using reference lighting and facial features as conditions for the diffusion relighting process, they are limited by the incompleteness of these conditions, such as the absence of explicit constraints on skin tone and the lack of spatial coverage for hard shadows, which results in facial and lighting inconsistencies. To address these challenges, we propose IFS-Light, an interactive framework that leverages spatial-nonspatial conditioning mechanism to localize facial features and reference lighting, then optimize their interplay in the relighting process. To ensure facial consistency, we first combine skin-tone-scaled conditions with shape information for tone adjustment, enhanced by a detail mask that identifies modifiable facial regions. Skin and shape parameters are then optimized to preserve both skin tone and fine details. To maintain lighting consistency, we propose a ray-tracing-based formulation that decomposes reference lighting into diffuse and non-diffuse components. These, integrated with shape information, assist in positioning light and shadow regions. Both components are then encoded for precise control over color and intensity. In addition, we propose an innovative and user-friendly solution for adjusting light conditions, which enables the user to precisely adjust the position of the light source to flexibly control both light intensity and direction, thereby making it easier to achieve the desired relighting results. Extensive experiments show that IFS-Light achieves superior relighting results compared to state-of-the-art methods. The code and appendix are available https://github.com/mRobotit/IFS-Light

Abstract:
As short videos become a dominant medium for news dissemination, fake news videos pose increasing threats to public trust and information integrity. Existing methods primarily focus on learning multimodal representations to predict binary veracity labels, yet they overlook the use of external evidence, which is important for identifying more sophisticated fake news that subtly exploits psychological cues and cognitive biases. Moreover, these approaches do not provide fine-grained attribution labels, which are essential for interpretable misinformation governance. To address these limitations, we introduce EvidSV, the first comprehensive benchmark supporting evidence- and attribution-aware fake news video detection. Drawing inspiration from the human cognitive process of interpreting news-related content, we propose MUKE, a multi-view knowledge progressive enhancement learning framework. By jointly analyzing both the news content and supporting evidence, MUKE (1) facilitates the understanding of news semantics to (2) progressively refine shared domain knowledge, and (3) adaptively summarizes multi-view knowledge to assess news veracity. Extensive experiments demonstrate that MUKE consistently outperforms existing methods in both fake news detection and attribution, and generalizes effectively to previously unseen domains. Our code is available at https://github.com/zzeng1998/EvidSV.

Abstract:
Vision-language models (VLMs) have achieved remarkable success in various vision-language tasks, such as image captioning and visual question answering. However, these models often lack physical common sense, frequently failing to identify visually evident violations of common physical principles. Therefore, evaluating the VLMs' understanding of physical common sense is essential, which has not yet been systematically explored in existing research. To fill this gap, we introduce PhyVIB (Physical Common Sense Violation Image Benchmark). This novel benchmark consists of 16,000 images across eight categories, aiming to systematically assess the VLMs' capability to detect violations of physical common sense in images. Our evaluations show that even the state-of-the-art VLMs perform poorly on PhyVIB, highlighting a significant area for improvement. In response, we propose PhyDetector, a two-stage fine-tuning framework to enhance the VLMs' capability to detect violations of physical common sense. The first stage involves supervised fine-tuning, which equips the VLM with essential concepts related to visual physical anomalies. The second stage utilizes group relative policy optimization to enhance the VLM's multimodal reasoning capability on physical plausibility. Experimental results show that the model fine-tuned with PhyDetector can significantly outperform the state-of-the-art VLMs in physical common sense understanding. Our artifacts are available at https://github.com/ZitongWang018/PhyVIB.

Abstract:
In robotics and autonomous driving, accurate depth estimation is vital yet challenging under dynamic scenes and extreme lighting. Conventional frame-based cameras offer rich context but suffer from motion blur and limited dynamic range, while event cameras provide high temporal resolution and dynamic range but lack global scene structure. Therefore, recent studies explore frame-event fusion depth estimation methods to leverage these two complementary modalities to achieve robust performance. However, due to the mismatch in temporal and spatial resolution, there is an inherent contradiction between high spatial resolution frames captured at sparse temporal intervals and event streams characterized by spatial sparsity but high temporal resolution, rendering cross-modal feature fusion ineffective. Moreover, the limited availability of frame-event depth datasets further undermines the model's generalization capability across different scenes. To address the above challenges, we propose HAFUNet, a Hierarchical Attention Fusion Network for depth estimation via frame-event fusion. Our method contains: (1) a pre-trained Dual-Stream Encoder (DSEer) to extract complementary features from frame and event inputs; (2) a Cross-modal Feature Interaction Module (CFIM) that aligns and fuses spatial-channel features across modalities; and (3) a Hierarchical Attention Decoder (HADer) that progressively refines depth predictions via attention-guided convolution. Experiments on synthetic and real-world datasets show that HAFUNet surpasses existing methods in depth accuracy and robustness. These results demonstrate the strength of our fusion strategy in diverse environments. Code is available at https://github.com/SiYZhangwh/HAFUNet.

Abstract:
Video-guided machine translation (VMT) involves taking text and video modalities as inputs, leveraging visual context to resolve the semantic ambiguities for improving the translation quality. This task remains challenging due to the difficulty of effective cross-modal integration and visual grounding. To address the issues, we propose a novel VMT model that combines temporal video and spatial keyframe streams by providing complementary visual cues. We develop a chaotic fusion mechanism to integrate modality-specific representations from various modalities that help capture semantic interactions between visual and textual cues. To improve visual grounding, a causally aligned spatio-temporal attention mechanism is also designed to enhance semantic alignment by refining decoder-side attention over the video and keyframe streams, respectively. We further propose PolyVTE, an evaluation dataset targeting polysemous ambiguities in VMT. Results on VATEX and PolyVTE datasets show that our model outperforms state-of-the-art models. The results also prove that using keyframe and video modalities significantly improves disambiguation capabilities. The PolyVTE dataset is available at https://github.com/zheng5d/PolyVTE.

Abstract:
Recent advancements in Audio Language Models (ALMs) have led to significant improvements in speech-related tasks. However, their capacity for profound metaphorical reasoning, especially when derived from audio-specific cues, has yet to be thoroughly investigated. To address this gap, we introduce Unspoken, a bilingual (Chinese-English) question answering benchmark designed to assess ALMs' comprehension of non-literal, metaphor-rich audio. Unlike prior text-centric evaluations, Unspoken emphasizes prosody, phonetic ambiguity, emotional inflection, and other nuanced acoustic features critical to metaphor understanding but often lost in transcription. We construct a high-quality dataset of 2,764 manually curated and validated QA pairs, spanning three reasoning dimensions: semantic, acoustic, and contextual, and covering six common types of metaphors. Evaluation across 23 mainstream ALMs reveals a substantial performance gap: the best model achieves only 69.5% accuracy, significantly below the human average of 81.1%. By analyzing the error patterns, we identify five key failure modes that reveal fundamental limitations in current models' reasoning capabilities. Unspoken not only sets a new standard for evaluating metaphorical reasoning in audio but also pioneers a novel research direction that moves beyond transcription-based assessments. Grounding metaphor understanding in authentic human communication scenarios offers deep insight for developing more cognitively capable ALMs. The data and codes are available at https://github.com/Hongru0306/UNSPOKEN.

Abstract:
Automatic echocardiography video segmentation is a powerful tool for improving the accuracy of cardiovascular function assessment. However, it remains a challenging task owing to (1) extensive speckle noise and blurred boundaries, (2) dramatic shape variations of targeting structures across frames, and (3) limited labeled data due to the high cost of annotation. In this paper, we present a novel semi-supervised segmentation model based on Vision Mamba (Vim) to comprehensively tackle these challenges; we call it EchoVim. Our framework introduces three technical innovations: First, a bidirectional inference mechanism (BIM) which can propagate label information bidirectionally from end-diastolic (ED) and end-systolic (ES) frames to generate pseudo-labels, coupled with confidence-aware dynamic updating to progressively refine supervision signals. Second, a dynamic interaction temporal alignment (DITA) module that establishes anatomical correspondence across frames by adaptively enhancing features near temporally stable regions while suppressing motion-irrelevant artifacts, effectively addressing variations in cardiac shape. Third, a semantic token-attentive refinement (STR) module that constructs low-rank semantic tokens to encode cardiac structure priors, utilizing attention-guided nonlinear transformations to disentangle speckle noise from true anatomical patterns. We conduct extensive experiments on two benchmarking echocardiography video datasets: CAMUS and EchoNet-Dynamic, and the results demonstrate that our method outperforms existing state-of-the-art approaches with real-time inference. Codes are available at https://github.com/guojx2255/EchoVim.

Abstract:
High-quality thermal facial data is essential for advancing biometric recognition, surveillance, in-cabin driver monitoring, and human-computer interaction, all of which are integral for modern multimedia and interactive AI systems. In this work, we optimized the FLUX text-to-image diffusion model on diverse real-world thermal facial datasets to generate hyper-realistic 2D thermal facial images for both males and females, and propose a new dataset, ThermVision. To enhance their multimedia applicability, these images are processed through a video retargeting pipeline, where driving videos animate realistic facial expressions and head pose variations from a single 2D thermal image, producing high-fidelity thermal facial video sequences. The overall rendered dataset incorporates smart transformations, ensuring diversity across gender balance, extreme head pose variations, expressive facial dynamics, and facial accessories, making it a valuable resource for real-world applications. Additionally, we provide facial detection annotations to facilitate precise feature extraction and thermal-face analysis. To validate our synthetic dataset, we evaluate its effectiveness in thermal gender classification, as downstream machine learning task, along with thermal face localization and facial landmarks detection demonstrating its applicability in real-world scenarios. This approach significantly improves the availability, realism, and integration of thermal facial data, paving the way for more robust and immersive AI-powered thermal imaging applications. The dataset, code and associated models are available at- https://mali-farooq.github.io/ThermVision/

Abstract:
Sand dust weather has adverse effects on image quality, making single-image sand dust removal a classic research topic in the field of image restoration. However, existing learning-based image restoration methods fail to account for uncertainties in both data and model dimensions, thus being unable to produce satisfactory results for sand dust image restoration. To address this challenge, we introduce a novel framework called the Uncertainty-aware SAM-aided Prompt-interaction Network (USPNet). USPNet comprises two key modules: the Uncertainty-aware SAM Priors Module (USPM), which addresses data-wise aleatoric uncertainties, and the Uncertainty-aware Prompt Learning Module (UPLM), which tackles model-wise epistemic uncertainties. By integrating data-wise and model-wise uncertainty learning, USPNet leverages uncertainty modeling through SAM semantic priors and distributionally representative prompts. Recognizing the unexplored uncertainties inherent in the learning process, we propose an Uncertainty-aware Perceptual Loss (UPL) to enhance the visual quality of restored images through perceptual learning. Through comprehensive perceptual studies and analysis of real sand-dust images, we propose a dataset named SanddustClearity. SanddustClearity includes daytime, nighttime synthetic, and real-world sand dust images. Our extensive experiments, conducted on both synthetic and real-world images exhibiting various levels of sand dust degradation, confirm the effectiveness and robustness of our proposed method. Our code will be available at https://github.com/WBC-ML/USPNet.

Abstract:
Reliable detection of conversational errors and user-initiated corrections is critical for effective human-robot interaction (HRI). In this study, we present a comprehensive multimodal approach leveraging temporal window processing, targeted feature engineering, and a MiniRocket + Ridge classification pipeline to address the challenges introduced by the ERR@HRI 2.0 dataset. Our methodology systematically integrates multimodal data streams, including facial expressions, acoustic features, and linguistic embeddings, to predict robot failures and user reactions. Experimental results demonstrate significant improvements over baseline models in event-level detection performance. Notably, linguistic features derived from transcript embeddings emerged as the most informative modality, substantially enhancing model performance. However, we observed challenges associated with managing false positives at the event level, suggesting avenues for future refinement in adaptive thresholding and sequential post-processing techniques. Our findings underscore the importance of careful feature selection and robust temporal modelling in developing effective real-time error detection systems for conversational robots. Our code is available online. https://github.com/Ruddy202/err-hri-2.0-armas.git.

Abstract:
Significant advancements have been achieved in both fields of Natural Language Processing (NLP) and Computer Vision (CV) with the advent of Multimodal Large Language Models (MLLMs), sometimes referred to as large vision-language models (LVMs). MLLMs show promising ability in multimodal tasks, such as image captioning, visual question answering, etc. However, there is a concerning trend associated with the advancement in MLLMs. These models exhibit an inclination to generate hallucinations and misleading facts, resulting in seemingly plausible yet factually spurious content. To address these challenges, our team, DeepSIX, leverages recent advances in MLLMs to enhance the ability to detect hallucination and verify factual information within the scope of the ACM MM 2025 grand challenge 8: Truthful and Responsible Multimodal Learning (ResMM). We participated in both tasks: Multimodal Hallucination Detection (Task 1) and Multimodal Fact Checking (Task 2). Our approach leverages the interpretive power of the vision and language components of vision language models (VLMs) to analyze and summarize insights from text and images. It performs contextual reasoning by uncovering semantic relationships among entities in the text and objects in the images. By employing diverse prompting techniques, our method deconstructs critical entities in the text, effectively uncovers implicit relationships between text and images, and identifies hallucinations and false facts. Experimental results demonstrate the strength of our approach: it achieved second place in the Hallucination Detection task and third place in the Fact Verification task, confirming the potential of LLM-based methods in MLLMs. We open-source our code at https://github.com/JAIST-DeepSIX/ACMMM25

Abstract:
Remote photoplethysmography (rPPG) aims to extract non-contact physiological signals from facial videos and has shown great potential. However, existing rPPG approaches struggle to bridge the gap between source and target domains. Recent test-time adaptation (TTA) solutions typically optimize rPPG model for the incoming test videos using self-training loss under an unrealistic assumption that the target domain remains stationary. However, time-varying factors like weather and lighting in dynamic environments often cause continual domain shifts. The erroneous gradients accumulation from these shifts may corrupt the model's key parameters for physiological information, leading to catastrophic forgetting. Therefore, We propose a physiology-related parameters freezing strategy to retain such knowledge. It isolates physiology-related and domain-related parameters by assessing the model's uncertainty to current domain and freezes the physiology-related parameters during adaptation to prevent catastrophic forgetting. Moreover, the dynamic domain shifts with various non-physiological characteristics may lead to conflicting optimization objectives during TTA, which is manifested as the over-adapted model losing its adaptability to future domains. To fix over-adaptation, we propose a preemptive gradient modification strategy. It preemptively adapts to future domains and uses the acquired gradients to modify current adaptation, thereby preserving the model's adaptability. In summary, we propose a stable continual test-time adaptation (CTTA) framework for rPPG measurement, called PhysRAP, which Remembers the past, Adapts to the present, and Preempts the future. Extensive experiments show its state-of-the-art performance, especially in domain shifts. The code is available at https://github.com/xjtucsy/PhysRAP.

Abstract:
Catastrophic forgetting remains a central challenge in continual learning (CL) with pre-trained models. While existing approaches typically freeze the backbone and fine-tune a small number of parameters to mitigate forgetting, they still rely on iterative error backpropagation and gradient-based optimization, which can be computationally intensive and less suitable for resource-constrained environments.To address this, we propose FoRo, a forward-only, gradient-free continual learning method. FoRo consists of a lightweight prompt tuning strategy and a novel knowledge encoding mechanism, both designed without modifying the pre-trained model. Specifically, prompt embeddings are inserted at the input layer and optimized using the Covariance Matrix Adaptation Evolution Strategy (CMA-ES), which mitigates distribution shifts and extracts high-quality task representations. Subsequently, task-specific knowledge is encoded into a knowledge encoding matrix via nonlinear random projection and recursive least squares, enabling incremental updates to the classifier without revisiting prior data.Experiments show that FoRo significantly reduces average forgetting and improves accuracy. Thanks to forward-only learning, FoRo reduces memory usage and run time while maintaining high knowledge retention across long task sequences. These results suggest that FoRo could serve as a promising direction for exploring continual learning with pre-trained models, especially in real-world multimedia applications where both efficiency and effectiveness are critical.

Abstract:
Visual Prompt Tuning (VPT) has emerged as a parameter-efficient fine-tuning paradigm for vision transformers, with conventional approaches utilizing dataset-level prompts that remain the same across all input instances. We observe that this strategy results in sub-optimal performance due to high variance in downstream datasets. To address this challenge, we propose Visual Instance-aware Prompt Tuning (ViaPT), which generates instance-aware prompts based on each individual input and fuses them with dataset-level prompts, leveraging Principal Component Analysis (PCA) to retain important prompting information. Moreover, we reveal that VPT-Deep and VPT-Shallow represent two corner cases based on a conceptual understanding, in which they fail to effectively capture instance-specific information, while random dimension reduction on prompts only yields performance between the two extremes. Instead, ViaPT overcomes these limitations by balancing dataset-level and instance-level knowledge, while reducing the amount of learnable parameters compared to VPT-Deep. Extensive experiments across 34 diverse datasets demonstrate that our method consistently outperforms state-of-the-art baselines, establishing a new paradigm for analyzing and optimizing visual prompts for vision transformers.

Abstract:
Zero-shot object detection (ZSD) aims to leverage semantic descriptions to localize and recognize objects of both seen and unseen classes. Existing ZSD works are mainly coarse-grained object detection, where the classes are visually quite different, thus are relatively easy to distinguish. However, in real life we often have to face fine-grained object detection scenarios, where the classes are too similar to be easily distinguished. For example, detecting different kinds of birds, fishes, and flowers. In this paper, we propose and solve a new problem called Fine-Grained Zero-Shot Object Detection (FG-ZSD for short), which aims to detect objects of different classes with minute differences in details under the ZSD paradigm. We develop an effective method called MSHC for the FG-ZSD task, which is based on an improved two-stage detector and employs a multi-level semantics-aware embedding alignment loss, ensuring tight coupling between the visual and semantic spaces. Considering that existing ZSD datasets are not suitable for the new FG-ZSD task, we build the first FG-ZSD benchmark dataset FGZSD-Birds, which contains 148,820 images falling into 36 orders, 140 families, 579 genera and 1432 species. Extensive experiments on FGZSD-Birds show that our method outperforms existing ZSD models.

Abstract:
Generalized category discovery (GCD) aims to group unlabeled samples from known and unknown classes when only part of the labeled data in the known classes is given. It allows the model to adapt to dynamic environments by discovering novel categories. However, when we applied the GCD approach to the decentralized open world, we still encountered the following challenges: (1) none of labeled data easily obtained in the open world, (2) heterogeneous label spaces across different environments, (3)representation degradation caused by fine-tuning models with limited data in specific environments. To address the above challenges, we introduce a new and practical task, namely Cloud-edge GCD (CE-GCD). Different from semi-supervised GCD, CE-GCD assumes that we only have a base model trained on common public categories, and aims to perform personalized unsupervised novel category discovery in multiple environments with heterogeneous label spaces. Data from different environments or clients cannot be shared, only model parameters can be transferred. To tackle this problem, we propose a novel GCD framework based on energy-guided known class discrimination and multi-level contrastive learning. In each client, we first use the classifier of the base model to distinguish between known and unknown classes, and then perform unsupervised learning on the unknown classes. Each client transfers category information through prototypes to assist learning. Extensive experiments on multiple datasets demonstrate the effectiveness of our approach.

Abstract:
Attribute-missing graph clustering has emerged as a significant unsupervised task, where only attribute vectors of partial nodes are available and the graph structure is intact. The related models generally follow the two-step paradigm of imputation and refinement. However, most imputation approaches fail to capture class-relevant semantic information, leading to sub-optimal imputation for clustering. Moreover, existing refinement strategies optimize the learned embedding through graph reconstruction, while neglecting the fact that some attributes are uncorrelated with the graph. To remedy the problems, we establish the Clustering-oriented Generative Imputation with reliable Refinement (CGIR) model. Concretely, the subcluster distributions are estimated to reveal the class-specific characteristics precisely, and constrain the sampling space of the generative adversarial module, such that the imputation nodes are impelled to align with the correct clusters. Afterwards, multiple subclusters are merged to guide the proposed edge attention network, which identifies the edge-wise attributes for each class, so as to avoid the redundant attributes in graph reconstruction from disturbing the refinement of overall embedding. To sum up, CGIR splits attribute-missing graph clustering into the search and mergence of subclusters, which guides to implement node imputation and refinement within a unified framework. Extensive experiments prove the advantages of CGIR over state-of-the-art competitors.

Abstract:
Animated movies are captivating for their unique character designs and imaginative storytelling, yet they pose significant challenges for existing recognition systems. Unlike the consistent visual patterns detected by conventional face recognition methods, animated characters exhibit extreme diversity in their appearance, motion, and deformation. In this work, we propose an audio-visual pipeline to enable automatic and robust animated character recognition, and thereby enhance character-centric understanding of animated movies. Central to our approach is the automatic construction of an audio-visual character bank from online sources. This bank contains both visual exemplars and voice (audio) samples for each character, enabling subsequent multi-modal character recognition despite long-tailed appearance distributions. Building on accurate character recognition, we explore two downstream applications: Audio Description (AD) generation for visually impaired audiences, and character-aware subtitling for the hearing impaired. To support research in this domain, we introduce CMD-AM, a new dataset of 75 animated movies with comprehensive annotations. Our character-centric pipeline demonstrates significant improvements in both accessibility and narrative comprehension for animated content over prior face-detection-based approaches. For the code and dataset, visit https://www.robots.ox.ac.uk/~vgg/research/animated_ad/.

Abstract:
Recent advances in face forgery detection have shown strong in-domain performance but often fail to generalize to out-of-distribution data, especially when confronted with unseen manipulation techniques or domain shifts (e.g., lighting conditions, camera noise). We propose a novel Mixture-of-Experts framework, termed GM-DF, that decouples domain-specific and domain-invariant features to tackle cross-domain face forgery detection. Our method builds upon a foundation model (CLIP) and incorporates three key modules: (1) Dataset-Embedding Generator that leverages lightweight expert layers and database-aware feature normalization to adaptively modulate features at a per-domain level, capturing idiosyncratic cues without overfitting; (2) Multi-Dataset Representation mechanism that fuses these expert embeddings using scaled dot-product attention and integrates a mask image modeling (MIM) task to amplify local forgery artifacts; (3) Meta-Domain-Embedding Optimizer, inspired by MAML, which alternates between domain-specific (inner-loop) and domain-invariant (outer-loop) updates to facilitate rapid adaptation on new domains. Additionally, inspired by [13] (Yossi Gandelsman, Alexei A Efros, and Jacob Steinhardt. 2024. Interpreting the second-order effects of neurons in clip. arXiv preprint arXiv:2406.04341 (2024)) we introduce second-order feature propagation in the intermediate layers of CLIP to enhance fine-grained artifact cues and propose domain-class disentangled prompts to flexibly encode multi-domain text representations. Together, these strategies enable GM-DF to learn robust, shared forgery cues while preserving essential domain nuances. Our extensive experiments on multiple cross-domain benchmarks demonstrate that GM-DF significantly outperforms state-of-the-art approaches in both detection accuracy and domain transferability, reducing reliance on superficial artifacts and improving generalization to unseen forgeries. Importantly, our design requires minimal overhead beyond standard CLIP, making GM-DF both effective and computationally efficient for real-world face forgery detection.

Abstract:
Generating accurate descriptions of human actions in videos remains a challenging task for video captioning models. Existing approaches often struggle to capture fine-grained motion details, resulting in vague or semantically inconsistent captions. In this work, we introduce the Motion-Augmented Caption Model (M-ACM), a novel generative framework that enhances caption quality by incorporating motion-aware decoding. At its core, M-ACM leverages motion representations derived from human mesh recovery to explicitly highlight human body dynamics, thereby reducing hallucinations and improving both semantic fidelity and spatial alignment in the generated captions. To support research in this area, we present the Human Motion Insight (HMI) Dataset, comprising 115K video-description pairs focused on human movement, along with HMI-Bench, a dedicated benchmark for evaluating motion-focused video captioning. Experimental results demonstrate that M-ACM significantly outperforms previous methods in accurately describing complex human motions and subtle temporal variations, setting a new standard for motion-centric video captioning.

Abstract:
Unlike Object Detection, Visual Grounding task necessitates the detection of an object described by complex free-form language. To simultaneously model such complex semantic and visual representations, recent state-of-the-art studies adopt transformer-based models to fuse features from both modalities, further introducing various modules that modulate visual features to align with the language expressions. However, in most cases, the supervision signal only adopts common Object Detection losses, solely governing the bounding box regression output, which fails to fully optimize for the above objectives. To tackle this problem, in this paper, we first analyze the attention mechanisms of transformer-based models. Building upon this, we propose a novel framework named Attention-Driven Constraint Balancing (AttBalance) to optimize the behavior of visual features within language-relevant regions. Extensive experimental results show that our method brings impressive improvements. Specifically, we achieve constant improvements over five different models evaluated on four different benchmarks. Moreover, we attain a new state-of-the-art performance by integrating our method into QRNet.

Abstract:
Multimodal learning benefits from multiple modal information, and each learned modal representations can be divided into uni-modal that can be learned from uni-modal training and paired-modal features that can be learned from cross-modal interaction. Building on this perspective, we propose a partitioner-guided modal learning framework, PgM, which consists of the modal partitioner, uni-modal learner, paired-modal learner, and uni-paired modal decoder. Modal partitioner segments the learned modal representation into uni-modal and paired-modal features. Modal learner incorporates two dedicated components for uni-modal and paired-modal learning. Uni-paired modal decoder reconstructs modal representation based on uni-modal and paired-modal features. PgM offers three key benefits: 1) thorough learning of uni-modal and paired-modal features, 2) flexible distribution adjustment for uni-modal and paired-modal representations to suit diverse downstream tasks, and 3) different learning rates across modalities and partitions. Extensive experiments demonstrate the effectiveness of PgM across four multimodal tasks and further highlight its transferability to existing models. Additionally, we visualize the distribution of uni-modal and paired-modal features across modalities and tasks, offering insights into their respective contributions.

Abstract:
In recent years, Transformers have achieved remarkable progress in computer vision tasks. However, their global modeling often comes with substantial computational overhead, in stark contrast to the human eye's efficient information processing. Inspired by the human eye's sparse scanning mechanism, we propose a Sparse Scan Self-Attention mechanism (S3A). This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors, avoiding redundant global modeling and excessive focus on local information. This approach mirrors the human eye's functionality and significantly reduces the computational load of vision models. Building on S3A, we introduce the Sparse Scan Vision Transformer (SSViT). Extensive experiments demonstrate the outstanding performance of SSViT across a variety of tasks. Specifically, on ImageNet classification, without additional supervision or training data, SSViT achieves top-1 accuracies of 84.4%/85.7% with 4.4G/18.2G FLOPs. SSViT also excels in downstream tasks such as object detection, instance segmentation, and semantic segmentation. Its robustness is further validated across diverse datasets.

Abstract:
Multi-modal recommender systems (MMRS) have gained significant attention due to their ability to leverage information from various modalities to enhance recommendation quality. However, existing negative sampling techniques often struggle to effectively utilize the multi-modal data, leading to suboptimal performance. In this paper, we identify two key challenges in negative sampling for MMRS: (1) producing cohesive negative samples contrasting with positive samples and (2) maintaining a balanced influence across different modalities. To address these challenges, we propose NegGen, a novel framework that utilizes multi-modal large language models (MLLMs) to generate balanced and contrastive negative samples. We design three different prompt templates to enable NegGen to analyze and manipulate item attributes across multiple modalities, and then generate negative samples that introduce better supervision signals and ensure modality balance. Furthermore, NegGen employs a causal learning module to disentangle the effect of intervened key features and irrelevant item attributes, enabling fine-grained learning of user preferences. Extensive experiments on real-world datasets demonstrate the superior performance of NegGen compared to state-of-the-art methods in both negative sampling and multi-modal recommendation.

Abstract:
Electrocardiogram (ECG), a key bioelectrical time-series signal, is crucial for assessing cardiac health and diagnosing various diseases. Given its time-series format, ECG data is often incorporated into pre-training datasets for large-scale time-series model training. However, existing studies often overlook its unique characteristics and specialized downstream applications, which differ significantly from other time-series data, leading to an incomplete understanding of its properties. In this paper, we present an in-depth investigation of ECG signals and establish a comprehensive benchmark, which includes (1) categorizing its downstream applications into four distinct evaluation tasks, (2) identifying limitations in traditional evaluation metrics for ECG analysis, and introducing a novel metric; (3) benchmarking state-of-the-art time-series models and proposing a new architecture. Extensive experiments demonstrate that our proposed benchmark is comprehensive and robust. The results validate the effectiveness of the proposed metric and model architecture, which establish a solid foundation for advancing research in ECG signal analysis.

Abstract:
Humans can infer the missing parts of an occluded object by leveraging prior knowledge and visible cues. However, enabling deep learning models to accurately predict such occluded regions remains a challenging task. De-occlusion addresses this problem by reconstructing both the mask and RGB appearance. In this work, we focus on human de-occlusion, specifically targeting the recovery of occluded body structures and appearances. Our approach decomposes the task into two stages: mask completion and RGB completion. The first stage leverages a diffusion-based human body prior to provide a comprehensive representation of body structure, combined with occluded joint heatmaps that offer explicit spatial cues about missing regions. The reconstructed amodal mask then serves as a conditioning input for the second stage, guiding the model on which areas require RGB reconstruction. To further enhance RGB generation, we incorporate human-specific textual features derived using a visual question answering (VQA) model and encoded via a CLIP encoder. RGB completion is performed using Stable Diffusion, with decoder fine-tuning applied to mitigate pixel-level degradation in visible regions---a known limitation of prior diffusion-based de-occlusion methods caused by latent space transformations. Our method effectively reconstructs human appearances even under severe occlusions and consistently outperforms existing methods in both mask and RGB completion. Moreover, the de-occluded images generated by our approach can improve the performance of downstream human-centric tasks, such as 2D pose estimation and 3D human reconstruction. The code will be made publicly available.

Abstract:
Scalable Vector Graphics (SVG) is a code structure used to represent visual information, and with the powerful capabilities of large language models, it holds significant research potential. Current text-to-SVG generation methods lack generalization capabilities and struggle with accurately adhering to input generation instructions. In this paper, we propose a novel approach for generating SVG using large language models, named SVGThinker, which incorporates a reasoning process to align the generation of SVG code with the visualization process, while supporting all SVG primitives. Through sequential rendering of SVG primitives, we first use a multimodal model to annotate the SVG, followed by sequential updates corresponding to the incremental additions of primitives. We then employ a supervised training framework based on Chain-of-Thought reasoning, which enhances the model's robustness and reduces the risk of errors or hallucinations. Through comparisons with state-of-the-art baseline models, our experiments show that our model generates more stable, high-quality, and editable SVG code. In contrast to image-based methods, our approach preserves the structural advantages of SVG and supports precise, hierarchical editing. We believe our work opens new directions for SVG generation, with potential applications in design, content creation, and automated SVG-based graphic generation.

Abstract:
Graphs play a pivotal role in multimedia applications by integrating information to model complex relationships. Recently, graph class-incremental learning (GCIL) has garnered attention, allowing graph neural networks (GNNs) to adapt to evolving graph analytical tasks by incrementally learning new class knowledge while retaining knowledge of old classes. Existing GCIL methods primarily focus on a closed-set assumption, where all test samples are presumed to belong to previously known classes. Such assumption restricts their applicability in real-world scenarios, where unknown classes naturally emerge during inference, and are absent during training. In this paper, we explore a more challenging open-set graph class-incremental learning scenario with two intertwined challenges: catastrophic forgetting of old classes, which impairs the detection of unknown classes, and inadequate open-set recognition, which destabilizes the retention of learned knowledge. To address the above problems, a novel OGCIL framework is proposed, which utilizes pseudo-sample embedding generation to effectively mitigate catastrophic forgetting and enable robust detection of unknown classes. To be specific, a prototypical conditional variational autoencoder is designed to synthesize node embeddings for old classes, enabling knowledge replay without storing raw graph data. To handle unknown classes, we employ a mixing-based strategy to generate out-of-distribution (OOD) samples from pseudo in-distribution and current node embeddings. A novel prototypical hypersphere classification loss is further proposed, which anchors in-distribution embeddings to their respective class prototypes, while repelling OOD embeddings away. Instead of assigning all unknown samples into one cluster, our proposed objective function explicitly models them as outliers through prototype-aware rejection regions, ensuring a robust open-set recognition. Extensive experiments on five benchmarks demonstrate the effectiveness of OGCIL over existing GCIL and open-set GNN methods.

Abstract:
Small, fast, and lightweight drones present significant challenges for traditional RGB cameras due to their limitations in capturing fast-moving objects, especially under challenging lighting conditions. Event cameras offer an ideal solution, providing high temporal definition and dynamic range, yet existing benchmarks often lack fine temporal resolution or drone-specific motion patterns, hindering progress in these areas. This paper introduces the Florence RGB-Event Drone dataset (FRED), a novel multimodal dataset specifically designed for drone detection, tracking, and trajectory forecasting, combining RGB video and event streams. FRED features more than 7 hours of densely annotated drone trajectories, using 5 different drone models and including challenging scenarios such as rain and adverse lighting conditions. We provide detailed evaluation protocols and standard metrics for each task, facilitating reproducible benchmarking. The authors hope FRED will advance research in high-speed drone perception and multimodal spatiotemporal understanding.

Abstract:
As numerous edge devices start implementing intelligent components, the challenges of energy consumption, bandwidth efficiency, and privacy gain significance. One proposed solution relies on the paradigm of split inference, which optimizes the delegation of the computational load between edge and remote devices. We developed and implemented the standard-compliant split inference system with an encoder and decoder capable of real-time streaming and processing. Our system outperforms state-of-the-art video compression implementations by an average of 83% bitrate reduction, while preserving privacy. We demonstrate the system's real-time performance on consumer devices, with interactive visualizations of object detection and segmentation, incorporating real-time metrics. Demo video: https://youtu.be/bmCbUo_ZWWU

Abstract:
Spatio-temporal data mining (STDM) has become crucial in multimedia, driven by the surge of multimodal data from remote sensing, IoT sensors, social media, surveillance systems, mobile devices, and crowdsourced platforms. Traditional single-modal methods, though successful, struggle to capture real-world complexity. Integrating multiple modalities yields richer, more accurate insights, boosting spatio-temporal analysis. This half-day tutorial,MM4ST: Multimodal Learning for STDM, offers a comprehensive overview, covering STDM fundamentals, challenges in aligning and fusing heterogeneous data, advanced multimodal modeling techniques, and emerging research directions. Attendees will acquire practical knowledge to develop scalable and robust spatio-temporal mining solutions. All materials will be publicly available online.

Abstract:
Zero-shot domain adaptation is a method for adapting a model to a target domain without utilizing target domain image data. To enable adaptation without target images, existing studies utilize CLIP's embedding space and text description to simulate target-like style features. Despite the previous achievements in zero-shot domain adaptation, we observe that these text-driven methods struggle to capture complex real-world variations and significantly increase adaptation time due to their alignment process. Instead of relying on text descriptions, we explore solutions leveraging image data, which provides diverse and more fine-grained style cues. In this work, we propose SIDA, a novel and efficient zero-shot domain adaptation method leveraging synthetic images. To generate synthetic images, we first create detailed, source-like images and apply image translation to reflect the style of the target domain. We then utilize the style features of these synthetic images as a proxy for the target domain. Based on these features, we introduce Domain Mix and Patch Style Transfer modules, which enable effective modeling of real-world variations. In particular, Domain Mix blends multiple styles to expand the intra-domain representations, and Patch Style Transfer assigns different styles to individual patches. We demonstrate the effectiveness of our method by showing state-of-the-art performance in diverse zero-shot adaptation scenarios, particularly in challenging domains. Moreover, our approach achieves high efficiency by significantly reducing the overall adaptation time.

Abstract:
Deep graph clustering, which aims to uncover the underlying structure within graphs and partition nodes into distinct groups, is a challenging research spot. However, the formation of the cluster in real-world graphs typically governed by the highly complex interaction of many underlying latent factors. Existing methods typically rely on the features and structure associated with the graph, and neglect the entanglement of these factors, resulting in sub-optimal clustering performance. In this paper, we propose a novel deep graph clustering framework named DisenCluster, which learns disentangled representations to simultaneously consider node separation results from diverse perspectives. Specifically, we introduce a disentangled graph encoder that iteratively identifies the latent factors of the input graph by modeling the distribution over different factors for each edge. Subsequently, we utilize a factor-wise contrastive loss to encourage clustering-friendly disentangled representations, allowing us to derive different clustering results based on the corresponding factor. These results are then structured as anchor graphs and seamlessly integrated into a unified graph. Finally, we formulate the framework as a continuous relaxation of the high-order graph cut problem and optimize the objective to obtain effective cluster assignments. Results from experiments on a variety of publicly available datasets further reveal the effectiveness and superiority of our DisenCluster compared with baselines.

Abstract:
Multi-view clustering based on anchor graph and regression is widely used to deal with high dimensional and redundant data. However, most of these methods ignore the probabilistic characteristics of anchor graph, and the effective information in different views is not fully mined. To solve these problems, we propose a multi-view clustering method based on probabilistic tensor regression (MVCPTR). Specifically, we reinterpret the regression process of the anchor graph from the perspective of probability. By modeling the anchor graph as the transition probability from samples to anchors, we construct the implicit relationship between labels of samples and anchors. In order to further mine the complementary information of multi-view data, we extend the anchor graph matrix regression to tensor regression to achieve multi-level information fusion at the representational level and decision level, and impose the Schatten p-norm constraint on the anchor label tensor and the sample label tensor to realize the bi-clustering of the anchors and samples. A large number of experiments prove the effectiveness of our proposed algorithm.

Abstract:
Existing few-shot action recognition (FSAR) studies predominantly follow a metric learning framework, where prototypes are generated directly from features extracted by an encoder, and classification is performed via distance-based matching. However, due to the limited number of available samples, significant variations exist between different video features of the same class. As a result, the same query video may yield different classification results when matched against different sets of support videos. To address this issue, we propose a novel Hierarchical Meta-Prototypes Network (HMP-Net). The key innovation of our approach lies in the introduction of a category-agnostic and feature-agnostic meta-prototype module, which guides video feature mapping into a more suitable feature space. To optimize this meta-prototype, we design an alternating meta-prototype training strategy, where the model first learns to transform features under a fixed meta-prototype, and then the meta-prototype is refined to better guide feature mapping. Additionally, to adapt image-based metric learning models to video-based FSAR tasks, we introduce a series of lightweight adaptation modules. Specifically, we integrate an adapter into the encoder to improve video frame feature extraction, design a hierarchical prototype generation mechanism to enhance overall video understanding, and incorporate a task-specific perception module to extract unique features for each task. These adaptations make our model better suited for FSAR, significantly improving performance. We evaluate HMP-Net on five challenging benchmarks, and experimental results demonstrate that our model achieves new state-of-the-art performance on HMDB51, UCF101, Kinetics, and SthSthV2-Small. Extensive empirical evaluations further highlight the effectiveness and robustness of HMP-Net.

Abstract:
Despite progress in pixel-level medical image perception, existing methods remain task-specific or depend on precise prompts like bounding boxes or text. However, the need for medical knowledge limits accessibility for the general public, who are more likely to use logically reasoned oral queries than domain-specific inputs. In this paper, we introduce a novel medical vision task: Medical Reasoning Segmentation and Detection (MedSD), which aims to comprehend implicit queries about medical images and generate the corresponding segmentation mask and bounding box for the target object. To accomplish this task, we first introduce a Multi-perspective, Logic-driven Medical Reasoning Segmentation and Detection (MLMR-SD) dataset, which encompasses a substantial collection of medical entity targets along with their corresponding reasoning. Furthermore, we propose MediSee, an effective baseline model designed for MedSD. The experimental results indicate that the proposed method can effectively address MedSD with implicit colloquial queries and outperform traditional medical referring segmentation methods. The MediSee project can be found here.

Abstract:
Driven by the increasing demand for applications in augmented and virtual reality, egocentric action recognition has emerged as a prominent research area. It is typically divided into two subtasks: recognizing the performed behavior (i.e., verb component) and identifying the objects being acted upon (i.e., noun component) from the first-person perspective. However, most existing approaches treat these two components as independent classification tasks, focusing on extracting component-specific knowledge while overlooking their inherent semantic and contextual relationships, leading to fragmented representations and sub-optimal generalization capability. To address these challenges, we propose a prompt learning-based framework, EgoPrompt, to conduct the egocentric action recognition task. Building on the existing prompting strategy to capture the component-specific knowledge, we construct a Unified Prompt Pool space to establish interaction between the two types of component representations. Specifically, the component representations (from verbs and nouns) are first decomposed into fine-grained patterns with the prompt pair form. Then, these pattern-level representations are fused through an attention-based mechanism to facilitate cross-component interaction. To ensure the prompt pool is informative, we further introduce a novel training objective, Diverse Pool Criteria. This objective realizes our goals from two perspectives: Prompt Selection Frequency Regularization and Prompt Knowledge Orthogonalization. Extensive experiments are conducted on the Ego4D, EPIC-Kitchens, and EGTEA datasets. The results consistently show that EgoPrompt achieves state-of-the-art performance across within-dataset, cross-dataset, and base-to-novel generalization benchmarks.

Abstract:
Document reconstruction constitutes a significant facet of document analysis and recognition, a field that has been progressively accruing interest within the scholarly community. A multitude of these researchers employ an array of document understanding models to generate predictions on distinct subtasks, subsequently integrating their results into a holistic document reconstruction format via heuristic principles. Nevertheless, these multi-stage methodologies are hindered by the phenomenon of error propagation, resulting in suboptimal performance. Furthermore, contemporary studies utilize generative models to extract the logical sequence of plain text, tables and mathematical expressions in an end-to-end process. However, this approach is deficient in preserving the information related to element layouts, which are vital for document reconstruction. To surmount these aforementioned limitations, we in this paper present an innovative autoregressive model specifically designed for document reconstruction, referred to as Document Reconstruction via End-to-end Autoregressive Model (DREAM). DREAM transmutes the text image into a sequence of document reconstruction in a comprehensive, end-to-end process, encapsulating a broader spectrum of document element information. In addition, we establish a standardized definition of the document reconstruction task, and introduce a novel Document Similarity Metric (DSM) and DocRec1K dataset for assessing the performance of the task. Empirical results substantiate that our methodology attains unparalleled performance in the realm of document reconstruction. Furthermore, the results on a variety of subtasks, encompassing document layout analysis, text recognition, table structure recognition, formula recognition and reading order detection, indicate that our model is competitive and compatible with various tasks.

Abstract:
Current scaling laws for visual AI models focus predominantly on large-scale pretraining, leaving a critical gap in understanding how performance scales for data-constrained downstream tasks. To address this limitation, this paper establishes the first practical framework for data-efficient scaling laws in visual transfer learning, addressing two fundamental questions: 1) How do scaling behaviors shift when downstream tasks operate with limited data?2) What governs the efficacy of knowledge distillation under such constraints? Through systematic analysis of vision tasks across data regimes (1K-1M samples), we propose the distillation boundary theory, revealing a critical turning point in distillation efficiency: 1) Distillation superiority: In data-scarce conditions, distilled models significantly outperform their non-distillation counterparts, efficiently leveraging inherited knowledge to compensate for limited training samples. 2) Pre-training dominance: As pre-training data increases beyond a critical threshold, non-distilled models gradually surpass distilled versions, suggesting diminishing returns from knowledge inheritance when sufficient task-specific data becomes available. Empirical validation across various model scales (2.5M to 38M parameters) and data volumes demonstrate these performance inflection points, with error difference curves transitioning from positive to negative values at critical data thresholds, confirming our theoretical predictions. This work redefines scaling laws for data-limited regimes, bridging the knowledge gap between large-scale pretraining and practical downstream adaptation, addressing a critical barrier to understanding vision model scaling behaviors and optimizing computational resource allocation.

Abstract:
The burgeoning growth of open-source vision-language models (VLMs) has catalyzed a plethora of applications across diverse domains. Ensuring the transparency and interpretability of these models is critical for fostering trustworthy and responsible AI systems. In this study, our objective is to delve into the internals of VLMs to interpret the functions of individual neurons. We observe the activations of neurons with respects to the input visual tokens and text tokens, and reveal some interesting findings. Particularly, we found that there are neurons responsible for only visual or text information, or both, respectively, which we refer to them as visual neurons, text neurons, and multi-modal neurons, respectively. We build a framework that automates the explanation of neurons with the assistant of GPT-4o. Meanwhile, for visual neurons, we propose an activation simulator to assess the reliability of the explanations for visual neurons. System statistical analyses on top of one representative VLM of LLaVA, uncover the behaviors/characteristics of different categories of neurons.

Abstract:
Modern text-to-image generation systems have enabled the creation of remarkably realistic and high-quality visuals, yet they often falter when handling the inherent ambiguities in user prompts. In this work, we present Twin-Co, a framework that leverages synchronized, co-adaptive dialogue to progressively refine image generation. Instead of a static generation process, Twin-Co employs a dynamic, iterative workflow where an intelligent dialogue agent continuously interacts with the user. Initially, a base image is generated from the user's prompt. Then, through a series of synchronized dialogue exchanges, the system adapts and optimizes the image according to evolving user feedback. The co-adaptive process allows the system to progressively narrow down ambiguities and better align with user intent. Experiments demonstrate that Twin-Co not only enhances user experience by reducing trial-and-error iterations but also improves the quality of the generated images, streamlining creative process across various applications.

Abstract:
Compositional zero-shot learning aims to recognize unseen stateobject compositions by leveraging known primitives (state and object) during training. However, effectively modeling interactions between primitives and generalizing knowledge to novel compositions remains a perennial challenge. There are two crucial factors: large object-conditioned and state-conditioned variance, i.e., the appearance of states (or objects) can vary significantly when combined with different objects (or states). For instance, the state "old" can signify vintage design for a "car" or advanced age for a "cat". In this paper, we argue that these variances can be mitigated by predicting composition categories based on salient observation cues. Therefore, we propose Progressive Language-based Observations (PLO), which can automatically determine the order of observation cues. These "observation cues" comprise a series of primitive concepts or graduated descriptions that allow the model to understand image content in a step-by-step manner. Specifically, PLO adopts pre-trained vision-language models (VLMs) to empower the model with observation capabilities.We further devise two variants: a twostep method (PLO-VLM) with a pre-observing classifier dynamically selecting the order of primitive concept-based cues, and a multistep approach (PLO-LLM) using large language models (LLMs) to craft graduated description-based cues. Extensive tests on three datasets show PLO's effectiveness in compositional recognition.

Abstract:
Cross-modal alignment is an important multi-modal task, aiming to bridge the semantic gap between different modalities. The most reliable fundamention for achieving this objective lies in the semantic consistency between matched pairs. Conventional methods implicitly assume embeddings contain solely semantic information, ignoring the impact of non-semantic information during alignment, which inevitably leads to information bias or even loss. These non-semantic information primarily manifest as stylistic variations in the data, which we formally define as style information. An intuitive approach is to separate style from semantics, aligning only the semantic information. However, most existing methods distinguish them based on feature columns, which cannot represent the complex coupling relationship between semantic and style information. In this paper, we propose PICO, a novel framework for suppressing style interference during embedding interaction. Specifically, we quantify the probability of each feature column representing semantic information, and regard it as the weight during the embedding interaction. To ensure the reliability of the semantic probability, we propose a prototype iterative construction method. The key operation of this method is a performance feedback-based weighting function, and we have theoretically proven that the function can assign higher weight to prototypes that bring higher performance improvements. Extensive experiments on various benchmarks and model backbones demonstrate the superiority of PICO, outperforming state-of-the-art methods by 5.2%-14.1%.

Abstract:
Referring Expression Comprehension (REC) aims to localize specified entities or regions from the source image according to the given natural language descriptions. While existing methods enable single-entity localization, they overlook modeling the complex inter-entity relationship in more practical multi-entity scenes, which limits their ability to produce accurate and reliable results. Moreover, the lack of high-quality multi-entity datasets incorporating fine-grained and paired image-text-relation annotations also limits addressing this challenge. To achieve this task, we first manually construct a relation-aware multi-entity REC dataset with fine-grained relation and text annotations, namely ReMeX. Additionally, we propose ReMeREC, a novel framework that effectively integrates textual and visual cues to localize multiple entities while capturing their inter-relationship. Specifically, to mitigate the semantic ambiguity arising from the absence of explicit entity boundaries in the source natural language description, we introduce a novel Text-adaptive Multi-entity Perceptron (TMP). TMP dynamically infers both the quantity and span of entities from corresponding fine-grained text cues, thus deriving representations that preserve the unique characteristics of each entity. Meanwhile, we design the Entity Inter-relationship Reasoner (EIR) to enhance semantic distinctiveness relationship modeling, leading to a more profound perception of the global scene. Furthermore, to better capture the fine-grained linguistic prompts for delineating multiple entity boundaries and inter-relationship, we leverage LLMs to generate a small-scale textual dataset, dubbed EntityText, which serves as an effective auxiliary resource and further improves the textual understanding. Extensive experiments conducted on four benchmark datasets demonstrate the superior performance of our framework. Remarkably, ReMeREC achieves outstanding results in multi-entity grounding and complex relationship prediction, outperforming other counterparts by a large margin.

Abstract:
The rapid evolution of deepfake generation technologies necessitates the development of robust face forgery detection algorithms. Recent studies have demonstrated that wavelet analysis can enhance the generalization abilities of forgery detectors. Wavelets effectively capture key facial contours, often slender, fine-grained, and globally distributed, that may conceal subtle forgery artifacts imperceptible in the spatial domain. However, current wavelet-based approaches fail to fully exploit the distinctive properties of wavelet data, resulting in sub-optimal feature extraction and limited performance gains. To address this challenge, we introduce WMamba, a novel wavelet-based feature extractor built upon the Mamba architecture. WMamba maximizes the utility of wavelet information through two key innovations. First, we propose Dynamic Contour Convolution (DCConv), which employs specially crafted deformable kernels to adaptively model slender facial contours. Second, by leveraging the Mamba architecture, our method captures long-range spatial relationships with linear complexity. This efficiency allows for the extraction of fine-grained, globally distributed forgery artifacts from small image patches. Extensive experiments show that WMamba achieves state-of-the-art (SOTA) performance, highlighting its effectiveness in face forgery detection.

Abstract:
The evolution of prompt learning methodologies has driven exploration of deeper prompt designs to enhance model performance. However, current deep text prompting approaches suffer from two critical limitations: Over-reliance on constrastive learning objectives that prioritize high-level semantic alignment, neglecting fine-grained feature optimization; Static prompts across all input categories, preventing content-aware adaptation. To address these limitations, we propose AttriPrompt-a novel framework that enhances and refines textual semantic representations by leveraging the intermediate-layer features of CLIP's vision encoder. We designed an Attribute Retrieval module that first clusters visual features from each layer. The aggregated visual features retrieve semantically similar prompts from a prompt pool, which are then concatenated to the input of every layer in the text encoder. Leveraging hierarchical visual information embedded in prompted text features, we introduce Dual-stream Contrastive Learning to realize fine-grained alignment. Furthermore, we introduce a Self-Regularization mechanism by applying explicit regularization constraints between the prompted and non-prompted text features to prevent overfitting on limited training data. Extensive experiments across three benchmarks demonstrate AttriPrompt's superiority over state-of-the-art methods, achieving up to 7.37% improvement in the base-to-novel setting. The observed strength of our method in cross-domain knowledge transfer positions vision-language pre-trained models as more viable solutions for real-world implementation.

Abstract:
Prompt learning has garnered attention for its efficiency over traditional model training and fine-tuning. However, existing methods, constrained by inadequate theoretical foundations, encounter difficulties in achieving causally invariant prompts, ultimately falling short of capturing robust features that generalize effectively across categories. To address these challenges, we introduce the DiCap model, a theoretically grounded Diffusion-based Counterfactual prompt learning framework, which leverages a diffusion process to iteratively sample gradients from the marginal and conditional distributions of the causal model, guiding the generation of counterfactuals that satisfy the minimal sufficiency criterion. Grounded in rigorous theoretical derivations, this approach guarantees the identifiability of counterfactual outcomes while imposing strict bounds on estimation errors. We further employ a contrastive learning framework that leverages the generated counterfactuals, thereby enabling the refined extraction of prompts that are precisely aligned with the causal features of the data. Extensive experimental results demonstrate that our method performs excellently across tasks such as image classification, image-text retrieval, and visual question answering, with particularly strong advantages in unseen categories.

Abstract:
Online video recommendation systems often build binary labels based on play complete rate (i.e., the ratio of watch time to video duration), such as complete play and effective play, using them as implicit feedback for Click-Through Rate (CTR) prediction tasks to gauge user interest. Existing works tend to improve prediction accuracy by designing complex models, overlooking that a key cause of inaccurate predictions is the disorganization of instance representation space. To address this issue, we explore a novel approach using prototype learning to calibrate the instance representation space of deep recommendation models and propose a model-agnostic Contrastive Prototype Framework (CPF). Firstly, CPF partitions the instance space into different subspaces based on duration, then generates positive and negative prototype pairs for each subspace from pre-trained recommendation model. Subsequently, we map the instance representations to the prototype space and calibrate them by reducing the distance to the corresponding prototypes. Ultimately, the prediction is derived from the linear combination of the estimated values associated with each prototype. To prevent disorganization in the prototype space during training, we design contrastive and orthogonality losses to constrain the learning of prototypes. Additionally, we show that how CPF effectively addresses the duration bias from the perspective of causal intervention. Offline experiments on two datasets demonstrate that CPF improves recommendation accuracy over several baseline models in predicting five widely used implicit feedback labels. We have also deployed CPF on a short video platform, validating its effectiveness in real-world scenarios.

Abstract:
Grounding natural language instructions to visual observations is fundamental for embodied agents operating in open-world environments. Recent advances in visual-language mapping have enabled generalizable semantic representations by leveraging visionlanguage models (VLMs). However, these methods often fall short in aligning free-form language commands with specific scene instances, due to limitations in both instance-level semantic consistency and instruction interpretation. We present OpenMap, a zero-shot open-vocabulary visual-language map designed for accurate instruction grounding in navigation tasks. To address semantic inconsistencies across views, we introduce a Structural-Semantic Consensus constraint that jointly considers global geometric structure and vision-language similarity to guide robust 3D instancelevel aggregation. To improve instruction interpretation, we propose an LLM-assisted Instruction-to-Instance Grounding module that enables fine-grained instance selection by incorporating spatial context and expressive target descriptions. We evaluate OpenMap on ScanNet200 and Matterport3D, covering both semantic mapping and instruction-to-target retrieval tasks. Experimental results show that OpenMap outperforms state-of-the-art baselines in zero-shot settings, demonstrating the effectiveness of our method in bridging free-form language and 3D perception for embodied navigation.

Abstract:
3D Gaussian Splatting (3D-GS) achieves real-time photorealistic novel view synthesis, yet struggles with complex scenes due to over-reconstruction artifacts, manifesting as local blurring and needle-shape distortions. While recent approaches attribute these issues to insufficient splitting of large-scale Gaussians, we identify two fundamental limitations: gradient magnitude dilution during densification and the primitive frozen phenomenon, where essential Gaussian densification is inhibited in complex regions while suboptimally scaled Gaussians become trapped in local optima. To address these challenges, we introduce ReAct-GS, a method founded on the principle of re-activation. Our approach features: (1) an importance-aware densification criterion incorporating α-blending weights from multiple viewpoints to re-activate stalled primitive growth in complex regions, and (2) a re-activation mechanism that revitalizes frozen primitives through adaptive parameter perturbations. Comprehensive experiments across diverse real-world datasets demonstrate that ReAct-GS effectively eliminates over-reconstruction artifacts and achieves state-of-the-art performance on standard novel view synthesis metrics while preserving intricate geometric details. Additionally, our re-activation mechanism yields consistent improvements when integrated with other 3D-GS variants such as Pixel-GS, demonstrating its broad applicability.

Abstract:
3D Gaussian Splatting (3DGS) has revolutionized 3D scene reconstruction, which effectively balances rendering quality, efficiency, and speed. However, existing 3DGS approaches usually generate plausible outputs and face significant challenges in complex scene reconstruction, manifesting as incomplete holistic structural outlines and unclear local lighting effects. To address these issues simultaneously, we propose a novel decoupled optimization framework, which integrates wavelet decomposition into 3D Gaussian Splatting and 2D sampling. Technically, through 3D wavelet decomposition, our approach divides point clouds into high-frequency and low-frequency components, enabling targeted optimization for each. The low-frequency component captures global structural outlines and manages the distribution of Gaussians through voxelization. In contrast, the high-frequency component restores intricate geometric and textural details while incorporating a relight module to mitigate lighting artifacts and enhance photorealistic rendering. Additionally, a 2D wavelet decomposition is applied to the training images, simulating radiance variations. This provides critical guidance for high-frequency detail reconstruction, ensuring seamless integration of details with the global structure. Extensive experiments on challenging datasets demonstrate our method achieves state-of-the-art performance across various metrics, surpassing existing approaches and advancing the field of 3D scene reconstruction.

Abstract:
Computed Tomography (CT) is a widely utilized imaging modality in clinical settings. Using densely acquired rotational X-ray arrays, CT can capture 3D spatial features. However, it is confronted with challenged such as significant time consumption and high radiation exposure. CT reconstruction methods based on sparse-view X-ray images have garnered substantial attention from researchers as they present a means to mitigate costs and risks. In recent years, diffusion models, particularly the Latent Diffusion Model (LDM), have demonstrated promising potential in the domain of 3D CT reconstruction. Nonetheless, due to the substantial differences between the 2D latent representation of X-ray modalities and the 3D latent representation of CT modalities, the vanilla LDM is incapable of achieving effective alignment within the latent space. To address this issue, we propose the Consistent Latent Space Diffusion Model (CLS-DM), which incorporates cross-modal feature contrastive learning to efficiently extract latent 3D information from 2D X-ray images and achieve latent space alignment between modalities. Experimental results indicate that CLS-DM outperforms classical and state-of-the-art generative models in terms of standard voxel-level metrics (PSNR, SSIM) on the LIDC-IDRI and CTSpine1K datasets. This methodology not only aids in enhancing the effectiveness and economic viability of sparse X-ray reconstructed CT but can also be generalized to other cross-modal transformation tasks, such as text-to-image synthesis. We have made our code publicly available at https://anonymous.4open.science/r/CLS-DM-50D6/ to facilitate further research and applications in other domains.

Abstract:
Recent large-scale video datasets have facilitated the generation of diverse videos of Video Diffusion Models (VDMs). Nonetheless, some complex actions have still struggled to be generated by those VDMs, leading to a reduction in video generalization. Some researchers attempt to use video editing methods for complex action generation. However, the actions in generated videos are often identical to the reference video, resulting in a lack of diversity. To this end, we first propose Acton In-Context Learning (AICL), a novel approach to generate intricate actions by emulating motions from pre-existing videos in the inference stage using a plug-and-play method. Specifically, the Action Perceiver (AP), is introduced to distill action features from reference videos, which requires training on only a small dataset. Leveraging the knowledge from pre-trained VDMs, Action Integration is introduced for incorporating new action features extracted by AP into VDMs through the additional layers. Extensive experiments demonstrate that AICL is not merely replicating the motion from references, and it significantly improves the generation of realistic actions, even in situations where existing VDMs might directly fail.

Abstract:
We propose CatchPhrase, a novel audio-to-image generation framework designed to mitigate semantic misalignment between audio inputs and generated images. While recent advances in multi-modal encoders have enabled progress in cross-modal generation, ambiguity stemming from homographs and auditory illusions continues to hinder accurate alignment. To address this issue, CatchPhrase generates enriched cross-modal semantic prompts (EXPrompt Mining ) from weak class labels by leveraging large language models (LLMs) and audio captioning models (ACMs). To address both class-level and instance-level misalignment, we apply multi-modal filtering and retrieval to select the most semantically aligned prompt for each audio sample (EXPrompt Selector ). A lightweight mapping network is then trained to adapt pre-trained text-to-image generation models to audio input. Extensive experiments on multiple audio classification datasets demonstrate that CatchPhrase improves audio-to-image alignment and consistently enhances generation quality by mitigating semantic misalignment.

Abstract:
Emotions are fundamental to the creation and perception of music performances. However, achieving human-like expression and emotion through machine learning models for performance rendering remains a challenging task. In this work, we present SyMuPe, a novel framework for developing and training affective and controllable symbolic piano performance models. Our flagship model, PianoFlow, uses conditional flow matching trained to solve diverse multi-mask performance inpainting tasks. By design, it supports both unconditional generation and infilling of music performance features. For training, we use a curated, cleaned dataset of 2,968 hours of aligned musical scores and expressive MIDI performances. For text and emotion control, we integrate a piano performance emotion classifier and tune PianoFlow with the emotion-weighted Flan-T5 text embeddings provided as conditional inputs. Objective and subjective evaluations against transformer-based baselines and existing models show that PianoFlow not only outperforms other approaches, but also achieves performance quality comparable to that of human-recorded and transcribed MIDI samples. For emotion control, we present and analyze samples generated under different text conditioning scenarios. The developed model can be integrated into interactive applications, contributing to the creation of more accessible and engaging music performance systems.

Abstract:
Large-scale pre-training frameworks like CLIP have revolutionized multimodal learning, but their reliance on web-scraped datasets, frequently containing private user data, raises serious concerns about misuse. Unlearnable Examples (UEs) have emerged as a promising countermeasure against unauthorized model training, employing carefully crafted unlearnable noise to disrupt the learning of meaningful representations from protected data. Current approaches typically generate UEs by jointly optimizing unlearnable noise for both images and their associated text descriptions (or labels). However, this optimization process is often computationally prohibitive for on-device execution, forcing reliance on external third-party services. This creates a fundamental privacy paradox: users must initially expose their data to these very services to achieve protection, thereby compromising privacy in the process. Such a contradiction has severely hindered the development of practical, scalable data protection solutions. To resolve this paradox, we introduce Text-to-Unlearnable Example (T2UE), a novel framework that enables users to generate UEs using only text descriptions. T2UE circumvents the need for original image data by employing a text-to-image (T2I) model to map text descriptions into the image (noise) space, combined with an error-minimization framework to produce effective unlearnable noise. Extensive experiments show that T2UE-protected data substantially degrades performance in downstream tasks (e.g., cross-modal retrieval) for state-of-the-art models. Notably, the protective effect generalizes across diverse architectures and even to supervised learning settings. Our work demonstrates the feasibility of ''zero-contact data protection'', where personal data can be safeguarded based solely on their textual descriptions, eliminating the need for direct data exposure.

Abstract:
Managing personal health data is a challenge in today's fragmented and institution-centric healthcare ecosystem. Individuals often lack meaningful control over their medical records, which are scattered across incompatible systems and formats. This vision paper presents Health+, a user-centric, multimodal health data management system that empowers individuals (including those with limited technical expertise) to upload, query, and share their data across modalities (e.g., text, images, reports). Rather than aiming for institutional overhaul, Health+ emphasizes individual agency by providing intuitive interfaces and intelligent recommendations for accessing and sharing data. At the system level, it tackles the complexity of storing, integrating, and securing heterogeneous health records, ensuring both efficiency and privacy. By unifying multimodal data and prioritizing patients, Health+ enables a more connected, interpretable, and user-controlled health information ecosystem.

Abstract:
Recent advances in deep learning have transformed computer-assisted intervention and surgical video analysis, driving improvements not only in surgical training, intraoperative decision support, and patient outcomes, but also in postoperative documentation and surgical discovery. Central to these developments is the availability of large, high-quality annotated datasets. In gynecologic laparoscopy, surgical scene understanding and action recognition are fundamental for building intelligent systems that assist surgeons during operations and provide deeper analysis after surgery. However, existing datasets are often limited by small scale, narrow task focus, or insufficiently detailed annotations, limiting their utility for comprehensive, end-to-end workflow analysis. To address these limitations, we introduce GynSurg, the largest and most diverse multi-task dataset for gynecologic laparoscopic surgery to date. GynSurg provides rich annotations across multiple tasks, supporting applications in action recognition, semantic segmentation, surgical documentation, and discovery of novel procedural insights. We demonstrate the dataset's quality and versatility by benchmarking state-of-the-art models under a standardized training protocol. To accelerate progress in the field, we publicly release the GynSurg dataset and its annotations (https://ftp.itec.aau.at/datasets/GynSurge/).

Abstract:
Existing EEG-driven image reconstruction methods often overlook spatial attention mechanisms, limiting fidelity and semantic coherence. To address this, we propose a dual-conditioning framework that combines EEG embeddings with spatial saliency maps to enhance image generation. Our approach leverages the Adaptive Thinking Mapper (ATM) for EEG feature extraction and fine-tunes Stable Diffusion 2.1 via Low-Rank Adaptation (LoRA) to align neural signals with visual semantics, while a ControlNet branch conditions generation on saliency maps for spatial control. Evaluated on THINGS-EEG, our method achieves a significant improvement in the quality of low- and high-level image features over existing approaches. Simultaneously, strongly aligning with human visual attention. The results demonstrate that attentional priors resolve EEG ambiguities, enabling high-fidelity reconstructions with applications in medical diagnostics and neuroadaptive interfaces, advancing neural decoding through efficient adaptation of pre-trained diffusion models.

Abstract:
Plant phenotyping involves analyzing observable characteristics of plants to better understand their growth, health, and development. In the context of deep learning, this analysis is often approached through single-view classification or regression models. However, these methods often fail to capture all information required for accurate estimation of target phenotypic traits, which can adversely affect plant health assessment and harvest readiness prediction. To address this, the Growth Modelling (GroMo) Grand Challenge at ACM Multimedia 2025 provides a multi-view dataset featuring multiple plants and two tasks: Plant Age Prediction and Leaf Count Estimation. Each plant is photographed from multiple heights and angles, leading to significant overlap and redundancy in the captured information. To learn view-invariant embeddings, we incorporate 24 views, referred to as the selection vector, in a random selection. Our ViewSparsifier approach won both tasks. For further improvement and as a direction for future research, we also experimented with randomized view selection across all five height levels (120 views total), referred to as selection matrices.

Abstract:
The rapid expansion of multimedia services, such as video streaming, video conferencing, virtual reality, and cloud gaming, makes maintaining and evaluating high perceptual visual quality essential for user experience and system competitiveness. However, visual content can degrade at multiple stages, including acquisition, compression, transmission, enhancement, and display, where suboptimal enhancement may also introduce artifacts and reduce perceived quality. The core challenge is to reliably measure and predict this perceived quality so that it can be maintained or improved. Perceptual Visual Quality Assessment (PVQA) addresses this by evaluating visual quality from the perspective of human subjects, through subjective studies and objective prediction models. Beyond humans, recent work also extends PVQA to machines and robots, where the goal is to preserve downstream task performance (e.g., segmentation accuracy and planning success) under distortions or bandwidth constraints. This tutorial provides a concise, practice-oriented overview of PVQA: fundamentals and human vision considerations; image and video quality assessment; methods for immersive/3D media; opportunities and challenges in the era of foundation models and GenAI; perceptual optimization loops that close the gap between assessment and decisions in coding, streaming, and embodied perception; and domain applications. Finally, we summarize the key concepts, toolchains, and future opportunities for PVQA to be used in modern multimedia communication.

Abstract:
Cross-view object correspondence involves matching objects between egocentric (first-person) and exocentric (third-person) views. It is a critical yet challenging task for visual understanding. In this work, we propose the Dense Object Matching and Refinement (DOMR) framework to establish dense object correspondences across views. The framework centers around the Dense Object Matcher (DOM) module, which jointly models multiple objects. Unlike methods that directly match individual object masks to image features, DOM leverages both positional and semantic relationships among objects to find correspondences. DOM integrates a proposal generation module with a dense matching module that jointly encodes visual, spatial, and semantic cues, explicitly constructing inter-object relationships to achieve dense matching among objects. Furthermore, we combine DOM with a mask refinement head designed to improve the completeness and accuracy of the predicted masks, forming the complete DOMR framework. Extensive evaluations on the Ego-Exo4D benchmark demonstrate that our approach achieves state-of-the-art performance with a mean IoU of 49.7% on Ego→Exo and 55.2% on Exo→Ego. These results outperform those of previous methods by 5.8% and 4.3%, respectively, validating the effectiveness of our integrated approach for cross-view understanding.

Abstract:
Lane topology reasoning plays a critical role in autonomous driving by modeling the connections among lanes and the topological relationships between lanes and traffic elements. Most existing methods adopt a first-detect-then-reason paradigm, where topological relationships are supervised based on the one-to-one assignment results obtained during the detection stage. This supervision strategy results in suboptimal topology reasoning performance due to the limited range of valid supervision. In this paper, we propose RATopo, a Redundancy Assignment strategy for lane Topology reasoning that enables quantity-rich and geometry-diverse topology supervision. Specifically, we restructure the Transformer decoder by swapping the cross-attention and self-attention layers. This allows redundant lane predictions to be retained before suppression, enabling effective one-to-many assignment. We also instantiate multiple parallel cross-attention blocks with independent parameters, which further enhances the diversity of detected lanes. Extensive experiments on OpenLane-V2 demonstrate that our RATopo strategy is model-agnostic and can be seamlessly integrated into existing topology reasoning frameworks, consistently improving both lane-lane and lane-traffic topology performance (e.g., + 15.7% and + 9.5% on TopoLogic for TOP_ll and TOP_lt on OpenLane-V2 subset_B, respectively).

Abstract:
Cross-video question answering presents significant challenges beyond traditional single-video understanding, particularly in establishing meaningful connections across video streams and managing the complexity of multi-source information retrieval. We introduce VideoForest, a novel framework that addresses these challenges through person-anchored hierarchical reasoning, enabling effective cross-video understanding without requiring end-to-end training. VideoForest integrates three key innovations: 1) a human-anchored feature extraction mechanism that employs ReID and tracking algorithms to establish robust spatiotemporal relationships across multiple video sources; 2) a multi-granularity spanning tree structure that hierarchically organizes visual content around person-level trajectories; and 3) a multi-agent reasoning framework that efficiently traverses this hierarchical structure to answer complex queries. To evaluate our method, we develop CrossVideoQA, a comprehensive benchmark specifically designed for person-centric cross-video analysis. Experimental results demonstrate VideoForest's superior performance in cross-video reasoning tasks, achieving 71.93% accuracy in person recognition, 83.75% in behavior analysis, and 51.67% in summarization and reasoning.

Abstract:
Unsupervised Continuous Anomaly Detection (UCAD) is gaining attention for effectively addressing the catastrophic forgetting and heavy computational burden issues in traditional Unsupervised Anomaly Detection (UAD). However, existing UCAD approaches that rely solely on visual information are insufficient to capture the manifold of normality in complex scenes, thereby impeding further gains in anomaly detection accuracy. To overcome this limitation, we propose an unsupervised continual anomaly detection framework grounded in multimodal prompting. Specifically, we introduce a Continual Multimodal Prompt Memory Bank (CMPMB) that progressively distills and retains prototypical normal patterns from both visual and textual domains across consecutive tasks, yielding a richer representation of normality. Furthermore, we devise a Defect-Semantic-Guided Adaptive Fusion Mechanism (DSG-AFM) that integrates an Adaptive Normalization Module (ANM) with a Dynamic Fusion Strategy (DFS) to jointly enhance detection accuracy and adversarial robustness. Benchmark experiments on MVTec AD and VisA datasets show that our approach achieves state-of-the-art (SOTA) performance on image-level AUROC and pixel-level AUPR metrics.

Abstract:
Learning from large-scale pre-trained models with strong generalization ability has shown remarkable success in a wide range of downstream tasks recently, but it is still underexplored in the challenging few-shot class-incremental learning (FSCIL) task. It aims to continually learn new concepts from limited training samples without forgetting the old ones at the same time. In this paper, we introduce DSS-Prompt, a simple yet effective approach that transforms the pre-trained Vision Transformer with minimal modifications in the way of prompts into a strong FSCIL classifier. Concretely, we synergistically utilize two complementary types of prompts in each Transformer block: static prompts to bridge the domain gap between the pre-training and downstream datasets, thus enabling better adaption; and dynamic prompts to capture instance-aware semantics, thus enabling easy transfer from base to novel classes. Specially, to generate dynamic prompts, we leverage a pre-trained multi-modal model to extract input-related diverse semantics, thereby generating complementary input-aware prompts, and then adaptively adjust their importance across different layers. In this way, on top of the prompted visual embeddings, a simple prototype classifier can beat state-of-the-arts without further training on the incremental tasks. We conduct extensive experiments on four benchmarks to validate the effectiveness of our DSS-Prompt and show that it consistently achieves better performance than existing approaches on all datasets and can alleviate the catastrophic forgetting issue as well.

Abstract:
Zero-shot compositional action recognition (ZS-CAR) aims to identify unseen verb-object compositions in the videos by exploiting the learned knowledge of verb and object primitives during training. Despite compositional learning's progress in ZS-CAR, two critical challenges persist: 1) Missing compositional structure constraint, leading to spurious correlations between primitives; 2) Neglecting semantic hierarchy constraint, leading to semantic ambiguity and impairing the training process. In this paper, we argue that human-like symbolic reasoning offers a principled solution to these challenges by explicitly modeling compositional and hierarchical structured abstraction. To this end, we propose a logic-driven ZS-CAR framework LogicCAR that integrates dual symbolic constraints: Explicit Compositional Logic and Hierarchical Primitive Logic. Specifically, the former models the restrictions within the compositions, enhancing the compositional reasoning ability of our model. The latter investigates the semantical dependencies among different primitives, empowering the models with fine-to-coarse reasoning capacity. By formalizing these constraints in first-order logic and embedding them into neural network architectures, LogicCAR systematically bridges the gap between symbolic abstraction and existing models. Extensive experiments on the Sth-com dataset demonstrate that our LogicCAR outperforms existing baseline methods, proving the effectiveness of our logic-driven constraints.

Abstract:
Recent advances in vision-language models (VLMs) have demonstrated remarkable capabilities in image classification by leveraging predefined sets of labels to construct text prompts for zero-shot reasoning. However, these approaches face significant limitations in undefined domains, where the label space is vocabulary-unknown and composite. We thus introduce Generative Semantic Labels (GSLs), a novel task that aims to predict a comprehensive set of semantic labels for an image without being constrained by a predefined labels set. Unlike traditional zero-shot classification, GSLs generates multiple semantic-level labels, encompassing objects, scenes, attributes, and relationships, thereby providing a richer and more accurate representation of image content. In this paper, we propose Chain-of-Action (CoA), an innovative method designed to tackle the GSLs task. CoA is motivated by the observation that enriched contextual information significantly improves generative performance during inference. Specifically, CoA decomposes the GSLs task into a sequence of detailed actions. Each action extracts and merges key information from the previous step, passing enriched context to the next, ultimately guiding the VLM to generate comprehensive and accurate semantic labels. We evaluate the effectiveness of CoA through extensive experiments on widely-used benchmark datasets. The results demonstrate significant improvements across key performance metrics, validating the capability of CoA to generate accurate and contextually rich semantic labels. Our work not only advances the state-of-the-art in generative semantic labels but also opens new avenues for applying VLMs in open-ended and dynamic real-world scenarios.

Abstract:
3D open-world classification is a challenging yet essential task in dynamic and unstructured real-world scenarios, requiring robust subsequent knowledge adaptation capabilities. While current approaches predominantly rely on 2D pre-trained models through 3D-to-2D projection, their performance degrades severely under arbitrary object orientations. Unlike these present efforts, this work makes a pioneering exploration of 3D generative models for 3D open-world classification-specifically, leverageing the accumulated prior knowledge from these models to provide anchors for novel categories, while integrating a rotation-invariant feature extractor. This innovative synergy endows our pipeline with the advantages of being training-free and pose-invariant, thus well suited to adapt novel categories in 3D open-world classification. Extensive experiments on benchmark datasets demonstrate the potential of this pipeline, achieving state-of-the-art performance on ModelNet10‡ and McGill‡ with 32.7% and 8.7% overall accuracy improvement, respectively. The code is available in the supplementary materials.

Abstract:
Federated Prompt Tuning (FPT) is an efficient method for cross-client collaborative fine-tuning of large Vision-Language Models (VLMs). However, models tuned using FPT are vulnerable to adversarial attacks, leading to misclassification in downstream tasks. In this work, we introduce Federated Adversarial Prompt Tuning (FedAPT), a novel method designed to enhance the adversarial robustness of FPT. We identify a key issue in FedAPT under non-independent and identically distributed (non-IID) settings: a class information gap between clients and the global model. Clients rely solely on limited local label information to generate adversarial samples for training, while the global model must defend against adversarial attacks from global labels. To address this issue, we propose a class-aware prompt generator that generates visual prompts from text prompts. This generator is guided by a Global Label Embedding (serving as a ''beacon'') which encodes cross-client label information to create more globally-aligned visual prompts. Additionally, we propose a cross-layer generator sharing strategy to enhance prompt coupling across different layers of the model, further boosting adversarial robustness. Extensive experiments on multiple image classification datasets demonstrate the superiority of FedAPT in improving adversarial robustness, outperforming existing methods by a large margin. FedAPT also exhibits exceptional generalization in cross-domain and cross-dataset scenarios, indicating its effectiveness in real-world applications.

Abstract:
Integrating large language models (LLMs) into autonomous driving motion planning has recently emerged as a promising direction, offering enhanced interpretability, better controllability, and improved generalization in rare and long-tail scenarios. However, existing methods often rely on abstracted perception or map-based inputs, missing crucial visual context, such as fine-grained road cues, accident aftermath, or unexpected obstacles, which are essential for robust decision-making in complex driving environments. To bridge this gap, we propose VLMPlanner, a hybrid framework that combines a learning-based real-time planner with a vision-language model (VLM) capable of reasoning over raw images. The VLM processes multi-view images to capture rich, detailed visual information and leverages its common-sense reasoning capabilities to guide the real-time planner in generating robust and safe trajectories. Furthermore, we develop the Context-Adaptive Inference Gate (CAI-Gate) mechanism that enables the VLM to mimic human driving behavior by dynamically adjusting its inference frequency based on scene complexity, thereby achieving an optimal balance between planning performance and computational efficiency. We evaluate our approach on the large-scale, challenging nuPlan benchmark, with comprehensive experimental results demonstrating superior planning performance in scenarios with intricate road conditions and dynamic elements.

Abstract:
Existing open-vocabulary 3D semantic segmentation methods typically supervise 3D segmentation model by merging text-aligned features (e.g., CLIP) extracted from multi-view images onto 3D points. However, such approaches treat multi-view images merely as intermediaries for transferring open-vocabulary information, overlooking their rich semantic content and cross-view correspondences, and thus limiting the model's effectiveness. To this end, we propose PGOV3D, a novel framework that introduces Partialto-Global curriculum to improve Open-Vocabulary 3D semantic segmentation. The key innovation of our work is a two-stage training strategy. In the first stage, we pre-train the model on partial scenes that provide dense semantic information but relatively simple geometry. Partial point clouds are derived from multi-view RGB-D inputs via pixel-wise depth projection. To enable openvocabulary learning, we leverage a multi-modality large language model (MLLM) and a 2D segmentation foundation model to generate open-vocabulary labels for each viewpoint, providing rich and aligned supervision. An auxiliary inter-frame consistency module is introduced during this stage to enforce feature consistency under viewpoint variations and enhance spatial understanding. In the second stage, we fine-tune the model on complete scene-level point clouds, which are sparser and structurally more complex. To support this, we aggregate the partial vocabularies associated with each scene and generate pseudo labels using the pre-trained model, effectively bridging the semantic gap between dense partial observations and large-scale 3D environments. Extensive experiments on ScanNet, ScanNet200 and S3DIS benchmarks demonstrate that PGOV3D, achieves competitive performance in open-vocabulary 3D semantic segmentation. The code will be released.

Abstract:
Image restoration under adverse weather conditions has been extensively explored, leading to numerous high-performance methods. In particular, recent advances in All-in-One approaches have shown impressive results by training on multi-task image restoration datasets. However, most of these methods rely on dedicated network modules or parameters for each specific degradation type, resulting in a significant parameter overhead. Moreover, the relatedness across different restoration tasks is often overlooked. In light of these issues, we propose a parameter-efficient All-in-One image restoration framework that leverages task-aware enhanced prompts to tackle various adverse weather degradations. Specifically, we adopt a two-stage training paradigm consisting of a pretraining phase and a prompt-tuning phase to mitigate parameter conflicts across tasks. We first employ supervised learning to acquire general restoration knowledge, and then adapt the model to handle specific degradation via trainable soft prompts. Crucially, we enhance these task-specific prompts in a task-aware manner. We apply low-rank decomposition to these prompts to capture both task-general and task-specific characteristics, and impose contrastive constraints to better align them with the actual inter-task relatedness. These enhanced prompts not only improve the parameter efficiency of the restoration model but also enable more accurate task modeling, as evidenced by t-SNE analysis. Experimental results on different restoration tasks demonstrate that the proposed method achieves superior performance with only 2.75M parameters.

Abstract:
Federated learning (FL) enables multiple clients to collaboratively train machine learning models without exposing local data, balancing performance and privacy. However, domain shift and label heterogeneity across clients often hinder the generalization of the aggregated global model. Recently, large-scale vision-language models like CLIP have shown strong zero-shot classification capabilities, raising the question of how to effectively fine-tune CLIP across domains in a federated setting. In this work, we propose an adaptive federated prompt tuning framework, FedDEAP, to enhance CLIP's generalization in multi-domain scenarios. Our method includes the following three key components: (1) To mitigate the loss of domain-specific information caused by label-supervised tuning, we disentangle semantic and domain-specific features in images by using semantic and domain transformation networks with unbiased mappings; (2) To preserve domain-specific knowledge during global prompt aggregation, we introduce a dual-prompt design with a global semantic prompt and a local domain prompt to balance shared and personalized information; (3) To maximize the inclusion of semantic and domain information from images in the generated text features, we align textual and visual representations under the two learned transformations to preserve semantic and domain consistency. Theoretical analysis and extensive experiments on four datasets demonstrate the effectiveness of our method in enhancing the generalization of CLIP for federated image recognition across multiple domains.

Abstract:
Ad-hoc Video Search (AVS) involves using a textual query to search for multiple relevant videos in a large collection of unlabeled short videos. The main challenge of AVS is the visual diversity of relevant videos. A simple query such as ''Find shots of a man and a woman dancing together indoors'' can span a multitude of environments, from brightly lit halls and shadowy bars to dance scenes in black-and-white animations. It is therefore essential to retrieve relevant videos as comprehensively as possible. Current solutions for the AVS task primarily fuse multiple features into one or more common spaces, yet overlook the need for diverse spaces. To fully exploit the expressive capability of individual features, we propose LPD, short for Learning Partially Decorrelated common spaces. LPD incorporates two key innovations: feature-specific common space construction and the de-correlation loss. Specifically, LPD learns a separate common space for each video and text feature, and employs de-correlation loss to diversify the ordering of negative samples across different spaces. To enhance the consistency of multi-space convergence, we designed an entropy-based fair multi-space triplet ranking loss. Extensive experiments on the TRECVID AVS benchmarks (2016-2023) justify the effectiveness of LPD. Moreover, diversity visualizations of LPD's spaces highlight its ability to enhance result diversity.

Abstract:
Existing approaches for image-to-recipe retrieval have the implicit assumption that a food image can fully capture the details textually documented in its recipe. However, a food image only reflects the visual outcome of a cooked dish and not the underlying cooking process. Consequently, learning cross-modal representations to bridge the modality gap between images and recipes tends to ignore subtle, recipe-specific details that are not visually apparent but are crucial for recipe retrieval. Specifically, the representations are biased to capture the dominant visual elements, resulting in difficulty in ranking similar recipes with subtle differences in use of ingredients and cooking methods. The bias in representation learning is expected to be more severe when the training data is mixed of images and recipes sourced from different cuisines. This paper proposes a novel causal approach that predicts the culinary elements potentially overlooked in images, while explicitly injecting these elements into cross-modal representation learning to mitigate biases. Experiments are conducted on the standard monolingual Recipe1M dataset and a newly curated multilingual multicultural cuisine dataset. The results indicate that the proposed causal representation learning is capable of uncovering subtle ingredients and cooking actions and achieves impressive retrieval performance on both monolingual and multilingual multicultural datasets.

Abstract:
Finance decision-making often relies on in-depth data analysis across various data sources, including financial tables, news articles, stock prices, etc. In this work, we introduce FINTMMBench, the first comprehensive benchmark for evaluating temporal-aware multi-modal Retrieval-Augmented Generation (RAG) systems in finance. Built from heterologous data of NASDAQ 100 companies, FINTMMBench offers three significant advantages. 1) Multi-modal Corpus: It encompasses a hybrid of financial tables, news articles, daily stock prices, and visual technical charts as the corpus. 2) Temporal-aware Questions: Each question requires the retrieval and interpretation of its relevant data over a specific time period, including daily, weekly, monthly, quarterly, and annual periods. 3) Diverse Financial Analysis Tasks: The questions involve 10 different financial analysis tasks designed by domain experts, including information extraction, trend analysis, sentiment analysis and event detection, etc. We further propose a novel TMMHybridRAG method, which first leverages a multi-modal LLM to convert data from other modalities (e.g., tabular, visual and time-series data) into textual format and then incorporates temporal information in each node when constructing graphs and dense indexes. Its effectiveness has been validated in extensive experiments, but notable gaps remain, highlighting the challenges presented by our FINTMMBench. The benchmark and source code will be made publicly available.

Abstract:
Artistic image assessment has become a prominent research area in computer vision. In recent years, the field has witnessed a proliferation of datasets and methods designed to evaluate the aesthetic quality of paintings. However, most existing approaches focus solely on static final images, overlooking the dynamic and multi-stage nature of the artistic painting process. To address this gap, we propose a novel framework for human-aligned assessment of painting processes. Specifically, we introduce the Painting Process Assessment Dataset (PPAD)-the first large-scale dataset comprising real and synthetic painting process images, annotated by domain experts across eight detailed attributes. Furthermore, we present PPJudge (Painting Process Judge), a Transformer-based model enhanced with temporally-aware positional encoding and a heterogeneous mixture-of-experts architecture, enabling effective assessment of the painting process. Experimental results demonstrate that our method outperforms existing baselines in accuracy, robustness, and alignment with human judgment, offering new insights into computational creativity and art education.

Abstract:
The advent and proliferation of large multi-modal models (LMMs) have introduced new paradigms to computer vision, transforming various tasks into a unified visual question answering framework. Video Quality Assessment (VQA), a classic field in low-level visual perception, focused initially on quantitative video quality scoring. However, driven by advances in LMMs, it is now progressing toward more holistic visual quality understanding tasks. Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can markedly enhance low-level visual quality evaluation. Nevertheless, related work has not been explored in the video domain, leaving substantial room for improvement. To address this gap, we introduce the VQA² Instruction Dataset-the first visual question answering instruction dataset that focuses on video quality assessment. This dataset consists of 3 subsets and covers various video types, containing 157,755 instruction question-answer pairs. Then, leveraging this foundation, we present the VQA² series models. The VQA² series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos. We conduct extensive experiments on video quality scoring and understanding tasks, and results demonstrate that the VQA² series models achieve excellent performance in both tasks. Notably, our final model, the VQA²-Assistant, exceeds the renowned GPT-4o in visual quality understanding tasks while maintaining strong competitiveness in quality scoring tasks. Our work provides a foundation and feasible approach for integrating low-level video quality assessment and understanding with LMMs.

Abstract:
Short video streaming has become a dominant paradigm in digital media, characterized by rapid swiping interactions and diverse media content. A key technical challenge is designing an effective preloading strategy that dynamically selects and prioritizes download tasks from an evolving playlist, balancing Quality of Experience (QoE) and bandwidth efficiency under practical commercial constraints. However, real-world analysis reveals critical limitations of existing approaches: (1) insufficient adaptation of download task sizes to dynamic conditions, and (2) watch-time prediction models that are difficult to deploy reliably at scale. In this paper, we propose DeLoad, a novel preloading framework that addresses these issues by introducing dynamic task sizing and a practical, multi-dimensional watch-time estimation method. Additionally, a Deep Reinforcement Learning (DRL)-enhanced agent is trained to optimize the download range decisions adaptively. Extensive evaluations conducted on an offline testing platform, leveraging massive real-world network data, demonstrate that DeLoad achieves significant improvements in QoE metrics (34.4%-87.4% gain). Furthermore, after deployment on a large-scale commercial short-video platform, DeLoad has increased overall user watch-time by 0.9‰ while simultaneously reducing rebuffering events and 3.76% bandwidth consumption.

Abstract:
Human motion generation has found widespread applications in AR/VR, film, sports, and medical rehabilitation, offering a cost-effective alternative to traditional motion capture systems. However, evaluating the fidelity of such generated motions is a crucial, multifaceted task. Although previous approaches have attempted at motion fidelity evaluation using human perception or physical constraints, there remains an inherent gap between human-perceived fidelity and physical feasibility. Moreover, the subjective and coarse binary labeling of human perception further undermines the development of a robust data-driven metric. We address these issues by introducing a physical labeling method. This method evaluates motion fidelity by calculating the minimum modifications needed for a motion to align with physical laws. With this approach, we are able to produce fine-grained, continuous physical alignment annotations that serve as objective ground truth. With these annotations, we propose PP-Motion, a novel data-driven metric to evaluate both physical and perceptual fidelity of human motion. To effectively capture underlying physical priors, we employ Pearson's correlation loss for the training of our metric. Additionally, by incorporating a human-based perceptual fidelity loss, our metric can capture fidelity that simultaneously considers both human perception and physical alignment. Experimental results demonstrate that our metric, PP-Motion, not only aligns with physical laws but also aligns better with human perception of motion fidelity than previous work.

Abstract:
The continual learning (CL) of novel concepts from new environments represents a popular and important topic aiming to manage catastrophic forgetting. Research studies have developed dynamic expansion models to deal with network forgetting in CL. Existing CL models usually explore the full capacity of activating parameters and representations while ignoring the previously learned representations when learning new tasks. In this paper, we propose a novel dynamic expansion model that incrementally accumulates and incorporates all previously learned representations into defining new experts to add to a mixture of experts in a recursive manner, aiming to reuse previously learned parameters and features to promote future task learning. We define a graph structure having each expert as a component node. We then propose a novel expandable expert graph attention mechanism that dynamically optimizes the graph when learning new tasks, maximizing the positive knowledge transfer. In addition, we propose a novel expert cooperation mechanism to promote the cooperation between all previous experts and with the currently updated expert. Furthermore, we propose a novel memory optimization approach, which encourages each expert to capture and learn completely different information, further improving performance. We provide the results of a series of experiments demonstrating that the proposed approach outperforms the state-of-the-art performance in CL.

Abstract:
Recent advancements in Generalizable Gaussian Splatting have enabled robust 3D reconstruction from sparse input views by utilizing feed-forward Gaussian Splatting models, achieving superior cross-scene generalization. However, while many methods focus on geometric consistency, they often neglect the potential of text-driven guidance to enhance semantic understanding, which is crucial for accurately reconstructing fine-grained details in complex scenes. To address this limitation, we propose TextSplat-the first text-driven Generalizable Gaussian Splatting framework. Specifically, our framework employs three parallel modules to obtain complementary representations: the Diffusion Prior Depth Estimator for accurate depth information, the Semantic Aware Segmentation Network for detailed semantic information, and the Multi-View Interaction Network for refined cross-view features. Then, in the Text-Guided Semantic Fusion Module, these representations are integrated via the text-guided and attention-based feature aggregation mechanism, resulting in enhanced 3D Gaussian parameters enriched with detailed semantic cues. Experimental results on various benchmark datasets demonstrate improved performance compared to existing methods across multiple evaluation metrics, validating the effectiveness of our framework. The code will be publicly available.

Abstract:
Electroencephalogram (EEG) signal classification faces significant challenges due to data distribution shifts caused by heterogeneous electrode configurations, acquisition protocols, and hardware discrepancies across domains. This paper introduces IMAC, a novel channel-dependent mask and imputation self-supervised framework that formulates the alignment of cross-domain EEG data shifts as a spatial time series imputation task. To address heterogeneous electrode configurations in cross-domain scenarios, IMAC first standardizes different electrode layouts using a 3D-to-2D positional unification mapping strategy, establishing unified spatial representations. Unlike previous mask-based self-supervised representation learning methods, IMAC introduces spatio-temporal signal alignment. This involves constructing a channel-dependent mask and reconstruction task framed as a low-to-high resolution EEG spatial imputation problem. Consequently, this approach simulates cross-domain variations such as channel omissions and temporal instabilities, thus enabling the model to leverage the proposed imputer for robust signal alignment during inference. Furthermore, IMAC incorporates a disentangled structure that separately models the temporal and spatial information of the EEG signals separately, reducing computational complexity while enhancing flexibility and adaptability. Comprehensive evaluations across 10 publicly available EEG datasets demonstrate IMAC's superior performance, achieving state-of-the-art classification accuracy in both cross-subject and cross-center validation scenarios. Notably, IMAC shows strong robustness under both simulated and real-world distribution shifts, surpassing baseline methods by up to 35% in integrity scores while maintaining consistent classification accuracy.

Abstract:
Video colorization is inherently challenging due to the need for accurate color inference and temporal consistency. In this paper, we present ColorDiffuser, an adaptation of a pre-trained text-to-image latent diffusion model for video colorization. By leveraging learned color priors from large-scale training, our method avoids costly retraining and enables controllable colorization via text prompts. To address the adaptation of an image model to video, we propose a novel Short- and Long-distance Cross-Frame Attention (SL-CFA) module combined with an amortized sampling strategy to unify the color latent over time. By incorporating information from nearby and distant frames, the model achieves better consistency for long video sequences, even with problematic disocclusion. To mitigate visual detail loss and color bleeding from compressed latent representations, we introduce a video colorization VAE model that incorporates semantic boundaries and grayscale inputs. Extensive experiments on benchmark datasets demonstrate that ColorDiffuser achieves state-of-the-art performance in color fidelity, temporal consistency, and visual quality, while offering diverse and controllable outputs. Our project page can be accessed at: https://colordiffuser.github.io.

Abstract:
Understanding how scientific ideas evolve requires more than summarizing individual papers-it demands structured, cross-document reasoning over thematically related research. In this work, we formalize multi-document scientific inference, a new task that extracts and aligns motivation, methodology, and experimental results across related papers to reconstruct research development chains. This task introduces key challenges, including temporally aligning loosely structured methods and standardizing heterogeneous experimental tables. We present ResearchPulse, an agent-based framework that integrates instruction planning, scientific content extraction, and structured visualization. It consists of three coordinated agents: a Plan Agent for task decomposition, a Mmap-Agent that constructs motivation-method mind maps, and a Lchart-Agent that synthesizes experimental line charts. To support this task, we introduce ResearchPulse-Bench, a citation-aware benchmark of annotated paper clusters. Experiments show that our system, despite using 7B-scale agents, consistently outperforms strong baselines like GPT-4o in semantic alignment, structural consistency, and visual fidelity. The dataset are available in https://huggingface.co/datasets/ResearchPulse/ResearchPulse-Bench

Abstract:
Cooking process visualization is a promising task in the intersection of image generation and food analysis, which aims to generate an image for each cooking step of a recipe. However, most existing works focus on generating images of finished foods based on the given recipes, and face two challenges in visualizing the cooking process. First, the appearance of ingredients changes variously across cooking steps, it is difficult to generate the correct appearances of foods that match the textual description, leading to semantic inconsistency. Second, the current step might depend on the operations of previous step, it is crucial to maintain the contextual coherence of images in sequential order. In this work, we present a cooking process visualization model, called Chain-of-Cooking. Specifically, to generate correct appearances of ingredients, we present a Dynamic Patch Selection Module to retrieve previously generated image patches as references, which are most related to current textual contents. Furthermore, to enhance the coherence and keep the rational order of generated images, we propose a Semantic Evolution Module and a Bidirectional Chain-of-Thought (CoT) Guidance. To better utilize the semantics of previous texts, the Semantic Evolution Module establishes the semantical association between latent prompts and current cooking step, and merges it with the latent features. Then the CoT Guidance updates the merged features to guide the current cooking step remain coherent with the previous step. Moreover, we construct a dataset named CookViz, consisting of intermediate image-text pairs for the cooking process. Quantitative and qualitative experiments show that our method outperforms existing methods in generating coherent and semantic consistent cooking process.

Abstract:
Autonomous driving has seen remarkable advancements, largely driven by extensive real-world data collection. However, acquiring diverse and corner-case data remains costly and inefficient. Generative models have emerged as a promising solution by synthesizing realistic sensor data. However, existing approaches primarily focus on single-modality generation, leading to inefficiencies and misalignment in multimodal sensor data. To address these challenges, we propose OminiGen, which generates aligned multimodal sensor data in a unified framework. Our approach leverages a shared Bird's Eye View (BEV) space to unify multimodal features and designs a novel generalizable multimodal reconstruction method, UAE, to jointly decode LiDAR and multi-view camera data. UAE achieves multimodal sensor decoding through volume rendering, enabling accurate and flexible reconstruction. Furthermore, we incorporate a Diffusion Transformer (DiT) with a ControlNet branch to enable controllable multimodal sensor generation. Our comprehensive experiments demonstrate that OminiGen achieves desired performances in unified multimodal sensor data generation with multimodal consistency and flexible sensor adjustments.

Abstract:
The perception and generation of Human-Object Interaction (HOI) are crucial for fields such as robotics, AR/VR, and human behavior understanding. However, current approaches model this task in an offline setting, where information at each time step can be drawn from the entire interaction sequence. In contrast, in real-world scenarios, the information available at each time step comes only from the current moment and historical data, i.e., an online setting. We find that offline methods perform poorly in an online context. Based on this observation, we propose two new tasks: Online HOI Generation and Perception. To address this task, we introduce the OnlineHOI framework, a network architecture based on the Mamba framework that employs a memory mechanism. By leveraging Mamba's powerful modeling capabilities for streaming data and the Memory mechanism's efficient integration of historical information, we achieve state-of-the-art results on the Core4D and OAKINK2 online generation tasks, as well as the online HOI4D perception task.

Abstract:
Artistic styles are defined by both their structural and appearance elements. Existing neural stylization techniques primarily focus on transferring appearance-level features such as color and texture, often neglecting the equally crucial aspect of structural stylization. To address this gap, we introduce DiffArtist, the first 2D stylization method to offer fine-grained, disentangled control over both structure and appearance style strength. This dual controllability is achieved by representing structure and appearance generation as separate diffusion processes, necessitating no further tuning or additional adapters. To properly evaluate this new capability of dual stylization, we further propose a Multimodal LLM-based stylization evaluator that aligns significantly better with human preferences than existing metrics. Extensive analysis shows that DiffArtist achieves superior style fidelity and dual-controllability compared to state-of-the-art methods. Its text-driven, training-free design and unprecedented dual controllability make it a powerful and interactive tool for various creative applications. Project homepage: https://diffusionartist.github.io.

Abstract:
Animal motion embodies species-specific behavioral habits, making the transfer of motion across categories a critical yet complex task for applications in animation and virtual reality. Existing motion transfer methods, primarily focused on human motion, emphasize skeletal alignment (motion retargeting) or stylistic consistency (motion style transfer), often neglecting the preservation of distinct habitual behaviors in animals. To bridge this gap, we propose a novel habit-preserved motion transfer framework for cross-category animal motion. Built upon a generative framework, our model introduces a habit-preservation module with category-specific habit encoder, allowing it to learn motion priors that capture distinctive habitual characteristics. Furthermore, we integrate a large language model (LLM) to facilitate the motion transfer to previously unobserved species. To evaluate the effectiveness of our approach, we introduce the DeformingThings4D-skl dataset, a quadruped dataset with skeletal bindings, and conduct extensive experiments and quantitative analyses, which validate the superiority of our proposed model.

Abstract:
We present DualMat, a novel dual-path diffusion framework for estimating Physically Based Rendering (PBR) materials from single images under complex lighting conditions. Our approach operates in two distinct latent spaces: an albedo-optimized path leveraging pretrained visual knowledge through RGB latent space, and a material-specialized path operating in a compact latent space designed for precise metallic and roughness estimation. To ensure coherent predictions between the albedo-optimized and material-specialized paths, we introduce feature distillation during training. We employ rectified flow to enhance efficiency by reducing inference steps while maintaining quality. Our framework extends to high-resolution and multi-view inputs through patch-based estimation and cross-view attention, enabling seamless integration into image-to-3D pipelines. DualMat achieves state-of-the-art performance on both Objaverse and real-world data, significantly outperforming existing methods with up to 28% improvement in albedo estimation and 39% reduction in metallic-roughness prediction errors. Our project can be found at yifehuang97.github.io/DualMatProjPage/.

Abstract:
The performance of computer vision models in certain real-world applications, such as medical diagnosis, is often limited by the scarcity of available images. Expanding datasets using pre-trained generative models is an effective solution. However, due to the uncontrollable generation process and the ambiguity of natural language, noisy images may be generated. Re-weighting is an effective way to address this issue by assigning low weights to such noisy images. We first theoretically analyze three types of supervision for the generated images. Based on the theoretical analysis, we develop TriReWeight, a triplet-connection-based sample re-weighting method to enhance generative data augmentation. Theoretically, TriReWeight can be integrated with any generative data augmentation methods and never downgrade their performance. Moreover, its generalization approaches the optimal in the order O(√d ln (n)/n). Our experiments validate the correctness of the theoretical analysis and demonstrate that our method outperforms the existing SOTA methods by 7.9% on average over six natural image datasets and by 3.4% on average over three medical datasets. We also experimentally validate that our method can enhance the performance of different generative data augmentation methods.

Abstract:
With the advent of diffusion models, Text-to-Image (T2I) generation has seen substantial advancements. Current T2I models allow users to specify object colors using linguistic color names, and some methods aim to personalize color-object association through prompt learning. However, existing models struggle to provide comprehensive control over the color schemes of an entire image, especially for background elements and less prominent objects not explicitly mentioned in prompts. This paper proposes a novel approach to enhance color scheme control by integrating color palettes as a separate guidance mechanism alongside prompt instructions. We investigate the effectiveness of palette guidance by exploring various palette representation methods within a diffusion-based image colorization framework. To facilitate this exploration, we construct specialized palette-text-image datasets and conduct extensive quantitative and qualitative analyses. Our results demonstrate that incorporating palette guidance significantly improves the model's ability to generate images with desired color schemes, enabling a more controlled and refined colorization process.

Abstract:
Emerging immersive display technologies efficiently utilize resources with perceptual graphics methods such as foveated rendering and denoising. Running multiple perceptual graphics methods challenges devices with limited power and computational resources. We propose a computationally-lightweight learned multitasking perceptual graphics model. Given RGB images and text-prompts, our model performs text-described perceptual tasks in a single inference step. Simply daisy-chaining multiple models or training dedicated models can lead to model management issues and exhaust computational resources. In contrast, our flexible method unlocks consistent high quality perceptual effects with reasonable compute, supporting various permutations at varied intensities using adjectives in text prompts (e.g., ''mildly'', ''lightly''). Text-guidance provides ease of use for dynamic requirements such as creative processes. To train our model, we propose a dataset containing source and perceptually enhanced images with corresponding text prompts. We evaluate our model on desktop and embedded platforms and validate perceptual quality through a user study.

Abstract:
Deep graph clustering (DGC) for attribute-missing graphs is an unsupervised task aimed at partitioning nodes with incomplete attributes into distinct clusters. Existing imputation methods for attribute-missing graphs often fail to account for the varying amounts of information available across node neighborhoods, leading to unreliable results. To address this issue, we propose a novel method named Divide-Then-Rule Graph Completion (DTRGC). This method first addresses nodes with sufficient known neighborhood information and treats the imputed results as new knowledge to iteratively impute more challenging nodes, while leveraging clustering information to correct imputation errors. Specifically, Dynamic Cluster-Aware Feature Propagation initializes missing node attributes by adjusting propagation weights based on the clustering structure. Subsequently, Hierarchical Neighborhood-Aware Imputation categorizes attribute-missing nodes into three groups based on the completeness of their neighborhood attributes. The imputation is performed hierarchically, prioritizing the groups with nodes that have the most available neighborhood information. The cluster structure is then used to refine the imputation and correct potential errors. Finally, Hop-wise Representation Enhancement integrates information across multiple hops, thereby enriching the expressiveness of node representations. Experimental results on 6 widely used graph datasets show that DTRGC significantly improves the clustering performance of various DGC methods under attribute-missing graphs.

Abstract:
Recent advances in data-driven geometric multi-view 3D reconstruction foundation models (e.g., DUSt3R) have shown remarkable performance across various 3D vision tasks, facilitated by the release of large-scale, high-quality 3D datasets. However, as we observed, constrained by their matching-based principles, the reconstruction quality of existing models suffers significant degradation in challenging regions with limited matching cues, particularly in weakly textured areas and low-light conditions. To mitigate these limitations, we propose to harness the inherent robustness of monocular geometry estimation to compensate for the shortcomings. Specifically, we introduce a monocular-guided refinement module that integrates monocular geometric priors into multi-view reconstruction frameworks. This integration substantially enhances the robustness of multi-view reconstruction systems, leading to high-quality feed-forward reconstructions. Comprehensive experiments across multiple benchmarks demonstrate that our method achieves substantial improvements in both multi-view camera pose estimation and point cloud accuracy.

Abstract:
In recent years, foundation models for monocular depth estimation have received increasing attention. Current methods mainly address typical daylight conditions, but their effectiveness notably decreases in low-light environments. There is a lack of robust foundational models for monocular depth estimation specifically designed for low-light scenarios. This largely stems from the absence of large-scale, high-quality paired depth datasets for low-light conditions and the effective parameter-efficient fine-tuning (PEFT) strategy. To address these challenges, we propose DepthDark, a robust foundation model for low-light monocular depth estimation. We first introduce a flare-simulation module and a noise-simulation module to accurately simulate the imaging process under nighttime conditions, producing high-quality paired depth datasets for low-light conditions. Additionally, we present an effective low-light PEFT strategy that utilizes illumination guidance and multiscale feature fusion to enhance the model's capability in low-light environments. Our method achieves state-of-the-art depth estimation performance on the challenging nuScenes-Night and RobotCar-Night datasets, validating its effectiveness using limited training data and computing resources.

Abstract:
The success of face recognition (FR) systems has led to serious privacy concerns due to potential unauthorized surveillance and user tracking on social networks. Existing methods for enhancing privacy fail to generate natural face images that can protect facial privacy. In this paper, we propose diffusion-based adversarial identity manipulation (DiffAIM) to generate natural and highly transferable adversarial faces against malicious FR systems. To be specific, we manipulate facial identity within the low-dimensional latent space of a diffusion model. This involves iteratively injecting gradient-based adversarial identity guidance during the reverse diffusion process, progressively steering the generation toward the desired adversarial faces. The guidance is optimized for identity convergence towards a target while promoting semantic divergence from the source, facilitating effective impersonation while maintaining visual naturalness. We further incorporate structure-preserving regularization to preserve facial structure consistency during manipulation. Extensive experiments on both face verification and identification tasks demonstrate that compared with the state-of-the-art, DiffAIM achieves stronger black-box attack transferability while maintaining superior visual quality. We also demonstrate the effectiveness of the proposed approach for commercial FR APIs, including Face++ and Aliyun.

Abstract:
Recent advancements in generative AI, particularly in speech synthesis, have enabled the generation of highly natural-sounding synthetic speech that closely mimics human voices. While these innovations hold promise for applications like assistive technologies, they also pose significant risks, including misuse for fraudulent activities, identity theft, and security threats. Current research on spoofing detection countermeasures remains limited by generalization to unseen deepfake attacks and languages. To address this, we propose a gating mechanism extracting relevant feature from the speech foundation XLS-R model as a front-end feature extractor. For downstream back-end classifier, we employ Multi-kernel gated Convolution (MultiConv) to capture both local and global speech artifacts. Additionally, we introduce Centered Kernel Alignment (CKA) as a similarity metric to enforce diversity in learned features across different MultiConv layers. By integrating CKA with our gating mechanism, we hypothesize that each component helps improving the learning of distinct synthetic speech patterns. Experimental results demonstrate that our approach achieves state-of-the-art performance on in-domain benchmarks while generalizing robustly to out-of-domain datasets, including multilingual speech samples. This underscores its potential as a versatile solution for detecting evolving speech deepfake threats.

Abstract:
Multi-modal learning has achieved remarkable success by integrating information from various modalities, achieving superior performance in tasks like recognition and retrieval compared to uni-modal approaches. However, real-world scenarios often present novel modalities that are unseen during training due to resource and privacy constraints, a challenge current methods struggle to address. This paper introduces Modality Generalization (MG), which focuses on enabling models to generalize to unseen modalities. We define two cases: Weak MG, where both seen and unseen modalities can be mapped into a joint embedding space via existing perceptors, and Strong MG, where no such mappings exist. To facilitate progress, we propose a comprehensive benchmark featuring multi-modal algorithms and adapt existing methods that focus on generalization. Extensive experiments highlight the complexity of MG, exposing the limitations of existing methods and identifying key directions for future research. Our work provides a foundation for advancing robust and adaptable multi-modal models, enabling them to handle unseen modalities in realistic scenarios.

Abstract:
The future of digital marketing lies in the convergence of human creativity and generative AI, where insight, strategy, and storytelling are co-authored by intelligent systems. We present MindFuse, a brave new explainable generative AI framework designed to act as a strategic partner in the marketing process. Unlike conventional LLM applications that stop at content generation, MindFuse fuses CTR-based content AI-guided co-creation with large language models to extract, interpret, and iterate on communication narratives grounded in real advertising data. MindFuse operates across the full marketing lifecycle: from distilling content pillars and customer personas from competitor campaigns to recommending in-flight optimizations based on live performance telemetry. It uses attention-based explainability to diagnose ad effectiveness and guide content iteration, while aligning messaging with strategic goals through dynamic narrative construction and storytelling. We introduce a new paradigm in GenAI for marketing, where LLMs not only generate content but reason through it, adapt campaigns in real time, and learn from audience engagement patterns. Our results, validated in agency deployments, demonstrate up to 12× efficiency gains, setting the stage for future integration with empirical audience data (e.g., GWI, Nielsen) and full-funnel attribution modeling. MindFuse redefines AI not just as a tool, but as a collaborative agent in the creative and strategic fabric of modern marketing. In the end of the paper, we also provide a forward-looking forecast on how platforms like MindFuse and Google DeepMind's ACAI are likely to shape the future of marketing agencies. These systems will not only reset client expectations toward greater transparency, speed, and personalization, but will also redefine the skillsets demanded from the new generation of marketers. As hybrid agencies emerge-blending creative storytelling with data science and AI engineering-the competitive landscape will increasingly hinge on talent. In this new environment, professionals will be expected to pair human imagination with technical fluency, while agencies will need to reinvent themselves as AI facilitators and curators of brand authenticity amidst the ongoing talent war led by platforms such as Meta.

Abstract:
Serving large language models (LLMs) efficiently remains challenging due to the high memory and latency overhead of key-value (KV) cache access during autoregressive decoding. We present TinyServe, a lightweight and extensible serving system for deploying tiny LLMs (e.g., TinyLLaMA, GPT2-345M) with support for structured KV sparsity, plugin-based token selection, and hardware-efficient attention kernels. Unlike prior simulation frameworks, TinyServe executes real-time decoding with configurable sparsity strategies and fine-grained instrumentation. To reduce decoding cost, we introduce a query-aware page selection mechanism that leverages bounding-box metadata to estimate attention relevance between the query and KV cache blocks. This enables selective KV loading with minimal overhead and no model modifications. Our fused CUDA kernel integrates page scoring, sparse memory access, and masked attention in a single pass. Experiments show that TinyServe achieves up to 3.4× speedup and over 2× memory savings with negligible accuracy drop. Additional analysis of cache reuse, page hit rate, and multi-GPU scaling confirms its practicality as an efficient system-level design for LLM training and inference research on resource-constrained hardware.

Abstract:
The Human-Object Interaction (HOI) task explores the dynamic interactions between humans and objects in physical environments, providing essential biomechanical and cognitive-behavioral foundations for fields such as robotics, virtual reality, and human-computer interaction. However, existing HOI data sets focus on details of affordance, often neglecting the influence of physical properties of objects on human long-term motion. To bridge this gap, we introduce the PA-HOI Motion Capture dataset, which highlights the impact of objects' physical attributes on human motion dynamics, including human posture, moving velocity, and other motion characteristics. The dataset comprises 562 motion sequences of human-object interactions, with each sequence performed by subjects of different genders interacting with 35 3D objects that vary in size, shape, and weight. This dataset stands out by significantly extending the scope of existing ones for understanding how the physical attributes of different objects influence human posture, speed, motion scale, and interacting strategies. We further demonstrate the applicability of the PA-HOI dataset by integrating it with existing motion generation methods, validating its capacity to transfer realistic physical awareness.

Abstract:
Ultra-low bitrate image compression is a challenging and demand- ing topic. With the development of Large Multimodal Models (LMMs), a Cross Modality Compression (CMC) paradigm of Image-Text- Image has emerged. Compared with traditional codecs, this semantic- level compression can reduce image data size to 0.1% or even lower, which has strong potential applications. However, CMC has cer- tain defects in consistency with the original image and perceptual quality. To inspire insights into such a problem, we introduce CMC- Bench, a benchmark of the cooperative performance of Image-to- Text (I2T) and Text-to-Image (T2I) models for image compression. This benchmark covers 18,000 and 40,000 images respectively to verify 6 mainstream I2T and 12 T2I models, including 160,000 sub- jective preference scores annotated by human experts. At ultra-low bitrates, it proves that the combination of some I2T and T2I models has surpassed the most advanced visual signal codecs; meanwhile, it highlights where LMMs can be further optimized toward the compression task. We encourage LMM developers to participate in this test to promote the evolution of visual signal codec protocols.

Abstract:
Wheat management strategies play a critical role in determining yield. Traditional management decisions often rely on labour-intensive expert inspections, which are expensive, subjective and difficult to scale. Recently, Vision-Language Models (VLMs) have emerged as a promising solution to enable scalable, data-driven management support. However, due to a lack of domain-specific knowledge, directly applying VLMs to wheat management tasks results in poor quantification and reasoning capabilities, ultimately producing vague or even misleading management recommendations. In response, we propose WisWheat, a wheat-specific dataset with a three-layered design to enhance VLM performance on wheat management tasks: (1) a foundational pretraining dataset of 47,871 image-caption pairs for coarsely adapting VLMs to wheat morphology; (2) a quantitative dataset comprising 7,263 VQA-style image-question-answer triplets for quantitative trait measuring tasks; and (3) an Instruction Fine-tuning dataset with 4,888 samples targeting biotic and abiotic stress diagnosis and management plan for different phenological stages. Extensive experimental results demonstrate that fine-tuning open-source VLMs (e.g., Qwen2.5 7B) on our dataset leads to significant performance improvements. Specifically, the Qwen2.5 VL 7B fine-tuned on our wheat instruction dataset achieves accuracy scores of 79.2% and 84.6% on wheat stress and growth stage conversation tasks respectively, surpassing even general-purpose commercial models such as GPT-4o by a margin of 11.9% and 34.6%.

Abstract:
Recent advances in AI-generated content have fueled the rise of highly realistic synthetic videos, posing severe risks to societal trust and digital integrity. Existing benchmarks for video authenticity detection typically suffer from limited realism, insufficient scale, and inadequate complexity, failing to effectively evaluate modern vision-language models against sophisticated forgeries. To address this critical gap, we introduce AEGIS, a novel large-scale benchmark explicitly targeting the detection of hyper-realistic and semantically nuanced AI-generated videos. AEGIS comprises over 10,000 rigorously curated real and synthetic videos generated by diverse, state-of-the-art generative models, including Stable Video Diffusion, CogVideoX-5B, KLing, and Sora, encompassing open-source and proprietary architectures. In particular, AEGIS features specially constructed challenging subsets enhanced with GPT-4o-refined prompts, creating unprecedentedly realistic scenarios for rigorous robustness evaluation. Furthermore, we provide multimodal annotations spanning Semantic-Authenticity Descriptions, Motion Features, and Low-level Visual Features, facilitating authenticity detection and supporting downstream tasks such as multimodal fusion and forgery localization. Extensive experiments using advanced vision-language models demonstrate limited detection capabilities on the most challenging subsets of AEGIS, highlighting the dataset's unique complexity and realism beyond the current generalization capabilities of existing models. In essence, AEGIS establishes an indispensable evaluation benchmark, fundamentally advancing research toward developing genuinely robust, reliable, and broadly generalizable video authenticity detection methodologies capable of addressing real-world forgery threats. Our dataset is avaliable on https://huggingface.co/datasets/Clarifiedfish/AEGIS.

Abstract:
Text removal is a crucial task in computer vision with applications such as privacy preservation, image editing, and media reuse. While existing research has primarily focused on scene text removal in natural images, limitations in current datasets hinder out-of-domain generalization or accurate evaluation. In particular, widely used benchmarks such as SCUT-EnsText suffer from ground truth artifacts due to manual editing, overly simplistic text backgrounds, and evaluation metrics that do not capture the quality of generated results. To address these issues, we introduce an approach to synthesizing a text removal benchmark applicable to domains other than scene texts. Our dataset features text rendered on complex backgrounds using object-aware placement and vision-language model-generated content, ensuring clean ground truth and challenging text removal scenarios. The dataset is available at https://huggingface.co/datasets/cyberagent/OTR.

Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable multimodal understanding capabilities in Visual Question Answering (VQA) tasks by integrating visual and textual features. However, under the challenging ten-choice question evaluation paradigm, existing methods still exhibit significant limitations when processing PDF documents with complex layouts and lengthy content. Notably, current mainstream models suffer from a strong bias toward English training data, resulting in suboptimal performance for Japanese and other language scenarios. To address these challenges, this paper proposes a novel Japanese PDF document understanding framework that combines multimodal hierarchical reasoning mechanisms with Colqwen-optimized retrieval methods, while innovatively introducing a semantic verification strategy through sub-question decomposition. Experimental results demonstrate that our framework not only significantly enhances the model's deep semantic parsing capability for complex documents, but also exhibits superior robustness in practical application scenarios.

Abstract:
Automated analysis of endoscopic imagery is a critical yet underdeveloped component of ENT (ear, nose, and throat) care, hindered by variability in devices and operators, subtle and localized findings, and fine-grained distinctions such as laterality and vocal-fold state. In addition to classification, clinicians require reliable retrieval of similar cases, both visually and through concise textual descriptions. These capabilities are rarely supported by existing public benchmarks. To this end, we introduce ENTRep, the ACM Multimedia 2025 Grand Challenge on ENT endoscopy analysis, which integrates fine-grained anatomical classification with image-to-image and text-to-image retrieval under bilingual (Vietnamese and English) clinical supervision. Specifically, the dataset comprises expert-annotated images, labeled for anatomical region and normal or abnormal status, and accompanied by dual-language narrative descriptions. In addition, we define three benchmark tasks, standardize the submission protocol, and evaluate performance on public and private test splits using server-side scoring. Moreover, we report results from the top-performing teams and provide an insightful discussion.

Abstract:
Recent prominence in 3D Gaussian Splatting (3DGS) has enabled real-time rendering while maintaining high-fidelity novel view synthesis. However, 3DGS resorts to the Gaussian function that is low-pass by nature and is restricted in representing high-frequency details in 3D scenes. Moreover, it causes redundant primitives with degraded training and rendering efficiency and excessive memory overhead. To overcome these limitations, we propose 3D Gabor Splatting (3DGabSplat) that leverages a novel 3D Gabor-based primitive with multiple directional 3D frequency responses for radiance field representation supervised by multi-view images. The proposed 3D Gabor-based primitive forms a filter bank incorporating multiple 3D Gabor kernels at different frequencies to enhance flexibility and efficiency in capturing fine 3D details. Furthermore, to achieve novel view rendering, an efficient CUDA-based rasterizer is developed to project the multiple directional 3D frequency components characterized by 3D Gabor-based primitives onto the 2D image plane, and a frequency-adaptive mechanism is presented for adaptive joint optimization of primitives. 3DGabSplat is scalable to be a plug-and-play kernel for seamless integration into existing 3DGS paradigms to enhance both efficiency and quality of novel view synthesis. Extensive experiments demonstrate that 3DGabSplat outperforms 3DGS and its variants using alternative primitives, and achieves state-of-the-art rendering quality across both real-world and synthetic scenes. Remarkably, we achieve up to 1.35 dB PSNR gain over 3DGS with simultaneously reduced number of primitives and memory consumption.

Abstract:
Box-supervised instance segmentation methods aim to achieve instance segmentation with only box annotations. Recent methods have demonstrated the effectiveness of acquiring high-quality pseudo masks under the teacher-student framework. Building upon this foundation, we propose a BoxSeg framework involving two novel and general modules named the Quality-Aware Module (QAM) and the Peer-assisted Copy-paste (PC). The QAM obtains high-quality pseudo masks and better measures the mask quality to help reduce the effect of noisy masks, by leveraging the quality-aware multi-mask complementation mechanism. The PC imitates Peer-Assisted Learning to further improve the quality of the low-quality masks with the guidance of the obtained high-quality pseudo masks. Theoretical and experimental analyses demonstrate the proposed QAM and PC are effective. Extensive experimental results show the superiority of our BoxSeg over the state-of-the-art methods, and illustrate the QAM and PC can be applied to improve other models.

Abstract:
With the emergence of transformer-based architectures and large language models (LLMs), the accuracy of road scene perception has substantially advanced. Nonetheless, current road scene segmentation approaches are predominantly trained on closed-set data, resulting in insufficient detection capabilities for out-of-distribution (OOD) objects. To overcome this limitation, road anomaly detection methods have been proposed. However, existing methods primarily depend on image inpainting and OOD distribution detection techniques, facing two critical issues: (1) inadequate consideration of the objectiveness attributes of anomalous regions, causing incomplete segmentation when anomalous objects share similarities with known classes, and (2) insufficient attention to environmental constraints, leading to the detection of anomalies irrelevant to autonomous driving tasks. In this paper, we propose a novel framework termed Segmenting Objectiveness and Task-Awareness (SOTA) for autonomous driving scenes. Specifically, SOTA enhances the segmentation of objectiveness through a Semantic Fusion Block (SFB) and filters anomalies irrelevant to road navigation tasks using a Scene-understanding Guided Prompt-Context Adaptor (SG-PCA). Extensive empirical evaluations on multiple benchmark datasets, including Fishyscapes Lost and Found, Segment-Me-If-You-Can, and RoadAnomaly, demonstrate that the proposed SOTA consistently improves OOD detection performance across diverse detectors, achieving robust and accurate segmentation outcomes.

Abstract:
The segmentation of substantial brain lesions is a significant and challenging task in the field of medical image segmentation. Substantial brain lesions in brain imaging exhibit high heterogeneity, with indistinct boundaries between lesion regions and normal brain tissue. Small lesions in single slices are difficult to identify, making the accurate and reproducible segmentation of abnormal regions, as well as their feature description, highly complex. Existing methods have the following limitations: 1) They rely solely on single-modal information for learning, neglecting the multi-modal information commonly used in diagnosis. This hampers the ability to comprehensively acquire brain lesion information from multiple perspectives and prevents the effective integration and utilization of multi-modal data inputs, thereby limiting a holistic understanding of lesions. 2) They are constrained by the amount of data available, leading to low sensitivity to small lesions and difficulty in detecting subtle pathological changes. 3) Current SAM-based models rely on external prompts, which cannot achieve automatic segmentation and, to some extent, affect diagnostic efficiency.To address these issues, we have developed a large-scale fully automated segmentation model specifically designed for brain lesion segmentation, named BrainSegDMIF. This model has the following features: 1) Dynamic Modal Interactive Fusion (DMIF) module that processes and integrates multi-modal data during the encoding process, providing the SAM encoder with more comprehensive modal information. 2) Layer-by-Layer Upsampling Decoder, enabling the model to extract rich low-level and high-level features even with limited data, thereby detecting the presence of small lesions. 3) Automatic segmentation masks, allowing the model to generate lesion masks automatically without requiring manual prompts.We tested and evaluated our model on two common brain disease segmentation benchmarks, including cases of focal cortical dysplasia and gliomas. Our model outperformed existing state-of-the-art methods across four metrics.

Abstract:
The parcellation of Cranial Nerves (CNs) serves as a crucial quantitative methodology for evaluating the morphological characteristics and anatomical pathways of specific CNs. Multi-modal CNs parcellation networks have achieved promising segmentation performance, which combine structural Magnetic Resonance Imaging (MRI) and diffusion MRI. However, insufficient exploration of diffusion MRI information has led to low performance of existing multi-modal fusion. In this work, we propose a tractography-guided Dual-label Collaborative Learning Network (DCLNet) for multi-modal CNs parcellation. The key contribution of our DCLNet is the introduction of coarse labels of CNs obtained from fiber tractography through CN atlas, and collaborative learning with precise labels annotated by experts. Meanwhile, we introduce a Modality-adaptive Encoder Module (MEM) to achieve soft information swapping between structural MRI and diffusion MRI. Extensive experiments conducted on the publicly available Human Connectome Project (HCP) dataset demonstrate performance improvements compared to single-label network. This systematic validation underscores the effectiveness of dual-label strategies in addressing inherent ambiguities in CNs parcellation tasks.

Abstract:
Aerial Vision-and-Language Navigation (VLN) is an emerging task that enables Unmanned Aerial Vehicles (UAVs) to navigate outdoor environments using natural language instructions and visual cues. However, due to the extended trajectories and complex maneuverability of UAVs, achieving reliable UAV-VLN performance is challenging and often requires human intervention or overly detailed instructions. To harness the advantages of UAVs' high mobility, which could provide multi-grained perspectives, while maintaining a manageable motion space for learning, we introduce a novel task called Dual-Altitude UAV Collaborative VLN (DuAl-VLN). In this task, two UAVs operate at distinct altitudes: a high-altitude UAV responsible for broad environmental reasoning, and a low-altitude UAV tasked with precise navigation. To support the training and evaluation of the DuAl-VLN, we construct the HaL-13k, a dataset comprising 13,838 collaborative high-low UAV demonstration trajectories, each paired with target-oriented language instructions. This dataset includes both unseen maps and an unseen object validation set to systematically evaluate the model's generalization capabilities across novel environments and unfamiliar targets. To consolidate their complementary strengths, we propose a dual-UAV collaborative VLN framework, AeroDuo, where the high-altitude UAV integrates a multimodal large language model (Pilot-LLM) for target reasoning, while the low-altitude UAV employs a lightweight multi-stage policy for navigation and target grounding. The two UAVs work collaboratively and only exchange minimal coordinate information to ensure efficiency. Experimental results indicate that AeroDuo achieves an evident 9.71% improvement in success rates compared to existing single-UAV methods, demonstrating the effectiveness of dual-altitude collaboration in balancing environmental coverage, precision, and operational autonomy.

Abstract:
The objective of video frame interpolation (VFI) methods is to enhance video fluency and visual quality by generating intermediate frames between consecutive original frames based on the source video. Recently, diffusion-based VFI methods have made promising progresses, with generated results performing well in perceptual quality. However, these methods have not fully explored how to effectively leverage external motion priors to enhance the model's ability to estimate motion information between adjacent frames, which is crucial for VFI models to avoid generating blurry results due to the motion ambiguity. In this paper, we propose an Enhanced Motion-Aware latent Diffusion model (EMADiff) for video frame interpolation. Specifically, we integrate motion priors into the decoder of vector-quantized enhanced motion-aware GAN to guide the information propagation during RGB interpolated frame reconstruction. Furthermore, we propose enhanced motion-aware noising and de-noising procedures. By reducing the discrepancy in attention to motion priors between the forward and reverse processes, our EMADiff effectively utilizes motion priors, alleviates motion ambiguity, and generates realistic content. Comprehensive experiments on benchmark datasets show EMADiff achieves state-of-the-art performance, surpassing existing approaches and producing visually plausible and content-clear results.

Abstract:
Video Question Answering (VideoQA) based on Large Language Models (LLMs) has shown potential in general video understanding but faces significant challenges when applied to the inherently complex domain of sports videos. In this work, we propose FineQuest, the first training-free framework that leverages dual-mode reasoning inspired by cognitive science: i) Reactive Reasoning for straightforward sports queries and ii) Deliberative Reasoning for more complex ones. To bridge the knowledge gap between general-purpose models and domain-specific sports understanding, FineQuest incorporates SSGraph, a multimodal sports knowledge scene graph spanning nine sports, which encodes both visual instances and domain-specific terminology to enhance reasoning accuracy. Furthermore, we introduce two new sports VideoQA benchmarks, Gym-QA and Diving-QA, derived from the FineGym and FineDiving datasets, enabling diverse and comprehensive evaluation. FineQuest achieves state-of-the-art performance on these benchmarks as well as the existing SPORTU dataset, while maintains strong general VideoQA capabilities.

Abstract:
Text-Image Retrieval (TIR) retrieves a target image from a gallery based on an image-level description, while Referring Expression Comprehension (REC) localizes a target object within a given image using an instance-level description. However, real-world applications often present more complex demands. Users typically query an instance-level description across a large gallery and expect to receive both relevant image and the corresponding instance location. In such scenarios, TIR struggles with fine-grained descriptions and object-level localization, while REC is limited in its ability to efficiently search large galleries and lacks an effective ranking mechanism. In this paper, we introduce a new task called Referring Expression Instance Retrieval (REIR), which supports both instance-level retrieval and localization based on fine-grained referring expressions. First, we propose a large-scale benchmark for REIR, named REIRCOCO, constructed by prompting advanced vision-language models to generate high quality referring expressions for instances in the MSCOCO and RefCOCO datasets. Second, we present a baseline method, Contrastive Language Instance Alignment with Relation Experts (CLARE), which employs a dual-stream architecture to address REIR in an end-to-end manner. Given a referring expression, the textual branch encodes it into a query embedding, enhanced by a Mix of Relation Experts (MORE) module designed to better capture inter-instance relationships. The visual branch detects candidate objects and extracts their instance-level visual features. The most similar candidate to the query is selected for bounding box prediction. CLARE is first trained on object detection and REC datasets to establish initial grounding capabilities, then optimized via Contrastive Language Instance Alignment (CLIA) for improved retrieval across images. Experimental results demonstrate that CLARE outperforms existing methods on the REIR benchmark and generalizes well to both TIR and REC tasks, showcasing its effectiveness and versatility.

Abstract:
Open-Vocabulary Multi-Label Recognition (OV-MLR) aims to identify multiple seen and unseen object categories within an image, requiring both precise intra-class localization to pinpoint objects and effective inter-class reasoning to model complex category dependencies. While Vision-Language Pre-training (VLP) models offer a strong open-vocabulary foundation, they often struggle with fine-grained localization under weak supervision and typically fail to explicitly leverage structured relational knowledge beyond basic semantics, limiting performance especially for unseen classes. To overcome these limitations, we propose the Dual Adaptive Refinement Transfer (DART) framework. DART enhances a frozen VLP backbone via two synergistic adaptive modules. For intra-class refinement, an Adaptive Refinement Module (ARM) refines patch features adaptively, coupled with a novel Weakly Supervised Patch Selecting (WPS) loss that enables discriminative localization using only image-level labels. Concurrently, for inter-class transfer, an Adaptive Transfer Module (ATM) leverages a Class Relationship Graph (CRG), constructed using structured knowledge mined from a Large Language Model (LLM), and employs graph attention network to adaptively transfer relational information between class representations. DART is the first framework, to our knowledge, to explicitly integrate external LLM-derived relational knowledge for adaptive inter-class transfer while simultaneously performing adaptive intra-class refinement under weak supervision for OV-MLR. Extensive experiments on challenging benchmarks demonstrate that our DART achieves new state-of-the-art performance, validating its effectiveness.

Abstract:
Multi-task learning (MTL) for dense prediction has shown promising results but still faces challenges in balancing shared representations with task-specific specialization. In this paper, we introduce a novel Fine-Grained Mixture of Experts (FGMoE) architecture that explores MoE-based MTL models through a combination of three key innovations and fine-tuning. First, we propose intra-task experts that partition along intermediate hidden dimensions of MLPs, enabling finer decomposition of task information while maintaining parameter efficiency. Second, we introduce shared experts that consolidate common information across different contexts of the same task, reducing redundancy, and allowing routing experts to focus on unique aspects. Third, we design a global expert that facilitates adaptive knowledge transfer across tasks based on both input feature and task requirements, promoting beneficial information sharing while preventing harmful interference. In addition, we use the fine-tuning approach to improve parameter efficiency only by training the parameters of the decoder. Extensive experimental results show that the proposed FGMoE uses fewer parameters and significantly outperforms current MoE-based competitive MTL models on two dense prediction datasets (i.e., NYUD-v2, PASCAL-Context) in various metrics.

Abstract:
Despite their strong performance in multimodal emotion reasoning, existing Multimodal Large Language Models (MLLMs) often overlook the scenarios involving emotion conflicts, where emotional cues from different modalities are inconsistent. To fill this gap, we first introduce CA-MER, a new benchmark designed to examine MLLMs under realistic emotion conflicts. It consists of three subsets: video-aligned, audio-aligned, and consistent, where only one or all modalities reflect the true emotion. However, evaluations on our CA-MER reveal that current state-of-the-art emotion MLLMs systematically over-rely on audio signal during emotion conflicts, neglecting critical cues from visual modality. To mitigate this bias, we propose MoSEAR, a parameter-efficient framework that promotes balanced modality integration. MoSEAR consists of two modules: (1)MoSE, modality-specific experts with a regularized gating mechanism that reduces modality bias in the fine-tuning heads; and (2)AR, an attention reallocation mechanism that rebalances modality contributions in frozen backbones during inference. Our framework offers two key advantages: it mitigates emotion conflicts and improves performance on consistent samples-without incurring a trade-off between audio and visual modalities. Experiments on multiple benchmarks-including MER2023, EMER, DFEW, and our CA-MER-demonstrate that MoSEAR achieves state-of-the-art performance, particularly under modality conflict conditions.

Abstract:
There is a growing need for social robots and intelligent agents that can effectively interact with and support users. For the interactions to be seamless, the agents need to analyse social scenes and behavioural cues from their (robot's) perspective. Works that model human-agent interactions in social situations are few; and even those existing ones are computationally too intensive to be deployed in real time or perform poorly in real-world scenarios when only limited information is available. We propose a knowledge distillation framework that models social interactions through various multimodal cues, and yet is robust against incomplete and noisy information during inference. We train a teacher model with multimodal input (body, face and hand gestures, gaze, raw images) that transfers knowledge to a student model which relies solely on body pose. Extensive experiments on two publicly available human-robot interaction datasets demonstrate that our student model achieves an average accuracy gain of 14.75% over competitive baselines on multiple downstream social understanding tasks, even with up to 51% of its input being corrupted. The student model is also highly efficient - less than 1% in size of the teacher model in terms of parameters and its latency is 11.9% of the teacher model. Our code and related data are available at github.com/biantongfei/SocialEgoMobile.

Abstract:
Multimodal recommendation aims to enhance user preference modeling by leveraging rich item content such as images and text. Yet dominant systems fuse modalities in the spatial domain, obscuring the frequency structure of signals and amplifying misalignment and redundancy. We adopt a spectral information-theoretic view and show that, under an orthogonal transform that approximately block-diagonalizes bandwise covariances, the Gaussian Information Bottleneck objective decouples across frequency bands, providing a principled basis for separate-then-fuse paradigm. Building on this foundation, we propose FITMM, a Frequency-aware Information-Theoretic framework for multimodal recommendation. FITMM constructs graph-enhanced item representations, performs modality-wise spectral decomposition to obtain orthogonal bands, and forms lightweight within-band multimodal components. A residual, task-adaptive gate aggregates bands into the final representation. To control redundancy and improve generalization, we regularize training with a frequency-domain IB term that allocates capacity across bands (Wiener-like shrinkage with shut-off of weak bands). We further introduce a cross-modal spectral consistency loss that aligns modalities within each band. The model is jointly optimized with the standard recommendation loss. Extensive experiments on three real-world datasets demonstrate that FITMM consistently and significantly outperforms advanced baselines.

Abstract:
Experiment commentary is crucial in describing the experimental procedures, delving into underlying scientific principles, and incorporating content-related safety guidelines. In practice, human teachers rely heavily on subject-specific expertise and invest significant time preparing such commentary. To address this challenge, we introduce the task of automatic commentary generation across multi-discipline scientific experiments. Current LMMs' ability to generate fine-grained and insightful experiment commentary remains largely under-explored. In this paper, we make the following contributions: (i) We construct ExpInstruct, the first dataset tailored for experiment commentary generation, featuring over 7 K step-level commentaries across 21 scientific subjects from 3 core disciplines. (ii) We propose ExpStar, an automatic experiment commentary generation model that leverages a retrieval-augmented mechanism to adaptively access, evaluate, and utilize external knowledge. (iii) Extensive experiments show that our ExpStar substantially outperforms 14 leading LMMs, which highlights the superiority of our dataset and model. We believe that ExpStar holds great potential for advancing AI-assisted scientific experiment instruction.

Abstract:
In this work, we introduce the task of script-driven video summarization, which aims to produce a summary of the full-length video by selecting the parts that are most relevant to a user-provided script outlining the visual content of the desired summary. Following, we extend a recently-introduced large-scale dataset for generic video summarization (VideoXum) by producing natural language descriptions of the different human-annotated summaries that are available per video. In this way we make it compatible with the introduced task, since the available triplets of ''video, summary and summary description'' can be used for training a method that is able to produce different summaries for a given video, driven by the provided script about the content of each summary. Finally, we develop a new network architecture for script-driven video summarization (SD-VSum), that employs a cross-modal attention mechanism for aligning and fusing information from the visual and text modalities. Our experimental evaluations demonstrate the advanced performance of SD-VSum against SOTA approaches for query-driven and generic (unimodal and multimodal) summarization from the literature, and document its capacity to produce video summaries that are adapted to each user's needs about their content.

Abstract:
Emotion alignment between music and palettes is crucial for effective multimedia content, yet misalignment creates confusion that weakens the intended message. However, existing methods often generate only a single dominant color, missing emotion variation. Others rely on indirect mappings through text or images, resulting in the loss of crucial emotion details. To address these challenges, we present Music2Palette, a novel method for emotion-aligned color palette generation via cross-modal representation learning. We first construct MuCED, a dataset of 2,634 expert-validated music-palette pairs aligned through Russell-based emotion vectors. To directly translate music into palettes, we propose a cross-modal representation learning framework with a music encoder and color decoder. We further propose a multi-objective optimization approach that jointly enhances emotion alignment, color diversity, and palette coherence. Extensive experiments demonstrate that our method outperforms current methods in interpreting music emotion and generating attractive and diverse color palettes. Our approach enables applications like music-driven image recoloring, video generating, and data visualization, bridging the gap between auditory and visual emotion experiences.

Abstract:
Proactive Deepfake detection via robust watermarks has seen interest ever since passive Deepfake detectors encountered challenges in identifying high-quality synthetic images. However, while demonstrating reasonable detection performance, they lack localization functionality and explainability in detection results. Additionally, the unstable robustness of watermarks can significantly affect the detection performance. In this study, we propose novel fractal watermarks for proactive Deepfake detection and localization, namely FractalForensics. Benefiting from the characteristics of fractals, we devise a parameter-driven watermark generation pipeline that derives fractal-based watermarks and performs one-way encryption of the selected parameters. Subsequently, we propose a semi-fragile watermarking framework for watermark embedding and recovery, trained to be robust against benign image processing operations and fragile when facing Deepfake manipulations in a black-box setting. Moreover, we introduce an entry-to-patch strategy that implicitly embeds the watermark matrix entries into image patches at corresponding positions, achieving localization of Deepfake manipulations. Extensive experiments demonstrate satisfactory robustness and fragility of our approach against common image processing operations and Deepfake manipulations, outperforming state-of-the-art semi-fragile watermarking algorithms and passive detectors for Deepfake detection. Furthermore, by highlighting the areas manipulated, our method provides explainability for the proactive Deepfake detection results.

Abstract:
Multimodal Industrial Anomaly Detection (MIAD)-fusing 3D point clouds and 2D RGB for product defect detection-is critical to quality inspection. However, existing MIAD methods assume all modalities are available and paired, overlooking real-scenario modality-missing and risking overfitting to incomplete data. To address these, we conduct the first comprehensive study on Modality-Incomplete Industrial Anomaly Detection (MIIAD) and establish MIIAD Bench, a benchmark covering diverse missing settings. Meanwhile, we propose RADAR, a robust two-stage Robust modAlity-instructive fusing & Detecting frAmewoRk. RADAR integrates i) a Modality-Incomplete Instruction mechanism-guiding the multimodal Transformer to focus more on available modal info, and ii) a Double-Pseudo Hybrid Module to highlight unique modality combinations and reduce overfitting. Our results show RADAR outperforms prior methods markedly on MIIAD Bench.

Abstract:
Novel view synthesis and 4D reconstruction techniques predominantly rely on RGB cameras, thereby inheriting inherent limitations such as the dependence on adequate lighting, susceptibility to motion blur, and a limited dynamic range. Event cameras, offering advantages of low power, high temporal resolution and high dynamic range, have brought a new perspective to addressing the scene reconstruction challenges in high-speed motion and low-light scenes. To this end, we propose E-4DGS, the first event-driven dynamic Gaussian Splatting approach, for novel view synthesis from multi-view event streams with fast-moving cameras. Specifically, we introduce an event-based initialization scheme to ensure stable training and propose event-adaptive slicing splatting for time-aware reconstruction. Additionally, we employ intensity importance pruning to eliminate floating artifacts and enhance 3D consistency, while incorporating an adaptive contrast threshold for more precise optimization. We design a synthetic multi-view camera setup with six moving event cameras surrounding the object in a 360-degree configuration and provide a benchmark multi-view event stream dataset that captures challenging motion scenarios. Our approach outperforms both event-only and event-RGB fusion baselines and paves the way for the exploration of multi-view event-based reconstruction as a novel approach for rapid scene capture.

Abstract:
Modeling relightable and animatable human avatars from monocular video is a long-standing and challenging task. Recently, Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) methods have been employed to reconstruct the avatars. However, they often produce unsatisfactory photo-realistic results because of insufficient geometrical details related to body motion, such as clothing wrinkles. In this paper, we propose a 3DGS-based human avatar modeling framework, termed as Relightable and Dynamic Gaussian Avatar (RnD-Avatar), that presents accurate pose-variant deformation for high-fidelity geometrical details. To achieve this, we introduce dynamic skinning weights that define the human avatar's articulation based on pose while also learning additional deformations induced by body motion. We also introduce a novel regularization to capture fine geometric details under sparse visual cues. Furthermore, we present a new multi-view dataset with varied lighting conditions to evaluate relight. Our framework enables realistic rendering of novel poses and views while supporting photo-realistic lighting effects under arbitrary lighting conditions. Our method achieves state-of-the-art performance in novel view synthesis, novel pose rendering, and relighting.

Abstract:
Unified Medical Image Segmentation (UMIS) is critical for comprehensive anatomical assessment but faces challenges due to multi-scale structural heterogeneity. Conventional pixel-based approaches, lacking object-level anatomical insight and inter-organ relational modeling, struggle with morphological complexity and feature conflicts, limiting their efficacy in UMIS. We propose Mamba Snake, a novel deep snake framework enhanced by state space modeling for UMIS. Mamba Snake frames multi-contour evolution as a hierarchical state space atlas, effectively modeling macroscopic inter-organ topological relationships and microscopic contour refinements. We introduce a snake-specific vision state space module, the Mamba Evolution Block (MEB), which leverages effective spatiotemporal information aggregation for adaptive refinement of complex morphologies. Energy map shape priors further ensures robust long-range contour evolution in heterogeneous data. Additionally, a dual-classification synergy mechanism is incorporated to concurrently optimize detection and segmentation, mitigating under-segmentation of microstructures in UMIS. Extensive evaluations across five clinical datasets reveal Mamba Snake's superior performance.

Abstract:
Recent advances in 3D Gaussian Splatting (3DGS) have enabled fast, photorealistic rendering of dynamic 3D scenes, showing strong potential in immersive communication. However, in digital human encoding and transmission, the compression methods based on general 3DGS representations are limited by the lack of human priors, resulting in suboptimal bitrate efficiency and reconstruction quality at the decoder side, which hinders their application in streamable 3D avatar systems. We propose HGC-Avatar, a novel Hierarchical Gaussian Compression framework designed for efficient transmission and high-quality rendering of dynamic avatars. Our method disentangles the Gaussian representation into a structural layer, which maps poses to Gaussians via a StyleUNet-based generator, and a motion layer, which leverages the SMPL-X model to represent temporal pose variations compactly and semantically. This hierarchical design supports layer-wise compression, progressive decoding, and controllable rendering from diverse pose inputs such as video sequences or text. Since people are most concerned with facial realism, we incorporate a facial attention mechanism during StyleUNet training to preserve identity and expression details under low-bitrate constraints. Experimental results demonstrate that HGC-Avatar provides a streamable solution for rapid 3D avatar rendering, while significantly outperforming prior methods in both visual quality and compression efficiency.

Abstract:
Removing various degradations from damaged documents greatly benefits digitization, downstream document analysis, and readability. Previous methods often treat each restoration task independently with dedicated models, leading to a cumbersome and highly complex document processing system. Although recent studies attempt to unify multiple tasks, they often suffer from limited scalability due to handcrafted prompts and heavy preprocessing, and fail to fully exploit inter-task synergy within a shared architecture. To address the aforementioned challenges, we propose Uni-DocDiff, a Unified and highly scalable Doc ument restoration model based on Dif fusion. Uni-DocDiff develops a learnable task prompt design, ensuring exceptional scalability across diverse tasks. To further enhance its multi-task capabilities and address potential task interference, we devise a novel Prior Pool, a simple yet comprehensive mechanism that combines both local high-frequency features and global low-frequency features. Additionally, we design the Prior Fusion Module (PFM), which enables the model to adaptively select the most relevant prior information for each specific task. Extensive experiments show that the versatile Uni-DocDiff achieves performance comparable or even superior performance compared with task-specific expert models, and simultaneously holds the task scalability for seamless adaptation to new tasks.

Abstract:
3D Gaussian Splatting has exhibited remarkable capabilities in 3D scene reconstruction. However, reconstructing high-quality 3D scenes from motion-blurred images caused by camera motion poses a significant challenge. The performance of existing 3DGS-based deblurring methods are limited due to their inherent mechanisms, such as extreme dependence on the accuracy of camera poses and inability to effectively control erroneous Gaussian primitives densification caused by motion blur. To solve these problems, we introduce a novel framework, Bi-Stage 3D Gaussian Splatting, to accurately reconstruct 3D scenes from motion-blurred images. BSGS contains two stages. First, Camera Pose Refinement roughly optimizes camera poses to reduce motion-induced distortions. Second, with fixed rough camera poses, Global Rigid Transformation further corrects motion-induced blur distortions. To alleviate multi-subframe gradient conflicts, we propose a subframe gradient aggregation strategy to optimize both stages. Furthermore, a space-time bi-stage optimization strategy is introduced to dynamically adjust primitive densification thresholds and prevent premature noisy Gaussian generation in blurred regions. Comprehensive experiments verify the effectiveness of our proposed deblurring method and show its superiority over the state of the arts.

Abstract:
3D Gaussian Splatting (3DGS) data compression is crucial for enabling efficient storage and transmission in 3D scene modeling. However, its development remains limited due to inadequate entropy models and suboptimal quantization strategies for both lossless and lossy compression scenarios, where existing methods have yet to 1) fully leverage hyperprior information to construct robust conditional entropy models, and 2) apply fine-grained, element-wise quantization strategies for improved compression granularity. In this work, we propose a novel Mixture of Priors (MoP) strategy to simultaneously address these two challenges. Specifically, inspired by the Mixture-of-Experts (MoE) paradigm, our MoP approach processes hyperprior information through multiple lightweight MLPs to generate diverse prior features, which are subsequently integrated into the MoP feature via a gating mechanism. To enhance lossless compression, the resulting MoP feature is utilized as a hyperprior to improve conditional entropy modeling. Meanwhile, for lossy compression, we employ the MoP feature as guidance information in an element-wise quantization procedure, leveraging a prior-guided Coarse-to-Fine Quantization (C2FQ) strategy with a predefined quantization step value. Specifically, we expand the quantization step value into a matrix and adaptively refine it from coarse to fine granularity, guided by the MoP feature, thereby obtaining a quantization step matrix that facilitates element-wise quantization. Extensive experiments demonstrate that our proposed 3DGS data compression framework achieves state-of-the-art performance across multiple benchmarks, including Mip-NeRF360, BungeeNeRF, DeepBlending, and Tank&Temples.

Abstract:
Signed Graph Neural Networks (SGNNs) are widely adopted to analyze complex patterns in signed graphs with both positive and negative links. Given the noisy nature of real-world connections, the robustness of SGNN has also emerged as a pivotal research area. Under the supervision of empirical properties, graph structure learning has shown its robustness on signed graph representation learning, however, there remains a paucity of research investigating a robust SGNN with theoretical guidance. Inspired by the success of graph information bottleneck (GIB) in information extraction, we propose RIDGE, a novel framework for Robust sIgned graph learning through joint Denoising of Graph inputs and supervision targEts. Different from the basic GIB, we extend the GIB theory with the capability of target space denoising as the co-existence of noise in both input and target spaces. In instantiation, RIDGE effectively cleanses input data and supervision targets via a tractable objective function produced by reparameterization mechanism and variational approximation. We extensively validate our method on four prevalent signed graph datasets, and the results show that RIDGE clearly improves the robustness of popular SGNN models under various levels of noise.

Abstract:
We introduce GroupSketch, a novel method for vector sketch animation that effectively handles multi-object interactions and complex motions. Existing approaches struggle with these scenarios, either being limited to single-object cases or suffering from temporal inconsistency and poor generalization. To address these limitations, our method adopts a two-stage pipeline comprising Motion Initialization and Motion Refinement. In the first stage, the input sketch is interactively divided into semantic groups and key frames are defined, enabling the generation of a coarse animation via interpolation. In the second stage, we propose a Group-based Displacement Network (GDN), which refines the coarse animation by predicting group-specific displacement fields, leveraging priors from a text-to-video model. GDN further incorporates specialized modules, such as Context-conditioned Feature Enhancement (CCFE), to improve temporal consistency. Extensive experiments demonstrate that our approach significantly outperforms existing methods in generating high-quality, temporally consistent animations for complex, multi-object sketches, thus expanding the practical applications of sketch animation.

Abstract:
Personalized image generation has emerged as a promising direction in multimodal content creation. It aims to synthesize images tailored to individual style preferences (e.g. color schemes, character appearances, layout) and semantic intentions (e.g. emotion, action, scene contexts) by leveraging user-interacted history images and multimodal instructions. Despite notable progress, existing methods -- whether based on diffusion models, large language models, or Large Multimodal Models (LMMs) -- struggle to accurately capture and composite user style preferences and semantic intentions. In particular, the state-of-the-art LMM-based method suffers from the entanglement of visual features, leading to Guidance Collapse, where the generated images fail to preserve user-preferred styles or reflect the specified semantics.

Abstract:
3D digital garment generation and editing play a pivotal role in fashion design, virtual try-on, and gaming. Traditional methods struggle to meet the growing demand due to technical complexity and high resource costs. Learning-based approaches offer faster, more diverse garment synthesis based on specific requirements and reduce human efforts and time costs. However, they still face challenges such as inconsistent multi-view geometry or textures and heavy reliance on detailed garment topology and manual rigging. We propose SemanticGarment, a 3D Gaussian-based method that realizes high-fidelity 3D garment generation from text or image prompts and supports semantic-based interactive editing for flexible user customization. To ensure multi-view consistency and garment fitting, we propose to leverage structural human priors for the generative model by introducing a 3D semantic clothing model, which initializes the geometry structure and lays the groundwork for view-consistent garment generation and editing. Without the need to regenerate or rely on existing mesh templates, our approach allows for rapid and diverse modifications to existing Gaussians, either globally or within a local region. To address the artifacts caused by self-occlusion for garment reconstruction based on single image, we develop a self-occlusion optimization strategy to mitigate holes and artifacts that arise when directly animating self-occluded garments. Extensive experiments are conducted to demonstrate our superior performance in 3D garment generation and editing.

Abstract:
Amodal completion, generating invisible parts of occluded objects, is vital for applications like image editing and AR. Prior methods face challenges with data needs, generalization, or error accumulation in progressive pipelines. We propose a Collaborative Multi-Agent Reasoning Framework based on upfront collaborative reasoning to overcome these issues. Our framework uses multiple agents to collaboratively analyze occlusion relationships and determine necessary boundary expansion, yielding a precise mask for inpainting. Concurrently, an agent generates fine-grained textual descriptions, enabling Fine-Grained Semantic Guidance. This ensures accurate object synthesis and prevents the regeneration of occluders or other unwanted elements, especially within large inpainting areas. Furthermore, our method directly produces layered RGBA outputs guided by visible masks and attention maps from a Diffusion Transformer, eliminating extra segmentation. Extensive evaluations demonstrate our framework achieves state-of-the-art visual quality.

Abstract:
Recent advances in Talking Head Generation (THG) have achieved impressive lip synchronization and visual quality through diffusion models; yet existing methods struggle to generate emotionally expressive portraits while preserving speaker identity. We identify three critical limitations in current emotional talking head generation: insufficient utilization of audio's inherent emotional cues, identity leakage in emotion representations, and isolated learning of emotion correlations. To address these challenges, we propose a novel framework dubbed as DICE-Talk, following the idea of disentangling identity with emotion, and then cooperating emotions with similar characteristics. First, we develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention, representing emotions as identity-agnostic Gaussian distributions. Second, we introduce a correlation-enhanced emotion conditioning module with learnable emotion banks that explicitly capture inter-emotion relationships through vector quantization and attention-based feature aggregation. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process through latent-space classification. Extensive experiments on MEAD and HDTF datasets demonstrate our method's superiority, outperforming state-of-the-art approaches in emotion accuracy while maintaining competitive lip-sync performance. Qualitative results and user studies further confirm our method's ability to generate identity-preserving portraits with rich, correlated emotional expressions that naturally adapt to unseen identities.

Abstract:
Sketches serve as fundamental blueprints in artistic creation because sketch editing is easier and more intuitive than pixel-level RGB image editing for painting artists, yet sketch generation remains unexplored despite advancements in generative models. We propose a novel framework CoProSketch, providing prominent controllability and details for sketch generation with diffusion models. A straightforward method is fine-tuning a pretrained image generation diffusion model with binarized sketch images. However, we find that the diffusion models fail to generate clear binary images, making the produced sketches chaotic. We thus propose to represent the sketches by unsigned distance field (UDF), which is continuous and can be easily decoded to sketches through a lightweight network. With CoProSketch, users can generate sketches progressively from rough to detailed, and make timely edits if unsatisfied. Additionally, we curate a large-scale text-sketch paired dataset as the training data. Experiments demonstrate superior semantic consistency and controllability over baselines, offering a solution for integrating user edit into generative workflows.

Abstract:
To advance continuous token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model's output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown in https://aka.ms/felle.

Abstract:
Text-to-image diffusion models have made significant progress in image generation, allowing for effortless customized generation. However, existing image editing methods still face certain limitations when dealing with personalized image composition tasks. First, there is the issue of lack of geometric control over the inserted objects. Current methods are confined to 2D space and typically rely on textual instructions, making it challenging to maintain precise geometric control over the objects. Second, there is the challenge of style consistency. Existing methods often overlook the style consistency between the inserted object and the background, resulting in a lack of realism. In addition, the challenge of inserting objects into images without extensive training remains significant. To address these issues, we propose FreeInsert, a novel training-free framework that customizes object insertion into arbitrary scenes by leveraging 3D geometric information. Benefiting from the advances in existing 3D generation models, we first convert the 2D object into 3D, perform interactive editing at the 3D level, and then re-render it into a 2D image from a specified view. This process introduces geometric controls such as shape or view. The rendered image, serving as geometric control, is combined with style and content control achieved through diffusion adapters, ultimately producing geometrically controlled, style-consistent edited images via the diffusion model.

Abstract:
In this paper, we propose the Dynamic Latent Frame Rate VAE (DLFR-VAE), a training-free paradigm that can make use of adaptive temporal compression in latent space. While existing video generative models apply fixed compression rates via pretrained VAE, we observe that real-world video content exhibits substantial temporal non-uniformity, with high-motion segments containing more information than static scenes. Based on this insight, DLFR-VAE dynamically adjusts the latent frame rate according to the content complexity. Specifically, DLFR-VAE comprises two core innovations: (1) a Dynamic Latent Frame Rate Scheduler that partitions videos into temporal chunks and adaptively determines optimal frame rates based on information-theoretic content complexity, and (2) a training-free adaptation mechanism that transforms pretrained VAE architectures to dynamic VAE that can process features with variable frame rates. Our simple but effective DLFR-VAE can function as a plug-and-play module, seamlessly integrating with existing video generation models and accelerating the video generation process.

Abstract:
Fine-tuning large-scale music audio generation models, such as MusicGen and Mustango, is a computationally expensive process, often requiring updates to billions of parameters and, therefore, significant hardware resources. Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly adapter-based methods, have emerged as a promising alternative, enabling adaptation with minimal trainable parameters while preserving model performance. However, the design choices for adapters, including their architecture, placement, and size, are numerous, and it is unclear which of these combinations would produce optimal adapters and why, for a given case of low-resource music genre. In this paper, we attempt to answer this question by studying various adapter configurations for two AI music models, MusicGen and Mustango, on two genres: Hindustani Classical and Turkish Makam music.

Abstract:
In this paper, we address the task of multimodal-to-speech generation, which aims to synthesize high-quality speech from multiple input modalities: text, video, and reference audio. This task has gained increasing attention due to its wide range of applications, such as film production, dubbing, and virtual avatars. Despite recent progress, existing methods still suffer from limitations in speech intelligibility, audio-video synchronization, speech naturalness, and voice similarity to the reference speaker. To address these challenges, we propose AlignDiT, a multimodal Aligned Diffusion Transformer that generates accurate, synchronized, and natural-sounding speech from aligned multimodal inputs. Built upon the in-context learning capability of the DiT architecture, AlignDiT explores three effective strategies to align multimodal representations. Furthermore, we introduce a novel multimodal classifier-free guidance mechanism that allows the model to adaptively balance information from each modality during speech synthesis. Extensive experiments demonstrate that AlignDiT significantly outperforms existing methods across multiple benchmarks in terms of quality, synchronization, and speaker similarity. Moreover, AlignDiT exhibits strong generalization capability across various multimodal tasks, such as video-to-speech synthesis and visual forced alignment, consistently achieving state-of-the-art performance. The demo page is available at https://mm.kaist.ac.kr/projects/AlignDiT.

Abstract:
Real-world time series typically exhibit complex temporal variations, making the time series classification task notably challenging. Recent advancements have demonstrated the potential of multi-scale analysis approaches, which provide an effective solution for capturing these complex temporal patterns. However, existing multi-scale analysis-based time series prediction methods fail to eliminate redundant scale-shared features across multi-scale time series, resulting in the model over- or under-focusing on scale-shared features. To address this issue, we propose a novel end-to-end Disentangled Multi-Scale framework for Time Series classification (DisMS-TS). The core idea of DisMS-TS is to eliminate redundant shared features in multi-scale time series, thereby improving prediction performance. Specifically, we propose a temporal disentanglement module to capture scale-shared and scale-specific temporal representations, respectively. Subsequently, to effectively learn both scale-shared and scale-specific temporal representations, we introduce two regularization terms that ensure the consistency of scale-shared representations and the disparity of scale-specific representations across all temporal scales. Extensive experiments conducted on multiple datasets validate the superiority of DisMS-TS over its competitive baselines, with the accuracy improvement up to 9.71%.

Abstract:
While 3D Gaussian representations (3DGS) have proven effective for modeling the geometry and appearance of objects, their potential for capturing other physical attributes-such as sound-remains largely unexplored. In this paper, we present a novel framework dubbed SonicGauss for synthesizing impact sounds from 3DGS representations by leveraging their inherent geometric and material properties. Specifically, we integrate a diffusion-based sound synthesis model with a PointTransformer-based feature extractor to infer material characteristics and spatial-acoustic correlations directly from Gaussian ellipsoids. Our approach supports spatially varying sound responses conditioned on impact locations and generalizes across a wide range of object categories. Experiments on the ObjectFolder dataset and real-world recordings demonstrate that our method produces realistic, position-aware auditory feedback. The results highlight the framework's robustness and generalization ability, offering a promising step toward bridging 3D visual representations and interactive sound synthesis.

Abstract:
Time series analysis is crucial in diverse scenarios. Beyond forecasting, considerable real-world tasks are categorized into classification, imputation, and anomaly detection, underscoring different capabilities termed time series understanding in this paper. While GPT-style models have been positioned as foundation models for time series forecasting, the BERT-style architecture, which has made significant advances in natural language understanding, has not been fully unlocked for time series understanding, possibly attributed to the undesirable dropout of essential elements of BERT. In this paper, inspired by the shared multi-granularity structure between multivariate time series and multisentence documents, we design TimesBERT to learn generic representations of time series including temporal patterns and variate-centric characteristics. In addition to a natural adaptation of masked modeling, we propose a parallel task of functional token prediction to embody vital multi-granularity structures. Our model is pre-trained on 260 billion time points across diverse domains. Leveraging multi-granularity representations, TimesBERT achieves state-of-the-art performance across four typical downstream understanding tasks, outperforming task-specific models and language pre-trained backbones, positioning it as a versatile foundation model for time series understanding.

Abstract:
Long-context video understanding in Multimodal Large Language Models (MLLMs) faces a critical challenge: balancing computational efficiency with the retention of fine-grained spatio-temporal patterns. Existing approaches (e.g., sparse sampling, dense sampling with low resolution, and token compression) suffer from significant information loss in temporal dynamics, spatial details, or subtle interactions, particularly in videos with complex motion or varying resolutions. To address this, we propose Mavors, a novel framework that introduces Multi-granularity videorepresentation for holistic long-video modeling. Specifically, Mavors directly encodes raw video content into latent representations through two core components: 1) an Intra-chunk Vision Encoder (IVE) that preserves high-resolution spatial features via 3D convolutions and Vision Transformers, and 2) an Inter-chunk Feature Aggregator (IFA) that establishes temporal coherence across chunks using transformer-based dependency modeling with chunk-level rotary position encodings. Moreover, the framework unifies image and video understanding by treating images as single-frame videos via sub-image decomposition. Experiments across diverse benchmarks demonstrate Mavors' superiority in maintaining both spatial fidelity and temporal continuity, significantly outperforming existing methods in tasks requiring fine-grained spatio-temporal reasoning.

Abstract:
Recent advances in multimodal large language models (MLLMs) have demonstrated strong capabilities in understanding general visual content. However, these general-domain MLLMs perform poorly in face perception tasks, often producing inaccurate or misleading responses to face-specific queries. To address this gap, we propose FaceInsight, a versatile face perception MLLM that provides fine-grained information. Our approach introduces visual textual alignment of facial knowledge to model both uncertain dependencies and deterministic relationships among facial information, mitigating the limitations of language-driven reasoning. Additionally, we incorporate face segmentation maps as an auxiliary perceptual modality, enriching visual input with localized structural cues to enhance semantic understanding. Comprehensive experiments show that FaceInsight consistently outperforms nine compared MLLMs under both training-free and fine-tuned settings.

Abstract:
Tackling toxic behavior in digital communication continues to be a pressing concern for both academics and industry professionals. While significant research has explored toxicity on platforms like social networks and discussion boards, podcasts-despite their rapid rise in popularity-remain relatively understudied in this context. This work seeks to fill that gap by curating a dataset of political podcast transcripts and analyzing them with a focus on conversational structure. Specifically, we investigate how toxicity surfaces and intensifies through sequences of replies within these dialogues, shedding light on the organic patterns by which harmful language can escalate across conversational turns. Warning: Contains potentially abusive/toxic contents.

Abstract:
In large-scale short-video platforms, CDN resource selection plays a critical role in maintaining users' Quality of Experience (QoE) while controlling escalating traffic costs. To better understand this phenomenon, we conduct in-the-wild network measurements during video playback in a production short-video system. The results reveal that CDNs delivering higher average QoE often come at greater financial cost, yet their connection quality fluctuates even within a single video-underscoring a fundamental and dynamic trade-off between QoE and cost. However, the problem of sustaining high QoE under cost constraints remains insufficiently investigated in the context of CDN selection for short-video streaming. To address this, we propose PIRA, a dynamic resource selection algorithm that optimizes QoE and cost in real-time during video playback. PIRA formally integrating QoE and cost by a mathematical model, and introduce a intra-video control-theoretic CDN resource selection approach which can balance QoE and cost under network dynamics. To reduce the computation overheads, PIRA employs state-space pruning and adaptive parameter adjustment to efficiently solve the high-dimensional optimization problem. In large-scale production experiments involving 450,000 users over two weeks, PIRA outperforms the production baseline, achieving a 2.1% reduction in start-up delay, 15.2% shorter rebuffering time, and 10% lower average unit traffic cost, demonstrating its effectiveness in balancing user experience and financial cost at scale.

Abstract:
Skin Neglected Tropical Diseases (NTDs) impose severe health and socioeconomic burdens in impoverished tropical communities. Yet, advancements in AI-driven diagnostic support are hindered by data scarcity, particularly for underrepresented populations and rare manifestations of NTDs. Existing dermatological datasets often lack the demographic and disease spectrum crucial for developing reliable recognition models of NTDs. To address this, we introduce eSkinHealth, a novel dermatological dataset collected on-site in Côte d'Ivoire and Ghana. Specifically, eSkinHealth contains 5,623 images from 1,639 cases and encompasses 47 skin diseases, focusing uniquely on skin NTDs and rare conditions among West African populations. We further propose an AI-expert collaboration paradigm to implement foundation language and segmentation models for efficient generation of multimodal annotations, under dermatologists' guidance. In addition to patient metadata and diagnosis labels, eSkinHealth also includes semantic lesion masks, instance-specific visual captions, and clinical concepts. Overall, our work provides a valuable new resource and a scalable annotation framework, aiming to catalyze the development of more equitable, accurate, and interpretable AI tools for global dermatology.

Abstract:
Online form filling is a common yet labor-intensive task involving extensive keyboard and mouse interactions. Despite the long-standing vision of automating this process with ''one click,'' existing tools remain largely rule-based and lack generalizable, generative capabilities. Recent advances in Multimodal Large Language Models (MLLMs) have enabled promising agents for GUI-related tasks in general-purpose scenarios. However, they struggle with the unique challenges of form filling, such as flexible layouts and the difficulty of aligning textual instructions with on-screen fields. To bridge this gap, we formally define the form-filling task and propose FormFactory-an interactive benchmarking suite comprising a web-based interface, backend evaluation module, and carefully constructed dataset. Our benchmark covers diverse real-world scenarios, incorporates various field formats, and simulates high-fidelity form interactions. We conduct a comprehensive evaluation of state-of-the-art MLLMs and observe that no model surpasses 5% accuracy, underscoring the inherent difficulty of the task. These findings also reveal significant limitations in current models' visual layout reasoning and field-value alignment abilities.

Abstract:
We present a Collaborative Agent-Based Framework for Multi-Image Reasoning. Our approach tackles the challenge of interleaved multimodal reasoning across diverse datasets and task formats by employing a dual-agent system: a language-based PromptEngineer, which generates context-aware, task-specific prompts, and a VisionReasoner, a large vision-language model (LVLM) responsible for final inference. The framework is fully automated, modular, and training-free, enabling generalization across classification, question answering, and free-form generation tasks involving one or multiple input images. We evaluate our method on 18 diverse datasets from the 2025 MIRAGE Challenge (Track A), covering a broad spectrum of visual reasoning tasks including document QA, visual comparison, dialogue-based understanding, and scene-level inference. Our results demonstrate that LVLMs can effectively reason over multiple images when guided by informative prompts. Notably, Claude 3.7 achieves near-ceiling performance on challenging tasks such as TQA (99.13% accuracy), DocVQA (96.87%), and MMCoQA (75.28 ROUGE-L). We also explore how design choices-such as model selection, shot count, and input length-influence the reasoning performance of different LVLMs.

Abstract:
We introduce a novel framework for reconstructing dynamic human-object interactions from monocular video that overcomes challenges associated with occlusions and temporal inconsistencies. Traditional 3D reconstruction methods typically assume static objects or full visibility of dynamic subjects, leading to degraded performance when these assumptions are violated-particularly in scenarios where mutual occlusions occur. To address this, our framework leverages amodal completion to infer the complete structure of partially obscured regions. Unlike conventional approaches that operate on individual frames, our method integrates temporal context, enforcing coherence across video sequences to incrementally refine and stabilize reconstructions. This template-free strategy adapts to varying conditions without relying on predefined models, significantly enhancing the recovery of intricate details in dynamic scenes. We validate our approach using 3D Gaussian Splatting on challenging monocular videos, demonstrating superior precision in handling occlusions and maintaining temporal stability compared to existing techniques.

Abstract:
Reconstructing realistic underwater scenes from underwater video remains a meaningful yet challenging task in the multimedia domain. The inherent spatiotemporal degradations in underwater imaging, including caustics, flickering, attenuation, and backscattering, frequently result in inaccurate geometry and appearance in existing 3D reconstruction methods. While a few recent works have explored underwater degradation-aware reconstruction, they often address either spatial or temporal degradation alone, falling short in more real-world underwater scenarios where both types of degradation occur. We propose MartineSTD-GS, a novel 3D Gaussian Splatting-based framework that explicitly models both temporal and spatial degradations for realistic underwater scene reconstruction. Specifically, we introduce two paired Gaussian primitives: Intrinsic Gaussians represent the true scene, while Degraded Gaussians render the degraded observations. The color of each Degraded Gaussian is physically derived from its paired Intrinsic Gaussian via a Spatiotemporal Degradation Modeling (SDM) module, enabling self-supervised disentanglement of realistic appearance from degraded images. To ensure stable training and accurate geometry, we further propose a Depth-Guided Geometry Loss and a Multi-Stage Optimization strategy. We also construct a simulated benchmark with diverse spatial and temporal degradations and ground-truth appearances for comprehensive evaluation. Experiments on both simulated and real-world datasets show that MarineSTD-GS robustly handles spatiotemporal degradations and outperforms existing methods in novel view synthesis with realistic, water-free scene appearances.

Abstract:
Semantic Scene Completion (SSC) aims to generate a complete semantic scene from an incomplete input. Existing approaches often employ dense network architectures with a high parameter count, leading to increased model complexity and resource demands. To address these limitations, we propose RWKV-PCSSC, a lightweight point cloud semantic scene completion network inspired by the Receptance Weighted Key Value (RWKV) mechanism. Specifically, we introduce a RWKV Seed Generator (RWKV-SG) module that can aggregate features from a partial point cloud to produce a coarse point cloud with coarse features. Subsequently, the point-wise feature of the point cloud is progressively restored through multiple stages of the RWKV Point Deconvolution (RWKV-PD) modules. By leveraging a compact and efficient design, our method achieves a lightweight model representation. Experimental results demonstrate that RWKV-PCSSC reduces the parameter count by 4.18× and improves memory efficiency by 1.37× compared to state-of-the-art methods PointSSC[51]. Furthermore, our network achieves state-of-the-art performance on established indoor (SSC-PC, NYUCAD-PC) and outdoor (PointSSC) scene dataset, as well as on our proposed datasets (NYUCAD-PC-V2, 3D-FRONT-PC).

Abstract:
Radiologists rely on eye movements to navigate and interpret medical images. A trained radiologist possesses knowledge about the potential diseases that may be present in the images and, when searching, follows a mental checklist to locate them using their gaze. This is a key observation, yet existing models fail to capture the underlying intent behind each fixation. In this paper, we introduce a deep learning-based approach, RadGazeIntent, designed to model this behavior: having an intention to find something and actively searching for it. Our transformer-based architecture processes both the temporal and spatial dimensions of gaze data, transforming fine-grained fixation features into coarse, meaningful representations of diagnostic intent to interpret radiologists' goals. To capture the nuances of radiologists' varied intention-driven behaviors, we process existing medical eye-tracking datasets to create three intention-labeled subsets: RadSeq (Systematic Sequential Search), RadExplore (Uncertainty-driven Exploration), and RadHybrid (Hybrid Pattern). Experimental results demonstrate RadGazeIntent's ability to predict which findings radiologists are examining at specific moments, outperforming baseline methods across all intention-labeled datasets.

Abstract:
Temporal Intention Localization (TIL) is crucial for video surveillance, focusing on identifying varying levels of suspicious intention to enhance security monitoring. However, existing discrete classification methods fail to capture the continuous progression of suspicious intentions, limiting early intervention and explainability. In this paper, we reconceptualize hidden intention modeling by shifting from discrete classification to continuous regression and propose Suspicion Progression Analysis Network (SPAN), which capture the fluctuations and progression of hidden intentions over time. Specifically, when analyzing the temporal progression of suspicion, we discover that suspicion exhibits long-term dependency and cumulative effects across extended sequences, characteristics significantly similar to the settings in Temporal Point Process (TPP) theory. Based on these insights, we formalize a suspicion score formula that models continuous changes while accounting for temporal characteristics. We also propose Suspicion Coefficient Modulation to adjust suspicion coefficients using multimodal information, reflecting different effects of suspicious actions. Notably, we introduce a Concept-Anchored Mapping method to quantify associations between suspicious actions and predefined intention concepts, enabling understanding of not just actions occurring but also their potential underlying intentions. Extensive experiments on the HAI dataset show that SPAN significantly outperforms existing methods, reducing MSE by 19.8% and improving average mAP by 1.78%,. Notably, SPAN achieves a 2.74% mAP gain in low-frequency cases, indicating superior capability in capturing subtle behavioral changes.Compared to discrete classification systems, out continuous suspicion modeling method enables earlier detection and more proactive interventions, substantially enhancing both system explainability and practical utility in security applications.

Abstract:
Recent advances in generative models have highlighted the need for robust detectors capable of distinguishing real images from AI-generated images. While existing methods perform well on known generators, their performance often declines when tested with newly emerging or unseen generative models due to overlapping feature embeddings that hinder accurate cross-generator classification. In this paper, we propose Multimodal Discriminative Representation Learning for Generalizable AI-generated Image Detection (MiraGe), a method designed to learn generator-invariant features. Motivated by theoretical insights on intra-class variation minimization and inter-class separation, MiraGe tightly aligns features within the same class while maximizing separation between classes, enhancing feature discriminability. Moreover, we apply multimodal prompt learning to further refine these principles into CLIP, leveraging text embeddings as semantic anchors for effective discriminative representation learning, thereby improving generalizability. Comprehensive experiments across multiple benchmarks show that MiraGe achieves state-of-the-art performance, maintaining robustness even against unseen generators like Sora.

Abstract:
Multimodal semantic segmentation shows significant potential for enhancing segmentation accuracy in complex scenes. However, current methods often incorporate specialized feature fusion modules tailored to specific modalities, thereby restricting input flexibility and increasing the number of training parameters. To address these challenges, we propose StitchFusion, a straightforward yet effective modal fusion framework that integrates large-scale pre-trained models directly as encoders and feature fusers. This approach facilitates comprehensive multi-modal and multi-scale feature fusion, accommodating any visual modal inputs. Specifically, our framework achieves modal integration during encoding by sharing multi-modal visual information. To enhance information exchange across modalities, we introduce a multi-directional Modality Adapter module (MoA) to enable cross-modal information transfer during encoding. By leveraging MoA to propagate multi-scale information across pre-trained encoders during the encoding process, StitchFusion achieves multi-modal visual information integration during encoding. Extensive comparative experiments demonstrate that our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters. Furthermore, the experimental integration of MoA with existing Feature Fusion Modules (FFMs) highlights their complementary nature. Our anonymous code is https://anonymous.4open.science/r/StitchFusion_V2-E777

Abstract:
Adversarial patches are widely used to evaluate the robustness of object detection systems in real-world scenarios. These patches were initially designed to deceive single-modal detectors (e.g., visible or infrared) and have recently been extended to target visible-infrared dual-modal detectors. However, existing dual-modal adversarial patch attacks have limited attack effectiveness across diverse physical scenarios. To address this, we propose CDUPatch, a universal cross-modal patch attack against visible-infrared object detectors across scales, views, and scenarios. Specifically, we observe that color variations lead to different levels of thermal absorption, resulting in temperature differences in infrared imaging. Leveraging this property, we propose an RGB-to-infrared adapter that maps RGB patches to infrared patches, enabling unified optimization of cross-modal patches. By learning an optimal color distribution on the adversarial patch, we can manipulate its thermal response and generate an adversarial infrared texture. Additionally, we introduce a multi-scale clipping strategy and construct a new visible-infrared dataset, MSDrone, which contains aerial vehicle images in varying scales and perspectives. These data augmentation strategies enhance the robustness of our patch in real-world conditions. Experiments on four benchmark datasets (e.g., DroneVehicle, LLVIP, VisDrone, MSDrone) show that our method outperforms existing patch attacks in the digital domain. Extensive physical tests further confirm strong transferability across scales, views, and scenarios. Attack demos are provided in the supplementary materials.

Abstract:
Graph fraud detection has garnered significant attention as Graph Neural Networks (GNNs) have proven effective in modeling complex relationships within multimodal data. However, existing graph fraud detection methods typically use preprocessed node embeddings and predefined graph structures to reveal fraudsters, which ignore the rich semantic cues contained in raw textual information. Although Large Language Models (LLMs) exhibit powerful capabilities in processing textual information, it remains a significant challenge to perform multimodal fusion of processed textual embeddings with graph structures. In this paper, we propose a Multi-level LLM Enhanced Graph Fraud Detection framework called MLED. In MLED, we utilize LLMs to extract external knowledge from textual information to enhance graph fraud detection methods. To integrate LLMs with graph structure information and enhance the ability to distinguish fraudsters, we design a multi-level LLM enhanced framework including type-level enhancer and relation-level enhancer. One is to enhance the difference between the fraudsters and the benign entities, the other is to enhance the importance of the fraudsters in different relations. The experiments on four real-world datasets show that MLED achieves state-of-the-art performance in graph fraud detection as a generalized framework that can be applied to existing methods.

Abstract:
Open-vocabulary semantic segmentation (OVSS) conducts pixel-level classification via text-driven alignment, where the domain discrepancy between base category training and open-vocabulary inference poses challenges in discriminative modeling of latent unseen category. To address this challenge, existing vision-language model (VLM)-based approaches demonstrate commendable performance through pre-trained multi-modal representations. However, the fundamental mechanisms of latent semantic comprehension remain underexplored, making the bottleneck for OVSS. In this work, we initiate a probing experiment to explore distribution patterns and dynamics of latent semantics in VLMs under inductive learning paradigms. Building on these insights, we propose X-Agent, an innovative OVSS framework employing latent semantic-aware ''agent'' to orchestrate cross-modal attention mechanisms, simultaneously optimizing latent semantic dynamic and amplifying its perceptibility. Extensive benchmark evaluations demonstrate that X-Agent achieves state-of-the-art performance while effectively enhancing the latent semantic saliency.

Abstract:
Medical image restoration tasks aim to recover high-quality images from degraded observations, exhibiting emergent desires in many clinical scenarios, such as low-dose CT image denoising, MRI super-resolution, and MRI artifact removal. Despite the success achieved by existing deep learning-based restoration methods with sophisticated modules, they struggle with rendering computationally-efficient reconstruction results. Moreover, they usually ignore the reliability of the restoration results, which is much more urgent in medical systems. To alleviate these issues, we present LRformer, a Lightweight Transformer-based method via Reliability-guided learning in the frequency domain. Specifically, inspired by the uncertainty quantification in Bayesian neural networks (BNNs), we develop a Reliable Lesion-Semantic Prior Producer (RLPP). RLPP leverages Monte Carlo (MC) estimators with stochastic sampling operations to generate sufficiently-reliable priors by performing multiple inferences on the foundational medical image segmentation model, MedSAM. Additionally, instead of directly incorporating the priors in the spatial domain, we decompose the cross-attention (CA) mechanism into real symmetric and imaginary anti-symmetric parts via fast Fourier transform (FFT), resulting in the design of the Guided Frequency Cross-Attention (GFCA) solver. By leveraging the conjugated symmetric property of FFT, GFCA reduces the computational complexity of naive CA by nearly half. Extensive experimental results in various tasks demonstrate the superiority of the proposed LRformer in both effectiveness and efficiency.

Abstract:
The event camera, benefiting from its high dynamic range and low latency, provides performance gain for low-light image enhancement. Unlike frame-based cameras, it records intensity changes with extremely high temporal resolution, capturing sufficient structure information. Currently, existing event-based methods feed a frame and events directly into a single model without fully exploiting modality-specific advantages, which limits their performance. Therefore, by analyzing the role of each sensing modality, the enhancement pipeline is decoupled into two stages: visibility restoration and structure refinement. In the first stage, we design a visibility restoration network with amplitude-phase entanglement by rethinking the relationship between amplitude and phase components in Fourier space. In the second stage, a fusion strategy with dynamic alignment is proposed to mitigate the spatial mismatch caused by the temporal resolution discrepancy between two sensing modalities, aiming to refine the structure information of the image enhanced by the visibility restoration network. In addition, we utilize spatial-frequency interpolation to simulate negative samples with diverse illumination, noise and artifact degradations, thereby developing a contrastive loss that encourages the model to learn discriminative representations. Experiments demonstrate that the proposed method outperforms state-of-the-art models.

Abstract:
Egocentric Video Question Answering (Egocentric VideoQA) plays an important role in egocentric video understanding, which refers to answering questions based on first-person videos. Although existing methods have made progress through the paradigm of pre-training and fine-tuning, they ignore the unique challenges posed by the first-person perspective, such as understanding multiple events and recognizing hand-object interactions. To deal with these challenges, we propose a Dual-Modal Counterfactual Contrastive Construction (DMC3) framework, which contains an egocentric videoqa baseline, a counterfactual sample construction module and a counterfactual sample-involved contrastive optimization. Specifically, We first develop a counterfactual sample construction module to generate positive and negative samples for textual and visual modalities through event description paraphrasing and core interaction mining, respectively. Then, We feed these samples together with the original samples into the baseline. Finally, in the counterfactual sample-involved contrastive optimization module, we apply contrastive loss to minimize the distance between the original sample features and the positive sample features, while maximizing the distance from the negative samples. Experiments show that our method achieve 52.51% and 46.04% on the normal and indirect splits of EgoTaskQA, and 13.2% on QAEGO4D, both reaching the state-of-the-art performance.

Abstract:
Scene understanding enables intelligent agents to interpret and comprehend their environment. While existing large vision-language models (LVLMs) for scene understanding have primarily focused on indoor household tasks, they face two significant limitations when applied to outdoor large-scale scene understanding. First, outdoor scenarios typically encompass larger-scale environments observed through various sensors from multiple viewpoints (e.g., bird view and terrestrial view), while existing indoor LVLMs mainly analyze single visual modalities within building-scale contexts from humanoid viewpoints. Second, existing LVLMs suffer from missing multidomain perception outdoor data and struggle to effectively integrate 2D and 3D visual information. To address the aforementioned limitations, we build the first multidomain perception outdoor scene understanding dataset, named SVM-City, deriving from multi-Scale scenarios with multi-View and multi-Modal instruction tuning data. It contains 420k images and 4, 811M point clouds with 567k question-answering pairs from vehicles, low-altitude drones, high-altitude aerial planes, and satellite. To effectively fuse multimodal data in the absence of one modality, we introduce incomplete multimodal learning to model outdoor scene understanding and design the LVLM named City-VLM. Multimodal fusion is realized by constructed as a joint probabilistic distribution space rather than implementing directly explicit fusion operations (e.g., concatenation). Experimental results on three typical outdoor scene understanding tasks show City-VLM achieves 18.14 % performance surpassing existing LVLMs in question-answering tasks averagely. Our method demonstrates pragmatic and generalization performance across multiple outdoor scenes.

Abstract:
Recent advances in video generation have posed great challenges in the assessment of AI-generated content, particularly with the emergence of increasingly sophisticated models. The various inconsistencies and defects observed in such videos are inherently complex, making overall scoring notoriously difficult. In this paper, we emphasize the critical importance of integrating fine-grained reasoning into video evaluation. We propose FingER, a novel entity-level reasoning evaluation framework that first automatically generates Fine-grained Entity-level questions, and then answers those questions by a Reasoning model with scores, which can be subsequently weighted summed to an overall score for different applications. Specifically, we leverage LLMs to derive entity-level questions across five distinct perspectives, which (i) often focus on some specific entities of the content, thereby making answering or scoring much easier for MLLMs, and (ii) are more interpretable. Then we construct a FingER dataset, consisting of approximately 3.3k videos and corresponding 60k fine-grained QA annotations, each with detailed reasons. Based on that, we further investigate various training protocols to best incentivize the reasoning capability of MLLMs for correct answer prediction. Extensive experiments demonstrate that a reasoning model trained using GRPO with a cold-start strategy achieves the best performance. Notably, our model surpasses existing methods by a relative margin of 11.8% on GenAI-Bench and 5.5% on MonetBench with only 3.3k training videos, which is at most one-tenth of the training samples utilized by other methods. Our codes and datasets have been released.

Abstract:
Visual instruction tuning is the key to building large vision language models (LVLMs), which can greatly improve the task solving and generalization capabilities. Previous work mostly collects a mixture of existing visual instruction datasets via heuristic ways for train- ing (even more than a million instructions), which may introduce data redundancy and increase the training cost. To investigate it, we conduct a series of empirical studies, which show that greatly reducing the amount of instructions from several tasks even do not affect the performance, indicating significant redundancy within the visual instruction datasets. Based on the findings, we propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost. In TIVE, based on the gradient-based influence functions, we estimate the instance influence score on its corresponding task and the task difficulty score. Then, we leverage these scores to determine the task proportion within the visual instruction subset, and select high-value instances for each task, respectively. Experiments on various LVLMs show that our approach using only about 15% data can achieve comparable performance to the full-data fine-tuned model across eight benchmarks, even surpassing it on four of the benchmarks. Our code and data will be publicly released.

Abstract:
In the realms of computer vision and natural language processing, Multimodal Large Language Models (MLLMs) have become indispensable tools, proficient in generating textual responses based on visual inputs. Despite their advancements, our investigation reveals a noteworthy bias: the generated content is often driven more by the inherent priors of the underlying Large Language Models (LLMs) than by the input image. Empirical experiments underscore the persistence of this bias, as MLLMs often provide confident answers even in the absence of relevant images or given incongruent visual inputs. To rectify these biases and redirect the model's focus toward visual information, we propose two simple, training-free strategies. First, for tasks such as classification or multi-choice question answering, we introduce a ''Post-Hoc Debias'' method using an affine calibration step to adjust the output distribution. This approach ensures uniform answer scores when the image is absent, acting as an effective regularization technique to alleviate the influence of LLM priors. For more intricate open-ended generation tasks, we extend this method to ''Visual Debias Decoding'', which mitigates bias by contrasting token log-probabilities conditioned on a correct image versus a meaningless one. Additionally, our investigation sheds light on the instability of MLLMs across various decoding configurations. Through systematic exploration of different settings, we achieve significant performance improvements-surpassing previously reported results-and raise concerns about the fairness of current evaluation practices. Comprehensive experiments substantiate the effectiveness of our proposed strategies in mitigating biases. These strategies not only prove beneficial in minimizing hallucinations but also contribute to the generation of more helpful and precise illustrations.

Abstract:
Retrieval-based multimodal document QA aims to identify and integrate relevant information from visually rich documents with complex multimodal structures. While retrieval-augmented generation (RAG) has shown strong performance in text-based QA, its extensions to multimodal documents remain underexplored and face significant limitations. Specifically, current approaches rely on query-agnostic document representations that overlook salient content and use static top-k evidence selection, which fails to adapt to the uncertain distribution of relevant information. To address these limitations, we propose the Multimodal Adaptive Retrieval-Augmented (MARA) framework, which introduces query-adaptive mechanisms to both retrieval and generation. MARA consists of two components: a Query-Aligned Region Encoder that builds multi-level document representations and reweights them based on query relevance to improve retrieval precision; and a Self-Reflective Evidence Controller that monitors evidence sufficiency during generation and adaptively incorporates content from lower-ranked sources using a sliding-window strategy. Experiments on six multimodal QA benchmarks demonstrate that MARA consistently improves retrieval relevance and answer quality over existing SOTA method.

Abstract:
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen combinations of seen attributes and objects. Current CLIP-based methods in CZSL, despite their advancements, often fail to effectively understand and link the attributes and objects due to inherent limitations in CLIP's pretraining mechanisms. To address these shortcomings, this paper introduces a novel framework, Understanding and Linking Attributes and Objects (ULAO) in CZSL, which comprises two innovative modules. The Understanding Attributes and Objects (UAO) module improves primitive understanding by sequential primitive prediction and leveraging recognized objects as contextual hints for attribute classification. Concurrently, the Linking Attributes and Objects (LAO) module improves the attribute-object linkage understanding through a new contrastive learning strategy that incorporates tailored hard negative generation and adaptive loss adjustments. We demonstrate our model's superiority by showcasing its state-of-the-art performance across three benchmark datasets in both Closed-World (CW) and Open-World (OW) scenarios.

Abstract:
The versatility of diffusion models in generating customized images from few samples raises significant privacy concerns, particularly regarding unauthorized modifications of private content. This concerning issue has renewed the efforts in developing protection mechanisms based on adversarial attacks, which generate effective perturbations to poison diffusion models. Our work is motivated by the observation that these models exhibit a high degree of abstraction within their semantic latent space (termed 'h-space'), which encodes critical high-level features for generating coherent and meaningful content. In this paper, we propose a novel anti-customization approach, called HAAD (h-space based Adversarial Attack for Diffusion models), that leverages adversarial attacks to craft perturbations based on the h-space that can efficiently degrade the image generation process. Building upon HAAD, we further introduce a more efficient variant, HAAD-KV, that constructs perturbations solely based on the KV parameters of the h-space. This strategy offers a stronger protection, that is computationally less expensive. Despite their simplicity, our methods outperform state-of-the-art adversarial attacks, highlighting their effectiveness.

Abstract:
Underwater images typically suffer from severe colour distortions, low visibility, and reduced structural clarity due to complex optical effects such as scattering and absorption, which greatly degrade their visual quality and limit the performance of downstream visual perception tasks. Existing enhancement methods often struggle to adaptively handle diverse degradation conditions and fail to leverage underwater-specific physical priors effectively. In this paper, we propose a degradation-aware conditional diffusion model to enhance underwater images adaptively and robustly. Given a degraded underwater image as input, we first predict its degradation level using a lightweight dual-stream convolutional network, generating a continuous degradation score as semantic guidance. Based on this score, we introduce a novel conditional diffusion-based restoration network with a Swin UNet backbone, enabling adaptive noise scheduling and hierarchical feature refinement. To incorporate underwater-specific physical priors, we further propose a degradation-guided adaptive feature fusion module and a hybrid loss function that combines perceptual consistency, histogram matching, and feature-level contrast. Comprehensive experiments on benchmark datasets demonstrate that our method effectively restores underwater images with superior colour fidelity, perceptual quality, and structural details. Compared with SOTA approaches, our framework achieves significant improvements in both quantitative metrics and qualitative visual assessments.

Abstract:
Scientific diagrams are vital tools for communicating structured knowledge across disciplines. However, they are often published as static raster images, losing symbolic semantics and limiting reuse. While Multimodal Large Language Models (MLLMs) offer a pathway to bridging vision and structure, existing methods lack semantic control and structural interpretability, especially on complex diagrams. We propose Draw with Thought (DwT), a training-free framework that guides MLLMs to reconstruct diagrams into editable mxGraph XML code through cognitively inspired Chain-of-Thought reasoning. DwT enables interpretable and controllable outputs without model fine-tuning by dividing the task into two stages: Coarse-to-Fine Planning, which handles perceptual structuring and semantic specification, and Structure-Aware Code Generation, enhanced by format-guided refinement. To support evaluation, we release Plot2XML, a benchmark of 247 real-world scientific diagrams with gold-standard XML annotations. Extensive experiments across eight MLLMs show that our approach yields high-fidelity, semantically aligned, and structurally valid reconstructions, with human evaluations confirming strong alignment in both accuracy and visual aesthetics, offering a scalable solution for converting static visuals into structurally valid and renderable representations and advancing machine understanding of scientific graphics.

Abstract:
In-context learning (ICL) enables Large Vision-Language Models (LVLMs) to adapt to new tasks without parameter updates, using a few demonstrations from a large support set. However, selecting informative demonstrations leads to high computational and memory costs. While some methods explore selecting a small and representative coreset in the text classification, evaluating all support set samples remains costly, and discarded samples lead to unnecessary information loss. These methods may also be less effective for image classification due to differences in feature spaces. Given these limitations, we propose Key-based Coreset Optimization (KeCO), a novel framework that leverages untapped data to construct a compact and informative coreset. We introduce visual features as keys within the coreset, which serve as the anchor for identifying samples to be updated through different selection strategies. By leveraging untapped samples from the support set, we update the keys of selected coreset samples, enabling the randomly initialized coreset to evolve into a more informative coreset under low computational cost. Through extensive experiments on coarse-grained and fine-grained image classification benchmarks, we demonstrate that KeCO effectively enhances ICL performance for image classification task, achieving an average improvement of more than 20%. Notably, we evaluate KeCO under a simulated online scenario, and the strong performance in this scenario highlights the practical value of our framework for resource-constrained real-world scenarios.

Abstract:
Depression poses significant challenges to patients and healthcare organizations, necessitating efficient assessment methods. Existing paradigms typically focus on a patient-doctor way that overlooks multi-role interactions, such as family involvement in the evaluation and caregiving process. Moreover, current automatic depression detection (ADD) methods usually model depression detection as a classification or regression task, lacking interpretability for the decision-making process. To address these issues, we developed InterMind, a doctor-patient-family interactive depression assessment system empowered by large language models (LLMs). Our system enables patients and families to contribute descriptions, generates assistive diagnostic reports for doctors, and provides actionable insights, improving diagnostic precision and efficiency. To enhance LLMs' performance in psychological counseling and diagnostic interpretability, we integrate retrieval-augmented generation (RAG) and chain-of-thoughts (CoT) techniques for data augmentation, which mitigates the hallucination issue of LLMs in specific scenarios after instruction fine-tuning. Quantitative experiments and professional assessments by clinicians validate the effectiveness of our system.

Abstract:
This paper explores silent speech decoding in active brain-computer interface (BCI) systems, which offer more natural and flexible communication than traditional BCI applications. We collected a new silent speech dataset of over 120 hours of electroencephalogram (EEG) recordings from 12 subjects, capturing 24 commonly used English words for language model pretraining and decoding. Following the recent success of pretraining large models with self-supervised paradigms to enhance EEG classification performance, we propose Large Brain Language Model (LBLM) pretrained to decode silent speech for active BCI. To pretrain LBLM, we propose Future Spectro-Temporal Prediction (FSTP) pretraining paradigm to learn effective representations from unlabeled EEG data. Unlike existing EEG pretraining methods that mainly follow a masked-reconstruction paradigm, our proposed FSTP method employs autoregressive modeling in temporal and frequency domains to capture both temporal and spectral dependencies from EEG signals. After pretraining, we finetune our LBLM on downstream tasks, including word-level and semantic-level classification. Extensive experiments demonstrate significant performance gains of the LBLM over fully-supervised and pretrained baseline models. For instance, in the difficult cross-session setting, our model achieves 47.2% accuracy on semantic-level classification and 42.3% in word-level classification, outperforming baseline methods substantially. Our research advances silent speech decoding in active BCI systems, offering an innovative solution for EEG language model pretraining and a new dataset for fundamental research.

Abstract:
Multimodal recommender systems (MRS) improve recommendation performance by integrating complementary semantic information from multiple modalities. However, the assumption of complete multimodality rarely holds in practice due to missing images and incomplete descriptions, hindering model robustness and generalization. To address these challenges, we introduce a novel method called I3-MRec, which uses Invairant learning with Information bottleneck principle for Incomplete Modality Recommendation. To achieve robust performance in missing modality scenarios, I3-MRec enforces two pivotal properties: (i) cross-modal preference invariance, ensuring consistent user preference modeling across varying modality environments, and (ii) compact yet effective multimodal representation, as modality information becomes unreliable in such scenarios, reducing the dependence on modality-specific information is particularly important. By treating each modality as a distinct semantic environment, I3-MRec employs invariant risk minimization (IRM) to learn preference-oriented representations. In parallel, a missing-aware fusion module is developed to explicitly simulate modality-missing scenarios. Built upon the Information Bottleneck (IB) principle, the module aims to preserve essential user preference signals across these scenarios while effectively compressing modality-specific information. Extensive experiments conducted on three real-world datasets demonstrate that I3-MRec consistently outperforms existing state-of-the-art MRS methods across various modality-missing scenarios, highlighting its effectiveness and robustness in practical applications.

Abstract:
Visual art understanding requires joint modeling of multiple perspectives and contextual inference rooted in cultural, historical, and stylistic knowledge. Recent multimodal large language models (MLLMs) demonstrate strong performance in generic captioning, primarily based on object recognition and training on large-scale generic data. They struggle in providing captions incorporating the multiple perspectives that fine art demands. In this work, we introduce ArtRAG, a novel training-free framework that integrates structured knowledge into a retrieval-augmented generation (RAG) pipeline for multi-perspective artwork explanation. ArtRAG automatically constructs an Art Context Knowledge Graph (ACKG) from domain-specific textual sources, organizing entities such as artists, themes, movements, and historical events into a rich, interpretable knowledge graph. At inference time, a multi-granular structured context retriever selects semantically and topologically relevant subgraphs to guide explanation generation. This approach enables MLLMs to produce contextually grounded, multi-perspective descriptions. Experiments on the SemArt and Artpedia datasets demonstrate that ArtRAG outperforms existing heavily trained baselines. Human evaluations further confirm ArtRAG's ability to generate coherent, informative, and culturally enriched interpretations of artworks.

Abstract:
Understanding cultural heritage through technology faces challenges in connecting with diverse audiences, especially when interpreting art across cultures. In this work, we present CultiVerse, a visual analytics system that leverages Large Language Models (LLMs) to support cross-cultural appreciation of Traditional Chinese Paintings (TCPs). CultiVerse operates within a mixed-initiative framework and guides users through three stages: extracting cultural context, aligning cross-cultural symbols, and extrapolating meaning in the viewer's cultural frame. By combining an interactive interface with LLM-powered analysis, the system enables deeper engagement with symbolic meanings and encourages serendipitous cross-cultural discoveries. Our approach bridges AI interpretation and human insight to foster mutual understanding in a multicultural setting. A curated TCP dataset supports exploration, while empirical evaluations confirm that CultiVerse enhances user understanding, interpretation accuracy, and cultural empathy.

Abstract:
Recent forward prediction-based learned video compression (LVC) methods have achieved impressive results, even surpassing VVC reference software VTM under the Low Delay B (LDB) configuration. In contrast, learned bidirectional video compression (BVC) remains underexplored and still lags behind its forward-only counterparts. This performance gap is mainly due to the limited ability to extract diverse and accurate contexts: most existing BVCs primarily exploit temporal motion while neglecting non-local correlations across frames. Moreover, they lack the adaptability to dynamically suppress harmful contexts arising from fast motion or occlusion. To tackle these challenges, we propose BiECVC, a BVC framework that incorporates diversified local and non-local context modeling along with adaptive context gating. For local context enhancement, BiECVC reuses high-quality features from lower layers and aligns them using decoded motion vectors without introducing extra motion overhead. To model non-local dependencies efficiently, we adopt a linear attention mechanism that balances performance and complexity. To further mitigate the impact of inaccurate context prediction, we introduce Bidirectional Context Gating, inspired by data-dependent decay in recent autoregressive language models, to dynamically filter contextual information based on conditional coding results. Extensive experiments demonstrate that BiECVC achieves state-of-the-art performance, reducing the bit-rate by 13.4% and 15.7% compared to VTM 13.2 under the Random Access (Intra Period 32) with intra periods of 32 and 64, respectively. To our knowledge, BiECVC is the first learned video codec to surpass VTM 13.2 RA across all standard test datasets.

Abstract:
In 3D point cloud object tracking, the motion-centric methods have emerged as a promising avenue due to its superior performance in modeling inter-frame motion. However, existing two-stage motion-based approaches suffer from fundamental limitations: (1) error accumulation due to decoupled optimization caused by explicit foreground segmentation prior to motion estimation, and (2) computational bottlenecks from sequential processing. To address these challenges, we propose FocusTrack, a novel one-stage paradigms tracking framework that unifies motion-semantics co-modeling through two core innovations: Inter-frame Motion Modeling (IMM) and Focus-and-Suppress Attention. The IMM module employs a temp-oral-difference siamese encoder to capture global motion patterns between adjacent frames. The Focus-and-Suppress attention that enhance the foreground semantics via motion-salient feature gating and suppress the background noise based on the temporal-aware motion context from IMM without explicit segmentation. Based on above two designs, FocusTrack enables end-to-end training with compact one-stage pipeline. Extensive experiments on prominent 3D tracking benchmarks, such as KITTI, nuScenes, and Waymo, demonstrate that the FocusTrack achieves new SOTA performance while running at a high speed with 105 FPS.

Abstract:
Video signals are vulnerable in multimedia communication and storage systems, as even slight bitstream-domain corruption can lead to significant pixel-domain degradation. To recover faithful spatio-temporal content from corrupted inputs, bitstream-corrupted video recovery has recently emerged as a challenging and understudied task. However, existing methods require time-consuming and labor-intensive annotation of corrupted regions for each corrupted video frame, resulting in a large workload in practice. In addition, high-quality recovery remains difficult as part of the local residual information in corrupted frames may mislead feature completion and successive content recovery. In this paper, we propose the first blind bitstream-corrupted video recovery framework that integrates visual foundation models with recovery model, which is adapted to different types of corruption and bitstream-level prompts. Within the framework, the proposed Detect Any Corruption (DAC) model leverages the rich priors of the visual foundation model while incorporating bitstream and corruption knowledge to enhance corruption localization and blind recovery. Additionally, we introduce a novel Corruption-aware Feature Completion (CFC) module, which adaptively processes residual contributions based on high-level corruption understanding. With VFM-guided hierarchical feature augmentation and high-level coordination in a mixture-of-residual-experts (MoRE) structure, our method suppresses artifacts and enhances informative residuals. Comprehensive evaluations show that the proposed method achieves outstanding performance in bitstream-corrupted video recovery without requiring a manually labeled mask sequence. The demonstrated effectiveness will help to realize improved user experience, wider application scenarios, and more reliable multimedia communication and storage systems.

Abstract:
Video-based Visible-Infrared person re-identification (VVI-ReID) aims to retrieve the same pedestrian across visible and infrared modalities from video sequences. Existing methods tend to exploit modality-invariant visual features but largely overlook gait features, which are not only modality-invariant but also rich in temporal dynamics, thus limiting their ability to model the spatiotemporal consistency essential for cross-modal video matching. To address these challenges, we propose a DINOv2-Driven Gait Representation Learning (DinoGRL) framework that leverages the rich visual priors of DINOv2 to learn gait features complementary to appearance cues, facilitating robust sequence-level representations for cross-modal retrieval. Specifically, we introduce a Semantic-Aware Silhouette and Gait Learning (SASGL) model, which generates and enhances silhouette representations with general-purpose semantic priors from DINOv2 and jointly optimizes them with the ReID objective to achieve semantically enriched and task-adaptive gait feature learning. Furthermore, we develop a Progressive Bidirectional Multi-Granularity Enhancement (PBMGE) module, which progressively refines feature representations by enabling bidirectional interactions between gait and appearance streams across multiple spatial granularities, fully leveraging their complementarity to enhance global representations with rich local details and produce highly discriminative features. Extensive experiments on HITSZ-VCM and BUPT datasets demonstrate the superiority of our approach, significantly outperforming existing state-of-the-art methods.

Abstract:
Advancements in Generative AI offers new opportunities for FashionAI, surpassing traditional recommendation systems that often lack transparency and struggle to integrate expert knowledge, leaving the potential for personalized fashion styling remain untapped. To address these challenges, we present PAFA (Principle-Aware Fashion), a multi-granular knowledge base that organizes professional styling expertise into three levels of metadata, domain principles, and semantic relationships. Using PAFA, we develop StePO-Rec, a knowledge-guided method for multi-step outfit recommendation. StePO-Rec provides structured suggestions using a scenario-dimension-attribute framework, employing recursive tree construction to align recommendations with both professional principles and individual preferences. A preference-trend re-ranking system further adapts to fashion trends while maintaining the consistency of the user's original style. Experiments on the widely used personalized outfit dataset IQON show a 28% increase in Recall@1 and 32.8% in MAP. Furthermore, case studies highlight improved explainability, traceability, result reliability, and the seamless integration of expertise and personalization.

Abstract:
Human-Object Interaction (HOI) detection involves detecting human-object pairs and predicting their interactions. However, it faces significant challenges due to the complexity of human behavior and the diverse contexts in which interactions occur. Contextual cues, such as the participants involved, body language, and the surrounding environment, are crucial for accurately identifying interactions, particularly those that are ambiguous or previously unseen. In this paper, we propose ConCue, a novel approach that integrates contextual cue generation with feature extraction to enhance HOI detection. Specifically, we design specialized prompts tailored for Large Vision-Language Models (VLMs), enabling the generation of rich contextual cues from images. These cues are then seamlessly integrated into HOI detection through a feature extraction module with a multi-tower architecture we developed, which effectively incorporates contextual information into both instance and interaction detection processes. Extensive experimental results demonstrate the effectiveness of ConCue. Integrating ConCue with state-of-the-art HOI methods leads to significant performance improvements on two widely used benchmark datasets, highlighting the potential of our approach in advancing HOI detection.

Abstract:
Recent zero-shot text-to-speech (TTS) systems face a common dilemma: autoregressive (AR) models suffer from slow generation and lack duration controllability, while non-autoregressive (NAR) models lack temporal modeling and typically require complex designs. In this paper, we introduce a novel pseudo-autoregressive (PAR) codec language modeling approach that unifies AR and NAR modeling. Combining explicit temporal modeling from AR with parallel generation from NAR, PAR generates dynamic-length spans at fixed time steps. Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement. In the first stage, PAR progressively generates speech tokens along the time dimension, with each step predicting all positions in parallel but only retaining the left-most span. In the second stage, low-confidence tokens are iteratively refined in parallel, leveraging the global contextual information. Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data, including F5-TTS, E2-TTS, and MaskGCT, on the LibriSpeech test-clean set in terms of speech quality, speaker similarity, and intelligibility, while achieving up to ten times faster inference speed. Audio samples are available at https://microsoft.com/research/project/vall-e-x/palle.

Abstract:
In computer animation, game design, and human-computer interaction, synthesizing human motion that aligns with user intent remains a significant challenge. Existing methods have notable limitations: textual approaches offer high-level semantic guidance but struggle to describe complex actions accurately; trajectory-based techniques provide intuitive global motion direction yet often fall short in generating precise or customized character movements; and anchor poses-guided methods are typically confined to synthesize only simple motion patterns. To generate more controllable and precise human motions, we propose ProMoGen (Progressive Motion Generation), a novel framework that integrates trajectory guidance with sparse anchor motion control. Global trajectories ensure consistency in spatial direction and displacement, while sparse anchor motions only deliver precise action guidance without displacement. This decoupling enables independent refinement of both aspects, resulting in a more controllable, high-fidelity, and sophisticated motion synthesis. ProMoGen supports both dual and single control paradigms within a unified training process. Moreover, we recognize that direct learning from sparse motions is inherently unstable, we introduce SAP-CL (Sparse Anchor Posture Curriculum Learning), a curriculum learning strategy that progressively adjusts the number of anchors used for guidance, thereby enabling more precise and stable convergence. Extensive experiments demonstrate that ProMoGen excels in synthesizing vivid and diverse motions guided by predefined trajectory and arbitrary anchor frames. Our approach seamlessly integrates personalized motion with structured guidance, significantly outperforming state-of-the-art methods across multiple control scenarios.

Abstract:
Despite recent advancements in text-to-image models, achieving semantically accurate images in text-to-image diffusion models is a persistent challenge. While existing initial latent optimization methods have demonstrated impressive performance, we identify two key limitations: (a) attention neglect, where the synthesized image omits certain subjects from the input prompt because they do not have a designated region in the self-attention map despite despite having a high-response cross-attention, and (b) attention interference, where the generated image has mixed-up properties of multiple subjects because of a conflicting overlap between cross- and self-attention maps of different subjects. To address these limitations, we introduce CoCoNO, a new algorithm that optimizes the initial latent by leveraging the complementary information within self-attention and cross-attention maps. We first identify subject-specific regions from the self-attention map and term them attention zones. Our method then introduces two new loss functions: the attention contrast loss, which minimizes undesirable overlap by ensuring each attention zone is exclusively linked to a specific subject's cross attention map, and the attention complete loss, which maximizes the activation within these attention zones to guarantee that each subject is fully and distinctly represented. Our approach operates within a noise optimization framework, avoiding the need to retrain base models. Through extensive experiments on multiple benchmarks, we demonstrate that CoCoNO significantly improves text-image alignment and outperforms the current state of the art.

Abstract:
Recent video generation research has focused heavily on isolated actions, leaving interactive motions-such as hand-face interactions-largely unexamined. These interactions are essential for emerging biometric authentication systems, which rely on interactive motion-based anti-spoofing approaches. From a security perspective, there is a growing need for large-scale, high-quality interactive videos to train and strengthen authentication models. In this work, we introduce a novel paradigm for animating realistic hand-face interactions. Our approach simultaneously learns spatio-temporal contact dynamics and biomechanically plausible deformation effects, enabling natural interactions where hand movements induce anatomically accurate facial deformations while maintaining collision-free contact. To facilitate this research, we present InterHF, a large-scale hand-face interaction dataset featuring 18 interaction patterns and 90,000 annotated videos. Additionally, we propose InterAnimate, a region-aware diffusion model designed specifically for interaction animation. InterAnimate leverages learnable spatial and temporal latents to effectively capture dynamic interaction priors and integrates a region-aware interaction mechanism that injects these priors into the denoising process. To the best of our knowledge, this work represents the first large-scale effort to systematically study human hand-face interactions. Qualitative and quantitative results show InterAnimate produces highly realistic animations, setting a new benchmark. Code and data will be made public to advance research.

Abstract:
Audio-driven video generation aims to synthesize realistic videos that align with input audio recordings, akin to the human ability to visualize scenes from auditory input. However, existing approaches predominantly focus on exploring semantic information, such as the classes of sounding sources present in the audio, limiting their ability to generate videos with accurate content and spatial composition. In contrast, we humans can not only naturally identify the semantic categories of sounding sources but also determine their deeply encoded spatial attributes, including locations and movement directions. This useful information can be elucidated by considering specific spatial indicators derived from the inherent physical properties of sound, such as loudness or frequency. As prior methods largely ignore this factor, we present SpA2V, the first framework explicitly exploits these spatial auditory cues from audios to generate videos with high semantic and spatial correspondence. SpA2V decomposes the generation process into two stages: 1) Audio-guided Video Planning: We meticulously adapt a state-of-the-art MLLM for a novel task of harnessing spatial and semantic cues from input audio to construct Video Scene Layouts (VSLs). This serves as an intermediate representation to bridge the gap between the audio and video modalities. 2) Layout-grounded Video Generation: We develop an efficient and effective approach to seamlessly integrate VSLs as conditional guidance into pre-trained diffusion models, enabling VSL-grounded video generation in a training-free manner. Extensive experiments demonstrate that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.

Abstract:
We propose a novel 3D gaze redirection framework that leverages an explicit 3D eyeball structure. Existing gaze redirection methods are typically based on neural radiance fields, which employ implicit neural representations via volume rendering. Unlike these NeRF-based approaches, where the rotation and translation of 3D representations are not explicitly modeled, we introduce a dedicated 3D eyeball structure to represent the eyeballs with 3D Gaussian Splatting (3DGS). Our method generates photorealistic images that faithfully reproduce the desired gaze direction by explicitly rotating and translating the 3D eyeball structure. In addition, we propose an adaptive deformation module that enables the replication of subtle muscle movements around the eyes. Through experiments conducted on the ETH-XGaze dataset, we demonstrate that our framework is capable of generating diverse novel gaze images, achieving superior image quality and gaze estimation accuracy compared to previous state-of-the-art methods.

Abstract:
Hair transfer is increasingly valuable across domains such as social media, gaming, advertising, and entertainment. While significant progress has been made in single-image hair transfer, video-based hair transfer remains challenging due to the need for temporal consistency, spatial fidelity, and dynamic adaptability. In this work, we propose HairShifter, a novel ''Anchor Frame + Animation'' framework that unifies high-quality image hair transfer with smooth and coherent video animation. At its core, HairShifter integrates a Image Hair Transfer (IHT) module for precise per-frame transformation and a Multi-Scale Gated SPADE Decoder to ensure seamless spatial blending and temporal coherence. Our method maintains hairstyle fidelity across frames while preserving non-hair regions. Extensive experiments demonstrate that HairShifter achieves state-of-the-art performance in video hairstyle transfer, combining superior visual quality, temporal consistency, and scalability. The code will be publicly available. We believe this work will open new avenues for video-based hairstyle transfer and establish a robust baseline in this field.

Abstract:
Recent advances in text-based image editing have enabled fine-grained manipulation of visual content guided by natural language. However, such methods are susceptible to adversarial attacks. In this work, we propose a novel attack that targets the visual component of editing methods. We introduce Attention Attack, which disrupts the cross-attention between a textual prompt and the visual representation of the image by using an automatically generated caption of the source image as a proxy for the edit prompt. This breaks the alignment between the contents of the image and their textual description, without requiring knowledge of the editing method or the editing prompt. Reflecting on the reliability of existing metrics for immunization success, we propose two novel evaluation strategies: Caption Similarity, which quantifies semantic consistency between original and adversarial edits, and semantic Intersection over Union (IoU), which measures spatial layout disruption via segmentation masks. Experiments conducted on the TEDBench++ benchmark demonstrate that our attack significantly degrades editing performance while remaining imperceptible.

Abstract:
The rapid growth of online video platforms, particularly live streaming services, has created an urgent need for real-time video understanding systems. These systems must process continuous video streams and respond to user queries instantaneously, presenting unique challenges for current Video Large Language Models (VideoLLMs). While existing VideoLLMs excel at processing complete videos, they face significant limitations in streaming scenarios due to their inability to handle dense, redundant frames efficiently. We introduce TimeChat-Online, a novel online VideoLLM that revolutionizes real-time video interaction. At its core lies our innovative Differential Token Drop (DTD) module, which addresses the fundamental challenge of visual redundancy in streaming videos. Drawing inspiration from human visual perception's Change Blindness phenomenon, DTD preserves meaningful temporal changes while filtering out static, redundant content between frames. Remarkably, our experiments demonstrate that DTD achieves an 82.8% reduction in video tokens while maintaining 98% performance on StreamingBench, revealing that over 80% of visual content in streaming videos is naturally redundant without requiring language guidance. To enable seamless real-time interaction, we present TimeChat-Online-139K, a comprehensive streaming video dataset featuring diverse interaction patterns including backward-tracing, current-perception, and future-responding scenarios. TimeChat-Online's unique Proactive Response capability, naturally achieved through continuous monitoring of video scene transitions via DTD, sets it apart from conventional approaches. Our extensive evaluation demonstrates TimeChat-Online's superior performance on streaming benchmarks (StreamingBench and OvOBench) and maintaining competitive results on long-form video tasks such as Video-MME and MLVU. Notably, when integrated with Qwen2.5VL-7B, DTD achieves a 5.7-point accuracy improvement on the challenging VideoMME subset containing videos of 30-60 minutes, while reducing video tokens by 84.6%. Project page: https://timechat-online.github.io.

Abstract:
Large language models (LLMs) have demonstrated impressive capabilities, but their enormous size poses significant challenges for deployment in real-world applications. To address this issue, researchers have sought to apply network pruning techniques to LLMs. A critical challenge in pruning is the allocation of sparsity for each layer. Recent sparsity allocation methods are often based on heuristics or search that can easily lead to suboptimal performance. In this paper, we conducted an extensive investigation into various LLMs and revealed three significant discoveries: (1) the Layerwise Pruning Sensitivity (LPS) of LLMs is highly non-uniform, (2) the choice of pruning metric affects LPS, and (3) the performance of a sparse model is related to the uniformity of its layerwise redundancy level. Based on these discoveries, we propose that the layerwise sparsity of LLMs should adhere to three principles: non-uniformity, pruning metric dependency, and uniform layerwise redundancy level in the pruned model. To this end, we proposed Maximum Redundancy Pruning (MRP), an iterative pruning algorithm that prunes in the most redundant layers (i.e., those with the highest non-outlier ratio) at each iteration. The achieved layerwise sparsity aligns with the outlined principles. We conducted extensive experiments on publicly available LLMs, including LLaMA2 and OPT, on various benchmarks. The experimental results validate the effectiveness of MRP, demonstrating its superiority over previous methods.

Abstract:
Current researches on Deepfake forensics often treat detection as a classification task or temporal forgery localization problem, which are usually restrictive, time-consuming, and challenging to scale for large datasets. To resolve these issues, we present a multimodal deviation perceiving framework for weakly-supervised temporal forgery localization (MDP), which aims to identify temporal partial forged segments using only video-level annotations. The MDP proposes a novel multimodal interaction mechanism (MI) and an extensible deviation perceiving loss to perceive multimodal deviation, which achieves the refined start and end timestamps localization of forged segments. Specifically, MI introduces a temporal property preserving cross-modal attention to measure the relevance between the visual and audio modalities in the probabilistic embedding space. It could identify the inter-modality deviation and construct comprehensive video features for temporal forgery localization. To explore further temporal deviation for weakly-supervised learning, an extensible deviation perceiving loss has been proposed, aiming at enlarging the deviation of adjacent segments of the forged samples and reducing that of genuine samples. Extensive experiments demonstrate the effectiveness of the proposed framework and achieve comparable results to fully-supervised approaches in several evaluation metrics.

Abstract:
The on-device real-time data distribution shift on devices challenges the generalization of lightweight on-device models. This critical issue is often overlooked in current research, which predominantly relies on data-intensive and computationally expensive fine-tuning approaches. To tackle this, we introduce Persona, a novel personalized method using a prototype-based, backpropagation-free parameter editing framework to enhance model generalization without post-deployment retraining. Persona employs a neural adapter in the cloud to generate a parameter editing matrix based on real-time device data. This matrix adeptly adapts on-device models to the prevailing data distributions, efficiently clustering them into prototype models. The prototypes are dynamically refined via the parameter editing matrix, facilitating efficient evolution. Furthermore, the integration of cross-layer knowledge transfer ensures consistent and context-aware multi-layer parameter changes and prototype assignment. Extensive experiments on vision task and recommendation task on multiple datasets confirm Persona's effectiveness and generality.

Abstract:
Recently, large vision-language models (LVLMs) unleash powerful analysis capabilities for low Earth orbit (LEO) satellite Earth observation images in the data center. However, fast satellite motion, brief satellite-ground station (GS) contact windows, and large size of the images pose a data download challenge. To enable near real-time Earth observation applications (e.g., disaster and extreme weather monitoring), we should explore how to deploy LVLM in LEO satellite networks, and design SpaceVerse, an efficient satellite-ground synergistic LVLM inference system. To this end, firstly, we deploy compact LVLMs on satellites for lightweight tasks, whereas regular LVLMs operate on GSs to handle computationally intensive tasks. Then, we propose a computing and communication co-design framework comprised of a progressive confidence network, and an attention-based multi-scale preprocessing, used to identify on-satellite inferring data, and reduce data redundancy before satellite-GS transmission, separately. We implement, and evaluate SpaceVerse on real-world LEO satellite constellations and datasets, achieving a 31.2% average gain in accuracy and a 51.2% reduction in latency compared to state-of-the-art baselines.

Abstract:
Current quantization-aware training (QAT) methods primarily focus on enhancing the performance of quantized models on in-distribution (I.D) data, while overlooking the potential performance degradation on out-of-distribution (OOD) data. In this paper, we first substantiate this problem through rigorous experiment, showing that QAT can lead to a significant OOD generalization performance degradation. Further, we find the contradiction between the perspective that flatness of loss landscape gives rise to superior OOD generalization and the phenomenon that QAT lead to a sharp loss landscape, can cause the above problem. Therefore, we propose a flatness-oriented QAT method, FQAT, to achieve generalizable QAT. Specifically, i) FQAT introduces a layer-wise freezing mechanism to mitigate the gradient conflict issue between dual optimization objectives (i.e., vanilla QAT and flatness). ii) FQAT proposes an disorder-guided adaptive freezing algorithm to dynamically determines which layers to freeze at each training step, effectively addressing the challenges caused by interference between layers. A gradient disorder metric is designed to help the algorithm identify unstable layers during training. Extensive experiments on influential OOD benchmark demonstrate the superiority of our method over state-of-the-art baselines under both I.D and OOD image classification tasks.

Abstract:
Federated Learning (FL) enables collaborative model training while preserving data privacy, but it is highly vulnerable to backdoor attacks. Most existing defense methods in FL have limited effectiveness due to their neglect of the model's over-reliance on backdoor triggers, particularly as the proportion of malicious clients increases. In this paper, we propose FedBAP, a novel defense framework for mitigating backdoor attacks in FL by reducing the model's reliance on backdoor triggers. Specifically, first, we propose a perturbed trigger generation mechanism that creates perturbation triggers precisely matching backdoor triggers in location and size, ensuring strong influence on model outputs. Second, we utilize these perturbation triggers to generate benign adversarial perturbations that disrupt the model's dependence on backdoor triggers while forcing it to learn more robust decision boundaries. Finally, we design an adaptive scaling mechanism to dynamically adjust perturbation intensity, effectively balancing defense strength and model performance. The experimental results demonstrate that FedBAP reduces the attack success rates by 0.22%-5.34%, 0.48%-6.34%, and 97.22%-97.6% under three types of backdoor attacks, respectively. In particular, FedBAP demonstrates outstanding performance against novel backdoor attacks.

Abstract:
Real-time streaming of point cloud video, characterized by massive data volumes and high sensitivity to packet loss, remains a key challenge for immersive applications under dynamic network conditions. While connection-oriented protocols such as TCP and more modern alternatives like QUIC alleviate some transport-layer inefficiencies, including head-of-line blocking, they still retain a coarse-grained, segment-based delivery model and a centralized control loop that limit fine-grained adaptation and effective caching. We introduce INDS (Incremental Named Data Streaming), an adaptive streaming framework based on Information-Centric Networking (ICN) that rethinks delivery for hierarchical, layered media. INDS leverages the Octree structure of point cloud video and expressive content naming to support progressive, partial retrieval of enhancement layers based on consumer bandwidth and decoding capability. By combining time-windows with Group-of-Frames (GoF), INDS's naming scheme supports fine-grained in-network caching and facilitates efficient multi-user data reuse. INDS can be deployed as an overlay, remaining compatible with QUIC-based transport infrastructure as well as future Media-over-QUIC (MoQ) architectures, without requiring changes to underlying IP networks. Our prototype implementation shows up to 80% lower delay, 15-50% higher throughput, and 20-30% increased cache hit rates compared to state-of-the-art DASH-style systems. Together, these results establish INDS as a scalable, cache-friendly solution for real-time point cloud streaming under variable and lossy conditions, while its compatibility with MoQ overlays further positions it as a practical, forward-compatible architecture for emerging immersive media systems.

Abstract:
Automated singing assessment is crucial for education, entertainment, and talent discovery. However, existing systems are hindered by two fundamental limitations: first, their reliance on reference tracks (e.g., the original song), which stifles creative expression, and second, their simplification of complex vocal performances into a single, often non-diagnostic score based on pitch and rhythm. This paradigm fails to capture the nuanced, multifaceted attributes that define expert-level singing. Echoing the recent shift in other AI domains from discriminative to descriptive evaluation, we advocate for a new paradigm in singing assessment. This paper aims to build a complete ecosystem for reference-free, multi-dimensional, and descriptive singing assessment. First, we construct Sing-MD, a large-scale, multi-dimensional singing dataset annotated by experts across four core dimensions: breath control, timbre quality, emotional expression, and vocal technique. Analysis of this dataset reveals a key finding: significant annotation inconsistencies among experts, which challenges the validity of traditional accuracy-based evaluation metrics. Second, standard Multimodal Large Language Models (MLLMs) are unable to analyze full-length songs on resource-constrained, consumer-grade hardware due to memory limitations. This challenge leads to a ''human label-audio input mismatch'' problem and results in poor performance. To address this issue, we designed VocalVerse, an efficient hybrid architecture. It leverages a lightweight acoustic encoder and specialized modules to process the entire song, thereby learning global performance features, modeling long-term dependencies, and ultimately overcoming this limitation. Third, to address the shortcomings of automated metrics, we establish a new evaluation benchmark-H-TPR (Human-in-the-loop Tiered Perceptual Ranking)-which evaluates a model's ability to generate perceptually valid performance rankings, rather than predicting a noisy ''ground-truth'' score. Our comprehensive experiments show that on the H-TPR benchmark, our VocalVerse framework can effectively learn and distinguish singing quality across different dimensions, thereby creating perceptually valid quality rankings and significantly outperforming existing baselines. Furthermore, our framework for multi-dimensional scoring and descriptive feedback generation has been successfully commercialized and deployed at scale, demonstrating its significant real-world impact and practical value.

Abstract:
Recent advances in vision-language models have enabled rich semantic understanding across modalities. However, these encoding methods lack the ability to interpret or reason about the moral dimensions of content---a crucial aspect of human cognition. In this paper, we address this gap by introducing MoralCLIP, a novel embedding representation method that extends multimodal learning with explicit moral grounding based on Moral Foundations Theory (MFT). Our approach integrates visual and textual moral cues into a unified embedding space, enabling cross-modal moral alignment. MoralCLIP is grounded on the multi-label dataset Social-Moral Image Database to identify co-occurring moral foundations in visual content. For MoralCLIP training, we design a moral data augmentation strategy to scale our annotated dataset to 15,000 image-text pairs labeled with MFT-aligned dimensions. Our results demonstrate that explicit moral supervision improves both unimodal and multimodal understanding of moral content, establishing a foundation for morally-aware AI systems capable of recognizing and aligning with human moral values.

Abstract:
Foundation models have transformed multimedia analysis by enabling robust and transferable representations across diverse modalities and tasks. However, their static deployment conflicts with growing societal and regulatory demands-particularly the need to unlearn specific data upon request, as mandated by privacy frameworks such as the GDPR. Traditional unlearning approaches, including retraining, activation editing, or distillation, are often computationally expensive, fragile, and ill-suited for real-time or continuously evolving systems. In this paper, we propose a paradigm shift: rethinking unlearning not as a retroactive intervention but as a built-in capability. We introduce a prompt-based learning framework that unifies knowledge acquisition and removal within a single training phase. Rather than encoding information in model weights, our approach binds class-level semantics to dedicated prompt tokens. This design enables instant unlearning simply by removing the corresponding prompt-without retraining, model modification, or access to original data. Experiments demonstrate that our framework preserves predictive performance on retained classes while effectively erasing forgotten ones. Beyond utility, our method exhibits strong privacy and security guarantees: it is resistant to membership inference attacks, and prompt removal prevents any residual knowledge extraction, even under adversarial conditions. This ensures compliance with data protection principles and safeguards against unauthorized access to forgotten information, making the framework suitable for deployment in sensitive and regulated environments. Overall, by embedding removability into the architecture itself, this work establishes a new foundation for designing modular, scalable and ethically responsive AI models.

Abstract:
Image Quality Assessment (IQA) models are increasingly relied upon to evaluate image quality in real-world systems --- from compression and enhancement to generation and streaming. Yet their adoption brings a fundamental risk: these models are inherently unstable. Adversarial manipulations can easily fool them, inflating scores and undermining trust. Traditionally, such vulnerabilities are addressed through data-driven defenses --- adversarial retraining, regularization, or input purification. But what if this is the wrong lens? What if robustness in perceptual models is not something to learn but something to design? In this work, we propose a provocative idea: robustness as an architectural prior. Rather than training models to resist perturbations, we reshape their internal structure to suppress sensitivity from the ground up. We achieve this by enforcing orthogonal information flow, constraining the network to norm-preserving operations --- and further stabilizing the system through pruning and fine-tuning. The result is a robust IQA architecture that withstands adversarial attacks without requiring adversarial training or significant changes to the original model. This approach suggests a shift in perspective: from optimizing robustness through data to engineering it through design.

Abstract:
Singing accent research is underexplored compared to speech accent studies, primarily due to the scarcity of suitable datasets. Existing singing datasets often suffer from detail loss, frequently resulting from the vocal-instrumental separation process. Additionally, they often lack regional accent annotations. To address this, we introduce the Multi-Accent Mandarin Dry-Vocal Singing Dataset (MADVSD). MADVSD comprises over 670 hours of dry vocal recordings from 4,026 native Mandarin speakers across nine distinct Chinese regions. In addition to each participant recording audio of three popular songs in their native accent, they also recorded phonetic exercises covering all Mandarin vowels and a full octave range. We validated MADVSD through benchmark experiments in singing accent recognition, demonstrating its utility for evaluating state-of-the-art speech models in singing contexts. Furthermore, we explored dialectal influences on singing accent and analyzed the role of vowels in accentual variations, leveraging MADVSD's unique phonetic exercises.

Abstract:
With the rapid progress of Multimodal LLMs, evaluating their mathematical reasoning capabilities has become an increasingly important research direction. In particular, visual-textual mathematical reasoning serves as a key indicator of an MLLM's ability to comprehend and solve complex, multi-step quantitative problems. While existing benchmarks such as MathVista and MathVerse have advanced the evaluation of multimodal math proficiency, they primarily rely on digitally rendered content and fall short in capturing the complexity of real-world scenarios. To bridge this gap, we introduce MathScape, a novel benchmark focused on assessing MLLMs' reasoning ability in realistic mathematical contexts. MathScape comprises 1,369 high-quality math problems paired with human-captured real-world images, closely reflecting the challenges encountered in practical educational settings. We conduct a thorough multi-dimensional evaluation across nine leading closed-source MLLMs, three open-source MLLMs with over 20 billion parameters, and seven smaller-scale MLLMs. Our results show that even SOTA models struggle with real-world math tasks, lagging behind human performance-highlighting critical limitations in current model capabilities. Moreover, we find that strong performance on synthetic or digitally rendered images does not guarantee similar effectiveness on real-world tasks. This underscores the necessity of MathScape in the next stage of multimodal mathematical reasoning.

Abstract:
Deception detection has garnered increasing attention in recent years due to the significant growth of digital media and heightened ethical and security concerns. It has been extensively studied using multimodal methods, including video, audio, and text. In addition, individual differences in deception production and detection are believed to play a crucial role. Although some studies have utilized individual information such as personality traits to enhance the performance of deception detection, current systems remain limited, partly due to a lack of sufficient datasets for evaluating performance. To address this issue, we introduce a multimodal deception dataset MDPE. Besides deception features, this dataset also includes individual differences information in personality and emotional expression characteristics. It can explore the impact of individual differences on deception behavior. It comprises over 104 hours of deception and emotional videos from 193 subjects. Furthermore, we conducted numerous experiments to provide valuable insights for future deception detection research. MDPE not only supports deception detection, but also provides conditions for tasks such as personality recognition and emotion recognition, and can even study the relationships between them. We believe that MDPE will become a valuable resource for promoting research in the field of affective computing.

Abstract:
To meet the growing demand for systematic surgical training, wet-lab environments have become indispensable platforms for hands-on practice in ophthalmology. Yet, traditional wet-lab training depends heavily on manual performance evaluations, which are labor-intensive, time-consuming, and often subject to variability. Recent advances in computer vision offer promising avenues for automated skill assessment, enhancing both the efficiency and objectivity of surgical education. Despite notable progress in ophthalmic surgical datasets, existing resources predominantly focus on real surgeries or isolated tasks, falling short of supporting comprehensive skill evaluation in controlled wet-lab settings. To address these limitations, we introduce WetCat, the first dataset of wet-lab cataract surgery videos specifically curated for automated skill assessment. WetCat comprises high-resolution recordings of surgeries performed by trainees on artificial eyes, featuring comprehensive phase annotations and semantic segmentations of key anatomical structures. These annotations are meticulously designed to facilitate skill assessment during the critical capsulorhexis and phacoemulsification phases, adhering to standardized surgical skill assessment frameworks. By focusing on these essential phases, WetCat enables the development of interpretable, AI-driven evaluation tools aligned with established clinical metrics. This dataset lays a strong foundation for advancing objective, scalable surgical education and sets a new benchmark for automated workflow analysis and skill assessment in ophthalmology training. The dataset and annotations are publicly available in Synapse (https://www.synapse.org/Synapse:syn66401174/files/).

Abstract:
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images. While recent advances in multi-modal learning have significantly improved performance, current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model outputs. To address these limitations, this work first proposes a Region-Aware Multimodal Chain-of-Thought (RMCoT) dataset, in which the process of producing an answer is preceded by a sequence of intermediate reasoning steps that explicitly ground relevant visual regions of the medical image, thereby providing fine-grained explainability. Furthermore, we introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer. Remarkably, our method achieves comparable performance using only one-eighth of the training data, demonstrating the efficiency and effectiveness of the proposal. The dataset is available at https://www.med-vqa.com/GEMeX/.

Abstract:
Understanding the interaction between different drugs (drug-drug interaction or DDI) is critical for ensuring patient safety and optimizing therapeutic outcomes. Existing DDI datasets primarily focus on textual information, overlooking multimodal data that reflect complex drug mechanisms. In this paper, we (1) introduce MUDI, a large-scale Multimodal biomedical dataset for Understanding pharmacodynamic Drug-drug Interactions, and (2) benchmark learning methods to study it. In brief, MUDI provides a comprehensive multimodal representation of drugs by combining pharmacological text, chemical formulas, molecular structure graphs, and images across 310,532 annotated drug pairs labeled as Synergism, Antagonism, or New Effect. Crucially, to effectively evaluate machine-learning based generalization, MUDI consists of unseen drug pairs in the test set. We evaluate benchmark models using both late fusion voting and intermediate fusion strategies. All data, annotations, evaluation scripts, and baselines are released under an open research license.

Abstract:
Despite rapid advances, Large Vision-Language Models (LVLMs) still suffer from hallucinations, i.e., generating content inconsistent with input or established world knowledge, which correspond to faithfulness and factuality hallucinations, respectively. Prior studies primarily evaluate faithfulness hallucination at a rather coarse level (e.g., object-level) and lack fine-grained analysis. Additionally, existing benchmarks often rely on costly manual curation or reused public datasets, raising concerns about scalability and data leakage. To address these limitations, we propose an automated data construction pipeline that produces scalable, controllable, and diverse evaluation data. We also design a hierarchical hallucination induction framework with input perturbations to simulate realistic noisy scenarios. Integrating these designs, we construct SHALE, a Scalable HALlucination Evaluation benchmark designed to assess both faithfulness and factuality hallucinations via a fine-grained hallucination categorization scheme. SHALE comprises over 30K image-instruction pairs spanning 12 representative visual perception aspects for faithfulness and 6 knowledge domains for factuality, considering both clean and noisy scenarios. Extensive experiments on over 20 mainstream LVLMs reveal significant factuality hallucinations and high sensitivity to semantic perturbations.

Abstract:
This paper presents our submission to the ACMMM25 - Grand Challenge on Multimedia Verification. We developed a multi-agent verification system that combines Multimodal Large Language Models (MLLMs) with specialized verification tools to detect multimedia misinformation. Our system operates through six stages: raw data processing, planning, information extraction, deep research, evidence collection, and report generation. The core Deep Researcher Agent employs four tools: reverse image search, metadata analysis, fact-checking databases, and verified news processing that extracts spatial, temporal, attribution, and motivational context. We demonstrate our approach on a challenge dataset sample involving complex multimedia content. Our system successfully verified content authenticity, extracted precise geolocation and timing information, and traced source attribution across multiple platforms, effectively addressing real-world multimedia verification scenarios.

Abstract:
Existing LiDAR-based 3D object detectors typically rely on manually annotated labels for training to achieve good performance. However, obtaining high-quality 3D labels is time-consuming and labor-intensive. To address this issue, recent works explore unsupervised 3D object detection by introducing RGB images as an auxiliary modal to assist pseudo-box generation. However, these methods simply integrate pseudo-boxes generated by LiDAR point clouds and RGB images. Yet, such a label-level fusion strategy brings limited improvements to the quality of pseudo-boxes, as it overlooks the complementary nature in terms of LiDAR and RGB image data. To overcome the above limitations, we propose a novel data-level fusion framework that integrates RGB images and LiDAR data at an early stage. Specifically, we utilize vision foundation models for instance segmentation and depth estimation on images and introduce a bi-directional fusion method, where real points acquire category labels from the 2D space, while 2D pixels are projected onto 3D to enhance real point density. To mitigate noise from depth and segmentation estimations, we propose a local and global filtering method, which applies local radius filtering to suppress depth estimation errors and global statistical filtering to remove segmentation-induced outliers. Furthermore, we propose a data-level fusion based dynamic self-evolution strategy, which iteratively refines pseudo-boxes under a dense representation, significantly improving localization accuracy. Extensive experiments on the nuScenes dataset demonstrate that the detector trained by our method significantly outperforms that trained by previous state-of-the-art methods with 28.4% mAP on the nuScenes validation benchmark.

Abstract:
Reconstructing high-fidelity 3D head avatars is crucial in various applications such as virtual reality. The pioneering methods reconstruct realistic head avatars with Neural Radiance Fields (NeRF), which have been limited by training and rendering speed. Recent methods based on 3D Gaussian Splatting (3DGS) significantly improve the efficiency of training and rendering. However, the surface inconsistency of 3DGS results in subpar geometric accuracy; later, 2DGS uses 2D surfels to enhance geometric accuracy at the expense of rendering fidelity. To leverage the benefits of both 2DGS and 3DGS, we propose a novel method named MixedGaussianAvatar for realistically and geometrically accurate head avatar reconstruction. Our main idea is to utilize 2D Gaussians to reconstruct the surface of the 3D head, ensuring geometric accuracy. We attach the 2D Gaussians to the triangular mesh of the FLAME model and connect additional 3D Gaussians to those 2D Gaussians where the rendering quality of 2DGS is inadequate, creating a mixed 2D-3D Gaussian representation. These 2D-3D Gaussians can then be animated using FLAME parameters. We further introduce a progressive training strategy that first trains the 2D Gaussians and then fine-tunes the mixed 2D-3D Gaussians. We use a unified mixed Gaussian representation to integrate the two modalities of 2D image and 3D mesh. Furthermore, the comprehensive experiments demonstrate the superiority of MixedGaussianAvatar. The code will be released.

Abstract:
Stereo image super-resolution (SSR) aims to enhance high-resolution details by leveraging information from stereo image pairs. However, existing stereo super-resolution (SSR) upsampling methods (e.g., pixel shuffle) often overlook cross-view geometric consistency and are limited to fixed-scale upsampling. The key issue is that previous upsampling methods use convolutions to independently process deep features of different views, lacking cross-view and non-local information perception, making it difficult to select beneficial information from multi-view scenes adaptively. In this work, we propose Stereo Implicit Neural Representation (StereoINR), which innovatively models stereo image pairs as continuous implicit representations. This continuous representation breaks through the scale limitations, providing a unified solution for arbitrary-scale stereo super-resolution reconstruction of left-right views. Furthermore, by incorporating spatial warping and cross-attention mechanisms, StereoINR enables effective cross-view information fusion and achieves significant improvements in pixel-level geometric consistency. Extensive experiments on multiple datasets demonstrate that StereoINR outperforms out-of-training-distribution scale upsampling and matches state-of-the-art SSR methods within training-distribution scales.

Abstract:
The rapid evolution of face manipulation techniques poses a critical challenge for face forgery detection: cross-domain generalization. Conventional methods, which rely on simple classification objectives, often fail to learn domain-invariant representations. We propose HAMLET-FFD, a cognitively inspired Hierarchical Adaptive Multi-modal Learning framework that tackles this challenge via bidirectional cross-modal reasoning. Building on contrastive vision-language models such as CLIP, HAMLET-FFD introduces a knowledge refinement loop that iteratively assesses authenticity by integrating visual evidence with conceptual cues, emulating expert forensic analysis. A key innovation is a bidirectional fusion mechanism in which textual authenticity embeddings guide the aggregation of hierarchical visual features, while modulated visual features refine text embeddings to generate image-adaptive prompts. This closed-loop process progressively aligns visual observations with semantic priors to enhance authenticity assessment. By design, HAMLET-FFD freezes all pretrained parameters, serving as an external plugin that preserves CLIP's original capabilities. Extensive experiments demonstrate its superior generalization to unseen manipulations across multiple benchmarks, and visual analyses reveal a division of labor among embeddings, with distinct representations specializing in fine-grained artifact recognition.

Abstract:
In this paper, we present the first pinhole-fisheye framework for heterogeneous multi-view depth estimation, PFDepth. Our key insight is to exploit the complementary characteristics of pinhole and fisheye imagery (undistorted vs. distorted, small vs. large FOV, far vs. near field) for joint optimization. PFDepth employs a unified architecture capable of processing arbitrary combinations of pinhole and fisheye cameras with varied intrinsics and extrinsics. Within PFDepth, we first explicitly lift 2D features from each heterogeneous view into a canonical 3D volumetric space. Then, a core module termed Heterogeneous Spatial Fusion is designed to process and fuse distortion-aware volumetric features across overlapping and non-overlapping regions. Additionally, we subtly reformulate the conventional voxel fusion into a novel 3D Gaussian representation, in which learnable latent Gaussian spheres dynamically adapt to local image textures for finer 3D aggregation. Finally, fused volume features are rendered into multi-view depth maps. Through extensive experiments, we demonstrate that PFDepth sets a state-of-the-art performance on KITTI-360 and RealHet datasets over current mainstream depth networks. To the best of our knowledge, this is the first systematic study of heterogeneous pinhole-fisheye depth estimation, offering both technical novelty and valuable empirical insights.

Abstract:
Large language models (LLMs) have shown great potential in the medical domain. However, existing models still fall short when faced with complex medical diagnosis task in the real world. This is mainly because they lack sufficient reasoning depth, which leads to information loss or logical jumps when processing a large amount of specialized medical data, leading to diagnostic errors. To address these challenges, we propose Tree-of-Reasoning (ToR), a novel multi-agent framework designed to handle complex scenarios. Specifically, ToR introduces a tree structure that can clearly record the reasoning path of LLMs and the corresponding clinical evidence. At the same time, we propose a cross-validation mechanism to ensure the consistency of multi-agent decision-making, thereby improving the clinical reasoning ability of multi-agents in complex medical scenarios. Experimental results on real-world medical data show that our framework can achieve better performance than existing baseline methods.

Abstract:
In the field of HER2 expression level assessment for breast cancer, clinical evaluations often rely on the synergistic analysis of both H&E and IHC stained images. However, acquiring dual-modality images for the same patient is frequently hindered by complex clinical workflows and high costs, resulting in missing modalities. To address this challenge, we propose an adaptive bimodal input prediction framework that flexibly supports both single-modality and dual-modality inputs. This framework employs a dynamic branch selection mechanism to overcome the rigid dependency of existing models on complete inputs, enabling accurate predictions using either H&E or IHC images alone, while retaining the ability for joint inference when both modalities are available. The core technical innovations include: a missing modality branch selector that dynamically activates either a modality completion process or an end-to-end dual-modality inference pipeline based on the available input; and a cross-modal generative adversarial network (CM-GAN) that facilitates context-aware reconstruction of the missing modality in the feature space. This design improves the prediction accuracy from 71.44% to 94.25% when using single-modality H&E images, significantly mitigating performance degradation caused by incomplete information. Experimental results demonstrate that the proposed framework achieves a prediction accuracy of 95.09% with full dual-modality input and maintains a high reliability of 90.28% under single-modality conditions. By adopting this ''dual-modality preferred, single-modality compatible'' flexible architecture, healthcare institutions can achieve near dual-modality accuracy without mandating synchronized acquisition of both image types. This is particularly valuable for regions with limited IHC staining infrastructure, offering a cost-effective clinical solution and substantially enhancing the accessibility of HER2 expression level assessment.

Abstract:
Text-video retrieval (TVR) systems often suffer from visual-linguistic biases present in datasets, which cause pre-trained vision-language models to overlook key details. To address this, we propose BiMa, a novel framework designed to mitigate biases in both visual and textual representations. Our approach begins by generating scene elements that characterize each video by identifying relevant entities/objects and activities. For visual debiasing, we integrate these scene elements into the video embeddings, enhancing them to emphasize fine-grained and salient details. For textual debiasing, we introduce a mechanism to disentangle text features into content and bias components, enabling the model to focus on meaningful content while separately handling biased information. Extensive experiments and ablation studies across five major TVR benchmarks (i.e., MSR-VTT, MSVD, LSMDC, ActivityNet, and DiDeMo) demonstrate the competitive performance of BiMa. Additionally, the model's bias mitigation capability is consistently validated by its strong results on out-of-distribution retrieval tasks.

Abstract:
Despite growing interest in hallucination in Multimodal Large Language Models (MLLMs), existing studies primarily focus on single-image settings, leaving hallucination in multi-image scenarios largely unexplored. To address this gap, we conduct the first systematic study of hallucinations in multi-image MLLMs and propose MIHBench, a benchmark specifically tailored for evaluating object-related hallucinations across multiple images. MIHBench comprises three core tasks-Multi-Image Object Existence Hallucination, Multi-Image Object Count Hallucination, and Object Identity Consistency Hallucination-targeting semantic understanding across object existence, quantity reasoning, and cross-view identity consistency. Through extensive evaluation, we identify key factors associated with the occurrence of multi-image hallucinations, including: (1) a progressive relationship between the number of image inputs and the likelihood of hallucination occurrences; (2) a strong correlation between single-image hallucination tendencies and those observed in multi-image contexts; and (3) the influence of same object image ratios and the positional placement of negative samples within image sequences on the occurrence of object identity consistency hallucination. To address these challenges, we propose a Dynamic Attention Balancing (DAB) mechanism that adjusts inter-image attention distributions while preserving the overall visual attention proportion. Experiments across multiple state-of-the-art MLLMs demonstrate that our method effectively reduces hallucination occurrences and enhances semantic integration and reasoning stability in multi-image scenarios.

Abstract:
As Vision-Language Models (VLMs) become increasingly integrated into user-facing applications, they are often deployed in split DNN configurations, where the visual encoder (e.g., ResNet or ViT) runs on user-side devices and only intermediate features are transmitted to the cloud for downstream processing. While this setup reduces communication overhead, the intermediate data features containing sensitive information can also expose users to privacy risks. Prior work has attempted to reconstruct images from these features to infer semantics, but such approaches often produce blurry images that obscure semantic details. In contrast, the potential to directly recover high-level semantic content - such as image labels or captions - via a cross-modality inversion attack remains largely unexplored. To address this gap, we propose CapRecover, a general cross-modality feature inversion framework that directly decodes semantic information from intermediate features without requiring image reconstruction. Additionally, CapRecover can be used to reverse engineer traditional neural networks for computer vision tasks, such as ViT, ResNet, and others.

Abstract:
Autonomous driving systems have made significant advances in Q&A, perception, prediction, and planning based on local visual information, yet they struggle to incorporate broader navigational context that human drivers routinely utilize. We address this critical gap between local sensor data and global navigation information by proposing NavigScene, an auxiliary navigation-guided natural language dataset that simulates a human-like driving environment within autonomous driving systems. Moreover, we develop three complementary paradigms to leverage NavigScene: (1) Navigation-guided Reasoning, which enhances vision-language models by incorporating navigation context into the prompting approach; (2) Navigation-guided Preference Optimization, a reinforcement learning method that extends Direct Preference Optimization to improve vision-language model responses by establishing preferences for navigation-relevant summarized information; and (3) Navigation-guided Vision-Language-Action model, which integrates navigation guidance and vision-language models with conventional driving models through feature fusion. Extensive experiments demonstrate that our approaches significantly improve performance across perception, prediction, planning, and question-answering tasks by enabling reasoning capabilities beyond visual range and improving generalization to diverse driving scenarios. This work represents a significant step toward more comprehensive autonomous driving systems capable of navigating complex, unfamiliar environments with greater reliability and safety.

Abstract:
Large Multimodal Models (LMMs) have demonstrated impressive performance in short video understanding tasks but face great challenges when applied to long video understanding. In contrast, Large Language Models (LLMs) exhibit outstanding capabilities in modeling long texts. Existing work attempts to address this issue by introducing long video-text pairs during training. However, these approaches require substantial computational and data resources. In this paper, we tackle the challenge of long video understanding from the perspective of context windows, aiming to apply LMMs to long video tasks without retraining on long video datasets. We first conduct an in-depth analysis of why pretrained LMMs struggle to understand lengthy video content, identifying that discrepancies between visual and language modalities lead to different context windows for visual and language tokens, making it difficult to directly extend the visual tokens to match the language context window. Based on this, we propose to adapt LMMs for long video understanding tasks by extending the visual context window, eliminating the need for retraining on large-scale long video datasets. To further mitigate the significant memory consumption caused by long sequences, we introduce a progressive pooling inference strategy that selectively adjusts the spatial resolution of frame embeddings, reducing the number of visual tokens while retaining important spatial information. Across multiple long video understanding benchmarks, our method consistently improves the performance as the number of video frames increases. On the MLVU benchmark, our method outperforms GPT-4o, even though our model size is only 7B. Additionally, in the 256-frame setting, our method reduces memory usage by approximately 45% compared to the baseline, without introducing any performance loss.

Abstract:
Monocular 3D visual grounding is a novel task that aims to locate 3D objects in RGB images using text descriptions with explicit geometry information. Despite the inclusion of geometry details in the text, we observe that the text embeddings are sensitive to the magnitude of numerical values but largely ignore the associated measurement units. For example, simply equidistant mapping the length with unit 'meters' to 'decimeters' or 'centimeters' leads to severe performance degradation, even though the physical length remains equivalent. This observation signifies the weak 3D comprehension of pre-trained language model, which generates misguiding text features to hinder 3D perception. Therefore, we propose to enhance the 3D perception of model on text embeddings and geometry features with two simple and effective methods. Firstly, we introduce a pre-processing method named 3D-text Enhancement (3DTE), which enhances the comprehension of mapping relationships between different units by augmenting the diversity of distance descriptors in text queries. Next, we propose a Text-Guided Geometry Enhancement (TGE) module to further enhance the 3D-text information by projecting the basic text features into geometrically consistent space. These 3D-enhanced text features are then leveraged to precisely guide the attention of geometry features. We evaluate the proposed method through extensive comparisons and ablation studies on the Mono3DRefer dataset. Experimental results demonstrate substantial improvements over previous methods, achieving new state-of-the-art results with a notable accuracy gain of 11.94% in the 'Far' scenario. Our code will be made publicly available.

Abstract:
Test-time adaptation (TTA) has emerged as a promising solution to improve the robustness of deep learning models under domain shifts, particularly in real-world scenarios where source data or true labels are unavailable. In this paper, we propose a novel TTA framework tailored for medical vision-language models (VLMs) that leverages a Mixture-of-Experts (MoE) mechanism. Building upon a frozen, pre-trained BiomedCLIP backbone, our method integrates parallel MoE adapters of different medical imaging modalities in each vision MLP block, enabling expert-specific adaptation without disrupting the core model representation. At inference time, only the MoE router is optimized through an entropy-regularized objective, which is further augmented by pseudo-label guidance and adaptive scaling strategies. Additionally, we propose an entropy-aware MoE scaling policy that dynamically adjusts expert influence based on prediction uncertainty, improving model adaptability. Extensive experiments on multiple medical imaging benchmarks demonstrate that our approach achieves substantial performance improvements over existing TTA baselines, while maintaining high efficiency and parameter sparsity. Our results highlight the potential of MoE-enhanced TTA to achieve robust and generalizable medical VLMs in unseen domains without access to source data. Code is available at https://openi.pcl.ac.cn/OpenMedIA/MoME-TTA.git

Abstract:
Script event induction, which aims to predict the subsequent event based on the context, is a challenging task in NLP, achieving remarkable success in practical applications. However, human events are mostly recorded and presented in the form of videos rather than scripts, yet there is a lack of related research in the realm of vision. To address this problem, we introduce AVEP (Action-centric Video Event Prediction), a task that distinguishes itself from existing video prediction tasks through its incorporation of more complex logic and richer semantic information. We present a large structured dataset, which consists of about 35K annotated videos and more than 178K video clips of event, built upon existing video event datasets to support this task. The dataset offers more fine-grained annotations, where the atomic unit is represented as a multimodal event argument node, providing better structured representations of video events. Due to the complexity of event structures, traditional visual models that take patches or frames as input are not well-suited for AVEP. We propose EventFormer, a node-graph hierarchical attention based video event prediction model, which can capture both the relationships between events and their arguments and the coreferencial relationships between arguments. We conducted experiments using several SOTA video prediction models as well as LVLMs on AVEP, demonstrating both the complexity of the task and the value of the dataset. Our approach outperforms all these video prediction models. We will release the dataset and code for replicating the experiments and annotations.

Abstract:
The application of Vision-Language Models (VLMs) in remote sensing (RS) image understanding has achieved notable progress, demonstrating the basic ability to recognize and describe geographical entities. However, existing RS-VLMs are mostly limited to image-level and region-level tasks, lacking the capability to handle pixel-level tasks and performing poorly in small-object recognition scenarios. Moreover, RS-VLMs consume significant computational resources when processing high-resolution RS images, further restricting their practical applicability. In this context, we propose GeoMag (Geographical Magnifier), an end-to-end general-purpose large model framework for RS. GeoMag dynamically focuses the attention scope based on prompt semantics to effectively perform remote sensing image parsing across multiple levels of granularity. This method introduces Task-driven Multi-granularity Resolution Adjustment and Prompt-guided Semantic-aware Cropping, which adaptively reduce the spatial resolution of task-irrelevant regions while enhancing the visual representation of task-relevant areas. This approach improves the model's perception of critical target regions, suppresses background redundancy, and reduces the computational cost of interpreting high-resolution RS imagery. Extensive comparative experiments on 10 benchmarks demonstrate that GeoMag not only excels in handling pixel-level tasks but also maintains competitive performance across tasks of other granularities compared to existing RS-VLMs.

Abstract:
Multi-modal emotion recognition has garnered increasing attention as it plays a significant role in human-computer interaction (HCI) in recent years. Since different discrete emotions may exist at the same time, compared with single-class emotion recognition, emotion distribution learning (EDL) that identifies a mixture of basic emotions has gradually emerged as a trend. However, existing EDL methods face challenges in mining the heterogeneity among multiple modalities. Besides, rich semantic correlations across arbitrary basic emotions are not fully exploited. In this paper, we propose a multi-modal emotion distribution learning framework, named HeLo, aimed at fully exploring the heterogeneity and complementary information in multi-modal emotional data and label correlation within mixed basic emotions. Specifically, we first adopt cross-attention to effectively fuse the physiological data. Then, an optimal transport (OT)-based heterogeneity mining module is devised to mine the interaction and heterogeneity between the physiological and behavioral representations. To facilitate label correlation learning, we introduce a learnable label embedding optimized by correlation matrix alignment. Finally, the learnable label embeddings and label correlation matrices are integrated with the multi-modal representations through a novel label correlation-driven cross-attention mechanism for accurate emotion distribution learning. Experimental results on two publicly available datasets demonstrate the superiority of our proposed method in emotion distribution learning.

Abstract:
Multimodal large language models (MLLMs) have been widely applied across various fields due to their powerful perceptual and reasoning capabilities. In the realm of psychology, these models hold promise for a deeper understanding of human emotions and behaviors. However, recent research primarily focuses on enhancing their emotion recognition abilities, leaving the substantial potential in emotion reasoning, which is crucial for improving the naturalness and effectiveness of human-machine interactions. Therefore, in this paper, we introduce a multi-turn multimodal emotion understanding and reasoning (MTMEUR) benchmark, which encompasses 1,451 video data from real-life scenarios, along with 5,101 progressive questions. These questions cover various aspects, including emotion recognition, potential causes of emotions, future action prediction, etc. Besides, we propose a multi-agent framework, where each agent specializes in a specific aspect, such as background context, character dynamics, and event details, to improve the system's reasoning capabilities. Furthermore, we conduct experiments with existing MLLMs and our agent-based method on the proposed benchmark, revealing that most models face significant challenges with this task.

Abstract:
With the advancement of mobile device capabilities, deploying reranking models directly on devices has become feasible, enabling real-time contextual recommendations. When migrating models from cloud to devices, resource heterogeneity inevitably necessitates model compression. Recent quantization methods show promise for efficient deployment, yet they overlook device-specific user interests, resulting in compromised recommendation accuracy. While on-device finetuning captures personalized user preference, it imposes additional computational burden through local retraining. To address these challenges, we propose a framework for Customizing Hybrid-precision On-device model for sequential Recommendation with Device-cloud collaboration (CHORD), leveraging channel-wise mixed-precision quantization to simultaneously achieve personalization and resource-adaptive deployment. CHORD distributes randomly initialized models across heterogeneous devices and identifies user-specific critical parameters through auxiliary hypernetwork modules on the cloud. Our parameter sensitivity analysis operates across multiple granularities (layer, filter, and element levels), enabling precise mapping from user profiles to quantization strategy. Through on-device mixed-precision quantization, CHORD delivers dynamic model adaptation and accelerated inference without backpropagation, eliminating costly retraining cycles. We minimize communication overhead by encoding quantization strategies using only 2 bits per channel instead of 32-bit weights. Experiments on three real-world datasets with two popular backbones (SASRec and Caser) demonstrate the accuracy, efficiency, and adaptivity of CHORD.

Abstract:
Large Vision-Language Models (LVLMs) have demonstrated remarkable advancements in numerous areas such as multimedia. However, hallucination issues significantly limit their credibility and application potential. Existing mitigation methods typically rely on external tools or the comparison of multi-round inference, which significantly increase inference time. In this paper, we propose SElf-Evolving Distillation (SEED), which identifies hallucinations within the inner knowledge of LVLMs, isolates and purges them, and then distills the purified knowledge back into the model, enabling self-evolution. Furthermore, we identified that traditional distillation methods are prone to inducing void spaces in the output space of LVLMs. To address this issue, we propose a Mode-Seeking Evolving approach, which performs distillation to capture the dominant modes of the purified knowledge distribution, thereby avoiding the chaotic results that could emerge from void spaces. Moreover, we introduce a Hallucination Elimination Adapter, which corrects the dark knowledge of the original model by learning purified knowledge. Extensive experiments on multiple benchmarks validate the superiority of our SEED, demonstrating substantial improvements in mitigating hallucinations for representative LVLM models such as LLaVA-1.5 and InternVL2. Remarkably, the F1 score of LLaVA-1.5 on the hallucination evaluation metric POPE-Random improved from 81.3 to 88.3.

Abstract:
Large Language Model (LLM)-based agents have demonstrated strong capabilities across a wide range of tasks, and their application in the medical domain holds particular promise due to the demand for high generalizability and reliance on interdisciplinary knowledge. However, existing medical agent systems often rely on static, manually crafted workflows that lack the flexibility to accommodate diverse diagnostic requirements and adapt to emerging clinical scenarios. Motivated by the success of automated machine learning (AutoML), this paper introduces a novel framework for the automated design of medical agent architectures. Specifically, we define a hierarchical and expressive agent search space that enables dynamic workflow adaptation through structured modifications at the node, structural, and framework levels. Our framework conceptualizes medical agents as graph-based architectures composed of diverse, functional node types and supports iterative self-improvement guided by diagnostic feedback. Experimental results on skin disease diagnosis tasks demonstrate that the proposed method effectively evolves workflow structures and significantly enhances diagnostic accuracy over time. This work represents the first fully automated framework for medical agent architecture design and offers a scalable, adaptable foundation for deploying intelligent agents in real-world clinical environments.

Abstract:
Most GCN-based methods model interacting individuals as independent graphs, neglecting their inherent inter-dependencies. Although recent approaches utilize predefined interaction adjacency matrices to integrate participants, these matrices fail to adaptively capture the dynamic and context-specific joint interactions across different actions. In this paper, we propose the Active Node Selection with External Attention Network (ASEA), an innovative approach that dynamically captures interaction relationships without predefined assumptions. Our method models each participant individually using a GCN to capture intra-personal relationships, facilitating a detailed representation of their actions. To identify the most relevant nodes for interaction modeling, we introduce the Adaptive Temporal Node Amplitude Calculation (AT-NAC) module, which estimates global node activity by combining spatial motion magnitude with adaptive temporal weighting, thereby highlighting salient motion patterns while reducing irrelevant or redundant information. A learnable threshold, regularized to prevent extreme variations, is defined to selectively identify the most informative nodes for interaction modeling. To capture interactions, we design the External Attention (EA) module to operate on active nodes, effectively modeling the interaction dynamics and semantic relationships between individuals. Extensive evaluations show that our method captures interaction relationships more effectively and flexibly, achieving state-of-the-art performance.

Abstract:
Multi-view clustering (MVC) for remote sensing data has attracted increasing attention due to its ability to exploit complementary information from multiple modalities without requiring labels. Recent graph-based deep clustering methods have shown strong potential in modeling spatial structures inherent in remote sensing data. However, existing approaches often emphasize capturing rich node relations while overlooking the optimization of these relations, leading to noisy connections and weak inter-cluster discrimination. To address this issue, we propose a novel Multi-view Graph Clustering with dual Relation Optimization (MDRO) framework tailored for remote sensing data. Specifically, we first segment the remote sensing image into irregular superpixels to reduce computational complexity and use superpixels as graph nodes. Then, MDRO constructs high-order similarity matrices guided by clustering distribution matrices and performs dual relation optimization to suppress noise relations and strengthen similarity relations. Furthermore, an optimal transportation-based constraint is introduced to guide the formation of robust and balanced cluster assignments, mitigating over-smoothing and trivial solutions in graph learning. Comprehensive experiments on four benchmark remote sensing datasets demonstrate that MDRO consistently outperforms existing single-view and multi-view clustering methods, achieving superior accuracy and robustness.

Abstract:
Modern artistic productions increasingly demand automated choreography generation that adapts to diverse musical styles and individual dancer characteristics. Existing approaches often fail to produce high-quality dance videos that harmonize with both musical rhythm and user-defined choreography styles, limiting their applicability in real-world creative contexts. To address this gap, we introduce ChoreoMuse, a diffusion-based framework that uses SMPL format parameters and their variation version as intermediaries between music and video generation, thereby overcoming the usual constraints imposed by video resolution. Critically, ChoreoMuse supports style-controllable, high-fidelity dance video generation across diverse musical genres and individual dancer characteristics, including the flexibility to handle any reference individual at any resolution. Our method employs a novel music encoder MotionTune to capture motion cues from audio, ensuring that the generated choreography closely follows the beat and expressive qualities of the input music. To quantitatively evaluate how well the generated dances match both musical and choreographic styles, we introduce two new metrics that measure alignment with the intended stylistic cues. Extensive experiments confirm that ChoreoMuse achieves state-of-the-art performance across multiple dimensions, including video quality, beat alignment, dance diversity, and style adherence, demonstrating its potential as a robust solution for a wide range of creative applications. Video results can be found on our project page: https://choreomuse.github.io.

Abstract:
Volumetric video enables immersive experiences by capturing dynamic 3D scenes, enabling diverse applications for virtual reality, education, and telepresence. However, traditional methods struggle with fixed lighting conditions, while neural approaches face trade-offs in efficiency, quality, or adaptability for relightable scenarios. To address these limitations, we present BEAM, a novel pipeline that bridges 4D Gaussian representations with physically-based rendering (PBR) to produce high-quality, relightable volumetric videos from multi-view RGB footage. BEAM recovers detailed geometry and PBR properties via a series of available Gaussian-based techniques. It first combines Gaussian-based human performance tracking with geometry-aware rasterization in a coarse-to-fine optimization framework to recover spatially and temporally consistent geometries. We further enhance Gaussian attributes by incorporating PBR properties step by step. We generate roughness via a multi-view-conditioned diffusion model, and then derive AO and base color using a 2D-to-3D strategy, incorporating a tailored Gaussian-based ray tracer for efficient visibility computation. Once recovered, these dynamic, relightable assets integrate seamlessly into traditional CG pipelines, supporting real-time rendering with deferred shading and offline rendering with ray tracing. By offering realistic, lifelike visualizations under diverse lighting conditions, BEAM opens new possibilities for interactive entertainment, storytelling, and creative visualization.

Abstract:
Image degradation caused by complex lighting conditions such as low-light and backlit scenarios is commonly encountered in real-world environments, significantly affecting image quality and downstream vision tasks. Most existing methods focus on a single type of illumination degradation and lack the ability to handle diverse lighting conditions in a unified manner. To address this issue, we propose a dual-illumination enhancement framework called DIME-Net. The core of our method is a Mixture-of-Experts illumination estimator module, where a sparse gating mechanism adaptively selects suitable S-curve expert networks based on the illumination characteristics of the input image. By integrating Retinex theory, this module effectively performs enhancement tailored to both low-light and backlit images. To further correct illumination-induced artifacts and color distortions, we design a damage restoration module equipped with Illumination-Aware Cross Attention and Sequential-State Global Attention mechanisms. In addition, we construct a hybrid illumination dataset, MixBL, by integrating existing datasets, allowing our model to achieve robust illumination adaptability through a single training process. Experimental results show that DIME-Net achieves competitive performance on both synthetic and real-world low-light and backlit datasets without any retraining. These results demonstrate its generalization ability and potential for practical multimedia applications under diverse and complex illumination conditions.

Abstract:
Recent advances in 3D Gaussian Splatting (3DGS) have greatly improved 3D reconstruction. However, its substantial data size poses a significant challenge for transmission and storage. While many compression techniques have been proposed, they fail to efficiently adapt to fluctuating network bandwidth, leading to resource wastage. We address this issue from the perspective of size-aware compression, where we aim to compress 3DGS to a desired size by quickly searching for suitable hyperparameters. Through a measurement study, we identify key hyperparameters that affect the size - namely, the reserve ratio of Gaussians and bit-width settings for Gaussian attributes. Then, we formulate this hyperparameter optimization problem as a mixed-integer nonlinear programming (MINLP) problem, with the goal of maximizing visual quality while respecting the size budget constraint. To solve the MINLP, we decouple this problem into two parts: discretely sampling the reserve ratio and determining the bit-width settings using integer linear programming (ILP). To solve the ILP more quickly and accurately, we design a quality loss estimator and a calibrated size estimator, as well as implement a CUDA kernel. Extensive experiments on multiple 3DGS variants demonstrate that our method achieves state-of-the-art performance in post-training compression. Furthermore, our method can achieve comparable quality to leading training-required methods after fine-tuning.

Abstract:
The rapid advancement in generative artificial intelligence have enabled the creation of 3D human faces (HFs) for applications including media production, virtual reality, security, healthcare, and game development, etc. However, assessing the quality and realism of these AI-generated 3D human faces remains a significant challenge due to the subjective nature of human perception and innate perceptual sensitivity to facial features. To this end, we conduct a comprehensive study on the quality assessment of AI-generated 3D human faces. We first introduce Gen3DHF, a large-scale benchmark comprising 2,000 videos of AI-Generated 3D Human Faces along with 4,000 Mean Opinion Scores (MOS) collected across two dimensions, i.e., quality and authenticity, 2,000 distortion-aware saliency maps and distortion descriptions. Based on Gen3DHF, we propose LMME3DHF, a Large Multimodal Model (LMM)-based metric for Evaluating 3DHF capable of quality and authenticity score prediction, distortion-aware visual question answering, and distortion-aware saliency prediction. Experimental results show that LMME3DHF achieves state-of-the-art performance, surpassing existing methods in both accurately predicting quality scores for AI-generated 3D human faces and effectively identifying distortion-aware salient regions and distortion types, while maintaining strong alignment with human perceptual judgments. Both the Gen3DHF database and the LMME3DHF will be released upon the publication.

Abstract:
Road Surface Reconstruction (RSR) is crucial for autonomous driving, enabling the understanding of road surface conditions. Recently, RSR from the Bird's Eye View (BEV) has gained attention for its potential to enhance performance. However, existing methods for transforming perspective views to BEV face challenges such as information loss and representation sparsity. Moreover, stereo matching in BEV is limited by the need to balance accuracy with inference speed. To address these challenges, we propose two efficient and accurate BEV-based RSR models: FastRSR-mono and FastRSR-stereo. Specifically, we first introduce Depth-Aware Projection (DAP), an efficient view transformation strategy designed to mitigate information loss and sparsity by querying depth and image features to aggregate BEV data within specific road surface regions using a pre-computed look-up table. To optimize accuracy and speed in stereo matching, we design the Spatial Attention Enhancement (SAE) and Confidence Attention Generation (CAG) modules. SAE adaptively highlights important regions, while CAG focuses on high-confidence predictions and filters out irrelevant information. FastRSR achieves state-of-the-art performance, exceeding monocular competitors by over 6.0% in elevation absolute error and providing at least a 3.0× speedup by stereo methods on the RSRD dataset. The source code will be released.

Abstract:
Mindfulness meditation has seen increasing applications in diverse domains as an effective practice to improve mental health. However, the standardized frameworks adopted by most applications often fail to cater to users with various psychological states and health conditions. This limitation arises primarily from the lack of personalization and adaptive content design. To address this, we propose MindfulVerse, an AI-Generated Content (AIGC)-driven application to create personalized and immersive mindfulness experiences. By developing a novel agent, the system can dynamically adjust the meditation content based on the ideas of individual users. Furthermore, we conducted exploratory user studies and comparative evaluations to assess the application scenarios and performance of our novel generative meditation tool in VR environments. The results of this user study indicate that generative meditation improves neural activation in self-regulation and shows a positive impact on emotional regulation and participation. Our approach offers a generative meditation procedure that provides users with an application that better suits their preferences and states.

Abstract:
Motion customization aims to adapt the diffusion model (DM) to generate videos with the motion specified by a set of video clips with the same motion concept. To realize this goal, the adaptation of DM should be possible to model the specified motion concept, without compromising the ability to generate diverse appearances. Thus, the key to solving this problem lies in how to separate the motion concept from the appearance in the adaptation process of DM. Typical previous works explore different ways to represent and insert a motion concept into large-scale pre-trained text-to-video diffusion models, e.g., learning a motion LoRA, using latent noise residuals, etc. While those methods can encode the motion concept, they also inevitably encode the appearance in reference videos, resulting in weakened appearance generation capability. In this paper, we follow the typical way to learn a motion LoRA to encode the motion concept, but propose two novel strategies to enhance motion-appearance separation, including temporal attention purification (TAP) and appearance highway (AH). Specifically, we assume that in the temporal attention module, the pretrained Value embeddings are sufficient to serve as basic components needed by producing a new motion. Thus, in TAP, we choose only to reshape the temporal attention with motion LoRAs so that Value embeddings can be reorganized to produce a new motion. Further, in AH, we alter the starting point of each skip connection in U-Net from the output of each temporal attention module to the output of each spatial attention module. Extensive experiments demonstrate that compared to previous works, our method can generate videos with appearance more aligned with the text descriptions and motion more consistent with the reference videos.

Abstract:
The high computational cost and slow inference time are major obstacles to deploying Video Diffusion Models (VDMs). To overcome this, we introduce a new Video Diffusion Model Compression approach using individual content and motion dynamics preserved pruning and consistency loss. First, we empirically observe that deeper VDM layers are crucial for maintaining the quality of motion dynamics (e.g., coherence of the entire video), while shallower layers are more focused on individual content (e.g., individual frames). Therefore, we prune redundant blocks from the shallower layers while preserving more of the deeper layers, resulting in a lightweight VDM variant called VDMini. Moreover, we propose an Individual Content and Motion Dynamics (ICMD) Consistency Loss to gain comparable generation performance as larger VDM to VDMini. In particular, we first use the Individual Content Distillation (ICD) Loss to preserve the consistency in the features of each generated frame between the teacher and student models. Next, we introduce a Multi-frame Content Adversarial (MCA) Loss to enhance the motion dynamics across the generated video as a whole. This method significantly accelerates inference time while maintaining high-quality video generation. Extensive experiments demonstrate the effectiveness of our VDMini on two important video generation tasks, Text-to-Video (T2V) and Image-to-Video (I2V), where we respectively achieve an average 2.5 ×, 1.4 ×, and 1.25 × speed up for the I2V method SF-V, the T2V method T2V-Turbo-v2, and the T2V method HunyuanVideo, while maintaining the quality of the generated videos on several benchmarks including UCF101, VBench-T2V, and VBench-I2V.

Abstract:
Tumor spatial heterogeneity analysis requires precise correlation between Hematoxylin and Eosin (H&E) morphology and immunohistochemical (IHC) biomarker expression, yet current methods suffer from spatial misalignment in consecutive sections, severely compromising in situ pathological interpretation. In order to obtain a more accurate virtual staining pattern, We propose PRINTER, a weakly-supervised framework that integrates PRototype-drIven content and staiNing patTERn decoupling and deformation-aware adversarial learning strategies designed to accurately learn IHC staining patterns while preserving H&E staining details. Our approach introduces three key innovations: (1) A prototype-driven staining pattern transfer with explicit content-style decoupling; and (2) A cyclic registration-synthesis framework GapBridge that bridges H&E and IHC domains through deformable structural alignment, where registered features guide cross-modal style transfer while synthesized outputs iteratively refine the registration;(3) Deformation-Aware Adversarial Learning: We propose a training framework where a generator and deformation-aware registration network jointly adversarially optimize a style-focused discriminator. Extensive experiments demonstrate that PRINTER effectively achieves superior performance in preserving H&E staining details and virtual staining fidelity, outperforming state-of-the-art methods. Our work provides a robust and scalable solution for virtual staining, advancing the field of computational pathology.

Abstract:
Diffusion models have demonstrated remarkable success in generative tasks, yet their iterative denoising process results in slow inference, limiting their practicality. While existing acceleration methods exploit the well-known U-shaped similarity pattern between adjacent steps through caching mechanisms, they lack theoretical foundation and rely on simplistic computation reuse, often leading to performance degradation. In this work, we provide a theoretical understanding by analyzing the denoising process through the second-order Adams-Bashforth method, revealing a linear relationship between the outputs of consecutive steps. This analysis explains why the outputs of adjacent steps exhibit a U-shaped pattern. Furthermore, extending Adams-Bashforth method to higher order, we propose a novel caching-based acceleration approach for diffusion models, instead of directly reusing cached results, with a truncation error bound of only (O(hk) where h is the step size. Extensive validation across diverse image and video diffusion models (including HunyuanVideo and FLUX.1-dev) with various schedulers demonstrates our method's effectiveness in achieving nearly 3× speedup while maintaining original performance levels, offering a practical real-time solution without compromising generation quality.

Abstract:
Hairstyles are intricate and culturally significant with various geometries, textures, and structures. Existing text or image-guided generation methods fail to handle the richness and complexity of diverse styles. We present a novel approach for 3D hair strand generation that accommodates diverse image inputs across styles, viewpoints, and quantities of input views. Our method consists of three key components. First, our MultiHair Dataset provides 457 diverse hairstyles annotated with 74 attributes, emphasizing complex and culturally significant styles to improve model generalization. Second, we propose a diffusion framework conditioned on multi-view linearts that can capture topological cues (e.g., strand density and parting lines) while filtering out noise. By leveraging a latent diffusion model with cross-attention on lineart features, our method achieves flexible and robust 3D hair generation across diverse input conditions. Third, a parametric post-processing module enforces braid-specific constraints to maintain coherence in complex structures. This framework not only advances hairstyle realism and diversity but also enables culturally inclusive digital avatars and novel applications like sketch-based 3D strand editing for animation and augmented reality.

Abstract:
Human beings perceive the real world through a spectrum of sensory modalities, encompassing auditory, visual, and linguistic faculties. This work proposes an industry-level omni-modal large language model (LLM) pipeline that integrates auditory, visual, and linguistic modalities to overcome challenges such as limited tri-modal datasets, high computational costs, and complex feature alignments. Our pipeline consists of three main components: First, a modular, end-to-end framework enabling flexible configuration of various encoder-LLM-decoder architectures. Second, a lightweight training strategy that pre-trains audio-language alignment on the state-of-the-art vision-language model Qwen2.5-VL, thus avoiding the costly pre-training of vision-specific modalities. Third, an audio synthesis pipeline that generates high-quality audio-text data from diverse real-world scenarios, supporting applications such as Automatic Speech Recognition and Speech-to-Speech chat. To this end, we introduce an industry-level omni-modal LLM, NEXUS-O. Extensive experiments validate the efficacy of our pipeline, yielding the following key findings: (1) In the visual understanding task, NEXUSO exhibits superior performance compared with its backbone model - Qwen2.5-VL-7B, validating the efficiency of our training strategy. (2) Within the English Spoken Question-Answering task, the model achieves better accuracy than the same-period competitor (i.e, MiniCPM-o2.6-7B) in the LLaMA Q. benchmark. (3) In our realworld ASR testset, NEXUS-O achieves outstanding performance, indicating its robustness in real scenarios. (4) In the Speech-to-Text Translation task, our model outperforms Qwen2-Audio-Instruct-7B. (5) In the Text-to-Speech task, based on pretrained vocoder (e.g., Fishspeech1.4 or CosyVoice2.0), NEXUS-O is comparable to its backbone vocoder on Seed-TTS benchmark. (6) An in-depth analysis of tri-modal alignment reveals that incorporating the audio modality enhances representational alignment between vision and language.

Abstract:
Various (text) prompt filters and (image) safety checkers have been implemented to mitigate the misuse of Text-to-Image (T2I) models in creating Not-Safe-For-Work (NSFW) content. In order to expose potential security vulnerabilities of such safeguards, multimodal jailbreaks have been studied. However, existing jailbreaks are limited to prompt-specific and image-specific perturbations, which suffer from poor scalability and time-consuming optimization. To address these limitations, we propose Universally Unfiltered and Unseen (U3)-Attack, a multimodal jailbreak attack method against T2I safeguards. Specifically, U3-Attack optimizes an adversarial patch on the image background to universally bypass safety checkers and optimizes a safe paraphrase set from a sensitive word to universally bypass prompt filters while eliminating redundant computations. Extensive experimental results demonstrate the superiority of our U3-Attack on both open-source and commercial T2I models. For example, on the commercial Runway-inpainting model with both prompt filter and safety checker, our U3-Attack achieves approximately 4× higher success rates than the state-of-the-art multimodal jailbreak attack, MMA-Diffusion. Content Warning: This paper includes examples of NSFW content.

Abstract:
Assessing whether AI-generated images are substantially similar to copyrighted works is a crucial step in resolving copyright disputes. In this paper, we propose CopyJudge, an automated copyright infringement identification framework that leverages large vision-language models (LVLMs) to simulate practical court processes for determining substantial similarity between copyrighted images and those generated by text-to-image diffusion models. Specifically, we employ an abstraction-filtration-comparison test framework with multi-LVLM debate to assess the likelihood of infringement and provide detailed judgment rationales. Based on the judgments, we further introduce a general LVLM-based mitigation strategy that automatically optimizes infringing prompts by avoiding sensitive expressions while preserving the non-infringing content. Besides, our approach can be enhanced by exploring non-infringing noise vectors within the diffusion latent space via reinforcement learning, even without modifying the original prompts.Experimental results show that our identification method achieves comparable state-of-the-art performance, while offering superior generalization and interpretability across various forms of infringement, and that our mitigation method could more effectively mitigate memorization and IP infringement without losing non-infringing expressions.

Abstract:
The rapid advancement of generative AI has raised concerns about the authenticity of digital images, as highly realistic fake images can now be generated at low cost, potentially increasing societal risks. In response, several datasets have been established to train detection models aimed at distinguishing AI-generated images from real ones. However, existing datasets suffer from limited generalization, low image quality, overly simple prompts, and insufficient image diversity. To address these limitations, we propose a high-quality, large-scale dataset comprising over 730,000 images across multiple categories, including both real and AI-generated images. The generated images are synthesized via state-of-the-art methods, including text-to-image generation (guided by over 10,000 carefully designed prompts), image inpainting, image refinement, and face swapping. Each generated image is annotated with its generation method and category. Inpainting images further include binary masks to indicate inpainted regions, providing rich metadata for analysis. Compared to existing datasets, detection models trained on our dataset demonstrate superior generalization capabilities. Our dataset not only serves as a strong benchmark for evaluating detection methods but also contributes to advancing the robustness of AI-generated image detection techniques. Building upon this, we propose a lightweight detection method based on image noise entropy, which transforms the original image into an entropy tensor of Non-Local Means (NLM) noise before classification. Extensive experiments demonstrate that models trained on our dataset achieve strong generalization, and our method delivers competitive performance, establishing a solid baseline for future research. The dataset and source code are publicly available at https://real-hd.github.io.

Abstract:
The great success of the diffusion model in image synthesis led to the release of gigantic commercial models, raising the issue of copyright protection and inappropriate content generation. Training-free diffusion watermarking provides a low-cost solution for these issues. However, the prior works remain vulnerable to rotation, scaling, and translation (RST) attacks. Although some methods employ meticulously designed patterns to mitigate this issue, they often reduce watermark capacity, which can result in identity (ID) collusion. To address these problems, we propose MaXsive, a training-free diffusion model generative watermarking technique that has high capacity and robustness. MaXsive best utilizes the initial noise to watermark the diffusion model. Moreover, instead of using a meticulously repetitive ring pattern, we propose injecting the X-shape template to recover the RST distortions. This design significantly increases robustness without losing any capacity, making ID collusion less likely to happen. The effectiveness of MaXsive has been verified on two well-known watermarking benchmarks under the scenarios of verification and identification.

Abstract:
Recently, Deep Learning (DL) models have been increasingly deployed on end-user devices as On-Device AI, offering improved efficiency and privacy. However, this deployment trend poses more serious Intellectual Property (IP) risks, as models are distributed on numerous local devices, making them vulnerable to theft and redistribution. Most existing ownership protection solutions (e.g., backdoor-based watermarking) are designed for cloud-based AI-as-a-Service (AIaaS) and are not directly applicable to large-scale distribution scenarios, where each user-specific model instance must carry a unique watermark. These methods typically embed a fixed watermark, and modifying the embedded watermark requires retraining the model. To address these challenges, we propose Hot-Swap MarkBoard, an efficient watermarking method. It encodes user-specific n-bit binary signatures by independently embedding multiple watermarks into a multi-branch Low-Rank Adaptation (LoRA) module, enabling efficient watermark customization without retraining through branch swapping. A parameter obfuscation mechanism further entangles the watermark weights with those of the base model, preventing removal without degrading model performance. The method supports black-box verification and is compatible with various model architectures and DL tasks, including classification, image generation, and text generation. Extensive experiments across three types of tasks and six backbone models demonstrate our method's superior efficiency and adaptability compared to existing approaches, achieving 100% verification accuracy.

Abstract:
Ensuring robust and generalizable autonomous driving requires not only broad scenario coverage but also efficient repair of failure cases, particularly those related to challenging and safety-critical scenarios. However, existing scenario generation and selection methods often lack adaptivity and semantic relevance, limiting their impact on performance improvement. In this paper, we propose SERA, an LLM-powered framework that enables autonomous driving systems to self-evolve by repairing failure cases through targeted scenario recommendation. By analyzing performance logs, SERA identifies failure patterns and dynamically retrieves semantically aligned scenarios from a structured bank. An LLM-based reflection mechanism further refines these recommendations to maximize relevance and diversity. The selected scenarios are used for few-shot fine-tuning, enabling targeted adaptation with minimal data. Experiments on the benchmark show that SERA consistently improves key metrics across multiple autonomous driving baselines, demonstrating its effectiveness and generalizability under safety-critical conditions.

Abstract:
As researchers continue to optimize AI agents for more effective task execution within operating systems, they often overlook a critical security concern: the ability of these agents to detect ''impostors'' within their environment. Through an analysis of the agents' operational context, we identify a significant threat-attackers can disguise malicious attacks as environmental elements, injecting active disturbances into the agents' execution processes to manipulate their decision-making. We define this novel threat as the Active Environment Injection Attack (AEIA). Focusing on the interaction mechanisms of the Android OS, we conduct a risk assessment of AEIA and identify two critical security vulnerabilities: (1) Adversarial content injection in multimodal interaction interfaces, where attackers embed adversarial instructions within environmental elements to mislead agent decision-making; and (2) Reasoning gap vulnerabilities in the agent's task execution process, which increase susceptibility to AEIA attacks during reasoning. To evaluate the impact of these vulnerabilities, we propose AEIA-MN, an attack scheme that exploits interaction vulnerabilities in mobile operating systems to assess the robustness of MLLM-based agents. Experimental results show that even advanced MLLMs are highly vulnerable to this attack, achieving a maximum attack success rate of 93% on the AndroidWorld benchmark by combining two vulnerabilities.

Abstract:
Text-to-Image (T2I) models have recently gained significant attention due to their ability to generate high-quality images and are consequently used in a wide range of applications. However, there are concerns about the gender bias of these models. Previous studies have shown that T2I models can perpetuate or even amplify gender stereotypes when provided with neutral text prompts (e.g., 'a photo of a CEO' is often associates with male images, while 'a photo of nurse' is often associates with female images). Researchers have proposed automated gender bias uncovering detectors for T2I models, but a crucial gap exists: no existing work comprehensively compares the various detectors and understands how the gender bias detected by them deviates from the actual situation. This study addresses this gap by validating previous gender bias detectors using a manually labeled dataset and comparing how the bias identified by various detectors deviates from the actual bias in T2I models, as verified by manual confirmation. We create a dataset consisting of 6,000 images generated from three cutting-edge T2I models, Stable Diffusion XL, Stable Diffusion 3, and Dreamlike Photoreal 2.0. During the human-labeling process, we find that all three T2I models generate a portion (12.48% on average) of low-quality images (e.g., generate images with no face present), where human annotators cannot determine the gender of the person. Our analysis reveals that all three T2I models show a preference for generating male images, with SDXL being the most biased. Additionally, images generated using prompts containing professional descriptions (e.g., lawyer or doctor) show the most bias. We evaluate seven gender bias detectors and find that none fully capture the actual level of bias in T2I models, with some detectors overestimating bias by up to 26.95%. We further investigate the causes of inaccurate estimations, highlighting the limitations of detectors in dealing with low-quality images. Based on our findings, we propose an enhanced detector called CLIP-Enhance, which most accurately measures the gender bias in T2I models, with a difference of only 0.47%-1.23%, and most effectively filters out 82.91% of low-quality images.1 We have made our dataset and code publicly available.

Abstract:
This paper explores the application of enhancement filtering techniques in neural video compression. Specifically, we categorize these techniques into in-loop contextual filtering and out-of-loop reconstruction enhancement based on whether the enhanced representation affects the subsequent coding loop. In-loop contextual filtering refines the temporal context by mitigating error propagation during frame-by-frame encoding. However, its influence on both the current and subsequent frames poses challenges in adaptively applying filtering throughout the sequence. To address this, we introduce an adaptive coding decision strategy that dynamically determines filtering application during encoding. Additionally, out-of-loop reconstruction enhancement is employed to refine the quality of reconstructed frames, providing a simple yet effective improvement in coding efficiency. To the best of our knowledge, this work presents the first systematic study of enhancement filtering in the context of conditional-based neural video compression. Extensive experiments demonstrate a 7.71% reduction in bit rate compared to state-of-the-art neural video codecs, validating the effectiveness of the proposed approach.

Abstract:
Generative Artificial Intelligence (GAI) has experienced exponential growth in recent years, partly facilitated by the abundance of large-scale open-source datasets. These datasets are often built using unrestricted and opaque data collection practices. While most literature focuses on the development and applications of GAI models, the ethical and legal considerations surrounding the creation of these datasets are often neglected. In addition, as datasets are shared, edited, and further reproduced online, information about their origin, legitimacy, and safety often gets lost. To address this gap, we introduce the Compliance Rating Scheme (CRS), a framework designed to evaluate dataset compliance with critical transparency, accountability, and security principles. We also release an open-source Python library built around data provenance technology to implement this framework, allowing for seamless integration into existing dataset-processing and AI training pipelines across multiple data modalities, including images, video, audio, and 3D assets. The library is simultaneously reactive and proactive, as in addition to evaluating the CRS of existing datasets, it equally informs responsible scraping and construction of new datasets.

Abstract:
Commonaiverse is an interactive installation exploring human emotions through full-body motion tracking and real-time AI feedback. Participants engage in three phases: Teaching, Exploration and the Cosmos Phase, collaboratively expressing and interpreting emotions with the system. The installation integrates MoveNet for precise motion tracking and a multi-recommender AI system to analyze emotional states dynamically, responding with adaptive audiovisual outputs. By shifting from top-down emotion classification to participant-driven, culturally diverse definitions, we highlight new pathways for inclusive, ethical affective computing. We discuss how this collaborative, out-of-the-box approach pushes multimedia research beyond single-user facial analysis toward a more embodied, co-created paradigm of emotional AI. Furthermore, we reflect on how this reimagined framework fosters user agency, reduces bias, and opens avenues for advanced interactive applications.

Abstract:
Recent large vision-language models (LVLMs) for video understanding are primarily fine-tuned with various videos scraped from online platforms. Existing datasets, such as ActivityNet, require considerable human labor for structuring and annotation before effectively utilized for tuning LVLMs. While current LVLMs are primarily trained on existing datasets in broad, general-purpose settings, adapting them to specific downstream scenarios remains challenging, as collecting and annotating task-specific videos is highly labor-intensive and time-consuming. To address this issue, we propose a three-stage framework named DreamFrame for automatically generating style-consistent keyframes and corresponding question-answer (QA) pairs to support LVLM instruction tuning. DreamFrame generates datasets in a movie-like manner. First, we utilize an LLM to generate structured movie plots including movie prior information (like overview and style), frame descriptions and plot-related QA pairs, with a story expansion strategy to mitigate context length limitations. Then, to ensure visual consistency across generated frames, we design a Style Immobilization Process which maintains consistent style through an embedding learning strategy. Finally, frame descriptions and style embeddings are integrated to produce coherent keyframes. Using DreamFrame, we construct a dataset comprising approximately 1k stylized keyframe-like videos and 100k diverse QA pairs. Extensive fine-tuned experiments on various LVLM architectures demonstrate the effectiveness of the proposed dataset. Furthermore, based on the proposed dataset, we fine-tune a new LVLM named DreamFrame-7B, which significantly surpasses the previous similar-sized LVLMs (+2.2 compared with VideoLLaVA-7B on MvBench) across different benchmarks.

Abstract:
Food image classification models are crucial for dietary management applications because they reduce the burden of manual meal logging. However, most publicly available datasets for training such models rely on web-crawled images, which often differ from users' real-world meal photos. In this work, we present FoodLogAthl-218, a food image dataset constructed from real-world meal records collected through the dietary management application FoodLog Athl. The dataset contains 6,925 images across 218 food categories, with a total of 14,349 bounding boxes. Rich metadata, including meal date and time, anonymized user IDs, and meal-level context, accompany each image. Unlike conventional datasets-where a predefined class set guides web-based image collection-our data begins with user-submitted photos, and labels are applied afterward. This yields greater intra-class diversity, a natural frequency distribution of meal types, and casual, unfiltered images intended for personal use rather than public sharing. In addition to (1) a standard classification benchmark, we introduce two FoodLog-specific tasks: (2) an incremental fine-tuning protocol that follows the temporal stream of users' logs, and (3) a context-aware classification task where each image contains multiple dishes, and the model must classify each dish by leveraging the overall meal context. We evaluate these tasks using large multimodal models (LMMs). The dataset is publicly available at https://huggingface.co/datasets/FoodLog/FoodLogAthl-218.

Abstract:
Evaluating and ensuring the adversarial robustness of autonomous driving (AD) systems is a critical and unresolved challenge. This paper introduces MetAdv, a novel adversarial testing platform that enables realistic, dynamic, and interactive evaluation by tightly integrating virtual simulation with physical vehicle feedback. At its core, MetAdv establishes a hybrid virtual-physical sandbox, within which we design a three-layer closed-loop testing environment with dynamic adversarial test evolution. This architecture facilitates end-to-end adversarial evaluation, ranging from high-level unified adversarial generation, through mid-level simulation-based interaction, to low-level execution on physical vehicles. Additionally, MetAdv supports a broad spectrum of AD tasks, algorithmic paradigms (e.g., modular deep learning pipelines, end-to-end learning, vision-language models). It supports flexible 3D vehicle modeling and seamless transitions between simulated and physical environments, with built-in compatibility for commercial platforms such as Apollo and Tesla. A key feature of MetAdv is its human-in-the-loop capability: besides flexible environmental configuration for more customized evaluation, it enables real-time capture of physiological signals and behavioral feedback from drivers, offering new insights into human-machine trust under adversarial conditions. We believe MetAdv can offer a scalable and unified framework for adversarial assessment, paving the way for safer AD. Our demo can be found at https://sites.google.com/view/metadv-demo-video.

Abstract:
Understanding sentiment in multimodal conversations is a complex yet crucial challenge toward building emotionally intelligent AI systems. The Multimodal Conversational Aspect-based Sentiment Analysis (MCABSA) Challenge invited participants to tackle two demanding subtasks: (1) extracting a comprehensive sentiment sextuple-including holder, target, aspect, opinion, sentiment, and rationale-from multi-speaker dialogues, and (2) detecting sentiment flipping, which detects dynamic sentiment shifts and their underlying triggers. For Subtask-I, in the present paper, we designed a structured prompting pipeline that guided large language models (LLMs) to sequentially extract sentiment components with refined contextual understanding. For Subtask-II, we further leveraged the complementary strengths of three LLMs through ensembling to robustly identify sentiment transitions and their triggers. Our system achieved a 47.38% average score on Subtask-I and a 74.12% exact match F1 on Subtask-II, showing the effectiveness of step-wise refinement and ensemble strategies in rich, multimodal sentiment analysis tasks.

Abstract:
We present a unified vision-language framework tailored for ENT endoscopy image analysis that simultaneously tackles three clinically-relevant tasks: image classification, image-to-image retrieval, and text-to-image retrieval. Unlike conventional CNN-based pipelines that struggle to capture cross-modal semantics, our approach leverages the CLIP ViT-B/16 backbone and enhances it through Low-Rank Adaptation, multi-level CLS token aggregation, and spherical feature interpolation. These components collectively enable efficient fine-tuning on limited medical data while improving representation diversity and semantic alignment across modalities. To bridge the gap between visual inputs and textual diagnostic context, we introduce class-specific natural language prompts that guide the image encoder through a joint training objective combining supervised classification with contrastive learning. We validated our framework through participation in the ACM MM'25 ENTRep Grand Challenge, achieving 95% accuracy and F1-score in classification, Recall@1 of 0.93 and 0.92 for image-to-image and text-to-image retrieval respectively, and MRR scores of 0.97 and 0.96. Ablation studies demonstrated the incremental benefits of each architectural component, validating the effectiveness of our design for robust multimodal medical understanding in low-resource clinical settings.

Abstract:
Autonomous vehicles generate massive volumes of point cloud data, yet only a subset is relevant for specific tasks such as collision detection, traffic analysis, or congestion monitoring. Effectively querying this data is essential to enable targeted analytics. In this work, we formalize point cloud querying by defining three core query types: RETRIEVAL, COUNT, and AGGREGATION, each aligned with distinct analytical scenarios. All these queries rely heavily on accurate object counts to produce meaningful results, making precise object counting a critical component of query execution. Prior work has focused on indexing techniques for 2D video data, assuming detection models provide accurate counting information. However, when applied to 3D point cloud data, state-of-the-art detection models often fail to generate reliable object counts, leading to substantial errors in query results. To address this limitation, we propose CounterNet, a heatmap-based network designed for accurate object counting in large-scale point cloud data. Rather than focusing on accurate object localization, CounterNet detects object presence by finding object centers to improve counting accuracy. We further enhance its performance with a feature map partitioning strategy using overlapping regions, enabling better handling of both small and large objects in complex traffic scenes. To adapt to varying frame characteristics, we introduce a per-frame dynamic model selection strategy that selects the most effective configuration for each input. Evaluations on three real-world autonomous vehicle datasets show that CounterNet improves counting accuracy by 5% to 20% across object categories, resulting in more reliable query outcomes across all supported query types.

Abstract:
Video Anomaly Detection (VAD) aims to identify and locate deviations from normal patterns in video sequences. Traditional methods often struggle with substantial computational demands and a reliance on extensive labeled datasets, thereby restricting their practical applicability. To address these constraints, we propose HiProbe-VAD, a novel framework that leverages pre-trained Multimodal Large Language Models (MLLMs) for VAD without requiring fine-tuning. In this paper, we discover that the intermediate hidden states of MLLMs contain information-rich representations, exhibiting higher sensitivity and linear separability for anomalies compared to the output layer. To capitalize on this, we propose a Dynamic Layer Saliency Probing (DLSP) mechanism that intelligently identifies and extracts the most informative hidden states from the optimal intermediate layer during the MLLMs reasoning. Then a lightweight anomaly scorer and temporal localization module efficiently detects anomalies using these extracted hidden states and finally generate explanations. Experiments on the UCF-Crime and XD-Violence datasets demonstrate that HiProbe-VAD outperforms existing training-free and most traditional approaches. Furthermore, our framework exhibits remarkable cross-model generalization capabilities in different MLLMs without any tuning, unlocking the potential of pre-trained MLLMs for video anomaly detection and paving the way for more practical and scalable solutions.

Abstract:
In recent years, simultaneous learning of multiple dense prediction tasks with partially annotated label data has emerged as an important research area. Previous works primarily focus on leveraging cross-task relations or conducting adversarial training for extra regularization, which achieve promising performance improvements, while still suffering from the lack of direct pixel-wise supervision and extra training of heavy mapping networks. To effectively tackle this challenge, we propose a novel approach to optimize a set of compact learnable hierarchical task tokens, including global and fine-grained ones, to discover consistent pixel-wise supervision signals in both feature and prediction levels. Specifically, the global task tokens are designed for effective cross-task feature interactions in a global context. Then, a group of fine-grained task-specific spatial tokens for each task is learned from the corresponding global task tokens. It is embedded to have dense interactions with each task-specific feature map. The learned global and local fine-grained task tokens are further used to discover pseudo task-specific dense labels at different levels of granularity, and they can be utilized to directly supervise the learning of the multi-task dense prediction framework. Extensive experimental results on challenging NYUD-v2, Cityscapes, and PASCAL Context datasets demonstrate significant improvements over existing state-of-the-art methods for partially annotated multi-task dense prediction.

Abstract:
Polymers, composed of repeating structural units called monomers, are fundamental materials with a wide range of applications in daily life and industry. Accurate property prediction for polymers is essential for their design, development, and application. However, existing modeling approaches, which typically represent polymers by the constituent monomers, struggle to capture the whole properties of polymer, since the properties change during the polymerization process. In this study, we propose a Multimodal Infinite Polymer Sequence (MIPS) pre-training framework, which represents polymers as infinite sequences of monomers and integrates both topological and spatial information for comprehensive modeling. From the topological perspective, we generalize message passing mechanism (MPM) and graph attention mechanism (GAM) to infinite polymer sequences. For MPM, we demonstrate that applying MPM to infinite polymer sequences is equivalent to applying MPM on the induced star-linking graph of monomers. For GAM, we propose to further replace global graph attention with localized graph attention (LGA). Moreover, we show the robustness of the ''star linking'' strategy through an adversarial evaluation method named Repeat and Shift Invariance Test (RSIT). Despite its robustness, ''star linking'' strategy exhibits limitations when monomer side chains contain ring structures, a common characteristic of polymers, as it fails the Weisfeiler-Lehman (WL) test. To overcome this issue, we propose backbone embedding to enhance the capability of MPM and LGA on infinite polymer sequences. From the spatial perspective, we extract 3D descriptors of repeating monomers to capture spatial information. Finally, we design a cross-modal fusion mechanism to unify the topological and spatial information. Experimental validation across eight diverse polymer property prediction tasks reveals that MIPS achieves state-of-the-art performance. Ablation studies further comfirm the efficacy of our infinite polymer sequence modeling approach and multimodal pre-training framework.

Abstract:
Infrared-visible image fusion (IVIF) has attracted much attention owing to the highly-complementary properties of the two image modalities. Due to the lack of ground-truth fused images, the fusion output of current deep-learning based methods heavily depends on the loss functions defined mathematically. As it is hard to well mathematically define the fused image without ground truth, the performance of existing fusion methods is limited. In this paper, we propose to use natural language to express the objective of IVIF, which can avoid the explicit mathematical modeling of fusion output in current losses, and make full use of the advantage of language expression to improve the fusion performance. For this purpose, we present a comprehensive language-expressed fusion objective, and encode relevant texts into the multi-modal embedding space using CLIP. A language-driven fusion model is then constructed in the embedding space, by establishing the relationship among the embedded vectors representing the fusion objective and input image modalities. Finally, a language-driven loss is derived to make the actual IVIF aligned with the embedded language-driven fusion model via supervised training. Experiments show that our method can obtain much better fusion results than existing techniques.

Abstract:
In recent years, the rampant spread of misinformation on social media has made accurate detection of multimodal fake news a critical research focus. However, previous research has not adequately understood the semantics of images, and models struggle to discern news authenticity with limited textual information. Meanwhile, treating all emotional types of news uniformly without tailored approaches further leads to performance degradation. Therefore, we propose a novel Knowledge Augmentation and Emotion Guidance Network (KEN). On the one hand, we effectively leverage LVLM's powerful semantic understanding and extensive world knowledge. For images, the generated captions provide a comprehensive understanding of image content and scenes, while for text, the retrieved evidence helps break the information silos caused by the closed and limited text and context. On the other hand, we consider inter-class differences between different emotional types of news through balanced learning, achieving fine-grained modeling of the relationship between emotional types and authenticity. Extensive experiments on two real-world datasets demonstrate the superiority of our KEN.

Abstract:
The multimodal knowledge graph reasoning (MKGR) task aims to predict the missing facts in the incomplete MKGs by leveraging auxiliary images and descriptions of entities. Existing approaches are trained with single-target objectives, which neglect the probabilistic correlations of entity labels, especially in non-target entities. Moreover, previous studies incorporate all modalities statically or adaptively, overlooking the negative impacts of irrelevant or misleading information in the incompetent modalities. To address these issues, we introduce a novel Reinforced Multimodal Distillation framework, exploiting the Dark Side of Modalities (DSoM) from two perspectives: (1) Dark knowledge from non-target entities: We propose to train a unimodal KGR model through logit distillation to mimic the multimodal soft labels provided by pre-trained multimodal teacher models. The multimodal soft labels could provide rich supervision signals with subtle correlations among both target and non-target entities from multiple perspectives. We further decouple logits into neighbor entities and non-neighbor entities to divide into two types of correlations. (2) Dark side in unhelpful modalities: To exclude the adverse effects of unhelpful modalities, we introduce a reinforced teacher combination mechanism that dynamically selects the optimal set of multimodal teachers for each triple. The agent is trained to maximize the rewards, which are only assigned to the beneficial multimodal combination strategies for the student model. Comprehensive experiments demonstrate the effectiveness of DSoM framework on 5 MKGR datasets. Codes are available at github.com/OreOZhao/DSoM.

Abstract:
Recent advances in molecular science have been propelled significantly by large language models (LLMs). However, their effectiveness is limited when relying solely on molecular sequences, which fail to capture the complex structures of molecules. Beyond sequence representation, molecules exhibit two complementary structural views: the first focuses on the topological relationships between atoms, as exemplified by the graph view; and the second emphasizes the spatial configuration of molecules, as represented by the image view. The two types of views provide unique insights into molecular structures. To leverage these views collaboratively, we propose the CROss-view Prefixes (CROP) to enhance LLMs' molecular understanding through efficient multi-view integration. CROP possesses two advantages: (i) efficiency: by jointly resampling multiple structural views into fixed-length prefixes, it avoids excessive consumption of the LLM's limited context length and allows easy expansion to more views; (ii) effectiveness: by utilizing the LLM's self-encoded molecular sequences to guide the resampling process, it boosts the quality of the generated prefixes. Specifically, our framework features a carefully designed SMILES Guided Resampler for view resampling, and a Structural Embedding Gate for converting the resulting embeddings into LLM's prefixes. Extensive experiments demonstrate the superiority of CROP in tasks including molecule captioning, IUPAC name prediction and molecule property prediction.

Abstract:
Recently, Multimodal Large Language Models (MLLMs) have achieved significant success across multiple disciplines due to their exceptional instruction-following capabilities and extensive world knowledge. However, whether these MLLMs possess human-like compositional reasoning abilities remains an open problem. To unveil their reasoning behaviors, we first curate a Multimodal Assumptive Reas oning Benchmark (MARS-Bench) in this paper. Interestingly, we find that most prevalent MLLMs can be easily fooled by the introduction of a presupposition into the question, whereas such presuppositions appear naive to human reasoning. Besides, we also propose a simple yet effective method, Active Deduction (AD), a novel reinforcement learning paradigm to encourage the model to actively perform composite deduction before reaching a final decision. Equipped with the proposed AD method, a MLLM demonstrates significant improvements in assumptive reasoning abilities without compromising its general-purpose question-answering performance. We also provide extensive evaluations of both open-source and private MLLMs on MARS-Bench, along with experimental analyses of the AD method.

Abstract:
We propose GeoUni, the first unified geometry expert model capable of generating problem solutions and diagrams within a single framework in a way that enables the creation of unique and individualized geometry problems. Traditionally, solving geometry problems and generating diagrams have been treated as separate tasks in machine learning, with no models successfully integrating both to support problem creation. However, we believe that mastery in geometry requires frictionless integration of all of these skills, from solving problems to visualizing geometric relationships, and finally crafting tailored problems. Our extensive experiments demonstrate that GeoUni, with only 1.5B parameters, achieves performance comparable to larger models such as DeepSeek-R1 with 671B parameters in geometric reasoning tasks. GeoUni also excels in generating precise geometric diagrams, surpassing both text-to-image models and unified models, including the GPT-4o image generation. Most importantly, GeoUni is the only model capable of successfully generating textual problems with matching diagrams based on specific knowledge points, thus offering a wider range of capabilities that extend beyond current models.

Abstract:
Large Video Models (LVMs) build on the semantic capabilities of Large Language Models (LLMs) and vision modules by integrating temporal information to better understand dynamic video content. Despite their progress, LVMs are prone to hallucinations-producing inaccurate or irrelevant descriptions. Current benchmarks for video hallucination depend heavily on manual categorization of video content, neglecting the perception-based processes through which humans naturally interpret videos. We introduce MESH, a benchmark designed to evaluate hallucinations in LVMs systematically. MESH uses a Question-Answering framework with binary and multi-choice formats incorporating target and trap instances. It follows a bottom-up approach, evaluating basic objects, coarse-to-fine subject features, and subject-action pairs, aligning with human video understanding. We demonstrate that MESH offers an effective and comprehensive approach for identifying hallucinations in video understanding. Our evaluations show that while LVMs excel at recognizing basic objects and features, their susceptibility to hallucinations increases markedly when handling fine details or aligning multiple actions involving various subjects in longer videos. The benchmark is available at MESH-Benchmark.

Abstract:
Video understanding plays a vital role in bridging low-level visual signals with high-level cognitive reasoning, and is fundamental to applications such as autonomous driving, embodied AI, and the broader pursuit of AGI. The rapid development of large language models (LLMs), particularly those utilizing Chain-of-Thought (CoT) technology, has significantly advanced video reasoning capabilities. However, current approaches primarily depend on textual information for reasoning, overlooking the visual modality in the actual video reasoning process. In contrast, humans naturally re-examine visual content while reasoning. Motivated by this, we introduce a novel video reasoning paradigm: Video-Text Interleaved CoT (ViTCoT), which facilitates more intuitive and cognitively aligned reasoning. To the end, first, we construct the Video-Text Interleaved Benchmark (ViTIB), which is created using MLLMs for key-video selection and manually verified. Furthermore, we extensively explore the potential of the ViTCoT paradigm in the video understanding field. Extensive experiments demonstrate that ViTCoT significantly enhances performance compared to the traditional text-only CoT paradigm and effectively activates more neuron values in MLLMs.

Abstract:
Personalized product search (PPS) aims to retrieve products relevant to the given query considering user preferences within their purchase histories. Since large language models (LLM) exhibit impressive potential in content understanding and reasoning, current methods explore to leverage LLM to comprehend the complicated relationships among user, query and product to improve the search performance of PPS. Despite the progress, LLM-based PPS solutions merely take textual contents into consideration, neglecting multimodal contents which play a critical role for product search. Motivated by this, we propose a novel framework, HMPPS, for Harnessing Multimodal large language models (MLLM) to deal with Personalized Product Search based on multimodal contents. Nevertheless, the redundancy and noise in PPS input stand for a great challenge to apply MLLM for PPS, which not only misleads MLLM to generate inaccurate search results but also increases the computation expense of MLLM. To deal with this problem, we additionally design two query-aware refinement modules for HMPPS: 1) a perspective-guided summarization module that generates refined product descriptions around core perspectives relevant to search query, reducing noise and redundancy within textual contents; and 2) a two-stage training paradigm that introduces search query for user history filtering based on multimodal representations, capturing precise user preferences and decreasing the inference cost. Extensive experiments are conducted on four public datasets to demonstrate the effectiveness of HMPPS. Furthermore, HMPPS is deployed on an online search system with billion-level daily active users and achieves an evident gain in A/B testing.

Abstract:
Sequential Recommendation (SR) focuses on personalizing user experiences by predicting future preferences based on historical interactions. Transformer models, with their attention mechanisms, have become the dominant architecture in SR tasks due to their ability to capture dependencies in user behavior sequences. However, traditional attention mechanisms, where attention weights are computed through query-key transformations, are inherently linear and deterministic. This fixed approach limits their ability to account for the dynamic and non-linear nature of user preferences, leading to challenges in capturing evolving interests and subtle behavioral patterns. Given that generative models excel at capturing non-linearity and probabilistic variability, we argue that generating attention distributions offers a more flexible and expressive alternative compared to traditional attention mechanisms. To support this claim, we present a theoretical proof demonstrating that generative attention mechanisms offer greater expressiveness and stochasticity than traditional deterministic approaches. Building upon this theoretical foundation, we introduce two generative attention models for SR, each grounded in the principles of Variational Autoencoders (VAE) and Diffusion Models (DMs), respectively. These models are designed specifically to generate adaptive attention distributions that better align with variable user preferences. Extensive experiments on real-world datasets show our models significantly outperform state-of-the-art in both accuracy and diversity.

Abstract:
Anomaly detection in graph-structured data is an inherently challenging problem, as it requires the identification of rare nodes that deviate from the majority in both their structural and behavioral characteristics. Existing methods, such as those based on graph convolutional networks (GCNs), often suffer from over-smoothing, which causes the learned node representations to become indistinguishable. Furthermore, graph reconstruction-based approaches are vulnerable to anomalous node interference during the reconstruction process, leading to inaccurate anomaly detection. In this work, we propose a novel and holistic anomaly evaluation framework that integrates three key components: a local-global Transformer encoder, a memory-guided reconstruction mechanism and a multi-scale representation matching strategy. These components work synergistically to enhance the model's ability to capture both local and global structural dependencies, suppress the influence of anomalous nodes, and assess anomalies from multiple levels of granularity. Anomaly scores are computed by combining reconstruction errors and memory matching signals, resulting in a more robust evaluation. Extensive experiments on seven benchmark datasets demonstrate that our method outperforms existing state-of-the-art approaches, offering a comprehensive and generalizable solution for anomaly detection across various graph domains.

Abstract:
3D Gaussian Splatting (3DGS) has demonstrated impressive capabilities in novel view synthesis. However, rendering reflective objects remains a significant challenge, particularly in inverse rendering and relighting. We introduce RTR-GS, a novel inverse rendering framework capable of robustly rendering objects with arbitrary reflectance properties, decomposing BRDF and lighting, and delivering credible relighting results. Given a collection of multi-view images, our method effectively recovers geometric structure through a hybrid rendering model that combines forward rendering for radiance transfer with deferred rendering for reflections. This approach successfully separates high-frequency and low-frequency appearances, mitigating floating artifacts caused by spherical harmonic overfitting when handling high-frequency details. We further refine BRDF and lighting decomposition using an additional physically-based deferred rendering branch. Experimental results show that our method enhances novel view synthesis, normal estimation, decomposition, and relighting while maintaining efficient training inference process.

Abstract:
As multimedia content expands, the demand for unified multimodal retrieval (UMR) in real-world applications increases. Recent work leverages multimodal large language models to tackle this task. However, the large number of parameters leads to high training resource demands and low inference efficiency. To address this issue, we propose the PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning, an efficient approach to enhancing the unified retrieval capabilities from both structure and learning perspectives: 1) From the perspective of model structure, we analyze and propose Layer-Pruned Self-Distillation approach. It structurally prunes the model by preserving only the shallow layers, substantially reducing the parameters of MLLM. To mitigate representational degradation, we apply self-distillation, reusing the feature from the original model as the teacher signal to guide the retrieval embedding token from the pruned model, enabling effective capacity retention in a compact model. 2) From the perspective of model learning, to mitigate representation degradation caused by rapid convergence during multimodal contrastive learning, we propose Modality-Adaptive Contrastive Learning Loss (MAC-Loss). It adaptively separates in-batch negative candidates into harder intra-modality and simpler inter-modality groups based on each query's target modality. Assigning each group a temperature coefficient with different strategies allows queries to adaptively focus on harder in-batch negatives, reducing the resource demands of multimodal contrastive learning when UMR training. Experiments demonstrate that our approach achieves double efficiency, significantly reduces resource consumption while maintaining most of the performance.

Abstract:
Video imaging is often affected by complex degradations such as blur, noise, and compression artifacts. Traditional restoration methods follow a ''single-task single-model'' paradigm, resulting in poor generalization and high computational cost, limiting their applicability in real-world scenarios with diverse degradation types. We propose UniFlowRestore, a general video restoration framework that models restoration as a time-continuous evolution under a prompt-guided and physics-informed vector field. A physics-aware backbone PhysicsUNet encodes degradation priors as potential energy, while PromptGenerator produces task-relevant prompts as momentum. These components define a Hamiltonian system whose vector field integrates inertial dynamics, decaying physical gradients, and prompt-based guidance. The system is optimized via a fixed-step ODE solver to achieve efficient and unified restoration across tasks. Experiments show that UniFlowRestore delivers state-of-the-art performance with strong generalization and efficiency. Quantitative results demonstrate that UniFlowRestore achieves state-of-the-art performance, attaining the highest PSNR (33.89 dB) and SSIM (0.97) on the video denoising task, while maintaining top or second-best scores across all evaluated tasks.

Abstract:
The widespread adoption of digital technology has ushered in a new era of digital transformation across all aspects of our lives. Online learning, social, and work activities, such as distance education, videoconferencing, interviews, and talks, have led to a dramatic increase in speech-rich video content. In contrast to other video types, such as surveillance footage, which typically contain abundant visual cues, speech-rich videos convey most of their meaningful information through the audio channel. This poses challenges for improving content consumption using existing visual-based video summarization, navigation, and exploration systems. In this paper, we present VisAug, a novel interactive system designed to enhance speech-rich video navigation and engagement by automatically generating informative and expressive visual augmentations based on the speech content of videos. Our findings suggest that this system has the potential to significantly enhance the consumption and engagement of information in an increasingly video-driven digital landscape.

Abstract:
Text logo design heavily relies on the creativity and expertise of professional designers, in which arranging element layouts is one of the most important procedures. However, this specific task has received limited attention, often overshadowed by broader layout generation tasks such as document or poster design. In this paper, we propose a Vision-Language Model (VLM)-based framework that generates content-aware text logo layouts by integrating multi-modal inputs with user-defined constraints, enabling more flexible and robust layout generation for real-world applications. We introduce two model techniques that reduce the computational cost for processing multiple glyph images simultaneously, without compromising performance. To support instruction tuning of our model, we construct two extensive text logo datasets that are five times larger than existing public datasets. In addition to geometric annotations (e.g., text masks and character recognition), our datasets include detailed layout descriptions in natural language, enabling the model to reason more effectively in handling complex designs and custom user inputs. Experimental results demonstrate the effectiveness of our proposed framework and datasets, outperforming existing methods on various benchmarks that assess geometric aesthetics and human preferences.

Abstract:
Learning natural and diverse behaviors from human motion datasets remains a significant challenge in physics-based character control. Existing conditional adversarial models often suffer from tight and biased embedding distributions where embeddings from the same motion are closely grouped in a small area, and shorter motions occupy even less space. Our empirical observations indicate this limits the representational capacity and diversity under each skill. An ideal latent space should be maximally packed by all motion's embedding clusters. Although methods that employ separate embedding spaces for each motion mitigate this limitation to some extent, introducing a hybrid discrete-continuous embedding space imposes a huge exploration burden on the high-level policy. To address the above limitations, we propose a versatile skill-conditioned controller that learns diverse skills with expressive variations. Our approach leverages the Neural Collapse phenomenon, a natural outcome of the classification-based encoder, to uniformly distribute cluster centers. We additionally propose a novel Embedding Expansion technique to form stylistic embedding clusters for diverse skills that are uniformly distributed on a hypersphere, maximizing the representational area occupied by each skill and minimizing unmapped regions. This maximally packed and uniformly distributed embedding space ensures that embeddings within the same cluster generate behaviors conforming to the characteristics of the corresponding motion clips, yet exhibiting noticeable variations within each cluster. Compared to existing methods, experimental results demonstrate that our controller not only generates high-quality, diverse motions covering the entire dataset but also achieves superior controllability, motion coverage, and diversity under each skill. Both qualitative and quantitative results confirm these traits, enabling our controller to be applied to a wide range of downstream tasks and serving as a cornerstone for diverse applications.

Abstract:
Diffusion-based video generation techniques have significantly improved zero-shot talking-head avatar generation, enhancing the naturalness of both head motion and facial expressions. However, existing methods suffer from poor controllability, making them less applicable to real-world scenarios such as filmmaking and live streaming for e-commerce. To address this limitation, we propose FLAP, a novel approach that integrates explicit 3D intermediate parameters (head poses and facial expressions) into the diffusion model for end-to-end generation of realistic portrait videos. The proposed architecture allows the model to generate vivid portrait videos from audio while simultaneously incorporating additional control signals, such as head rotation angles and eye-blinking frequency. Furthermore, the decoupling of head pose and facial expression allows for independent control of each, offering precise manipulation of both the avatar's pose and facial expressions. We also demonstrate its flexibility in integrating with existing 3D head generation methods, bridging the gap between 3D model-based approaches and end-to-end diffusion techniques. Extensive experiments show that our method outperforms recent audio-driven portrait video models in both naturalness and controllability.

Abstract:
As the third generation of neural networks, spiking neural networks (SNNs) have recently gained widespread attention for their biological plausibility, energy efficiency, and effectiveness in processing neuromorphic datasets. To better emulate biological neurons, various models such as Integrate-and-Fire (IF) and Leaky Integrate-and-Fire (LIF) have been widely adopted in SNNs. However, these neuron models overlook the refractory period, a fundamental characteristic of biological neurons. Research on excitable neurons reveal that after firing, neurons enter a refractory period during which they are temporarily unresponsive to subsequent stimuli. This mechanism is critical for preventing over-excitation and mitigating interference from aberrant signals. Therefore, we propose a simple yet effective method to incorporate the refractory period into spiking LIF neurons through spike-triggered threshold dynamics, termed RPLIF. Our method ensures that each spike accurately encodes neural information, effectively preventing neuron over-excitation under continuous inputs and interference from anomalous inputs. Incorporating the refractory period into LIF neurons is seamless and computationally efficient, enhancing robustness and efficiency while yielding better performance with negligible overhead. To the best of our knowledge, RPLIF achieves state-of-the-art performance on Cifar10-DVS(82.40%) and N-Caltech101(83.35%) with fewer timesteps and demonstrates superior performance on DVS128 Gesture(97.22%) at low latency.

Abstract:
Large language models have been extended to the speech domain, leading to the development of speech large language models (SLLMs). While existing SLLMs demonstrate strong performance in speech instruction-following for core languages (e.g., English), they often struggle with non-core languages due to the scarcity of paired speech-text data and limited multilingual semantic reasoning capabilities. To address this, we propose the semi-implicit Cross-lingual Speech Chain-of-Thought (XS-CoT) framework, which integrates speech-to-text translation into the reasoning process of SLLMs. The XS-CoT generates four types of tokens: instruction and response tokens in both core and non-core languages, enabling cross-lingual transfer of reasoning capabilities. To mitigate inference latency in generating target non-core response tokens, we incorporate a semi-implicit CoT scheme into XS-CoT, which progressively compresses the first three types of intermediate reasoning tokens while retaining global reasoning logic during training. By leveraging the robust reasoning capabilities of the core language, XS-CoT improves responses for non-core languages by up to 45% in GPT-4 score when compared to direct supervised fine-tuning on two representative SLLMs, Qwen2-Audio and SALMONN. Moreover, the semi-implicit XS-CoT reduces token delay by more than 50% with a slight drop in GPT-4 scores. Importantly, XS-CoT requires only a small amount of high-quality training data for non-core languages by leveraging the reasoning capabilities of core languages. To support training, we also develop a data pipeline and open-source speech instruction-following datasets in Japanese, German, and French.

Abstract:
Multimodal spiking neural networks (SNNs) hold significant potential for energy-efficient sensory processing but face critical challenges in modality imbalance and temporal misalignment. Current approaches suffer from uncoordinated convergence speeds across modalities and static fusion mechanisms that ignore time-varying cross-modal interactions. We propose the temporal attention-guided adaptive fusion framework for multimodal SNNs with two synergistic innovations: 1) The Temporal Attention-guided Adaptive Fusion (TAAF) module that dynamically assigns importance scores to fused spiking features at each timestep, enabling hierarchical integration of temporally heterogeneous spike-based features; 2) The temporal adaptive balanced fusion loss that modulates learning rates per modality based on the above attention scores, preventing dominant modalities from monopolizing optimization. The proposed framework implements adaptive fusion, especially in the temporal dimension, and alleviates the modality imbalance during multimodal learning, mimicking cortical multisensory integration principles. Evaluations on CREMA-D, AVE, and EAD datasets demonstrate state-of-the-art performance (77.55%, 70.65% and 98.65% accuracy, respectively) with energy efficiency. The system resolves temporal misalignment through learnable time-warping operations and faster modality convergence coordination than baseline SNNs. This work establishes a new paradigm for temporally coherent multimodal learning in neuromorphic systems, bridging the gap between biological sensory processing and efficient machine intelligence.mfp

Abstract:
Anomaly inspection plays a vital role in industrial manufacturing, but the scarcity of anomaly samples significantly limits the effectiveness of existing methods in tasks such as localization and classification. While several anomaly synthesis approaches have been introduced for data augmentation, they often struggle with low realism, inaccurate mask alignment, and poor generalization. To overcome these limitations, we propose Generate Aligned Anomaly (GAA), a region-guided, few-shot anomaly image-mask pair generation framework. GAA leverages the strong priors of a pretrained latent diffusion model to generate realistic, diverse, and semantically aligned anomalies using only a small number of samples. The framework first employs Localized Concept Decomposition to jointly model the semantic features and spatial information of anomalies, enabling flexible control over the type and location of anomalies. It then utilizes Adaptive Multi-Round Anomaly Clustering to perform fine-grained semantic clustering of anomaly concepts, thereby enhancing the consistency of anomaly representations. Subsequently, a region-guided mask generation strategy ensures precise alignment between anomalies and their corresponding masks, while a low-quality sample filtering module is introduced to further improve the overall quality of the generated samples. Extensive experiments on the MVTec AD and LOCO datasets demonstrate that GAA achieves superior performance in both anomaly synthesis quality and downstream tasks such as localization and classification.

Abstract:
With the development of generative artificial intelligence, new forgery methods are rapidly emerging. Social platforms are flooded with vast amounts of unlabeled synthetic data and authentic data, making it increasingly challenging to distinguish real from fake. Due to the lack of labels, existing supervised detection methods struggle to effectively address the detection of unknown deepfake methods. Moreover, in open world scenarios, the amount of unlabeled data greatly exceeds that of labeled data. Therefore, we define a new deepfake detection generalization task which focuses on how to achieve efficient detection of large amounts of unlabeled data based on limited labeled data to simulate a open world scenario. To solve the above mentioned task, we propose a novel Open-World Deepfake Detection Generalization Enhancement Training Strategy (OWG-DS) to improve the generalization ability of existing methods. Our approach aims to transfer deepfake detection knowledge from a small amount of labeled source domain data to large-scale unlabeled target domain data. Specifically, we introduce the Domain Distance Optimization (DDO) module to align different domain features by optimizing both inter-domain and intra-domain distances. Additionally, the Similarity-based Class Boundary Separation (SCBS) module is used to enhance the aggregation of similar samples to ensure clearer class boundaries, while an adversarial training mechanism is adopted to learn the domain-invariant features. Extensive experiments show that the proposed deepfake detection generalization enhancement training strategy excels in cross-method and cross-dataset scenarios, improving the model's generalization.

Abstract:
The rise of advanced voice deepfake technologies has raised serious concerns over user audio privacy, as malicious actors increasingly exploit publicly available voice data to generate convincing fake audio for malicious purposes such as identity theft, financial fraud and misinformation campaigns. While existing defense methods offer partial protection, they suffer from critical limitations, including weak adaptability to unseen user data, poor scalability to long audio, regid reliance on white-box knowledge and high computational and temporal costs to encryption process. Therefore, to defend against personalized voice deepfake threats, we propose Enkidu, a novel user-oriented privacy-preserving framework that leverages universal frequential perturbations generated through black-box knowledge and few-shot training on a small amount of user samples. These high-malleablity frequency-domain noise patches enable real-time, lightweight protection with strong generalization across variable-length audio and robust resistance against voice deepfake attacks-all while preserving high perceptual and intelligible audio quality. Notably, Enkidu achieves over 50-200× processing memory efficiency (requiring only 0.004 GB) and over 3-7000× runtime efficiency (real-time coefficient as low as 0.004) compared to six SOTA countermeasures. Extensive experiments across six mainstream Text-to-Speech (TTS) models and five cutting-edge Automated Speaker Verification (ASV) models demonstrate the effectiveness, transferability, and practicality of Enkidu in defending against voice deepfakes and adaptive attacks.

Abstract:
Toxicity detection in multimodal text-image content faces growing challenges, especially with multimodal implicit toxicity, where each modality appears benign on its own but conveys hazard when combined. Multimodal implicit toxicity appears not only as formal statements in social platforms but also prompts that can lead to toxic dialogs from Large Vision-Language Models (LVLMs). Despite the success in unimodal text or image moderation, toxicity detection for multimodal content, particularly the multimodal implicit toxicity, remains underexplored. To fill this gap, we comprehensively build a taxonomy for multimodal implicit toxicity (MMIT) and introduce an MMIT-dataset, comprising 2,100 multimodal statements and prompts across 7 risk categories (31 sub-categories) and 5 typical cross-modal correlation modes. To advance the detection of multimodal implicit toxicity, we build ShieldVLM, a model which identifies implicit toxicity in multimodal statements, prompts and dialogs via deliberative cross-modal reasoning. Experiments show that ShieldVLM outperforms existing strong baselines in detecting both implicit and explicit toxicity. The model and dataset will be publicly available to support future researches (Warning: This paper contains potentially sensitive contents). Warning: This paper contains potentially sensitive contents.

Abstract:
Accurate accident anticipation remains challenging when driver cognition and dynamic road conditions are underrepresented in predictive models. In this paper, we propose CAMERA (Context-Aware Multi-modal Enhanced Risk Anticipation), a multi-modal framework integrating dashcam video, textual annotations, and driver attention maps for robust accident anticipation. Unlike existing methods that rely on static or environment-centric thresholds, CAMERA employs an adaptive mechanism guided by scene complexity and gaze entropy, reducing false alarms while maintaining high recall in dynamic, multi-agent traffic scenarios. A hierarchical fusion pipeline with Bi-GRU (Bidirectional GRU) captures spatio-temporal dependencies, while a Geo-Context Vision-Language module translates 3D spatial relationships into interpretable, human-centric alerts. Evaluations on the DADA-2000 and benchmarks show that CAMERA achieves state-of-the-art performance, improving accuracy and lead time. These results demonstrate the effectiveness of modeling driver attention, contextual description, and adaptive risk thresholds to enable more reliable accident anticipation.

Abstract:
Video content comprehension is essential for various applications, ranging from video analysis to interactive systems. Despite advancements in large-scale vision-language models (VLMs), these models often struggle to capture the nuanced, spatiotemporal details essential for thorough video analysis. To address this gap, we introduce Video-CoT, a groundbreaking dataset designed to enhance spatiotemporal understanding using Chain-of-Thought(CoT) methodologies. Video-CoT contains 192,000 fine-grained spatiotemporal question-answer pairs and 23,000 high-quality CoT-annotated samples, providing a solid foundation for evaluating spatiotemporal understanding in video comprehension. Addition- ally, we provide a comprehensive benchmark for assessing these tasks, with each task featuring 750 images and tailored evaluation metrics. Our extensive experiments reveal that current VLMs face significant challenges in achieving satisfactory performance, high- lighting the difficulties of effective spatiotemporal understanding. Overall, the Video-CoT dataset and benchmark open new avenues for research in multimedia understanding and support future innovations in intelligent systems requiring advanced video analysis capabilities. By making these resources publicly available, we aim to encourage further exploration in this critical area. Project website: https://video-cot.github.io/ .

Abstract:
As Multimodal Large Language Models (MLLMs) continue to evolve, their cognitive and reasoning capabilities have seen remarkable progress. However, challenges in visual fine-grained perception and commonsense causal inference persist. This paper introduces Argus Inspection, a multimodal benchmark with two levels of difficulty, emphasizing detailed visual recognition while incorporating real-world commonsense understanding to evaluate causal reasoning abilities. Expanding on it, we present the Eye of Panoptes framework, which integrates a binary parametric Sigmoid metric with an indicator function, enabling a more holistic evaluation of MLLMs' responses in opinion-based reasoning tasks. Experiments conducted on 26 mainstream MLLMs reveal that the highest performance in visual fine-grained reasoning reaches only 0.46, highlighting considerable potential for enhancement. Our research offers valuable perspectives for the continued refinement of MLLMs.

Abstract:
Advances in Generative AI have made video-level deepfake detection increasingly challenging, exposing the limitations of current detection techniques. In this paper, we present HOLA, our solution to the Video-Level Deepfake Detection track of 2025 1M-Deepfakes Detection Challenge. Inspired by the success of large-scale pre-training in the general domain, we first scale audio-visual self-supervised pre-training in the multimodal video-level deepfake detection, which leverages our self-built dataset of 1.81M samples, thereby leading to a unified two-stage framework. To be specific, HOLA features an iterative-aware cross-modal learning module for selective audio-visual interactions, hierarchical contextual modeling with gated aggregations under the local-global perspective, and a pyramid-like refiner for scale-aware cross-grained semantic enhancements. Moreover, we propose the pseudo supervised singal injection strategy to further boost model performance. Extensive experiments across expert models and MLLMs impressivly demonstrate the effectiveness of our proposed HOLA. We also conduct a series of ablation studies to explore the crucial design factors of our introduced components. Remarkably, our HOLA ranks 1st, outperforming the second by 0.0476 AUC on the TestA set.

Abstract:
Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. In recent years, substantial advancements have been made in the areas of ME recognition, spotting, and generation. However, conventional approaches that treat spotting and recognition as separate tasks are suboptimal, particularly for analyzing long-duration videos in realistic settings. Concurrently, the emergence of multimodal large language models (MLLMs) and large vision-language models (LVLMs) offers promising new avenues for enhancing ME analysis through their powerful multimodal reasoning capabilities. The ME grand challenge (MEGC) 2025 introduces two tasks that reflect these evolving research directions: (1) ME spot-then-recognize (ME-STR), which integrates ME spotting and subsequent recognition in a unified sequential pipeline; and (2) ME visual question answering (ME-VQA), which explores ME understanding through visual question answering, leveraging MLLMs or LVLMs to address diverse question types related to MEs. All participating algorithms are required to run on this test set and submit their results on a leaderboard. More details are available at https://megc2025.github.io.

Abstract:
Social Media Popularity Prediction is a complex multimodal task that requires effective integration of images, text, and structured information. However, current approaches suffer from inadequate visual-textual alignment and fail to capture the inherent cross-content correlations and hierarchical patterns in social media data. To overcome these limitations, we establish a multi-class framework, introducing hierarchical prototypes for structural enhancement and contrastive learning for improved vision-text alignment. Furthermore, we propose a feature-enhanced framework integrating dual-grained prompt learning and cross-modal attention mechanisms, achieving precise multimodal representation through fine-grained category modeling. Experimental results demonstrate state-of-the-art performance on benchmark metrics, establishing new reference standards for multimodal social media analysis.

Abstract:
The integration of large language models (LLMs) into conversational robots has made human-robot conversations more dynamic. Yet, LLM-powered conversational robots remain prone to errors, e.g., misunderstanding user intent, prematurely interrupting users, or failing to respond altogether. Detecting and addressing these failures is critical for preventing conversational breakdowns, avoiding task disruptions, and sustaining user trust. To tackle this problem, the ERR@HRI 2.0 Challenge provides a multimodal dataset of LLM-powered conversational robot failures during human-robot conversations and encourages researchers to benchmark machine learning models designed to detect robot failures. The dataset includes 16 hours of dyadic human-robot interactions, incorporating facial, speech, and head movement features. Each interaction is annotated with the presence or absence of robot errors from the system perspective, and perceived user intention to correct for a mismatch between robot behavior and user expectation. Participants are invited to form teams and develop machine learning models that detect these failures using multimodal data. Submissions will be evaluated using various performance metrics, including detection accuracy and false positive rate. This challenge represents another key step toward improving failure detection in human-robot interaction through social signal analysis.

Abstract:
This paper introduces OmniGSE, a novel general speech enhancement (GSE) framework designed to mitigate the diverse distortions that speech signals encounter in real-world scenarios. These distortions include background noise, reverberation, bandwidth limitations, signal clipping, and network packet loss. Existing methods typically focus on optimizing for a single type of distortion, often struggling to effectively handle the simultaneous presence of multiple distortions in complex scenarios. OmniGSE bridges this gap by integrating the strengths of discriminative and generative approaches through a two-stage architecture that enables cross-domain collaborative optimization. In the first stage, continuous features are enhanced using a lightweight channel-split NAC-RoFormer. In the second stage, discrete tokens are generated to reconstruct high-quality speech through language models. Specifically, we designed a hierarchical language model structure consisting of a RootLM and multiple BranchLMs. The RootLM models general acoustic features across codebook layers, while the BranchLMs explicitly capture the progressive relationships between different codebook levels. Experimental results demonstrate that OmniGSE surpasses existing models across multiple benchmarks, particularly excelling in scenarios involving compound distortions. These findings underscore the framework's potential for robust and versatile speech enhancement in real-world applications.

Abstract:
Recent Multimodal Large Language Models (MLLMs) exhibit strong zero-shot abilities but struggle with complex Grounded Situation Recognition (GSR) and are resource-intensive for edge device deployment. Meanwhile, conventional GSR models often lack generalization ability, falling short in recognizing unseen and rare situations. In this paper, we exploit transferring knowledge from a teacher MLLM to a small GSR model to enhance its generalization and zero-shot abilities, thereby introducing the task of Open-vocabulary Grounded Situation Recognition (Ov-GSR). To achieve this, we propose Multimodal Interactive Prompt Distillation (MIPD), a novel framework that distills enriched multimodal knowledge from the foundation model, enabling the student Ov-GSR model to recognize unseen situations and be better aware of rare situations. Specifically, the MIPD framework first leverages the LLM-based Judgmental Rationales Generator (JRG) to construct positive and negative glimpse and gaze rationales enriched with contextual semantic information. The proposed scene-aware and instance-perception prompts are then introduced to align rationales with visual information from the MLLM teacher via the Negative-Guided Multimodal Prompting Alignment (NMPA) module, effectively capturing holistic and perceptual multimodal knowledge. Finally, the aligned multimodal knowledge is distilled into the student Ov-GSR model, providing a stronger foundation for generalization that enhances situation understanding, bridges the gap between seen and unseen scenarios, and mitigates prediction bias in rare cases. We evaluate MIPD on the refined Ov-SWiG dataset, achieving superior performance on seen, rare, and unseen situations, and further demonstrate improved unseen detection on the HICO-DET dataset.

Abstract:
Large Language Models (LLMs) are trained on a vast amount of procedural texts, but they do not directly observe real-world phenomena. In the context of cooking recipes, this poses a challenge, as intermediate states of ingredients are often omitted, making it difficult for models to track ingredient states and understand recipes accurately. In this paper, we apply state probing, a method for evaluating a language model's understanding of the world, to the domain of cooking. We propose a new task and dataset for evaluating how well LLMs can recognize intermediate ingredient states during cooking procedures. We first construct a new Japanese recipe dataset with clear and accurate annotations of ingredient state changes, collected from well-structured and controlled recipe texts. Using this dataset, we design three novel tasks to evaluate whether LLMs can track ingredient state transitions and identify ingredients present at intermediate steps. Our experiments with widely used LLMs, such as Llama3.1-70B and Qwen2.5-72B, show that learning ingredient state knowledge improves their understanding of cooking processes, achieving performance comparable to commercial LLMs. The dataset are publicly available at: https://huggingface.co/datasets/mashi6n/nhkrecipe-100-anno-1.

Abstract:
Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries, which is crucial for embodied AI applications such as autonomous navigation, robotics, and augmented reality. Learning 3D language fields through neural representations enables accurate understanding of 3D scenes from limited viewpoints and facilitates the localization of target objects in complex environments. However, existing language field methods struggle to accurately localize instances using spatial relations in language queries, such as ''the book on the chair.'' This limitation mainly arises from inadequate reasoning about spatial relations in both language queries and 3D scenes. In this work, we propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning that constructs a visual properties-enhanced hierarchical feature field for open-vocabulary 3D visual grounding. To enable spatial reasoning in language queries, SpatialReasoner fine-tunes an LLM to capture spatial relations and explicitly infer instructions for the target, anchor, and spatial relation. To enable spatial reasoning in 3D scenes, SpatialReasoner incorporates visual properties (opacity and color) to construct a hierarchical feature field. This field represents language and instance features using distilled CLIP features and masks extracted via the Segment Anything Model (SAM). The field is then queried using the inferred instructions in a hierarchical manner to localize the target 3D instance based on the spatial relation in the language query. Notably, SpatialReasoner is not limited to a specific 3D neural representation; it serves as a framework adaptable to various representations, such as Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS). Extensive experiments show that our framework can be seamlessly integrated into different neural representations, outperforming baseline models in 3D visual grounding while empowering their spatial reasoning capability. Project Homepage:ZhenyangLiu.github.io/SpatialReasoner.

Abstract:
Dynamic scene reconstruction from monocular video is essential for real-world applications. We introduce DGNS, a hybrid framework integrating Deformable Gaussian Splatting and Dynamic Neural Surfaces, effectively addressing dynamic novel-view synthesis and 3D geometry reconstruction simultaneously. During training, depth maps generated by the deformable Gaussian splatting module guide the ray sampling for faster processing and provide depth supervision within the dynamic neural surface module to improve geometry reconstruction. Conversely, the dynamic neural surface directs the distribution of Gaussian primitives around the surface, enhancing rendering quality. In addition, we propose a depth-filtering approach to further refine depth supervision. Extensive experiments conducted on public datasets demonstrate that DGNS achieves state-of-the-art performance in 3D reconstruction, along with competitive results in novel-view synthesis.

Abstract:
Topological structures in image data, such as connected components and loops, play a crucial role in understanding image content (e.g., biomedical objects). Despite remarkable successes of numerous image processing methods that rely on appearance information, these methods often lack sensitivity to topological structures when used in general deep learning (DL) frameworks. In this paper, we introduce a new general approach, called TopoImages (for Topology Images), which computes a new representation of input images by encoding local topology of patches. In TopoImages, we leverage persistent homology (PH) to encode geometric and topological features inherent in image patches. Our main objective is to capture topological information in local patches of an input image into a vectorized form. Specifically, we first compute persistence diagrams (PDs) of the patches, and then vectorize and arrange these PDs into long vectors for pixels of the patches. The resulting multi-channel image-form representation is called a TopoImage. TopoImages offers a new perspective for data analysis. To garner diverse and significant topological features in image data and ensure a more comprehensive and enriched representation, we further generate multiple TopoImages of the input image using various filtration functions, which we call multi-view TopoImages. The multi-view TopoImages are fused with the input image for DL-based classification, with considerable improvement. Our TopoImages approach is highly versatile and can be seamlessly integrated into common DL frameworks. Experiments on three public medical image classification datasets demonstrate noticeably improved accuracy over state-of-the-art methods.

Abstract:
Time Series forecasting is critical in diverse domains such as weather forecasting, financial investment, and traffic management. While traditional numerical metrics like mean squared error (MSE) can quantify point-wise accuracy, they fail to evaluate the geometric structure of time series data, which is essential to understand temporal dynamics. To address this issue, we propose the time series Geometric Structure Index (TGSI), a novel evaluation metric that transforms time series into images to leverage their inherent two-dimensional geometric representations. However, since the image transformation process is non-differentiable, TGSI cannot be directly integrated as a training loss. We further introduce the Shape-Aware Temporal Loss (SATL), a multi-component loss function operating in the time series modality to bridge this gap and enhance structure modeling during training. SATL combines three components: a first-order difference loss that measures structural consistency through the MSE between first-order differences, a frequency domain loss that captures essential periodic patterns using the Fast Fourier Transform while minimizing noise, and a perceptual feature loss that measures geometric structure difference in time-series by aligning temporal features with geometric structure features through a pre-trained temporal feature extractor and time-series image autoencoder. Experiments across multiple datasets demonstrate that models trained with SATL achieve superior performance in both MSE and the proposed TGSI metrics compared to baseline methods, without additional computational cost during inference.

Abstract:
The natural combination of intricate topological structures and rich textual information in text-attributed graphs (TAGs) opens up a novel perspective for graph anomaly detection (GAD). However, existing GAD methods primarily focus on designing complex optimization objectives within the graph domain, overlooking the complementary value of the textual modality, whose features are often encoded by shallow embedding techniques, such as bag-of-words or skip-gram, so that semantic context related to anomalies may be missed. To unleash the enormous potential of textual modality, large language models (LLMs) have emerged as promising alternatives due to their strong semantic understanding and reasoning capabilities. Nevertheless, their application to TAG anomaly detection remains nascent, and they struggle to encode high-order structural information inherent in graphs due to input length constraints. For high-quality anomaly detection in TAGs, we propose CoLL, a novel framework that combines LLMs and graph neural networks (GNNs) to leverage their complementary strengths. CoLL employs multi-LLM collaboration for evidence-augmented generation to capture anomaly-relevant contexts while delivering human-readable rationales for detected anomalies. Moreover, CoLL integrates a GNN equipped with a gating mechanism to adaptively fuse textual features with evidence while preserving high-order topological information. Extensive experiments demonstrate the superiority of CoLL, achieving an average improvement of 13.37% in AP. This study opens a new avenue for incorporating LLMs in advancing GAD.

Abstract:
Zero-shot Image Captioning (ZIC) increasingly utilizes synthetic datasets generated by text-to-image (T2I) models to mitigate the need for costly manual annotation. However, these T2I models often produce images that exhibit semantic misalignments with their corresponding input captions (e.g., missing objects, incorrect attributes), resulting in noisy synthetic image-caption pairs that can hinder model training. Existing dataset pruning techniques are largely designed for removing noisy text in web-crawled data. However, these methods are ill-suited for the distinct challenges of synthetic data, where captions are typically well-formed, but images may be inaccurate representations. To address this gap, we introduce SynC, a novel framework specifically designed to refine synthetic image-caption datasets for ZIC. Instead of conventional filtering or regeneration, SynC focuses on reassigning captions to the most semantically aligned images already present within the synthetic image pool. Our approach employs a one-to-many mapping strategy by initially retrieving multiple relevant candidate images for each caption. We then apply a cycle-consistency-inspired alignment scorer that selects the best image by verifying its ability to retrieve the original caption via image-to-text retrieval. Extensive evaluations demonstrate that SynC consistently and significantly improves performance across various ZIC models on standard benchmarks (MS-COCO, Flickr30k, NoCaps), achieving state-of-the-art results in several scenarios. SynC offers an effective strategy for curating refined synthetic data to enhance ZIC.

Abstract:
Most of existing weakly supervised GAR methods are typically bottom-up, automatically mining key areas by the attention mechanism. Due to the lack of a semantic connection to individual actions, some regions associated with these actions may be omitted, potentially impacting performance. In fact, a group activity is a combination of multiple individual actions, and the prototype of a specific action can be obtained from visual representations of individuals performing it, denoted as visual conceptual knowledge. In this paper, we propose a Visual Conceptual Knowledge Guided Action Map framework. It uses prototypes to produce individual action maps that indicate the likelihood of actions occurring at different locations. In some scenarios, the spatial distribution of actions shows strong regularity, which we compile as A-A Maps to enhance individual action maps. The action maps are integrated with action semantic representations for group activity recognition. Extensive experiments on two public benchmarks, the Volleyball and the NBA datasets, demonstrate the effectiveness of our proposed method, even in cases of limited training data.

Abstract:
Object state recognition aims to identify the specific condition of objects, such as their positional states (e.g., open or closed) and functional states (e.g., on or off). While recent Vision-Language Models (VLMs) are capable of performing a variety of multimodal tasks, it remains unclear how precisely they can identify object states. To alleviate this issue, we introduce the STAte and Transition UnderStanding Benchmark (STATUS Bench), the first benchmark for rigorously evaluating the ability of VLMs to understand subtle variations in object states in diverse situations. Specifically, STATUS Bench introduces a novel evaluation scheme that requires VLMs to perform three tasks simultaneously: object state identification (OSI), image retrieval (IR), and state change identification (SCI). These tasks are defined over our fully hand-crafted dataset involving image pairs, their corresponding object state descriptions and state change descriptions. Furthermore, we introduce a large-scale training dataset, namely STATUS Train, which consists of 13 million semi-automatically created descriptions. This dataset serves as the largest resource to facilitate further research in this area. In our experiments, we demonstrate that STATUS Bench enables rigorous consistency evaluation and reveal that current state-of-the-art VLMs still significantly struggle to capture subtle object state distinctions. Surprisingly, under the proposed rigorous evaluation scheme, most open-weight VLMs exhibited chance-level zero-shot performance. After fine-tuning on STATUS Train, Qwen2.5-VL achieved performance comparable to Gemini 2.0 Flash. These findings underscore the necessity of STATUS Bench and Train for advancing object state recognition in VLM research.

Abstract:
3D visual grounding allows an embodied agent to understand visual information in real-world 3D environments based on human instructions, which is crucial for embodied intelligence. Existing 3D visual grounding methods typically rely on separate encoders for different modalities (e.g., RGB images, text, and 3D point clouds), resulting in large and complex models that are inefficient to train. While some approaches use pre-trained 2D multi-modal models like CLIP for 3D tasks, they still struggle with aligning point cloud data to 2D encoders. As a result, these methods continue to depend on 3D encoders for feature extraction, further increasing model complexity and training inefficiency. In this paper, we propose a unified 2D pre-trained multi-modal network to process all three modalities (RGB images, text, and point clouds), significantly simplifying the architecture. By leveraging a 2D CLIP bi-modal model with adapter-based fine-tuning, this framework effectively adapts to the tri-modal setting, improving both adaptability and performance across modalities. Our Geometric-Aware 2D-3D Feature Recovery and Fusion (GARF) module is designed to fuse geometric multi-scale features from point clouds and images. We then integrate textual features for final modality fusion and introduce a multi-modal decoder to facilitate deep cross-modal understanding. Together, our method achieves unified feature extraction and fusion across the three modalities, enabling an end-to-end 3D visual grounding model. Compared to the baseline, our method reduces the number of trainable parameters by approximately 58%, while achieving a 6.52% improvement in the 3D detection task and a 6.25% improvement in the 3D visual grounding task.

Abstract:
Since a building's floorplans are easily accessible, consistent over time, and inherently robust to changes in visual appearance, self-localization within the floorplan has attracted researchers' interest. However, since floorplans are minimalist representations of a building's structure, modal and geometric differences between visual perceptions and floorplans pose challenges to this task. While existing methods cleverly utilize 2D geometric features and pose filters to achieve promising performance, they fail to address the localization errors caused by frequent visual changes and view occlusions due to variously shaped 3D objects. To tackle these issues, this paper views the 2D Floorplan Localization (FLoc) problem from a higher dimension by injecting 3D geometric priors into the visual FLoc algorithm. For the 3D geometric prior modeling, we first model geometrically aware view invariance using multi-view constraints, i.e., leveraging imaging geometric principles to provide matching constraints between multiple images that see the same points. Then, we further model the view-scene aligned geometric priors, enhancing the cross-modal geometry-color correspondences by associating the scene's surface reconstruction with the RGB frames of the sequence. Both 3D priors are modeled through self-supervised contrastive learning, thus no additional geometric or semantic annotations are required. These 3D priors summarized in extensive realistic scenes bridge the modal gap while improving localization success without increasing the computational burden on the FLoc algorithm. Sufficient comparative studies demonstrate that our method significantly outperforms state-of-the-art methods and substantially boosts the FLoc accuracy.

Abstract:
With the continuous progress of autonomous vehicle (AV) technologies, the resulting accidents have generated public concern, underscoring the need for comprehensive, reliable, and authoritative testing to enhance safety. X-in-the-loop (XiL) testing has emerged as a promising paradigm to bridge the gap between simulation and real-world deployment. This study addresses the gap in supporting solid-state LiDAR sensors, thereby bolstering the authority and credibility of XiL testing. Based on the operating principles of actual LiDAR, the proposed high-fidelity model simulates unique mechanisms of solid-state LiDAR to produce point clouds that closely correspond to real-world data. Additionally, the model accounts for the impact of various weather conditions on laser propagation, enhancing the coverage and reliability of testing scenarios. Extensive experiments have been conducted to validate the proposed LiDAR model from multiple levels, including scanning pattern, point cloud acquisition, and XiL testing, with real world data as the benchmark. Results confirm that the model replicates real LiDAR behavior and supports AV functions (mapping, localization, perception), offering a novel sensor model extension for controlled and repeatable XiL testing.

Abstract:
Whole Slide Image (WSI) representation is critical for cancer subtyping, cancer recognition and mutation prediction.Training an end-to-end WSI representation model poses significant challenges, as a standard gigapixel slide can contain tens of thousands of image tiles, making it difficult to compute gradients of all tiles in a single mini-batch due to current GPU limitations. To address this challenge, we propose a method of dynamic residual encoding with slide-level contrastive learning (DRE-SLCL) for end-to-end WSI representation. Our approach utilizes a memory bank to store the features of tiles across all WSIs in the dataset. During training, a mini-batch usually contains multiple WSIs. For each WSI in the batch, a subset of tiles is randomly sampled and their features are computed using a tile encoder. Then, additional tile features from the same WSI are selected from the memory bank. The representation of each individual WSI is generated using a residual encoding technique that incorporates both the sampled features and those retrieved from the memory bank. Finally, the slide-level contrastive loss is computed based on the representations and histopathology reports ofthe WSIs within the mini-batch. Experiments conducted over cancer subtyping, cancer recognition, and mutation prediction tasks proved the effectiveness of the proposed DRE-SLCL method.

Abstract:
Early screening for Alzheimer's Disease (AD) through speech presents a promising non-invasive approach. However, challenges such as limited data and the lack of fine-grained, adaptive feature selection often hinder performance. To address these issues, we propose MoTAS, a robust framework designed to enhance AD screening efficiency. MoTAS leverages Text-to-Speech (TTS) augmentation to increase data volume and employs a Mixture of Experts (MoE) mechanism to improve multimodal feature selection, jointly enhancing model generalization. The process begins with automatic speech recognition (ASR) to obtain accurate transcriptions. TTS is then used to synthesize speech that enriches the dataset. After extracting acoustic and text embeddings, the MoE mechanism dynamically selects the most informative features, optimizing feature fusion for improved classification. Evaluated on the ADReSSo dataset, MoTAS achieves a leading accuracy of 85.71%, outperforming existing baselines. Ablation studies further validate the individual contributions of TTS augmentation and MoE in boosting classification performance. These findings highlight the practical value of MoTAS in real-world AD screening scenarios, particularly in data-limited settings.

Abstract:
The Federated Domain Generalization for Person re-identification (FedDG-ReID) aims to learn a global server model that can be effectively generalized to source and target domains through distributed source domain data. Existing methods mainly improve the diversity of samples through style transformation, which to some extent enhances the generalization performance of the model. However, we discover that not all styles contribute to the generalization performance. Therefore, we define styles that are beneficial/harmful to the model's generalization performance as positive/negative styles. Based on this, new issues arise: How to effectively screen and continuously utilize the positive styles. To solve these problems, we propose a Style Screening and Continuous Utilization (SSCU) framework. Firstly, we design a Generalization Gain-guided Dynamic Style Memory (GGDSM) for each client model to screen and accumulate generated positive styles. Specifically, the memory maintains a prototype initialized from raw data for each category, then screens positive styles that enhance the global model during training, and updates these positive styles into the memory using a momentum-based approach. Meanwhile, we propose a style memory recognition loss to fully leverage the positive styles memorized by GGDSM. Furthermore, we propose a Collaborative Style Training (CST) strategy to make full use of positive styles. Unlike traditional learning strategies, our approach leverages both newly generated styles and the accumulated positive styles stored in memory to train client models on two distinct branches. This training strategy is designed to effectively promote the rapid acquisition of new styles by the client models, ensuring that they can quickly adapt to and integrate novel stylistic variations. Simultaneously, this strategy guarantees the continuous and thorough utilization of positive styles, which is highly beneficial for the model's generalization performance. Extensive experimental results demonstrate that our method outperforms existing methods in both the source domain and the target domain.

Abstract:
Speech-driven 3D talking head generation aims to produce lifelike facial animations precisely synchronized with speech. While considerable progress has been made in achieving high lip-synchronization accuracy, existing methods largely overlook the intricate nuances of individual speaking styles, which limits personalization and realism. In this work, we present a novel framework for personalized 3D talking head animation, namely ''PTalker''. This framework preserves speaking style through style disentanglement from audio and facial motion sequences and enhances lip-synchronization accuracy through a three-level alignment mechanism between audio and mesh modalities. Specifically, to effectively disentangle style and content, we design disentanglement constraints that encode driven audio and motion sequences into distinct style and content spaces to enhance speaking style representation. To improve lip-synchronization accuracy, we adopt a modality alignment mechanism incorporating three aspects: spatial alignment using Graph Attention Networks to capture vertex connectivity in the 3D mesh structure, temporal alignment using cross-attention to capture and synchronize temporal dependencies, and feature alignment by top-k bidirectional contrastive losses and KL divergence constraints to ensure consistency between speech and mesh modalities. Extensive qualitative and quantitative experiments on public datasets demonstrate that PTalker effectively generates realistic, stylized 3D talking heads that accurately match identity-specific speaking styles, outperforming state-of-the-art methods. The source code and supplementary videos are available at: PTalker.

Abstract:
Monocular texture 3D human reconstruction aims to create a complete 3D digital avatar from just a single front-view human RGB image. However, the geometric ambiguity inherent in a single 2D image and the scarcity of 3D human training data are the main obstacles limiting progress in this field. To address these issues, current methods employ prior geometric estimation networks to derive various human geometric forms, such as the SMPL model and normal maps. However, they struggle to integrate these modalities effectively, leading to view inconsistencies, such as facial distortions. To this end, we propose a two-process 3D human reconstruction framework, SAT, which seamlessly learns various prior geometries in a unified manner and reconstructs high-quality textured 3D avatars as the final output. To further facilitate geometry learning, we introduce a Supervisor Feature Regularization module. By employing a multi-view network with the same structure to provide intermediate features as training supervision, these varied geometric priors can be better fused. To tackle data scarcity and further improve reconstruction quality, we also propose an Online Animation Augmentation module. By building a one-feed-forward animation network, we augment a massive number of samples from the original 3D human data online for model training. Extensive experiments on two benchmarks show the superiority of our approach compared to state-of-the-art methods.

Abstract:
Large vision-language models (LVLMs) are vulnerable to harmful input compared to their language-only backbones. We investigated this vulnerability by exploring LVLMs internal dynamics, framing their inherent safety understanding in terms of three key capabilities. Specifically, we define these capabilities as safety perception, semantic understanding, and alignment for linguistic expression, and experimentally pinpointed their primary locations within the model architecture. The results indicate that safety perception often emerges before comprehensive semantic understanding, leading to the reduction in safety. Motivated by these findings, we propose Self-Aware Safety Augmentation (SASA), a technique that projects informative semantic representations from intermediate layers onto earlier safety-oriented layers. This approach leverages the model's inherent semantic understanding to enhance safety recognition without fine-tuning. Then, we employ linear probing to articulate the model's internal semantic comprehension to detect the risk before the generation process. Extensive experiments on various datasets and tasks demonstrate that SASA significantly improves the safety of LVLMs, with minimal impact on the utility.

Abstract:
The rapid advancement of multimodal large language models (MLLMs) has led to breakthroughs in various applications, yet their security remains a critical challenge. One pressing issue involves unsafe image-query pairs-jailbreak inputs specifically designed to bypass security constraints and elicit unintended responses from MLLMs. Compared to general multimodal data, such unsafe inputs are relatively sparse, which limits the diversity and richness of training samples available for developing robust defense models. Meanwhile, existing guardrail-type methods rely on external modules to enforce security constraints but fail to address intrinsic vulnerabilities within MLLMs. Traditional supervised fine-tuning (SFT), on the other hand, often over-refuses harmless inputs, compromising general performance. Given these challenges, we propose Secure Tug-of-War (SecTOW), an innovative iterative defense-attack training method to enhance the security of MLLMs. SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO). During the iterative process, the attacker identifies security vulnerabilities in the defense model and expands jailbreak data. The expanded data are then used to train the defender, enabling it to address identified security vulnerabilities. We also design reward mechanisms used for GRPO to simplify the use of response labels, reducing dependence on complex generative labels and enabling the efficient use of synthetic data. Additionally, a quality monitoring mechanism is used to mitigate the defender's over-refusal of harmless inputs and ensure the diversity of the jailbreak data generated by the attacker. Experimental results on safety-specific and general benchmarks demonstrate that SecTOW significantly improves security while preserving general performance. Warning: This paper contains offensive and unsafe content.

Abstract:
With the rapid advancement of AIGC technologies, audio deepfakes have become increasingly realistic, posing serious threats to information security and biometric authentication. Therefore, audio deepfake detection (ADD) has emerged as a critical and fast-evolving research area, particularly requiring superior generalization in out-of-domain scenarios. However, existing ADD methods suffer from constrained generalization and limited access to target data. To address these challenges, we propose Risk-Aware Style Alignment (RASA), a novel generalizable ADD framework that projects the style of any input feature into a shared style space through similarity-based projection. This alignment reduces both inter-domain and intra-source discrepancies without requiring target data during training. In addition, we adopt Structural Empirical Risk Minimization (SERM) in the Poincaré ball model to capture the hierarchical structure of the data and further minimize source risk. By jointly optimizing RASA and SERM, the proposed method effectively tightens the theoretical upper bound of target risk across three key dimensions: source risk, inter-domain divergence, and intra-source discrepancy. Extensive experiments demonstrate that our approach achieves superior generalization and outperforms existing state-of-the-art methods.

Abstract:
The rapid advancement of AI-generation models has enabled the creation of hyperrealistic imagery, posing ethical risks through widespread misinformation. Current deepfake detection methods, categorized as face-specific detectors or general AI-generated detectors, lack transparency by framing detection as a classification task without explaining decisions. While several LLM-based approaches offer explainability, they suffer from coarse-grained analyses and dependency on labor-intensive annotations. This paper introduces RAIDX (Retrieval-Augmented Image Deepfake Detection and Explainability), a novel deepfake detection framework integrating Retrieval-Augmented Generation (RAG) and Group Relative Policy Optimization (GRPO) to enhance detection accuracy and decision explainability. Specifically, RAIDX leverages RAG to incorporate external knowledge for improved detection accuracy and employs GRPO to autonomously generate fine-grained textual explanations and saliency maps, eliminating the need for extensive manual annotations. Experiments on multiple benchmarks demonstrate RAIDX's effectiveness in identifying real or fake, and providing interpretable rationales in both textual descriptions and saliency maps, achieving state-of-the-art detection performance while advancing transparency in deepfake identification. RAIDX represents the first unified framework to synergize RAG and GRPO, addressing critical gaps in accuracy and explainability. Our code and models will be publicly available.

Abstract:
As the development of lightweight deep learning algorithms, various deep neural network (DNN) models have been proposed for the remote sensing scene classification (RSSC) application. However, it is still challenging for these RSSC models to achieve optimal performance among model accuracy, inference latency, and energy consumption on resource-constrained edge devices. In this paper, we propose a lightweight RSSC framework, which includes a distilled global filter network (GFNet) model and an early-exit mechanism designed for edge devices to achieve state-of-the-art performance. Specifically, we first apply frequency domain distillation on the GFNet model to reduce model size. Then we design a dynamic early-exit model tailored for DNN models on edge devices to further improve model inference efficiency. We evaluate our E3C model on three edge devices across four datasets. Extensive experimental results show that it achieves an average of 1.3x speedup on model inference and over 40% improvement on energy efficiency, while maintaining high classification accuracy.

Abstract:
Recent breakthroughs in generative artificial intelligence (AI) are transforming multimedia communication. This paper systematically reviews key recent advancements across generative AI for multimedia communication, emphasizing transformative models like diffusion and transformers. However, conventional information-theoretic frameworks fail to address semantic fidelity, critical to human perception. We propose an innovative semantic information-theoretic framework, introducing semantic entropy, mutual information, channel capacity, and rate-distortion concepts specifically adapted to multimedia applications. This framework redefines multimedia communication from purely syntactic data transmission to semantic information conveyance. We further highlight future opportunities and critical research directions. We chart a path toward robust, efficient, and semantically meaningful multimedia communication systems by bridging generative AI innovations with information theory. This exploratory paper aims to inspire a semantic-first paradigm shift, offering a fresh perspective with significant implications for future multimedia research.

Abstract:
Concept Bottleneck Models (CBMs) enhance the interpretability of AI systems, particularly by bridging visual input with human-understandable concepts, effectively acting as a form of multimodal interpretability model. However, existing CBMs typically assume static datasets, which fundamentally limits their adaptability to real-world, continuously evolving multimodal data streams. To address this, we define a novel continual learning task for CBMs: simultaneously handling concept-incremental and class-incremental learning. This task requires models to continuously acquire new concepts (often representing cross-modal attributes) and classes while robustly preserving previously learned knowledge. To tackle this challenging problem, we propose CONceptual Continual Incremental Learning (CONCIL), a novel framework that fundamentally re-imagines concept and decision layer updates as linear regression problems. This reformulation eliminates the need for gradient-based optimization, thereby effectively preventing catastrophic forgetting. Crucially, CONCIL relies solely on recursive matrix operations, rendering it highly computationally efficient and well-suited for real-time and large-scale multimodal data applications. Experimental results compellingly demonstrate that CONCIL achieves ''absolute knowledge memory'' and significantly surpasses the performance of traditional CBM methods in both concept- and class-incremental settings, thus establishing a new paradigm for continual learning in CBMs, particularly valuable for dynamic multimodal understanding.

Abstract:
Lifelogging involves continuously capturing personal data through wearable cameras, providing an egocentric view of daily activities. Lifelog retrieval aims to search and retrieve relevant moments from this data, yet existing methods largely overlook activity-level annotations, which capture temporal relationships and enrich semantic understanding. In this work, we introduce LSC-ADL, an ADL-annotated lifelog dataset derived from the Lifelog Search Challenge (LSC) dataset, incorporating Activities of Daily Living (ADLs) as a structured semantic layer. Using a semi-automatic approach featuring the HDBSCAN algorithm for intra-class clustering and human-in-the-loop verification, we generate accurate ADL annotations to enhance retrieval explainability. By integrating action recognition into lifelog retrieval, LSC-ADL bridges a critical gap in existing research, offering a more context-aware representation of daily life. We believe this dataset will advance research in lifelog retrieval, activity recognition, and egocentric vision, ultimately improving the accuracy and interpretability of retrieved content. The ADL annotations can be downloaded at https://bit.ly/lsc-adl-annotations.

Abstract:
The field of visual and audio generation is burgeoning with new state-of-the-art methods. This rapid proliferation of new techniques underscores the need for robust solutions for detecting synthetic content in videos. In particular, when fine-grained alterations via localized manipulations are performed in visual, audio, or both domains, these subtle modifications add challenges to the detection algorithms. This paper presents solutions for the problems of deepfake video classification and localization. The methods were submitted to the ACM 1M Deepfakes Detection Challenge, achieving the best performance in the temporal localization task and a top four ranking in the classification task for the TestA split of the evaluation dataset.

Abstract:
The rapid development of audio-driven talking head generators and advanced Text-To-Speech (TTS) models has led to more sophisticated temporal deepfakes. These advances highlight the need for robust methods capable of detecting and localizing deepfakes, even under novel, unseen attack scenarios. Current state-of-the-art deepfake detectors, while accurate, are often computationally expensive and struggle to generalize to novel manipulation techniques. To address these challenges, we propose multimodal approaches for the AV-Deepfake1M 2025 challenge. For the visual modality, we leverage handcrafted features to improve interpretability and adaptability. For the audio modality, we adapt a self-supervised learning (SSL) backbone coupled with graph attention networks to capture rich audio representations, improving detection robustness. Our approach strikes a balance between performance and real-world deployment, focusing on resilience and potential interpretability. On the AV-Deepfake1M++ dataset, our multimodal system achieves AUC of 92.78% for deepfake classification task and IoU of 0.3536 for temporal localization using only the audio modality.

Abstract:
Due to their powerful image generation capabilities, diffusion-based adversarial example generation methods through image editing are rapidly gaining popularity. However, due to reliance on the discriminative capability of the diffusion model, these diffusion-based methods often struggle to generalize beyond conventional image classification tasks, such as in Deepfake detection. Moreover, traditional strategies for enhancing adversarial example transferability are challenging to adapt to these methods. To address these challenges, we propose a unified framework that seamlessly incorporates traditional transferability enhancement strategies into diffusion model-based adversarial example generation via image editing, enabling their application across a wider range of downstream tasks. Our method won first place in the ''1st Adversarial Attacks on Deepfake Detectors: A Challenge in the Era of AI-Generated Media'' competition at ACM MM25, which validates the effectiveness of our approach.

Abstract:
The proliferation of multimedia content on social media has transformed how information is produced and consumed, enabling real-time coverage of global events but also accelerating the spread of misinformation, particularly during crises such as wars, natural disasters, and elections. The rise of synthetic media and the reuse of authentic content in misleading contexts further underscore the need for robust verification tools. In this paper, we present a comprehensive system developed for the ACM Multimedia 2025 Grand Challenge on Multimedia Verification. Our system evaluates both the authenticity and contextual accuracy of multimedia content in multilingual settings, producing expert-oriented verification reports alongside accessible summaries for the public. We propose a unified verification pipeline that integrates visual forensics, textual analysis, and multimodal reasoning, with a hybrid approach to detecting out-of-context (OOC) media via semantic similarity, temporal alignment, and geolocation cues. Extensive evaluations on the challenge benchmark demonstrate the system's effectiveness across diverse real-world scenarios. Our contributions advance the state of the art in multimedia verification while providing practical tools for journalists, fact-checkers, and researchers addressing information integrity in the digital age.

Abstract:
Predicting online video popularity faces a critical challenge: prediction drift, where models trained on historical data rapidly degrade due to evolving viral trends and user behaviors. To address this temporal distribution shift, we propose an Anchored Multi-modal Clustering and Feature Generation (AMCFG) framework that discovers temporally-invariant patterns across data distributions. Our approach employs multi-modal clustering to reveal content structure, then leverages Large Language Models (LLMs) to generate semantic Anchor Features-high-level concepts such as audience demographics, content themes, and engagement patterns-that transcend superficial trend variations. These semantic anchors, combined with cluster-derived statistical features, enable prediction based on stable principles rather than ephemeral signals. Experiments demonstrate that AMCFG significantly enhances both predictive accuracy and temporal robustness, achieving superior performance on out-of-distribution data and providing a viable solution for real-world video popularity prediction.

Abstract:
2D images and 3D point clouds are foundational data types for multimedia applications, including real-time video analysis, augmented reality (AR), and 3D scene understanding. Class-incremental semantic segmentation (CSS) requires incrementally learning new semantic categories while retaining prior knowledge. Existing methods typically rely on computationally expensive training based on stochastic gradient descent, employing complex regularization or exemplar replay. However, stochastic gradient descent-based approaches inevitably update the model's weights for past knowledge, leading to catastrophic forgetting, a problem exacerbated by pixel/point-level granularity. To address these challenges, we propose CFSSeg, a novel exemplar-free approach that leverages a closed-form solution, offering a practical and theoretically grounded solution for continual semantic segmentation tasks. This eliminates the need for iterative gradient-based optimization and storage of past data, requiring only a single pass through new samples per step. It not only enhances computational efficiency but also provides a practical solution for dynamic, privacy-sensitive multimedia environments. Extensive experiments on 2D and 3D benchmark datasets such as Pascal VOC2012, S3DIS, and ScanNet demonstrate CFSSeg's superior performance.

Abstract:
For skeleton-based action recognition, Graph Convolutional Networks (GCNs) are effective models. Still, their reliance on floatingpoint computations leads to high energy consumption, limiting their applicability in battery-powered devices. While energy-efficient, Spiking Neural Networks (SNNs) struggle to model skeleton dynamics, leading to suboptimal solutions. We propose Signal-SGN (Spiking Graph Convolutional Network), which utilizes the temporal dimension of skeleton sequences as the spike time steps and represents features as multi-dimensional discrete stochastic signals for temporal-frequency domain feature extraction. It combines the 1D Spiking Graph Convolution (1D-SGC) module and the Frequency Spiking Convolution (FSC) module to extract features from the skeleton represented as spiking form. Additionally, the Multi-Scale Wavelet Transform Feature Fusion (MWTF) module is proposed to extract dynamic spiking features and capture frequency-specific characteristics, enhancing classification performance. Experiments across three large-scale datasets reveal Signal-SGN exceeding state-of-the-art SNN-based methods in accuracy and computational efficiency while attaining comparable performance with GCN methods and significantly reducing theoretical energy consumption.

Abstract:
Large Vision Language Models (LVLMs) have been widely adopted to guide vision foundation models in performing reasoning segmentation tasks, achieving impressive performance. However, the substantial computational overhead associated with LVLMs presents a new challenge. The primary source of this computational cost arises from processing hundreds of image tokens. Therefore, an effective strategy to mitigate such overhead is to reduce the number of image tokens-a process known as image token pruning. Previous studies on image token pruning for LVLMs have primarily focused on high-level visual understanding tasks, such as visual question answering and image captioning. In contrast, guiding vision foundation models to generate accurate visual masks based on textual queries demands precise semantic and spatial reasoning capabilities. Consequently, pruning methods must carefully control individual image tokens throughout the LVLM reasoning process. Our empirical analysis reveals that existing methods struggle to adequately balance reductions in computational overhead with the necessity to maintain high segmentation accuracy. In this work, we propose LVLM_CSP, a novel training-free visual token pruning method specifically designed for LVLM-based reasoning segmentation tasks. LVLM_CSP consists of three stages: clustering, scattering, and pruning. Initially, the LVLM performs coarse-grained visual reasoning using a subset of selected image tokens. Next, fine-grained reasoning is conducted, and finally, most visual tokens are pruned in the last stage. Extensive experiments demonstrate that LVLM_CSP achieves a 65% reduction in image token inference FLOPs with virtually no accuracy degradation, and a 70% reduction with only a minor 1% drop in accuracy on the 7B LVLM.

Abstract:
Sparse Mixture of Experts (sMoE) has become a pivotal approach for scaling large vision-language models, offering substantial capacity while maintaining computational efficiency through dynamic, sparse activation of experts. However, existing routing mechanisms, typically based on similarity scoring, struggle to effectively capture the underlying input structure. This limitation leads to a trade-off between expert specialization and balanced computation, hindering both scalability and performance. We propose Input Domain Aware MoE, a novel routing framework that leverages a probabilistic mixture model to better partition the input space. By modeling routing probabilities as a mixture of distributions, our method enables experts to develop clear specialization boundaries while achieving balanced utilization. Unlike conventional approaches, our routing mechanism is trained independently of task-specific objectives, allowing for stable optimization and decisive expert assignments. Empirical results on vision-language tasks demonstrate that our method consistently outperforms existing sMoE approaches, achieving higher task performance and improved expert utilization balance.

Abstract:
Multi-task model merging offers an efficient solution for integrating knowledge from multiple fine-tuned models, mitigating the significant computational and storage demands associated with multi-task training. As a key technique in this field, Task Arithmetic (TA) defines task vectors by subtracting the pre-trained model (0pre) from the fine-tuned task models in parameter space, then adjusting the weight between these task vectors and 0pre to balance task-generalized and task-specific knowledge. Despite the promising performance of TA, conflicts can arise among the task vectors, particularly when different tasks require distinct model adaptations. In this paper, we formally define this issue as knowledge conflicts, characterized by the performance degradation of one task after merging with a model fine-tuned for another task. Through in-depth analysis, we show that these conflicts stem primarily from the components of task vectors that align with the gradient of task-specific losses at 0pre. To address this, we propose Task Arithmetic in Trust Region (TATR), which defines the trust region as dimensions in the model parameter space that cause only small changes (corresponding to the task vector components with gradient orthogonal direction) in the task-specific losses. Restricting parameter merging within this trust region, TATR can effectively alleviate knowledge conflicts. Moreover, TATR serves as a plug-and-play module compatible with a wide range of TA-based methods. Extensive empirical evaluations on visual and visual-language tasks robustly demonstrate that TATR improves the multi-task performance of several TA-based model merging methods.

Abstract:
Older adults tend to encounter challenges when learning to use new smartphone apps due to age-related cognitive and physical changes. Compared to traditional support methods such as video tutorials, trial-and-error allows older adults to learn to use smartphone apps by making and correcting mistakes. However, it remains unknown how trial-and-error should be designed to empower older adults to use smartphone apps and how well it would work for older adults. Informed by the guidelines derived from prior work, we designed and implemented ExplorAR, an AR-based trial-and-error system that offers real-time and situated visual guidance in the augmented space around the smartphone to empower older adults to explore and correct mistakes independently. We conducted a user study with 18 older adults to compare ExplorAR with traditional video tutorials and a simplified version of ExplorAR. Results show that the AR-supported trial-and-error method enhanced older adults' learning experience by fostering deeper cognitive engagement and improving confidence in exploring unknown operations.

Abstract:
Reconstruction-based methods have demonstrated very promising results for 3D anomaly detection. However, these methods face great challenges in handling high-precision point clouds due to the large scale and complex structure. In this study, a Down-Up Sampling Networks (DUS-Net) is proposed to reconstruct high-precision point clouds for 3D anomaly detection by preserving the group center geometric structure. The DUS-Net first introduces a Noise Generation module to generate noisy patches, which facilitates the diversity of training data and strengthens the feature representation for reconstruction. Then, a Down-sampling Network (Down-Net) is developed to learn an anomaly-free center point cloud from patches with noise injection. Subsequently, an Up-sampling Network (Up-Net) is designed to reconstruct high-precision point clouds by fusing multi-scale up-sampling features. Our method leverages group centers for construction, enabling the preservation of geometric structure and providing a more precise point cloud. Extensive experiments demonstrate the effectiveness of our proposed method, achieving state-of-the-art (SOTA) performance, with an Object-level AUROC of 79.9% and 79.5% and a Point-level AUROC of 71.2% and 84.7% on the Real3D-AD and Anomaly-ShapeNet datasets, respectively.

Abstract:
We introduce MARL-MambaContour, the first contour-based medical image segmentation framework based on Multi-Agent Reinforcement Learning (MARL). Our approach reframes segmentation as a multi-agent cooperation task focused on generating topologically consistent object-level contours, addressing the limitations of traditional pixel-based methods which could lack topological constraints and holistic structural awareness of anatomical regions. Each contour point is modeled as an autonomous agent that iteratively adjusts its position to align precisely with the target boundary, enabling adaptation to blurred edges and intricate morphologies common in medical images. This iterative adjustment process is optimized by a contour-specific Soft Actor-Critic (SAC) algorithm, further enhanced with the Entropy Regularization Adjustment Mechanism (ERAM) which dynamically balances agent exploration with contour smoothness. Furthermore, the framework incorporates a Mamba-based policy network featuring a novel Bidirectional Cross-attention Hidden-state Fusion Mechanism (BCHFM). This mechanism mitigates potential memory confusion limitations associated with long-range modeling in state space models, thereby facilitating more accurate inter-agent information exchange and informed decision-making. Extensive experiments on five diverse medical imaging datasets demonstrate the state-of-the-art performance of MARL-MambaContour, highlighting its potential as an accurate and robust clinical application.

Abstract:
Current semantic segmentation models typically require a substantial amount of manually annotated data, a process that is both time-consuming and resource-intensive. Alternatively, leveraging advanced text-to-image models such as Midjourney and Stable Diffusion has emerged as an efficient strategy, enabling the automatic generation of synthetic data in place of manual annotations. However, previous methods have been limited to generating single-instance images, as the generation of multiple instances with Stable Diffusion has proven unstable and masks can be significantly affected by occlusion between different objects. To overcome this limitation and broaden the variety of synthetic datasets, we propose a novel framework, Free-Mask. It combines a Diffusion Model for segmentation with advanced image editing capabilities, allowing the insertion of multiple objects into images through text-to-image models. In addition, we introduce a new active learning paradigm that benefits both model generalization and data optimization. Our method enables the creation of realistic datasets that closely reflect open-world environments while generating accurate segmentation masks. Our code is released on GitHub.

Abstract:
The proliferation of deepfake technologies poses urgent challenges and serious risks to digital integrity, particularly within critical sectors such as forensics, journalism, and the legal system. While existing detection systems have made significant progress in classification accuracy, they typically function as black-box models, offering limited transparency and minimal support for human reasoning. This hinders their usability in real-world decision-making contexts, especially for non-expert users. We present DF-P2E (Deepfake: Prediction to Explanation), a novel multimodal framework that integrates visual, semantic, and narrative layers of explanation to make deepfake detection interpretable and accessible. The framework consists of three modular components: (1) a deepfake classifier with Grad-CAM-based saliency visualisation, (2) a visual captioning module that generates natural language summaries of manipulated regions, and (3) a narrative refinement module that uses a fine-tuned LLM to produce context-aware, user-sensitive explanations. We instantiate and evaluate the framework on the DF40 benchmark, the most diverse deepfake dataset to date. Experiments demonstrate that our system achieves competitive detection performance while providing high-quality explanations aligned with Grad-CAM activations. By unifying prediction and explanation in a coherent, human-aligned pipeline, this work offers a scalable approach to interpretable deepfake detection, advancing the broader vision of trustworthy and transparent AI systems for media forensics.

Abstract:
Multimedia systems underpin modern digital interactions, facilitating seamless integration and optimization of resources across diverse multimedia applications. To meet growing personalization demands, multimedia systems must efficiently manage competing resource needs, adaptive content, and user-specific data handling. This paper introduces Generative Flow Networks (GFlowNets, GFNs) as a brave new framework for enabling personalized multimedia systems. By integrating multi-candidate generative modeling with flow-based principles, GFlowNets offer a scalable and flexible solution for enhancing user-specific multimedia experiences. To illustrate the effectiveness of GFlowNets, we focus on short video feeds, a multimedia application characterized by high personalization demands and significant resource constraints, as a case study. Our proposed GFlowNet-based personalized feeds algorithm demonstrates superior performance compared to traditional rule-based and reinforcement learning methods across critical metrics, including video quality, resource utilization efficiency, and delivery cost. Moreover, we propose a unified GFlowNet-based framework generalizable to other multimedia systems, highlighting its adaptability and wide-ranging applicability. These findings underscore the potential of GFlowNets to advance personalized multimedia systems by addressing complex optimization challenges and supporting sophisticated multimedia application scenarios.

Abstract:
To enable robots to comprehend high-level human instructions and perform complex tasks, a key challenge lies in achieving comprehensive scene understanding: interpreting and interacting with the 3D environment in a meaningful way. This requires a smart map that fuses accurate geometric structure with rich, human-understandable semantics. To address this, we introduce the 3D Queryable Scene Representation (3D QSR), a novel framework built on multimedia data that unifies three complementary 3D representations: (1) 3D-consistent novel view rendering and segmentation from panoptic reconstruction, (2) precise geometry from 3D point clouds, and (3) structured, scalable organization via 3D scene graphs. Built on an object-centric design, the framework integrates with large vision-language models to enable semantic queryability by linking multimodal object embeddings, and supporting object-level retrieval of geometric, visual, and semantic information. The retrieved data are then loaded into a robotic task planner for downstream execution.

Abstract:
Marine videos present significant challenges for video understanding due to the dynamics of marine objects and the surrounding environment, camera motion, and the complexity of underwater scenes. Existing video captioning datasets, typically focused on generic or human-centric domains, often fail to generalize to the complexities of the marine environment and gain insights about marine life. To address these limitations, we propose a two-stage marine object-oriented video captioning pipeline. We introduce a comprehensive video understanding benchmark that leverages the triplets of video, text, and segmentation masks to facilitate visual grounding and captioning, leading to improved marine video understanding and analysis, and marine video generation. Additionally, we highlight the effectiveness of video splitting in order to detect salient object transitions in scene changes, which significantly enrich the semantics of captioning content. Our dataset and code have been released at https://msc.hkustvgd.com.

Abstract:
Progress in remote PhotoPlethysmoGraphy (rPPG) is limited by the critical issues of existing publicly available datasets: small size, privacy concerns with facial videos, and lack of diversity in conditions. The paper introduces a novel, comprehensive, large-scale multi-view video dataset for rPPG and health biomarkers estimation. Our dataset comprises 3600 synchronized video recordings from 600 subjects, captured under varied conditions (resting and post-exercise) using multiple consumer-grade cameras at different angles. To enable multimodal analysis of physiological states, each recording is paired with a 100 Hz PPG signal and extended health metrics, such as electrocardiogram, arterial blood pressure, biomarkers, temperature, oxygen saturation, respiratory rate, and stress level. Using this data, we train an efficient rPPG model and compare its quality with existing approaches in cross-dataset scenarios. The public release of our dataset and model should significantly speed up the progress in the development of AI medical assistants.

Abstract:
Robust gait recognition requires highly discriminative representations, which are closely tied to input modalities. While binary silhouettes and skeletons have dominated recent literature, these 2D representations fall short of capturing sufficient cues that can be exploited to handle viewpoint variations, and capture finer and meaningful details of gait. In this paper, we introduce a novel framework, termed DepthGait, that incorporates RGB-derived depth maps and silhouettes for enhanced gait recognition. Specifically, apart from the 2D silhouette representation of the human body, the proposed pipeline explicitly estimates depth maps from a given RGB image sequence and uses them as a new modality to capture discriminative features inherent in human locomotion. In addition, a novel multi-scale and cross-level fusion scheme has also been developed to bridge the modality gap between depth maps and silhouettes. Extensive experiments on standard benchmarks demonstrate that the proposed DepthGait achieves state-of-the-art performance compared to peer methods and attains an impressive mean rank-1 accuracy on the challenging datasets.

Abstract:
Open vocabulary image segmentation tackles the challenge of recognizing dynamically adjustable, predefined novel categories at inference time by leveraging vision-language alignment. However, existing paradigms typically perform class-agnostic region segmentation followed by category matching, which deviates from the human visual system's process of recognizing objects based on semantic concepts, leading to poor alignment between region segmentation and object concepts. To bridge this gap, we propose a novel Cognition-Inspired Framework for open vocabulary image segmentation that emulates the human visual recognition process: first forming a conceptual understanding of an object, then perceiving its spatial extent. The framework consists of three core components: (1) A Generative Vision-Language Model (G-VLM) that mimics human cognition by generating object concepts to provide semantic guidance for region segmentation. (2) A Concept-Aware Visual Enhancer module that fuses textual concept features with global visual representations, enabling adaptive visual perception based on object concepts. (3) A Cognition-Inspired Mask Decoder that integrates local instance features with G-VLM-provided semantic cues, allowing selective classification over a subset of relevant categories. Extensive experiments demonstrate that our framework achieves significant improvements, reaching 27.2 PQ, 17.0 mAP, and 35.3 mIoU on A-150. It further attains 56.2, 28.2, 15.4, 59.2, 18.7, and 95.8 mIoU on Cityscapes, Mapillary Vistas, A-847, PC-59, PC-459, and PAS-20, respectively. In addition, our framework supports vocabulary-free image segmentation, offering enhanced flexibility in recognizing unseen categories.

Abstract:
Defending pre-trained vision-language models (VLMs), such as CLIP, against adversarial attacks is crucial, as these models are widely used in diverse zero-shot tasks, including image classification. However, existing adversarial training (AT) methods for robust fine-tuning largely overlook the role of language in enhancing visual robustness. Specifically, (1) supervised AT methods rely on short texts (e.g., class labels) to generate adversarial perturbations, leading to overfitting to object classes in the training data, and (2) unsupervised AT avoids this overfitting but remains suboptimal against practical text-guided adversarial attacks due to its lack of semantic guidance. To address these limitations, we propose Quality Text-guided Adversarial Fine-Tuning (QT-AFT), which leverages high-quality captions during training to guide adversarial examples away from diverse semantics present in images. This enables the visual encoder to robustly recognize a broader range of image features even under adversarial noise, thereby enhancing robustness across diverse downstream tasks. QT-AFT overcomes the key weaknesses of prior methods---overfitting in supervised AT and lack of semantic awareness in unsupervised AT---achieving state-of-the-art zero-shot adversarial robustness and clean accuracy, evaluated across 16 zero-shot datasets. Furthermore, our comprehensive study uncovers several key insights into the role of language in enhancing vision robustness; for example, describing object properties in addition to object names further enhances zero-shot robustness. Our findings point to an urgent direction for future work---centering high-quality linguistic supervision in robust visual representation learning.

Abstract:
Our research reveals a new privacy risk associated with the vision language model (VLM) agentic framework: the ability to infer sensitive attributes (e.g., age and health information) and even abstract ones (e.g., personality and social traits) from a set of personal images, which we term ''image private attribute profiling.'' This threat is particularly severe given that modern apps can easily access users' photo albums, and inference from image sets enables models to exploit inter-image relations for more sophisticated profiling. However, two main challenges hinder our understanding of how well VLMs can profile an individual from a few personal photos: (1) the lack of benchmark datasets with multi-image annotations for private attributes, and (2) the limited ability of current multimodal large language models (MLLMs) to infer abstract attributes from large image collections. In this work, we construct PAPI, the largest dataset for studying private attribute profiling in personal images, comprising 2,510 images from 251 individuals with 3,012 annotated privacy attributes. We also propose HolmesEye, a hybrid agentic framework that combines VLMs and LLMs to enhance privacy inference. HolmesEye uses VLMs to extract both intra-image and inter-image information and LLMs to guide the inference process as well as consolidate the results through forensic analysis, overcoming existing limitations in long-context visual reasoning. Experiments reveal that HolmesEye achieves a 10.8% improvement in average accuracy over state-of-the-art baselines and surpasses human-level performance by 15.0% in predicting abstract attributes. This work highlights the urgency of addressing privacy risks in image-based profiling and offers both a new dataset and an advanced framework to guide future research in this area.

Abstract:
Multimodal planning capabilities refer to the ability to predict, reason, and design steps for task execution with multimodal context, which is essential for complex reasoning and decision-making across multiple steps. However, current benchmarks face two key challenges: (1) they cannot directly assess multimodal real-world planning capabilities, and (2) they lack constraints or implicit constraints across modalities. To address these issues, we introduce Multimodal Planning with Complex Constraints (MPCC), the first benchmark to systematically evaluate MLLMs' ability to handle multimodal constraints in planning. To address the first challenge, MPCC focuses on three real-world tasks: Flight Planning, Calendar Planning, and Meeting Planning. To solve the second challenge, we introduce complex constraints (e.g. budget, temporal, and spatial) in these tasks, with graded difficulty levels (EASY, MEDIUM, HARD) to separate constraint complexity from search space expansion. Experiments on 13 advanced MLLMs reveal significant challenges: closed-source models achieve only 21.3% feasible plans, while open-source models average below 11%. Additionally, we observe that MLLMs are highly sensitive to constraint complexity and that traditional multimodal prompting strategies fail in multi-constraint scenarios. Our work formalizes multimodal constraints in planning, provides a rigorous evaluation framework, and highlights the need for advancements in constraint-aware reasoning for real-world MLLM applications.

Abstract:
The efficiency and scalability of graph convolution networks (GCNs) in training recommender systems remain critical challenges, hindering their practical deployment in real-world scenarios. In the multimodal recommendation (MMRec) field, training GCNs requires more expensive time and space costs and exacerbates the gap between different modalities, resulting in sub-optimal recommendation accuracy. This paper critically points out the inherent challenges associated with adopting GCNs during the training phase in MMRec, revealing that GCNs inevitably create unhelpful and even harmful pairs during model optimization and isolate different modalities. To this end, we propose FastMMRec, a highly efficient multimodal recommendation framework that deploys graph convolutions exclusively during the testing phase, bypassing their use in training. We demonstrate that adopting GCNs solely in the testing phase significantly improves the model's efficiency and scalability while alleviating the modality isolation problem often caused by using GCNs during the training phase. We conduct extensive experiments on three public datasets, consistently demonstrating the performance superiority of FastMMRec over competitive baselines while achieving efficiency and scalability.

Abstract:
Given the significant advances in Large Vision Language Models (LVLMs) in reasoning and visual understanding, mobile agents are rapidly emerging to meet users' automation needs. However, existing evaluation benchmarks are disconnected from the real world and fail to adequately address the diverse and complex requirements of users. From our extensive collection of user questionnaire, we identified five tasks: Multi-App, Vague, Interactive, Single-App, and Unethical Instructions. Around these tasks, we present MVISU-Bench, a bilingual benchmark that includes 404 tasks across 137 mobile applications. Furthermore, we propose Aider, a plug-and-play module that acts as a dynamic prompt prompter to mitigate risks and clarify user intent for mobile agents. Our Aider is easy to integrate into several frameworks and has successfully improved overall success rates by 19.55% compared to the current state-of-the-art (SOTA) on MVISU-Bench. Specifically, it achieves success rate improvements of 53.52% and 29.41% for unethical and interactive instructions, respectively. Through extensive experiments and analysis, we highlight the gap between existing mobile agents and real-world user expectations.

Abstract:
Moving target selection in multimedia interactive systems faces unprecedented challenges as users increasingly interact across diverse, dynamic contexts-from live streaming in moving vehicles to VR gaming in varying environments. Existing approaches rely on probabilistic models that relate endpoint distribution to target properties (size, speed). However, these methods require substantial training data for each new context and lack transferability across scenarios, limiting their practical deployment in diverse multimedia environments where rich multimodal contextual information is readily available. This paper introduces MAGNeT (Multimodal Adaptive Gaussian Networks), which addresses these problems by combining classical statistical modeling with context-aware multimodal method. MAGNeT dynamically fuses pre-fitted Ternary-Gaussian models from various scenarios based on real-time contextual cues, enabling effective adaptation with minimal training data while preserving model interpretability. We take experiments on self-constructed 2D and 3D moving target selection datasets under in-vehicle vibration conditions. Extensive experiments demonstrate that MAGNeT achieves lower error rates with few-shot samples, by applying context-aware fusion of Gaussian experts from multi-factor conditions.

Abstract:
Multimodal Large Language Models (MLLMs) demonstrate exceptional performance in cross-modality interaction, yet they also suffer adversarial vulnerabilities. In particular, the transferability of adversarial examples remains an ongoing challenge. In this paper, we specifically analyze the manifestation of adversarial transferability among MLLMs and identify the key factors that influence this characteristic. We discover that the transferability of MLLMs exists in cross-LLM scenarios with the same vision encoder and indicate two key Factors that may influence transferability. We provide two semantic-level data augmentation methods, Adding Image Patch (AIP) and Typography Augment Transferability Method (TATM), which boost the transferability of adversarial examples across MLLMs. To explore the potential impact in the real world, we utilize two tasks that can have both negative and positive societal impacts: 1. Harmful Content Insertion and 2. Information Protection.

Abstract:
Phone scams remain a pervasive threat to both personal safety and financial security worldwide. Recent advances in large language models (LLMs) have demonstrated strong potential in detecting fraudulent behavior by analyzing transcribed phone conversations. However, these capabilities introduce notable privacy risks, as such conversations frequently contain sensitive personal information that may be exposed to third-party service providers during processing. In this work, we explore how to harness LLMs for phone scam detection while preserving user privacy. We propose MASK (Modular Adaptive Sanitization Kit), a trainable and extensible framework that enables dynamic privacy adjustment based on individual preferences. MASK provides a pluggable architecture that accommodates diverse sanitization methods-from traditional keyword-based techniques for high-privacy users to sophisticated neural approaches for those prioritizing accuracy. We also discuss potential modeling approaches and loss function designs for future development, enabling the creation of truly personalized, privacy-aware LLM-based detection systems that balance user trust and detection effectiveness, even beyond phone scam context.

Abstract:
Driven by the rapid progress in vision-language models (VLMs), the responsible behavior of large-scale multimodal models has become a prominent research area, particularly focusing on hallucination detection and factuality checking. In this paper, we present the solution for the two tracks of Responsible AI challenge. Inspirations from the general domain demonstrate that a smaller distilled VLM can often outperform a larger VLM that is directly tuned on the downstream tasks, while achieving higher efficiency. We thus jointly tackle two tasks from the perspective of knowledge distillation and propose a progressive hybrid knowledge distillation framework termed HKD4VLM. Specifically, the overall framework can be decomposed into Pyramid-like Progressive Online Distillation and Ternary-Coupled Refinement Distillation, hierarchically moving from coarse-grained knowledge alignment to fine-grained refinement. Besides, we further introduce the mapping shift-enhanced inference and diverse augmentation strategies to enhance model performance and robustness. Extensive experimental results demonstrate the effectiveness of our HKD4VLM. Ablation studies provide insights into the critical design choices driving performance gains.

Abstract:
Timely and accurate diagnosis of neurodegenerative disorders, such as Alzheimer's disease, is central to disease management. Existing deep learning models require large annotated datasets and often act as ''black boxes''. However, clinical datasets are frequently small or lack labels, limiting the effectiveness of these methods. Here, we introduce REMEMBER - Retrieval-based Explainable Multimodal Evidence-guided Modeling for Brain Evaluation and Reasoning - a machine learning framework that enables zero- and few-shot Alzheimer's diagnosis from brain MRI scans via reference-based reasoning. Specifically, REMEMBER first contrastively trains a vision-text model on expert-annotated reference data, using pseudo-text modalities to encode abnormality types, diagnosis labels, and composite clinical descriptions. At inference time, it retrieves similar, human-validated cases from a curated dataset and integrates their contextual information via an evidence encoder and attention-based inference head. This evidence-guided design allows REMEMBER to mimic clinical decision-making by grounding predictions in retrieved imaging and textual context. It outputs diagnostic predictions with an interpretable report, including reference images and clinical-aligned explanations. Experimental results demonstrate that REMEMBER achieves robust zero- and few-shot performance and offers a powerful and explainable framework to neuroimaging-based diagnosis in the real world, especially under limited data.

Abstract:
Aligning large vision-language models (LVLMs) with human preferences is challenging due to the scarcity of fine-grained, high-quality, and multimodal preference data without human annotations. Existing methods relying on direct distillation often struggle with low-confidence data, leading to suboptimal performance. To address this, we propose (CaReVL), a novel method for preference reward modeling by reliably using both high- and low-confidence data. First, a cluster of auxiliary expert models (textual reward models) innovatively leverages image captions as weak supervision signals to filter high-confidence data. The high-confidence data are then used to fine-tune the LVLM. Second, low-confidence data are used to generate diverse preference samples using the fine-tuned LVLM. These samples are then scored and selected to construct reliable chosen-rejected pairs for further training. (CaReVL) achieves performance improvements over traditional distillation-based methods on VL-RewardBench and MLLM-as-a-Judge benchmark, demonstrating its effectiveness.

Abstract:
Offering diverse perspectives on a museum artifact can deepen visitors' understanding and help avoid the cognitive limitations of a single narrative, ultimately enhancing their overall experience. Physical museums promote diversity through visitor interactions. However, it remains a challenge to present multiple voices appropriately while attracting and sustaining a visitor's attention in the virtual museum. Inspired by recent studies that show the effectiveness of LLM-powered multi-agents in presenting different opinions about an event, we propose SimViews, an interactive multi-agent system that simulates visitor-to-visitor conversational patterns to promote the presentation of diverse perspectives. The system employs LLM-powered multi-agents that simulate virtual visitors with different professional identities, providing diverse interpretations of artifacts. Additionally, we constructed 4 conversational patterns between users and agents to simulate visitor interactions. We conducted a within-subject study with 20 participants, comparing SimViews to a traditional single-agent condition. Our results show that SimViews effectively facilitates the presentation of diverse perspectives through conversations, enhancing participants' understanding of viewpoints and engagement within the virtual museum.

Abstract:
Humans are innately social creatures with a need to interact with others. The advent of AI has profoundly advanced technologies for human interaction. It has not only opened the doors to seamless human-to-machine interaction but has also helped to significantly enhance human-to-human digital interaction, bringing people together in more engaging and immersive ways. This talk will trace a decade-long progression of the development of cutting-edge AI-mediated technologies for human-to-machine and human-to-human interaction. There are two fundamental aspects to solving this problem -- human understanding and human synthesis. This talk will seek to answer the question of how far along we've come in solving both, along with what the underlying AI building blocks are that have helped us get here. It will further highlight the successful application areas of AI-mediated human interaction technologies that are positively impacting people's lives. Lastly, it will conclude with thoughts on what is left on the table in terms of research and innovation, and speculate on what the next frontier will be for profound real-world impact in this space.

Abstract:
Spiking Neural Networks (SNNs) have garnered significant attention due to their biological plausibility and low power consumption. While spiking transformers enhance performance by combining SNNs with transformer architecture, most rely on rate coding, limiting energy efficiency. Temporal coding methods, such as Time-To-First-Spike (TTFS) coding, offer a more efficient alternative by encoding information based on the timing of a single spike. However, integrating TTFS with transformer architecture faces challenges due to incompatibility with batch normalization (BN) and residual connections (RC), which disrupt the precise spike firing times. In this paper, we propose temporal-coded BN (tBN) and temporal-coded RC (tRC) to address these issues. Building on tBN and tRC, we develop temporal-coded spiking attention (TSA) and temporal-coded spiking transformer (T-SpikeFormer), the first to combine TTFS coding with transformer architecture. Experimental results show our model achieves state-of-the-art performance for temporal-coded SNNs and comparable results to rate-coded SNNs while significantly reducing power consumption.

Abstract:
Spatial-spectral fusion offers a promising alternative to expensive equipment in high-resolution hyperspectral (HrHs) imaging. However, training separate models for different scaling factors remains costly. To address this, we propose the Arbitrary-scale Fusion Neural Operator (AFNO), a lightweight solution for HrHs fusion across arbitrary scalings. Instead of entities, AFNO treats low-resolution hyperspectral (LrHs) and high-resolution multispectral (HrMs) images as functions and performs meticulously designed integrations as the mapping operator. The key components include Attention-Driven Convolution Integration (ADCI) to restore discretization invariance disrupted by convolutions, Implicit Neural Functional Integration (INFI) for cross-domain interaction of spatial degradations, and Galerkin-type Integration as a decoder for high-frequency details. Additionally, the bonded activation opeartor are improved for the principle of continuous-discrete equivalence. Extensive experiments validate the superiority of our approach over cutting-edge methods. Notably, AFNO holds significantly better generalization on arbitrary scaling factors, yet requiring only 0.07M parameters.

Abstract:
With the rapid development of the diffusion models, numerous exquisitely generated images have significantly increased the risk of image misuse and abuse. Despite various AI parties and companies having devoted themselves to embedding watermarks into the generated images to curb the potential detriments, the isolated embedding from the generation process makes the watermarks vulnerable to watermark removal networks. To address this issue, we propose a novel generative image watermark scheme, dubbed Diffusion Visible Watermark (DVW), which can generate watermarked images in one step without additional training or fine-tuning of the diffusion models. Specifically, DVW introduces a masked distribution alignment strategy to fuse the watermark distribution with a Gaussian noise distribution. By iterative denoising the fused aligned distribution with the pretraining diffusion models, the watermarked images with coordinated and unified distribution can be generated with natural robustness against removal. In addition, we design and integrate a dynamic transparency module to adaptively control the watermark coverage degree for better visual quality. Comprehensive experiments and analysis are conducted on two representative kinds of diffusion models, GLIDE and StableDiffusion, to prove the superior and generic robustness of our DVW against watermark removal without sacrificing the generation ability of the diffusion models.

Abstract:
Person re-identification (Re-ID) models have achieved remarkable advancements with the advent of deep learning. However, their performance often degrades in diverse scenarios, such as variations in viewing angles, lighting conditions, and environmental changes. These limitations arise from the difficulty in generalizing across multiple factors, including environments and subject appearances. Multimodal Large Language Models (MLLMs) offer a promising alternative to address these challenges by leveraging generalized knowledge, as demonstrated in biometric tasks like face and iris recognition. This study explores the Re-ID capabilities of MLLMs by comparatively evaluating six representative MLLMs on the most challenging scenarios, including angle variation, illumination differences, clothing changes, image corruption, and visually fine-grained scenarios in Re-ID. We find that GPT-4o outperforms other MLLMs in handling angle variation, illumination differences, corruption resistance, and fine-grained detail disturbances, demonstrating high accuracy and robustness in challenging Re-ID scenarios. However, further optimization is required for robustness against illumination variation, corruption handling, and fine-grained identification across all tested MLLMs. Additionally, the Re-ID performance of MLLMs can be improved by applying several prompt templates. Our research suggests potential directions for integrating MLLMs into Re-ID systems to enhance performance and robustness, underscoring their promising potential in this field.

Abstract:
Traditional Video Object Detection (VOD) is limited by pre-defined closed-set categories, restricting its ability to detect novel objects in real-world scenarios. To address this limitation, we make three key contributions. First, we formally define Open-Vocabulary Video Object Detection (Open-Vocabulary VOD) as the task of detecting objects in video streams from open-set categories, including novel categories unseen during training. Second, we establish an evaluation benchmark by utilizing existing datasets (LV-VIS, BURST, and TAO) to bridge the data gap for this new task. Third, we propose OV-VOD, an Open-Vocabulary VOD method that detects objects in videos beyond pre-defined training categories and addresses the shortcomings of image-level open-vocabulary detectors, which generally neglect the essential temporal and spatial information. Specifically, we design a Semantic-Presence Memory Tracking (SPMT) module that propagates object features across frames through a memory bank to leverage temporal consistency. Moreover, we propose a Spatial Object Relationship Distillation loss (LSR) that captures inter-object spatial dependencies and enhances knowledge transfer during feature distillation. Experiments on multiple video datasets demonstrate that our OV-VOD exhibits superior zero-shot generalization capability compared to existing image-level open-vocabulary object detection methods.

Abstract:
Cross-Modal Hashing (CMH) has gained significant attention for its ability to learn semantic category discrimination and enable efficient retrieval. However, in practical applications, the massive amounts of multi-modal data collected from the internet often contain coarse annotations, which inevitably introduce noisy labels and degrade retrieval performance. To address this challenge, this paper proposes a dynamic optimization-based training framework, namely Dynamic Optimization Noisy Cross-Modal Hashing (DONCMH). Firstly, to alleviate the issue of overfitting to noisy labels during training, we propose a novel regularization-based noise-robust strategy that updates the target distribution with momentum to optimize clustering learning, thus avoiding over-emphasizing noisy samples. Secondly, to more accurately select high-quality training samples, we introduce ClusterOT, a novel Optimal Transport formulation explicitly tailored for Noisy Cross-Modal Hashing (NCMH), which integrates center representation learning and cross-modal alignment into a unified structure. By leveraging the spatial distribution of samples, ClusterOT effectively mitigates distribution imbalances inherent in center representation learning, thereby significantly improving the model's robustness to noisy label predictions. Finally, a robust feature learning module is employed to enhance the extraction of informative and discriminative representations from both modalities. Extensive experiments conducted on four widely used benchmark datasets demonstrate that the proposed method effectively mitigates the impact of noisy labels and significantly improves cross-modal retrieval performance.

Abstract:
Cross-Domain Recommendation (CDR) leverages auxiliary information from different domains to enhance the target domain. However, most CDR models capture coarse-grained user preferences, as they focus on learning fixed domain-invariant embeddings that overlook semantics in user-generated texts. In this paper, we propose a novel framework, LLM-grounded diffusion for CDR (LLDCDR), which integrates LLMs and diffusion models to model fine-grained user preferences by learning multi-aspect domain-invariant representations. First, we leverage the advanced understanding abilities of LLMs to capture multi-faceted common semantics influencing user preferences across different domains. Second, we present an LLM-grounded conditional diffusion to reduce noises of domain-invariant representations by performing multi-step noise diffusion and denoising. To further disentangle multi-aspect semantics, we conceptualize domain-invariant representation learning as a conditional diffusion process, guiding the diffusion using distinct semantic aspects derived from LLM. Finally, we encapsulate LLDCDR into a plug-in framework by modularizing the above strategies. This allows LLDCDR to be integrated into any existing CDR models.

Abstract:
Vision Transformers (ViTs) show promising potential in various multimedia application scenarios. To facilitate their deployment on resource-constrained devices, token pruning and merging have been introduced. However, existing token compression methods focus solely on the abstract features exhibited by high-dimensional tokens after patch embedding, resulting in information loss when evaluating token importance. In this paper, we propose a novel Training-Free Adaptive Token Merging (TF-ATM) method by exploring the intrinsic properties of images themselves. Our TF-ATM is inspired by the observation that the characteristics of patches can intuitively reflect the redundancy level of images. Based on the observation, we develop a method that is mathematically formulated to merge tokens corresponding to patches that are close to the Median presentation in the Frequency domain (MF). The principle behind our merging is that patches close to MF can be replaced by their surrounding ones, and thus removing them does not impair performance. Besides, we experimentally show that patches farther from MF contain more important information, which can be leveraged to capture the object of interest accurately. Without any retraining, TF-ATM leads to significant improvements over the state-of-the-arts (SOTAs), with similar FLOPs (floating point operations). For example, we achieve a 44.5%-FLOPs reduction with only a small loss of 0.39% in top-1 accuracy for the MAE-H model on ImageNet dataset, superior to comparison approaches that require meticulous fine-tuning.

Abstract:
Existing action counting methods typically rely on pixel-based changes within videos, leading to high computational redundancy and low accuracy due to the limited spatial sensitivity. To address these challenges, we introduce MoCount, the first framework that leverages 3D motion representations for counting tasks. MoCount significantly reduces computational overhead and improves counting accuracy, benefiting from the simplicity of motion representation and strong spatial sensitivity. Specifically, we utilize a motion estimator to convert video subjects into 3D motion data. A motion encoder, combined with a Sparse Spatial-Temporal module, is then applied to extract robust human body representations, yielding precise counting results. Extensive experiments on the RepCount and UCFRep datasets show that MoCount achieves state-of-the-art performance, reducing inference latency by approximately 2-3 times compared to existing video counting models. These advantages position MoCount as a leading solution for real-world action counting applications.

Abstract:
The widespread adoption of Low-Rank Adaptation (LoRA) modules in parameter-efficient fine-tuning has revolutionized the deployment of large-scale deep neural networks. However, the intellectual property protection of LoRA modules remains a critical challenge. White-box watermarking is a more effective solution than black-box watermarking in the multi-bit verification scenario of protecting and tracing intellectual property. However, existing white-box watermarking methods for LoRA lack both flexible multi-bit capacity and merging robustness, leaving LoRA modules vulnerable to unauthorized use and redistribution. In this paper, we propose a novel merging-resistant watermarking method for LoRA modules. Our method embeds watermarks into the increment matrix generated during LoRA merging and decomposes the watermark-induced modifications into LoRA's standard matrices, achieving reliable watermark extraction and preserving LoRA's efficiency. Specifically, we adopt quantization index modulation to embed watermarks in the low-frequency components of selected increment matrix weights. Extensive experiments demonstrate the effectiveness, imperceptibility, and robustness of our method, making it a practical solution for safeguarding LoRA modules in real-world applications. This work responds to the limited attention given to intellectual property protection for LoRA, contributing to the secure and sustainable development of deep learning technologies.

Abstract:
Cataract surgery is the most frequently performed surgical procedure worldwide, involving the replacement of a patient's clouded eye lens with a synthetic intraocular lens to restore visual acuity. Although typically brief, the operation consists of distinct phases that demand precision and extensive training, traditionally constrained by the limitations of real-time observation under a microscope. To enhance learning and procedural accuracy, modern advancements in stereoscopic video capture and head-mounted displays (HMDs) offer a promising solution. This paper demonstrates the application of stereoscopic cataract surgery videos, visualized through Apple Vision Pro (AVP) and Meta Quest 3, to provide immersive 3D perspectives that enhance depth perception and spatial awareness. An expert evaluation study with experienced surgeons indicates that stereoscopic visualization significantly improves comprehension of spatial relationships and procedural maneuvers, suggesting its potential to revolutionize surgical education and real-time guidance in ophthalmic surgery.

Abstract:
Recent advances in diffusion-based image editing models have demonstrated remarkable success. However, these models primarily rely on high-quality textual prompts to guide image manipulation, creating a significant barrier for non-expert users. In this demonstration, we present an exemplar-based image editing framework named Edit-by-Example, which eliminates the reliance on textual prompts, requires only a single pair of before-and-after images to encapsulate the desired editing effect that can readily be applied on the user-provided query image without any model fine-tuning. Technically, our framework comprises two components: an Adaptive Editing Policy Module (AEPM) and a Generation Module (GM). The AEPM jointly analyzes cross-image relationships in exemplar pairs and query image content to derive optimal editing directions, while GM executes these policies through an off-the-shelf image editor with optional semantic alignment verification. We introduce EEdBench, a comprehensive benchmark for exemplar-based image editing containing 1,500 test cases across 15 categories. Experiments demonstrate that our framework outperforms existing prompt-free methods in editing direction accuracy (S-Visual) and fidelity (FID).

Abstract:
In multimodal fact-checking, advanced large multimodal models (LMMs) struggle to capture and integrate the complex relationships between text and images. A potential solution is to generate reasoning support text to optimize reasoning and integrate evidence. However, existing generation approaches rely heavily on high-quality data annotations for training, which are costly and limited in scalability, hindering responsiveness to evolving misinformation. To address these issues, we propose CoReS, a novel zero-shot multimodal fact-checking model based on Conceptual Reasoning Support-leveraging key concepts from evidence to guide the reasoning process and improve decision-making. This model includes a reasoning support text generation module that extracts key concepts (critical elements that significantly impact the judgment outcome) from raw textual evidence via retrieval and filtering. By using a Conceptual Reasoning LM, CoReS generates reasoning support texts framed around core key concepts that are semantically consistent with multimodal evidence, linking key clues, thus replacing redundant and complex evidence for fact-checking. The reasoning support texts generated by CoReS effectively distill complex evidence relationships and integrate important reasoning information, allowing the judgment model to provide clear and accurate judgments. Evaluations on benchmark datasets and the new multi-domain MultiVerify dataset demonstrate that CoReS excels in accuracy, generalization, and scalability.

Abstract:
The integration of Evolutionary Algorithms (EAs) and Reinforcement Learning (RL) offers a new paradigm for complex decision-making tasks, especially through methods that optimize actor populations and critics collaboratively (e.g., DBCEM-TD3, QD-RL), collectively referred to as the Actor-Population Critic (APC) framework. However, existing methods still face two major challenges: traditional feature inputs fail to fully capture the spatial relationships, and the design of the critic struggles to balance both the quality and diversity of the policies. To address these issues, we propose Multimodal Dual Population Evolutionary Reinforcement Learning (M-DPERL), which achieves breakthroughs through cross-modal feature augmentation and dual-population coevolution. On one hand, a feature-image bimodal input enhancement mechanism is proposed, which dynamically encodes environmental features into spatial heatmaps. On the other hand, this method introduces a critic population into APC and, for the first time, proposes a population-guided fitness metric to optimize the critic's ability to guide the actor population in balancing quality and diversity. Additionally, we design the Flow-Fix Dynamics (FFD) mechanism to regulate the update rhythm of the dual populations and alleviate the evolutionary chaos in their coevolution. The results across a series of MUJOCO tasks demonstrate that M-DPERL significantly outperforms the baselines, with a 19.2% improvement in sample efficiency and a 17.1% increase in final performance.

Abstract:
Few-shot object detection (FSOD) is an important problem in computer vision, aiming to accurately detect objects with only a few annotated examples. Prototype learning has been widely explored in this field. Some methods extract visual prototypes from support images, but the limited sample size often leads to unrepresentative features. Others use textual prototypes generated by pre-trained vision-language models such as CLIP, which lack visual detail and may suffer from language ambiguity. As visual and textual prototypes offer complementary strengths --- detail and generalization respectively, single-modal prototypes struggle to balance both. To address this issue, we propose MP-DETR, the first multi-modal prototype guided method for FSOD. We design an adaptive multi-modal prototype fusion module to combine visual and textual prototypes from foundation models using a gating mechanism, producing multi-modal class prototypes that retain both general semantics and visual specificity. These prototypes are then deeply integrated into the DETR detection pipeline for guiding potential region selection, enhancing corresponding object queries, and constructing a prototype similarity-based classifier enhanced by contrastive learning to improve discrimination among similar classes. By incorporating multi-modal guidance into the detection process, MP-DETR achieves better performance than existing single-modal methods. Extensive experiments on MS-COCO and Pascal-VOC show that MP-DETR achieves SOTA results in various few-shot settings, confirming its effectiveness and superiority.

Abstract:
Text-based person search (TBPS), aiming to retrieve target pedestrian images with natural language descriptions, has seen significant progress in recent years. However, severe domain shift remains a key challenge in this field, causing source-domain-trained models to degrade significantly when applied to an unseen target domain. To address this, we propose the Identity-preserving Cross-modal Alignment and Adaptation (ICAA) model, a novel test-time adaptation framework for TBPS that enables seamless domain adaptation using only unlabeled target samples. Our method tackles two key challenges: 1) Cross-modal domain-shift misalignment: textual and visual modalities exhibit inconsistent distributional shifts across domains. To this end, our Cross-Modal Alignment adaptation (CMA) module identifies pseudo-positive image-text pairs and minimizes their matching discrepancies in the target domain, adapting to new cross-modal distribution relationships. 2) Identity semantic absence: crucial identity annotations are usually unavailable in both target text and image data. To mitigate this, we introduce the Identity-Preserving Dynamic adaptation (IPD) module, which dynamically associates image-text pairs with potential identity prototypes to enhance identity consistency in cross-modal alignment during adaptation. Our method is simple yet effective, establishing new state-of-the-art cross-domain results for TBPS on three public benchmarks, i.e., CUHK-PEDES, ICFG-PEDES, and RSTPReid.

Abstract:
Continual Learning (CL) enables models to sequentially acquire new knowledge while retaining previous knowledge. However, the challenge of catastrophic forgetting arises when new tasks interfere with previously acquired knowledge. Prompt-based approaches, leveraging pre-trained models, show promise in adapting to new tasks and reducing the risk of overfitting while mitigating catastrophic forgetting. However, existing approaches operate primarily in the spatial domain, neglecting the spectral entanglement between style-biased amplitude components and semantics-preserving phase components in feature representations. In this work, we propose the Spectral-Decomposed Prompting (SDP) method, a novel prompt-based approach that dynamically generates prompts based on the current input using a spectral decomposition strategy. By employing the Fast Fourier Transform (FFT), the query feature and the token embedding are transformed and decomposed into amplitude and phase spectra. SDP suppresses style-sensitive amplitude variations via spectral normalization while adaptively modulating phase components through task-aware attention mechanisms. It minimizes the disturbance of stylistic variations and enhances the semantic representations learning for prompts. Extensive experiments demonstrate that SDP significantly improves adaptability and performance in continual learning tasks, outperforming state-of-the-art methods while mitigating catastrophic forgetting.

Abstract:
Commanding robots to do chores using natural language instructions has been a dream of us for a long time. The navigation capability, as one of the key foundational abilities to achieve this goal, has garnered significant attention in this regard. When human users instruct intelligent agent, the instructions they given sometimes exhibit slight discrepancies from navigable ones, as user's understanding of scene may not be up-to-date due to instant change of environments. This paper investigates 3 common scenarios where instructions and navigation scenes are imperfectly aligned: change of navigability, incorrect landmark references, and incorrect direction descriptions. We then propose an ImperfectVLN task and dataset for evaluating an agent's navigation performance under instruction and environment imperfectly matched conditions. Evaluation results indicate significant performance fluctuations in existing state-of-the-art models under modification scenarios including referred landmark removal and original path blockages. We also provide a series of result analyses and further insights. We aim for this new dataset to become a valuable benchmark, enhancing practical VLN tasks. We further design a reflection module based on our insights, allowing an agent to review its history and identify potential errors. Experiments show that this module improves the performance on ImperfectVLN by 4.4%.

Abstract:
Efficient and precise open-vocabulary 3D scene segmentation remains a critical challenge in computer vision. While current leading methods encode CLIP language features into 3D Gaussians to achieve high segmentation accuracy and fast inference speeds, they suffer from point ambiguity issues caused by separately training on multi-level 2D semantic masks. This approach not only compromises time and space efficiency but also degrades accuracy when selecting optimal semantic levels. To overcome these limitations, we propose Voxel-Aware Fusion Language Gaussian Splatting (VaF-LangSplat), a novel framework that jointly optimizes geometric and semantic representations. Our approach first voxelizes 3D Gaussians using sparse point clouds and lightweight MLP decoders, effectively disentangling language features from geometric attributes. This enables simultaneous training across arbitrary semantic levels with minimal overhead. Crucially, we introduce Fusion Language Splatting, which aligns geometric and multi-level semantic distributions to sharpen boundary definitions while eliminating redundant Gaussian expansions. The voxel-aware representation further enhances robustness against motion blur and lighting variations. Experiments on open-vocabulary 3D localization and segmentation tasks demonstrate that VaF-LangSplat outperforms LangSplat (the prior state-of-the-art) with significant improvements in both segmentation/localization accuracy and efficiency: 4X faster training and 15X reduced storage requirements.

Abstract:
Emotion recognition, as a core technology in mental health monitoring, has long been constrained by the intrusive nature of data collection methods relying on physiological signals and behavioral cues. Although existing motion-based approaches enable non-intrusive data acquisition, they often overlook the societal dimensions inherent in human behavior. As a result, they often exhibit a significant performance drop in real-world scenarios compared to laboratory settings. In this study, we analyzed the spatial distribution of participants' spatiotemporal trajectories and their visited Points of Interest (POIs), and observed significant differences under varying emotional states. Building on this observation, we propose a novel emotion recognition framework, SE2E, which innovatively incorporates the semantic information of POIs into the emotion recognition task. Specifically, SE2E employs a category-aware semantic embedding mechanism combined with a masked prediction task to ensure that the POI embeddings capture both categorical semantics and contextual information. It then structurally represents individual societal event patterns through a personalized spatiotemporal flow. Finally, a temporal-region consistency attention module is employed to extract continuous representations of societal events, thereby enabling a robust mapping from societal behavior to emotional state. Extensive experimental results demonstrate that SE2E outperforms state-of-the-art methods across multiple benchmarks. To the best of our knowledge, this is the first study to leverage societal event for emotion recognition, offering a new technical direction, benchmark, and insight for future research in the field.

Abstract:
Online cross-modal hashing has recently gained significant attention due to its remarkable capability to handle cross-modal streaming data retrieval. Despite promising progress, existing methods still face challenges in fully exploiting the intricate relations across heterogeneous modalities and streaming data chunks, limiting the retrieval performance. In this paper, a novel Online Cross-modal Hashing method with Multi-level Memory (OCH-MM) is proposed. OCH-MM captures the cross-modal consistency and sample semantic correlations for discrete hash learning with latent feature disentanglement, and designs a multi-level memory framework for effective knowledge transfer. Specifically, for discriminative hash learning, OCH-MM maps the multi-modal data into a latent feature space that is further disentangled into a common Hamming space and a modality-specific feature space. The semantic correlations among samples are also preserved into discrete hash codes without relaxation in a nonlinear manner. For effectively learning from streaming data, OCH-MM designs an intra-space feature association memory, an inter-space feature association memory, and a hash codes memory, which encode the historical feature correlations within original multi-modal spaces, the feature correlations between original and latent space, and a subset of hash codes, respectively. By dynamically updating and utilizing the multi-level memory, the data correlations between different chunks are well explored and the historical knowledge is effectively reused to guide future learning. The proposed model is solved by an efficient discrete optimization algorithm. Experimental results on three benchmark datasets demonstrate that our proposed method achieves better retrieval accuracy over the state-of-the-art baselines.

Abstract:
The goal of generic multimodal summarization is to extract the most important information from different modalities to form summaries. Yet the importance of scenes and text in a video is often subjective, and users should have the option of customizing the summary by using natural language to specify what is important to them. However, existing methods for fully automatic multimodal summarization have not exploited available language models, which can serve as an effective prior for saliency. To address this issue, we introduce Query-Focused Multimodal Summ arization(QFSumm), a single framework for addressing both generic and query-focused multimodal summarization, typically approached separately in the literature. In addition, we propose a novel gate-guided mixture-of-experts that uses expert gate module to organize three experts (video expert, text expert and shared expert) to model the correlations between multimodal information. In addition, we propose two novel contrastive losses to represent consistency and diversity. Extensive experiments on a query-focused video summarization dataset (QFVS), two standard video summarization datasets (TVSum and SumMe) and three multimodal summarization datasets (CNN, Daily Mail and BLiSS) demonstrate the superiority of QFSumm, achieving state-of-the-art performances on all datasets.

Abstract:
Road surface reconstruction is crucial for autonomous driving, providing accurate and up-to-date road geometry for navigation, safety assessment, and infrastructure maintenance. Camera-based methods have become increasingly cost-effective and scalable for this task. However, achieving high-quality large-scale reconstruction remains challenging due to inconsistent observations of the same road surface points. These inconsistencies arise both within single sessions-caused by factors such as vehicle shadows and exposure shifts-and across multiple sessions, where changes in lighting conditions and viewpoints further exacerbate the problem. To address these challenges, we propose MS-Road, a camera-based approach for large-scale road surface reconstruction with strong geometric consistency. MS-Road leveraging self-supervised learning and multi-view consistent constrain to tackle two key issues: inconsistency in road appearance across different observations and inaccurate road height localization. By enforcing spatiotemporal consistency in both geometric and visual aspects, our method produces more reliable reconstructions within and across sessions. Experiments on two public datasets and a real-world dataset demonstrate that our approach achieves robust and high-fidelity reconstruction under diverse and challenging conditions.

Abstract:
Cloth-changing Person Re-Identification (CCReID) aims to recognize individuals across clothing variations by learning clothing-invariant representations. However, obtaining sufficient samples of the same person in diverse outfits is often impractical. While synthesizing realistic person images provides an effective solution, existing augmentation methods require labeled data and external priors (e.g., pose skeletons, semantic maps), resulting in high costs and limited generalization. To this end, we propose a Prior-Free Augmentation method for Cloth-changing person re-identification (PFAC), which leverages text guidance to synthesize images with clothing variations while maintaining identity consistency. Our approach features: (1) a truncated diffusion model that preserves clothing-invariant structural cues from intermediate noisy images, (2) a dual-branch denoising network that decouples text-guided clothing synthesis from identity consistency via cross-modal alignment, and (3) a joint optimization strategy with identity-focused losses and image filtering to enhance realism and discriminability. Experimental results on PRCC, LTCC, and Celeb-reID datasets demonstrate that PFAC achieves state-of-the-art CCReID performance, effectively generating high-fidelity, identity-consistent images for robust augmentation without external priors.

Abstract:
Recently, diffusion models have demonstrated powerful capabilities in image generation. However, the repetitive and sequential denoising process adds significant time and computational costs, limiting their application. In this paper, we propose a training-free and universal method, Horizontal-Vertical Accelerated Denoising (HVAD). It mainly utilizes the inherent temporal redundancy in the diffusion model to parallelize the time steps of the denoising process horizontally and reduces the computation of individual denoising time steps vertically. It achieves simultaneous inference acceleration from both the horizontal and vertical directions of the denoising process. Thus, significant inference acceleration is achieved without sacrificing generation quality. Experimental results based on the MS-COCO validation set show that the method achieves a practical speedup of 1.97× on Stable Diffusion v1.5, and achieves a theoretical speedup of 2.40× or even higher with guaranteed generation quality. To verify the generality of the method, better results were achieved on other versions of Stable Diffusion, such as Stable Diffusion v2.1 and Stable Diffusion XL.

Abstract:
In this paper, we propose a Dual-Constraint Diffusion Model (DCDM) to contrast aging appearance for facial age estimation, addressing the key issue of noisy labels. Existing methods for face age estimation are plagued by class imbalance and noisy supervision signals, which disrupt the ordinal relationships between age categories and hinder effective feature decoupling in existing models. To overcome these challenges, the proposed DCDM develops a label-independent Paired Comparison, ensuring accurate sample labeling and maintaining continuity in age estimation. Moreover, we incorporate a Dual-Constraint Diffusion Model to effectively separate and recombine age-related and unrelated features, thus facilitating the generation of high-fidelity and continuous age-progressed facial representations. Lastly, we optimize our model parameters by exploiting the age difference information via an active learning framework. Comparative evaluations on several in-the-wild datasets demonstrate that our DCDM significantly achieves superior results compared to existing state-of-the-art methods in facial age estimation.

Abstract:
Text-guided inpainting models are widely used for image editing, restoration, and content generation due to their ability to produce high-fidelity results aligned with natural language prompts. However, these models remain vulnerable to jailbreaking attacks, where adversaries manipulate inputs to generate pornographic or violent content. While prior attacks rely on adversarial text prompts, they are increasingly mitigated by advanced text-based safety filters and manual review. In this work, we propose a new attack paradigm that bypasses these defenses by leveraging the image modality alone. Specifically, we inject imperceptible adversarial perturbations into the input image, enabling successful jailbreaks even when paired with clean prompts (e.g., ''a woman''). To achieve this, we address two key challenges: (1) stabilizing the optimization of adversarial perturbations via a novel gradient estimator, and (2) ensuring visual imperceptibility through a diffusion-based perturbation generator. Extensive experiments show that our method successfully compromises the Stable Diffusion Inpainting model-despite its built-in image and text safety checkers-achieving an average attack success rate (ASR) of 85.7%, significantly outperforming baselines (58.7%). Moreover, our attack exhibits strong transferability across models and maintains robustness against common image pre-processing defenses. Warning: Blurred or masked NSFW imagery is contained.

Abstract:
Remote rendering enables high-fidelity virtual reality (VR) experiences on standalone headsets by offloading intensive graphics workloads to remote servers. However, streaming high-quality VR graphics imposes substantial bandwidth and latency challenges. Spatial compression is a form of foveation which addresses this challenge by leveraging the human visual system's varying acuity, allocating higher visual quality around the user's gaze while reducing resolution in the periphery. In this work, we implement three gaze-adaptive foveation methods: Dynamic Axis-Aligned Distortion Transmission (D-AADT2 and D-AADT3) and Dynamic Foveated Radial Warp (D-FRW)) of which only D-AADT2 has been previously presented. These methods dynamically adapt spatial compression based on gaze-tracking input, ensuring optimal perceptual quality. We integrate these methods together with their static counterparts into the open-source Air Light VR (ALVR) remote-rendering framework, enabling native (72 FPS) framerates. We conclude a comprehensive objective evaluation across diverse VR games and demonstrate that the dynamic methods significantly outperform traditional static approaches in both encoding efficiency and perceptual quality metrics. A complementary subjective user study further validates these findings, confirming that dynamic gaze-adaptive foveation substantially enhances visual quality, immersion, and user interaction experience.

Abstract:
Video is an essential part of sports interaction among sportspeople. Athletes benefit from self-visualization to correct movement patterns, coaches leverage video analysis to review past events and spectators engage with video to stay connected with their preferred sports. To meet these distinct user needs, it is crucial to establish a holistic approach that considers the intricacies of human interactions within sports contexts. This PhD research adopts a user-centered approach that iteratively creates, develops and evaluates intelligent video-based systems to support meaningful sports interaction. In line with previous work, this research emphasizes how these systems can facilitate visual analysis and knowledge sharing while addressing interaction challenges posed by deploying computer vision in real-world sports scenarios. The findings from this research contribute to the emerging field of SportsHCI, providing design implications for intelligent video-based systems that enhance learning, game analysis and overall interaction across sportspeople. Additionally, it recognizes the significance of machine learning methods in supporting interpersonal collaboration (e.g., athlete-coach), game understanding and personalization in video-based interactions. Situated at the intersection between HCI, computer vision and SportsHCI, this work aligns with core tenets of research in the field that contribute to enhancing user experiences across multimedia applications.

Abstract:
The pervasive spread of multimedia misinformation and disinformation presents a significant challenge to information integrity, demanding robust and efficient verification methodologies. Accurately assessing the authenticity and context of complex multimedia content across diverse platforms requires advanced analytical capabilities. This work introduces Ægis, a novel AI-enhanced solution developed as the authors' submission to the ACMMM'25 - Grand Challenge on Multimedia Verification, aimed at improving the efficiency of multimedia verification. The proposed solution integrates both state-of-the-art and traditional verification tools to comprehensively address various aspects of verification tasks, including event summarization, forensic analysis, and evidence validation. In addition to selectively applied verification modules, large language models (LLMs) are leveraged extensively to fuse findings, perform reasoning, and generate structured verification reports-significantly streamlining the verification process. Demonstrated on the competition multimedia dataset, Ægis accurately validates content integrity, extracts geospatial and temporal information, and identifies content origin across platforms, offering reliable and ethical support for real-world fact-checking.

Abstract:
With the explosive growth of video-centric social media platforms, understanding and predicting video popularity has become a crucial problem in both academia and industry. This year, the Social Media Prediction (SMP) Challenge expands its scope by introducing a dedicated video track, shifting the focus from static images to dynamic, multimodal video content. We introduce the Social Media Prediction for Videos (SMPV) task and release a large-scale, multimodal benchmark dataset, SMPD-Video, with more than 6K short-form videos, including vision language metadata, user profiles, and popularity labels. This challenge invites global researchers to develop predictive algorithms that integrate spatial-temporal dynamics, multimodal learning, and user-video interactions to forecast video popularity in real-world social temporal streams. With the participation and contribution of top teams around the world, the challenge has seen continuous performance improvements in recent years, driven by technological advancements. SMP Challenge Homepage: www.smp-challenge.com.

Abstract:
Estimating momentary conversational engagement is central to assistive, socially aware AI systems, yet models are typically trained and evaluated within a single domain, limiting real-world robustness. The MultiMediate '25 challenge advances engagement estimation to more challenging, cross-cultural, and multi-domain settings. Building on prior challenge editions, we expand beyond NOXI as the sole training source by introducing NOXI-J, a new multilingual corpus covering Japanese and Chinese interactions, enabling both training and evaluation in diverse linguistic contexts. Although NOXI-J conceptually extends NOXI, we treat it as a distinct domain because linguistic, cultural, capture, and annotation differences induce measurable distribution shifts. In this paper, we present new annotations, precomputed multi-modal features (visual, vocal, and verbal), baseline evaluations, and an analysis of the best performing challenge solutions. Beyond accuracy, we quantify fairness using Conditional Demographic Disparity for gender and language. Our baselines confirm strong in-domain performance (e.g., paralinguistic eGeMAPS and video-transformer features) and reveal notable cross-domain drops, underscoring the challenge of cultural, linguistic, and interactional shifts. Fairness analyses indicate generally small discrepancies for our baselines. We observe the largest disparities for the proposed challenge solutions on the Chinese language test set. All annotations, features, code, and leaderboards are made publicly available to foster sustained progress on robust and fair engagement estimation.

Abstract:
In the wave of Artificial Intelligence, along with the proliferation of mobile devices, digital content has been generated and published in an explosive way. Digital content is inherent multimodal: text, image, audio, video, etc. How to effectively and efficiently automate the entire multimodal content lifecycle from idea, to creation and distribution is therefore of great importance. In this talk, we will first delve into the core technological breakthroughs in multimodal content generation, particularly the latest advancements in image and video generation tasks. Then, we will present an agent-based solution from HiDream.ai to interpret user intention and manage content creation-from ideation to final distribution-using three core, interconnected agents. In between, the Content Creation Agent takes a simple user prompt, instantly comprehends the creator's intention, and accurately sources relevant assets from content platform to create multimodal content. Such way frees creators to focus on the story, not the software. The Self-evolving Platform Agent translates global trends and user preferences into strategic directives, guiding models to autonomously generate high-impact content and continuously enriching the platform's ecosystem. The Distribution Agent adapts multimodal content for various social media platforms and analyzes its performance after publishing. Together, these intelligent agents create a seamless ecosystem to connect creators, content and consumers.

Abstract:
Image manipulation localization (IML) refers to the task of identifying regions in images that have been altered by specific tampering techniques, such as copy-move, splicing, or inpainting. Transformers have been applied to IML tasks due to their ability to model long-range correlations between pixels through the self-attention mechanism. However, the inherent self-attention mechanism may not accurately model or sufficiently enhance these correlations, particularly in detailed edge traces, due to its limitations. In this paper, we propose an Edge-Aware Affinity Enhancement approach for the IML task. Specifically, we introduce an Affinity Regularization Module to establish inter-patch correlations for feature regularization via random walk propagation. Based on the extracted correlation representation, we propose an Edge-Affinity Guidance strategy to further refine the correlation accuracy, particularly in ambiguous edge regions. Extensive experimental results demonstrate that our method outperforms state-of-the-art image manipulation localization techniques in terms of localization accuracy.

Abstract:
Malicious Deepfakes pose serious security risks by producing highly realistic forged faces. While numerous countermeasures have been developed to train binary Deepfake classifiers, their limited generalization capacity restricts practical deployment. To proactively defend against Deepfakes, we propose SVS-WM, a Self-Verifiable Semantic Watermarking strategy. The core idea behind SVS-WM is to embed pairs of correlated watermarks within facial semantics, leveraging the inherent fragility of these features, i.e., any semantic modification will disrupt the watermark correlation, thereby enabling robust Deepfake detection. SVS-WM employs a facial semantic disentanglement and reconstruction network, allowing semi-fragile watermarks to be embedded concurrently across multiple semantic levels, including identity and multi-levels of attributes. Specifically, pairs of pseudo-random noise watermarks are adaptively injected into facial attribute and identity features. During propagation stage, the protected image may encounter identity or facial attributes manipulations, we then detect Deepfakes by verifying the correlation result between the decoded attribute watermark and the extracted identity vector. This unique cross-verification mechanism enables authentication without requiring original reference watermark, thereby realizing blind Deepfake detection. Extensive experiments validate the effectiveness of our approach, achieving an average detection accuracy of 98.19% across diverse Deepfake manipulations.

Abstract:
Model merging techniques aim to consolidate multiple fine-tuned models into a single unified model, reducing both storage and computational overhead while retaining task-specific performance. However, existing methods face several limitations: monotonous compression techniques that fail to account for task-specific weight distribution characteristics, weight-magnitude-based compression that fails to consider functional importance revealed by activation patterns, and non-adaptive allocation strategies that ignores task-specific layer importance. To overcome these challenges, we propose OA-Merge, a novel Outlier-Aware Model Merging framework that leverages task activation outliers to enable adaptive compression and resource allocation across tasks. OA-Merge comprises three key components: (1) dynamic hybrid decomposition technique that formulates task vectors as tailored combinations of low-rank and sparse components adapted to task-specific statistical distributions, (2) activation-informed compression methodology that incorporates task-specific activation statistics to prioritize functionally important weights, and (3) task-related allocation that optimizes the distribution of compression resources according to layer-specific importance metrics derived from activation outlier analysis. These hybrid outlier-aware strategies adapt dynamically to each task's intrinsic characteristics, avoiding the pitfalls of one-size-fits-all ways. Extensive experiments on both vision models (e.g., ViT) and language models (e.g., RoBERTa, Qwen) demonstrate that OA-Merge outperforms state-of-the-art baselines, achieving average performance gains of 3.2% on vision tasks and 2.8% on language tasks.

Abstract:
Multi-view clustering is one of the fundamental unsupervised multimedia analysis tasks. Recent studies have mainly focused on developing multi-view clustering approaches, which can achieve state-of-the-art clustering performance. However, most of the existing works just focus on multi-view clustering with fixed views, which lacks flexibility with guidance of the views dynamically generated. Besides, these works ignore to integrate generating views in a dynamic manner and learning the common representation shared by different views into a unified framework. To this end, we propose the Flexible Multi-view Clustering with Dynamic Views Generation (FMCDVG). Specifically, FMCDVG adopts the graph convolutional network and auto-encoder to dynamically generate the topological graph representation and node attribute representation as two different views, respectively. FMCDVG introduces the latent representation shared by different feature representations and integrates multiple feature representations based on node attributes and graph structure into the latent representation with reconstruction through reconstructed encoding networks (REN). FMCDVG jointly conducts generating views in a dynamic manner and learning the common representation shared by different views in a unified optimization framework. We demonstrate that FMCDVG is able to consistently achieve better clustering performance than the state-of-the-art methods through comprehensive experiments.

Abstract:
Multi-view clustering aims to effectively integrate data from multiple views to uncover the underlying clustering structure. However, existing methods typically adopt direct fusion strategies for multiview data, neglecting the issues of view gap-induced heterogeneity and the imbalance in view quality. Particularly, when there are significant differences between views, such direct fusion often leads to the loss of critical information and a decline in clustering performance. To address these challenges, we propose a novel Dynamic Progressive Fusion Multi-View Clustering (DPFMVC). DPFMVC employs a view-adaptive fusion mechanism that dynamically selects the most similar views, reducing conflicts between views while preserving complementary information. Additionally, DPFMVC introduces a dual contrastive loss module and a progressive fusion loss, which effectively align sample features with clustering centers, promoting efficient integration of multi-view information. Specifically, the dual contrastive loss compares the similarity between sample features and cluster centers, ensuring cross-view feature consistency and thus enhancing the discriminability of clustering. Meanwhile, the progressive fusion loss progressively adjusts the fusion order of views, effectively reducing the negative impact of low-quality views on the clustering results, strengthening the synergy between views, and facilitating more effective information fusion.Comprehensive experiments on multiple public benchmarks show that DPFMVC delivers superior clustering results and exhibits overall great effectiveness compared to state-of-the-art techniques.

Abstract:
Accurate nuclei classification serves as a critical cornerstone for disease diagnosis and treatment, yet challenged by the heterogeneity of tissue types, staining procedures, and imaging techniques. Recently, vision-language models (VLMs) have demonstrated impressive success in the natural image field and advanced potential in medical imaging. However, the adaptation of VLMs to nuclei classification still poses several challenges, including limited generalization capability and coarse image-text feature alignment. In this paper, we propose SIGNPrompt, a domain-Specific Interactive Prompt learning framework for Generalized Nuclei classification. Specifically, to unleash the generalization capability of VLM-based models, we introduce a prior-guided domain adapting module that integrates nuclei prior information from the large language model (LLM), enabling flexible and robust adaptation to the inherent heterogeneity across pathological domains. Moreover, we develop a multi-modal interactive prompting mechanism to refine image-text feature alignment by leveraging the interdependence between visual and language prompting, thus enhancing the discriminability of nuclei categories. In addition, a simple yet effective noise-adding strategy is proposed to mitigate the overfitting problem in prompt learning. Extensive experiments on diverse public benchmarks and challenging zero-shot scenarios validate that SIGNPrompt consistently outperforms state-of-the-art (SOTA) methods in both accuracy and generalization.

Abstract:
Multi-modal visual tracking leverages complementary sensor information to enhance robustness under challenging conditions. However, the security of multi-modal tracking systems remains largely unexplored. Existing attacks primarily target single-modal trackers or independently disrupt each modality, failing to exploit the inherent feature interactions and fusion mechanisms that define multi-modal tracking. As a result, these methods exhibit limited attack effectiveness and fail to assess multi-modal tracking systems' vulnerabilities accurately. Understanding these security risks is crucial, as adversarial threats could lead to severe failures in safety-critical applications. To address these challenges, a feature-aware adversarial attack, termed FA3T is proposed. It is designed to explicitly disrupt feature extraction and cross-modal alignment, thereby weakening the fusion process that multi-modal trackers rely on. To achieve this, a Frequency-Spatial Feature Separation (FSFS) module is constructed to perturb feature representations at multiple levels, weakening the modality-complementary advantages of multi-modal tracking. Furthermore, a Target Confusion Attack (TCA) module is devised to manipulate the target-background-template relationships, making it increasingly difficult for the tracker to distinguish the true target, significantly impairing tracking performance. Extensive experiments on five benchmark datasets (i.e., LasHeR, RGBT234, DepthTrack, VOT-RGBD2022, VisEvent) across three different modalities (RGB-T, RGB-D, and RGB-E) demonstrate that our attack substantially degrades state-of-the-art multi-modal trackers, exposing their susceptibility to adversarial threats.

Abstract:
By clustering pixels with locally similar values, superpixel-based approaches have shown great potential in processing hyperspectral images (HSI), thereby reducing the computational burden associated with large spatial dimensions. However, specific for spatial-spectral fusion (SSF), superpixel segmentation is inherently non-differentiable and irreversible; hence it is inapplicable. To address the issues, we propose a semantic transformer-based solver, namely SpecSolver, which is basically inspired by the benefits of superpixel-based approaches, yet with the inner mechanism completely improved. The core idea lies in learning the intrinsic semantic states of HSIs hidden behind discretized pixel representations. Specifically, we propose a new Semantic-Attention to adaptively split the image domain into a series of learnable slices of flexible shapes, where image pixels under similar semantic states will be ascribed to the same slice. By calculating attention to the Semantic-Superpixel tokens encoded from slices, SpecSolver can effectively capture intricate semantic correlations from the vast number of pixels, which also empowers the solver with an endogenous capacity for modeling different magnification scales and allows for efficient computation in linear complexity. On that basis, we elaborate a SpatialNet module, which extracts multiscale local spectral information, and a FreqNet module, which supplements global information, capturing subtle details and variations across different spectra. Experiments on two benchmark SSF datasets verify the state-of-the-art (SOTA) performance of the proposed method, both visually and quantitatively. Also, ablation studies validate the mentioned contributions.

Abstract:
Deep multi-view graph clustering seeks to integrate diverse graph feature sets and uncover consistent information across multiple views. While extensive prior research has utilized various neural network architectures to address multi-view graph clustering challenges, these approaches exhibit notable limitations: 1) The ''black-box'' nature of deep learning models, which obscures their internal mechanisms and impedes interpretability; 2) Insufficient efforts aim to capture low-dimensional representations through graphs that reflect intuitive clustering structures and reduce computational cost. To address these limitations, this paper introduces an interpretable multi-view graph clustering framework constructed with optimization-inspired modules. The proposed approach formulates low-dimensional clustering representation learning from graph matrices as an optimization problem, deriving an iterative solution rooted in this formulation. By seamlessly bridging this optimization process to a deep network architecture, the model learns a low-dimensional clustering representation for graph-structured data across multiple views while adhering to the iterative optimization principles and reducing computational costs. This transparent network design enhances the interpretability of multi-view clustering, enabling intuitive and human-understandable learning of clustering structures. Extensive experimental evaluations validate the proposed framework's superiority over state-of-the-art methods in multi-view clustering tasks while ensuring interpretability and reducing computational costs.

Abstract:
Diffusion models have recently shown strong capabilities in image generation. This paper investigates their potential for semantic segmentation, with a focus on RGB-D tasks that demand precise pixel-level predictions. In particular, we delve into the intermediate activations generated during the reverse Markov step of diffusion process, discovering that these activations can effectively capture the semantic information of an input image, making them outstanding representations for addressing segmentation challenges. This paper proposes Diffusion-Enhanced Multi-Modal Segmenter (DiffuSeg), which innovatively combines RGB features with those generated by an additional diffusion model, facilitating the extraction of comprehensive and nuanced semantic features. Furthermore, we propose the Cross Attention-and-Aggregation Module (CAAM), which not only fosters long-range interactions between RGB and diffusion-derived features but also recalibrates both feature sets before integration, enhancing multi-modal synergy. Additionally, our model incorporates a Dynamic Cascade Kernel (DCK) architecture that exploits local and intricate multi-scale geometric details. As a part of DCK, the Spatial Interaction Module (SIM) dynamically encodes spatial information by establishing pixel-level correlations, thereby enhancing the spatial feature representation capacity. Extensive experiments on two benchmark datasets demonstrate the strong capability of DiffuSeg in handling challenging semantic segmentation tasks.

Abstract:
Over the past few decades, Internet video streaming has seen explosive growth, pushing network resources to their limits. Video Super-Resolution (VSR) technology, which enhances video quality while reducing bandwidth usage, offers a promising solution to replace traditional video delivery frameworks. However, existing real-world VSR approaches often struggle when faced with inevitable packet loss during network transmission, especially in bandwidth-constrained and low-latency environments. This packet loss introduces amplified noise and artifacts, significantly degrading visual quality. In this work, we decouple the various degrees of degradation caused by packet loss and comprehensively analyze the impact of different types of packet loss. To address these challenges, we propose ReinVSR, an efficient countermeasure strategy that mitigates the detrimental effects of packet loss without introducing additional network overhead. ReinVSR employs a two-pronged approach: a pre-restore module to mitigate missing pixel information and a Local Hidden State Attention module to rectify semantic distortions at the feature level by replacing corrupted hidden states with more accurate representations. Specifically, we leverage neighboring frames to generate a pool of hidden features, which are then refined using a novel spatial attention mechanism to aggregate more authentic and accurate hidden states. Extensive experiments demonstrate that ReinVSR outperforms state-of-the-art methods, achieving significant improvements in visual quality. It offers a robust and effective solution for high-quality video streaming in bandwidth-limited environments.

Abstract:
Dense video captioning aims to generate descriptive sentences for each temporally localized event in a video. This task comprises two subtasks: event detection and event captioning. Existing methods commonly adopt a DETR-like (Detection Transformer) architecture to perform both subtasks in parallel. These methods assume that both subtasks require the same visual information and thus extract a single event representation for each event using a shared query. We observe that event detection and event captioning emphasize different regions of a video. In particular, compared to event captioning, event detection tends to focus more on the boundary regions of event proposals. Therefore, relying on shared queries may hinder the ability of the model to meet the specific needs of each subtask, leading to suboptimal performance. In this paper, we propose decoupling the two subtasks by assigning distinct queries to each, enabling more accurate capture of task-specific features. Specifically, we introduce a task-specific query transformation module. This module utilizes two sets of task-specific prompts to transform shared queries into queries tailored for each subtask. These task-specific queries enable each subtask to attend to the video regions that are most beneficial to its respective objectives. By integrating our method into several state-of-the-art frameworks, we achieve superior performance on both event detection and event captioning.

Abstract:
Vision-Language Models (VLMs) such as CLIP have demonstrated outstanding performance in cross-modal tasks, but the prohibitive computational cost hinders practical deployment. Although Knowledge Distillation (KD) provides a promising compression paradigm, most existing methods rely heavily on feature imitation and contrastive relations without explicit fine-grained alignment. Additionally, they do not fully leverage the multimodal interaction knowledge from the teacher model, restricting cross-modal semantic alignment. To address these challenges, we propose KAID, a Knowledge-Aware Interactive Distillation method for VLMs. Specifically, we first pretrain a large CLIP teacher model with domain few-shot labels and store text features as category vectors. Then, an Image Feature Matching (IFM) module is introduced to calculate the feature distribution of teacher-student models with improved cosine similarity, which achieves hierarchical knowledge transfer from global to local levels and enhances the fine-grained perception of student model through joint optimization. Moreover, a Pixel-Wise Alignment (PWA) module is constructed between the teacher's text features and the student's image features, employing a cross-modal attention mechanism to establish semantic associations, while a Text-guided Pixel alignment Loss function (TPloss) is concurrently designed to enhance the student's comprehension capabilities. Ultimately, the well-trained student model is used for inference. Extensive experiments on 11 datasets validate the effectiveness of our method. Specifically, our method achieves average improvements of 2.14% and 2.40% on the base and new classes across these datasets.

Abstract:
Recent advances in Generalized Zero-Shot Learning have demonstrated promising results by leveraging cross-modal attention mechanisms to model the nuanced relationships between semantic attributes and visual regions. However, we find that inherent attribute imbalance leads to significant attention disparities, i.e., those attributes with less sample weights often exhibit lower attention confidence and poorer localization accuracy. This attention illusion obviously leads to potential degraded or even counterproductive contributions to category determination. Conventional rebalancing approaches like resampling or reweighting are ineffective for attention calibration due to the complex interdependencies among attributes in one object and the absence of region-level annotation guidance. In this paper, we propose a novel coarse-to-fine framework termed Hierarchical Progressive Attention Network (HPAN), which leverages latent attention consistency across attributes to rectify the focus deviation of minor attributes. HPAN comprises two synergistic components, Super-Attribute Guided Generable Attention (SGGA) and Rebalanced Attribute Attention Calibration (RAAC). SGGA employs super-attributes to establish unified attention regions for both major and minor attributes, subsequently propagating these attention masks to RAAC for calibration. RAAC specializes in capturing fine-grained attribute-region interaction relations, with its output attention masks serving as supervisory signals to iteratively optimize the coarse attention of SGGA. Extensive experiments on three benchmark datasets verify the effectiveness of the proposed method. Code is available at github.com/zjrao/HPAN.

Abstract:
Event-driven eye-based emotion recognition has attracted increasing attention due to the high temporal resolution and dynamic range inherent to event cameras. The intrinsic spatial sparsity of event data, combined with the eye-based emotion recognition task's reliance on localized features such as eyebrows and eyelids, makes it intuitive and efficient to discard less informative regions. However, integrating such sparsification into CNNs remains challenging due to their reliance on dense grid-based operations. In this paper, we propose an efficient vision transformer framework for eye-based emotion recognition with event cameras. Specifically, we present window selection and token selection schemes tailored for event data and eye-based emotion recognition, which can diminish computing demands while enhancing performance. Firstly, we estimate the importance of all local windows and discard those with limited information, reducing computational cost while emphasizing attention on the periocular region. Secondly, we further introduce an adaptive token pruning mechanism that jointly evaluates the input event data and tokens to predict a binary decision mask, identifying and discarding uninformative tokens. Extensive experiments validate that the proposed approach outperforms existing state-of-the-art methods in accuracy by a significant margin.

Abstract:
The rich pretrained knowledge in vision-language models (VLMs) endows them with the ability to discriminate common objects given only category names, but may be challenged by out-of-distribution unlabeled samples. To address this limitation, test-time adaptation (TTA) dynamically adjusts VLMs to target distributions during inference. Current TTA frameworks rely heavily on unsupervised data augmentations to enhance sample informativeness, but remain vulnerable to naive augmented views. This work introduces Patch Augmentation (PatAug), a pixel-level perturbation framework that optimizes the benefits of informative augmentations and mitigates negative transformation impacts. Implemented as trainable pixels, PatAug are prepared given only category names before inference, introducing few additional overheads. The patches encode class-related semantic information. They assist VLMs in emphasizing on the compatible visual information in the images, restoring perturbed image details, while retaining unrecognized information. Such merits inspire the design of an augmentation of augmentation framework, where PatAug is applied to standard augmentation views for reliable TTA inference results. To better fit the target distributions, we adjust patches with a cross-modal similarity alignment loss and learnable patching weights. Experiments on natural and specialized domain shifts confirm the effectiveness of PatAug.

Abstract:
Unsupervised hashing is applied in large-scale multimodal retrieval by mapping original data from heterogeneous modalities into compact binary codes. Transformer-based retrieval augmented generation possesses significant advantages in retrieval accuracy and context-awareness, yet faces scalability challenges due to the computational overhead of dense embedding. Thus, the integration of hash learning and Transformer provides a feasible improvement scheme, which can achieve efficient retrieval preserving semantic association. This paper proposes a novel Unsupervised Similarity-Fusion Transformer Hashing for multimodal retrieval, denoted as USFTH. Initially, the modal fusion similarity matrix based on Gaussian kernel, sigmoid function, and Laplacian transformation is introduced to construct a discriminative similarity matrix, ensuring that semantic correlation among samples can be captured precisely. Then, cross-modal multiplex joint construction via Transformer-based attention mechanisms is designed, realizing effective integration of heterogeneous modalities in the similarity matrix through multi-path fusion. Furthermore, the consensus fusion strategy is proposed to ensure that hash codes generated under unsupervised conditions possess a uniform distribution and achieve accurate retrieval. In addition, comprehensive experiments on MIRFlickr, NUS-WIDE, and IAPR-TC12 datasets demonstrate the superior performance of USFTH to state-of-the-art hashing approaches.

Abstract:
Multi-domain and multi-task learning enhance the efficiency and performance of industrial recommendation systems by integrating information from different domains/tasks to model user interests uniformly. However, existing methods suffer from the problem of representation entanglement, which limits the effective handling of commonality and specificity among various domains/tasks. In this paper, we propose a Prototype-guided Representation Projection (PRP) model to address this issue, which explores a novel direction of applying prototype learning to deal with complex domain/task relationships in the recommendation field. To identify inter-domain/task commonality, PRP initially uses a shared Mixture of Experts (MoE) architecture to learn representations for each sample, projecting them into a common prototype space across all domains/tasks. For domain/task specificity, specific feature extraction experts are employed, and sample representations are projected to the corresponding prototype spaces, constrained by an orthogonal loss to ensure the independence of those spaces. Moreover, PRP utilizes Optimal Transport (OT) to guide the correct representation projection within the prototype spaces, employing the linear combination of prototypes as the new sample representation. We conduct offline experiments on two open-source datasets and deploy our approach in an online system for A/B testing. Extensive experimental results consistently demonstrate that our approach outperforms existing methods.

Abstract:
Molecular retrieval is critical in drug discovery and molecular design. Traditional discriminative methods often model the conditional probability distribution of retrieving candidates, treating the query text as a deterministic input. However, these approaches have notable limitations: (1) They often overlook the statistical properties of the original data distributions of queries and candidates, preventing the recognition of out-of-distribution data. (2) They struggle to balance retrieval accuracy and diversity when processing open-ended semantic queries. To address these challenges, we introduce DiffTMR, a novel framework that reformulates text-molecule retrieval as a reverse denoising process, progressively generating the joint distribution of candidates and queries from noises. DiffTMR uniquely integrates hierarchical diffusion alignment with dynamic perturbation embedding mechanisms. By employing text-anchored perturbations, it enhances the diversity of molecular representations, and through global-local progressive denoising, it achieves cross-modal hierarchical alignment. This leads to significant improvements in retrieval accuracy and out-of-domain generalization. Evaluations on benchmark datasets ChEBI-20 and PCdes demonstrate that DiffTMR surpasses current leading baselines by 4.2%-5.4% in Hits@1 metrics and exhibits superior performance in out-of-domain retrieval tasks.

Abstract:
Learning user preferences in recommendation systems is enriched by multimodal features, such as textual and visual content, and amplified by multi-interest modeling with Variational AutoEncoders (VAEs). However, prior efforts are limited by single modality focus and cumbersome, parameter-heavy architecture designs. To address these limitations, we introduce an innovative solution that blends the semantic richness of multimodal data with the representational power of multi-representation VAEs. Drawing inspiration from Mixture of Experts (MoE), we cast each VAE as an expert tailored to a specific modality, then fuse them via a novel parameter-merging function into a lean, unified model. This approach efficiently captures diverse user preferences behind multimodal data with minimal complexity. Rigorous experiments on real-world benchmarks show our method outshines state-of-the-art baselines while slashing parameter counts. Our work sets a new, streamlined standard for multimodal, multi-interest recommendation systems.

Abstract:
Generative models have emerged as powerful tools capable of generating photorealistic images, spawning a wide range of applications across various domains. However, effectively integrating generative models into image classification tasks remains an open problem. Our analysis reveals that current generative data augmentation methods, as well as traditional data augmentation techniques, have limitations in simultaneously ensuring both fidelity (faithful foreground) and diversity (rich background contexts). To address this challenge, we propose Decomposition-Recomposition Data Augmentation (DRMix), an innovative intra-class data augmentation method. DRMix decomposes images into foreground-background and foreground parts, then performs diversified background recomposition and intra-class foreground recomposition, achieving dual diversity enhancement at both the image and part levels, and strikes a better trade-off between fidelity and diversity. Experimental results demonstrate that DRMix significantly improves performance across multiple tasks, including image classification, few-shot learning, and weakly-supervised object localization (WSOL).

Abstract:
Fake document detection is an important area in image forensics. Most of the existing techniques focus on the detection of digitally forged documents. In this paper, we look into the forgery of generalized physical occlusion, which is a simple and effective strategy to generate fake document images. We propose an Adversary Decomposition Network (ADDNet) to effectively extract generalized physical occlusion features from various types of documents, where two adversarial classifiers are designed and trained for feature decomposition. On top of the ADDNet, we further propose a lightweight Document Adapter (DA) for flexible and scalable fake document detection, which works well when we encounter a new type of document with limited samples for fine-tuning. To facilitate the research, we newly construct a dataset for physical occlusion detection on different types of documents. Various experiments are carried out to demonstrate the advantage of our proposed scheme over the existing schemes for physical occlusion detection, especially when the document is unseen or has limited samples in training.

Abstract:
Video instance segmentation presents significant challenges in complex and dynamic environments, where instances experience progressive occlusion, either from objects obstructing each other or due to changes in the camera's viewpoint. Current state-of-the-art methods rely on memory bank mechanisms, but we still look forward to new paradigms that have the ability to capture and utilize structural information, the ability to model complex relationships, and the flexibility to adapt to dynamic scenarios. To this end, we propose the Weighted Structure Inference method for Video Instance Segmentation. We build on high-order structural relationships by constructing hypergraphs for each video frame, enabling the capture of complex interactions that go beyond traditional pairwise methods. To model intricate dynamics, we introduce Weighted Sheaf Hypergraph Convolution, which enhances the hierarchical and structural information embedded in the hypergraph. Furthermore, we ensure spatio-temporal consistency by employing a dynamic inference mechanism based on Weighted Sliced Wasserstein distance to compare structural features across adjacent frames. Our method preserves the topological characteristics of occlusion instances and improves the reliability of instance tracking across frames. Experimental results demonstrate that our method outperforms existing video instance segmentation frameworks in both Video Instance and Panoptic Segmentation tasks.

Abstract:
Online face recognition systems usually store face features in the server database for authentication, which are vulnerable to face reconstruction attacks. Various face privacy protection approaches have been proposed to address this issue, where transformation-based schemes are shown to be promising. However, the existing transformation-based schemes are all hand-crafted approaches which are difficult to balance the privacy protection and face recognition. In this paper, we propose to learn a set of discrepant convolutional neural networks (DCNNs) to protect the privacy of face features. We randomly split the original face features into different sub-features. Each of the DCNNs transforms an original sub-feature into a protected one. We adopt appropriate strategies to make the DCNNs as diverse as possible to improve the ability of our protected features to resist different face reconstruction attacks, where a face recognition loss and a privacy protection loss are designed for training. The former ensures that the protected feature can be matched directly using the existing face recognizers, while the latter incorporates a shadow face reconstruction model to interrupt the correlation between the protected features and the face images. Experimental results demonstrate the advantage of our method over existing schemes for face privacy protection. Our protected features can be accurately matched using existing face recognizers, which are capable of resisting both black-box and white-box face reconstruction attacks.

Abstract:
Text-to-Motion Retrieval (TMR) is a challenging task to retrieve relevant motion sequences with the natural language description. Existing TMR methods primarily utilize single embeddings to represent and align text and motion sequences. However, real-world motion sequences typically contain multiple sequential actions with intricate semantics, which are hard to precisely capture by single embedding. Additionally, relying solely on naive contrastive training to capture high-level semantics may struggle to perceive and capture fine-grained action details necessary for precise text-motion alignment. In this work, we propose a novel Sequence-Event Semantic Consistent Learning (SECL) framework for 3D human motion retrieval. Specifically, we introduce a self-supervised learning strategy to incorporate fine-grained action details into the motion representations via the generative feedback from the diffusion model. We design a parameter-free sequence-level interaction to explore coarse-grained alignment and an event-level interaction that utilizes several learnable queries to capture event semantics in a shared learning manner for fine-grained alignment. Furthermore, an inter-consistency loss is introduced to align the event semantics between the motion and corresponding text, and an intra-diversity loss is designed to encourage event features to attend to different contents, effectively capturing the rich action information. Finally, we modify the traditional contrastive alignment objective and propose an importance-sampling strategy to emphasize harder negatives for discriminative representation learning. Extensive experiments show that our method significantly outperforms existing methods in text-to-motion retrieval and other challenging tasks, e.g., human interaction recognition and motion temporal localization.

Abstract:
Text-conditioned motion generation has significant applications across various domains. However, generating natural motion remains challenging due to the vast solution space and the accumulation of errors during motion generation. To address these challenges, a novel hierarchical motion intention decoding-based motion synthesis model named Hi-Motion is proposed, which disentangles human motion into temporal intents of pivot joints and skeleton synthesis guided by intention from a new perspective. Specifically, Hi-Motion first parameterizes pivot joint motion with high-order Bézier curves and constructs a Bézier decoder to generate their trajectories, which serve as motion intention to guide skeleton generation. Secondly, we formulate the generation of skeletons as a graph node transformation problem under the condition of determined edge connections. By incorporating hierarchical joint motion intentions into the graph node features, the spatial details of each frame can be precisely synthesized. The proposed Hi-Motion effectively decouples motion generation into temporal and spatial dimensions through hierarchical motion intention decoding, ensuring coordination and naturalness in the generated motion. Extensive experiments on HumanML3D and KIT-ML datasets substantiate the motion generation capabilities of Hi-Motion. Further analysis demonstrates that Hi-Motion can accurately predict the motion intention of pivot joints and synthesize skeletal details.

Abstract:
Image retargeting (IR) with text regions is a challenging yet underexplored task that focuses on resizing an image's aspect ratio while preserving both semantic objects and the legibility of textual content. This task introduces three primary challenges, which can be summarized as follows: (1) the distinct probability distributions between text and non-text regions in images; (2) the lack of dedicated mechanisms in existing IR methods for handling text regions, often leading to text distortion or blurring; (3) the absence of paired datasets specifically designed for IR tasks involving text regions. To tackle these challenges, we propose SSIR, a unified framework that reformulates IR as a joint Semantic Segmentation and Image Retargeting (SS-IR) task, leveraging an attention mechanism to bridge these components. Specifically, we first employ a semantic segmentation sub-network that extracts text region features using text segmentation techniques to improve text-awareness in retargeting tasks. Then, we integrate text features into the image's visual representation through an attention-driven module designed to preserve both textual and semantic content during retargeting. Finally, we address the absence of paired datasets with an unsupervised learning paradigm based on a Cycle-IR framework, which employs cyclic consistency reconstruction, enabling effective learning without the need for paired training data. Experimental results show that the SSIR algorithm effectively preserves text information and delivers high-quality visual retargeting results.

Abstract:
In recent years, Text-to-Music (T2M) generation models have rapidly emerged as powerful tools in content creation across fields. While existing models have made notable progress in sound quality, instrument identification, and stylistic alignment, they still exhibit clear limitations in modeling musical structure and musicality-particularly in terms of harmonic coherence and rhythmic alignment. To address these issues, we propose a Temporal-Conditioned Symbolic Alignment for Controllable Text-to-Music Generation(TCSA), which introduces explicit local condition controls to enhance structural fidelity in music generation. Specifically, we design a music theory enrichment strategy based on GPT-2 that transforms input text into detailed descriptions with embedded music theory knowledge, from which accurate chord progressions and rhythmic patterns are extracted as generation conditions. To synchronize these local features effectively, we develop a temporal alignment feature fusion mechanism. Additionally, we propose a layer-skipping fine-tuning strategy to avoid overfitting and enable fine-grained structural modeling. Finally, we introduce a perception-driven loss function based on Mel spectrograms to optimize the harmonic consistency and structural coherence of the generated music. Experimental results demonstrate that TCSA achieves competitive generation quality while offering significantly improved controllability over musical structure, making it well-suited for professional music production and refined content creation.

Abstract:
Co-speech gestures are generally categorized into rhythmic and semantic gestures: rhythmic gestures align with speech rhythm and intonation, while semantic gestures convey specific meanings or emotions, enriching verbal communication. Most previous studies have focused on synthesizing rhythmic gestures, while recent methods have explored integrating large language models (LLMs) to retrieve semantic gestures and merge them with rhythmic ones. However, existing approaches primarily rely on textual context for retrieval, which may not fully capture the emotional and tonal nuances of speech, sometimes leading to semantic gestures that do not align with the speaker's intended expression. Additionally, common gesture fusion techniques often merge rhythmic and semantic gestures directly, causing discontinuity due to differences in their movement styles. To address these challenges, we propose SemGesture, a system designed to generate smooth and semantically accurate gestures. Our approach incorporates Context-aware Retrieval powered by a Large Audio Language Model, enabling precise retrieval of gestures that align with both the semantic and emotional aspects of speech. Additionally, the Gesture Fusion Module dynamically adjusts semantic gestures to harmonize with rhythmic gestures, ensuring seamless and coherent motion transitions. Extensive experiments demonstrate that SemGesture significantly outperforms existing methods in generating contextually accurate and visually natural gestures.

Abstract:
Existing synthetic image detection approaches can be categorized into three paradigms: spatial, frequency, and fingerprint-based methods. Our analysis reveals a fundamental commonality across these paradigms: a significant reliance on high-frequency image components. This observation highlights the discriminative power of high-frequency information for this task and provides a strong rationale for learning generalized artifact representations based on multi-modal fusion strategies. Building on this insight, we introduce a multi-modal high-frequency interactive detection framework for general synthetic image detection. This framework explicitly integrates high-frequency information from both the spatial and frequency domains. Specifically, its spatial processing branch incorporates a novel high-frequency self-enhancement module to bolster local high-frequency representations. Concurrently, the frequency processing branch utilizes a multi-scale frequency information enhancement module to capture diverse contextual cues. At the feature fusion stage, we propose a pooling-guided cross-modal high-frequency interaction module, which dynamically weights cross-modal information to further reinforce salient high-frequency representations. Extensive experiments on public datasets demonstrate that our proposed framework achieves state-of-the-art performance in real-world detection scenarios.

Abstract:
The widespread availability of publicly accessible data on the internet accelerates the progress of deep learning but also raises concerns about unauthorized data usage for training neural networks. Early safeguard methods introduce small, carefully crafted perturbations via surrogate model into data to generate unlearnable data, aiming to prevent models from learning meaningful patterns. However, these methods lack robustness against adversarial training. Later, some works introduce adversarial examples to solve this problem but at the cost of increased overhead of the surrogate model. Recently, Convolution-based unlearnable data (CUDA), a surrogate-free method, has been proposed to address this issue by manually designed class-wise convolution kernels. Despite its success, CUDA suffers from high-frequency detail loss, perturbation hash collisions, and vulnerability to frequency filtering attacks. In this paper, we propose KBS (K-Space Bispectrum Steganography), which embeds class-specific information into the magnitude and phase components of the Fourier domain while preserving visual fidelity under reconstruction constraints. By directly performing steganography in the frequency domain, KBS preserves high-frequency details and avoids hash collisions with compact binary codes, enabling scalability to large-class datasets. Furthermore, KBS resists frequency filtering attacks by embedding perturbations in a way that remains imperceptible in the pixel space. Experimental results on public benchmarks demonstrate that KBS outperforms state-of-the-art methods.

Abstract:
Nowadays, haptic data has gained a fast-growing volume with enormous interaction points during human-computer interaction and embodied AI. In the near future, the massive haptic signals -encompassing both kinesthetic and vibrotactile signals- will place significant demands on both communication and computing resources. To address this challenge, we propose the first task-oriented sematic codec of low-delay vibrotactile transmission, namely, vibrotactile semantic codec (VTSC). Specifically, we design a perception-based vibrotactile semantic extraction mechanism (PSEM) that considers the high and low thresholds of vibrotactile perception in effective semantic coding while adhering to the low delay constraint. Inspired by this principle, we then propose a vibrotactile semantic encoder (VSE) with local and global semantic extractors, which can efficiently extract and preserve semantic features within the short frame context. Besides, we present a semantic distribution loss function to enhance the learning of meaningful representations. Comprehensive experiments demonstrate the superiority of our VTSC, achieving significantly higher task accuracy than the state-of-the-art vibrotactile codecs at the same compression ratio (CR), e.g. 60% improvement when CR=256. When compared to transferred audio-visual sematic codecs, our VTSC also shows promising improvements, validating the effectiveness our approach.

Abstract:
Line sketches serve as the visual DNA of fashion design, forming the essential foundation where concepts take shape, yet today's digital tools often lack the fluidity, personalization, and intelligence needed to truly support this creative process. We present FashSketch, an interactive, multimedia-driven system that reimagines fashion sketching through the lens of generative AI. Designed with a layer-based creative interface, FashSketch empowers designers to ideate, customize, and iterate on sketches seamlessly. By integrating state-of-the-art generative models, sketch-based retrieval, and large language models, the system supports advanced functionalities such as text-to-sketch generation and context-aware sketch recommendation. FashSketch not only enhances the sketching experience but also opens new multimodal pathways for creative expression, making it a powerful co-creative partner in the early stages of fashion design. Video demo is available at https://youtu.be/BX-Edz7Z7ZY.

Abstract:
This paper presents a doctoral research focusing on integrating Retrieval-Augmented Generation (RAG) into video-related multimodal tasks. Existing RAG studies predominantly target text, images, or tabular data, overlooking the unique value of video as a knowledge carrier. We address this gap by: 1) proposing AdaVideoRAG, a framework that adaptively allocates retrieval strategies based on query complexity for long-video understanding; 2) developing REViG (RAG-Enhanced Video Generation) to optimize prompt engineering via retrieved knowledge for controllable video synthesis; 3) constructing the UltraVideo dataset (UHD-4K/8K resolution, 100+ themes, 10 structured captions per video) and HiVU/HiVG benchmarks to evaluate RAG-driven video tasks. Experiments validate the effectiveness of our methods, and we outline future plans to unify video understanding and generation through Agentic RAG for AGI-oriented research.

Abstract:
Predicting the popularity of social media videos involves estimating user engagement based on rich multimodal information embedded within the posts. Unlike static images, videos incorporate temporally evolving visual signals that, alongside associated metadata such as descriptions, hashtags, timestamps, and user attributes, offer valuable insights into their potential audience reach. Prior approaches typically extract features from different modalities independently and merge them via naïve concatenation, which overlooks the semantic discrepancy and interaction dynamics across modalities. To address these limitations, we propose a feature fusion framework that encodes and aligns video content and associated textual cues into a shared semantic space. By jointly modeling temporally structured visual features with context-aware textual embeddings, our method effectively captures cross-modal correlations that are crucial for discerning content virality patterns. In addition, we incorporate user-centric behavioral profiles and content creation dynamics, enriching the representation with personalized signals that reflect audience-specific preferences. Notably, our method achieves top-tier performance in the 2025 SMP challenge, ranking among the highest-performing entries. This strong empirical result underscores the value of deep semantic alignment across video, text, and user domains in accurately forecasting social media video popularity.

Abstract:
Recent years have witnessed an unprecedented growth of multimodal data in healthcare, ranging from distributed sensors and medical imaging devices (MRI, CT, X-rays) to digital health platforms that integrate audio, video, 3D geometry, and clinical text. The increasing availability of such data presents significant opportunities for computer-aided diagnosis and intelligent healthcare solutions, yet also poses substantial challenges in multimodal integration, large-scale analysis, and real-world deployment. The 2nd International Workshop on Multimedia Computing for Health and Medicine (MCHM'25), held in conjunction with ACM Multimedia 2025, focuses on advanced multimedia computing techniques, including mobile and hardware solutions, for tackling real-world problems in healthcare. The workshop brings together researchers and practitioners in multimedia computing, artificial intelligence, and medicine to explore emerging methods, applications, and systems that have a direct impact on human health.

Abstract:
The evolving media landscape increasingly demands immersive, non-linear formats supported by innovative tools for content creation and distribution. The Horizon Europe XReco project addresses this need by providing a unified, data-driven ecosystem for next-generation media production, with a focus on extended reality (XR) and virtual production. The platform integrates ingestion of diverse media types (text, images, audio, video, 3D), cross-modal search, 3D content creation, sharing, and monetization. Central to XReco is a metadata-driven ingestion system that overcomes archive fragmentation by enabling efficient organization and access to content from sources like broadcasters, online news, and open repositories. This capability was demonstrated through a short TV documentary on Guglielmo Marconi, created using historical materials assembled via the XReco platform. The platform's Orchestrator module empowers users with powerful cross-modal semantic search capabilities, leveraging neural descriptors to enable queries across different media formats. Editorial teams can retrieve relevant contents searching by keywords like ''telegraph'' or perform reverse image searches to identify and contextualize visual assets like images and 3D models. This unified search functionality significantly enhances content discovery and reuse. A major innovation of the platform consists in providing a set of tools for enhancing the quality of the ingested contents, as well as generating 3D models from 2D assets using state-of-the-art techniques (video super resolution, blind face restoration, NeRF, Gaussian Splatting, Structure from Motion). These services are accessible and tunable via a unified interface, which provides a streamlined user experience and hides the complexity of the underlying technologies. For the Marconi documentary, detailed 3D models of key technological artifacts were created, enabling viewers to interactively explore these objects from multiple perspectives. XReco also supports seamless integration with third-party tools to enrich production workflows. Our documentary incorporated photorealistic digital avatars created with Unreal MetaHuman, animated via motion capture, and featured holoported human experts alongside real presenters within dynamic virtual environments. A noteworthy example is the virtual reconstruction of the RAI Radio Museum in Turin based on Gaussian Splatting, in which avatars from remote locations are developed with Unity and rendered using 4D Gaussian Splatting and Free Viewpoint Video (FVV) technologies. Compatibility with platforms such as Unity and Unreal Engine further facilitates the creation of visually compelling XR experiences. In summary, the XReco platform represents a robust end-to-end solution that effectively tackles the technical and commercial complexities of modern XR and virtual production, paving the way for innovative storytelling in the evolving media ecosystem. During the demo, attendees will have a walkthrough of the platform functionalities, highlighting key technologies for content search, filtering, and processing. They will also be able to enjoy a short documentary about Guglielmo Marconi, produced by our editorial team using XReco technology. After the walkthrough, attendees will have the opportunity to interact directly with the XReco platform to explore its features hands-on-such as testing the search capabilities, creating 3D assets, and experimenting with other available tools. This will provide a more engaging and comprehensive experience of the demo's functionalities. Link to the video: https://drive.google.com/drive/folders/15XTkg-x1U62hQ2dRo2ABiT3LG94CJXO5

Abstract:
The integration of AI/ML technologies into medical imaging is revolutionizing radiology, offering transformative benefits in clinical workflows. AI-powered Software as a Medical Device (SaMD) solutions not only reduce workload and optimize image interpretation but also unlock critical insights previously undetectable by human eyes-catching the unseen and enabling earlier, more accurate diagnoses. Lung cancer, the leading cause of cancer-related mortality worldwide, is often diagnosed at a late stage, when curative treatment is no longer viable. Early detection is paramount. Traditional screening methods rely heavily on nodule size and growth as indicators of malignancy. However, these criteria alone are insufficient for identifying cancer at its earliest, most treatable stage. eyonis® LCS, the flagship clinical development program of Median Technologies, represents a next-generation AI/ML-based SaMD designed specifically for lung cancer screening [1][2]. It combines Computer-Aided Detection (CADe) and Computer-Aided Diagnosis (CADx) [3][4] capabilities to support clinicians in identifying malignant nodules with greater precision. By leveraging specific architectural choices and deep learning models, eyonis® LCS enhances diagnostic accuracy beyond the current standard of care [5], offering a paradigm shift in early lung cancer detection [6]. This presentation will delve into some of the architectural foundations of eyonis® LCS, highlight its clinical impact, and demonstrate how it empowers radiologists to diagnose lung cancer when patients can still be cured. Through this pioneering technology, Median Technologies is redefining the future of cancer screening and patient outcomes.

Abstract:
Live-cell imaging is a powerful tool for studying dynamic subcellular processes by capturing the spatiotemporal organization of the biological microenvironment. However, limitations due to phototoxicity and photobleaching prevent microscopes from achieving high frame rates and high-quality images. Although current deep learning methods can enhance both frame rates and image resolution without compromising cell health, they often overlook the continuity of subcellular trajectories, which leads to discontinuous temporal modeling. It also incurs prohibitive computational costs due to exhaustive correlation computation that hinder real-time applications. To address these issues with high efficiency, we propose Trajectory Space-Time Super-Resolution (T-STSR), a method designed to boost frame rates and resolution in fast subcellular imaging while significantly reducing computational overhead. Our approach incorporates Spatial-Temporal Trajectory Modeling (STTM), which learns a state-space model over spatiotemporal slices to reconstruct particle trajectories at low cost. In addition, our novel Trajectory-Aware Loss randomly subsamples trajectory data during training, promoting continuous trajectory representation and mitigating noise with minimal additional computation. We validated T-STSR on both synthesized and real-world datasets with various particle types and noise conditions, demonstrating that our method achieves superior restoration results while saving 75% inference time compared to the previous SOTA model.

Abstract:
Transparent object segmentation from a single image has been investigated for several years. However, detecting transparent areas from video has not been well explored, especially for different kinds of transparent categories besides glass, due to the scarcity of such a dataset. Therefore, in this paper, we propose the video-based transparent object segmentation task and introduce the first-of-its-kind corresponding dataset named TransVid, which contains nearly 400 videos with a total of 18,523 frames. Based on TranVid, we further propose a new method called TranSeg, in which we innovatively introduce Graph Neural Networks into the temporal segmentation task and combined with a novel Diffusion Model to make the model's segmentation results more accurate. Experimental results show that TranSeg achieves higher accuracy with fewer parameters than previous state-of-the-art models, demonstrating the effectiveness of our method. Moreover, comprehensive ablation analysis reveal several fascinating insights and suggest viable paths for further research.

Abstract:
Event cameras, as emerging bio-inspired sensors, endow us with a unique scene perception capability with sub-millisecond latency in challenging environments, such as high-dynamic range and motion blur, as to which a plausible yet efficient exploration on spatiotemporal characteristics of the sparse, asynchronous event data remains an open problem. Event-based pedestrian detection, considered as a promising alternate for road safety in autonomous driving, is chosen as the testbed in this paper for pursuing a specific event-tailored spatiotemporal model. Note that, heterogeneous architectures are generally used in literature, such as building on a CNN/Transformer-style model for capturing the spatial features and a RNN/LSTM model for mining the temporal coherence, respectively. However, existing methods still face significant limitations, particularly as deployed in multi-rate dynamic environments, characterized by pronounced sparsity patterns in slow-motion or other scenarios. As such, a homogeneous neural network for robust pedestrian detection is proposed, with Event-tailored Recurrent Spatiotemporal State-Space Module (ERS3M) as the core innovation, for a joint meticulous modeling of spatiotemporal sparsity and dynamics over event data. On one hand, inspired by Vision Mamba, ERS3M is equipped with adaptive spatiotemporal state propagation as well as multi-directional compensatory scanning, enabling elaborate detection even as to observation intervals with extremely limited events triggered. On the other hand, ERS3 M is augmented with an additional block termed Temporal-Entropy Synergy, offering a collaborative spatiotemporal event purification mechanism, so as to enhance the probability credibility of event streams in visual semantics considering their complicated dynamics. Finally, ERS3M ends with an aliasing-alleviated S5 block to transit information between consecutive time steps, facilitating the temporal consistent pedestrian detection. Evaluations on the PEDRo dataset demonstrate that, the proposed detection method with ERS3 M as backbone has achieved a comparable or even superior performance to state-of-the-art approaches in terms of both accuracy and efficiency.

Abstract:
Light Field (LF) semantic segmentation relies on leveraging redundant information across multiple views to assign a semantic label to each pixel of the central view. Recent approaches typically feed the views into a pre-trained backbone and utilize an estimated depth map to aggregate semantic representations for label prediction. However, these methods ignore the correlation between encoded structural cues in LF and semantic labels. On one hand, it is challenging to identify matching points for regions that are occluded in some views. This broken view consistency emphasizes object edge localization, facilitating more precise edge labeling. On the other hand, the depth continuity for the same object ensures semantic consistency in adjacent regions. Therefore, effectively extracting structural cues and integrating them into semantic segmentation are key points in LF semantic segmentation.In this paper, we propose an Epipolar Consistency-based network for structure-aware LF semantic segmentation, termed ECNet. First, we explore the epipolar consistency between views to characterize the edges and depth cues of the input. Based on the embedded edges information, we design an edge-semantic correlation transformer to generate fine-grained representations of object edges. Furthermore, the proposed depth-semantic correlation transformer maps semantic features of one object closer together according the depth information.Extensive experiments demonstrate that ECNet achieves state-of-the-art performance, which reduces computational cost by 33.3% (in terms of FLOPs) while maintaining high segmentation accuracy.

Abstract:
Multi-view classification has demonstrated its ability to integrate diverse sources of information to significantly boost classification accuracy. To further enhance the reliability of these results, trusted multi-view learning methods have been developed. However, these approaches are designed for closed-set scenarios and fail when novel or unknown categories appear in open-world contexts. To address this limitation, we introduce the concept of Open Multi-View Learning, with the objective of detecting unknown categories with low confidence scores. We propose Trusted Open-World Multi-View Classification method for this problem. Specifically, we employ subjective logic to measure the uncertainty of data views. On these grounds, we propose dynamic opinion aggregation strategy based on their uncertainty measurements and theoretically prove this strategy can effectively detect unknown multi-view categories. The inter-view opinion consistency regularization is also adopted to mitigate conflicts between views. Experiments conducted on various multi-view datasets validate the reliability and robustness of our method.

Abstract:
Open-vocabulary object detection seeks to recognize objects from arbitrary language inputs, extending detection beyond fixed training categories. While recent methods have made progress in detecting unseen categories, they typically require a set of predefined categories during the inference stage, hindering practical deployment in open-world scenarios. To overcome this crucial limitation, we propose UniPerception, a novel universal perception framework based on open-vocabulary object detection. It not only excels at open-vocabulary object detection but is also capable of generating labels for target objects in the absence of predefined vocabularies, and can be adapted to a broad range of vision-language tasks simply by modifying the language instructions. UniPerception seamlessly integrates three key innovations: 1) a robust visual detector trained on diverse data sources to capture rich and generalizable visual representations; 2) a language model with interleaved cross-modality fusion layers to interpret instructions and generate fine-grained responses conditioned on visual features; and 3) a tailored multi-stage training strategy that effectively bridges detection-specific learning with general vision-language understanding. We conduct extensive experiments on multiple benchmarks for open-vocabulary object detection (COCO, LVIS, ODinW), referring expression comprehension (RefCOCO/+/g, D3), and vision-language understanding (Flickr30k, VQAv2, GQA). The results show that UniPerception achieves strong open-world generalization and multi-modal understanding, outperforming the existing state-of-the-art methods and establishing itself as a unified, instruction-driven perception system.

Abstract:
Multi-modal learning that combines whole-slide images (WSIs) and genomic data has recently emerged as a promising paradigm for improving cancer survival prediction. However, existing methods either utilize genomic data as guidance to integrate WSI features or treat both modalities as equally important across all patients, overlooking individual variations in modality importance. As critical survival-related features can reside in different modalities for different patients, prioritizing the modality with more discriminative information for each patient, referred to as individual modality preference, is crucial for enhancing prediction accuracy. In this paper, we propose a novel Individual PREference-aware Multi-modal CooperatIon framework for Survival PrEdiction (PREMISE), which collaborates with a uni-modal and a cross-modal preference learner to fully exploit individual modality preference. Specifically, the uni-modal preference learner adopts a task-aware preference estimator to dynamically assess the importance of each modality for each patient, thereby identifying the preferred modality for input individual. To promote cross-modal learning, the cross-modal preference learner embeds the obtained preferences as biases to construct a preference-aware mutual-attention module, enabling the individually adaptive focus and interactions between modalities. Meanwhile, inspired by clinical practice where doctors reference prior cases for survival evaluation, we introduce dual-level cross-modal alignment, incorporating both patient-level and group-level preferences. This alignment emphasizes the more discriminative modality and improves risk group separation during cross-modal knowledge transfer. Experiments have validated our superiority.

Abstract:
Sonar image recognition is a key technology in underwater exploration systems. Compared with natural images, sonar images have fewer texture details and are easily affected by heavy noise, making it more challenging for specialists to distinguish the subtle differences among classes. In view of this, studying fine-grained classification methods for sonar images with scarce annotations is of significant importance. To address this issue, we propose a Physics-Guided Teacher-Student (PGTS) framework to explore the unique physical information of sonar images while simultaneously mitigating the effects of limited annotations. First, PGTS reconstructs sonar signals through physical simulation and a specially designed physics-guided feature generation module, which allows it to bypass the time-consuming physical simulation during inference. Then, we design a multi-modal teacher model combines the reconstructed sonar signals and sonar images to extract discriminative features to generate robust pseudo labels for fine-grained target categories. Finally, the knowledge is transferred to a single-modal student model through consistency loss. Under the joint constraints of the teacher model and the reconstructed sonar physical signals, the student model continuously improves its performance in annotation-scarce scenarios. Notably, when merely 1% of the data is labeled, our method outperforms other state-of-the-art approaches by 12.46% in terms of accuracy.

Abstract:
Recently, since diffusion models show great potential in image generation, many pretrained diffusion models based image composition methods have been proposed for image illumination harmonization. However, they mainly face two key challenges: 1) the effective preservation of foreground appearance (i.e., content structure and texture details, etc); 2) Reasonable generation of the foreground casting shadow. To this end, we propose a novel Image Illumination Harmonization Diffusion model called I2 HDiffuser to achieve image illumination harmonization with high-fidelity foreground appearance and reasonable cast shadows. I2 HDiffuser mainly consists of frequency domain feature enhancement branch (FDFEB) and illumination-shadow consistency generation branch (ISCGB). Specifically, FDFEB first introduces the Wavelet Transform Module (WTM) for decomposing composite image features into low-frequency (i.e., illumination features, etc) and high-frequency (i.e., texture and content structure features, etc) components using the Haar wavelet transform. Then the Multi-Condition Guidance Mechanism (M-CGM) is proposed to interact these components as prior conditions, which are further injected into the ISCGB with a noise-to-denoise process for guiding high-fidelity content and background illumination-aware foreground regeneration. Meanwhile, a shadow mask step-wise iterative optimization strategy is introduced to the ISCGB to explicitly provide a reasonable shadow generation space for foreground objects. Extensive experiments on public image harmonization datasets DESOBAv2 and iHarmony4 and real illumination harmonization dataset IH-SG show that the I2HDiffuser achieves the superiority.

Abstract:
Structural variant (SV) calling plays a critical role in understanding genome diversity and disease mechanisms. Although deep learning techniques have been increasingly applied to SV identification, existing general-purpose models still face significant challenges, including incomplete extraction of alignment signals, limited accuracy and efficiency, and poor performance in highly polymorphic or structurally complex genomic regions. These limitations lead to suboptimal detection accuracy in current SV callers. In this work, we present MMF-SV, a multi-modal feature fusion-based model (MMF) for SV calling. MMF-SV integrates matching patterns and statistical information from CIGAR signals with textual features extracted from alignment information, enabling comprehensive representation of diverse SV signals. We trained MMF-SV using CLIP, and the trained model achieved over 96% F1 score for classifying various types of variations. We validated the stability and robustness of the MMF-SV model through 5-fold cross-validation. Compared to existing long-read SV callers, MMF-SV achieves higher accuracy and can be effectively integrated with them to significantly reduce the number of false positives in the calling results.

Abstract:
Creating digitalized hand-object interaction scenes plays a crucial role in recent advancements, enabling viewers to understand how human dexterity influences and shapes the world. In this paper, we present HandCraft, a framework designed to capture and render hand-object interactions with exceptional precision and realism. Our Gaussian models are built on the development of digital representations of hands, objects, and scenes, derived from data captured using multi-modal sensing systems. By combining motion capture with IMU-based data gloves equipped with tactile sensors, HandCraft ensures precise hand pose tracking and reliable contact fidelity. HandCraft includes a novel method that uses hand motions to solve the object occlusions, effectively reconstructing missing interaction details. For enhanced physical feasibility, HandCraft incorporates optimization techniques to resolve object penetration issues and enforce temporal consistency. Using these techniques, we introduce a high-quality dataset of hand-object interaction sequences, featuring complex and prolonged daily activities. This dataset demonstrates HandCraft's ability to capture and reproduce subtle, dynamic interactions in rich detail. HandCraft holds promises in creating realistic virtual environments and advancing world modeling in both graphics and robotics research.

Abstract:
Recently, multimedia data analysis based on non-negative tensor factorization (NTF) has become a hot research topic, but these methods mainly focus on 2-factor factorization and cannot effectively explore the complex structures hidden in multimedia data, especially for graph multimedia data. In this paper, analysis for 3-factor NTF X = U C GT is provided in detail. Specifically, constrained 3-factor NTF helps provide new features to constrained 2-factor NTF. We herein study bi-orthogonal constraint due to the fact that it leads to rigorous interpretability of clustering. After that, we apply it to multimedia data label learning and produce a novel co-multi-view label learning based on bi-orthogonal 3-factor NTF. Extensive experiments show the capability of bi-orthogonal 3-factor NTF on simultaneously clustering anchors and samples of the input data matrix.

Abstract:
Incomplete multi-view multi-label learning faces significant challenges arising from semantic heterogeneity across modalities and incomplete modality availability. Traditional fusion approaches typically emphasize superficial feature alignment, neglecting high-order semantic interactions among modalities and labels, thus resulting in redundant or conflicting information integration. To address these limitations, we propose a novel Label Semantic Guided Adaptive Fusion framework. Specifically, we leverage pretrained language models to generate semantic embeddings for both multi-view data and associated labels, facilitating unified semantic understanding. Subsequently, we construct dual-domain hypergraphs separately within the modality and label semantic spaces to explicitly model complex high-order semantic correlations. Based on these hypergraphs, we employ hypergraph neural networks to mine intrinsic semantic relationships and dynamically assess semantic consistency between each modality and the label space. Finally, an adaptive weighting strategy guided by this semantic consistency measure is introduced to fuse modalities effectively, assigning high weights to modalities with greater semantic alignment. Extensive experiments demonstrate that our LSGMM improves fusion accuracy and robustness over state-of-the-art IMvML methods, confirming the effectiveness of integrating label semantics and high-order semantic relationships into adaptive multi-view fusion.

Abstract:
Incomplete multi-view clustering (IMVC) remains a challenging problem, as missing views significantly hinder the learning of comprehensive and consistent representations. Existing imputation-based approaches often rely on features of neighboring samples to perform view imputation, which can lead to high reconstruction errors and limited semantic integration. To address the challenges, we propose a novel framework, Deep Variational Incomplete Multi-View Clustering with Information-Theoretic Guidance (DVIMC-ITG). Our approach employs a deep variational autoencoder (VAE) to learn shared latent representations and addresses view missingness by leveraging the mixture of Wasserstein barycenter, effectively capturing the joint distribution of multiple views in a unified latent space. To preserve cross-view consistency while minimizing redundancy, we impose an information-theoretic constraint on the view-specific representations. We formulate a robust Evidence Lower Bound (ELBO) that guides the optimization process toward more informative representation and improved clustering performance. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art incomplete multi-view clustering methods in both clustering accuracy and robustness.

Abstract:
Incomplete Multi-view Clustering (IMvC) aims to perform effective clustering in the presence of missing views by exploiting the available information. While many existing approaches demonstrate satisfactory performance, their failure to adequately optimize the recovered data often limits the quality of learned representations and thus hampers clustering performance. To address this challenge, we propose a novel method, Dual-Level Distribution Alignment for Deep Incomplete Multi-View Clustering (DDAIMVC). To effectively address missing data, DDAIMVC employs a fusion-fill strategy to recover incomplete views. The recovered data from each view are then concatenated and processed through an attention mechanism to generate a unified high-level representation. To ensure consistent information across views, the framework performs distribution alignment at both the instance and cluster levels. Specifically, instance-level distribution alignment is conducted by minimizing the maximum mean discrepancy among views, while cluster-level distribution alignment is enhanced via prototypical contrastive learning, which encourages coherent cluster assignments across different modalities. Through the co-optimization of dual-level distribution alignment, the common representation reveals a clear clustering structure. Experimental results on benchmark multi-view datasets demonstrate that DDAIMVC consistently achieves state-of-the-art clustering performance.

Abstract:
Leveraging pre-trained Vision-Language Models (VLMs) for downstream tasks has gained significant attention recently, particularly in Multi-Source Domain Adaptation (MSDA). However, most existing VLMs-based MSDA approaches rely on domain-specific text prompts, which struggle to capture domain-invariant representations. In addition, multiple domains in MSDA introduce significant distribution discrepancies, complicating the design of effective text prompts. To address these challenges, we propose a Domain-aware Visual Context Prompt (DVCP) method, which leverages domain-level features to bridge the domain gaps. Specifically, we design domain-aware text prompts (DTP) module that maps global visual information into the textual prompt embedding space, creating trainable text prompts that incorporate domain-level visual information. Then, we construct a domain-aware visual tuning (DVT) module that collaboratively leverages domain-level and instance-level features to align distributions across multiple domains. Extensive experiments conducted on four popular MSDA benchmarks including Office31, ImageCLEF-DA, Office-Home, and DomainNet, demonstrate the superiority of the proposed method.

Abstract:
Non-rigid registration is essential for reconstructing dynamic and incomplete 3D human meshes, yet traditional methods often fail to achieve robust alignment in the sequence of high-motion deformations and missing geometry. We propose a domain crossover non-rigid registration (DCNRR) framework that addresses these challenges by effectively transferring informative features from 2D image space into the 3D mesh domain of three key stages: multi-view projection, hierarchical non-rigid registration, and topology-consistent completion. In the first stage, multi-view projections are used to extract 2D joint locations and deep features, which guide deformation in the 3D space. In the second stage, hierarchical joint priors and deep features collaboratively guide mesh alignment, enabling more accurate deformation in distal regions and complex poses. In the final stage, we apply a diffusion-based completion process in UV coordinates to reconstruct incomplete surface normals and refine missing mesh areas with topological consistency. Our approach achieves highly detailed and perceptually accurate mesh deformation. To validate our approach, we evaluate performance on a newly constructed dynamic human motion (DHM) dataset, as well as public datasets. Our method demonstrates state-of-the-art results in both geometric accuracy and stability, showing particular robustness in dynamic and incomplete mesh sequences.

Abstract:
Few-shot parameter-efficient tuning methods demonstrate promising potential for Vision-Language (V-L) models in downstream tasks. However, existing approaches primarily focus on class-level alignment between image and text features, overlooking crucial instance-specific semantic information. This limitation leads to suboptimal performance on challenging tasks and restricted generalization capability to unseen data. To address these issues, we propose Class- and Instance-aware Adaptation (CIA), a novel framework that simultaneously optimizes both class-level and instance-level alignments. Specifically, CIA introduces a novel instance encoder that leverages cross-modal self-attention to generate instance-specific text features, accompanied by a carefully designed regularization mechanism to maintain consistency between class-level and instance-level representations. Extensive experiments across 15 benchmark datasets demonstrate that CIA significantly improves the downstream adaptation of V-L models.

Abstract:
Despite the effectiveness of Segment Anything Model (SAM) based methods in Few-Shot Segmentation (FSS) tasks, our closer examination of their prompt encoding mechanism reveals that these methods rely solely on visual information to generate a single type of prompt. Consequently, they suffer from semantic granularity representation bias and a loss of spatial information. To address these limitations, this paper introduces an innovative multi-modal prompt encoder, enabling SAM to leverage both annotated reference images and textual descriptions of class names as segmentation prompts. This approach generates text prompts, dense visual prompts, and sparse visual prompts, spanning multiple modalities and granularities. These prompts provide enhanced representations of the target class, capturing both abstract semantics and specific details, while ensuring granularity appropriateness. When our multi-modal prompt encoder is integrated with SAM's image encoder and mask decoder, the overall model is referred to as MM-Prompt. To validate its effectiveness, we conducted extensive empirical studies on the PASCAL-5^i and COCO-20^i datasets. The experimental results demonstrate that MM-Prompt achieves state-of-the-art performance in FSS tasks, highlighting its substantial potential and value in this domain.

Abstract:
Out-of-distribution (OOD) detection is crucial for safe ML deployment, yet neural networks often exhibit overconfidence on unseen data. While activation norms provide useful OOD signals, they remain vulnerable---OOD inputs can artificially inflate norms through sparse, high-magnitude activations, while valid in-distribution samples with moderate norms may be misclassified. We propose that activation distributional shape, not just magnitude, is essential for robust detection. Our method, Activation Norm and Entropy Weighting (ANEW), combines L2-norm (strength) with Shannon entropy (spread) to distinguish between genuine in-distribution patterns and OOD samples, including adversarial examples mimicking high norms via low-entropy spikes. ANEW requires only a single forward pass without retraining, making it highly practical. Extensive experiments across diverse architectures and benchmarks show ANEW significantly outperforms norm-only baselines, reducing both false positives and false negatives in challenging scenarios. Code available upon acceptance.

Abstract:
Low-resolution (LR) depth maps captured by depth sensors often suffer from structural distortions, noise, and blurring, limiting their practical usability. While most existing depth super-resolution (DSR) methods rely on synthetic datasets, they fail to accurately model real-world degradations, leading to poor performance on real-world data. To address this limitation, we identify two key challenges in the real-world DSR task: structural contour inconsistency and regional degradation inconsistency. The former arises from structural distortions in LR depth maps, while the latter stems from varying degradation levels in smooth regions. Upon this, we propose a Semantics-Driven Contrastive Learning (SDCL) pipeline for real-world DSR, leveraging semantic priors from the SAM model to enhance structural contour reconstruction and region-wise degradation handling. We introduce two novel contrastive loss functions: Structural Contour Alignment (SCA) loss, which aligns depth contours with semantic boundaries, and Regional Degradation Discrimination (RDD) loss, which optimizes smooth region restoration through region-level contrastive learning. Our approach is model-agnostic and can be seamlessly integrated into existing DSR frameworks. Experiments demonstrate that our method significantly enhances DSR performance on real-world DSR datasets.

Abstract:
Open-ended visual storytelling presents a formidable challenge for current text-to-image models, which frequently struggle to preserve both narrative coherence and consistent character depictions across generated sequences. To address this, we introduce StoryCrafter, a multi-character diffusion model that leverages a novel instance-level cross-attention module with supervised fine-tuning to ensure precise text-character alignment and consistent multi-character interactions throughout the narrative. Further, we propose Direct-Diffusion Group Relative Policy Optimization (D2GRPO), a novel RLHF stage that optimizes denoising strategies using automated story-aligned rewards, selecting the best candidate frames from a generated group. We evaluate our approach through human assessments and vision language model (VLM) scoring, measuring text-to-image alignment, style and character consistency, and fine-grained detail quality. Experiments on three benchmarks demonstrate that StoryCrafter outperforms existing methods, achieving 7% improvements in storytelling consistency and 10% in character accuracy, while outperforming baselines in both human and VLM evaluations.

Abstract:
Learning with Noisy Labels (LNL) reduces reliance on high-quality labeled data but often overlooks open-set noise, where noisy samples belong to unknown classes, unlike closed-set noise within known categories.This paper advances LNL by reformulating the problem to incorporate open-set noise through a complete noise transition matrix, enabling a theoretical comparison of its impact on classification error rates against closed-set noise. Our analysis reveals that open-set noise induces smaller error increases, with distinct effects from 'hard' (semantically similar to inliers) and 'easy' (dissimilar) variants. We evaluate entropy-based detection, finding it effective only for easy open-set noise, and propose solutions leveraging vision-language models and self-supervised learning to address hard noise challenges. For empirical validation, we introduce CIFAR100-O,ImageNet-O, and a WebVision open-set test set, enabling robust benchmarking of LNL methods under open-set noise conditions. Recognizing classification accuracy's limitations in capturing model robustness, we advocate out-of-distribution (OOD) detection as a complementary metric. Our theoretical and empirical results highlight the unique challenges of open-set noise, offering new tools and evaluation frameworks to enhance LNL robustness in real-world scenarios.

Abstract:
Large Language Models (LLM) can significantly enhance the Vision-Language Model's prompting capabilities (e.g. CLIP) by generating detailed and comprehensive prompts. However, the LLMs are prone to generating hallucinated text prompts, leading to misalignment between visual and textual representations in the semantic space. Moreover, sparse sampling of images in low-shot scenarios often leads to incomplete or biased concept representations between two modalities, as the limited data struggles to capture the full range of concept attributes. To this end, we propose Chain-of-Thought Guided Low-shot Debiasing (CoLD), which addresses two key challenges: the textual biases generated by the hallucination issues and the visual biases due to insufficient discriminative features. Specifically, a multi-granularity chain-of-thought (CoT) prompting strategy is proposed that reduces the negative impact of irrelevant textual information, thereby effectively mitigating hallucinations and alleviating the issue of textual bias. Additionally, we propose visual anchoring that utilizes CoT prompts with class attributes and characteristics to generate auxiliary visual features. The anchors optimize the existing semantic space by introducing additional visually-grounded concepts, thereby mitigating visual biases and enhancing low-shot performance. Extensive experiments demonstrate the effectiveness of the proposed method, showcasing notable performance improvements across various datasets.

Abstract:
In this paper, we present a novel Self-Supervised Learning (SSL) framework tailored for Multi-View Clustering (MVC), which learns cross-view semantic representations with clear clustering boundaries and derives balanced clustering in an end-to-end manner. Concretely, we propose a generative SSL module that learns high-level semantic representations by recovering randomly masked views from observed views. Then the extracted representations are unified via a sample-level local fusion mechanism and projected into a unit-hypersphere space with evenly distributed cluster prototypes such that the pseudo labels can be directly retrieved using cosine similarity. For each sample, we define highly credible positive pairs of the same cluster and negative pairs of different clusters and design a contrastive SSL module to force the sample to move toward its cluster prototype while farther from the other prototypes in the embedding space. Consequently, the representations exhibit clearer clustering boundaries, and the two SSL modules benefit each other. Finally, we further introduce a clustering regularizer to prevent trivial solutions and derive balanced clustering with theoretical guarantees. Comprehensive evaluations over eight benchmark datasets validate the effectiveness of our proposals against ten state-of-the-art MVC methods.

Abstract:
Click-through rate (CTR) prediction lies at the heart of the online advertising ecosystem and recommendation systems, helping to improve user engagement and platform revenue. With the advent of Pretrained Language Models (PLMs), researchers focus on incorporating textual features to enhance semantic understanding in CTR prediction. However, existing methods typically aggregate a wealth of textual features and encode the informative text into a single semantic embedding. This mechanism leads to entangled embedding that fails to capture fine-grained feature interactions, ultimately limiting CTR prediction performance. To address this issue, we propose Multi-faceted Semantic Disentanglement for CTR prediction (MSD-CTR), a novel framework designed to disentangle and leverage multi-faceted textual information. MSD-CTR consists of two key components: Disentangled Semantic Topic Model (DSTopic) and Topic Guided Disentangled Representation Learning (TopicDRL). DSTopic employs a disentangled generative process to extract multi-faceted knowledge from the entangled textual information. Meanwhile, TopicDRL integrates the extracted multi-faceted knowledge into CTR prediction and introduces two alignment losses to guide disentangled semantic embedding learning. Extensive experiments on four real-world datasets demonstrate that MSD-CTR outperforms existing CTR models, highlighting the effectiveness of disentangling textual information for better click-through rate prediction.

Abstract:
Combinatorial Medication Recommendation (CMR) based on multimodal Electronic Health Records (EHRs) is a promising yet challenging frontier in AI-driven healthcare. Existing approaches usually rely on feature extraction from individual modalities without explicitly aligning information across different data sources. As a result, they may ignore complementary information from other modalities, leading to suboptimal representations for CMR. To this end, we propose MedAlign, a novel combinatorial Medication recommendation framework with multi-modality Alignment. Specifically, we first design a distribution-aware multimodal medication alignment module. This aligns distinct modality distributions of medications within a unified latent space, generating consistent medication representations. Furthermore, we introduce a longitudinal multi-view patient aggregation module, which aggregates the historical visits of patients with multi-view information to form informative patient representations. Finally, we propose a combinatorial medication recommendation module, enabling an accurate and safe medication recommendation combination for each patient. Extensive experiments on two real-world multimodal EHR datasets demonstrate the effectiveness of our MedAlign.

Abstract:
Multi-view clustering (MVC) has gained extensive attention for its capacity to handle heterogeneous data. However, current autoencoder-based MVC methods suffer from a limitation: embedding space exhibits severe imbalances in the efficacy of feature direction, creating a long-tailed singular value distribution where few directions dominate. To mitigate this, we introduce a novel Activate-Then-Eliminate Strategy for Multi-View Clustering (AEMVC), inspired by the observation that balanced feature directions can facilitate enhancing discrimination of learned representations. AEMVC dynamically adjusts the contributions of different feature directions through two keys: a Feature Activation Module that narrows singular value discrepancies to prevent dominant directions from controlling clustering decisions, and an Inter-view Mutual Supervision strategy that filters redundant information by adaptively determining view-specific thresholds based on cross-view consistency. By activating more feature directions and eliminating each view's adverse factors, AEMVC achieves more balanced and discriminative embedding representations. Extensive experiments on seven multi-view benchmarks validate AEMVC's effectiveness, demonstrating substantial improvements over state-of-the-art methods.

Abstract:
Due to data collection limitations and annotation reliability, the lack of multi-view data will weaken the comprehensive understanding of samples, and incomplete multi-view multi-label classification faces severe challenges. To address this problem, we propose a multi-view complementary learning framework MC-IVLC to explore the complementary information between views fully. Specifically, MC-IVLC proposes compensating for the collapse of reconstructed features and explicitly using fused features as supervisory signals to guide the completion of missing views. In addition, MC-IVLC fully utilizes the complementary relationship between views from both instance and semantic levels. Instance-level contrastive learning aims to promote the clustering of similar features in the same view to enhance the complementarity of cross-view features. Semantic-level contrastive learning utilizes pseudo-labels to infer missing labels in label embeddings. It combines pseudo-label semantic information with feature embeddings to guide the semantic relevance of cross-view features. Finally, MC-IVLC explicitly encodes view identity and introduces a view-label prediction loss term to enhance the perception of view information and align single views and multiple views, further exploring the intrinsic connection between views and labels. We conduct experiments on five widely used datasets. Experimental results show that MC-IVLC achieves excellent performance compared with state-of-the-art methods. Ablation studies further validate the effectiveness of each component.

Abstract:
Few-Shot Class-Incremental Learning (FSCIL) aims to enable the model to continuously learn new categories with only limited samples and maintain the recognition ability of old categories. However, this task faces two core challenges: catastrophic forgetting and overfitting due to data scarcity. Existing methods mostly rely on fine-tuning strategies or the transfer ability of visual language models, which is difficult to fundamentally solve the problem of insufficient image samples. To this end, we propose an incremental learning method called Imagining Vision From Language (IVFL), which consists of a base session and multiple incremental sessions. In the base session, the model generates imagined visual features through language descriptions, learns the mapping relationship from language to vision, and estimates the feature distribution of each category; in the incremental session, the model jointly uses old-class pseudo-image features, new-class images, and language descriptions to guide the model to perform effective category expansion while maintaining old-class knowledge. Our IVFL achieves an effective balance between new and old knowledge and achieves significant performance improvements on miniImageNet, CIFAR100, and CUB200 datasets, verifying its superiority in scenarios with data scarcity and knowledge retention.

Abstract:
Image de-reflection is a critical task in computer vision. Existing methods for de-reflection using monocular cameras face challenges due to the lack of depth cues to separate the transmission and reflection layers, particularly under strong illumination or multi-layer reflection scenarios. Although recent advances, such as 3D Gaussian Splatting (3DGS), utilize novel view-synthesis capabilities to separate transmitted and reflected layers, they still encounter difficulties in practice with monocular images. In this paper, we simplify the de-reflection task by combining dual-pixel (DP) technology with 3DGS, forming the first unsupervised de-reflection framework. Specifically, we propose the Dual-View Coordinated Reflection Removal (DCRR) Framework, which integrates depth cues from DP sensors with the rendering capabilities of 3DGS. The DCRR utilizes a dual-view approach that estimates the image transmission layer and opacity via differentiable rasterization with 3DGS and reconstructs the reflection layer through a lightweight multi-layer perceptron. We then present the Dual-Pixel-Driven Reflection Gaussian Pruning (DPRGP) to refine the separation process. By using the physical properties of DP sensors, DCRR achieves significant accuracy improvements in complex reflection scenarios. A real-world DP-based dataset that includes paired reflection/reflection-free images has been collected. Extensive experiments demonstrate our competitive performance compared to state-of-the-art de-reflection approaches.

Abstract:
Underwater scenes present significant challenges for modern 3D scene reconstruction techniques due to absorption, in-scattering, and out-scattering effects, which alter light transport and degrade reconstruction quality, especially under sparse-view conditions. We present AtlantisGS, a novel underwater scene reconstruction method, which only requires sparse-view inputs. It incorporates a scattering decomposition method that separates medium and object contributions during rendering, and a sparse Gaussian proliferation strategy that adaptively densifies the scene representation to improve structural accuracy. These components jointly enhance both geometric reconstruction and medium modeling, enabling accurate and efficient scene recovery with limited observations. Extensive experiments on real-world underwater datasets demonstrate that AtlantisGS outperforms existing NeRF- and 3DGS-based methods across various metrics. AtlantisGS achieves higher reconstruction fidelity with significantly fewer input views and real-time rendering capability. These results establish AtlantisGS as an effective solution for sparse-view underwater 3D scene reconstruction.

Abstract:
Deep neural networks on cloud platforms face growing security threats, with AI services increasingly relying on heterogeneous models for the same task to meet diverse user needs. Existing methods fail to distinguish benign modifications from malicious attacks in cross-model scenarios. To address this challenge, we propose a non-intrusive cross-model watermarking method that generates discriminative samples as universal keys, enabling authentication without altering model parameters or architectures. Specifically, we introduce a margin enhancement loss to amplify confidence gaps between benign and malicious behaviors, ensuring high transferability across models. Both theoretical analysis and experimental results demonstrate the high efficacy of our proposed method. The generated samples maintain high visual fidelity (SSIM > 0.99), achieve over 3 times higher discriminability than existing methods, retain over 93% accuracy under benign modifications, and detect malicious attacks with accuracy dropping below 9%. Overall, our proposed method provides a robust, transferable, and non-intrusive solution for cross-model authentication, making it ideal for real-world applications where security is critical.

Abstract:
Unified Anomaly Detection (UAD) aims to identify anomalies across diverse domains without access to target domain data during training. Unlike traditional anomaly detection methods that rely on training separate models for each domain, UAD employs a single model to generalize across multiple categories. A key challenge lies in the domain shift between seen and unseen data, which requires capturing invariant discriminative patterns between reference and query images across different domains during in-context learning for unified anomaly detection. To tackle this, we propose a novel UAD framework to learn the invariant discriminative patterns through pre-, in- and post-processing modules. First, a pre-processing VLM-guided data augmentation module generates diverse and semantically consist images, followed by a latent-space filtering mechanism. Second, an in-processing Adaptive VQ memory module stores representative discriminative patterns to enable robust residual comparison. Third, a post-processing GUR (Geometric distributions Upgrade Representation) feature augmentation module models geometric feature distributions to synthesize informative prompts, improving the quality of feature delta estimation for anomaly scoring. Extensive experiments on benchmark datasets demonstrate that our method achieves superior generalization in detecting anomalies across unseen domains, outperforming existing state-of-the-art approaches.

Abstract:
Lately, the academic community has been showing growing interest in multi-domain fake news detection, and in particular, incorporating multimodal information into this field has emerged as a highly promising research direction. However, existing methods often struggle with: (1) Insufficient intrinsic domain adaptation during representation Learning; (2) Amplified negative transfer from entangled domain style and content representations; and (3) Neglecting domain-varying modality uncertainty. To address these issues, we propose Domain-Aware Prompt Tuning (DAPT), an innovative framework for multimodal multi-domain fake news detection. DAPT leverages Multimodal Prompt Tuning for parameter-efficient domain adaptation of pretrain models. An Adaptive Domain Debias Module disentangles domain features from veracity signals guided by content to mitigate negative transfer. Furthermore, inspired by the Variational Information Bottleneck, an Uncertainty-Aware Multimodal Fusion mechanism adaptively aggregates modalities based on domain-specific reliability. Extensive experiments demonstrate that DAPT significantly outperforms state-of-the-art baselines on benchmark datasets.

Abstract:
Perceptual hashing has garnered significant attention for its wide-ranging applications in image retrieval and authentication domains. However, existing algorithms often struggle to detect subtle manipulations confined to small regions of an image. In this paper, we introduce a novel framework, Manipulation-Aware Deep Perceptual Hashing (MADPHash), which leverages feature consistency to enhance sensitivity to such subtle manipulations. MADPHash explicitly treats tampered images as a distinct category, incorporates a tampering detection objective into the perceptual hash generation process, and employs a Consistency Constraint Module to amplify discrepancies between tampered and untampered regions. Comprehensive experiments conducted on five benchmark datasets demonstrate that MADPHash significantly improves the detection of subtle manipulations while maintaining robustness against content-preserving transformations, outperforming several state-of-the-art perceptual hashing methods.

Abstract:
We present NIVM, a lightweight and efficient view morphing framework that learns coordinate transforms between views, enabling real-time, user-controlled perspective shifts on resource-constrained devices. Unlike existing view interpolation methods that compromise visual quality or require high data overhead, NIVM integrates seamlessly into multi-view video streams as compact metadata per frame, enabling the synthesis of high-quality intermediate views and interactive transitions from sparse viewpoints. To avoid dependence on explicit 3D geometry, which may be unavailable, we introduce a dual-branch training strategy: a teacher network operates in rectified stereo space to supervise the morpher in the original image domain. By inheriting the monotonicity constraints of epipolar geometry, our morphing network produces visually plausible pixel flows while avoiding the reprojection artifacts prevalent in depth-based methods. Compared to recent pose-free sparse-view Gaussian Splatting approaches, NIVM achieves competitive results without the need to construct or transmit volumetric representations. Experiments show that NIVM achieves the lowest memory footprint, highest inference efficiency, and top-tier visual quality across multiple benchmark datasets.

Abstract:
Previous research on Multimodal Relation Extraction (MRE) has primarily focused on identifying textual relations enhanced by static visual clues from images, benefiting fields such as multimedia analysis and knowledge graphs. With the rapid rise of video content on social media platforms, Multimodal Relation Extraction (MRE) systems face new challenges. To bridge this gap, we introduce Video-level Multimodal Relation Extraction (VMRE), a novel task aimed at extracting relational facts from videos. To advance this research, we present Vid-MRE, a new dataset containing 32 relation types and 12,402 multimodal relational facts, annotated across 3,970 pairs of textual news titles and corresponding videos. Since this task demands precise event and entity grounding to filter out excessive noise in the video, we propose an Event-Entity Semantic Consistency Network (E2SCN) to capture relational clues in the video effectively. Experimental results demonstrate that incorporating video content into the model significantly improves relation identification performance but also introduces more noise. Our E2SCN method effectively reduces the noise, enhancing fine-grained multimodal event and entity alignments while achieving state-of-the-art (SOTA) performance.

Abstract:
End-to-end Multiple Object Tracking (MOT) frameworks integrate detection and tracking into a unified model, avoiding intermediate information loss and complicated post-processing. However, existing end-to-end MOT trackers rely on track queries of the previous frame to provide prior information. Their limited short-term temporal modeling struggle to cope with high dynamic tracking scenarios, where inter-frame target variations exhibit significant heterogeneity. To address these shortcomings, we propose a scene-perception MOT framework (SP-MOT) that encodes scene context understanding into long-term embedding and adaptively complements it with short-term cues, enabling discriminative and flexible instance representations. Specifically, SP-MOT introduces: (1) a learnable scene query that globally profiles foreground and background to capture short-term scene-level features; (2) a context understanding module to uncover long-term stable relationships across dynamic scenes based on multiple historical scene features; (3) scene-adaptive augmented decoding that leverages scene information as guidance, adaptively aggregating long-term and short-term information into object embeddings, improving the model's association ability and fault-tolerance. Extensive experiments on MOT benchmarks demonstrate that SP-MOT outperforms state-of-the-art end-to-end trackers across multiple metrics, particularly in challenging scenarios with high dynamics.

Abstract:
Food nutrition assessment plays a crucial role in maintaining health, preventing diseases, and promoting scientific dietary habits. However, existing nutrition assessment methods often fail to fully consider the relationships between tasks, leading to limited overall performance. Specifically, these methods suffer from three major challenges: (1) task conflicts, where different tasks compete during joint optimization, leading to suboptimal overall performance; (2) varying training difficulties among tasks, leading to imbalanced learning and subpar model generalization; and (3) the small-scale and complex distribution of datasets, which limits the robustness of learned representations. To address these issues, we propose a novel method that reduces interference between tasks, dynamically focuses on more challenging tasks, and incorporates 3D spatial awareness to enhance multi-modal feature representation. First, we decouple the prediction network from the backbone and introduce a CAMTH (Cross-Attention-Based Multi-Task Head Module), effectively mitigating task interference and fully leveraging each task's learning potential. Second, we improve the loss function to adaptively focus on more challenging tasks, improving overall model performance. Third, we design a 3D-FEM (3D Feature Extraction Module) and MMFF (Multi-Modal Feature Fusion Module), enabling the model to fully exploit the spatial information of food and enhance the food's multi-modal feature representation. We validate our method through extensive experiments on the Nutrition5K dataset, comparing it with state-of-the-art (SOTA) models. The results show that our method achieves superior performance in nutrition estimation, demonstrating the effectiveness of our method.

Abstract:
The attention mechanism is the key to the success of transformers in different machine learning tasks. However, the quadratic complexity with respect to the sequence length of the vanilla softmax-based attention mechanism becomes the major bottleneck for the application of long sequence tasks, such as vision tasks. Although various efficient linear attention mechanisms have been proposed, they need to sacrifice performance to achieve high efficiency. What's more, memory-efficient methods, such as FlashAttention-1-3, still have quadratic computation complexity which can be further improved. In this paper, we propose a novel efficient linear fast attention (ELFATT) mechanism to achieve low memory input/output operations, linear computational complexity, and high performance at the same time. ELFATT offers 4-7x speedups over the vanilla softmax-based attention mechanism in high-resolution vision tasks without losing performance. ELFATT is FlashAttention friendly. Using FlashAttention-2 acceleration, ELFATT still offers 2-3x speedups over the vanilla softmax-based attention mechanism on high-resolution vision tasks without losing performance. Even in some non-vision tasks of long-range arena, ELFATT still achieves leading performance and offers 1.2-2.3x speedups over FlashAttention-2. Even on edge GPUs, ELFATT still offers 1.6x to 2.0x speedups compared to state-of-the-art attention mechanisms in various power modes from 5W to 60W. Furthermore, ELFATT can be used to enhance and accelerate diffusion tasks directly without training.

Abstract:
Food plays a vital role in human health, and accurate nutrition estimation is crucial for guiding healthy dietary choices. Traditional biochemical-based assessment methods are often inefficient, costly, and impractical for daily use. With the continuous progress in computer vision, some vision-based nutrition estimation approaches have emerged, typically relying on RGB images alone or in combination with depth images to infer nutritional information. These methods have achieved promising performance and garnered considerable attention. However, these methods often ignore visually imperceptible ingredients such as oil, sugar, and salt, which may significantly influence the estimation of nutritional content. Besides, existing methods lack explicit mechanisms for modeling nutrient-specific information and guiding attention toward nutrition-relevant semantics. To solve the above two issues, we propose a novel ingredients-guided and nutrients-prompted nutrition estimation method. Our method adopts multi-scale feature fusion and integrates RGB and depth modalities to enhance visual representation learning. To account for invisible ingredients, we introduce an ingredients-guided strategy, which enhances the sensitivity to non-visible nutritional factors. Moreover, a nutrient-prompt mechanism is introduced to explicitly guide the focus of the model toward nutrient-relevant attributes during estimation. We validate our method on Nutrition5k, where it consistently outperforms existing state-of-the-art methods, demonstrating its efficacy.

Abstract:
Text-to-image person retrieval aims to identify target person images using natural language descriptions. Current state-of-the-art methods predominantly rely on single-round retrieval frameworks, where retrieval accuracy heavily depends on the quality of the initial textual descriptions. However, users sometimes struggle to provide detailed and distinctive descriptions in a single attempt, resulting in generic initial queries that lack discriminative details. This fundamental limitation of the single-round retrieval framework frequently leads to the misinterpretation of user intent and suboptimal retrieval performance. To address this limitation, we propose Dialogue-driven Interactive Dynamic Learning (DIDL) for text-to-image person retrieval. Specifically, we first introduce Collaborative Query Refinement (CQR), which progressively refines retrieval conditions through multi-round dialogues. Then, we design Dynamic Context Resampling (DCR) based on a bi-granular mask strategy that enhances the model's adaptation to dialogue-style contexts and effectively balances its attention between initial descriptions and supplementary information. Based on these components, we further propose cross-modal Probabilistic Context Matching Modeling (ProCMM) that establishes effective associations between static visual features and dynamic contextual semantics. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across all three benchmark datasets.

Abstract:
In this work, We propose a novel framework for 3D garment generation, named Retrieval Augmented 3D Garment Generation (RAG2), capable of generating high-quality mesh and high consistent texture with input image simultaneously. Specifically, we decouple the 3D garment generation task into garment modeling and texturing to address the issues of low-quality meshes and poor texture consistency caused by using a single model in previous approaches. For garment modeling, we build a base garment mesh database and introduce Retrieval Augmented Deformer to obtain high-quality mesh with similar clothing styles to the input image. To generate high-fidelity texture, we propose TextureNet by imposing a high-fidelity UV generation module to ensure consistency with the input image; a multi-view consistent branch to ensure geometry and logical coherence; and a DiT-based main branch to support efficient and dedicated information interaction between multi-branches. Extensive experiments validate that RAG2 surpasses existing methods both in mesh quality and texture fidelity.

Abstract:
Current reaction generation studies often assume the homogeneity of all reactor body joints in the end-to-end motion generation while neglecting the physical contact information, resulting in evident joint mismatches in both temporal and spatial dimensions. In this paper, we introduce our method, Reactffusion, which addresses the reaction joint mismatch issue by explicitly leveraging the guidance from the actor-reactor physical contacts. At the mathematical modeling level, we reformulate the contact-guided reaction generation as a multi-task problem, divided into two sub-problems: contact information learning and reaction generation with physical constraints. Specifically, given the actor motion sequence, we first introduce a Contact Prediction Module (CPM), which adopts a spatial and temporal attentive mechanism to forecast the contact map, indicating the timing and the location of the potential joint collisions. Then, we employ the contact map as an explicit guide to rectify the sampling distribution in the denoising process of the proposed diffusion network. The comprehensive evaluations prove our method can achieve state-of-the-art performance compared with other reaction generation methods across multiple public benchmarks. Furthermore, the contact map predicted by the CPM can also effectively boost other baselines as an extra plug-in.

Abstract:
Recent DDS-based video editing methods have presented remarkable potential by enhancing traditional diffusion models. However, these methods are limited by the MSE-based isolated comparison of noises, leading to issues such as numerical sensitivity and local structure perception deficiency. Additionally, the inherent uncertainty introduced by the noise injection process in diffusion models further hinders the improvement of editing performance. To address these limitations, we propose an Evidential Video Editing (EVE) framework, which normalizes noise vectors into probability distributions, enhancing the comparability of element relationships. By leveraging evidential deep learning, EVE employs Dirichlet distributions to establish distribution-based probabilistic modeling, overcoming the constraints of single deterministic normalization probabilities. Furthermore, we introduce an uncertainty-guided local optimization strategy to capture the local uncertainty of noise and preserve local structural details, thereby improving editing precision. Extensive experiments demonstrate that our method achieves state-of-the-art performance in video editing.

Abstract:
Text-guided visual editing aims to modify visual content according to a target prompt while faithfully preserving the structure and identity of the source image or video. However, existing methods ignore confounding effects brought from the pretrained model, i.e., harmful biases learned from the pretraining datasets, leading to spurious correlations during the editing processing. To address this issue, we introduce CausalCtrl, a novel training-free framework that reformulates text-guided visual editing from a causal inference perspective. The core idea is to leverage frontdoor adjustment to estimate the interventional distribution of the output, effectively blocking the influence of hidden confounders introduced by the pretrained model. Specifically, we first design a dual-branch inversion mechanism that disentangles the source content and target semantics into two separate latent embeddings to simplify the sampling space of interventional operation, and perform unbiased denoising through their controlled interaction. Besides, we propose a Structured Attention Injection Module (SAIM) that adaptively identifies and amplifies dominant attention heads using a lightweight SVD-based top-K selection strategy. Extensive experiments on several challenging image and video editing benchmarks demonstrate that CausalCtrl consistently outperforms existing methods in both target semantic alignment and source content preservation, validating the effectiveness of causal intervention in this task.

Abstract:
Although current work of text-to-Image generation can preliminarily generate images from the descriptions of human-object interactions, it fails to consider the emotions involved in human-object interactions. While people often experience emotions when using objects or interacting with them. Therefore, in this paper, we propose Emotional Interaction Generation task, a novel image generation task, which generates emotionally expressive human-object interaction images from given prompts, human-object interaction (HOI) region, and emotions. First, we construct a new emotional interaction dataset, called EmotionHOI, which including 47,776 images with content prompt, emotions and human-object interaction bounding box. Second, we propose an emotion-aware text-to-image diffusion model, named EmIT, for emotional interaction generation. Specifically, EmIT consists of three components: (1) an emotion interaction tokenizer that encodes subject, object, action, and emotion into structured tokens; (2) an Emo-Interaction Self-Attention that preliminarily guides the latent space to conduct hybrid learning with emotional interaction tokens; and (3) a Hierarchical Emotion-Visual Cross-Attention that further focus on grounding affect-such as pose, gaze, or interaction intensity-into specific spatial regions and capture subtle emotional variations. These components jointly model interaction semantics and emotional context, enabling EmIT to generate images that are both behaviorally coherent and emotionally expressive. Experimental results on the EmotionHOI dataset demonstrate the superiority of the proposed model.

Abstract:
Conditional text-to-image diffusion models enhance the controllability of text-to-image generation by incorporating additional visual conditions. However, they often encounter two main challenges when dealing with complex visual conditions (namely, including multiple different objects): semantic leakage among objects and conflicts between visual inputs and text descriptions. To address these issues, we propose an innovative object-level conditional image generation method. It associates visual features with object semantic information, ensuring that generated objects are accurately positioned in their expected locations within the visual inputs. To address semantic leakage, we design an Object-level Structure Controller (OSC) module. This module utilizes an attention mechanism to fuse bounding box annotations, object prompts, and visual conditional inputs, allowing the model to learn essential object-level structural features. Besides, we propose an Object-level Control Relaxation (OCR) module to predict object-level scale features, which can reconcile conflicts between object semantics and visual features. Finally, the scaled backbone features are fused with structural features to form the final output features. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods in terms of text-image alignment, structural similarity, and spatial fidelity.

Abstract:
Achieving high-quality output alongside enhanced controllability is crucial in video-to-music generation, especially for optimizing user experience in real-life application scenarios. Most existing studies emphasize generative quality, but often overlooking the vital aspect of controllability. Therefore, the generated music cannot be easily fine-tuned or modified to meet users' expectations. In this paper, we delve into the spatial-temporal decomposition and alignment in controllable video-to-music generation. We first introduce a novel video-music decomposition and transformation approach in both spatial and temporal domain, and enhance the cross-modal correspondence through feature alignment and flow-matching based alignment. Furthermore, our method attains unsupervised controllability during training via feature-free guidance. Experimental results demonstrate that our model achieves state-of-the-art results in overall generative quality. Moreover, its controllability significantly outperforms existing models, making it exceptionally well-suited to accommodate users' flexible and diverse control requirements.

Abstract:
Automated vector floorplan generation is valuable for designers to explore potential spatial designs. However, existing learning-based methods rely on complex post-processing or optimization to obtain plausible vector floorplans, which disrupts the end-to-end design flow. In this paper, we propose FloorplanSBS, a patch-based segmentation framework for directly synthesizing vector floorplans. Our method leverages the strengths of box-based representation and segmentation-based generation, following a division-and-labeling scheme. The framework operates in two stages: given input design constraints, a division model first divides the design space into rectangular patches, followed by a labelling model that assigns semantic labels to each patch. FloorplanSBS supports constraints such as boundaries and layout graphs. Extensive evaluations show that it surpasses state-of-the-art methods in generating high-quality vector floorplans. With its end-to-end neural framework, FloorplanSBS eliminates the need for post-processing, offering a simple, efficient, and user-friendly tool for vector floorplan design.

Abstract:
Achieving accurate predictions with limited samples is a key challenge in biomedical image artificial intelligence. Previous methods rely on pre-trained image foundation models with supervised fine-tuning (SFT) or zero-shot inference to enhance small-data performance. However, SFT is time-consuming and prone to overfitting, whereas zero-shot inference fails to fully exploit available data. Inspired by recent tabular foundation models, which show superior performance on small-sample tasks with in-context learning (ICL), we propose TabiMed, a novel framework that transforms visual representations into structured tabular data, leveraging pre-trained tabular models for fast and accurate analysis on small data. TabiMed consists of three key components: dynamic modality-aware representation engine, tabularization adapter and in-context inference module. Experiments on 10 datasets from different fields demonstrate three major advantages of TabiMed: 1) excellent performance on small datasets, with an average AUC of 14.1% higher than zero-shot; 2) high efficiency, with a training time 250x faster than SFT; 3) scalability to larger datasets through our tabularization adapter. TabiMed proposes a novel pathway to address the challenges of analyzing biomedical images with few samples.

Abstract:
The success of diffusion models in text-to-image (T2I) generation has made it urgent to remove unwanted concepts, such as copyrighted, offensive, and unsafe ones, from pre-trained models in an accurate, timely, and cost-effective manner. However, limited by the inherent optimization perspective, existing methods have two major problems. Firstly, they overlook maintaining the global visual style during the erasure process, leading to significant style shifts. Secondly, excessive concept erasure causes relevant content to disappear or generates substitutes unrelated to the original object's attributes. Compared to other methods, our proposed ICE has unique advantages, as it can generate diverse visual features and achieve a balance between concept erasure and maintaining the semantic content of the target object. This is mainly achieved through our well-designed non-erasable features protector (NEFP) and augmented invariant constraints (AIC). Specifically, we enhance the protection of feature information by embedding an augmented orthogonal anchor concept matrix. Meanwhile, under controlled constraints, we introduce invariants into the embedding space to retain key semantics. This work specifically emphasizes the importance of focusing on feature expression and semantic protection in the concept erasure task for fully unleashing the performance of T2I models.

Abstract:
This paper addresses the critical challenge of detecting codec-based audio deepfakes in multilingual and dynamically evolving adversarial scenarios. While existing detection systems exhibit performance degradation against codec-generated forgeries and unseen linguistic environments, we propose a novel audio deepfake detection framework ''WhiADD'' enhanced by semantic-acoustic fusion and cross-modal generalization. Our methodology introduces three key innovations: (1) The Union CodecFake (UCF) dataset, synthesized by extending the CodecFake generation pipeline to the multilingual Common Voice corpus, significantly expands acoustic diversity with 1.9M samples across varied phonetic, channel, and codec manipulation patterns. (2) A semantic-prompted Whisper architecture that integrates full-transcript linguistic constraints into decoder fine-tuning, enabling detection of semantic inconsistencies. (3) A gated cross-attention mechanism that dynamically fuses multi-source audio features with the proposed model's frozen encoder outputs, enhancing artifact detection through adaptive attention to pre-trained representations. Extensive experiments demonstrate state-of-the-art performance, achieving 0.55% EER on UCF testing data and less than 3% EER in zero-shot cross-lingual detection (German, French, Italian). The framework reduces false negatives by up to 24% compared to conventional models through improved semantic-acoustic alignment. These advancements establish a robust paradigm for combating evolving codec-based forgeries, bridging the critical gap between acoustic feature engineering and semantic coherence analysis in audio forensics.

Abstract:
Federated Learning (FL) has become a powerful technique for collaborative model training across decentralized entities while preserving data privacy. Despite its potential, FL faces significant challenges, including communication overhead, resource heterogeneity, and data heterogeneity. Existing solutions fall short in addressing disparities in client resources and the errors introduced by direct model aggregation across heterogeneous clients. To tackle these issues, we propose DynFed, a novel federated learning framework that incorporates dynamic quantization bit-width allocation and multi-teacher knowledge distillation for model aggregation. DynFed dynamically adjusts quantization bit-widths to clients based on their resource heterogeneity, adapting these allocations according to variations in the local loss function during training. This adaptive quantization strategy optimizes resource utilization while preserving model performance. For model aggregation, DynFed utilizes a dynamic multi-teacher knowledge distillation approach, assigning the most suitable teacher model to each data sample based on a comprehensive evaluation score, thereby ensuring effective knowledge transfer even in the presence of quantization-induced errors. This method not only mitigates the negative effects of heterogeneous bit-widths but also leverages client model diversity to enhance the robustness of the global model. Extensive experimental results demonstrate the superiority of DynFed over state-of-the-art methods.

Abstract:
Graph has recently enabled substantial advances to the Web. Processing worldwide graphs with millions to billions, even trillions of edges in large-scale high-performance systems is pressing, but current graph processing engines are designed for small-scale graph processing beyond a few tens of computing nodes and are unable to scale well to large parallel systems because they are oblivious to imbalanced communication across the communication grid. Therefore, we present GraphWorld, a better approach to optimizing graph search in large parallel systems for world-wide web crawling and indexing.GraphWorld (i) features a new graph partitioning method to achieve better load balancing and minimize communication overhead across the row and column directions; (ii) designs an efficient hardware prefetching and caching mechanism that can gather, traverse, and scatter pipeline vertices to accelerate graph processing; and (iii) proposes υBFS: vectorization-based BFS for leveraging vectorization units equipped in modern high performance processors to further improve graph search.In addition, we used real-world graphs and benchmarks to demonstrate the effectiveness of GraphWorld. In particular, the GraphWorld-based Graph 500 tests on the Tianhe supercomputer are superior to the fastest systems in the latest Graph 500 lists. We finally apply GraphWorld to real-life graphs for the worldwide search of the Web, which outperforms the state-of-the-art graph partitioning and graph system by orders of magnitude.

Abstract:
Immersive virtual reality (VR) experiences require transmission and rendering of large-scale 3D content, often represented as point clouds or polygon meshes. Unfortunately, existing networked VR systems often fail to fully exploit the flexibility of VR data representations. To address this problem, we propose a cross-layer design that elevates a network data unit to a usable rendering unit for VR applications. Our aim is to bridge the gap between networks and applications in order to enhance visual quality, especially over constrained and variable networks. Our approach, Rendering Unit that is Network-aware (RUN), with two variants, RUN-Packet and RUN-Hybrid, includes mechanisms to effectively utilize network data units when encoding, transmitting, decoding, and rendering. Specifically, we develop additive detail refinement mechanisms and address streaming challenges such as head-of-line (HoL) blocking. We prototype our system in Unity 3D and evaluate it using synthetic network environments and real network traces. Our results with both static and dynamic point clouds demonstrate that RUN significantly reduces stalls and delivers smoother frame updates, enhancing visual quality.

Abstract:
Multimodal sentiment analysis (MSA) traditionally assumes a unified emotional signal across modalities such as text, audio, and video. However, recent findings suggest that each modality may convey distinct affective perspectives. Motivated by perspectivist theories from cognitive science and natural language processing, this paper introduces Label Divergence Weighting (LDW), a modality-weighting strategy that dynamically adjusts trust in each modality based on its alignment with the overall sentiment label. The LDW framework leverages training-time supervision from the divergence between unimodal and multimodal sentiment annotations to learn modality reliability, and applies this learning to unseen data without requiring unimodal labels at inference time. Integrated into a multitask variant of the Tensor Fusion Network (MTFN), the proposed LDW-MTFN model achieves state-of-the-art results on both the acted Chinese dataset CH-SIMS and the authentic English dataset UniC. Extensive experiments and ablation studies demonstrate the robustness and generalizability of LDW across datasets with different cultural, linguistic, and environmental characteristics.

Abstract:
Concept-based models have been proposed as a new line of research for explainable by-design deep learning models. However, those models show their whole power when applied to benchmarks where the concepts are well defined and the concepts' attributes easily extractable from the raw data. In this paper, we challenge the most recent concept-based model initially developed for image classification, on more complex interpretative tasks from a recently proposed video benchmark where they perform poorly. We conduct a root cause analysis of the poor performances of state-of-the-art explainable concept-based models for these multimodal interpretative tasks, and propose adaptations to design robust explainable models for detecting character objectification in this novel challenging video benchmark. We show that the optimal architectural choice may vary depending on the modality setting, thereby showing that designing multimodal concept-based approaches remains an open challenge and calls for further investigation.

Abstract:
Scientific Figure Analysis (SFA) aims to derive analytical insights from figures while incorporating background instructions. Unlike conventional tasks such as figure captioning or description generation, which focus on extracting surface-level information from the sole visual modality, SFA requires an intelligent system to summarize key patterns, infer implications, and contextualize scientific findings from visual and textual inputs. It demands not only visual recognition but also the integration of scientific knowledge, multimodal understanding, and contextual reasoning. In this work, we introduce an SFA dataset, AnaFig, comprising 2,000 high-quality samples across 56 domains. All samples are evaluated by using human-aligned five-dimensional scoring criteria, resulting 10,000 human-annotated score labels. The AnaFig dataset facilitates the assessment of three critical capabilities of multimodal large language models (MLLMs): adherence to complex instructions, multimodal perception, and analytical summarization. By building a new benchmark with widely used MLLMs, this study contributes to scientific knowledge discovery and reasoning, fostering the alignment of MLLMs and human experts in scientific analysis.

Abstract:
The rapid advancements in Artificial Intelligence (AI) increasingly underscore the critical importance of multi-modal datasets for training robust and versatile models. However, the dominance of English-centric resources and the scarcity of datasets capturing diverse languages and cultural nuances limit AI's inclusivity and global applicability, posing a key challenge. Critically, language's intrinsic link to cultural context demands models possess profound cultural insight beyond mere translation to genuinely grasp societal norms and unique cultural expressions. Culturally rich broadcast content offers a solution; its systematic curation can mitigate linguistic/cultural imbalances, fostering culturally deep multi-modal datasets. This paper introduces the HAN (Korean Heritage Augmented Narrative Visual-Language Description) dataset, a new resource for multilingual image captioning and retrieval. HAN comprises 41,000 images captured from Korean broadcast video clips with 410,000 Korean/English narrative-style captions offering multifaceted perspectives on each visual instance. By incorporating Korean heritage, HAN reflects cultural diversity, addressing limitations of existing datasets focused on simple descriptions. Furthermore, this work analyzes HAN's caption diversity impact on retrieval, proposing strategies to enhance efficacy. These findings underscore HAN's potential to significantly advance multi-modal and multilingual processing, supported by its value as a rich resource for learning approaches that integrate diverse data types (like vision and text), natural language processing across various languages, and Korean heritage studies.

Abstract:
In this paper, we present PrivEdit, a zero-shot, interactive image privacy editing system specifically designed for automated sensitive information desensitization. As social networks and smart devices proliferate, the risk of unintended privacy leakage grows, driving demand for personalized, controllable protection tools. PrivEdit is powered by natural-language instructions and integrates a Recognize-Anything model for robust detection and classification of sensitive objects (e.g., faces, license plates, ID cards), followed by GroundingDINO and SAM for high-precision mask extraction. User intents are parsed and disambiguated via GPT-4o, enabling selective target confirmation and iterative refinement. Finally, our editing module performs localized edits-such as adjustable blurring, mosaicking, or replacement via generative editing. With support for multi-round feedback and real-time modification, PrivEdit seamlessly handles both pre-recorded images and live streams, making it ideal for social-media pre-publishing, privacy data desensitization in enterprise or healthcare contexts, and intelligent surveillance applications. By unifying detection, segmentation, intent parsing, and localized editing into one coherent interface, PrivEdit delivers an end-to-end solution for safeguarding visual data. Supplementary materials including the demo video and slides are available at: https://drive.google.com/file/d/13jFBmYgZgxYQLPIAqCeaQhcyzZTHpf7N/view?usp=sharing

Abstract:
We present a physics-driven 3D dart-throwing interaction system for Apple Vision Pro (AVP), developed using Unity 6 engine and running in augmented reality (AR) mode on the device. The system utilizes the PolySpatial and Apple's ARKit software development kits (SDKs) to ensure hand input and tracking in order to intuitively spawn, grab, and throw virtual darts similar to real darts. The application benefits from physics simulations alongside the innovative no-controller input system of AVP to manipulate objects realistically in an unbounded spatial volume. By implementing spatial distance measurement, scoring logic, and recording user performance, this project enables user studies on quality of experience in interactive experiences. To evaluate the perceived quality and realism of the interaction, we conducted a subjective study with 10 participants using a structured questionnaire. The study measured various aspects of the user experience, including visual and spatial realism, control fidelity, depth perception, immersiveness, and enjoyment. Results indicate high mean opinion scores (MOS) across key dimensions.

Abstract:
FaceCluster is an interactive photo management system that leverages our enhanced KP-RPE face recognition model with Embedding Statistical Regularization to organize personal photo collections automatically. Unlike existing cloud-based systems that raise privacy concerns, FaceCluster operates entirely locally while demonstrating high performance across multiple challenging benchmarks, including IJB-C (97.25% TAR@0.01%), TinyFace (74.14% Rank-1), and AgeDB (97.78% accuracy) when trained on the WebFace4M dataset. The demo showcases real-time face detection, clustering, and organization capabilities through an intuitive web interface, enabling users to effortlessly manage large photo collections with a single-command Docker deployment.

Abstract:
This paper presents MindSpeak, a real-time brain-computer interface (BCI) system for recording, processing, and decoding silent speech to enable online multimodal communication between the human brain and a computer, involving both noninvasive multichannel EEG signals and text output. To enable hand-free and brain-only control, our system incorporates steady-state visual evoked potential (SSVEP) for users to select incomplete sentences from a predefined pool and confirm the correctness of decoded words. An intuitive graphical interface is designed for natural communication. We evaluate the effectiveness of our real-time BCI system, which achieves 77.3% accuracy in decoding silent speech and 98.9% accuracy in SSVEP-based selection and confirmation of correct sentences. Unlike existing BCI systems, the presented MindSpeak system significantly expands the application scope of existing BCI systems by enabling users to express complete thoughts through a fully BCI-controlled interactive interface. Our demonstration video is on: https://youtu.be/B1wt1dmCCrg.

Abstract:
This study presents an intelligent planning system (News Video to Propagation Rules Strategy, NV2PRS). The system is based on event chain modeling to achieve automated generation of event-level communication strategies for video news. It consists of two modules: video style feature extraction and knowledge chain matching. First, a multimodal feature analysis engine is used to obtain text semantics, visual features, and communication features. Then, template-based knowledge chain matching is employed to realize the mapping between events and strategies. To optimize the system's practicality, a hierarchical architecture design is adopted, integrating a feature visualization interface and an end-to-end workflow for strategy generation.

Abstract:
As Picasso said, a painting lives only through the one who looks at it. To materialize this thought, we propose to automatically produce artworks that visually transform paintings by amplifying and distorting the most observed areas by viewers. Our work is based on a study conducted at the Caen Museum of Fine Arts in France. During the study, 151 participants were equipped with eye-tracking glasses, and observed various paintings, first alone and then in pairs. Based on the fixation and gaze path stored data, we first generate saliency maps that reflect the visual attention given to each painting. These maps are then used to fine-tune the UNETRSal model, a neural network designed to predict saliency maps, in order to align its outputs with human visual patterns observed during the experiment. The saliency maps generated are subsequently used to create deformations of the original painting. This overall process gives rise to a new artwork born from the interaction between human gaze and AI-prediction.

Abstract:
Amidst the swift advancement of 3D vision technology, Multi-view Compression (MVC) has become a crucial technique, widely applied in fields such as virtual reality, augmented reality, autonomous driving, telemedicine, and security surveillance. The technology effectively handles views from multiple cameras, utilizing the inter-view correlations to compress data efficiently. It substantially decreases the data transmission and storage requirements, enabling a richer and more realistic visual experience within the same bandwidth constraints. To further enhance compression performance, new methods continue to emerge. However, the absence of a unified benchmark testing library capable of effectively evaluating existing algorithms poses significant challenges to the further development of the field and the practical deployment of algorithms. To address this issue, we introduce OpenMVC, an Open-Source Library for Learning-based Multi-view Compression. We provide a comprehensive description and analysis of the performance advantages of existing algorithms. Furthermore, we conduct extensive and comprehensive benchmark testing of nine typical algorithms in the last five years, evaluating them in a consistent environment across various metrics. The open-source library for OpenMVC is available at https://openi.pcl.ac.cn/OpenAICoding/OpenMVC.

Abstract:
Multimodal knowledge graphs often separate easily represented information (text) from that which is not (multimedia documents like images, videos, or audio). This severely limits query expressiveness, as the engines lack access to the node contents stored externally. We present MeGraS, the MediaGraph Store, a novel storage and query engine for multimodal knowledge graphs. By storing multimedia documents directly in the graph, MeGraS allows the query engine to leverage their content for enhanced capabilities, making it natively capable of performing operations such as k-NN, segmentation, or deriving non-materialized relations based on visual features. To demonstrate this, we incorporate and extend the pattern-matching query language SPARQL, resulting in a unified framework for storing and managing multimodal knowledge graphs with advanced expressiveness. MeGraS is available as open-source software: http://megras.org

Abstract:
Recently, with the rapid advancement of multimodal large language models (MLLMs), intent-oriented video captioning has received increasing attention due to its potential for controllable and grounded visual understanding. Fine-grained localized video captioning presents unique challenges due to the need for controllability, object grounding, and temporal precision. In this paper, we propose MGVC, a two-stage framework for intention-oriented controllable video captioning in the IntentVC 2025 Challenge. Our pipeline first leverages a fine-tuned MLLM to generate diverse preliminary captions. These candidate captions are then refined by another finetuned MLLM for further semantic alignment and stylistic coherence. We introduce a video-text matching module, further finetuned on the IntentVC dataset. This module will filter out semantically misaligned candidate captions. For caption selection, we train category-specific regressors that predict caption quality scores based on VTM similarity, textual features, intra-caption BLEU, and CLIP-based retrieval correlations. The caption with the highest predicted alignment score is chosen as final output. Finally, our method achieves 1st place in the IntentVC 2025 Grand Challenge, which demonstrates the effectiveness and generalization of our proposed method.

Abstract:
We present a unified vision-language framework for the Responsible Multimodal AI Challenge 2025, tackling two related tasks: multimodal hallucination detection and factuality verification. Our approach builds on a Dynamic ViLBERT model enhanced with adaptive multi-branch attention fusion to effectively integrate visual and textual data. For Task A (Hallucination Detection), we employ four parallel answer-encoding branches with co-attentional transformers to compare AI-generated captions or answers against corresponding images, enabling accurate hallucination identification. For Task B (Factuality Verification), we fuse visual and textual features via element-wise addition, multiplication, and concatenation, feeding them into a classifier to assess claim validity. Inputs include visual features from Detectron2 and text embeddings from BERT tokenization. Experiments show notable accuracy gains over baselines, validated by official F1 scores. Extensive ablations, comparative studies, and architectural visualizations confirm the value of cross-modal attention and customized fusion strategies. Our final system ranks 3rd in hallucination detection (F1 = 0.80) and 2nd in factuality verification (F1 = 0.84), demonstrating the strength of our unified approach.

Abstract:
This study proposes an LLM-augmented hierarchical fusion framework to enhance multimodal personality and ability assessment for the ACM MULTIMEDIA AVI CHALLENGE 2025, addressing semantic sparsity and cross-modal interaction limitations. We leverage large language models (e.g., Qwen, DeepSeek) to generate psychologically enriched text descriptions, bridging raw transcripts with expert evaluations, and integrate them with audio-visual features through early fusion and multi-path MLP ensembles. Track 1 (personality regression) employs dual-text inputs while Track 2 (multi-label ability prediction) uses parallel regression. Results show significant improvements: 23.1% MSE reduction over text-only baselines in Track 1, and 10.7%/12.5% gains over state-of-the-art fusion in Tracks 1/2, with 31.2% average improvement for cognitive traits (Q3-Q5). The framework demonstrates the effectiveness of semantic enhancement and adaptive fusion, with future work focusing on overfitting mitigation and feature optimization.

Abstract:
Micro-Actions (MAs) are a crucial form of non-verbal communication in social interactions, with promising applications in human emotion analysis. Although the topic has attracted considerable research interest, progress has been hindered by the lack of publicly available benchmark datasets. To address this gap, the Micro-Action Analysis Grand Challenge (MAC) is organized annually. This paper presents an overview of the 2nd Micro-Action Analysis Grand Challenge, held in conjunction with ACM Multimedia 2025. We provide a comprehensive summary of the challenge, including its dataset, evaluation protocol, results, and discussion. The top-ranked solutions are highlighted to offer valuable insights for researchers, and potential future directions are outlined to guide ongoing developments in this area. The goal of this grand challenge is to foster innovative research in micro-action analysis and advance research in the human-centric action understanding community.

Abstract:
In contrast to traditional action recognition, Micro-Action Recognition focuses on identifying subtle, low-amplitude movements, which was constrained by two kinds of challenges. The first challenge is the spatial imbalance, where small, critical action regions are easily overwhelmed by vast, irrelevant backgrounds, leading to a low signal-to-noise ratio. The second challenge is the class distribution imbalance, where the natural occurrence of actions follows a long-tailed distribution, causing models to be biased towards common actions. To address these specific issues, our framework introduces two targeted solutions. To mitigate spatial imbalance, a YOLOv12-based detection module has been used to localize and crop salient body parts, forcing the model to focus on action-relevant regions. Concurrently, to mitigate class imbalance, this study implement a dynamic oversampling strategy combined with temporal data augmentation, effectively re-weighting the training process to improve performance on rare categories. Integrated with a V-JEPA2 backbone and a multi-classifier ensemble, our approach demonstrates its efficacy by securing second place in the ACM MM'25 Micro-Action Analysis Challenge with an F1-score of 76.98%.

Abstract:
Micro-action refers to subtle, low-intensity non-verbal behaviors that can provide insights into an individual's underlying emotions and intentions. Due to its brief duration and significant overlap, identifying these micro-actions poses a challenge for current models. In response to these challenges, this paper proposes a novel multi-feature fusion framework, which extracts coarse-grained body features and fine-grained action features separately. Specifically, we present Temporal Contextualization for fine-grained learning, a cross-frame injection mechanism designed to capture essential spatio-temporal information and introduce a 3D-ResNet Adapter for coarse-grained learning, which aggregates temporal data and facilitates parameter-efficient fine-tuning. In consideration of the task dataset distribution's long-tail nature, the implementation of Feature Decoupling is undertaken, adopting a two-stage training strategy. By conducting experiments, the aforementioned hierarchical multi-feature extraction and aggregation approach has been demonstrated to yield substantial enhancement in Micro-Action Recognition. Our method attains an F1-mean score of 77.75% on the MA-52 dataset, ranking 1st in the 2nd Micro-Action Analysis Grand Challenge in Conjunction with ACM MM'25.

Abstract:
This workshop is part of the ACM Multimedia 2025 Conference and is organized by the ACM I2M Chapter, consisting of both industry and academia members. The rapid convergence of Artificial Intelligence (AI), Human-Computer Interaction (HCI), and immersive multimedia is redefining the landscape of intelligent and adaptive digital experiences. As ACM Multimedia 2025 emphasizes cutting-edge multimedia systems, this workshop directly contributes to its vision by exploring AI's transformative role in immersive media. Through AI-driven multimedia interaction, adaptive virtual environments, and intelligent content generation, this workshop will showcase how AI is enhancing the creation and experience of digital worlds. The workshop proceedings can be found at: https://dl.acm.org/doi/proceedings/10.1145/3728487

Abstract:
MMFood'25, the 1st International Workshop on Multi-modal Food Computing, brings together researchers and practitioners at the intersection of artificial intelligence, computer vision, natural language processing, and sensory modeling to advance the study of food. The workshop highlights how multimodal methods can be applied to food recognition, recommendation, analysis, and monitoring to address pressing challenges in health, nutrition, sustainability, and food culture. It features a rich program including a keynote, an invited talk, paper presentations, a poster session, and a panel discussion on the role of multimodal AI in preserving cultural heritage, fostering sustainable food futures, and enabling personal well-being. By convening experts from academia, industry, and healthcare, MMFood'25 provides a unique platform for fostering interdisciplinary collaboration and for shaping the emerging field of multimodal food computing. The workshop proceedings can be found at: https://dl.acm.org/doi/proceedings/10.1145/3746264.

Abstract:
Human behavior has the nature of mutual dependencies, which requires human-robot interactive systems to predict surrounding agents' trajectories by modeling complex social interactions, avoiding collisions and executing safe path planning. While there exist many trajectory prediction methods, most of them do not incorporate the own motion of the ego agent and only model interactions based on static information. We are inspired by the humans' theory of mind during trajectory selection and propose a Cross time domain intention-interactive method for conditional Trajectory prediction(CiT). Our proposed CiT conducts joint analysis of behavior intentions over time, and achieves information complementarity and integration across different time domains. The intention in its own time domain can be corrected by the social interaction information from the other time domain to obtain a more precise intention representation. In addition, CiT is designed to closely integrate with robotic motion planning and control modules, capable of generating a set of optional trajectory prediction results for all surrounding agents based on potential motions of the ego agent. Extensive experiments demonstrate that the proposed CiT significantly outperforms the existing methods, achieving state-of-the-art performance in the benchmarks.

Abstract:
In the field of computer vision, Vision Graph Neural Networks (ViG) have demonstrated significant potential in image understanding. By treating the divided image patches as nodes and constructing connection relationships based on neighbor attributes, ViG can efficiently model global dependencies within images with the help of graph attributes. However, most existing ViG methods have the problem of high computational complexity in graph construction and may not be able to effectively and fully explore the graph structure information. Besides, the heavy reliance on manual annotation labels limits the application potential of ViG in practical scenarios. To this end, in this paper, we propose a novel self-supervised vision graph contrastive learning method (S2ViG) based on image mixing strategy for efficient vision graph representation learning. It aims to use self-supervised method to alleviate the dependence on manual annotation and enhance the understanding of the global structure of the graph using two different vision graph construction methods. Specifically, we first employ image mixing strategy to uncover latent semantic relationships among multiple images. Then, we construct dynamic graph structures for image patches from local and global perspectives to obtain augmented contrastive samples. Finally, the multilevel contrastive loss is constructed to optimize the network. Experimental results show that our method achieves excellent performance on multiple datasets such as ImageNet-1K and CIFAR.

Abstract:
The RAW domain image super-resolution faces two critical challenges: the physical impossibility of capturing native high-quality RAW references with a resolution-limited camera and the limitations of neural networks, including inefficient residual layer utilization and spectral bias in feature learning. This paper proposes a strategy combining physics-based imaging simulation and neural networks to jointly address these challenges. First, we develop a rapid imaging simulation system based on our proposed subgraph decomposition technology. It generates camera-specific degraded and clean RAW image pairs at multiple resolutions. Second, we design a LatentKAN network, featuring an iterative feature fusion network that extracts additional beneficial information through stage-wise supervision and a multi-layer Kolmogorov Arnold network that suppresses spectral bias via learnable activation functions. Ultimately, our strategy demonstrates significant advantages, achieving an average 0.8 dB PSNR improvement across all SR scales compared to state-of-the-art methods, thereby establishing a new paradigm for camera-specific super-resolution tasks.

Abstract:
Long video understanding, which leverages Video-LLMs to analyze and interpret extended video content to extract meaningful information, insights, or summaries, is a fundamental task in multimedia domain. Chain-of-thought (CoT) methods are widely adopted to enhance long video understanding by incorporating intermediate reasoning steps. However, the iterative nature of CoT-requiring a lengthy sequence of internal thoughts-significantly increases the latency of video object description. To address this challenge, we design Compressed Scene Graph-enabled CoT (CSGCoT), a novel approach that facilitates efficient and accurate long-video object description. Inspired by video codec principles, we propose a compressed scene graph composed of two components: Key-SG for key frames and Delta-SG for delta frames, enabling efficient encoding of scene information across video segments. Specifically, CSGCoT comprises three major modules: (1) a Key-SG Detector that identifies representative segments, (2) a Delta-SG Generator that produces compensated representations for delta segments, and (3) a SG-Query Manager that converts scene graphs into natural language prompts for video object description. Experiments show that CSGCoT achieves comparable accuracy to SOTA methods while reducing latency by over 62.6% on hour-long videos while maintaining comparable accuracy.

Abstract:
While current multimodal anomaly detection methods predominantly employ intermediate fusion strategies, they often suffer from inadequate cross-modal interaction and irreversible information loss during feature alignment processes. To overcome these limitations, we propose Hierarchical Geometry-Color Fusion (HGCF), a novel framework that establishes deep synergistic relationships between RGB texture features and point cloud geometric representations. Firstly, we propose a bidirectional cross-modal early fusion mechanism that enables complementary information exchange between point cloud and RGB modalities at the input level. Secondly, we introduce a local self-supervised geometric color reconstruction network with group-wise feature alignment, enhancing fine-grained feature extraction through joint color-geometry reconstruction tasks. Finally, we propose a local window spatial-consistent attention fusion, which achieves semantic consistency and spatial consistency by emphasizing local mutation features to improve the detection of subtle anomalies. Extensive experiments show our model achieves 99.1% I-AUROC on MVTec 3D-AD and 91.7% on Eyecandies, both surpassing state-of-the-art methods.

Abstract:
In recent years, multi-view graph clustering (MVGC) has attracted increasing attention from researchers. However, many existing MVGC methods focus on view-level integration through strategies like assigning weights to different views, for example, ignoring cross-view interactions between nodes. In fact, cross-view interactions at node level are crucial for extraction and fusion of semantic information. Additionally, some methods separate representation learning from clustering, which results in suboptimal clustering performance. To address these problems, we propose a novel unsupervised cross-view message passing method for MVGC. The kernel of our method is the cross-view interaction mechanism, which dynamically constructs node-specific cross-view edges based on node features and structural information. The mechanism enables adaptive interactions of informative nodes from different views, which promotes the extraction and propagation of complementary information. Besides, our method unifies representation learning and hyperspherical clustering in an end-to-end framework, which projects node representations into a hypersphere space, thereby enabling direct acquisition of balanced clustering results without dependence on external clustering methods. We provide comprehensive analyses on our method, and evaluate our method on six multi-view datasets. The results show that our method consistently achieves superior performance than existing state-of-the-art multi-view clustering methods.

Abstract:
Anchor-based multi-view clustering has received a lot of attention due to its efficiency in handling large-scale datasets. However, existing methods rely on penalty-based regularization terms in anchor graphs to handle noise and outliers but overlook the role of consistent semantics in label contributions, failing to effectively mitigate their impact and potentially deviating from actual data distributions. In addition, most strategies use adaptive anchor learning without considering the veracity of anchor selection and the lack of sufficient semantic support in modeling semantic consistency, which leads to anchors deviating from the clustering center. To solve the above problems, we propose a novel method called Prior-oriented Anchor Learning with Coalesced Semantics for Multi-View Clustering (PALCS). Specifically, PALCS strips out inconsistent semantics from anchor graphs to be processed separately through coalesced semantics and highlights consistent semantics to reveal the underlying shared structure of the data. Moreover, PALCS enhances the semantic consistency and discriminative properties of anchors by directing them to be evenly distributed across clusters via the prior matrix. Finally, the clustering labels are directly obtained by non-negative matrix decomposition, avoiding additional post-processing steps. Extensive experimental evidence demonstrates the superiority of our method compared to state-of-the-art methods.

Abstract:
The task of spatiotemporal dynamic modeling of multi-modal high-dimensional neuroimaging data presents a significant challenge in the field of neuroscience. Recent works often integrate attention mechanisms for hierarchical modeling, but the gradual extraction of spatiotemporal features leads to feature isolation. Moreover, attention-based fusion mechanisms (such as cross attention) tend to focus on learning the self-similarity between different modalities, lacking sufficient exploration of the complementary information across modalities. To address these challenges, we propose the cross swin 4D transformer (CrosST), which can efficiently learn the spatiotemporal patterns of multi-modal high-dimensional neuroimaging data in an end-to-end manner. The unique diffusion cross attention fusion mechanism of CrosST connects features from different modalities through a diffusion strategy during the attention computation, enabling the transfer of differential information between modalities and achieving deep fusion of multi-modal coupled features. Additionally, a voxel interaction strategy is employed to alleviate the computational burden during the fusion process. Furthermore, CrosST utilizes a 4D shifted window technique to effectively combine local and global information, and introduces the innovative 4D-Mamba algorithm to enhance computational efficiency. We validate the model using a large-scale Alzheimer's disease dataset and design a multi-granularity cognitive stage task for evaluation. The results demonstrate the effectiveness of CrosST.

Abstract:
Graph Neural Networks (GNNs) are effective for processing graph data; however, their heavy reliance on label information limits their generalization across domains. At the same time, Large Language Models (LLMs) have made significant progress across diverse domains, sparking growing interest in their potential for processing and understanding graph data. Nevertheless, their effectiveness is constrained by challenges such as space misalignment and global topology blindness, which arise because LLMs are trained primarily on Euclidean data rather than non-Euclidean structures. To address these challenges, we propose BridgeGLM, a novel framework that bridges graph and language spaces to improve domain generalization. BridgeGLM integrates topology-aware graph representations to capture higher-order structural relationships and employs a semantic-aware tokenizer to generate enriched node representations. Furthermore, we introduce three contrastive learning strategies based on graph interactions to effectively align graph and language representations. During testing, task-specific instruction templates facilitate zero-shot node classification. Extensive experiments on six datasets, covering both academic and recommendation graphs, show that BridgeGLM consistently outperforms state-of-the-art baselines across in-dataset, intra-domain, and cross-domain settings. In cross-domain settings, BridgeGLM achieves a 2-7% improvement in key performance metrics compared to the existing state-of-the-art method.

Abstract:
Deep long-tailed recognition (DLTR) has garnered increasing attention due to the inherent imbalance in many real-world problems (e.g., multimedia processing). Recently, some multi-objective optimization (MOO)-based solutions have been proposed to address conflicts during representation learning in DLTR. However, these methods face two primary challenges: (1) their effectiveness is subject to the power of MOO, which is arguable in recent literature, and (2) MOO approaches are resource-intensive due to frequent gradient operations. In this paper, we propose a novel approach: conflict-Buffering OptimizatiOn by Symmetry Teleportation (BOOST), which avoids altering complicated gradient combinations as previous methods did. A major challenge in this approach is the absence of off-the-shelf symmetry teleportation algorithms suitable for modern deep neural networks. To address this, we cast symmetry teleportation as the optimization of low-rank adaptation (LoRA). Specifically, we first divide categories into multiple groups and detect conflicts among them. When a conflict arises, we employ LoRA to identify an alternative point on the same loss level set, reducing conflicts and facilitating balanced optimization. To achieve this, we decouple symmetry teleportation into two objectives-loss invariance and balanced gradient maximization-and design corresponding objectives for LoRA optimization. Besides, we propose a trajectory reuse strategy to continually benefit from advanced optimizers. Extensive experiments demonstrate that BOOST achieves state-of-the-art performance across multiple mainstream DLTR datasets.

Abstract:
Handwritten Mathematical Expression Recognition (HMER) remains a challenging task due to the structural complexity of mathematical notation and the ambiguity of handwritten symbols-e.g., ''ρ'' vs. ''p'' or ''B'' vs. ''β''. While stroke-based models offer disambiguation via temporal cues, most existing methods are constrained by coarse modality fusion and a lack of fine-grained cross-modal alignment, further hindered by limited annotated data. We introduce Art for Math (Art4Math), a novel framework that leverages the structural richness of human sketches to enhance HMER through fine-grained, modality-aware learning. Art4Math follows a two-stage training paradigm: Art Grounding (A-Grd) and Math Decoding (M-Dec). In A-Grd, the model is trained to reconstruct masked regions of sketches via joint modeling of visual and stroke-level features, encouraging sensitivity to local structural cues and inter-modality alignment. This Art Grounding cultivates a strong inductive bias for parsing abstract, sparse visual forms. M-Dec then adapts this representation to the HMER domain, enabling more precise symbol disambiguation and structural decoding with limited supervision. Extensive experiments across sketch and handwriting-related tasks, including sketch recognition, retrieval, and HMER, demonstrate that Art4Math significantly outperforms existing self-supervised methods, revealing the overlooked synergy between artistic abstraction and mathematical expression.

Abstract:
Multi-view clustering aims to extract and integrate semantic information from multiple views to improve clustering performance. While deep learning-based approaches have shown promising results, they suffer from noisy view dependency (NVD) and dominant view dependency (DVD), limiting their robustness and effectiveness. NVD arises when models fail to filter out irrelevant variations, treating noise as semantic information. DVD occurs when models over-rely on dominant views, neglecting complementary information from other perspectives. To address these challenges, we propose causal content-style representation learning for deep multi-view clustering. To mitigate NVD, we incorporate causal content-style disentanglement via a dual differential content-style network for separation of semantic information from noise. Meanwhile, to reduce DVD, we introduce causal content consistency that aligns semantic content from both intra-view and cross-view perspectives. Besides, we design a content-centered style receptive field for contrastive learning, enhancing the semantic association between positive sample pairs while preventing over-alignment to dominant views. Extensive experiments on ten benchmark datasets demonstrate that CausalMVC outperforms state-of-the-art methods, validating its effectiveness.

Abstract:
Multimodal Anomaly Detection (MMAD) has attracted significant attention in industrial defect inspection as it can simultaneously leverage the complementary information from different modalities to achieve higher-precision detection. Among existing MMAD approaches, dual-branch reverse distillation is widely adopted because of its efficiency in avoiding large-scale data storage. However, it suffers from two key issues. First, the alignment of cross-modal features can lead to a loss of modality-specific characteristics. Second, when one modality indicates normal while another shows anomalies, anomaly detection may be misled by that modality ambiguity. To address these challenges, we propose a Frequency-Aware Multimodal Reverse Distillation (FAMRD) framework from the frequency domain perspective. Specifically, we introduce a frequency spectral feature alignment module that aligns the low- and medium-frequency components across modalities to preserve global shape consistency, while maintaining high-frequency modality-specific details. In addition, we design a frequency spectral anomaly synthesis module. It perturbs the normal feature of one modality to create modality consistent anomalies, fuses it with another modality normal feature to mimic modality ambiguous anomalies, and adds them to the reverse distillation process for decision boundary optimization. Extensive experiments on standard MMAD benchmarks demonstrate that FAMRD achieves competitive performance in both anomaly detection and localization, outperforming state-of-the-art methods.

Abstract:
Extrinsic calibration is a fundamental step in sensor fusion systems. However, existing methods often lack generalization capabilities when facing diverse hardware configurations, sensor poses, and environmental conditions, hindering their large-scale deployment. To address this limitation, we propose a general extrinsic calibration method, CalibWorkflow. Our core innovation lies in positioning multimodal large language models (MLLMs) as ''visual guides'' for the calibration process, leveraging their powerful vision-language understanding capabilities to guide parameter search and refinement. This reliance on visual scene understanding, rather than specific geometric features or sensor characteristics, enables the method to generalize effectively across diverse hardware and environmental conditions. Specifically, CalibWorkflow employs a three-stage calibration pipeline: initial parameter search, coarse optimization, and fine optimization. First, it utilizes the MLLM to assess the visual consistency between the projected point cloud and the image, rapidly determining an initial range for the extrinsic parameters. Next, the MLLM serves as a differential evaluator, giving simple ''better'' or ''worse'' feedback on parameter changes to guide the search through the parameter space. Finally, the method refines the calibration by matching edge features and performing non-linear optimization. Extensive experiments are conducted across six diverse scenarios and four heterogeneous sensor combinations. CalibWorkflow achieves state-of-the-art sub-degree and centimeter-level accuracy on four datasets and demonstrates highly competitive performance on others. These results thoroughly validate the generalization and robustness when facing various scenarios. Codes will be available.

Abstract:
Whole-body PET tumor segmentation remains challenging due to limited training data and substantial tumor heterogeneity, which impact the segmentation accuracy and the clinical utility. Tumor distribution information is usually contained in patient medical records and routinely utilized in medical image interpretation, which can also be used to improve the segmentation accuracy. This study introduces a novel 3D PET/MR tumor segmentation framework which integrates tumor distribution priors extracted from medical records. The proposed Tumor Localization Priors(TLP) are generated based on medical records using Large Language Models (LLMs) and organ localization based on MRI. Furthermore, the Region-Aware Fusion Module (RAFM) is designed to fuse TLP and encoded PET information through attention within the Anatomically-Consistent Multitask Model(ACMM). Moreover, Anatomical Consistency Loss(AC Loss) is introduced, integrating tumor localization and its anatomical distribution to enhance segmentation performance. Our method achieves an 9.30% Dice improvement over the baseline nnU-Net v2, with particularly notable 23.06% gains in precision while maintaining high recall (+6.19%). Clinical evaluations confirm superior detection of both primary and metastatic lesions, alongside reduced physiological uptake artifacts.

Abstract:
Hybrid-modal table understanding (HMTU), which targets leveraging multi-modal table evidence for multi-hop reasoning, has garnered widespread attention. Existing models primarily focus on effectively integrating multi-modal table evidence to enhance the table understanding capabilities of multi-modal large language models (MLLMs). However, these models ignore the fact that different types of table understanding questions lean toward different modalities of table evidence. Consequently, these models suffer from low utilization efficiency and poor interpretability. To address these issues, in this paper, we propose a modality preference alignment model, called ESTJ, which Enhances Structured Tendency Judgment in HMTU. Specifically, ESTJ first samples modality preference data from the responses generated by MLLMs. Then, it alleviates modality preference imbalance by adhering to the principle of least modality priority. Finally, ESTJ performs direct preference optimization (DPO) training based on structured tendency judgment to align modality preference effectively. Experimental results on TableQA and TableFV tasks demonstrate that our proposed model outperforms state-of-the-art baselines. Additionally, these results present fascinating phenomena and unveil profound insights into modality preference for table understanding.

Abstract:
Multimodal models leverage complementary information across modalities to enrich feature representations. While visual information shows potential in representing structure for some combinatorial optimization problems (COPs), its application to complex scheduling like the Flexible Job Shop Scheduling Problem (FJSP) remains underexplored. Current learning-based FJSP solvers predominantly rely on handcrafted state features. This dependence can lead to inconsistencies and may not fully capture the problem's intricate dynamics. Crucially, these methods overlook visual modalities. Visual representations offer a distinct advantage by inherently capturing the global topological structure and complex resource interactions within the FJSP state. Unlike localized handcrafted features, this holistic, structural view provides a richer foundation for understanding scheduling complexity and making informed decisions. To overcome these limitations by leveraging visual information-known for representing topological structures and providing richer state representations-we introduce the AO-framework. This multimodal feature fusion approach enhances handcrafted state features by integrating insights from visual data. Our core contribution is a novel fusion mechanism utilizing orthogonal projection and local attention. Unlike traditional methods that often rely on simple concatenation of visual data, our method uniquely reduces redundancy by projecting global image-derived features onto local handcrafted features. This process extracts distinct information inherent to the visual modality, significantly improving the quality and complementarity of the resulting state features and enabling more informed scheduling decisions. To our knowledge, the AO-framework represents the first multimodal framework applied to scheduling problems, demonstrating the significant potential of visual information in this domain. Extensive experiments across various FJSP solvers and datasets confirm that our framework yields substantial enhancements in solution quality, decision-making capabilities, and generalization.

Abstract:
Multi-modal feature fusion under conditions of image misalignment remains a significant challenge in multispectral object detection. Existing approaches predominantly rely on cross-attention mechanisms; however, when local features are sparse, inadequate feature capture hinders accurate alignment and results in distorted fusion outcomes. To address this problem, we propose a novel multispectral fusion detection network, CSSFDet, which leverages the intrinsic correlations among image regions in visual recognition to dynamically enhance local features via global semantic constraints during the fusion process. Specifically, we introduce a Contextual Region Feature Fusion Module (CRFM) that regulates the fusion process through a selective state-space formulation, adaptively incorporating surrounding context to compensate for local feature degradation caused by misalignment. Moreover, we design a Complementary Enhancement Module (CoE) to ensure both distinctiveness and completeness of modality-specific features. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance across multiple datasets, attaining 84.1% mAP50 on the DroneVehicle dataset-a 20% improvement over the baseline. It also shows strong performance on the misaligned CVC-14 dataset, and sensitivity analysis on data shifts further underscores its robustness to misalignment.

Abstract:
Existing medical vision-language contrastive pretraining methods aim to bring the paired image-report embeddings close together while pushing the unpaired ones apart. However, medical images often exhibit high inter-class visual similarity with only subtle differences, leading to the presence of hard negative samples that are semantically distinct from the anchor but incorrectly close to it in the embedding space, making it challenging to distinguish semantically dissimilar samples. Previous methods consider only the embedding similarity between samples to identify hard negatives, often wrongly treating false negatives as hard negatives. To address this issue, we design a simple yet effective approach called Semantic-Aware Hard Negative mining (SAHN), distinguishing hard negatives from false negatives and encouraging the model to pay greater attention to hard negatives. Specifically, hard negatives are identified as samples with high embedding similarity but low semantic similarity to the anchor and assigned greater importance weights. By integrating these importance weights into the InfoNCE loss, SAHN enhances the model's ability to separate semantically dissimilar samples while clustering semantically similar ones. We further conduct a gradient-based theoretical analysis to validate the effectiveness of SAHN. Extensive experimental results on four downstream medical tasks covering image classification, object detection, semantic segmentation, and cross-modal retrieval demonstrate the superiority of our approach.

Abstract:
Emotional Video Captioning (EVC) is an emerging task that aims to describe factual content with the intrinsic emotions expressed in videos. Existing EVC methods perceive global emotional cues through visual features at first, and then combine them with the video features to guide the emotional caption generation, which ignores the critical characteristic of the EVC task that emotional cues have intrinsic motivational causes reflected in the video content. Such video causes have a facilitative effect on both emotion perception and emotion-attributed caption generation. To this end, a multi-round mutual emotion-cause pair extraction network (MM-ECPE) is proposed in this paper for the joint extraction of emotional cues and visual causes through iterative mutual refinement. Specifically, in the 1st-round mutual learning, we propose a spatio-temporal disentangled visual adaptive refinement (ST-DVAR) and a multi-level video-guided emotion affine transformation (MV-EAT) to achieve preliminary refinement on video features and emotion lexicon to eliminate the noise caused by emotion-irrelevant visual information and video-irrelevant emotional information. Then, in the 2nd-round mutual learning, we exploit the cross-attention of the preliminary refined features and the original features to obtain the ultimate emotional cues and visual causes, and couple them in pair-wise extraction through contrastive loss. Overall, our approach optimizes complex semantic understanding and emotion perception of videos, leading to a promising performance in emotional captioning. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module, e.g., improving the latest records by +97.5% and +76.2% w.r.t. CIDEr and CFS, respectively, on the EVC-MSVD dataset.

Abstract:
In existing Sign Language (SL) research, most datasets and backbone models focus on sentence-level samples. However, the annotated sentence-level SL datasets are rather limited, and it is in great need to expand sentence-level SL datasets. When considering the large-scale long SL videos with captions, we propose a new task, i.e., Sentence-level Sign Language Segmentation (SSLS), which splits the long videos into consecutive sentence-level videos. SSLS is an important and meaningful task, which can greatly reduce the labor costs in data annotation for sentence-level SL datasets. However, SSLS is a very challenging task, since it is rather difficult to accurately find the boundary of each sentence in a long video. To address this issue, we formalize, learn, and optimize the boundaries of sentences step by step. First, to distinguish the boundary and the inside of a sentence, we formalize SSLS as a frame-level classification task and design a boundary annotation scheme. Second, to learn the boundary of each sentence from the long video, we design a multimodal framework, SignBD, which correlates the local features and global features through dual dilated attention, while aligning visual and textual (i.e., sentences) modalities through gated cross-attention. Third, to alleviate the widely existed over-segmentation and under-segmentation problems in segmentation tasks, we propose a boundary optimization strategy, which utilizes the number of sentences provided by captions to optimize (i.e., insert or delete) boundaries based on information uncertainty. Extensive experimental results demonstrate the superiority of our solution.

Abstract:
Cross-category object perception is one of the essential upstream tasks for generelizable robot object interaction and manipulation. Recently, an increasing number of researchers are focusing on investigating visual Generalizable and Actionable Parts understanding at cross-category level perception. However, these works are built upon the RGB-D or point cloud input, that relies on the depth information capture. Under the circumstances of limited depth camera performance, e.g. transparent or light absorbing material, perception algorithms that do not require depth information are urgently needed. In this paper, we propose DFGAP, a novel depth-free framework for RGB-based GAParts segmentation and pose estimation. Specifically, we independently model the ill-pose problems from the absence of depth for GAPart segmentation and pose estimation, by clearly quantifying the pixel-wise segmentation probability and relative depth. We reduce the uncertainty and benefit learning in these two tasks. The experimental results demonstrate the superior performance and robustness of our DFGAP. Our work provides a new research paradigm in GAParts perception. We believe that our work has the enormous potential to be applied in many areas of embodied AI system.

Abstract:
Multispectral image demosaicing aims to reconstruct full band multispectral images from a compressed spectral mosaic images. Although existing learning-based methods have made progress in multispectral image demosaicing, there still exist intrinsic performance bottlenecks due to the heavy undersampling according to mosaic pattern. To address this issue, we propose Polarity memory network with quant attention to establish global correlation, thus reconstructing high-quality multispectral images from compressed spectral mosaic images. Our proposed Polarity memory network adaptively encapsulates reconstruction-oriented representations, then amplifies relevant ones and reducing noise from irrelevant ones in a polarity-aware manner to better cater to the enhancement of different spectral information with linear computational complexity. Moreover, considering existing methods' inability to adequately compensate for long distance interactions in reconstruction, we introduce a quant attention paradigm that categorize tokens into semantic-aware groups using an efficient quant operation for attention computation. Experimental results show our method achieves state-of-the-art performance on various simulation datasets and better vision results on real-world datasets.

Abstract:
Recently, 3D Gaussian Splatting (3DGS) has achieved remarkable results in 3D reconstruction and view synthesis tasks. However, single-view feed-forward 3DGS still faces significant challenges. Current state-of-the-art (SOTA) single-view 3DGS methods typically employ a small number of layers (1-2 layers) with Gaussian Splatting (GS) representations at the same resolution as the input image to address the irregularity of GS data. However, such shallow and uniform GS primitive distributions is difficult to represent occluded regions and important spatial details. Inspired by multi-plane images, this paper proposes a Multi-Layer Gaussian Splatting (MLGS) representation, which consists of shallow base GS layers for visible content and multiple occlusion GS layers dedicated to reconstructing occluded regions. The proposed MLGS representation explicitly decouples the learning processes of visible and occluded content while enhancing occlusion prediction through the following components. First, spatial stratification of GS is achieved by estimating the depth distribution range of GS primitives across different layers, forcing GS to learn spatial content reconstruction at different depths. Second, a mask-guided mechanism is proposed to effectively isolate occlusion regions and guide inpainting using spatially context-aware features. Finally, a gated convolution block is designed to dynamically modulate feature fusion to enhance reconstruction fidelity. With separate loss supervision for base and occlusion layers, MLGS enables geometrically plausible scene completion. Experiments on RealEstate10K, KITTI, and NYUv2 datasets demonstrate that the proposed method achieves SOTA performance for single-image spatial scene reconstruction.

Abstract:
Referring video object segmentation (RVOS) focuses on segmenting target objects in a video based on natural language descriptions. However, existing methods typically rely on text cues that are unrelated to video content, and the target entity is only recognized in the pixel space. This often leads to ambiguous cross-modal understanding and fragmented perception across space and time, resulting in inaccurate or incomplete segmentation of the target objects. To address these challenges, a novel wavelet calibration learning (WaveCL) framework is proposed to unify cross-modal understanding and preserve spatial-temporal integrity of the target object. The WaveCL framework is built on two core components: semantic-calibrated entity perception (SEP) and wavelet-guided integrity perception (WIP). SEP aligns the textual semantics with video content, enabling more accurate and context-aware cross-modal understanding. WIP, on the other hand, leverages wavelet representations to capture fine-grained details of the target object from a global spatial-temporal perspective. By refining wavelet clues with the guidance of text queries, WIP enhances the integrity of segmentation. Through the collaboration of SEP and WIP, WaveCL enables precise, target-specific segmentation with detailed boundaries and consistent spatial-temporal perception. Extensive experiments on four benchmark datasets of Ref-YouTube-VOS, Ref-DAVIS17, A2D-Sentences, and JHMDB-Sentences show that WaveCL outperforms existing state-of-the-art methods. The source code of this work can be found in https://mic.tongji.edu.cn.

Abstract:
Deep Neural Networks (DNNs) have become increasingly prevalent in various applications, yet they remain vulnerable to adversarial attacks, particularly through the use of adversarial examples (AEs). This paper introduces the concept of stealthy AE, which is benign before transmission through Online Social Networks (OSNs) but becomes adversarial after processing. The inherent transformations applied by OSNs, such as image compression and format conversion, can activate the properties of adversarial examples that are originally hidden. We present a suite of stealthy AE generation frameworks. Subsequently, our scheme involves the quality factor calculation, leveraging the diffusion model with differential JPEG layers to simulate OSN transmission, and utilizing the Lagrange multiplier method for AE generation optimization. Extensive experiments demonstrate that our method consistently outperforms seven state-of-the-art adversarial example generation techniques across multiple OSNs and victim models. Moreover, resistance detection evaluation and extended experiments with different attack settings also demonstrated the scalability of our scheme.

Abstract:
Visual reasoning is a key capability that significantly impacts the performance of multimodal tasks, such as compositional visual question answering and visual grounding. These tasks often require complex, multi-step reasoning processes. In recent years, several training-free methods for Vision-Language Models (VLMs) have emerged, with visual programming methods being proposed to enhance the capability of VLMs in visual reasoning tasks. While these methods have made some progress, they still face two primary challenges due to the lack of verification and refinement mechanisms for each action's output during the reasoning process: error accumulation and feedback delay, as well as insufficient utilization of multimodal contextual information. To address these challenges, we propose VL-DynaRefine, a training-free approach consisting of three modules: a planner, a verifier, and a refiner. The planner generates programmatic actions to solve the problem and executes each action in sequence, which is inspected by a verifier that reassesses the actions via confidence scores and determines whether refinement is necessary based on the evaluation results. In the refiner module, we incorporate a context-aware local refinement mechanism and a global refinement mechanism based on visual and action trajectories to reduce the impact of reasoning errors on the outcome. We evaluate our approach on multiple visual reasoning datasets, and the experimental results show that our method outperforms existing visual programming methods in both reasoning accuracy and efficiency, further validating its effectiveness in visual reasoning tasks.

Abstract:
The Who-What-Where (3W) composite-semantics video Instance Search (INS) task aims to find video shots about a person doing an action in a location. The state-of-the-art (SOTA) methods decompose 3W INS into three 2W INS, i.e., who-what, what-where and where-who semantic correlation modeling, and directly multiply three 2W INS results to produce the final 3W INS result. Obviously, overlapping semantics exist among the above 2Ws, e.g., who-what and what-where share the action component. The semantic overlap indicates that the 2Ws are mutually interdependent rather than independent. According to probability theory, the product of interdependent variables cannot be directly multiplied to obtain an accurate result, and such a direct product would yield a suboptimal outcome. This interdependence exerts diverse influences on the 3W INS results. For instance, fusing two 2W INS results ''Dr. Kelleher-provide medical guidance'' and ''provide medical guidance-in the hospital'', ''provide medical guidance'' is a pivotal connection, of positively enhancing the rationality of both person and location. Conversely, while both ''Ross-lifts heavy objects'' and ''lift heavy objects-Ross'' are individually coherent, combining them by overlapping the shared element ''Ross'' creates a conflict between the hazardous setting and strenuous labor, ultimately undermining the overall plausibility. Inspired by quantum interference theory, we propose a Quantum Interference Partial Decomposition (QIPD) method to model the diverse influences of semantic overlap from 2W to 3W INS. Specifically, QIPD incorporates two core modules, i.e., semantic interference and temporal interference. The former derives the 3W amplitude by converting 2W samples into amplitudes and phases and performing interference, while the latter sets the current shot's phase as baseline, amplifying the influence of adjacent shots while attenuating distant shots. Extensive evaluations on three large-scale 3W INS datasets demonstrate that QIPD outperforms SOTA baselines.

Abstract:
Visual impairment affects over 200 million individuals globally, creating significant challenges in daily visual tasks. While vision-language models offer transformative assistive potential, existing systems based on Multimodal Large Language Models (MLLMs) face a serious cross-contamination problem when processing real-world images captured by blind and low-vision (BLV) users: when jointly processing imperfect images and specific questions, current models are often misled by question assumptions rather than adhering to visual facts, generating hallucinations about objects not present in the image. We introduce DR-VQA (Decompose-then-Reconstruct Visual Question Answering), a novel framework that balances user intent with visual facts. Our approach prevents cross-contamination through structured reasoning. Our approach deliberately separates image processing from question analysis, ensuring model-generated descriptions are strictly based on image facts without being influenced by questions. Subsequently, through a structured decomposition mechanism, the system generates targeted sub-questions relevant to user intent, gradually aligning visual descriptions with user needs while minimizing question bias. During final synthesis, a memory-reset LLM reconstructs the reasoning chain with detailed information to generate responses that either provide evidence-supported conclusions or transparently acknowledge information limitations. Experimental evaluations demonstrate our framework's effectiveness in reducing hallucination risks while improving answer accuracy. By systematically balancing user intent with factual visual evidence, this work advances BLV-assistive technologies from probabilistic outputs to reliable visual assistance services.

Abstract:
Decoding dynamic visual information from brain activity remains challenging due to inter-subject neural heterogeneity, limited per-subject data availability, and the substantial temporal resolution gap between fMRI signals (0.5Hz) and video dynamics (30Hz). Current approaches face persistent challenges in addressing these temporal mismatches, demonstrate limited capacity to integrate subject-specific neural patterns with shared representational frameworks, and lack adequate semantic granularity for aligning neural responses with visual content. To bridge these gaps, we propose CrossMind-VL, a framework addressing these limitations through three innovations: (1) a Dynamic Temporal Alignment module that resolves temporal mismatches via exponentially decayed multi-frame fusion with adaptive decay coefficients; (2) a Brain Mixture-of-Experts architecture that combines subject-specific extractors with shared expert layers through parameter-efficient tri-modal contrastive learning; and (3) a Multi-perspective Semantic Hyper-Anchoring module that resolves cross-subject attention bias via multi-dimensional semantic decomposition, leveraging multimodal LLMs for fine-grained video semantic extraction-enabling the model to match individual attention patterns as different subjects naturally focus on distinct aspects of the same visual stimulus. This module boosts Top-10/Top-100 retrieval by 17.7%/6.6%. Experiments on two video-fMRI datasets demonstrate state-of-the-art performance, with 39%/30% improvements in Top-10/Top-100 accuracy over single-subject baselines and 27% gains against multi-subject models. The framework exhibits remarkable few-shot adaptability, retaining 97% performance when using only 10% training data for new subjects. Visualization analysis confirms this generalization capability stems from effective disentanglement of subject-specific and shared neural representations.

Abstract:
Weakly Supervised Object Localization (WSOL) relies only on image-level labels to realize object localization, significantly reducing the cost for fine-grained annotations. While traditional CAM-based methods excel at identifying the most prominent regions of objects, they frequently neglect other essential components, resulting in partial or incomplete object localization. The foreground prediction map (FPM) generates finer-grained activation maps using underlying features to address the shortcomings of CAM, but it may still have coverage blind spots. To this end, this paper proposes a collaborative optimization framework based on cross-modal semantic alignment that deeply integrates the saliency awareness of CAM with the refined representation capabilities of FPM. It introduces a multimodal pretrained model (CLIP) to construct a semantic-driven WSOL paradigm. By dynamically interacting CLIP's text embeddings with the semantic of image categories, a semantic-enhanced FPM based on similarity measurement is generated. Leveraging CLIP's cross-modal alignment capabilities, a targeted generation scheme is designed. On the one hand, the CLIP model is frozen and its features are refined through a decoder to obtain richer semantic representations; On the other hand, by using knowledge distillation, the CAM generated by CLIP is taken as a reference benchmark, guiding the network to learn more accurate target localization. Additionally, to enhance FPM's focus on foreground regions, the Exponential Decay Foreground Emphasis (EDFE) module is designed, which uses a differentiated excitation strategy to effectively suppress background interference and highlight target areas. Experimental results show that our method significantly improves the completeness and boundary accuracy of target localization under weak supervision, laying a solid foundation for subsequent downstream tasks.

Abstract:
Long-form video understanding (LVU) addresses the challenge of answering complex questions over extended video length, where informative cues are sparse and easily overwhelmed by redundant content. To tackle this, it requires selecting a small set of question-relevant keyframes and reasoning over long-range, temporally dispersed visual evidence. However, current methods typically extract frame-level features with limited temporal context and store them in sequential memory structures. As a result, they struggle to capture the evolving relations among entities and fail to maintain identity consistency when entities temporarily leave and later reappear in the video. These limitations prevent accurate keyframe localization and coherent reasoning.

Abstract:
Advances in automatic graphical user interface (GUI) agents have brought significant privacy and security challenges, especially cloud-based solutions that may leak sensitive data and be vulnerable to man-in-the-middle attacks. To address these issues, we propose MMPro, a novel GUI agent framework that adopts a separated perception-thinking-execution architecture. The perception module and the execution module process inputs locally to generate outputs to ensure security, and the thinking module operates on abstract representations to ensure the effectiveness of the GUI agent.By modularizing each stage while introducing a hybrid description language (HDL), MMPro transforms screen images into abstract structured representations, minimizing the risk of sensitive information leakage. Experimental results on the OSWorld benchmark show that MMPro outperforms existing GUI agent methods while ensuring strong privacy protection and real-time interaction capabilities. This work presents a pioneering approach to developing efficient, privacy-conscious, and explainable automated GUI agents.

Abstract:
Class incremental medical segmentation (CIMS) aims to sequentially learn new classes while preserving knowledge of previously learned categories in the absence of old-class labels. Current methods suffer from performance degradation under class imbalance and require additional segmentation heads to accommodate new categories. Inspired by recent prototype learning that leverages prototypes to achieve robust recognition of new categories under limited-data regimes, we introduce a Prototype-dRIven class increMEntal (PRIME) method. PRIME replaces the incremental segmentation heads with prototypes to mitigate class imbalance, allowing new class learning with the simple addition of new prototypes. Based on prototype learning, PRIME further involves three tailored techniques. First, prototype structure alignment imposes structural constraints on inter-prototype relations to maintain consistent relative distances in the feature space, improving the model's ability to distinguish distinct classes. Second, pixel-wise contrastive loss term groups embeddings of similar samples while separating those of different classes, enhancing segmentation accuracy across all categories. Finally, the consensus-based prototype update mechanism refines the old prototypes during the learning of new classes, preventing performance degradation on the old classes. Extensive experiments on two public multi-organ segmentation datasets demonstrate that our approach significantly outperforms state-of-the-art methods, validating the effectiveness of the proposed PRIME.

Abstract:
Deep learning-based medical image segmentation, with its precise lesion localization capabilities, serves as a core component in multimedia medical applications and intelligent diagnostic assistance systems. Recent innovations in network architectures significantly improve segmentation performance. However, the Non-Effective Samples (NES) on model optimization receive little attention. These samples are characterized by minimal gradient variations in loss during training and exhibit a slight contribution to model optimization. They encompass well-segmented samples with near-zero loss values and challenging samples with consistently high loss values. Especially, when NES accumulate, the model will fall into an optimization trap, causing the optimization to stagnate. To address this issue, we propose a lightweight plug-and-play Gradient-Aware Sample Selection and Reactivation Strategy (GA-SRS) that efficiently identifies and revitalizes the training potential of NES. Firstly, GA-SRS filters NES out based on the historical training information of samples and the variations in the loss gradient during the training process. Then, GA-SRS revitalizes the training values of these samples through strong data augmentation. Extensive experiments on four public datasets and three general models demonstrate the effectiveness of GA-SRS. For example, GA-SRS helps improve the IoU metric of U-KAN from 67.20% to 71.24% on the BUSI and from 81.20% to 82.51% on the ISIC dataset, achieving state-of-the-art experimental results.

Abstract:
Monocular 3D lane detection is a challenging task for autonomous driving systems. Recent advances primarily focus on one-step methods for lane detection based on front-view features, which show promising results on straight lanes. However, curved lanes are difficult to handle with one-step prediction, which performs prediction in a single leap without gradual refinement. To address this issue, we propose a novel Denoising Diffusion Model for 3DLane Detection framework (D3L). The main idea is to leverage the progressive generation capability of the diffusion model to generate accurate 3D curved lanes, and ensuring lane continuity through curvature constraints. The framework includes three creative components: coarse-to-fine denoiser (CFD), curvature-constrained loss (CCL) and multi-sampling aggregation strategy (MSAS). In CFD, both lane-level and point-level transformer blocks are integrated to accurately denoise 3D lanes, which effectively captures both global and local features. CCL is designed to reduce deviations in lane curvature, resulting in smoother lane continuity. This loss enhances both the accuracy and geometric consistency of lane detection, especially in complex curved scenes. MSAS is proposed to select the optimal lane point-by-point from multiple candidates, thus robustness of the lane prediction is significantly improved. Extensive experiments on two popular 3D lane detection benchmarks demonstrate that our D3 L outperforms the state-of-the-art methods.

Abstract:
Image captioning bridges the gap between visual perception and natural language understanding by transforming image content into descriptive text. While existing methods have made significant progress in visual feature extraction, encoding, and cross-modal semantic alignment, challenges remain in terms of fine-grained feature representation, cross-modal alignment efficiency, and suboptimal search strategies. To address these issues, a multimodal-guided and search space-optimized image captioning model is proposed. In the visual encoding stage, we construct a hierarchical network that integrates regional and grid features through a geometry-constrained multi-layer feature aggregation mechanism, which enhances the model's capability to jointly capture global semantics and local details. In the decoding stage, we introduce a dynamic grouped beam width adjustment strategy to improve semantic path exploration. Additionally, a diversity-driven scoring function is designed to enforce intra-group diversity rewards and inter-group similarity penalties, encouraging the generation of more diverse captions. Finally, we incorporate a two-level pruning algorithm based on syntactic and spatial logic constraints to refine the search space from both hard and soft constraint perspectives, improving both the accuracy and diversity of generated captions. A 3% improvement in CIDEr is achieved by the proposed method over state-of-the-art (SOTA) models, as demonstrated by experiments on the COCO and Flickr30k datasets.

Abstract:
Generating accurate recipes from dish images is a challenging task that requires a deep understanding of food categories, ingredient combinations, cooking methods, and context. Current works mainly rely on the two-stage training method or supervised fine-tuning of vision-language models (VLMs). Two-stage models typically first predict ingredients from images and then generate recipes based on both ingredients and images. However, accumulated errors in ingredient prediction often lead to inaccurate recipes. Fine-tuning VLMs only fit the statistical patterns of the training data, lacking deep reasoning capabilities, which leads to severe hallucinations in the generated recipes. In this paper, we introduce a novel reinforced retrieval-augmented generation framework named RecipeRAG for recipe generation, and compare the supervised fine-tuning (SFT) paradigm and the reinforcement fine-tuning (RFT) paradigm. To effectively retrieve recipes relevant to the query image, we improve CLIP to obtain IR-CLIP as both our retriever and re-ranker by integrating metric learning and contrastive learning. The retrieved recipes are then used to enhance the generated results, improving accuracy and reducing hallucinations. However, the SFT VLM often fails to judge the quality of the retrieved recipe information and perform the complex recipe generation. Therefore, we furthermore investigate the two-phase RFT training framework. Firstly, the cold-start phase uses generated Chain-of-Thought (CoT) data for SFT to activate the reasoning capabilities of VLMs. Then, the reinforcement learning phase utilizes Group Relative Policy Optimization (GRPO) to generate multiple reasoning-answer pairs, further enhancing the generalization ability of VLMs in recipe generation tasks. Extensive evaluations on the large-scale Recipe1M dataset demonstrate that RecipeRAG outperforms all previous methods in recipe generation and exhibits strong generalization ability under the RL paradigm.

Abstract:
Current Text-to-Image (T2I) generation methods struggle to accurately create images with complex object relationships and scene compositions. To overcome these challenges, we propose KAIG, a novel text-to-image generative model that integrates a knowledge graph into the image generation process. Unlike traditional models, KAIG uses structured knowledge to enhance the retrieval of relevant information, enabling the generation of high-quality, contextually rich, and semantically consistent images from multi-modal inputs. We introduce a two-stage training strategy: first, condition adapters are trained to align multi-modal inputs, followed by fine-tuning the entire diffusion model. This approach ensures precise alignment between retrieved conditions and the image generation process, leading to an efficient and scalable pipeline. Our experiments on two popular datasets, MS-COCO and CUB-200-2011, show that KAIG consistently outperforms existing methods in both image quality and consistency. Notably, KAIG can seamlessly integrate with any pre-trained diffusion model, requiring minimal additional training while achieving superior results. Ultimately, KAIG demonstrates strong potential for addressing key limitations in current T2I models and advancing the field of image synthesis.

Abstract:
The classification of Egyptian hieroglyphs remains a challenging problem due to the vast variability in writing styles across time periods, regions, and individual scribes. In this work, we present a comprehensive evaluation of hieroglyph classification performance across diverse stylistic domains, highlighting the limitations of current models in generalizing beyond a single style. We introduce a dataset that spans multiple writing styles, ranging from monumental inscriptions to handwritten manuscripts, and assess several near state-of-the-art recognition models. Our analysis reveals significant discrepancies in model performance when exposed to unseen styles, underscoring the need for style-aware learning strategies. This study provides a framework for future research on hieroglyph recognition with a focus on stylistic diversity and serves as a first step toward building vision-language systems capable of analyzing Egyptian hieroglyphic writings.

Abstract:
The continuous advancement of multimedia technology has led to the exponential accumulation of massive time-stamped data. However, accurately identifying anomalies in such data remains a major challenge. Current anomaly detection methods still face serious limitations, including the difficulty in handling complex time series data and the anomaly masking phenomenon caused by overlapping temporal patterns. Existing methods cannot effectively address these challenges. To overcome these limitations, we propose an unsupervised time series anomaly detection algorithm DMemAD based on a dual-domain memory module. Specifically, we design an STD Mamba structure that can effectively extract trend and seasonal components in the series and enhance the connection between elements in each component through bidirectional learning. Second, we design a dual-domain memory module to avoid anomaly masking by independently storing trend and seasonal patterns. Additionally, we propose a residual-based memory update mechanism to enhance the accuracy of memory updates, ensuring that prototype patterns are stored precisely. Extensive experiments on four datasets from different domains show that DMemAD achieves an average F1 score of 96.81%, outperforming 17 baseline methods and establishing state-of-the-art performance.

Abstract:
Same-product identification serves as a critical infrastructure in e-commerce systems, enabling accurate product matching across heterogeneous marketing representations for key applications such as price comparison and personalized recommendation. Conventional approaches typically depend on manual feature engineering and extensive rule tuning, which limits their adaptability to varying identification criteria across different product categories and inconsistent business scenarios. To overcome these challenges, we propose an end-to-end same-product identification model powered by multimodal large language models (MLLMs) that inherently support multimodal alignment and exhibit strong generalization across diverse real-world settings. We first introduce a novel group-wise annotation pipeline to construct a high-quality dataset, consisting of diverse product pairs with multimodal presentations and labeled at the SKU level. Building on this dataset, we incorporate task-specific training recipes from the perspective of data augmentation, resulting in our SaP-Bot, which demonstrates advanced performance and generalization capabilities. Moreover, we identify a strong correlation between the output logits of MLLMs and the product similarity, enabling interpretable confidence estimation that benefits both data annotation and downstream applications.

Abstract:
We focus on the approximate nearest neighbor search (ANNS) in high dimensional space, which is a fundamental technique in computer vision and multimedia database. Among the ANNS solutions, graph-based approaches achieve excellent performance by executing a routing algorithm on a proximity graph to retrieve the nearest neighbors. However, most of their routing strategies are heuristic-based greedy routing, leading to suboptimal search results with large number of hops. In this paper, we propose a novel routing paradigm on graphs for ANNS problem by deep reinforcement learning. We design a reinforcement model to learn the routing policy by making use of both graph global and local topology information. A hops-optimized reward mechanism is devised to enable the model to be more efficient and effective. The final searching algorithm with the learned model is able to find the nearest neighbors without any backtracking in a small number of hops. Comprehensive experiments on real-world datasets demonstrate the superiorities of the proposed method over the state-of-the-art ANNS approaches.

Abstract:
Text-based person retrieval (TBPR) aims to retrieve pedestrian images based on textual descriptions, facing challenges due to the inherent heterogeneity and uncertainty between visual and textual modalities. Most existing methods focus on addressing heterogeneity while neglecting the issue of uncertainty. To tackle the uncertainty arising from the diverse textual expressions, including both structural and semantic content variations, we propose a novel Dual Uncertainty-Guided Feature Alignment Learning (DUAL) approach, utilizing instance-level and identity-level uncertainty estimations to mitigate these impacts. Specifically, for the uncertainty caused by textual structure variations, DUAL first introduces an uncertainty Gaussian modeling module that represents image and text features as Gaussian distributions in a learnable manner, and estimates instance-level uncertainty coefficients to quantify structural differences within the text. Subsequently, DUAL leverages ShareGPT4V to standardize the text structure, dynamically aligning the original text features with structure-invariant generated text features through adaptive knowledge distillation guided by the instance-level uncertainty coefficients, effectively reducing structural diversity's impact while minimizing noise. Moreover, for the uncertainty caused by the diversity of textual semantic content, DUAL designs an alignment loss that utilizes identity-level uncertainty coefficients, estimated via a Gaussian Mixture Model based on the distances between image and text features of the same identity, effectively mitigating the impact of semantic content diversity. Experimental results demonstrate that DUAL outperforms existing methods on TBPR benchmarks, highlighting its superiority in multimodal person retrieval.

Abstract:
Frame extrapolation, as a typical low-latency frame generation method, improves the frame rate of real-time rendering by predicting future frames based solely on historical data. To guarantee the high quality of predictions, existing methods rely heavily on G-buffers of the target frames. However, these G-buffers are not always accessible, and enabling them in certain rendering engines can incur considerable costs. To tackle this challenge, we introduce a G-buffer free frame extrapolation framework that can achieve comparable quality with state-of-the-art G-buffer based methods. In contrast to existing learning-based approaches that handle motions of new frames implicitly and jointly, we design a decoupled strategy that predicts explicit motions for geometry, shading and disoccluded regions separately. In our framework, we first extract the geometric motion using a dual-space method, and then leverage a lightweight motion inpainting network (OccNet) to fill in the disoccluded regions. The shading motion is extracted between two historical frames and then used to propagate shading variations to new frames. Through extensive experiments across various scenes, we demonstrate that our decoupled approach can generate high-quality motions for a wide range of geometric and shading variations in a scene, thereby significantly improving the accuracy of extrapolated frames at a very low computational expense.

Abstract:
Point cloud compression (PCC) is indispensable for the upcoming holographic communication, enabling efficient transmission and real-time interaction with high-fidelity 3D data. As a mature international PCC standard, MPEG G-PCC holds promise for widespread applications due to its support for various point cloud types and ease of implementation. However, the geometry compression performance of G-PCC is limited, which severely impacts the user quality of experience (QoE). To address this, we propose GeoQE, an enhancement model that seamlessly integrates with the G-PCC decoder to mitigate compression artifacts and improve QoE. GeoQE introduces two key techniques: (1) a quantizer-guided expansion operation that adaptively handles distortions at varying levels, and (2) a spatiotemporal mechanism that leverages correlations within the current frame and across adjacent frames, allowing effective enhancement even with a lightweight network. Experiments show that GeoQE delivers state-of-the-art performance on both dense (e.g., VR/AR) and sparse (e.g., LiDAR for autonomous driving) point clouds. Operating at around 4 fps on a 3090Ti GPU with a compact 1.6 MB model, it achieves much lower computational complexity than existing methods, which is attractive for practical applications.

Abstract:
Federated learning remains vulnerable to backdoor attacks through malicious parameter updates, with existing defenses limited by homogeneous data assumptions or reliance on gradient anomaly detection. We reveal that FedAvg's critical flaw lies in malicious feature extractor propagation: aggregating poisoned extractors degrades defense accuracy to <70% across five benchmarks, while benign extractors with poisoned headers retain an average of 89.36% defense accuracy. Therefore, we propose FeatShield, a feature-space isolation framework that prevents backdoor propagation via non-aggregated local extractors trained on clean client data. FeatShield introduces 1) variance-aware alignment, adaptively balancing client-specific features and global consistency using local variance metrics, and 2) adversarial feature synthesis, generating non-linear synthetic features via GAN to enhance the global prediction header's generalization on main tasks. Extensive experiments on eight real-world datasets show that FeatShield achieves the best defense performance. For instance, under heterogeneous data (Dirichlet β=0.5) and strong attacks (50% malicious clients), FeatShield achieves 99.26-99.89% defense accuracy and main task accuracy exceeding FedAvg by 1.32-5.70%, demonstrating its superior resistance to backdoor attacks without sacrificing the benign performance.

Abstract:
Motion style transfer is a significant research area in computer vision, enabling the rapid switching of stylistic variations for the same motion in virtual digital humans. This dramatically enhances the richness and realism of motions, making it widely applicable in multimedia contexts such as film, gaming, and the Metaverse. However, most existing methods employ a two-stream structure, which often overlooks the intrinsic relationships between content and style motions, resulting in information loss and misalignment. Additionally, these methods struggle to capture temporal dependencies in long-range motion sequences, resulting in less natural outputs. To address these limitations, we propose a Unified Motion Style Diffusion (UMSD) Framework that simultaneously extracts features from content and style motions, achieving comprehensive information interaction. We also introduce the Motion Style Mamba (MSM) denoiser, which, for the first time in motion style transfer, leverages Mamba's powerful sequence modelling capability to produce more temporally coherent stylized motion sequences. Furthermore, we design a diffusion-based content consistency loss and a style consistency loss to ensure that the framework preserves content motion while effectively learning style motion features. Extensive experiments demonstrate that our approach outperforms State-Of-The-Art (SOTA) methods qualitatively and quantitatively, achieving more realistic and coherent motion style transfer.

Abstract:
Monocular 3D Gaussian Splatting (3DGS) SLAM methods demonstrate outstanding performance in rapid dense 3D reconstruction. Yet former methods frequently exhibit suboptimal localization and mapping quality when processing indoor objects characterized by weak textures, dark colors, and high reflectivity (e.g., leather furniture), primarily due to insufficient surface feature information, even with the aid of depth sensors. To overcome these limitations, this work pioneers the integration of polarization information into the 3DGS SLAM framework. Specifically, we introduce a polarization integrated SLAM front-end that leverages the abundant planar features inherent in indoor environments. By incorporating a Chroma Boost mechanism, our approach effectively enhances the spectral multi-view consistency during the SLAM process, while the integration of a Gaussian-visible polarization difference improves the robustness of keyframe registration in low-texture scenarios. We further propose a flattened Gaussian regularization coupled with normal consistency constraints to capture the local geometric features of surfaces more accurately. Moreover, a novel integration of Pol-RGB hierarchical density plane segmentation and multi-scale plane self-constraint substantially enhances the quality of scene surface reconstruction, with further azimuth refinement achieved through the angle of linear polarization (AoLP). Extensive experiments demonstrate that, compared with previous SLAM methods, our approach significantly improves surface reconstruction quality.

Abstract:
Human-Object Interaction (HOI) detection serves a broad spectrum of applications. Despite significant progress, current approaches encounter difficulties in effectively handling Non-Contact Human-Object Interaction (NCHOI) scenarios, where humans and objects remain physically apart. To address these challenges, this paper proposes a novel approach, named Latent Interactiveness Field Modeling (LIFM), which enhances HOI detection by capturing long-range contextual dependencies. Specifically, the Latent Interactiveness Field (LIF) is introduced to define potential interactive relationships between humans and objects. To complement this, the LIF Fusion Encoder is designed to adaptively fuse visual features with LIF, resulting in more informative and discriminative feature representations. The Mobile Scanning HOI Dataset (MSHD) is introduced as a comprehensive benchmark to systematically assess the robustness of existing methods on both common HOI and NCHOI in real-world applications. Extensive experimentation indicates that the proposed approach outperforms existing state-of-the-art techniques. It offers substantial improvements, particularly in NCHOI scenarios, which highlight its effectiveness in resolving issues related to long-range interactions.

Abstract:
Despite recent advances in dynamic scene reconstruction, challenges from imbalanced camera distribution and inaccurate pose estimation in real-world datasets still persist, undermining the spatiotemporal consistency of reconstruction. In this paper, we propose HybridPlane, a novel representation that leverages the complementary advantages of cylindrical and Cartesian coordinate systems to achieve high-quality dynamic scene synthesis. Unlike Cartesian projection, which shares identical features in symmetric regions, cylindrical projection explicitly disentangles features from different viewpoints, thereby improving robustness against imbalanced camera distributions. Moreover, the synergy between these two coordinate systems in both projection and representational capacity enhances the model's ability to capture complex motions and fine-grained details. We further adopt the dynamic positional encoding strategy to enhance the smoothness of temporal interpolation under inaccurate camera poses by progressively regulating high-frequency signals without incurring additional computational overhead. Extensive experiments demonstrate that our versatile representation can be seamlessly integrated into various rendering pipelines, outperforming the previous methods in reconstruction quality while reducing computational and memory costs by approximately one-third.

Abstract:
Sketching is a quick ideation and multimedia tool for effectively expressing design intent. By translating simple strokes into CAD models, it allows non-expert users to create editable designs, reducing the learning curve associated with traditional CAD software. However, current sketch-based CAD modeling methods are often limited to basic shapes and require structured inputs, making them less robust when dealing with varied sketch styles. To overcome these challenges, we propose a novel sketch-based modeling framework DAFU-CAD, that is both efficient and robust. Our approach features a Depth-Assisted and Feature-Unraveling sketch classification module that categorizes sketches into corresponding modeling operations, independent of their drawing style. A parameter regression and optimization module then estimates the modeling parameters, ensuring consistent and stable model reconstruction across different sketch inputs. To support this, we compile a diverse sketch dataset with a range of modeling categories and abstraction levels. Experimental results show that our method outperforms existing approaches in terms of both robustness and versatility.

Abstract:
Mental stress assessment is crucial for mental and physical well-being. However, it faces limitations due to domain fragmentation, in which contextual variations in stress triggers and demographics hinder the generalization of assessment models across real-world scenarios. Additionally, mental stress assessment is sensitive and human-centric due to its implications for mental health interventions, emphasizing the need for model transparency and trustworthiness. To address this gap, we propose Retrieval-Augmented Reasoning, a novel framework that bridges domain gaps in mental stress assessment through transparent step-by-step reasoning and dynamic in-context example retrieval. Our framework introduces two key components: (1) a ''detect-then-assess'' reasoning chain decouples stress-relevant facial action units (AUs) from domain-specific noise by first generating textual descriptions as intermediate reasoning step (e.g., ''eyebrow: inner portions raised''). The model then reflects on and learns to refine these descriptions via Direct Preference Optimization (DPO), ensuring faithfulness and helpfulness; (2) a dual-encoder multimodal retriever dynamically selects proper in-context examples from source domain to enhance target-domain assessments, leveraging feedback from the assessment model to optimize retrieval. Experimental results demonstrate that our framework consistently outperforms large multimodal foundation models, stress assessment baselines, and domain generalization methods.

Abstract:
In industrial scenarios, diverse anomalous images are difficult to acquire, significantly limiting the performance of industrial anomaly detection methods. Automatically generating anomalous images for anomaly detection has the potential to solve the above problem. However, existing anomaly generation models are still not satisfactory regarding the authenticity and controllability of anomaly generation. In this paper, we propose a controlled anomaly generation model named AnomalyControl to generate realistic anomalous images aligned highly with both text prompts and anomaly masks. First, we introduce a CLIP-guided anomaly prompt generator that leverages a CLIP text encoder to find anomaly text prompts most aligned with real anomalous images. Secondly, we propose an anomaly appearance and shape decoupling mechanism, which designs an embedding similarity loss to enforce the alignment between the anomaly text prompt and anomalies generated with different shapes at the same location, making the appearance of generated anomalies better maintain semantic consistency when the anomaly shape changes. Then, a training-free local control enhancement strategy is employed to provide stronger control intensity to anomaly regions during inference for finer alignment with anomaly masks. Finally, a hard sample generation module is proposed to create anomalous samples with subtle shapes and imperceptible anomaly appearances, enabling the downstream anomaly detection model to focus on learning low-saliency anomaly features. Extensive experiments demonstrate that anomalous images generated by our model outperform the state-of-the-art anomaly generation methods in terms of authenticity and consistency, and can significantly improve the performance of downstream anomaly detection tasks, especially anomaly localization.

Abstract:
Deep learning-based tumor segmentation methods typically require precise pixel-level annotations, which are costly in clinical practice. While bounding box supervision offers a more efficient alternative, existing approaches assume unrealistically tight box annotations, leading to performance degradation when applied to the loose boxes commonly produced by medical annotators. To address this challenge, we propose LooBox, a novel 3D segmentation framework that utilizes loose box annotations through a self-correction and bidirectional rectification paradigm. For the self-correction part, we propose a noise cleaner that comprehensively utilizes deterministic outer box information by integrating three complementary perspectives for predictive self-rectification: entropy mapping, gradient monitoring, and foreground-background affinity measurement. For the bidirectional rectification part, we introduce an augmentation-driven comprehensive consistency constraint strategy. Specifically, the framework incorporates: an asymmetric co-teaching architecture comprising a basic UNet and an enhanced UNet variant with a noise adapter, and an augmentation-driven consistency mechanism that computes pairwise loss between self-corrected predictions after each training iteration to ensure robust tumor feature extraction. Comprehensive evaluations on LIDC-IDRI, MSD-Lung, and MSD-Pancreas datasets demonstrate that LooBox achieves superior segmentation accuracy compared to state-of-the-art box-supervised methods.

Abstract:
Recent advances in 4D Gaussian Splatting have boosted dynamic scene reconstruction and real-time rendering. However, current methods remain retrospective, lacking the ability to forecast future states-limiting their utility in tasks like autonomous navigation and robotics. To address these limitations, we propose FutureGS, a novel Gaussian-based dynamic scene representation framework tailored for continuous 3D future scene prediction and view synthesis. FutureGS introduces a dual-domain decoupled representation, consisting of a static 3D Gaussian base to maintain spatial consistency and a dynamic deformation field to explicitly model temporal motion evolution. To capture long-range dependencies and complex motion dynamics, we design a multi-window collaborative prediction strategy that leverages a sliding temporal window and a bidirectional LSTM-based temporal encoder for robust future motion estimation. Furthermore, we propose a KNN-based local rigidity-aware fusion mechanism, which adaptively regulates the prediction consistency based on local deformation intensity, enhancing the geometric stability and physical plausibility of future scenes. Extensive experiments on standard dynamic scene benchmarks, including D-NeRF and NeRF-DS, demonstrate that FutureGS achieves superior performance in terms of visual fidelity and spatiotemporal consistency, enabling real-time and photorealistic rendering from arbitrary viewpoints at future time steps.

Abstract:
Multimodal Sentiment Analysis (MSA) aims to integrate textual, audio, and visual data to capture nuanced sentimental cues. Although text dominates in existing approaches, audio and visual modalities inherently contain both shared semantics (overlapping with text) and private semantics. Existing methods struggle to precisely find semantic boundaries and lack explicit mechanisms for modeling interaction between shared/private semantics and different modalities. To address this, we propose DiffuFuse, a framework that uses a diffusion denoising model to leverage textual information to predict shared semantic features, dynamically and adaptively delineate semantic boundaries for non-textual features, and employs a dual-stream fusion strategy to accurately model the interactions between different modalities and semantic types. Finally, adopt an orthogonal projection method to reduce redundancy and eliminate overlapping information between the two streams. DiffuFuse is evaluated on the MOSI and MOSEI datasets, and the experimental results demonstrate that our proposed DiffuFuse achieves superior performance.

Abstract:
Image steganalysis is a detection task to distinguish whether a secret message is embedded in a digital image. Due to the domain inconsistency caused by Cover Source Mismatch(CSM) and Steganographic Algorithm Mismatch (SAM), most of them suffer from significant performance degradation. Recent mismatched steganalysis focused on extracting domain invariant features by domain adversarial training or feature alignment. However these schemes are limited to unstable performance in diverse domain mismatch scenarios, and are even ineffective in some cases. In this paper, we propose a Universal Mismatched Steganalysis PCD-UMS via pair-wise confidence difference-based pseudo-label selection from the perspective of optimizing target training data. Specifically, we reveal a strong positive correlation commonality between pair-wise confidence difference and the detection performance of steganalysis among various mismatch scenarios. Based on this, a novel pseudo-label selection strategy consisting of maximum confidence difference first (MCDF) rule and pair-wise label differential storage (PLDS) rule is designed to select and filter the reliable target pseudo-labels. Furthermore, a multi-perspective pair-wise feature alignment loss is designed to initially transfer the classification ability of source steganalysis, thus solving the problem that source steganalysis fails completely under some domain mismatch scenarios. Comprehensive experiments show that our PCD-UMS outperforms the existing mismatched steganalysis by 12.07% and 3.40% in terms of detection performance under CSM and SAM scenarios.

Abstract:
We propose GSAPro, a Gaussian Splatting based 3D surface reconstruction framework that exhibits robustness across diverse scales of scenes. Previous research has leveraged photometric consistency constraints or prior information as guidance to enhance the reconstruction accuracy. However, error estimation and noise inevitably exist in these priors. Applying a strict geometric filter removes a large amount of reliable information, resulting in a deterioration of the quality of guided reconstruction. Regarding possible errors in the initial guidance, GSAPro can continuously improve the accuracy of the guidance through a joint optimization strategy. The Gaussian Branch integrates reliable geometric and color constraints, thus providing more accurate geometric parameters for the Prior Branch compared to its current state guidance parameters. The Prior Branch, through photometric selection and propagation, obtains more accurate geometric parameters from the state geometric parameters and rendered parameters. Then GSAPro uses these parameters to guide the optimization of the Gaussian Branch. Regarding the problem of noise existing in the guidance, we train the Semantic Aware Module to predict the noise by utilizing the image information, thus improving the accuracy. Moreover, we also introduce a Distillation Module to mitigate the excessive splitting of Gaussians that is caused by the implementation of additional constraints. Experiments demonstrate that our method exhibits SOTA performance and has stronger robustness against scenes of different scales.

Abstract:
Medical time series, such as Electroencephalogram (EEG) and Electrocardiogram (ECG), are widely used for disease detection, with multiple electrodes or sensors recording simultaneously. Accurately modeling inter-channel relationships is crucial for improving detection performance. Current methods mainly rely on data-driven approaches to model channel relationships, facing two challenges: (1) insufficient integration of medical prior knowledge, hindering the accurate representation of physiological correlations between channels, and (2) high temporal pattern similarity across channels, leading to feature redundancy and degraded classification performance. To address these issues, we introduce KEMed, a knowledge-augmented model for medical time series classification. The model incorporates medical textual prior knowledge by generating natural language descriptions for each channel and leveraging Pre-trained Language Model (PLM) for semantic representation, enabling precise identification of physiological and pathological similarities and differences between channels. Specifically, KEMed optimizes channel relationships through knowledge-guided clustering and weighting mechanisms and leverages Large Language Model (LLM) to capture spatiotemporal dependencies, thereby enhancing classification performance. Experimental results on five medical time series datasets demonstrate that KEMed consistently outperforms state-of-the-art methods, validating the effectiveness and superiority of knowledge augmentation in medical time series classification.

Abstract:
This paper presents a novel approach to lifelogging by transforming it into a narrative-driven Virtual Reality (VR) experience that we call SLIVeR (Someone else's Lifelog in Virtual Reality). The system situates users in an immersive, emotionally charged environment where they explore lifelog video fragments as part of a memory recovery narrative. Users begin disoriented, responding to existential prompts such as 'Who am I?' and 'What was the last thing I did?' - each unlocking cinematic scenes that reconstruct a car accident central to the character's identity loss. The experience transitions into a metaphorical space representing the fragmented mind, where users interact with floating lifelog questions organised by theme (e.g., relationships, movement, work). These interactions simulate a lifelog search interface embedded within a story arc, encouraging reflection and engagement. To evaluate SLIVeR, we conducted a mixed-methods study with 30 participants, analysing engagement using the User Engagement Scale (UES) and thematic reflection questions. Results showed that narrative coherence, visual interactivity, and identity-driven content enhanced user engagement. Social and emotionally resonant question types were rated as more meaningful than routine-based ones. This work demonstrates how VR, narrative framing, and lifelog data can be fused into a reflective, game-like experience that deepens both interaction and emotional connection.

Abstract:
Human activity recognition (HAR) is an evolving technique that offers innovative solutions across various domains, such as healthcare, sports training, and human-computer interactions. This paper addresses the novel challenge of video-based activity recognition, focusing on detecting and classifying athletes' actions to enable precision sports training. Conventional HAR methods based on direct video analysis incur excessive computational overhead and constrained applicability. In contrast, our novel transformer-based framework, namely RSFomer, converts videos into multivariate time series, and then detects and classifies the athletes' actions. However, sports videos often suffer from severe occlusion, which introduces significant noise to the converted time series and thus deteriorates recognition performance. To address this challenge, we implement several innovative strategies to improve the robustness of our framework. First, we propose a dual-scale filtering mechanism that leverages the unscented Kalman filter and kinematic constraints to reduce noise and outliers in the converted time series. Second, we incorporate the masking mechanism and temporal slicing mechanism to enhance the transformer's ability to handle anomalies and extract multi-scale features for accurate action recognition. We perform extensive evaluations on our Boxing dataset as well as the UEA and FineGym datasets. The results demonstrate that our RSFomer is effective, outperforming existing state-of-the-art methods with significant advantages.

Abstract:
With the rapid evolution of multimedia technologies and its widespread integration into education, adaptive multimedia learning has gained significant prominence. Cognitive diagnosis (CD) is pivotal in this domain, as it models students' cognitive states using practice data captured by multimedia learning applications. However, existing methods often simplify these states to mere proficiency on knowledge concepts. Constructivism in education emphasizes learning as a continuous cognitive development process, during which students' cognitive states become increasingly complex, involving not only their construction of concepts but also their construction of relations between concepts that have long been overlooked. To this end, we propose the Hierarchical Disentanglement of Cognitive States for Enhanced Cognitive Diagnosis (HDCD). Inspired by the Structure of Observed Learning Outcomes (SOLO) taxonomy, which categorizes cognitive development into core hierarchical levels (Multistructural, Relational, Extended Abstract), we introduce a hierarchical disentanglement strategy to define cognitive states aligned with each SOLO level: Intra-Concept Cognitive States, Relational Cognitive States, and Extended Cognitive States. Specifically, (i) At the multistructural level, intra-concept cognitive states are sampled from student's personalized cognitive distribution, representing the construction of individual concepts. (ii) At the relational level, inter-concept cognitive states are first sampled to represent the construction of relations between concepts. We then employ a hypergraph transformation to collaboratively update both intra-concept and inter-concept cognitive states, forming relational cognitive states. Considering that students' self-constructed knowledge systems involve multiple types of inter-concept relations, relational cognitive states are implemented under both undirected and directed relation views in this work, and then fed into local diagnostic functions, respectively. (iii) At the extended abstract level, outputs from the local diagnostic functions are fused using multi-view attention mechanisms, resulting in extended cognitive states, which integrate information from multiple relational views, are then fed into a global diagnostic function for final prediction. Extensive experiments on real-world datasets demonstrate the superior performance and interpretability of our HDCD.

Abstract:
Modern generative models often struggle to synthesize structured objects from detailed part specifications. They frequently produce anatomically implausible outputs or hallucinated components. We introduce PLATO, a novel two-stage framework that bridges this gap by enabling precise, part-controlled object generation. The first stage is PLayGen, our novel part layout generator which takes a list of parts and object category as input and synthesizes high-fidelity layouts of part bounding boxes. To enhance PLayGen's ability to learn inter-part relationships, we introduce novel structure-based loss functions. In the second stage, PLayGen's synthesized layout is used to condition a custom-tuned ControlNet-style adapter, enforcing spatial and connectivity constraints. This results in anatomically consistent, high-fidelity object generations containing precisely the user-specified parts. We further propose new part-level evaluation metrics to rigorously quantify adherence to part specifications. Extensive experiments show that PLATO significantly outperforms state-of-the-art generative models and produces structurally coherent objects in a controllable manner - marking a step forward in modular, part-driven asset generation.

Abstract:
Domain generalization (DG) plays a pivotal role in enabling models to maintain robust performance across heterogeneous environments. However, existing DG methods are fundamentally constrained by two intertwined limitations: (1) causal misalignment, which stems from undifferentiated feature encoding that entangles causal mechanisms with environmental biases; (2)semantic conflict arises when conventional adaptation methods find it challenging to balance the preservation of class discriminability with the mitigation of domain-specific distribution discrepancies. To address these challenges of DG, we propose a novel Causal-Driven Semantic Consistency Reasoning (CauRDG) method, which synergistically integrates Prototype-Guided Causal Disentanglement (PGCD) and Dual-Space Semantic Disambiguation (DSSD). Specifically, PGCD constructs a causal framework that identifies stable relationships and decouples invariant mechanisms from domain-specific variations, preserving causal consistency while adapting to contextual differences. DSSD harnesses a dual-space paradigm, enhancing local categorical clarity and maintaining global conceptual unity, thus balancing domain-specific precision with cross-domain coherence. The robustness provided by CauRDG ensures robust extraction and interpretation of essential features by preserving invariant causal structures, thereby harmonizing discriminative semantics with domain-varying contexts. Extensive experiments on multiple benchmark datasets consistently demonstrate the effectiveness and superiority of our CauRDG over state-of-the-art baselines.

Abstract:
Bitwise Vision AutoRegressive (BVAR) Model, as a distinguished source of young blood, has been taking the lead in the track of text-to-image synthesis, which at the same time raises legal and ethnic concerns such as copyright and authenticity. However, existing methods mainly focus on watermarking within diffusion models, which rely on the distinctive attributes of diffusion steps and cannot be directly transferred to new circumstances. To this end, we propose Safe-BVAR, the first watermark framework to embed bit strings during image generation in BVAR. Our study discovers the local similarity of the inferenced latent feature and the element-wise robustness of image autoencoder. Therefore, combined with the residual-accumulative nature of BVAR, we propose a novel Late Stage Residual Implanter to embed watermark and extract the information based on Local Contextual Extractor. Furthermore, we propose a Distributed Rotational Arranger to enhance watermark against local distortions. Our method is training-free and plug-and-play. Meanwhile, it can be easily applied to flexible-sized images. We evaluate the robustness and invisibility of the watermark, showing that it can resist common image attacks and cast inappreciable influence on the image.

Abstract:
Democratic mediation serves as a vital mechanism for resolving social conflicts; however, current practices encounter three critical limitations: (1) inefficient operations, wherein traditional laborintensive mediation processes are both time-consuming and inefficient; (2) theoretical gaps, as prevailing mediation theories fail to explore the underlying causes of conflicts; and (3) inadequate analysis, with existing digital tools lacking comprehensive conflict mediation capabilities and primarily focusing on singular data types. To address these limitations, we introduce the Normative Social Simulator for Democratic Mediation, referred to as Norm Mediat. This framework is specifically designed to simulate democratic mediation, incorporating social norms. Central to this framework is the integration of normative reasoning into the mediation process, which enhances the ability to understand individuals' intrinsic needs and identify the root causes of conflicts. The framework comprises two essential components: (1) Dynamic Multimodal Conflict Modeling (DMCM), which generates the initial dataset of conflict interactions; and (2) Norm-Aware Iterative Mediation (NAIM), which implements an iterative democratic mediation process through norm awareness. The results of our human evaluation underscore the effectiveness of our norm-driven mediation strategies. This research significantly contributes to computational social science by providing a comprehensive methodological framework for simulating democratic processes and offering a benchmark dataset for conflict resolution studies.

Abstract:
While autonomous driving has made substantial progress, accurately predicting the trajectories of surrounding traffic agents remains a fundamental challenge for ensuring safety. Integrating both infrastructure-side and vehicle-side information has the potential to enhance perception and prediction capabilities. However, existing methods overlook the challenges in Vehicle-Infrastructure Cooperative Trajectory Prediction. To bridge this gap, we propose ViTraj, a model-agnostic framework for VIC-TP that leverages infrastructure-side trajectories to mitigate the inherent limitations of vehicle-side forecasting. ViTraj introduces a Feature-Side Selection and a Cooperative Interaction to aggregate complementary features from both sides, effectively expanding the perceptual horizon of prediction models. In addition, we present a Vehicle-Infrastructure Knowledge Distillation strategy to enforce consistency between multi-side predictions, which efficient global-local feature alignment through a single backward pass. Extensive experiments on large-scale public datasets demonstrate that ViTraj consistently improves advanced trajectory prediction models, achieving the state-of-the-art performance compared to existing vehicle-infrastructure cooperative methods. We believe this work provides a promising step toward the practical deployment of V2X-based autonomous driving systems.

Abstract:
Unknown object detection aims to build detectors capable of identifying out-of-distribution objects, a critical need for applications like autonomous driving and traffic monitoring. However, limited device resources restrict existing methods from achieving accurate detection on the device side. Addressing this gap, this paper introduces a device-cloud collaborative framework named DCCUOD that enhances device model performance through efficient cloud collaboration. Our framework employs an energy-based sampling function on devices to target samples with unknown objects, coupled with a collaborative pseudo-labeling strategy to generate accurate pseudo-labels. Additionally, a two-stage training paradigm enables continuous improvements of device models on both known and unknown objects. Our study is the first to explore device-cloud collaborative learning for UOD tasks. Experimental results show that the device model is three times smaller and seven times faster than cloud models, with minimal performance trade-offs.

Abstract:
Outdoor AR applications on mobile devices need accurate estimates for the pose of the device. In this paper, we develop SplatPose, a novel pose estimation technique that uses a data-driven 3D modeling technique called Gaussian Splatting. SplatPose uses a trained Gaussian Splatting model to render an image at an estimated device location, then matches features with the camera image to estimate pose. % Because this matching can be fast, SplatPose can, in theory, estimate pose entirely on a mobile device, while existing approaches cannot. To this end, SplatPose trains Gaussian Splatting models to be robust to appearance changes, thereby improving accuracy. It also incorporates a novel fast renderer to improve rendering speed. Using an AR pose estimation benchmark dataset, we show that SplatPose outperforms the state-of-the-art in terms of accuracy, and is up to an order of magnitude faster on a mobile device.

Abstract:
Current 3D model Level of Detail (LOD) methods require multiple models with varying detail levels to reduce client computational load, transmitting different models based on the user's distance to the object. However, this process consumes excessive network bandwidth and strains the client's memory and storage. To address this, we propose BS3 (Bézier Slicing for 3Ds), a middleware-enabled method that slices 3D meshes and fits the contours using Bézier curves. Acting as an intermediate layer, the BS3 middleware handles slicing, vectorization, sampling and reconstruction, allowing .bs3 files to be streamed only once and adjusted dynamically at different sampling rates. Our experiments demonstrate the efficiency and performance analysis of BS3, which shows that it can reduce network and storage burdens while keeping the display effect. We believe that BS3 will enhance 3D multimedia in the game, exhibition, digital museum, cultural heritage, metaverse, etc.

Abstract:
Video analytics pipelines migrating to edge deployments are facing performance bottlenecks under limited bandwidth. Non-uniform intra-frame encoding emerges to further compress pixels without affecting the output of the server deep neural network (DNN), while it is inefficient in high-resolution video streaming at low bandwidth. The detail enhancement capability of neural super-resolution (SR) permits resolution downsampling and aggressive compression on edge devices for low-latency transmission. To exploit its accuracy potential, DNN-oriented non-uniform encoding is expected to be additionally aware of SR models. However, traditional codecs struggle to cope with both quality optimization for SR and global semantic features for DNN. We advocate neural codecs for coordinated encoding and enhancement, enabling analytic-oriented video streaming with optimal accuracy-delay tradeoffs. Our system, VidIQ, achieves quality-enhanced real-time video analytics by 1) improving the network architecture of neural codecs (at two granularity) to integrate SR models into a DNN-oriented analytics pipeline, and 2) adapting the multi-scale encoder and SR-decoder to scene dynamics (i.e., content and bandwidth variations) with the help of the monolithic controller to hold a performance advantage. Extensive evaluations showcase that VidIQ reduces end-to-end delay by 35.8% and improves analytics accuracy by 21.2% compared to the recent video compression, enhancement, and streaming baselines.

Abstract:
Volumetric videos are essential for immersive applications due to their engaging and realistic experiences. However, streaming them in real time over constrained, fluctuating networks remains challenging. Progressive streaming is an effective method to mitigate this issue by gradually enhancing video quality through incremental data transmission. However, existing progressive volumetric streaming solutions often rely on specific compression algorithms or require codec modifications, leading to poor compatibility with standard codecs. In this paper, we propose P2VS, a progressive partition-based volumetric video streaming framework, to achieve codec-independent progressive streaming. Specifically, P2VS leverages the unique structure of point cloud-based volumetric video to incrementally enhance video quality without being constrained by specific compression algorithms. Moreover, we propose adaptive streaming algorithms under this framework to enhance the quality of experience (QoE). Extensive simulations demonstrate that P2VS improves QoE by 21% on average compared to non-progressive streaming schemes. It also achieves better bandwidth efficiency and full compatibility with standard codecs. A prototype is built to verify the feasibility of P2VS.

Abstract:
Video streaming platforms and existing ABRs traditionally assume uninterrupted sequential playback, yet users frequently skip to points of interest-a fundamental mismatch causing degradation of quality of experience at high-interest segments while wasting bandwidth on skipped content. We address this through our Hotspot-Aware Joint Optimization framework, which reframes video streaming as a non-monotonic optimization problem with discontinuous state transitions caused by navigation events. Our framework jointly optimizes adaptive bitrate decisions and buffer management by leveraging viewer engagement patterns to predict navigation behavior. Our approach combines: (1) a mathematical formulation capturing state discontinuities in non-sequential viewing, (2) self-supervised models predicting navigation targets using only aggregate viewing data, and (3) hotspot-aware ABR and buffer management algorithms implemented through our Streaming Local Search (SLS) technique that dynamically prioritize quality for frequently-watched segments. Evaluation across diverse content and network conditions demonstrates our framework delivers 38.2% higher quality in hotspot regions, 32.5% reduced navigation delays, and 27.1% improved resource efficiency compared to traditional methods. These improvements establish a foundation for streaming systems that adapt to both network conditions and content structure, aligning resource allocation with actual viewing patterns.

Abstract:
The field of information retrieval, especially when targeting multimodal content, has found ways of satisfying a broad range of information needs, which can be expressed in a multitude of ways. In contrast to related fields, such as relational databases, no universal way of representing the queries to be answered by a retrieval system has emerged. In this paper, we present an initial proposal for a universal query representation mechanism for multimodal information retrieval. The proposed approach imperatively expresses arbitrary information needs, using a DAG of query primitives. We show how such a representation can be used for both feature extraction and query processing pipelines and how it can serve as a foundation towards a query language for information retrieval.

Abstract:
Cold-start recommendation, which addresses the challenge of recommending items to new users without sufficient historical interaction data, remains difficult in personalized recommender systems. Cross-domain recommendation methods have gained attention in addressing this issue by transferring user preferences from a source domain to alleviate the cold-start problem in a target domain. Traditional embedding-based approaches typically rely on a one-to-one alignment over continuous user embeddings using overlapping user IDs, which often leads to severe overfitting and limited generalization, particularly when overlapping users between domains are sparse. In this paper, we propose a novel hierarchical recommendation framework specifically targeting the cold-start problem by modeling user preferences as hierarchical interests derived from textual embeddings and Residual Quantized Variational Autoencoders (RQ-VAE). Unlike traditional methods that directly align embeddings, our approach builds a mapping function transferring hierarchical user interest structures from the source to the target domain. This hierarchical mapping significantly enhances generalization, providing a more comprehensive and robust representation of user preferences. Additionally, we employ a generative rewriting mechanism utilizing Large Language Models to refine user-generated reviews into concise, semantically enriched summaries that explicitly highlight user interests. Extensive evaluations on Amazon review datasets demonstrate the effectiveness of our hierarchical preference modeling and generative rewriting approach, outperforming existing embedding-based methods consistently, especially in cold-start and sparsely populated scenarios. Our proposed method thus provides a robust, flexible solution for personalized recommendation tasks in cold-start conditions.

Abstract:
Obtaining textual descriptions of the visual content of images and videos is often required in multimedia analysis and retrieval. Traditional video captioning approaches are usually evaluated on very short captions using rather simple metrics from NLP, while multimodal large language model (MLLM)-based approaches are mostly evaluated with question answering, which is query specific. We provide a dataset (FM-V2T) with 258 video clips from a media archive, annotated with detailed manually curated descriptions in English and German (long and short). We propose an LLM-based metric, which assesses the entailment and contradiction of facts extracted from a description with a reference, addressing shortcomings of existing metrics small changes with semantic impact and comparing descriptions with substantially different lengths. We provide experimental results on the reliability of the metric, and apply it to baseline results of three MLLM-based approaches on the FM-V2T dataset, comparing it with other metrics.

Abstract:
Despite growing interest in Audio-Visual Question Answering (AVQA), existing datasets often suffer from limited diversity, rigid formats, and insufficient integration of audio and visual modalities. To address these limitations, we introduce Valor32k-AVQA v2.0, a large-scale dataset containing 28,863 real-world videos and over 225,000 QA pairs, designed to support diverse and realistic multimodal understanding. The dataset features both open-ended and multiple-choice questions, each annotated with the required modality ( visual,audio, or audio-visual ) and question category ( description,action,count,temporal,location, or relative position ). All annotations-including questions, answers, and metadata-are generated through a fully automated prompting pipeline using GPT-4o, with human validation performed on a representative sample to ensure quality. We benchmark a few state-of-the-art models, with additional evaluations available on the project page, and observe that incorporating audio consistently improves performance during fine-tuning without compromising visual reasoning capabilities. These findings highlight that the audio signals in our dataset are not only well integrated, but also informative and complementary, establishing Valor32k-AVQA v2.0 as a valuable resource for developing and evaluating robust audio-visual question answering systems.

Abstract:
Volumetric video is a key enabler of immersive extended reality (XR) experiences and is often represented using point clouds for their structural simplicity. However, capturing volumetric content through multi-view acquisition and depth sensing poses many challenges, such as occlusions and depth mismatches. To foster research in this field, we introduce a unique dual-quality point cloud dataset, named UVG-CWI-DQPC, which is designed to support the development of point cloud enhancement, compression, and quality assessment. Our dataset includes 12 dynamic sequences captured simultaneously by: 1) a high-end capture system producing high-fidelity point clouds with extensive processing; and 2) a consumer-grade capture system relying on affordable RGB-D cameras, lightweight processing, and open-source tools. For each sequence, our dataset provides ground-truth point clouds from the high-end capture system and raw RGB-D footage from the consumer-grade capture system, along with calibration data and tools for point cloud generation. This dual-quality setup enables direct comparison and benchmarking of algorithms for densification, occlusion removal, registration, and quality enhancement. Our dataset is publicly available under a permissive license to support reproducible research and standardization work in Moving Picture Experts Group (MPEG) and 3rd Generation Partnership Project (3GPP).

Abstract:
Although audio-augmented reality (AAR) has known applications in music, the use of wearables such as augmented reality (AR) glasses for egocentric audio data capture for music has not been investigated. Current egocentric datasets are mostly focused on speech research, neglecting music's unique demands for tasks such as real-time optimisation or assistive listening. This paper introduces EgoMusic, a multimodal dataset featuring synchronised egocentric audio-visual data captured with AR glasses during live performances, alongside studio-quality audio references. We investigate AR glasses' utility for music and baseline artificial intelligence (AI) approaches for hearing enhancement, positioning EgoMusic as the first dataset that enables research for egocentric music AAR.

Abstract:
We present UR-MAT, a multimodal, material-aware synthetic dataset for urban scene understanding and physics-based simulation. UR-MAT comprises seven diverse outdoor environments, ranging from historic districts to modern office areas, reconstructed from OpenStreetMap data and procedurally enhanced in Unreal Engine. Each scene includes semantically structured 3D meshes and physically based rendering (PBR) materials annotated with electromagnetic properties such as permittivity, reflectance, and attenuation. The dataset provides spatially aligned RGB images, material segmentation masks, depth maps, point clouds, camera poses, and 3D mesh files in .glb format. All data is generated through a deterministic, reproducible pipeline integrating OSM2World, Unreal Engine, and UnrealCV. UR-MAT supports a wide range of research tasks, from semantic segmentation and 3D reconstruction to material-aware electromagnetic simulation (e.g., mmWave propagation). We also release two utility scripts to extract mesh-material relationships and assign physical metadata, enabling dataset extension and reproducibility. By bridging computer vision and physical modeling, UR-MAT serves as a testbed for multimodal AI and signal-aware urban simulation.

Abstract:
Visual anomaly detection (VAD) aims to identify image regions that deviate from established normal patterns. Existing methods often rely on domain-specific training and follow a ''one-class-one-model'' paradigm, limiting scalability. We propose Omni-LLaMA-AD, the first unified multimodal large language model for open-set anomaly detection, capable of handling diverse domains with minimal supervision. Built on a pretrained LLaMA backbone, the model uses a VQGAN-based tokenizer and supports joint vision-language generation. Trained via vision-language alignment and instruction tuning, it achieves effective anomaly detection with only a few normal samples and no domain-specific fine-tuning. Our demo showcases the model's ability to generate high-quality anomaly masks across industrial, medical, and logical datasets, highlighting its strong cross-domain generalization and interactive dialogue-based user experience.

Abstract:
Key Opinion Leaders (KOLs) play a crucial role in modern marketing by shaping consumer perceptions and enhancing brand credibility. However, collaborating with human KOLs often involves high costs and logistical challenges. To address this, we present GenKOL, an interactive system that empowers marketing professionals to efficiently generate high-quality virtual KOL images using generative AI. GenKOL enables users to dynamically compose promotional visuals through an intuitive interface that integrates multiple AI capabilities, including garment generation, makeup transfer, background synthesis, and hair editing. These capabilities are implemented as modular, interchangeable services that can be deployed flexibly on local machines or in the cloud. This modular architecture ensures adaptability across diverse use cases and computational environments. Our system can significantly streamline the production of branded content, lowering costs and accelerating marketing workflows through scalable virtual KOL creation. Video demo is available at https://youtu.be/uXpXmEbjg3M.

Abstract:
We present Pask, a proactive AI agent that provides real-time, context-aware guidance and knowledge support in audio-centric media environments. Unlike passive assistants that follow the ''you ask, I answer'' model, Pask shifts toward ''answering before asking'' by continuously monitoring live audio, anticipating user needs, and proactively offering conceptual explanations and semantic clarifications. It integrates three core components: a silent copilot for in-situ explanation, a structured knowledge base for factual grounding, and a private memory module for personalized adaptation. Pask enhances comprehension and communication in scenarios such as online learning, media consumption, and live meetings through sustained, intelligent guidance. A live demo is available at https://www.youtube.com/watch?v=ki_CKiV9Oyk.

Abstract:
With the continuous development of networking and computing devices, immersive communication has become increasingly viable, enabling users to explore virtual worlds and interact with other users in 6 Degrees-of-Freedom (6DoF). Immersive communication has great potential not only in professional domains, such as medical diagnostics and distance education but also for leisure activities, such as social networks and new media. This proposal aims to develop an immersive communication system leveraging the cutting-edge dynamic 3D Gaussians Splatting (3DGS). Our objective is to design, implement, and evaluate a highly interactive, adaptive, and efficient immersive communication system that maximizes user experience and system performance while supporting heterogeneous hardware platforms, networks, and applications. We identify and tackle three critical challenges in developing such a system: (i) designing deformable 3DGS avatars to enhance real-time user representation, (ii) developing scalable dynamic 3DGS codecs to optimize data transmission and storage efficiency, and (iii) implementing adaptive streaming algorithms to ensure smooth and responsive user experience across diverse usage scenarios. By solving these challenges, our research aims to push the boundaries of real-time 3D streaming and interaction while redefining the future of virtual world over next-generation networks.

Abstract:
Identity preservation is a critical capability in video generation and one of the core requirements for high-quality video synthesis. Existing approaches typically extract facial features from reference images as conditional inputs and inject them into the generation pipeline to maintain subject identity. However, in the IPVG Challenge 2025, state-of-the-art models such as ConcatID still fall short of delivering satisfactory identity preservation. To address this limitation, we propose a simple yet highly effective multi-branch video generation framework based on entity routing. Concretely, we integrate several fine-tuned dedicated models to compensate for the base model's weaknesses in identity preservation, dynamically selecting the appropriate branch according to each prompt. In addition, we employ enhanced prompts to further steer the generation process. Remarkably, using just a single NVIDIA RTX 3090 GPU for 120 hours of training, we boost the baseline's cur_score from 0.242 to 0.313.

Abstract:
Accurately identifying personality traits is of profound significance for gaining in-depth insights into human behavior, facilitating efficient human-computer interaction, and developing personalized intelligent systems. However, existing studies often treat personality trait prediction and emotion recognition as relatively independent tasks, neglecting the inherent correlation between them. This report proposes a multimodal fusion prediction framework through the research topic ''On the Interaction between Personality and Emotion in Human Behavior and Social Interaction''. The core goal of this framework is to explore and verify the positive gain of emotional analysis on the accuracy of personality prediction. We extracted visual and audio features based on large-scale data pre-training and fine-tuning, aggregated video features at the character level to enhance the stability of personality prediction, and then integrated the video-level emotion prediction branch. By jointly optimizing the losses of personality prediction and emotion prediction, the generalization performance of the model is improved. Experimental results show that the multi-task learning method integrating emotional information can improve the prediction performance of personality traits to a certain extent, and achieved the first place in the MER-PR validation set of the MER2025 Challenge. This provides empirical evidence for us to explore the complex interaction between emotion and personality.

Abstract:
As social media becomes a dominant platform for sharing content, predicting the popularity of user posts has become increasingly important for applications such as content recommendation, trend forecasting, and user engagement. However, this task is challenging due to the diverse and multimodal nature of social media posts, which often include unstructured text, images, and structured metadata. To address this challenge, we propose Fusion-Aware Multi-modal Ensemble (FAME), a framework effectively captures and integrates diverse information sources within social media content. Unlike prior approaches that rely on a single model to process all modalities, FAME leverages four specialized predictors. Three of them-CatBoost, LightGBM, and AutoGluon-are tree-based models that excel at handling structured metadata and its interactions with unstructured features. The fourth is a denoising autoencoder (DAE), which learns robust joint representations from unstructured text and image data. These models are combined through a weighted ensemble strategy, allowing FAME to leverage the complementary strengths of different architectures. Experiments on the Social Media Prediction Dataset demonstrate that FAME significantly outperforms existing baselines, achieving state-of-the-art results and validating its effectiveness in modeling the complex, multimodal nature of social media content.

Abstract:
Despite remarkable advances, current Extended Reality (XR) applications are in their majority local and individual experiences. A plethora of interactive applications, such as teleconferencing, telesurgery, interconnection in new buildings project chain, cultural heritage, and museum contents communication, are well on their way to integrating immersive technologies. However, interconnected, and interactive XR, where participants can virtually interact across vast distances, remains a distant dream. In fact, three great barriers stand between current technology and remote immersive interactive life-like experiences, namely (i) content realism, (ii) motion-to-photon latency, and accurate (iii) human-centric quality assessment and control. Overcoming these barriers will require novel solutions at all elements of the end-to-end transmission chain. This workshop focuses on the challenges, applications, and major advancements in multimedia, networks, and end-user infrastructures to enable the next generation of interactive XR applications and services. The workshop proceedings can be found at: https://dl.acm.org/doi/proceedings/10.1145/3746269

Abstract:
Surveillance systems such as bodycams and drones often operate under bandwidth constraints that limit video quality and degrade both human monitoring and AI-based analytics. Traditional compression techniques introduce artifacts that obscure critical details, especially in high-motion scenarios. We present a generative AI-powered video compression framework developed by Small Pixels, a spin-off of the University of Florence, designed to deliver Full HD video at significantly reduced bitrates. The system combines edge-side preprocessing for compression resilience with real-time receiver-side super-resolution, enabling up to 50% bandwidth savings while preserving perceptual quality and detection accuracy. Objective evaluations on EgoSeg and VisDrone datasets show +6.2 VMAF improvement and stable YOLOv11 detection performance with 30% less bitrate. Live trials in Singapore, within the Singapore Hatch-X Global Innovation Program, validated real-time operation with minimal latency, demonstrating clearer faces and motion in challenging conditions. The solution integrates seamlessly into existing infrastructures without hardware upgrades, offering a practical path to reliable, high-quality video streaming in bandwidth-limited environments.

Abstract:
The movement for Sovereign AI is accelerating. Meeting its promise requires vertically integrated AI stacks -spanning data, models, and reasoning systems- that remain sovereign while adhering to shared scientific principles around which global research communities can coalesce. This talk presents BharatGen as a sovereign-yet-shared effort to make AI work for all: creation of datasets, benchmarks, and models that natively support Indian languages, dialects, and code mixing across text, speech, and vision; data pipelines grounded in local realities; and frugal methods that reduce cost and lower barriers. We outline our journey to date across language infrastructure, efficient training and distillation, and early sector pilots. The R&D deep dive will draw from some of our recent work on cross-lingual knowledge distillation for low-resource languages, tokenization/phonetic design for code-mix robustness, or trustworthy document AI with visual grounding focusing on robustness under dialect/code-mix shift, and latency/cost trade-offs. We hope to inspire other Sovereign-AI efforts, especially in the low-resource ecosystems of the Global South and close by inviting international collaborations toward principled research to build people-serving AI.

Abstract:
In test-time adaptation, handling constant domain change using a single sample at a time presents two key challenges: efficiently stabilizing adaptation and effectively preventing catastrophic forgetting. This paper introduces a single-sample continual test-time adaptation (S-CoTTA) task to address these challenges. Existing works mainly either 1) apply continual test-time adaptation methods with an inefficient moving window that increases memory overhead or 2) filter out high-uncertainty samples to maintain stability. The former handled forgetting but failed to stabilize adaptation efficiently, while the latter neglected the catastrophic forgetting issues. We argue that both efficient tuning stabilization and forgetting prevention should be addressed simultaneously. To this end, we proposed a novel Efficient Buffer and Resetting (EBaR) method for S-CoTTA. EBaR employs a novel memory-efficient buffer to store samples based on their uncertainty levels and utilizes them to update the model with different losses to enhance stability. EBaR also incorporates a novel elastic resetting unit to dynamically reset the parameters based on their sensitivity to domain shift. The elastic resetting strategy effectively mitigates catastrophic forgetting while retaining useful target domain knowledge. Comprehensive experimental evaluations demonstrate the effectiveness and efficiency of both components. Combining their benefits, EBaR surpasses state-of-the-art methods across multiple datasets, including CIFAR10-C, CIFAR100-C, ImageNet-C, and CCC, for the S-CoTTA task.

Abstract:
Event-based image reconstruction has achieved remarkable progress, benefiting from the high temporal resolution and high dynamic range of event cameras. However, most event-based methods focus on enhancing sRGB image quality, neglecting the potential of leveraging event data for RAW-to-sRGB conversion. Due to the limitations of camera sensors, images processed through standard ISP pipelines often suffer from motion blur and color distortion in dynamic scenes. In contrast, RAW images preserve uncompressed scene information, integrating event signals at this stage enables finer texture recovery and more accurate color correction. To tackle these challenges, we propose EvRAW, a novel event-assisted RAW-to-sRGB image reconstruction network that integrates event signals to promote high-fidelity sRGB image reconstruction. Specifically, we introduce a Motion-guided Structural Enhancement (MSE) module that extracts motion patterns from event streams and aggregates dynamic features to restore fine textures. Additionally, we propose an Adaptive Color Correction (ACC) module that performs region-wise gamma correction and channel-wise color decoding to enhance color fidelity under complex lighting conditions. To evaluate performance in challenging real-world scenarios, we collect a pixel-aligned RAW-Event dataset specifically for this task. Extensive experiments demonstrate that EvRAW achieves state-of-the-art performance in RAW-to-sRGB reconstruction on both synthetic and real-world datasets.

Abstract:
Event cameras with bio-inspired neuromorphic sensors are highly sensitive to brightness changes. When there are moving objects in a scene under constant lighting, event cameras only record motion information and output a sequence of events asynchronously. However, the common flickering light sources, such as fluorescent or LED lamps powered by alternating current exist in various real-world scenarios. When operating under a flickering light source, event cameras output numerous redundant event signals that are triggered by the flickering effect, which overwhelm the useful signals that encode motion information. In this paper, we propose EDeF-Net, an Event streams DeFlickering Network that effectively leverages the spatio-temporal correlation of event streams by modeling both the inter-channel temporal attention and inter-patch spatial attention. To facilitate network training and evaluation, we synthesize the first dataset containing paired flickering and flicker-free event streams. Moreover, we demonstrate that event streams filtered by EDeF-Net yield performance improvements on down-stream applications such as event-based optical flow estimation and object tracking.

Abstract:
Gait recognition has emerged as a promising biometric technology due to its ability to operate at a distance without subject cooperation. While pose-based methods offer advantages over appearance-based approaches in robustness and interpretability, their performance has been limited by the sparse keypoint representations of current pose estimation frameworks. We identify two critical limitations: (1) incomplete motion representation due to insufficient keypoints for dynamic body parts, and (2) lack of shape information from minimal skeleton points. This paper presents DPGait, a novel framework that addresses these challenges through innovations in both upstream processing and downstream modeling. First, we enhance pose estimation by extending the standard COCO keypoint format with additional motion-sensitive points and shape-descriptive keypoints inspired by human mesh estimation. Second, we propose a divide-and-conquer modeling strategy that processes dense keypoints through group convolution with cross-group attention, coupled with multi-granularity supervision for improved training. Our comprehensive experiments demonstrate state-of-the-art performance in pose-based gait recognition, achieving 85.8% rank-1 accuracy on SUSTech1K-surpassing leading silhouette-based methods for the first time. The results validate that dense pose representation combined with our novel modeling approach significantly advances the field of gait recognition.

Abstract:
Recent advances in 3D scene representation, particularly 3D Gaussian Splatting (3DGS), have demonstrated remarkable photorealistic rendering capabilities. However, the heavy reliance on dense and precisely calibrated camera configurations limits effectiveness in sparse view and unposed scenarios. In this paper, we present Tri-consistency 3D Gaussian Splatting (dubbed TriGS), a novel framework that jointly optimizes 3DGS parameters and camera poses only from sparse and unposed images via triple consistency supervisions coupled with the adaptive regularization strategy. We first estimate coarse camera poses by exploiting 3DGS's anisotropic properties through iterative relative pose optimization. Building upon this foundation, we introduce cross-view consistency enforcement through synchronized photometric color, geometric structure, and deep feature, effectively resolving rendering ambiguities with auxiliary supervisions. A unified rendering paradigm is also proposed to jointly refine Gaussian primitives and camera poses by transforming positions, covariances, and spherical harmonics. To combat overfitting inherent in joint optimization, we devise an adaptive regularization mechanism that strategically samples hard viewpoints based on baseline distances and training dynamics, enforcing projection consistency through deep feature priors. Extensive experiments on multiple challenging real-world datasets validate the effectiveness of TriGS, which achieves satisfactory results to set a new state-of-the-art without the reliance on external pose priors only under sparse and unposed view inputs.

Abstract:
Video social relation recognition is a fundamental task in video understanding, which is dedicated to the construction of multi-modal knowledge graphs. Previous work mainly focuses on multi-modal fusion and the construction of special character graphs. However, they often treat the global frame sequence equally, ignoring the influence of key frame sequence on relation recognition. Specifically, the key frame sequence that significantly reflect character relationships in a video tends to be sparse and short. At the same time, the key frames have not only temporal but also strong causal relationship. Therefore, we propose a novel Video Local Causal Frame (VLCF) model to explore the causal relationship between frames. Inspired by Granger causality theory, we estimate inter-frame causal relationships by comparing the predicted result frames with and without masking the premise frame. We then construct global connections between video frames. Multiple local causal frame sequences and global frame sequences are extracted to capture the key information and global information in the video. Extensive experiments conducted on the ViSR dataset and the MovieGraphs dataset demonstrate that the proposed model achieves state-of-the-art performance.

Abstract:
Given the scarcity of real data and the time-intensive nature of labeling, current multi-agent perception models often rely on simulated sensor data for training and validation. However, perception performance deteriorates significantly due to domain gap between simulated and real data. Existing adaptation methods focus on domain-generalized feature extraction while neglecting multi-agent shift uncertainty and relational semantic loss. To address this issue, we propose a Selective Shift Domain Adaptation method in multi-agent collaborative perception, called SSDA. SSDA incorporates two essential components: the frequency-decoupled feature shift adjustment (FSA) and the entropy-driven staged adaptive alignment (SAA). To mitigate the relational semantic loss, the FSA is proposed to simplify the representation of correlation features and remove redundant information from the source domain, thereby mitigating interference for domain adversarial scenarios. To tackle the shift uncertainty, the SAA is designed to achieve adaptive alignment from global to local guided by information entropy, which dynamically adjusts weights for samples according to their level of uncertainty. The results demonstrate that the SSDA is significantly superior to the SOTA, achieving up to 7.35% improvements on AP@0.7.

Abstract:
Multi-modal knowledge graph reasoning (MKGR) seeks to conjecture plausible facts in MKGs by learning effective representations from various modalities (e.g., structure, text, and image). However, due to holistic redundancy (i.e., each modality carries task-irrelevant redundancy) and modality conflict (i.e., different modalities contain contradictory information), the reasoning performance of current methods is substantially impaired. In this paper, we propose a novel Consistency Discovery-guided Information Bottleneck (CDIB) framework to address the aforementioned challenges. Specifically, a modality compression module is first designed to learn modality-private entity representations of alleviating redundant information. Then, a consistency discovery module is developed to discover cross-modal consistency during multi-modal fusion to learn the comprehensive entity representations. To retain task-relevant information, an information preservation module is devised to further enrich the comprehensive entity representations to be predictive for MKGR. Extensive experiments indicate that CDIB achieves state-of-the-art reasoning ability on two benchmark datasets over current MKGR baselines, and also exhibits promising robustness against noise.

Abstract:
NeRF-SLAM and GS-SLAM demonstrate excellent performance in high-fidelity rendering and real-time reconstruction in static scenes. However, real-world environments are often filled with dynamic objects, leading to tracking errors and mapping failures. Several dynamic SLAM approaches have been proposed, but they remain difficult to adopt due to challenges in deployment, framework compatibility, and generalization. To address these challenges, we introduce SLAM-X, the first plug-and-play module designed to universally enhance dynamic scene handling across a range of SLAM architectures. SLAM-Xleverages zero-shot segmentation and adaptively tracked sparse optical flow to generate dynamic masks, enabling tracking and mapping correction through continuous scene learning, while removing dynamic artifacts without requiring any task-specific fine-tuning. Extensive experiments on multiple real-world datasets demonstrate that SLAM-X effectively mitigates dynamic disturbances and seamlessly integrates with various NeRF-SLAM and GS-SLAM frameworks, achieving state-of-the-art performance in dynamic environments.

Abstract:
Infrared-visible image fusion for object detection (IVIF-OD) aims to utilize complementary information in the two modalities to synthesize new images with richer information to serve object detection. Most existing works focus on how to better fuse pixel-level details while ignoring object-related information required for detection and introducing redundant and object-irrelevant information in the fused images. To address the limitations of previous studies, this paper proposes a diffusion-based denoising fusion for object detection in infrared-visible images, termed DDFD. Specifically, DDFD treats image fusion as a diffusion-based denoising process to generate fused images that are informative yet non-redundant. Since visible imaging is easily affected by adverse conditions, DDFD exploits an image-adaptive enhancement (IAE) module that adaptively improves visible images to achieve better fusion. To extract key fusion features and remove redundancy, DDFD uses an image-aware noise estimator (INE) to determine the noise in the input infrared-visible images for promoting the diffusion denoising network. To take advantage of both the fusion network and object detection network, DDFD jointly optimizes them such that the fusion network can receive object information to improve the fused images, and the improved images can provide high-quality features to enhance object detection performance. Extensive experiments on the M3FD, DroneVehicle, and VEDAI public datasets reveal the superior object detection performance of DDFD and confirm the effectiveness of IVIF-based object detection under challenging weather conditions.

Abstract:
Multimodal recommendation systems effectively address data sparsity with visual and textual features but face two primary challenges: (1) Modal noise contamination, where high-frequency noise introduced by feature extractors distorts the semantic representation of items, and (2) Limited multimodal feature fusion capability, where conventional static fusion strategies struggle to capture fine-grained user preferences across modalities. To address these challenges, we propose the Frequency-refined graph convolutional network with cross-modal wavelet denoising for Recommendation (FreRec). Specifically, for the first challenge, FreRec introduces a dual denoising framework that: (i) employs wavelet transforms to decompose multimodal features into low-frequency semantics and high-frequency details, applying adaptive soft-thresholding to remove noise while preserving essential information; and (ii) implements a degree-sensitive edge pruning strategy to eliminate noisy connections in the user-item interaction graph, preventing noise propagation during message passing. For the second challenge, FreRec proposes an adaptive multimodal fusion approach that: (i) constructs a dual-view graph convolutional framework that jointly models user-item interactions and intra-modal semantic similarities; and (ii) leverages the behavior-guided frequency fusion module to dynamically adjust modality importance based on user behavior patterns, achieving robust fusion that adapts to individual preferences. Extensive experiments on three benchmark datasets validate the superiority of our model.

Abstract:
The 4D millimeter wave radar has gained increasing attention in autonomous driving due to its robustness against environmental interference compared to other perception devices such as cameras or LiDAR. However, the practical deployment of radar remains challenging due to the noise and high sparsity of radar point clouds, as well as the resource limitations of edge computing. To address these issues, we propose Radar-Mamba, a lightweight and efficient radar enhancement approach for boosting radar perception. The proposed approach contains three main components: cross-modal alignment, radar enhancement architecture based on the Mamba model, and Doppler feature fusion. Specifically, the point clouds of radar and LiDAR are first refined and aligned to build more dense and rich occupancies. The aligned 4D radar data is then enhanced by capturing both local and global spatial-temporal features, while integrating radar-specific velocity and elevation information for further denoising. Experiments on two open-source datasets demonstrate that our method achieves state-of-the-art performance and generates high-quality 4D point clouds with a density comparable to LiDAR while maintaining a low parameter count friendly for practical deployment.

Abstract:
Low-light image enhancement aims to improve brightness, suppress noise, and recover accurate color and structure, requiring precise illumination modeling and reliable reflectance recovery. However, most Retinex-based methods adopt explicit, multi-stage pipelines prone to decomposition bias, error accumulation, and chromatic entanglement between illumination and reflectance. To tackle these issues, we propose IDAR (Implicit Decomposition, illumination Adjustment, and reflectance Restoration), a unified Retinex-inspired framework with two key innovations. First, we design an implicit decomposition strategy based on dual-branch feature learning: a low-frequency-constrained illumination branch models lighting with chromaticity awareness, while a contrast-guided reflection branch preserves details by decoupling reflectance from illumination. This implicit design avoids intermediate supervision and reduces decomposition bias. Second, we introduce the Illumination Chromaticity Expansion Module (ICEM), which employs text-guided chromaticity learning to enhance chromaticity perception. By learning a reflectance-independent spectral representation, ICEM reduces color shifts and improves fidelity under complex lighting. Experiments on multiple benchmarks validate the superior visual quality, quantitative performance, and physical interpretability of IDAR.

Abstract:
Multi-view multi-label learning (MVML) is a significant area of research in multimedia, providing a foundational framework for various real-world applications. However, the richness of its descriptive capabilities often results in high-dimensional data that contains redundant information, negatively affecting model performance. Most existing methods do not thoroughly explore the mapping of the distinctive parts while balancing common and distinctive information. Additionally, there has been limited focus on the relationships between different types of mappings and the high-order constraints among view-specific labels. In this paper, we propose a novel tensor-based method for view-specific label learning that integrates adaptive weight mechanisms into both global non-linear and local linear mappings. This method effectively captures high-order relationships among views and hybrid labels through hierarchical label correlation constraints. Central to our model is the ''Opposing yet Complementary'' procedure, which enhances feature weight representation at a finer-grained level. Extensive experiments on widely used multi-view multi-label datasets demonstrate significant performance improvements, underscoring the effectiveness of our proposed method.

Abstract:
Pan-sharpening aims to improve the spatial resolution of low-resolution multispectral (LRMS) image by integrating high-frequency information from corresponding texture-rich panchromatic (PAN) image. While RWKV architecture has demonstrated remarkable global perception with linear computational efficiency in vision tasks, its inherent sequential scanning mechanism critically compromises local spatial coherence, hindering high-frequency reconstruction. To bridge this gap, we tailor Freq-RWKV, the first spatial-frequency adaptive RWKV featuring dual-domain scanning where wavelet-guided path selection dynamically modulates scanning granularity and orientation according to spectral-spatial information density distributions. Building upon this innovation, we architect the hierarchical U-shaped fusion network that strategically coordinates granularity-aware scanning across spatial and frequency domains, enabling adaptive trade-offs between performance and complexity. The U-shaped architecture implements coarse-to-fine enhancement: in the encoding stage, Coarse Structural Interaction (CSI-RWKV) module preserves geometric dependencies via window-constrained recurrent scanning while encoding structural priors into LRMS features; during decoding, the Fine-grained Frequency Interaction (FFI-RWKV) module performs edge-aware refinement through differentiable frequency-adaptive window partitioning, where multi-scale spectral wavelet attention prioritizes high-frequency PAN components extracted via discrete wavelet transform (DWT). This hybrid decomposition strategy maintains spectral integrity through approximation coefficients while detail coefficients regulate frequency-gated fusion thresholds. Extensive experiments on multiple satellite datasets validate the effectiveness of the proposed method.

Abstract:
Audio-visual speech recognition (AVSR) leverages complementary visual cues to improve speech recognition. However, in real-world scenarios, both modalities may suffer from noise or occlusion. In such scenarios, most existing fusion strategies overlook the variation in modality-specific quality under different degradation conditions. This limitation may lead to dominance of corrupted modality in the fusion process, resulting in worse AVSR performance than unimodal systems, termed as Corrupted Modality Bias (CMB) in this work. To address this, a self-supervised speech representation learning framework, called AV-RISE, is proposed to employ teacher-student self-distillation to robustly reconstruct clean speech representations from corrupted audio-visual inputs. A hierarchical fusion mechanism is designed to progressively refine audio and visual representations by integrating the Suppression and Enhancement Interaction (SEI) module into each layer of the pre-trained encoder. In the SEI module, cross-modal suppression and modality-oriented enhancement are performed to mitigate noise-induced feature inconsistencies, which strengthens the modeling of complementary semantic representations. Extensive experiments on the LRS2 and LRS3 datasets demonstrate that AV-RISE outperforms SOTA AVSR models, especially under extreme degradation. Most importantly, the hierarchical SEI-based fusion effectively enhances reliable semantic representations to mitigate CMB, by evaluating feature similarities between clean and noise samples.

Abstract:
The rapid urbanization process has significantly increased building energy consumption and carbon emissions, making reliable electricity load forecasting crucial for energy management. However, accurate load forecasting faces three key challenges: (1) complex impact of multimodal data, (2) inter-building semantical relationships, and (3) uncertainty modeling of load patterns. To address these, we propose MMLoad, a novel diffusion-based multimodal framework for multi-scenario building load forecasting with three innovations: (i) a Multimodal Data Enhancement Pipeline generating rich building descriptions using LLMs and integrating temporal factors to analyze multimodal impacts; (ii) a Cross-modal Relation Encoder discovering latent interdependencies through hierarchical fusion, projecting buildings into a unified spatio-temporal (ST) embedding space; and (iii) a Scenario-Conditioned Diffusion Generator employing transformer-based denoising with Scenario-Adaptive Normalization (SAN) for diverse trajectory generation with uncertainty quantification. Experiments show MMLoad outperforms state-of-the-art baselines in accuracy while generating plausible future scenarios, establishing a new paradigm for multimodal learning in smart energy systems.

Abstract:
Test-time adaptation (TTA) offers the potential to enhance model generalizability without relying on training data or retraining processes. However, TTA faces challenges under covariate shift, where discrepancies between the distributions of training and testing phases hinder model performance. This limitation stems from the fact that existing methods usually rely heavily on training data and fail to establish a good connection between the model and the marginal distribution of test data, resulting in reduced generalization ability. To mitigate this issue, we introduce a novel self-supervised framework that integrates latent score matching and pseudo-label refinement into the TTA paradigm to enhance the model's perception of the test data distribution. Our approach, Joint Test-time Adaptation with Refined Pseudo-labels and Latent Score Matching, reinterprets a classifier as a score estimator and trains it using pseudo-label refinement. This enables the model to better align with the test distribution through latent score matching, while simultaneously preserving discriminative performance via pseudo-label refinement. Extensive experiments across diverse architectures and benchmarks demonstrate that TAPS consistently outperforms state-of-the-art methods in terms of generalization performance under various distribution shifts.

Abstract:
L1 loss is a classical regression loss function and achieves remarkable success in image reconstruction (IR) tasks. Theoretically, the L1 loss is the maximum likelihood of images with the Laplace prior distribution, which is simple and not that suitable for digital images with integer pixel values. This phenomenon results in high reconstruction error in detailed areas (such as boundary areas). For this, we investigate the effectiveness of classification loss functions like Cross Entropy (CE) loss in IR tasks, which is the maximum likelihood of Multinomial (or Bernoulli) prior distribution, and propose a novel image reconstruction framework named DichotomyIR. To apply CE loss to image reconstruction, we adopt the dichotomy method to transfer integer pixel values into 8-bit labels and further design a dual-branch Dichotomy Decoder (D-Decoder) in DichotomyIR to reconstruct high-quality (HQ) images I_L1 and I_CE supervised with L1 loss and CE loss, respectively. Next, we analyze the reconstructed uncertainties of these 2 images with different prior distributions and design an iterative uncertainty elimination (IUE) processing. Integrating the popular status spatial model with the IUE processing, we propose the Uncertainty Elimination Mamba (UEM) to eliminate the reconstructed uncertainty iteratively. With the D-Decoder and UEM, the DichotomyIR is flexible and feasible to be embedded with any current popular IR methods. Universal experiments on IR tasks demonstrate the effectiveness and efficiency of the proposed DichotomyIR, which strongly supports the importance of uncertainty elimination in IR tasks.

Abstract:
Leveraging Vision-Language Models (VLMs) like CLIP for various downstream tasks has emerged as a significant research trend. Recently, researchers have introduced Test-Time Adaptation (TTA) as a technique for models to learn online from unlabeled samples at test time, improving the generalization performance of VLMs to target domains. However, existing TTA methods either require expensive backpropagating gradient computations for each test sample or only extract knowledge from a limited number of historical test samples in the cache model, resulting in suboptimal adaptation performance. To address these limitations, we propose a Prototype Adaptive Fusion (PAF) framework, a novel TTA approach that makes full use of historical knowledge from test samples. Unlike traditional cache-based methods, which store only a few low-entropy samples per class, PAF introduces a prototype fusion mechanism that constructs class prototype representations through cumulatively merging features from qualified test samples. Furthermore, we propose an enhanced version, Easy-Hard PAF (EH-PAF), which adaptively applies a category-specific strategy based on CLIP prediction to improve performance. Extensive experiments across 15 diverse datasets demonstrate that our method consistently outperforms previous state-of-the-art approaches.

Abstract:
In response to the growing demands of real-world applications, models must be capable of learning continuously under inconsistent data distribution. However, existing Class-Incremental (CI) methods fail to alleviate domain shifts, while traditional Unsupervised Domain Adaptation (UDA) techniques suffer from catastrophic forgetting and privacy concerns. To address these limitations, we explore Source-Free Class Incremental Domain Adaptation (SFCIDA) and propose a novel approach, Quantifying Samples with Invariance (QSI), for this scenario. Our proposed method involves two main strategies: (1) Semantic Restructuring. We identify confusing source category pairs and restructure images to create a negative dataset that is semantically similar to the source features, refining accurate decision boundary among source categories. (2) Invariance Quantification. The sample's confidence is then quantified by its spatial location under the special data distribution, reflecting the trade-off between invariant features and domain shifts. Guided by such strategy, samples' confidence is accumulated for the target model to prioritize reliable categories, not only mitigating the poor performance of experience replay in unsupervised scenarios, but alleviating distribution discrepancies simultaneously. Experiments demonstrate that our approach outperforms previous methods, establishing new state-of-the-art performance on the Office-31, Office-Home and DomainNet-126 datasets, with average accuracy improvements of over 7.3%, 4.9% and 10.2% respectively.

Abstract:
Although existing diffusion-based image super-resolution methods have achieved remarkable visual quality, they often struggle with fidelity issues, particularly in preserving consistency with the original input image. This issue arises because using low-quality images as conditional inputs introduces substantial errors in the diffusion backward denoising process, making the restored features deviate from target features and thus degrade image fidelity. To improve the accuracy of noise estimation, we propose a dual-memory module to reinforce the input low-quality conditional features, which consists of a pre-trained high-quality memory bank to enrich the structural information and a degradation memory to remove the degradation components. Furthermore, we develop an uncertainty-aware noise estimation framework, utilizing an extra branch in the denoising network to predict pixel-wise uncertainty values, thus dynamically adjust the optimization weights for high-uncertainty regions. This adaptive strategy effectively improves the accuracy of noise estimation in challenging reconstruction areas. Experimental results demonstrate that our method significantly enhances the fidelity while preserving high visual quality of diffusion-based super-resolution, improving the reliability of diffusion applications.

Abstract:
Vision-Language Models (VLMs) such as GPT-4V and LLaVA have demonstrated impressive capabilities in multimodal understanding and generation. Unfortunately, their ability to infer sensitive information from visual content raises serious privacy concerns, especially when the images containing personal information. Existing solutions either rely on static alignment mechanisms, such as task-specific prompt turning, which are vulnerable to adversarial prompts, or irreversible redaction methods that permanently destroy content utility for legitimate users. To address these limitations, we propose a reversible privacy-preserving framework on VLMs via Adversarial Multimodal Key (AMK). Specifically, AMK embeds a learnable adversarial image key into mosaic-obscured images and generates a corresponding text key through multimodal contrastive learning model. The image key is optimized by gradient-based supervision from white-box VLMs, and the text key is implicitly derived from the image content to avoid exposure during transmission. These keys enable authorized users to restore sensitive information through VLMs, while unauthorized queries are explicitly rejected through a refusal response mechanism. Our experiments across various privacy scenarios show that the proposed method effectively restores redacted content with correct keys and prevents unauthorized disclosure, offering a practical solution for privacy protection in multimodal systems.

Abstract:
Visible watermark removal is crucial for evaluating watermark robustness and advancing more resilient protection techniques. Current methods face challenges in real-world scenarios due to architectural constraints in multi-task frameworks and limited dataset diversity. To address these challenges, we first propose a novel two-stage framework, PatchWiper, consisting of an independent watermark segmentation network and a highly dynamic patch-wise restoration network. This framework decouples watermark localization from background restoration, allowing each network to focus on its designated task. Our restoration network dynamically generates unique parameters for each image patch, enabling fine-grained adaptation to different watermark distortions. Second, we construct the Pixabay Real-world Watermark Dataset (PRWD ), which incorporates diverse background images and over 1,000 distinct watermark types, providing a more comprehensive benchmark for evaluating watermark removal methods. Extensive experiments on PRWD, ILAW, and real-world testing images demonstrate our method's superior performance over existing approaches, particularly in handling complex real-world cases.

Abstract:
Text-based Person Retrieval (TBPR) is a challenging task that aims to retrieve pedestrian images according to natural language descriptions. Existing works mainly focus on discriminative feature learning via exploring cross-modal matching methods, while the overfitting issues caused by insufficient labeled data and the absence of well-designed auxiliary tasks are often overlooked. Motivated by the recent progress of large language models (LLMs), we propose a novel method named GPT-ReID for TBPR, which aims to leverage the strong comprehension of LLMs to alleviate the overfitting risk. Specifically, based on the great power of GPT, GPT-ReID first introduces an adversarial text generation scheme called GPTGAN, which aims to generate comprehensive strong positive captions and deceptive hard negative captions through the original captions for a single image. Furthermore, a joint auxiliary learning strategy is also proposed which contains Multi-Relation Aware (MRA), Keywords Masked Language Model (KMLM), and Keywords Replacement Detection (KRD), to facilitate global- and token-level optimization, enhancing cross-modal granular representation alignment. Extensive experiments on a set of highly competitive benchmark datasets validate the merits of the proposed GPT-ReID against a flurry of state-of-the-art methods, with Rank-1 accuracy reaching 78.42%, 69.43%, and 70.06% on CUHK-PEDS, ICFG-PEDS, and RSTPReid, respectively.

Abstract:
Diffusion-based image captioning methods have been proposed to address the inherent issues of autoregressive models, such as slow inference speed, significant accumulative errors, and limited generative diversity. However, due to excessive reliance on textual data and constrained training objective, existing diffusion-based methods suffer from a semantic gap between vision and language, ultimately resulting in poor quality of generated captions. To address this issue, we propose a novel diffusion-based semantics aligned image captioning framework, namely DSACap. Specifically, DSACap deviates from existing methods which treat text as the target of noise-adding and denoising, instead directly applying these processes to the image, thus reducing the loss of visual-semantic alignment. In addition, we introduce a reinforcement learning-based training strategy to maximize the semantic alignment between image and text. We feed the generated textual descriptions into an image generation model to reconstruct the original image and use the cosine similarity between the generated image and the original image as the reward to train the image captioning model. Extensive experimental results on the MS COCO dataset demonstrate that DSACap achieves a CIDEr score of 128.8, clearly outperforming existing diffusion-based image captioning methods. Our code will be made publicly open soon.

Abstract:
The uproar over two-view correspondence pruning stems from the advent of the ConvNet-style paradigm, which showcases intrinsic proficiency in local context aggregation, tackling the context-agnostic deficiency of MLP-based methods fundamentally and delivering impressive pruning capability. To further unlock the potential of such a paradigm, this perspective study revisits its design decisions and introduces CorrNeXt, a cutting-edge ConvNet-style pruner that incorporates multiple simple but effective improvements. Firstly, we explicitly integrate 2D relative spatial knowledge into motion field modeling, arming the interconversion between unordered sparse motion vectors and ordered image-structured ones with positional awareness. Secondly, considering that existing methods struggle with perceiving global context due to limited receptive field of small-kernel convolution, we devise a context-orthogonal aggregation module that decomposes computationally expensive large-kernel depthwise convolution along channel dimension into a small square kernel, two orthogonal band kernels, and an identity mapping, enjoying large receptive field while maintaining efficiency. Thirdly, we deploy a motion field pyramid architecture that obtains and fuses multi-level motion fields, thereby facilitating the handling of the motion field's discontinuities in case of large scene disparity. Ultimately, we propose an elastic inference strategy that allows the model to introspect the confidence of its predictions at each layer, through which CorrNeXt is endowed with the flexibility of adaptively determining inference termination according to the difficulty of each image pair. Thorough experimentation affirms CorrNeXt's remarkable capabilities.

Abstract:
Current anomaly detection paradigms face inherent limitations in simultaneously addressing structural anomalies (\eg, geometric distortions) and logical anomalies (\eg, semantic inconsistencies), due to conflicting feature representation requirements between these two anomaly categories. We propose UniAD, a novel dual-branch teacher-student framework that achieves unified anomaly detection through synergistic integration of complementary expertise from heterogeneous vision models without requirements of extra manual annotations. In particular, our framework integrates two frozen expert models as teachers: (1) a structural teacher specializing in geometric-sensitive patterns, and (2) a logical teacher focusing on semantic-aware representations via component relationship modeling. To resolve feature conflicts while preserving complementary information, the student network is equipped with one shared backbone and two independent branches. One branch employs multi-scale feature alignment with the structural teacher while another branch establishing semantic correspondence with the logical teacher through component-aware attention mechanisms. Furthermore, we introduce the text-guided semantic enhancement module as a kind of logical guidance to facilitate the anomaly indicator. Extensive experiments on the challenging MVTec LOCO benchmark validate that the scalability of our model to localize both geometric distortions and semantic inconsistencies. The proposed method outperforms existing single-purpose detectors, yielding 93.7% AUROC for logical anomalies and 93.2% AUROC for structural anomalies.

Abstract:
Grounded Multimodal Named Entity Recognition (GMNER) extends Multimodal Named Entity Recognition (MNER) by identifying named entities, their types, and corresponding image regions. Fine-grained MNER and Grounding (FMNERG) further refines entity categorization. However, existing methods struggle with scarce annotated data, particularly in low-resource scenarios, and often fail to generalize to unseen entities. While vision-language pre-training (VLP) leverages unlabeled image-caption pairs, it primarily learns generic visual-linguistic representations, overlooking fine-grained entity-region alignment crucial for entity-related tasks. To address these challenges, we propose a unified VLP framework for GMNER and FMNERG, introducing two task-specific pre-training objectives: Entity-to-Region Alignment (ETRA) for entity grounding and Region-to-Entity Alignment (RTEA) for entity reconstruction. These tasks jointly optimize fine-grained entity-region alignment. To compensate for the lack of fine-grained multimodal pre-training data, we develop an automatic labeling method that distills entity-oriented knowledge from large-scale unlabeled image-text pairs, enhancing generalization to unseen entities. Extensive experiments on GMNER and FMNERG benchmarks demonstrate that our framework outperforms existing low-resource learning approaches and achieves competitive performance in full-supervision, underscoring its effectiveness across diverse data conditions.

Abstract:
Image captioning aims to create natural language descriptions of images. Recent advancements in image captioning have explored text-only training methods that eliminate the need for image annotations. However, these methods are prone to generate descriptions that include objects that do not actually appear in the image, but are instead drawn from the retrieval texts or hard prompts-resulting in object hallucinations. To address this issue, we propose synergistic prompting mechanism called NASCap (Noise-Aware Decoding with Salient Region Enhancing for Zero-Shot Image Captioning). Our method further improves the accuracy of generated captions by designing a fusion model with mask attention that isolate integrates retrieved captions with input features. The hard prompt mixed with negative entities is designed to further improve model's robust to wrong information. Additionally, we introduce a training-free multi-granularity fusion strategy that dynamically perceive and enhance salient regions into global representation. Extensive experiments demonstrate that NASCap sets a new state-of-the art cross-domain (transferable) captioning and performs Through extensive experiments, our straightforward yet powerful approach has demonstrated its efficacy, outperforming the state-of-the-art methods by a significant margin in image captioning compared to zero-shot captioning based on text-only training.

Abstract:
Generalizable neural implicit surface reconstruction aims to recover accurate surfaces with sparse views from unseen scenes. Most existing methods suffer from severe incompleteness and inaccuracies in the case of reconstruction with large viewpoint variations, as significant perspective distortions across views lead to unreliable feature correspondence and geometry representations. In this paper, we propose a cross-view geometric collaboration framework for generalizable neural surface reconstruction, which exploits cross-view complementary geometric information to improve the accuracy and robustness of reconstruction from sparse views. Specifically, we propose a cross-view geometry complement module that utilizes the reliable geometric information of different views to refine geometric representations. In addition, we construct a distortion-robust patch-based consistency volume to provide supplementary geometric cues for uncertain regions. For the rendering process, we develop a cross-view geometry transformer to adaptively aggregate reliable cross-view point features by considering geometric context along the ray. Finally, we render per-view depth maps and fuse them to reconstruct the final surface. Extensive experimental results on the DTU, BlendedMVS, and Tanks and Temples datasets demonstrate the superior reconstruction quality and view-combination generalizability of our solution.

Abstract:
Federated learning (FL) facilitates collaborative model training without requiring participants to share their raw training data directly. Fairly evaluating client contributions is essential to ensure equitable benefit allocation and sustained participation. However, most existing methods cannot assess contribution fairness before FL training completes, leading to inefficient resource utilization. While before-training evaluation schemes exist, they either violate privacy requirements or impose prohibitive computational costs. To address these limitations, we propose PriCAF, an efficient and privacy-preserving contribution assessment in FL before model training. Its key innovation involves generating compact, privacy-preserving reduced datasets that encode class distribution, replacing clients' local datasets for assessment. These reduced datasets are aggregated to approximate the global data distribution, eliminating the need for external reference datasets. Extensive experiments demonstrate that PriCAF achieves higher accuracy than state-of-the-art reference-free baselines (before-training assessment) across diverse settings while achieving a 41× speedup in large-scale FL.

Abstract:
Image retargeting technique aims to adjust and reorganize the content of original images to fit different display sizes and visual requirements. Text elements frequently appear in real-world images and play a crucial role in conveying information. Existing algorithms often treat the image as a whole during retargeting, neglecting the unique features of textual content. This oversight results in missing textual information or distorted character structures, ultimately failing to effectively preserve the integrity of text regions, thereby affecting both the efficiency of information transmission and visual quality of the final image. To address the aforementioned issues, we start from the perception of textual content, which guides retargeted image generation through the fusion of attention features. Specifically, a Transformer-based model is employed for the image retargeting tasks in this study. Text and image features are extracted separately, accompanied by a dual-modal feature fusion strategy, which integrates text and image features through attention maps generated. The training process adopts a cyclic training strategy, where the retargeted results are fed back into the model in reverse. This approach is applicable to retargeting images of various sizes, ensuring that detailed information from both text and image content is accurately preserved. Extensive evaluations on benchmark datasets demonstrate that our method significantly outperforms existing techniques in maintaining both textual clarity and overall visual quality, making it a promising solution for advanced multimedia applications in computer science.

Abstract:
Embellishing slides with illustrations is a well-established practice for improving engagement and storytelling. However, this process is challenging, requiring careful consideration of both visual appearance and semantics of illustrations while ensuring they complement rather than overwhelm the slide content. In this paper, we take a pioneering step toward automating this process by introducing the task of Illustration Layout Generation: given a slide and a set of illustrations, automatically determining their optimal sizes and positions to enrich the slide. Existing layout generation approaches struggle with this task as they rely on large-scale layout datasets for training and have limited support for multiple visual inputs. To address these challenges, we propose SlideILG, a method that iteratively optimizes illustration placement using a diffusion-based text-to-image prior. We introduce three key techniques to enhance efficiency and quality: (1) leveraging cross-attention maps from the text-to-image model to initialize illustration placement; (2) employing an over-parameterization strategy to stabilize optimization; and (3) fine-tuning the text-to-image model on high-quality slide thumbnails for more precise guidance. To evaluate SlideILG, we construct IllustrationBench, a benchmark comprising 128 real-world slides, each paired with a set of illustrations for embellishment. Quantitative, qualitative and human-study results demonstrate the effectiveness of our approach. Furthermore, we showcase a real-world application scenario to highlight the significance and practical utility of this task and our method.

Abstract:
Recent advancements in Large Multimodal Models have demonstrated impressive performance in various tasks. However, their capabilities in error detection and resolution for Optical Character Recognition (OCR) remain underexplored. To address this gap, we construct the first visual instruction tuning dataset specifically for detailed OCR error analysis. Building on this foundation, we develop a universal, plug-and-play OCR-Critic model that incorporates three novel dynamic alignment strategies. These strategies systematically mitigate LMMs' weaknesses in OCR tasks by providing coarse-to-fine error feedback. To comprehensively evaluate these capabilities, we introduce OCR-ERROR, a benchmark designed to assess LMMs' ability to detect and categorize OCR errors, covering two task types, diverse error categories, and 2,400 rigorously validated samples. Experimental results show that OCR-Critic effectively identifies fine-grained OCR errors across multiple domains. With the integration of our dynamic alignment strategies, the LMM further achieves substantial performance gains on four prominent benchmarks, demonstrating both versatility and effectiveness.

Abstract:
Infrared (IR) search and track systems are widely applied in aerospace and defense fields. Infrared small target detection (IRSTD) in heavy clouds and chaotic terrestrial environments remains a challenging task. The semantic features of IR small targets are highly prone to vanishing with the addition of network layers. Transformer with quadratic computational complexity struggles for local feature refinement. To tackle this issue, we introduce a Mamba-driven approach dubbed Spatial-Frequency Mamba Collaborative Learning Network (SMCLNet). Specifically, the perspective transformation structures heterogeneous backgrounds. The reconstructed data couples Mamba's flattened multidirectional scanning mechanism. Given that small targets possess sparse and high-frequency properties, spatial Mamba and frequency Mamba collaboratively enrich the semantic features of small targets. The Texture Enhancement Module (TEM) effectively fuses spatial and frequency features to enhance the contrast information of small targets. To refine the features, the Fine-Grained Reinforcement Module (FRM) integrates multiple gradient operators to inscribe the intact small target profile. Both qualitative and quantitative experiments demonstrate that our proposed SMCLNet outperforms 14 recent benchmark algorithms on multiple public datasets.

Abstract:
Backchannel signals play a critical role in social interaction, expressing attentiveness, agreement, and emotion in both human and human-agent conversations. However, few multi-modal databases exist in this area due to the complexity of categorisation and the high cost of precise timing, especially in naturalistic dyadic conversations. To address these challenges, we introduce CCDb+ (Cardiff Conversation Database +) an enhanced version of CCDb, with 25 newly annotated conversations and corrections to 14 previously annotated conversations, along with thorough consistency checks to ensure annotation reliability. Additionally, we propose a multi-modal process for backchannel detection as a baseline, showing that both visual and acoustic cues contribute significantly to understanding backchannel behaviour. Recognising that backchannel signals often intersect with other social cues, we introduce several detection sub-tasks-such as smile, nodding, and agreement-with baseline results for each. Finally, we demonstrate multi-modal paradigms for nuanced signals like nodding and thinking. The database and associated annotations are publicly available at https://huggingface.co/datasets/CardiffVisualComputing/CCDb.

Abstract:
Micro-expression analysis (MEA) is crucial for detecting subtle emotional cues, with applications in lie detection and psychological assessment. Existing methods struggle with three main challenges: 1) Noise sensitivity arising from the inherent subtlety of micro-expressions. 2) Reliance on fixed priors and apex annotations. 3) Information redundancy, with static features often dominating over dynamic emotional cues. To address these challenges, we propose Ac4AU, a framework inspired by Regulatory Focus Theory (RFT) that utilizes structured representation learning to decompose dynamic emotional patterns from redundant features. Specifically, AC4AU first leverages a face recognition backbone to extract robust yet redundant static representations. Secondly, a Frequency-aware Redundancy Decomposer (FRD) is introduced to eliminate the Direct Current component and retain the dynamic and process-sensitive features. Finally, a dynamic expert allocation mechanism, embodied by the AU-specific Expert Router (AUsER), is adopted to learn localized facial motion patterns and capture long-term relationships, enabling AU-targeted supervision and enhancing generalization across diverse datasets. Rigorous experiments demonstrate that the apex-free AC4AU achieves performance comparable to state-of-the-art apex-dependent methods. Additionally, we conduct a statistical analysis that provides insights into the AU dependencies. Code will be made available upon request.

Abstract:
Next Point-of-Interest (POI) recommendation aims to predict user's subsequent destinations based on historical check-in sequences, thereby enhancing travel experiences. While traditional methods primarily rely on unique identifiers (IDs) to represent POIs, they face data scarcity challenges. Recent multi-modal approaches offer alternatives but struggle with two key issues: inadequate handling of heterogeneity between ID and multi-modal features, and difficulties in unified framework integration, limiting their potential benefits. To address these limitations, we propose IM-POI, a novel framework that leverages the complementary strengths of both ID embeddings and multi-modal representations for next POI recommendation. In our framework, a global POI weighted transition graph inspired by TF-IDF captures sequential dependencies and enhances memorization capabilities, while a geographical graph incorporates spatial information into multi-modal features to be consistent with real-world visitation patterns. To address representation integration, we introduce an IM-Aligner module to prevent representation collapse during distribution matching. Extensive experiments on three real-world datasets demonstrate that IM-POI significantly outperforms state-of-the-art baselines.

Abstract:
Multi-modal recommender systems (MRSs) have emerged as critical multi-modal technologies, but Do we truly leverage multi-modal content properly? Through an empirical study of four diverse, realworld datasets spanning various recommendation scenarios, we observe that MRSs exhibit a stronger tendency to recommend items with high similarity to users' past interactions in terms of multimodal content than conventional RSs. While this tendency improves the recommendation accuracy, it introduces a previously unexplored bias that significantly impacts user experience. We define this bias as User-side Content Bias:users who prefer items similar to their historical choices receive higher quality recommendations than those seeking diverse options. We show that User-side Content Bias is unrelated to the activity of users, indicating a fundamental limitation in current MRSs. We propose ISOLATOR: utIlizing uSer-side cOntent simiLarity via a model-AgnosTic framewORk to leverage multi-modal content more properly. ISOLATOR estimates the impact of User-side Content Similarity and proposes two intervention strategies to meet the needs for more accurate and unbiased recommendations. Extensive evaluations on several widely used datasets demonstrate that ISOLATOR consistently improves various state-of-the-art MRSs and effectively addresses the User-side Content Bias.

Abstract:
With the rapid growth of 3D content, there is an increasing need for intelligent systems that can search for complex 3D shapes using simple natural language queries. However, existing approaches face significant limitations. They rely heavily on manually labeled datasets and use fixed similarity thresholds to determine matches, which restricts their ability to generalize and accurately retrieve novel or diverse 3D shapes. To bridge these gaps, this paper introduces Open3DSearch, the first attempt to address the challenge of open-domain text-to-shape precise retrieval. Our core idea is to transform 3D shapes into semantically representative 2D views, thereby enabling the task to be handled by mature large vision-language models (LVLMs) and allowing for explicit cross-modal matching judgments. To realize this concept, we design a view rendering strategy to mitigate potential information degradation during 3D-to-2D conversion while capturing the maximal amount of query-relevant information. To evaluate Open3DSearch and advance research in this field, we present the Uni3D-R benchmark dataset, designed to simulate precise associations between user queries and 3D shapes in open-domain contexts. Extensive quantitative and qualitative experiments demonstrate that Open3DSearch achieves state-of-the-art results.

Abstract:
Unsupervised cross-domain image retrieval (UCDIR) aims to retrieve images across different domains without the guidance of labels. However, existing UCDIR methods assume that data can be shared across domains in plaintext, which is often impractical due to strict data privacy protection policies. In this paper, we propose ShieldIR, a novel privacy-preserving unsupervised cross-domain image retrieval framework that enhances the retrieval performance across two domains while safeguarding data privacy. ShieldIR unifies intra-domain and cross-domain representation learning through a Dual Protection Transformation (DPT) module, which introduces a structured feature space via orthogonal projection and ensures data privacy by adding calibrated differential privacy noise. This transformation allows ShieldIR to preserve semantic structure while formally protecting private data. For intra-domain representation learning, ShieldIR enhances discriminability by using DPT to map prototypes into an independent feature space and subsequently aligning the resulting dual-protected prototypes with instance-level features. For cross-domain alignment, ShieldIR maps intra-domain features and cross-domain prototypes into a shared structured space using DPT, achieving semantic alignment under privacy constraints. Extensive experiments on real-world datasets demonstrate that our ShieldIR outperforms state-of-the-art methods while effectively protecting data privacy.

Abstract:
To mitigate data sparsity in Sequential Recommendation, Cross-Domain Sequential Recommendation (CDSR) exploits dynamic knowledge transfer across domains. Traditional CDSR approaches merge specific-domain sequences into mixed-domain sequences to reconnect users' dispersed interests. However, most methods rely on unidirectional transfer between mixed and specific domains on each domain task, overlooking the complex interplay between mixed-domain and domain-specific dynamics. Moreover, token-level transfer between coinciding domain sequences fails to consider inherent sequential dynamics. To address these limitations, we propose Multi-Domain Enhancement via Residual Interwoven Transfer (MERIT). Specifically, MERIT enhances domain representations along multiple domain-to-domain paths, leveraging the proposed extended cross-attention fusion compatible with partially overlapping sequences. To facilitate such transfers, MERIT further employs MoE networks in encoders to generate both intra-domain and inter-domain representations. In addition, by integrating stopped-gradient mixed-domain representations into specific-domain representations, MERIT enables the model to learn the residual signal of the mixed-domain information, better aligning with downstream specific-domain tasks. Extensive experiments on three real-world datasets demonstrate that MERIT consistently outperforms state-of-the-art CDSR counterparts with statistical significance.

Abstract:
Personalized news recommendation aims to deliver content aligned with user interests. However, most existing methods rely on the objective textual content of news, overlooking the subjective social review that reflects how the news is socially perceived. Inspired by social constructionism, we propose Social Review-aware Recommendation (SRec), a novel framework that integrates both objective content and the social review. The latter is constructed through group deliberation modeled by an agent-based social simulator, providing structured representations of collective understandings toward news. In addition, SRec incorporates a reasoning-guided explanation module that produces interpretable rationales by aligning user preferences with the social review of news. Experimental results on the MIND-small and MIND-large datasets demonstrate that SRec improves AUC by at least 2.45% over competitive baselines. Further analysis confirms the value of the social review generated by the simulator, and shows the flexibility of SRec as a lightweight enhancement to existing recommendation systems.

Abstract:
The exponential growth of video content necessitates efficient summarization techniques that balance local redundancy reduction and global dependency modeling. In this work, we introduce VSumMamba, an innovative video summarization approach that leverages Selective State Space Models to address the quadratic complexity limitations of Transformer based approaches meanwhile surpassing CNNs' restricted long-range modeling capabilities. The proposed framework comprises three core components: 1) a Multi-Scale Aggregator, 2) a Cascaded Temporal Modeling Module with bi-directional Mamba blocks for temporal representation enhancement, and 3) a Parallel Spatial Modeling Module employing spatial Mamba blocks, operating in concert to effectively refine spatiotemporal video representations. Through three specialized multi-scale spatial-temporal modeling schemes, VSumMamba demonstrate the ability to balance computational efficiency and summarization performance. Comprehensive evaluations on benchmarks datasets demonstrate VSumMamba's superior performance, achieving 67.5% and 56.0% F1-scores on TVSum and SumMe respectively, while maintaining lower computational cost compared to existing state-of-the-art methods.

Abstract:
Recent advances in generative modeling have enabled the synthesis of high-quality artistic images. Nevertheless, systematic evaluation of generative models from an aesthetic standpoint is still lacking, which hinders progress in artistic image synthesis. Existing evaluation metrics, such as Fréchet Inception Distance (FID) and CMMD, struggle with aesthetic assessment: they rely on pretrained visual features that overlook nuanced artistic attributes and employ distance functions ill-suited for modeling the diverse, multi-modal distribution of artistic styles. To address these limitations, we propose ArtFRD, a metric specifically designed for generative aesthetic evaluation. Grounded in aesthetic theory, ArtFRD extracts visual features along four key aesthetic dimensions-brushstroke, composition, lighting, and color-to capture fine-grained artistic properties. To model the multi-modal nature of artistic styles, we adopt a Gaussian Mixture Model assumption and derive an efficient approximation of the Fisher-Rao distance, which serves as the final evaluation score. Extensive experiments demonstrate that ArtFRD aligns significantly better with human aesthetic judgments than existing metrics, even across a wide range of artistic styles. These results highlight its potential as a robust and interpretable foundation for future research in generative aesthetic evaluation.

Abstract:
This paper presents a novel culture-specific LoRA framework that enhances AI-generated art with authentic cultural representation, aiming to both preserve and promote cultural heritage. Grounded in the Cultural Iceberg Model, the proposed approach redefines the traditional LoRA pipeline by introducing image preprocessing step and leveraging Large Language Models for image tagging-enhancing the overall quality of the training dataset compared to generic LoRA. It also enhances the prompting process for image generation to produce more culturally authentic outputs. Additionally, artistic style transfer is applied to the resulting photorealistic imagery to enrich the visual narrative. Extensive experiments across three distinct cultural contexts-Tanka, Yi, and Inuit-supported by both quantitative and qualitative evaluations, demonstrate that our approach significantly improves cultural authenticity. This work underscores the potential of AI to safeguard and revitalize cultural heritage through generative art.

Abstract:
Super-resolution (SR) images, generated by advanced algorithms to enhance resolution under hardware constraints, are increasingly applied across various multimedia tasks. However, the absence of paired high-resolution (HR) reference image and the inherent ill-posedness of SR reconstruction present key challenges for SR image quality assessment (SR-IQA). Full-reference methods become inapplicable, while the reduced-reference methods relying on one low-resolution (LR) image offer limited reliability. To address these issues, I propose the SQer, a no-reference SR-IQA method based on a graph perceptron with semantic fidelity. The SQer first extracts perceptual and hierarchical SR image features using a superposition nonlinear feature pooling. These features are transformed into graph vector representations, allowing semantic information learning via a graph-structured attention perceptron. Finally, the resulting graphs are globally average-pooled into a semantic embedding, which is then processed by a multilayer perceptron to predict the SR image quality score. Extensive experiments on multiple SR-IQA benchmarks demonstrate that my proposed SQer significantly outperforms existing state-of-the-art reference-based methods, exhibiting superior accuracy and a stronger ability to capture fine-grained perceptual cues and SR-specific artifacts. The SQer method provides a promising direction for guiding the optimization and application of image super-resolution models.

Abstract:
With the rise of Text-to-Image (T2I) models, generating face images from text prompts has emerged as a prominent research area. However, evaluating the quality of these generated face images, particularly with respect to fine-grained facial attributes, remains a significant challenge. To address this, we introduce the Fine -grained Text-to-Face Image Quality Assessment (FineTFIQA) database, which is designed to evaluate the ability of T2I models to generate fine-grained face images. To the best of our knowledge, this database is the largest of its kind, containing 7,218 face images generated from 1,000 text prompts that cover 111 distinct facial attributes. A large group of subjects was invited to assess the quality of text-to-face images on four evaluation dimensions: perceptual quality, human likeness, attractiveness, and consistency. Additionally, we develop the Multi-Dimensional Text-to-Face Image Quality Assessment (MDTFIQA) method based on the Large Language Model (LLM), which combines both face image features and text features to evaluate generated images on all evaluation dimensions. Extensive experimental results demonstrate that traditional face image assessment methods and general image quality assessment methods are inadequate for accurately evaluating generated text-to-face images. Our method significantly outperforms these existing methods on all evaluation dimensions, proving to be an effective method for assessing the quality of generated text-to-face images.

Abstract:
Dense multi-label action detection in untrimmed long videos is a formidable task, with end-to-end training particularly challenging due to computational constraints, typically involving separate stages of off-the-shelf feature extraction and subsequent global modeling for action prediction.Existing methods fail to optimize all modules jointly for better performance. We introduce FreETAD, a Frequency-based End-to-end Temporal Action Detection approach, which shifts the focus from local actionness scores to frequency component estimation. Using the short-term Fourier Transform, FreETAD reconstructs the global action curve seamlessly. With a DETR-like decoder and frequency-encoded vectors for queries, it enhances multi-scale time-frequency interactions. FreETAD leverages end-to-end training effectively, boosting the mAP by 1.5% on Charades and 2.7% on MultiTHUMOS.

Abstract:
The core issue in novel view synthesis lies in how to handle dynamic scenes that are common in the real world, such as dynamic objects, occlusions, and varying luminance. Current 3D Gaussian Splatting-based methods perform excellently in static scenes but often rely on fixed camera parameters, precise semantic prior segmentation, or specially designed rendering loss functions when dealing with dynamic scenes. These additional pieces of information limit the method's generalization ability, real-time performance, and application in AR, VR, and multimedia. To address these issues, we propose Wild3A, a comprehensive end-to-end integrated framework: it directly regresses 3D point positions and initial point confidence via the MASt3R's Transformer and integrates a Bayesian estimation module based on multimodal information fusion. We adopt a self-supervised joint optimization approach for scene representation and camera parameters. Extensive experiments on both public and private datasets show that Wild3A effectively eliminates visual artifacts in various dynamic scenes, achieves state-of-the-art results across multiple tasks, and achieves real-time rendering at 1000+ FPS.

Abstract:
Physical adversarial attacks on person detectors reveal critical vulnerabilities in safety-critical vision systems such as autonomous driving and surveillance. While recent methods enhance attack efficacy and robustness, they often neglect realistic garment deformations and motion blur, limiting real-world performance. In this work, we propose FAB-Attack (Fabric-a ware and Blur-resistant Attack), a new adversarial attack that simulates realistic garment deformation during training and targets both person detectors and image deblurring models. To enhance attack effectiveness under varying clothing deformations, we introduce a Fabric-aware Texture Appliance (FTA) module, which applies adversarial textures to clothing regions and simulates realistic fabric dynamics via physics-inspired TPS. To better emulate real-world conditions, we develop a differentiable pipeline incorporating motion blur and deblurring processes. Moreover, we demonstrate the stability of low-frequency information during motion blur's generation and removal. Based on this insight, we design a frequency band separation mechanism that suppresses high-frequency components in adversarial patterns to enhance further robustness against motion blur. Experimental results demonstrate that our approach achieves SOTA performance, reducing AP to 25.2% on the COCO dataset and achieving a 94.4% ASR in the real world under severe motion blur.

Abstract:
Dynamic novel view synthesis (NVS) aims to render time-varying scenes from arbitrary viewpoints, balancing rendering quality and computational efficiency. While recent 4D Gaussian Splatting approaches offer promising real-time performance, they fundamentally overlook critical interdependence between Gaussians by modeling deformations independently. Our information-theoretic analysis reveals substantial mutual information across the Gaussian field, manifesting as appearance-preserving radiance coherence and motion-consistent deformation propagation. This finding establishes that rendering quality emerges from coordinated transformation rather than independent processing. We propose Correlation-aware Dynamic Gaussian Splatting (CaDGS) with our novel Gaussian Correlation Tensor Projection (GCTP) method, which efficiently transforms the complex O(n3) mutual information tensor into a dual-channel O(n2) spatial matrix, preserving the critical topological structure of Gaussian interactions. Combined with our Spatio-Temporal Deformation Consistency (STDC) learning, which enforces volumetric coherence through tensor-guided regularization across multiple scales, CaDGS prevents geometric distortions and texture inconsistencies common in previous approaches. Experimental results demonstrate state-of-the-art performance, achieving 32.4 PSNR on the Neu3D dataset with fewer Gaussians while maintaining rendering speeds of 323 FPS at 1353 × 1014 resolution.

Abstract:
The detection of hand contact states, which involves identifying interactions between hands and objects or other entities, is essential for the development of human-computer interaction systems and the comprehension of social dynamics. Previous approaches have made progress in modeling hand-object interactions. Nonetheless, they neglect critical cues between their hands and bodies, as well as those of others, thus constraining their ability to accurately detect interpersonal contact. The task remains challenging due to frequent occlusions, especially in crowded multi-person scenarios with complex contexts. In this paper, a novel hand-object-person interaction network, called HOPNet, is proposed to model contextual information between hands and objects, as well as between hands and bodies. Specifically, HOPNet consists of two components: (i) the Hand-Object Relation (HOR) module analyzes interaction patterns between hands and objects, capturing spatial and semantic relationships; (ii) the Contrastive Spatial Refinement (CSR) module learns hand-body interactions through contrastive geometric embedding and relative spatial enhancement, improving interpersonal contact recognition in crowded scenarios. Experiments on ContactHands and 100DOH datasets demonstrate that HOPNet outperforms state-of-the-art methods.

Abstract:
Remote Sensing Image-Text Retrieval (RSITR) is a fundamental task in the remote sensing (RS) field and has seen significant progress in recent years. However, existing methods often overlook explicit attention to semantic entities in RS scenes, limiting their capabilities in fine-grained semantic modeling and cross-modal matching, thereby hindering retrieval performance. To address these limitations, we propose a novel framework, Entity-level Alignment with Prompt-guided Adapter (EAPA), which enhances retrieval performance by explicitly perceiving, embedding, and aligning semantic entities in RS images and texts. Built upon the Contrastive Language-Image Pretraining (CLIP) model, EAPA comprises three key modules: the Prompt-guided Attention Adapter (PAA) module, the Pseudo-label-supervised Entity Embedding (PEE) module, and the Cross-modal Entity-level Semantic Alignment (CESA) module. Specifically, PAA freezes the CLIP backbone and introduces learnable prompt vectors to capture RS-specific entity-level semantic knowledge, guiding attention distribution and enhancing semantic representations. To obtain cross-modal consistent entity-level representations, PEE employs an entity query-based encoder to extract entity embeddings of both images and texts, and uses pseudo semantic labels as supervision to ensure that each embedding corresponds to a unique and well-defined semantic category. Based on this, CESA performs one-to-one alignment of cross-modal entity embeddings that correspond to the same semantic category, effectively avoiding mismatches and enhancing fine-grained alignment. Extensive experiments on the RSICD and RSITMD datasets demonstrate that EAPA outperforms state-of-the-art methods across multiple metrics, validating the effectiveness of each module in enhancing fine-grained semantic modeling and cross-modal matching.

Abstract:
Event cameras, with their microsecond-level temporal resolution and sparse visual encoding, provide a transformative paradigm for automatic lip reading (ALR). However, event data inherently lack explicit spatial structure and exhibit a pronounced frequency-domain bias. The low-frequency components fail to capture crucial lip structural information, which fundamentally impedes the modeling of intra-frame topological dependencies and inter-frame semantic evolution-both of which are critical for robust lip reading. To this end, we propose FAST-HG, a Frequency-Aware SpatioTemporal HyperGraph framework specifically designed for event-based lip reading. First, we apply low-frequency perturbation to improve the model's robustness for capturing discriminative features, and integrate adaptive high-frequency filtering to enhance edge-aware representations. Then, we construct a Spatial Region Hypergraph (SRH) and a Temporal Semantic Hypergraph (TSH). The former captures intra-frame topological dependencies among lip regions, while the latter explicitly models inter-frame structural associations throughout the lip movement process, enabling the model to capture discriminative patterns in lip dynamics. Furthermore, we propose a viseme-aware label smoothing strategy, where a novel viseme-level edit distance is designed to quantify visual similarities between classes and guide the construction of soft labels. FAST-HG achieves 79.85% and 84.03% accuracy on the DVS-Lip and DVS-LRW100 datasets, respectively, significantly outperforming prior methods and establishing a new benchmark for event-based lip reading.

Abstract:
In recent years, adversarial attacks on video recognition models have attracted increasing attention. However, most existing strategies are extensions of image-based methods, where adversarial perturbations are computed independently and embedded into individual frames. This independent per-frame perturbation process wastes computational resources and leads to excessive query consumption. To address this problem, we introduce Frequency Domain Distributed Perturbations (FDP), a straightforward yet effective black-box video attack method using temporal correlations between video frames. Specifically, FDP first converts the input video into the frequency domain and calculates globally coordinated adversarial perturbations in the spectral space. By conducting global optimization in the frequency domain, FDP improves the effectiveness of each query, significantly decreasing the total number of queries needed. The resulting perturbations are temporally distributed across frames to preserve the spatiotemporal structure. Furthermore, we introduce a frequency-sensitive mask to identify the spectral regions most critical to the model's predictions. By applying perturbations only to these key frequency bands, FDP further reduces the perturbation search space and improves query efficiency. Extensive experiments demonstrate that our method significantly reduces query consumption while achieving higher attack success rates than state-of-the-art approaches.

Abstract:
Accurately identifying correct correspondences in two images is a crucial task in computer vision. Current methods predominantly use PointCN blocks as feature extraction backbones and learn local-global consensus through a progressive learning strategy. However, such methods have two main drawbacks: First, PointCN blocks, composed of multilayer perceptrons and normalization layers, process spatial positions independently, leading to limited interaction between channel-wise and spatial-wise dimensions. Second, the progressive learning strategy primarily focuses on unidirectional transfer from local to global consensus, yet neglects the bidirectional interaction between local and global consensus. To address these issues, we propose the Channel-Spatial interaction and Bidirectional Consensus interaction-Based Network (CSBCNet), which contains three innovative blocks: Channel-Spatial Interaction (CSI), Local Consensus Mining (LCM), and Global Consensus-Aware Attention (GCAA). Specifically, CSI enhances interaction between channel-wise and spatial-wise dimensions through a dual-path attention mechanism, addressing the limited interaction caused by the independent processing of spatial positions in PointCN blocks. LCM extracts reliable local consensus by modeling geometric structures and spatial continuity within correspondences. GCAA captures global consensus by aggregating correspondences that are highly likely to be correct ones, and achieves bidirectional interaction between local and global consensus through cross attention. Experiments demonstrate our CSBCNet's superior performance in camera pose estimation and correspondence pruning. Notably, when the CSI block is applied to the existing OANet and MS2DGNet networks, it achieves significant performance improvements of 10.27% and 7.5%, respectively, on the mAP5° metric on the camera pose estimation task.

Abstract:
Transfer-based adversarial attacks have endowed adversarial examples with the ability to transfer from a source model to an unknown target model, which poses a more realistic threat to security-critical applications. Existing transferable adversarial attacks generally suffer from overfitting to the source model, i.e., the perturbations are locally optimal in the source model and focus on the model-specific information. We demand the adversarial perturbation to contain more generalized knowledge, which reveals the intrinsic general properties and can introduce model-general optimum into adversarial examples, for improving transferability. To this end, we devise a Bi-level Bias Mitigated Attack (BBMA), which empowers the transferability of adversarial examples by exploring generalization in two levels: 1) Progressive filtering of high-frequency sample components. We first propose to remove the sample-specific high-frequency components of samples to explore model-level generation. To simulate how a model evaluates feature importance at different stages, we devise a stride-wise step-tuning strategy to progressively produce multiple samples for aggregating the gradients. 2) Accumulated gradient-guided model attention shift. To facilitate the sample-level bias mitigation, we employ an accumulated gradient-guided attention map to distort the more generalized features during perturbation generation. Comprehensive experiments on several benchmarks demonstrate the superiority of our method in attack transferability over state-of-the-art attacks.

Abstract:
Egocentric action recognition holds critical value in augmented reality, embodied AI, and human behavior analysis. While transformer-based masked autoencoders show potential in general video representation learning, their direct application to egocentric vision faces fundamental limitations --- random masking strategies disrupt crucial spatiotemporal features like hand-object interaction hierarchies and viewpoint dynamics by neglecting task-specific semantic priors. Through systematic analysis, this paper reveals three complementary semantic priors for egocentric video understanding: verb-centric motion patterns characterizing hand trajectories, noun-aware attention regions highlighting object contact points, and action-oriented global context integrating holistic semantics. These hierarchical cues address egocentric visual specificity through motion granularity, interaction locality, and semantic integrity. Building on this discovery, we propose EgoHierMask: a hierarchical semantic prior-guided masked autoencoder framework coordinating vision-language knowledge through differentiated masking strategies. The framework employs frozen vision-language teacher models to generate multi-level semantic attention maps, systematically guiding three specialized masking branches: a) dynamic motion masking preserves hand movement continuity through temporal verb attention, b) interaction-sensitive masking maintains object manipulation coherence via spatial noun saliency, and c) spatiotemporal joint masking encodes complete action semantics through global context alignment. Additionally, to enhance learning efficacy, we curate a distribution-balanced pretraining corpus and devise a unified architecture with dual-granularity supervision, combining pixel-level reconstruction with semantic-level distillation within the VideoMAE paradigm.Extensive experiments demonstrate state-of-the-art performance across major benchmarks, validating the crucial value of hierarchical prior injection for egocentric representation learning.

Abstract:
Cross-Domain Few-Shot Learning (CD-FSL) aims to transfer knowledge acquired from a source domain with abundant data to the target domain with limited labeled samples. Recent advancements have enhanced model generalization through Perturbation Augmentation (PA), facilitating more effective knowledge transfer. However, PA-based CD-FSL methods still suffer from two critical challenges, i.e., (1) limited diversity of augmented samples, making it difficult to cover the true distribution of unseen domains, and (2) conflicting gradients during model optimization, where augmented and original samples drive the model's optimization in opposing directions. To address these issues, we propose a novel PA-based framework with Style-Decoupled Augmentation (SDA) and Gradient-Conflict Adjustment (GCA) for Cross-Domain Few-Shot Learning, which is termed ''SG-FSL''. Specifically, SDA decouples the source domain style into style weights and basis styles, generating diverse unseen styles by perturbing the style weights to reweight the basis styles. Meanwhile, GCA leverages the angular relationships between the domain-specific gradient directions of augmented and original features, adaptively adjusting the gradient directions of original features to ensure that the model acquires diverse domain knowledge without interference, guiding it toward conflict-free optimization. Comprehensive experiments on multiple benchmark datasets consistently demonstrate the effectiveness and superiority of our method over state-of-the-art baselines.

Abstract:
Existing prototype learning-based Multiple Instance Learning (MIL) methods mainly focus on learning a single set of prototypes for each class or generating a generic prototype from the overall data distribution. This design forces the model to compress general and heterogeneous features into identical prototype embeddings, prioritizing general features over subtle but discriminative features. Additionally, these methods often guide prototype updates by jointly optimizing attention score distributions and the distances between instances and prototypes, resulting in prototype biases due to over-concentration. To address these issues, we propose a dual prototype learning MIL (DP-MIL) framework that introduces two distinct sets of prototypes: primary prototypes, which capture general WSI features, and boundary prototypes, which capture discriminative features near the decision boundary. The DP-MIL framework employs three prototype-tailored losses: an alienation loss to encourage primary prototypes to be distant from decision boundaries, an affinity loss to anchor boundary prototypes near these boundaries, and a distance loss to enforce separation between the two prototype sets. To mitigate prototype semantic drift during training, we introduce a prototype joint updating and refinement strategy: for each prototype, we use its corresponding global token to filter out the most similar instances to momentum update the corresponding prototype set, while the boundary prototype set is refined with the mean pooled feature of hard samples. Extensive experiments on four datasets demonstrate the effectiveness of our DP-MIL framework and prototype updating strategy.

Abstract:
Graph classification is a fundamental machine learning problem with extensive applications in multimedia and biochemical analysis. Contemporary graph classification models usually require precise graph labels for supervision, even after self-supervised pre-training. However, in practical applications, the extensive precise annotation of graphs could be expensive or impractical. To exploit data efficiently, this work studies partial label graph learning, in which each graph is linked to a set of candidate labels but only one of them is accurate. Label ambiguity would bring difficulties in extracting graph semantics and the risk of overfitting noisy partial labels. Here, we present a novel approach called Coupled Dual Separation (CODE). To improve graph semantics mining under label ambiguity, our CODE contains a message passing branch and a graph kernel branch, which explore graph semantics implicitly and explicitly, respectively. To facilitate information exchange, we utilize one branch to separate partially labeled graphs into an informative set and an uninformative set, which provides guidance for the optimization of the other branch. Furthermore, to mitigate the risk of overfitting, parameters in coupled branches are partitioned into critical and non-critical ones for separated optimization procedures. Extensive experiments on several benchmark datasets validate the effectiveness of the proposed CODE.

Abstract:
Immersive telepresence aims to authentically reproduce remote physical scenes, enabling the experience of real-world places, objects and people over large geographic distances. This requires the ability to generate realistic novel views of the scene with low latency. Existing methods either depend on depth data from specialized hardware setups or precomputed templates such as human models, which severely restrict their practicality and generalization to diverse scenes. To address these challenges, we introduce RIFTCast, a real-time template-free volumetric reconstruction framework that synthesizes high-fidelity dynamic scenes from a multi-view RGB-only capture setup. The framework is specifically targeted at the efficient reconstruction, transmission and visualization of complex scenes, including extensive human-human and human-object interactions. For this purpose, our method leverages a GPU-accelerated client-server pipeline that computes a visual hull representation to select a suitable subset of images for novel view synthesis, substantially reducing bandwidth and computation demands. This lightweight architecture enables deployment from small-scale configurations to sophisticated multi-camera capture stages, achieving low-latency telepresence even on resource-constrained devices. For evaluation, we provide a comprehensive high-quality multi-view video data benchmark as well as our reconstruction and rendering code, including tools for loading and processing a variety of data input formats, to facilitate future telepresence research.

Abstract:
Optimal Transport (OT) has emerged as a principled framework for learning mappings between probability distributions by minimizing transportation costs. While neural OT methods have achieved remarkable success in dual-domain (N=2) image-to-image (I2I) translation, their extension to multi-domain settings (N>2) remains challenging due to the quadratic complexity (O(N2)), leading to computational inefficiency and poor scalability. In this work, we propose Diffusion-Cascaded Neural Optimal Transport (DCNOT), a novel approach that reduces the complexity of multi-domain I2I translation to linear (O(N)) leveraging the contracting properties of the forward diffusion process and a cascaded OT strategy. We first prove theoretically that the Wasserstein-2 distance between domains contracts progressively under diffusion noise injection, enabling the alignment of all domains to a shared approximate domain. The remaining distributional shifts are then decomposed into smaller, more tractable gaps bridged via cascaded neural OT mappings, ensuring both efficiency and fidelity. Extensive experiments on synthetic and real-world benchmarks demonstrate that DCNOT achieves state-of-the-art scalability in multi-domain translation while preserving or surpassing the quality of prior OT-based methods. Our work establishes a new paradigm for scalable multi-domain learning with optimal transport.

Abstract:
With the advancement of autonomous driving technology, there is an increasing demand for high-quality and diverse images of road traffic scenes. Style transfer techniques can be employed to synthesize large-scale datasets. However, existing image style transfer methods often exhibit suboptimal performance in transferring styles for road scenes, frequently struggling to maintain structural consistency. In this paper, we propose a novel network architecture for unsupervised image style transfer named SVDGNet. This architecture dynamically adjusts the weights of different image regions during model training by calculating the Shapley values for the source and target domain images. We also employ a pre-trained diffusion model to generate better-stylized images. The experimental results demonstrate that the proposed method achieves better performance compared to the existing methods, which can preserve the structural consistency of the source domain images while providing impressive style transfer results.

Abstract:
Generating 3D cities from satellite imagery opens up new avenues for gaming, urban planning, and cinematic production. However, the limited information from satellite views presents significant challenges, hindering existing methods from generating high-quality cities that meet application standards. To address these challenges, we propose CitySculpt, a UV diffusion-based framework for generating 3D cities with high-fidelity geometry and photorealistic textures. Specifically, we first generate the detailed 3D geometries by refining coarse structures using a UV normal diffusion network. Building on these refined geometries, we introduce a texture generation approach that produces photorealistic textures despite the limited satellite information. To ensure style consistency across multiple objects, we design a cross-attention mechanism that enables feature sharing among them. Additionally, we contribute the CitySculpt dataset, a collection of high-quality 3D urban assets with multi-view renderings and comprehensive annotations to advance research in 3D city generation. Experiments demonstrate that CitySculpt outperforms state-of-the-art approaches in both generating detailed individual buildings and creating cities with high visual quality and rich architectural details.

Abstract:
Image outpainting has drawn increasing demands from many real-world applications. The core capacity called for this task is to generate image content beyond the boundaries that are semantically aligned with the source image. Compared to other image generation tasks, image outpainting remains very challenging since we need to identify the scene of the source image and generate new yet consistent boundaries with few local context. However, one common propensity for the outpainting techniques is to generate irregular high-frequency patterns. Furthermore, the dominating data-driven learning paradigm utilized by the existing state-of-the-art methods would require sophisticated model design, significant computation cost and introduce potential bias as well.

Abstract:
Vision-language models (VLMs) has demonstrated impressive cross-modal alignment. However, their internal mechanisms of associating text concepts with visual patterns remain opaque. This opacity raises a critical question: What visual patterns do VLMs inherently associate with text concepts? Current methods for decoding representations of VLMs often produce suboptimal outputs, hindering to probe the clear visual patterns. To address this, we introduce Generative Semantic Probing (GSP), a novel training-free framework that synthesizes images to probe the implicit semantic preferences of VLMs. Our method generates visual patterns that maximize the similarity to the target text embeddings, through three core components: (1) Hierarchical Feature Decomposition, which decomposes the image generation across multi-scale feature levels; (2) Feature Space Constraint, which constrains the optimization within semantically meaningful feature subspace; (3) Quality Assessment Module, which ensures the generation of visually plausible outputs. Experiments validate our method's strengths in high-fidelity image generation and interpretable model analysis. Beyond text-to-image generation, style transfer and image editing applications, our framework enables unprecedented visualization of VLMs' decision boundaries. By exposing implicit preferences and systematic biases in the cross-modal association, our work provides a valuable insight for both understanding and improvement of the vision-language alignment.

Abstract:
Neural audio codecs, used as speech tokenizers, have demonstrated remarkable potential in the field of speech generation. However, to ensure high-fidelity audio reconstruction, neural audio codecs typically encode audio into long sequences of speech tokens, posing a significant challenge for downstream language models in long-context modeling. We observe that speech token sequences exhibit short-range dependency: due to the monotonic alignment between text and speech in text-to-speech (TTS) tasks, the prediction of the current token primarily relies on its local context, while long-range tokens contribute less to the current token prediction and often contain redundant information. Inspired by this observation, we propose a compressed-to-fine language modeling approach to address the challenge of long sequence speech tokens within neural codec language models: (1) Fine-grained Initial and Short-range Information: Our approach retains the prompt and local tokens during prediction to ensure text alignment and the integrity of paralinguistic information; (2) Compressed Long-range Context: Our approach compresses long-range token spans into compact representations to reduce redundant information while preserving essential semantics. Extensive experiments on various neural audio codecs and downstream language models validate the effectiveness and generalizability of the proposed approach, highlighting the importance of token compression in improving speech generation within neural codec language models. The demo of audio samples will be available at https://anonymous.4open.science/r/SpeechTokenPredictionViaCompressedToFinedLM.

Abstract:
Generating highly realistic 4D interaction in real time is significant for visual content generation. Although existing works have validated to produce impressive dynamics by employing physical simulation and learned material mainly from pre-trained video diffusion models, it is still challenging to generate real-time 4D interaction with high-quality motion due to the heavy time consumption of the simulation solver and indirect material learning strategy. This paper proposes a novel physics-based 4D generation method, Phys4DRT, for arbitrary realistic real-time interaction on 3D Gaussian Splatting (3DGS) objects with direct motion supervision in time-frequency domain. Specifically, we devise a fast and differentiable eXtended Position Based Dynamics (XPBD) simulator as the light-weight controller for efficient physical evolution on a quasi-regular tetrahedral proxy mesh, into which we immerse the static 3DGS for efficient and stable deformation simulation. In addition, to learn the heterogeneous material for realistic motion, we directly supervise the generated dynamic 3D behavior by the motion representation of the optical flow and spectral volume extracted from the generated reference video, rather than indirect supervision in the color space used in previous approaches. We thoroughly conduct experiments on the public benchmarks to demonstrate the efficiency and effectiveness of our method. Our model can accelerate real-time 4D interaction generation by approximately x20 faster than the current Material Point Method (MPM) based approaches while achieving competitive visual quality compared with the state-of-the-art baselines.

Abstract:
While the widespread adoption of diffusion models in image generation has showcased remarkable capabilities, it has also inadvertently opened the door to malicious exploitation. Recent research has primarily concentrated on protecting images from the misuse of diffusion-based customized generation (CG). However, these approaches often overlook that image details can still be enhanced through diffusion-based super-resolution (SR) techniques, significantly increasing the risks of personal image leakage and abuse. To combat these multifaceted risks, we propose the Zero Matrix-guided Adaptive Image Vaccine (ZMAIV) framework. Specifically, we introduce the Self-attention Removal strategy, tailored for CG, which disrupts the model's core mechanism of focusing on sensitive spaces. Concurrently, the High-frequency Removal strategy is proposed to impede the high-frequency details reconstruction of SR. These defense strategies effectively dismantle the underlying mechanisms that facilitate unauthorized data extrapolation. Moreover, the proposed Adaptive Space Search Attack precisely targets critical spaces within images for vaccine injection, optimizing perturbation placement to minimize perturbation conflict while maintaining defense performance. Extensive experiments demonstrate that the proposed ZMAIV outperforms the state-of-the-arts in the aspects of simultaneously defending against diffusion-based CG and SR, affirming its superiority in safeguarding visual content against these dual threats.

Abstract:
The rapid evolution of deepfake techniques presents dual challenges for detection models: adapting to continuously shifting attack distributions while retaining previously learned knowledge. Although recent continual deepfake detection methods have made progress, they often rely on replay-based training, which limits scalability and deployment. Meanwhile, the task structure of deepfake detection offers a unique opportunity that remains under-explored: it is inherently a binary classification problem with a fixed label space, where the main difficulty lies in distributional drift rather than class expansion. This insight enables the modeling of each incremental distribution shift as a dedicated expert, focusing on specific forgery patterns. To this end, we propose a novel analytically driven, replay-free continual detection framework that eliminates the need for iterative gradient updates. In this framework, task-specific experts are constructed via closed-form ridge regression, requiring only a single forward pass and ensuring non-interference with previous tasks. To enhance the model's capacity for fine-grained forgery recognition, we introduce a lightweight Forgery-Aware Residual Enhancer (FARE). At inference, an Uncertainty-Guided Expert Selection module (UGES) dynamically routes each sample to the most confident expert, which does not require prior knowledge of the attack type. The proposed framework achieves a favorable trade-off between efficiency, privacy, and generalization. It achieves state-of-the-art performance across four benchmark datasets, with an average accuracy of 91.82% and only 1.78% forgetting. Notably, it improves cross-forgery generalization by 9.28% on unseen forgery types, demonstrating strong generalization.

Abstract:
The application of deep learning in voice cloning has significantly enhanced the quality of cloned voices. While advanced voice cloning technologies are widely applied across various domains, they also pose serious security challenges such as producing natural Deepfakes. In response, numerous studies have focused on detecting fake voices, with many reporting outstanding performance. However, is the issue truly resolved? This paper introduces Adversarial Neural Mimicry Attack (ANMA) which leverages a specialized model to predict the behavior of other similar models, transforming black-box attacks into white-box scenarios indirectly. Based on ANMA and Speaker-irrelative Features (SiFs), we propose a novel black-box attack framework called SiFMimicEvader, designed to evade fake voice detectors with high success rates and minimal query requirements. The framework utilizes speech representation models as the breakthrough to predict the behaviors of fake voice detectors and employs a series of SiFs editing operations as perturbations to deceive these detectors. Experimental results demonstrate the effectiveness of SiFMimicEvader, achieving an average attack success rate exceeding 50% across various detectors, significantly outperforming other attack methods, while also showing great performance in audio quality and query scale, indicating its high availability in real-world scenarios.

Abstract:
Multimodal deepfakes pose growing security threats across diverse domains, driven by rapid advancements in generative models. This demands effective Multimodal Deepfake Continual Detection (MDCD) methods capable of adapting to evolving and heterogeneous deepfake techniques. However, MDCD remains underexplored, facing two major challenges: (1) modality-specific feature disparities limit the effectiveness of simple feature fusion, exacerbating the forgetting of previous forgery-relevant knowledge; and (2) newly introduced deepfake videos initially exhibit limited scale that gradually expand, causing class imbalance dominated by forged samples, undermines authentic content understanding in comming tasks. To address these issues, we propose the Analytic Synaptic Dynamic Scaling Balancer (ADanser) that adapts to modality-specific biases and class imbalance while employing a closed-form update to preserve prior multimodal deepfake knowledge in an evolving data stream. Inspired by synaptic scaling in neuroscience, ADanser introduces a modality synaptic scaling mechanism that applies modality-aware attention to extract discriminative and complementary forgery patterns, improving cross-modal knowledge retention. Additionally, a class-wise contribution balancer dynamically reweights learning signals to reduce class bias and enhance authentic video representation. Extensive experiments on benchmark multimodal deepfake datasets demonstrate that ADanser significantly outperforms state-of-the-art continual learning methods, effectively coordinating adaptation and retention in imbalanced, cross-modal scenarios.

Abstract:
Video analytics pipelines (VAPs) have been a paradigm for large-scale video analytics. Due to temporal redundancy in video, frame filtering is widely used in VAPs to reduce analysis workload. However, existing works overlook a limitation: while inference operates only on selected frames, decoders must still process many redundant frames due to codec dependencies, leading to over-decoding trap. This limitation stems from the reference-based design in modern codecs, which require decoding preceding frames to reconstruct any selected one. As a result, over-decoding has become the practical bottleneck in VAPs using modern decoders, highlighting a critical but under-explored problem. To address this issue, we propose ParaDeco, a high-throughput video analytics framework featuring a novel frame-level parallel generative decoder. Unlike traditional decoders, ParaDeco adopts a decode-what-matters approach with decoupled frame dependencies. To decode arbitrary frames independently, ParaDeco generates frame-wise features as standalone skeletons using compressed video metadata, then predicts pseudo frames maintaining semantic consistency with original frames. Moreover, ParaDeco identifies which frames truly matter for analysis via delicate contribution-based frame filtering. We implement ParaDeco on a cloud server and evaluate it on large-scale real-world video datasets. Our experimental results show that ParaDeco achieves a 2.76× speedup on average compared to state-of-the-art VAPs.

Abstract:
Federated learning (FL), an emerging data-secure distributed training paradigm, unites massive isolated Internet of Things (IoT) device nodes to collaboratively train a global neural network (NN) model without the exposure of their local multimedia data. However, constrained by the synchronous NN model integration nature of FL, there is a training latency inconsistency among heterogeneous devices, which significantly deteriorates FL training efficiency. Meanwhile, frequent local NN training and transmission impose high energy consumption pressure on users. To tackle these issues, this paper proposes a premium multi-width NN-assisted hierarchical FL (HFL) framework in heterogeneous cloud-edge-device computing to achieve remarkable training speedup and energy conservation. Specifically, a heterogeneity-aware NN width coefficient determination algorithm, which flexibly assigns a subnet with a suitable width to each user device based on its computing ability, is first applied to shorten the HFL training latency. Subsequently, to integrate subnets with different width topologies, we design a width-aware adaptive NN model integration approach to effectively ensure the accuracy of the integrated global NN model. Finally, a latency-aware energy saving strategy is introduced to reduce energy consumption. Experimental results demonstrate that our proposed framework outperforms state-of-the-art benchmarks, and attains up to 42.42% enhancement in accuracy, 81.5% reduction in training latency, and 40.9% optimization in energy cost.

Abstract:
Existing quality of experience (QoE)-driven adaptive bitrate (ABR) algorithms either fail to consider personalized QoE or rely on over-simplified QoE models, all resulting in unsatisfactory streaming experiences. Recognizing the wide existence of user feedback schemes in existing streaming applications, we introduce Q+, a framework leveraging progressively gathered personal user opinion scores from multiple interaction sessions for enhanced user-system alignment. Q+ first innovates QoE modeling by incorporating both pairwise ordinal and cardinal preferences constructed from scores. The capturing of both preferences ensures reliable and robust preference representation. Moreover, we design a monotonic neural network as the QoE model to capture the inherent monotonicity property in ABR services, improving model expressivity and generalization ability even with limited human feedback. To align the policy with the progressively updated QoE, we then develop a value-based reinforcement learning (RL) algorithm for bitrate control that integrates reward relabeling and calibrated prioritized experience replay. Extensive experiments reveal that Q+ consistently surpasses state-of-the-art rule-based, control-based, and RL-based baselines within only three sessions, improving QoE by 5.69% to 29.39% across diverse network conditions.

Abstract:
Video streaming dominates global internet traffic, yet conventional pipelines remain inefficient for structured, human-centric content such as sports, performance, or interactive media. Standard codecs re-encode entire frames, foreground and background alike, treating all pixels uniformly and ignoring the semantic structure of the scene. This leads to significant bandwidth waste, particularly in scenarios where backgrounds are static and motion is constrained to a few salient actors. We introduce GenStream, a semantic streaming framework that replaces dense video frames with compact, structured metadata. Instead of transmitting pixels, GenStream encodes each scene as a combination of skeletal keypoints, camera viewpoint parameters, and a static 3D background model. These elements are transmitted to the client, where a generative model reconstructs photorealistic human figures and composites them into the 3D scene from the original viewpoint. This paradigm enables extreme compression, achieving over 99.9% bandwidth reduction compared to HEVC for the continuous data stream. We partially validate GenStream on Olympic figure skating footage and demonstrate potential for high perceptual fidelity under minimal data. While acknowledging the significant computational costs shifted to the client and challenges in generalization, GenStream opens new directions in volumetric avatar synthesis, canonical 3D actor fusion across views, and personalized viewing experiences, laying the groundwork for scalable, intelligent streaming in the post-codec era.

Abstract:
Understanding and measuring Quality of Experience (QoE) is crucial for optimized but still user-centered multimedia systems. However, current assessment methods rely largely on one-dimensional subjective ratings collected post hoc and therefore fail to capture how users actually experience quality in real time. Inspired by advances in neuroimaging, we investigate whether QoE can be assessed directly from brain activity. We propose a novel approach using functional near-infrared spectroscopy (fNIRS) to objectively measure perceptual quality during multimedia service interaction. In a preliminary study with 8 participants, we recorded fNIRS signals while viewers watched videos of varying quality. Results show a statistically significant increase in oxygenated-hemoglobin in the prefrontal cortex in low quality conditions, suggesting elevated cognitive effort or reduced perceptual fluency. These findings establish a neural signature of degraded quality perception and demonstrate the usefullness of fNIRS for neuro-based objective QoE estimation. Unlike traditional techniques, our method provides continuous, real-time, implicit quality measurement without interrupting the user. This work calls for a rethinking of QoE as a neuroperceptual phenomenon rather than a subjective judgment and propose a neuro-based QoE framework.

Abstract:
Deep learning models have emerged as a promising alternative to conventional approaches for plant disease identification, a critical challenge in agricultural production. However, the existing plant disease datasets are insufficient to address the complexities of real-world agricultural scenarios, such as multi crop disease, unseen, few-shot, and domain shift adaptation. Additionally, the lack of standardized evaluation protocols and benchmark datasets hinders the fair evaluation of models against these challenges. To bridge this gap, we introduce Deep-Plant-Disease, the largest and most diverse dataset with novel text data designed to enhance model generalization in multi crop disease identification. We revisit and reformulate the task by establishing a standardized evaluation framework that supports consistent benchmarking and guides future research. Through experiments, we further validate the robustness and adaptability of models trained on our dataset, highlighting their effective transferability to real-world agricultural challenges.

Abstract:
In smartphone image signal processing (ISP), different parameter settings can yield diverse color renditions, even when images have similar color accuracy and aesthetic quality. A key yet underexplored question is: which rendition does a specific user or demographic prefer? This is difficult to answer due to the subjective nature of preference. Existing assessments focus on color fidelity or aesthetics using visibly degraded images, limiting their ability to capture subtle color preferences in similar image sets. Averaged metric predictions further obscure individual perceptual differences. To address these gaps, we present the Smartphone Photography Color Preference (SPCP) dataset-the largest of its kind-designed to evaluate color preferences arising from ISP-induced variations. The SPCP dataset comprises 12,000 images derived from 1,000 diverse scenes, with each scene rendered into 12 distinct variants. These variants include (i) real-world captures from six flagship smartphones and (ii) synthetic images generated through systematic variation of key ISP parameters. To obtain reliable ground-truth annotations, we conduct a large-scale psychophysical study involving 20 subjects under controlled laboratory conditions. Subjects perform exhaustive pairwise comparisons among the 12 variants for each scene, yielding fine-grained human preference data. Using this dataset, we identify three key challenges in modeling color preferences and outline the corresponding desiderata for the development of effective computational color preference models. The dataset is publicly available at: https://huggingface.co/datasets/zwx8981/SPCP_dataset.

Abstract:
Rainy weather typically leads to significantly reduced ambient illumination due to overcast skies.However, most existing image deraining datasets overlook this critical physical condition.They are usually constructed by linearly superimposing rain layers onto clean background images, without accounting for illumination degradation.This simplification introduces a clear domain gap between synthetic and real-world rainy images, thus limiting the generalization capability of current deraining algorithms.Moreover, existing methods predominantly focus on removing rain streaks while ignoring the simultaneous degradation caused by low-light conditions.To address these limitations, we introduce a new joint task: image deraining and low-light enhancement.Specifically, we construct a physically plausible dataset that simulates rainy scenes under low-light conditions, incorporating both rain streaks and raindrops with illumination-aware degradation modeling.In addition, we propose a baseline deraining network based on a multi-scale Mamba architecture, which jointly restores rain-free and well-lit images by effectively modeling both global illumination and local rain interference.Extensive experiments demonstrate that our method outperforms existing deraining approaches.The proposed dataset is released at https://drive.google.com/file/d/1QXxHqpYL7Q1TR5tdvAHm8tV2BdZgpGOc/view?usp=sharing.

Abstract:
We present ICS-MR, a dataset containing three conversational scenarios designed for the evaluation of communication quality in Mixed Reality (MR) systems. Along with detailed descriptions of the conversation tasks, we provide all the materials required to incorporate the tasks into MR user studies. The materials also support application of the scenarios in real-world and video-conferencing contexts for studies that, for example, call for comparison of immersive systems against reference communication media. Open-source Unity implementations of the scenarios are also made available, supporting direct usage of the scenarios in distributed, multi-user experiments. The conversation tasks have all been administered in recent scientific works that address the evaluation of user experiences in immersive communication systems, allowing analysis and comparison of each scenario's evoked behavioral properties. The ICS-MR dataset therefore contributes valuable resources for further research on communication in immersive systems.

Abstract:
As immersive 360° video experiences through head-mounted displays (HMDs) gain widespread adoption, the need for real-time, fine-grained assessment of Quality of Experience (QoE) becomes increasingly critical for optimising user engagement and system performance. This paper introduces RCQoEA-360VR, a novel multi-modal dataset designed for continuous QoE evaluation in virtual reality (VR) environments. In a controlled study (N=32), participants watched five selected 360° video sequences across eight different video quality configurations (from the VQEG database) using a Vive Pro Eye while providing continuous QoE annotations via a touchpad-based input method, enhanced by the DotMorph peripheral visualisation technique. The dataset also includes synchronised physiological signals (electrocardiogram and galvanic skin response), behavioural data (eye and head movements) and post-viewing QoE ratings gathered through a within-VR interface. RCQoEA-360VR addresses a critical gap in existing public datasets by providing a fine-grained, synchronised multimodal data for immersive QoE analysis. It offers a unique and valuable resource for the research community, supporting a wide range of research applications, including QoE prediction, behavioural modelling, adaptive streaming, and implicit perceptual analysis.

Abstract:
Human-centric videos play a significant role in the pervasive video content of modern life. However, the capabilities of text-to-video (T2V) generation models and video-to-text (V2T) understanding models for human-centric videos remain largely unexplored. To this end, we present HVEval, the first comprehensive evaluation dataset focusing on human-centric videos, which consists of 20,000 videos, 60k MOS annotations across 3 dimensions (i.e., spatial quality, temporal quality, and text-video correspondence), and 20k category-specific Q&A pairs. Based on the HVEval dataset, this paper aims to answer three questions: (1) can today's T2V models effectively generate human-centric videos following the given prompts? (2) how effective are today's V2T LMMs in understanding and evaluating human-centric videos? (3) are current VQA metrics good enough for evaluating human-centric videos? Comprehensive evaluations of 24 T2V models, 20 LMMs, and 18 VQA metrics reveal their limitations in fine-grained text-controlled generation and human-aligned perception and understanding, highlighting the significant potential of our dataset and benchmarks to advance research in human-centric video generation and understanding.

Abstract:
Multimodal hate speech detection targets offensive content expressed through combinations of modalities such as text and images, which often evade detection when analyzed separately. We introduce MAXplain, an interactive framework that addresses both issues via a configurable LLM-based multi-agent architecture. Specialized agents handle distinct subtasks and exchange information through structured dialogues, enabling intrinsic explainability and improved accuracy. The web interface supports human-in-the-loop interaction, including real-time adjustment of agent behaviors and evaluation rules. A browser plugin enables direct inspection of online content. While demonstrated for hate speech detection, MAXplain also supports rapid prototyping for other multimodal tasks.

Abstract:
Despite recent advances, deepfake detectors remain vulnerable to adversarial examples, particularly in diverse, real-world settings. We propose MIG-COW, a novel adversarial attack framework that generates highly generalizable and visually imperceptible adversarial examples. By combining momentum-integrated gradients with a consensus-orthogonal decomposition, MIG-COW captures both shared and model-specific vulnerabilities across heterogeneous CNN and ViT detectors. On the AADD-2025 Challenge benchmarks, MIG-COW achieves a 99.96% white-box attack success rate (ASR) with high perceptual similarity (SSIM), significantly outperforming existing baselines. However, its limited 7.16% ASR against official black-box targets-despite achieving the best overall score-highlights the ongoing challenge of transferability. We also demonstrate that incorporating low-performing but diverse models in the ensemble can degrade attack effectiveness, underscoring the need for careful surrogate model selection in real-world adversarial settings.

Abstract:
Multimodal Emotion Recognition (MER) has advanced significantly with the advent of Multimodal Large Language Models (MLLMs), which enable generative, descriptive understanding of complex human affect. However, the inherent ''black-box'' nature of these end-to-end models limits their trustworthiness and applicability in high-stakes domains, particularly due to their opacity in handling conflicting cross-modal cues (e.g., sarcasm). To address this critical gap, we propose Affective-CoT, a novel hierarchical framework that explicitly decouples perception from reasoning to achieve interpretable and faithful emotion analysis. Our framework utilizes specialized perception models to extract structured semantic evidence from raw audiovisual streams, which is then integrated and arbitrated by a central reasoning LLM executing a meticulously designed Cognitive Workflow. Critically, Affective-CoT generates a nuanced emotion description grounded in a transparent, human-interpretable reasoning trace. The efficacy of our framework was decisively validated by securing first place in the official MER-2025 Descriptive Emotion Understanding (DES) challenge. This result not only highlights the superiority of our method but also champions a new paradigm for building scrutable and trustworthy affective computing systems.

Abstract:
Asynchronous Video Interviews (AVIs) allow candidates to record responses to predefined questions using digital devices, offering both flexibility and remote accessibility. Assessing personality traits and interview performance via AVIs provides organizations with valuable insights into candidate profiles and facilitates the prediction of future job performance. However, prior benchmark challenges, whose datasets were predominantly sourced from social media, suffer from suboptimal construct and methodological validity, limiting their utility for model development and real-world applications. To address these limitations, we introduce the AVI Grand Challenge at ACM Multimedia 2025, featuring a novel dataset of mock AVIs comprising 3,876 videos from 646 participants in a simulated job application procedure. Interview questions were carefully designed to reflect real-world selection contexts and elicit personality expressions grounded in Trait Activation Theory. Personality traits and job competencies were annotated by trained evaluators and professional recruiters, ensuring both methodological rigor and ecological validity. The solutions and algorithms developed in this challenge are analyzed and summarized in this paper to foster the development of fair, reliable, and AI-driven hiring assessments.

Abstract:
Depression detection remains challenged by generalized modeling approaches that fail to account for individual heterogeneity. To address this, the Multimodal Personality-aware Depression Detection (MPDD) Challenge introduced personalized features into the modeling process, aiming to better capture individual variability. However, the baseline models still exhibit two critical limitations: the neglect of textual semantics embedded in audio, and inconsistent predictions for the same subject across tasks and samples. Motivated by these limitations, we introduce HOPE (Hierarchical fusion for Optimized and Personality-aware Estimation of Depression), a unified framework for consistent, subject-level depression estimation. HOPE first employs a Latent Semantic Projection (LSP) module to reconstruct textual semantics from audio features when transcripts are unavailable. It then introduces a consistency-aware integration mechanism that hierarchically fuses multi-branch predictions to resolve inter-task and inter-sample contradictions. HOPE achieved first place in the MPDD Challenge Young Track, demonstrating strong cross-modal learning capabilities and consistent, subject-level depression prediction.

Abstract:
Generating diverse and contextually appropriate facial reactions remains a significant challenge due to variability in individual responses, limited explainability, and insufficient modeling of contextual cues. In this study, we propose a multimodal framework that integrates behavioral memory, dynamic attention control, and cognitive style modeling to generate personalized and psychologically grounded facial reactions in dyadic interactions. Our method models the causal link between speaker behavior and listener response by incorporating frame-level behavioral cues, personality traits, and cognitive processing styles. The proposed system consists of three core components: a behavioral memory module that captures temporal context across conversation turns; a Personalized Personality Recognition Style (PPRS) module that infers cognitive tendencies via dual-path learning based on the Big Five personality traits; and a transformer-based generative module equipped with diffusion modeling and context-aware attention gating. This design enables the generation of expressive, individualized responses even during silence or scene transitions. We conduct extensive evaluations on the REACT2025 benchmark using the MARS dataset. Results show that our method outperforms state-of-the-art models in appropriateness (FRCorr ↑0.71), diversity (FRDiv ↑0.1405), and synchrony (FRSyn ↑47.77), ranking 1st in the offline track and 3rd in the online setting. These findings highlight the framework's effectiveness in simulating human-like, emotionally congruent reactions while offering interpretability grounded in personality psychology.

Abstract:
The 2025 Grand Challenge on Multimedia Verification addresses the challenges of verifying the authenticity and context of online multimedia content. Participants analyzed real-world cases of images and videos, assessed their sources, detected any potential manipulations, and submitted detailed verification reports. The main competition was structured into three stages: Training, Validation, and Real-World Verification, which included live test cases and welcomed solutions ranging from manual methods and OSINT practices to automated tools and novel AI techniques. A total of 32 teams from 11 countries registered. The submitted solutions presented diverse pipelines that integrated forensic analysis, multimodal reasoning, and large language models (LLMs). The results demonstrate significant progress toward semi-automated verification workflows, while also exposing challenges in scalability, consistency, and reliability. In this paper, we present the challenge design, datasets, evaluation criteria, and participant outcomes, providing insights to guide future research and practice in multimedia verification.

Abstract:
In recent years, Multimodal Sentiment Analysis (MSA) has attracted growing attention for its ability to interpret human emotions by integrating information across multiple modalities. Multimodal Conversational Aspect-based Sentiment Analysis (MCABSA) extends this research frontier by incorporating multi-party conversational contexts and requiring comprehensive extraction of sentiment elements. MCABSA presents substantial challenges, including the need to understand complex conversational contexts, integrate heterogeneous multimodal signals, and identify causal reasoning at the cognitive level. To address these challenges, we propose a two-stage Full Fine-tuning and LLM Post-processing (FLP) framework. In the first stage, we develop a multimodal caption-enhanced full fine-tuning pipeline that performs structured extraction of sextuples and sentiment flip tuples. The second stage introduces paraphrase-based sextuple verification to identify and filter low-quality sextuples for Panoptic Sentiment Sextuple Extraction (Task-1), while implementing trigger classification with a distribution alignment mechanism to determine trigger types for sentiment flipping and enhance output consistency for Sentiment Flipping Analysis (Task-2). Comprehensive experiments on both MCABSA challenge subtasks demonstrate the effectiveness of our approach, achieving 1st place on Task-1 and 3rd place on Task-2.

Abstract:
Recent advances in video action recognition have achieved remarkable performance in coarse-grained macro-action classification by leveraging large-scale visual backbones and transformer architectures. However, extending these successes to fine-grained micro-action recognition remains a fundamental challenge due to the subtlety, brevity, and low motion intensity of micro-actions. In this paper, we propose a high-capacity framework for micro-action recognition, enhancing both representation learning and decision robustness. We scale to large-scale backbones using the VideoMAEv2 Giant model, enabling the extraction of finer spatial-temporal features. A Temporal-Spatial Connector (TSC) is introduced to dynamically highlight discriminative temporal frames and spatial regions, strengthening the model's focus on subtle motion cues critical for micro-action identification. To stabilize optimization and fully exploit the capacity of large models, we design a four-phase progressive training strategy, encompassing linear probing, full fine-tuning, connector-specific optimization, and classifier head refinement. Furthermore, we propose a novel ensemble decision mechanism that integrates Top-K predictions from diverse models via a Large Language Model (LLM), enhancing prediction consistency and robustness through multimodel consensus. Our method achieves an F1mean of 76.54% on the MA-52 dataset, ranking 3rd in the 2025 Micro-Action Analysis Grand Challenge and advancing the state of the art in fine-grained video understanding.

Abstract:
The goal of this workshop is to showcase the latest advancements in generative AI (GAI) for creating, editing, restoring, and compressing rich media data, including images, videos, and 3D content. GAI models such as VAEs, GANs, and diffusion models have demonstrated remarkable impact in both academic research and industrial applications. For example, GAI enables users to design and generate synthetic yet realistic content without requiring professional artistic or technical expertise, driving significant market growth in gaming and entertainment. Beyond creative applications, GAI also provides crucial simulated data for training embodied AI agents. When applied to media restoration and synthesis, GAI techniques can further alleviate transmission challenges by offloading computation to client devices. To advance this field, the workshop will host four competition tracks using novel industry-level data, solicit high-quality paper submissions, and invite leading speakers from academia and industry to foster collaboration and innovation. In particular, the competition focuses on media generation and transmission with GAI. The first three tracks address reducing computation and transmission costs for efficient media delivery, while the fourth track focuses on controlled novel content creation. To support these challenges, a large-scale multi-modality, multi-view dataset named M3VIR is provided. This dataset comprises a diverse collection of videos simulated using the UE5 Unreal Engine, with carefully matched content serving as ground truth for the competition tasks.

Abstract:
Today, truly immersive multimedia systems demand the integration of emerging multi-sensorial media, which go beyond traditional audiovisual signals to include haptics, olfaction, motion capture, electroencephalograms, and other novel media forms. To effectively incorporate these modalities into cutting-edge multimedia systems, advances are needed across the entire pipeline, from processing and encoding to seamless integration. In addition, human-centric factors such as ergonomics and user experience must be considered to ensure practical implementation. Our workshop, the International Workshop on Multi-Sensorial Media and Applications (MSMA'2025), seeks to attract contributions related to multi-sensorial media systems, including system design, evaluation, coding, delivery, media analysis, multi-modal interaction, human factors, ergonomics, and related areas. By fostering collaboration among researchers, MSMA aims to bridge existing work in the field, spark innovation, and push the boundaries of multimedia technology.

Abstract:
Intelligent Document Processing (IDP) is critical for unlocking actionable insights from the vast volume of unstructured documents like invoices and medical reports, yet its promise is often unfulfilled as its implementation is typically hindered by significant technical barriers. Traditional IDP systems require deep expertise in programming, machine learning, and intricate model fine-tuning, creating a dependency on specialized data science teams. This effectively sidelines domain experts-the very individuals who possess the critical contextual understanding of the documents-thereby limiting the agility and accuracy of workflow automation. This paper introduces IDPFlow, a novel framework to unify a no-code, user-centric interface with a sophisticated, tool-augmented agentic architecture for end-to-end multimodal document processing, empowering experts such as business analysts and legal professionals to independently build and deploy sophisticated workflows without writing any code. IDPFlow is built upon a powerful agentic architecture, which intelligently utilize a versatile toolkit to execute a range of sophisticated IDP tasks. This toolkit enables a spectrum of high-precision IDP tasks such as multi-class document classification, Document visual question answering (Doc-VQA), key information extraction from text, tables, and checkboxes and long-document summarization. The core of IDPFlow is its dynamic agentic workflow, which redefines user interaction. Upon document upload, the agentic system instantly analyzes the content, classifying sub-documents and proactively suggesting a comprehensive data schema relevant to the use case, shifting the user's role from workflow builder to supervisor. This initial workflow is not static, it can be refined in real-time through simple, conversational instructions, enabling true business agility. Furthermore, the agentic intelligence extends to reusability, allowing existing workflows to be intelligently adapted for new, related tasks, dramatically reducing development time for subsequent use cases. For particularly complex tasks involving long or dense documents, the agentic system can leverage a specialized Multimodal Retrieval-Augmented Generation (MMRAG) pipeline to overcome the context window limitations of standard LLMs. This pipeline utilizes the ColPali model, which excels at generating unified multimodal embeddings, ensuring robust and accurate information retrieval from both textual content and embedded images or diagrams. To foster user trust and ensure verifiability, IDPFlow incorporates a grounded traceback citation mechanism that automatically highlights the precise document segments from which the agent derived its responses, making all outputs transparent and easily auditable.

Abstract:
This talk presents research on estimating health status from facial video analysis, focusing on non-invasive approaches to support well-being. The talk introduces robust estimation methods from facial videos for health- and well-being related indicators and real-time systems capable of simultaneously estimating vital signs-including heart rate, respiration, blood oxygen level, blood pressure, and pulse variability. It also explains the feasibility of estimating states such as drowsiness and stress from facial videos. These techniques utilize widely available devices such as smartphones and PCs, enabling continuous and low-burden monitoring in daily life. By demonstrating the feasibility of vision-based health estimation across various contexts, this work highlights its potential for workplace well-being, preventive healthcare, and telemedicine applications. Ultimately, this research aims to develop solutions that leverage cutting-edge technology to promote well-being, contributing to a future where people can lead healthier and more fulfilling lives-both physically and mentally.

Abstract:
Data has become the foundation of knowledge, and many companies are growing interested in harnessing AI-based data analysis to unlock its value. The volume of digital data is increasing at an unprecedented pace: market research reports estimate that global data volume, approximately 12.5 zettabytes in 2014, will reach around 180 zettabytes by 2025. Extracting patterns and trends from such big data is crucial for enabling data-driven decision-making. However, a key challenge lies in the enormous computational costs required for large-scale analysis, due to the inherent complexities of the task. Approximate methods are often employed to reduce these costs, but they inevitably trade exactness for efficiency. To overcome this limitation, our research aims to develop a machine learning platform that delivers both speed and accuracy. The core of our platform is computational pruning.

Abstract:
Document Image Tampering Localization (DITL) advances considerably, yet achieving robust cross-dataset generalization remains a formidable challenge for practical applications. Expanding existing document datasets for training is labor-intensive, making it appealing to incorporate data from non-document domains such as natural scene images. However, domain-specific variations, including differences in color distribution and texture, compromise the performance of joint training. To address this issue, we propose DITL2, a Dual-stage Invariance Transfer Learning framework for Document Image Tampering Localization that consists of Cross-Domain Invariance Pre-training (CDIP) and Frequency Decoupling Parameter Adaptation (FDPA). In the pre-training stage, CDIP employs style transfer and texture consistency learning to suppress domain-specific influences from tampered natural scene images, and tampering trace commonality learning to acquire domain-invariant features. In the fine-tuning stage, FDPA adapts the parameters of the pre-trained model, leveraging the general knowledge from the pre-trained model to address DITL tasks while reducing the risk of overfitting. Experiments show that this approach effectively leverages external data resources to boost model performance, achieving state-of-the-art results across a variety of cross-dataset settings.

Abstract:
Scene flow estimation using 4D millimeter-wave radar has emerged as a prominent research focus for 3D dynamic perception. However, compared to LiDAR point clouds, the drastic sparsity of radar point clouds poses challenges in enforcing local rigidity constraints, which are crucial for accurate 3D motion estimation. To address this issue, we propose a novel Gaussian-based pseudo-point generation method that fully leverages two distinct yet complementary data modalities, 3D coordinates and Doppler velocity, to support multi-body rigidity assumptions, effectively capturing fine-grained and structured motion patterns from highly sparse radar point clouds. Furthermore, a velocity calibration mechanism is designed to improve the reliability of fine-grained rigid motion velocity estimation. In addition, a progressive fusion strategy is introduced to systematically integrate fine-grained rigid motion priors at multiple levels, enhancing the robustness of matching costs and motion features while effectively compensating for coarse flows. Experimental results on real-world radar scans from the View-of-Delft (VoD) dataset demonstrate the promising performance of our FGRFlow compared to other leading 4D radar-based approaches, validating the advantages of our design choices.

Abstract:
Dynamic point cloud-based human action recognition has garnered increasing attention due to its inherent advantages in privacy preservation and structural completeness. Current methods typically rely on nested point spatio-temporal convolutions to understand motion semantics in a bottom-up manner, which is intractable for capturing high-fidelity human dynamics disentangled from spatio-temporal interference. Motivated by this, designing a practical spatio-temporal factorization backbone is essential. However, the repeated coarsening of aggregated features along the spatial dimension often leads to the degradation of intrinsic geometric texture relations within point cloud data. Moreover, discretizing continuous visual data into isolated temporal hyperpoints significantly diminishes temporal continuity, resulting in the fragmentation of human action. To circumvent above limitations, we propose a novel Geometry-Prior Cross-Frequency Interactive Fusion Network (Geo-CF2Net). Specifically, we investigate a Spatial-Geometry Pose Prior (SGPP) module, which compensates for pose information loss during spatial downsampling by explicitly modeling geometric constraints among neighboring points. In addition, we elaborate on a Temporal Motion Unit Interactive Coordination (TMIC) module to track the interactive composite semantics of low-frequency steady-state venations and high-frequency transient-state details within a high-dimensional pose evolution flow. Extensive experiments on three public benchmarks substantiate the superiority of Geo-CF2Net over state-of-the-art methods.

Abstract:
Camouflaged Object Segmentation (COS) seeks to accurately identify and segment objects that are intricately blended with their surroundings, making them challenging to distinguish at the pixel level. Existing COS methods often struggle to capture the subtle distinctions between targets and backgrounds, despite their improved adaptability to camouflaged objects. To address this challenge, we propose a novel Adaptive Camouflage Discrimination Network (ACDNet), to focus more attention on object-relevant features while suppressing attached camouflage features. The proposed ACDNet enjoys several merits. First, we design a gradient-based feature modulator that injects gradient information into channel-wise attention layers, thereby enhancing the discriminability between camouflaged objects and background features. Second, a hierarchical prompting strategy is introduced to endow the prototype-based classifier with target awareness and multi-level perception, mitigating the impact of camouflage diversity. Extensive experimental results on four benchmarks demonstrate that our ACDNet performs favorably against state-of-the-art COS methods.

Abstract:
Given that action evolution follows temporal progression, recent studies for Online Action Detection (OAD) and Online Action Anticipation (OAA) generally adopt forward temporal modeling to capture dependencies in observable video sequences. However, the strictly sequential nature of forward temporal modeling prevents subsequent frames from being used to enhance the earlier modeling process. In particular, the current frame, the last observable frame in the online video stream, serves as the direct visual cue for ongoing action recognition and the informative context for future action anticipation. As modeling errors accumulate over time, the resulting representations may progressively deviate from the actual semantics. Findings in cognitive neuroscience show that the hippocampus performs backward replay after observation to reinforce and correct the interpretation of previous observations. Inspired by this, we propose to incorporate backward temporal modeling following forward temporal modeling, enabling the model to leverage backward temporal modeling to enhance forward temporal modeling. Based on this idea, we propose a unified model for OAD and OAA, named Bidirectional Online Mamba (BiOMamba). Specifically, to address the excessive length and relevance imbalance in observable sequences, BiOMamba compresses distant long-term memory and preserves recent short-term memory. Then, BiOMamba sequentially model both forward and backward temporal dependencies in the whole memory. Finally, according to the temporal modeling result, BiOMamba generates representations for current and future actions. BiOMamba achieves state-of-the-art performance on THUMOS'14 (OAD: 73.3% mAP, OAA: 59.7% mAP) and TVSeries (OAD: 89.9% mcAP, OAA: 83.7% mcAP).

Abstract:
Domain adaptation, which bridges the domain gap between heterogeneous agents, has emerged as an effective solution to improve the perception capabilities of multi-agent systems. However, it may introduce backdoor vulnerabilities, as adversaries could exploit the collaborative process to propagate malicious features across agents, yet these threats remain largely unexplored. In this paper, we take the first step to study the backdoor attacks in this safety-critical scenario, with the 3D object detection task as the representative case. To this end, we propose BadMDA, the first backdoor attack tailored for the domain adaptation process to collapse multi-agent perception. Specifically, we first propose a gradient-suppression trigger optimization module to mitigate trigger distortion during the domain adaptation. By utilizing the optimizable additive triggers and minimizing gradient variations of triggered features induced by the domain adaptation, we reduce the transformation magnitude of triggered features, thereby maintaining the trigger effectiveness. Then, we propose a dual-gradient guided poisoning module to achieve clean-label poisoning in 3D object detection tasks. This module aligns training gradients with poisoned ones to learn malicious features, while enforcing the orthogonality between training and benign gradients. Consequently, the learned malicious features mislead the victim's finetuning updates, causing detection failures upon receiving triggered features while only slightly affecting the victim agent's model utility. Extensive experiments on various dominant domain adaptation methods show the superior attacking effectiveness and universality of BadMDA, underscoring the need for a more advanced defense.

Abstract:
Medical image synthesis is crucial in clinical workflows, enabling the generation of missing modalities from available imaging data. While recent diffusion-based models show promise in medical image synthesis, they face two key limitations: progressive distribution drift from coarse intermediate samples and structural granularity loss due to missing high-frequency constraints. To address these challenges, we propose Dual Diffusion Bridge (DualDB), a framework integrating implicit distribution alignment and explicit structural constraints within a unified diffusion bridge paradigm. First, implicit distribution alignment employs optimal transport-guided adversarial learning to minimize statistical discrepancies between intermediate and target distributions, mitigating global distribution drift. Second, explicit structural alignment applies gradient-driven constraints to preserve high-frequency anatomical features, preventing structural degradation during reverse diffusion. This complementary design ensures both global statistical consistency and local anatomical precision in the synthesized results. Extensive experiments on multi-contrast MRI and MRI-CT translation show that DualDB outperforms state-of-the-art methods in quantitative performance and visual fidelity, maintaining superior anatomical accuracy even under noisy conditions.

Abstract:
Multimodal models integrate visual, textual, and other data to achieve human-like understanding, but this fusion creates a conflict between cross-modal alignment and modality-specific expertise.The pursuit of unified feature spaces often undermines specialized knowledge in individual modalities, as shown by performance drops in unimodal tasks. To resolve this contradiction, we propose VAMP (Variational Alignment with Modality Preservation), a novel multimodal framework featuring a Dynamic Feature Diversion mechanism that partitions modal representations into two components-one preserving modality-specific expertise and the other enabling cross-modal alignment. Inspired by Variational Canonical Correlation Analysis, we introduce a shared space projection layer that maps features into a common representational space while preserving modality-specific characteristics. We further implement a Progressive Training Strategy that sequentially freezes different components before full fine-tuning, preventing mode collapse and enhancing generalization capabilities. Experimental results demonstrate VAMP's significant performance improvements across zero-shot image classification, cross-modal retrieval, and visual question answering, while simultaneously outperforming baseline models on unimodal tasks. This research provides an engineered solution to the ''knowledge dilution'' problem in cross-modal alignment.

Abstract:
Multimodal feature fusion, by integrating the complementary information from each modality, can effectively capture complex features in real-world data. However, in many use cases, such as boiler combustion monitoring, factors including equipment failure, inconsistent sensor sampling frequencies, and network delays often cause data collected from different modalities to suffer from missing modality and temporal asynchrony. This leads to the incompleteness and disorderliness of multimodal data. To address these issues, previous studies have proposed several data fusion methods that align the cluster centers before fusion. However, these approaches have two key limitations: 1) they do not guarantee a high alignment accuracy of data pairs at the sample level, and 2) they do not address the issue of significant discrepancies in data sizes across different classes, which impacts the subsequent data fusion performance.

Abstract:
Cross-domain graph anomaly detection (GAD) aims to identify nodes that significantly deviate from normal patterns in unseen target domains, showing great potential in applications such as multimedia content security and financial risk control. However, existing methods often rely on semantic information trained on individual datasets, which makes it difficult to capture node commonalities across domains and limits generalization in complex multimedia environments. To address these challenges, we propose Zero-GAD, a universal Zero-shot Graph Anomaly Detection framework tailored for cross-domain scenarios. Zero-GAD leverages a novel de-semanticized strategy to train a unified detection model that can be directly applied to unseen domains without retraining or fine-tuning. The framework is built upon two key components: (1) a Global Information Unification Module, which projects graph data into the spectral domain and performs normalization to align the energy distribution in the frequency space; and (2) a Node-Neutralized Discrepancy Scoring Module that leverages the discrepancy between the original and reconstructed node representations to produce effective anomaly scores. Extensive experiments show that Zero-GAD achieves superior accuracy compared to existing models under a GAD setting.

Abstract:
In recent years, deep multi-view clustering methods based on contrastive learning have gained significant attention. Most existing approaches treat the anchor and its cross-view representations as positive pairs, while the anchor and other samples are considered negative pairs. However, this pairwise assignment does not account for the higher similarity between the anchor and samples that are close, which should also be treated as positive pairs. To address this issue, we introduce topology-aware positive sampling: for each anchor, both its intra-view neighbors and cross-view consistent neighbors are selected as additional positive samples, which aligns contrastive learning with the homophily principle of clustering. Additionally, to obtain reliable neighbor relationships, most existing methods construct graphs from the original data or extracted features and average them to form a consensus graph. However, this approach overlooks the fact that views of varying quality should be assigned different weights, and unreliable connections within a view should be discarded. To overcome this, we design a global-guided weak connections suppression mechanism to weaken unreliable connections in the initial graph of each view, then apply weighted graph fusion to obtain a more accurate consensus graph. We also combine the view weights from graph fusion with the corresponding view's neighbor contrastive loss to enhance consistency between the two processes. Extensive experimental results demonstrate the superiority of our proposed method.

Abstract:
Model Diagram-to-Code Generation aims to translate model diagrams from research papers into implementation code that reconstructs the model's architecture. This task plays a crucial role in accelerating scientific workflows and enhancing the efficiency of industrial model deployment. While recent studies have explored various Image-to-Code Generation tasks using Multimodal Large Language Models (MLLMs), these efforts have primarily focused on reconstructing the visual appearance depicted in input images, leaving this task largely underexplored. The complex structural elements and implicit relationships in model diagrams present greater challenges for MLLMs, particularly in terms of visual reasoning and semantic interpretation. To support this task, we introduce MDCDataset, a dataset designed to evaluate the ability of MLLMs to generate code from model diagrams. It comprises 1,008 instances spanning 16 research domains, each with a model diagram, structured textual content, and the ground-truth code implementation. Furthermore, to address the inherent challenges of this task, we propose MDCAgent, a collaborative multi-agent framework composed of Parsing, Generation, and Check Agents. These agents work in coordination to analyze, extract, and verify complex elements and implicit relationships within model diagrams, thereby enhancing the visual architecture-aware reasoning capabilities of MLLMs. Our extensive experiments confirm the effectiveness of the framework.

Abstract:
Multimodal federated learning (MFL) focuses on integrating distributed multimodal data from different clients to improve feature representation while preserving user privacy and has gained popularity in medical image analysis. Existing research idealizes communication efficiency and generalization ability as independent optimization objectives, leading to the failure in the trade-off between model generalization and client resource constraints. To address this challenge, we propose a lightweight Federated Learning with Mid-Frequency consensus-Driven (FedMFD) method, aiming to efficiently generalize multimodal medical image segmentation tasks while reducing communication costs. In the aspect of communication efficiency, client-side images are converted via Discrete Cosine Transform (DCT) from spatial domain into frequency domain coefficients. With only the mid-frequency components selected for transmission. Regarding generalization capability, we adopt the Earth Mover's Distance (EMD) to quantify the maximum similarity of mid-frequency features between clients, generating a global frequency consensus with the optimal transport plan. Guided by global frequency consensus, client-side structural and detailed representations are reconstructed to improve segmentation generalization in the presence of modality shifts and background noise. Extensive experiments across Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) modality datasets have verified the superiority of FedMFD against its competitors.

Abstract:
In recent years, graph-based multi-view learning has received widespread attention for its ability to utilize data dependencies to capture more comprehensive information from ubiquitous multi-view data. However, with the increase in data size and complexity, Euclidean space struggles to capture the hierarchical and exponentially expanding relationships of multi-view data in limited dimensions, leading to embedding distortion and insufficient cross-view alignment and interaction. To this end, we propose a hyperbolic multi-view heat diffusion method. Firstly, we utilize the negative curvature advantage of the hyperbolic space to construct a graph representation for each view separately, so that each view can still retain its hierarchical structure in relatively low dimensions. Then we construct a graph heat diffusion process on hyperbolic manifolds to ensure that each view is locally smoothed and globally aggregated while achieving semantic consistency through virtual views. We show that the method can be interpreted as a Riemannian gradient descent process for collaborative learning on hyperbolic manifolds, which not only effectively fuses multimodal information, but also significantly enhances the interaction and unified representation among different views. Experimental results show that the proposed framework achieves excellent performance in a variety of multi-view scenarios.

Abstract:
The rapid proliferation of multi-view data has necessitated robust and scalable clustering techniques capable of capturing complex, high-dimensional patterns. While Multi-view Bipartite Graph Clustering (MVBGC) has shown promising results, existing approaches often overlook that the generated bipartite graph is susceptible to disturbances from complex structures and noise. To address these challenges, we propose RTGD-MVC, a novel framework for Robust Tensor Learning with Graph Diffusion tailored for efficient and scalable multi-view graph clustering. RTGD-MVC integrates a graph diffusion mechanism to suppress noise propagation and employs cross-view diffusion to enhance global consistency while capturing complementary information across views. Additionally, a non-convex Tensor Exponential Norm (TEN) is introduced as a tighter surrogate for the tensor rank, enabling the learning of more discriminative and noise-robust representations. By embedding these components into a unified optimization model with linear computational complexity, RTGD-MVC achieves both theoretical efficiency and practical scalability. Extensive experiments on diverse benchmark datasets demonstrate that RTGD-MVC significantly outperforms state-of-the-art methods, highlighting its superior ability to capture intricate multi-view correlations and structural patterns.

Abstract:
Multi-view clustering plays a pivotal role in remote sensing image analysis, where graph neural network-based methods have demonstrated remarkable potential by modeling data as graphs. However, existing efforts, which construct remote sensing graphs using fixed rules (e.g., K-nearest neighbors), inevitably introduce noisy edges and increase the risk of heterogeneous information diffusion, leading to inferior clustering performance. Although recent works attempt to address this issue by refining the structure, they are designed for single-view data and struggle to extend to multi-view scenarios. To bridge this gap, we propose a dual structure awareness multi-view graph clustering method named DSMVGC, which generates two distinct structures for each view through explicit and implicit perspectives. Specifically, in our method, the learning processes of structure refinement and clustering are alternately optimized to mutually enhance each other. On one hand, the explicit structure updates the topology based on inter-cluster relationships, while the implicit structure captures latent relationships not covered by the explicit structure through adversarial learning. On the other hand, the refined structures not only facilitate homogeneous message passing but also serve as prior knowledge to guide the contrastive loss, thereby enhancing the discriminability of representations for accurate clustering. Extensive experiments on five multi-view remote sensing datasets validate the effectiveness of DSMVGC.

Abstract:
Link prediction serves as a fundamental task in graph-based applications, where graph neural networks (GNNs) are extensively applied to estimate node connectivity likelihood. However, GNN-based methods for homogeneous graphs encounter the semantic mixing issue in heterogeneous graphs. Previous works leverage the disentangled-based model to separate the semantic information into different factors and conduct the message passing for link prediction. However, their models suffer from information loss and inadequate expression, which harms link prediction performance. To address these limitations, we propose a language-assisted semantic-aware disentangled method for link prediction on heterogeneous graphs. First, we employ a factor-wise attention mechanism to reduce the information loss caused by the disentangled model. Specifically, we design a factor selection strategy to select disentangled factors and combine them to utilize more semantic information. Second, a language-graph learning method is developed to enhance contextual expression by fusing the features of nodes and edge textual information. Extensive experiments show that the proposed method outperforms existing state-of-the-art baselines.

Abstract:
Automated choreography generation, which aims to seamlessly harmonize human movements with music, is a multifaceted challenge demanding both technical precision and artistic expressiveness. We present M2PE-DIFF, a novel framework for generating human dance videos conditioned on a reference image and music sequence using a latent diffusion model. Our approach integrates a Music-to-Pose Encoder (M2PEnc), trained with a novel synthetic dataset generation pipeline (SDGPip), which maps audio features into structured 3D pose and shape parameters that capture human geometry and dynamic motion patterns synchronized with musical input. By combining these encoded parameters with a reference image through a multi-level attention mechanism within the latent diffusion framework, we synthesize visually coherent and rhythmically synchronized dance animations of individuals depicted in the given reference image. Experiments on benchmark datasets demonstrate that M2PE-DIFF achieves state-of-the-art performance, producing high-quality dance videos that accurately reflect pose diversity and temporal consistency. Additionally, our method exhibits robust generalization capabilities, validated by its strong performance on a newly introduced in-the-wild dataset.

Abstract:
Remote physiological measurement enables the capture of vital signals in a non-contact way, which offers significant potential for various applications. Monitoring these signals is achieved through video cameras or radio frequency (RF) sensors, with recent few methods attempting to fuse both sources to leverage complementary patterns for enhanced accuracy. However, these two modalities operate on distinct principles, where video-based methods detect subtle facial color changes from blood volume variations, while RF-based methods capture subtle body vibration due to heartbeats. In practical applications, they may encounter interference at different occasions. Treating these modalities as equally reliable in all situations can lead to suboptimal fusion. To address this issue, we propose an evidential video-RF fusion framework for robust remote physiological signal measurement. We design an uncertainty regression head for each uni-modality, which estimates uncertainty features together with the corresponding physiological signal in each branch. Then an evidential multi-modal fusion module is employed to dynamically fuse the two modalities according to their uncertainty. Extensive experiments carried on public and self-collected datasets show that the proposed method not only achieves superior fusion performance on easy data collected under well-controlled environment, it also generalizes well to unseen data which represents challenging practical conditions that one or both sensors are disturbed.

Abstract:
The rise of multimodal fake news threatens reliable information dissemination by exploiting multiple modalities to create deceptive, engaging content, significantly impacting society safety. Existing methods still face challenges in cross-modal alignment (e.g., semantic inconsistencies, complex visual-semantic relations) and are vulnerable to low-quality or noisy samples. To address these, we propose Cross-Modal Alignment with Visual Reasoning Prompting (CMA-VRP) for multimodal fake news detection. Specifically, we model text and image entities with graphs to capture fine-grained semantic interactions and enhance cross-modal consistency through graph contrastive learning. Unlike methods relying on shallow image features (e.g., edges, textures), we leverage large language models (LLMs) and large vision-language models (LVLMs) to capture deep visual-semantic attributes related to reasoning (e.g., actions, scenes). Based on graph modeling and visual reasoning features, we perform graph-based cross-modal semantic fusion to unify textual and visual representations and cross-modal cycle alignment to align modality distributions by reducing semantic discrepancies, filtering modality-specific noise, and extracting invariant representations across domains. These steps enable the model to obtain semantically consistent and modality-invariant features. Extensive experiments demonstrate that our model outperforms existing methods in multimodal fake news detection and shows strong robustness against noisy samples.

Abstract:
Existing Text-Image Person Retrieval (TIPR) methods have made substantial progress in modeling cross-modal associations via contrastive learning frameworks, but usually ignore the fine-grained differences in semantic relevance among different samples, which limits retrieval accuracy. To address this problem, we propose a novel Hierarchical Cross-modal Association framework HCA, which leverages the intra-modal fine-grained semantic relations distilled by single-modal pretrained models to constrain hierarchical cross-modal association between image and text modalities, for accurate TIPR. Specifically, to model hierarchical cross-modal semantic relationships, we propose a Hierarchical Relevance Matching (HRM) module. It partitions the matching strength of image-text pairs by jointly considering identity labels and cross-modal similarity, collaborating with unimodal similarity to construct a hierarchical relevance distribution that serves as a soft supervision signal. HRM not only helps the model better capture varying levels of semantic consistency between image-text pairs but also enhances the overall accuracy of cross-modal association learning. To enhance the ability to capture fine-grained cross-modal semantic relationships, we introduce an Image-guided Ambiguous text Token Modeling (IATM) module. It replaces original tokens with semantically ambiguous ones and leverages image guidance to detect and correct these tokens. This process further improves the fine-grained semantic alignment between images and texts. Experimental results demonstrate that HCA achieves new state-of-the-art performance across multiple datasets, thoroughly validating its effectiveness and advancement in cross-modal retrieval tasks.

Abstract:
Spiking Neural Network (SNN), as a next-generation neural network technology, use binary spike signals as carriers of information. They offer advantages such as low energy consumption, low computational complexity, and high information transmission rates. However, deep SNNs suffer from the gradient vanishing problem due to issues like the non-differentiability of the step function and neuron dormancy. To address this gradient problem, we first propose the MCLIF neuron, which optimizes the backpropagation mechanism and compensates for gradient information from both temporal and spatial dimensions. Furthermore, we design a spiking attention mechanism tailored to the temporal characteristics of SNNs. By introducing the QK Memory to embed temporal features, we make full use of information from different time steps. Additionally, we propose a gradient correction module to enhance the model's representational power from both temporal and spatial dimensions. The proposed SGM-Transformer achieves state-of-the-art (SOTA) performance in image classification tasks such as CIFAR10, CIFAR100, and CIFAR10-DVS, and also excels in industrial defect classification scenarios. The code will be made available after the paper is accepted.

Abstract:
Multi-label image classification requires simultaneously recognizing multiple objects with complex interdependencies. While existing attention-based methods are prominent, their performance is hampered by two forms of representation entanglement: 1) Spatial entanglement, where contextual interference from backgrounds and co-occurring objects confuses specific object representations; 2) Semantic entanglement, where models overfit label co-occurrence priors, thereby impairing a genuine semantic understanding of the image. To address these challenges, we propose an Object-Purified Representation Learning framework. Concretely, for spatial entanglement, we propose the Spatial-wise Representation Purification Module that employs Spatial-Purified Attention to eliminate object-irrelevant feature activations for contextual interference reduction, combined with Spatial-Aware Supervision to enhance object perception capability. For semantic entanglement, we develop the Semantic-wise Association Purification Module that synergistically integrates our proposed average message with the original co-occurrence-based message. This design effectively models co-occurrence relationships while preventing their overemphasis. Furthermore, we design the Bidirectional Representation Refinement Module to efficiently enhance representations, further boosting classification performance. Extensive experiments on multiple benchmark datasets with different configurations demonstrate that our proposed method achieves state-of-the-art performance.

Abstract:
Hallucination remains a significant challenge which constrains the development of large vision-language models (LVLMs). Therefore, reliable hallucination detection has become a critical step in LVLMs evaluation and real-world deployment. Many previous studies have explored hallucination detection in LVLMs, with uncertainty-based approaches being widely adopted due to the independence from external tools and relatively low resource consumption. However, we observe that uncertainty does not always completely correlate with hallucination. Therefore, uncertainty-based methods may fail in certain cases, such as instances exhibiting high uncertainty but non-hallucination. To address this issue, we propose a framework called Visual Perception Uncertainty Learning (VisPUL) for hallucination detection in LVLMs. Specifically, VisPUL integrates visual information into uncertainty learning directly, allowing to capture uncertainty and visual-text consistency simultaneously. VisPUL improves the insufficiency of uncertainty methods that rely only on text output, providing enhanced generalizability and reliability. Extensive experiments conducted on the M-HalDetect and POPE datasets, covering both open-ended and yes-or-no tasks. Experimental results demonstrate that VisPUL significantly outperforms several strong baseline methods across different LVLMs.

Abstract:
visual reasoning, exemplified by Raven's Progressive Matrices (RPM), remains a significant challenge in artificial intelligence. A critical difficulty lies in disentangling abstract relational rules from image-specific features, as these rules operate independently of visual appearances. To address this challenge, we propose the Cognitive Predictive Coding Network (CPCN), inspired by predictive coding theory from cognitive science. CPCN features a three-component architecture: a Relation Disentangler that separates abstract rules from image-specific features through prediction error minimization; Stacked Free Energy Minimizers (FEMs) that leverage energy minimization principles to progressively reduce uncertainty during hierarchical abstraction; and a classifier for solution identification. Unlike previous approaches, our model employs mutual information constraints to explicitly separate relation-relevant and relation-irrelevant features, enabling more robust pattern recognition. Our novel FEMs provide a principled approach to uncertainty reduction through iterative refinement of pattern understanding. Experiments demonstrate PAH's superior performance across multiple benchmarks (such as 98.9% on RAVEN-Fair) with state-of-the-art 59.7% average accuracy across all PGM subtasks.

Abstract:
Spatio-Temporal Video Grounding (STVG) aims to localize spatio-temporal tubes of specific objects or actions within videos based on textual queries. Despite significant progress, existing methods struggle to generalize effectively to real-world scenarios due to the limited quantity and diversity of annotated data. In this paper, we introduce RealVG, a robust and training-free pipeline that leverages powerful Multimodal Large Language Models (MLLMs) through question-answering to tackle STVG in the wild. To address the challenges posed by complex real-world videos and queries, we propose a spatio-temporal decoupling module and a query-guided visual token filter to decompose intricate scenes and refine target-oriented perception, enhancing the robustness and adaptability of MLLMs. Specifically, the spatio-temporal decoupling module breaks down videos and queries into simpler sub-scenes and sub-queries, reducing complexity and promoting a precise understanding of static visual elements. Meanwhile, the query-guided visual token filter eliminates irrelevant tokens, sharpening focus on the target object and improving short-range action perception. Experimental results demonstrate that RealVG achieves superior performance over state-of-the-art supervised and weakly supervised methods in real-world settings, despite requiring no STVG data for training.

Abstract:
Multi-domain Task Incremental Learning (MTIL) aims to continuously acquire knowledge from diverse domains while maintaining generalization capability. Recent works have demonstrated promising results by leveraging Vision-Language Models (VLMs) for continual learning. However, we identify a critical issue within this paradigm, termed the Evolving Semantic Entanglement. Specifically, VLMs tend to produce highly similar text features for semantically related categories, resulting in hard feature alignment and interference between related categories. This problem becomes increasingly pronounced as the category space expands. In this paper, we present a novel Dual-granularity Prompt Learning (DuPLe) framework to address this challenge. Our approach enhances text feature discriminability by leveraging complementary two-level prompts: category-level global prompts for holistic semantic concepts and attribute-level local prompts for fine-grained visual patterns. We further apply a Task Assignment-free Inference strategy that eliminates explicit task identification, simplifying the inference process and enabling extension to unseen categories. Extensive experiments on 11 diverse domains under MTIL and X-TAIL settings demonstrate that our method significantly mitigates the entanglement issue and outperforms previous state-of-the-art approaches.

Abstract:
Referring Expression Counting (REC) is an emerging task that aims to count specific objects in images based on textual phrases describing their attributes and categories. While current REC baselines inherit architectures from pre-trained open-vocabulary object detectors and demonstrate promising counting and localization capabilities, they overlook critical limitations in the original single-decoder design with shared object queries. This architectural constraint entangles the semantic and localization perception processes, hindering fine-grained understanding of attribute-aware visual features. To address these challenges, we propose DCount, a decoupled counting framework comprising two innovative components: a Decoupled Dual-Decoder (DDD) module and an Attribute Semantic Discriminator (ASD) module. The DDD module separates spatial perception tasks by employing distinct semantic and localization decoders with task-specific object queries, thereby enhancing the capture of discriminative visual features. Building upon the positional and semantic feedback from DDD, the ASD module introduces a two-stage filtering strategy to explicitly mine challenging hard negative attribute samples in the visual domain, while synergistically refining attribute discrimination across both modalities through contrastive learning in the textual domain. Our method achieves state-of-the-art results on both the REC and Zero-Shot Object Counting (ZSOC) benchmarks.

Abstract:
Micro-expressions (MEs) are involuntary facial expressions that reveal genuine emotions and have significant applications in fields such as psychology, security, and human-computer interaction. However, previous ME datasets are mainly collected in controlled laboratory environments, such as fixed views, single illumination and head movements, limited subjects and the lack of background. There are significant gaps between them and the real world. To handle this issue, we introduce a novel Natural Micro-Expression (NaME) dataset, a natural dataset collected under unconstrained real-world conditions. It encompasses (1) diverse subjects, multiple views and varying head movements ; (2) rich background information, providing a more realistic benchmark for the micro-expression recognition (MER) research. Furthermore, we propose a MER benchmark for natural environments, named MixFormer. MixFormer includes an efficient sparse attention mechanism to capture subtle facial motions from various factors, and a face-background mix of attention module to model the environment context to help MER. Extensive experiments are conducted to analyze our NaME dataset and benchmark. We believe that our dataset and benchmark will pave the way for future research in MER beyond controlled settings, facilitating the deployment of MER in practical applications. NaME is available at github.com/real-ljt/NAMEdataset.

Abstract:
Emotion recognition based on electroencephalogram (EEG) aims to recognize emotional states for improving user experience in Human-Computer Interaction, often using subjects' responses as labels. Physiological signals on widely used datasets (i.e. SEED, SEED-IV, and DEAP) are collected during subjects watching different types of movies as video stimulus. As a result, when subjects' emotional states change from one to another during a single video stimulus, two challenges are inevitable for reliable emotion recognition due to dynamic emotional fluctuations: (1) inaccurate annotation of EEG data; (2) feature confusion in classifier boundaries from similar emotional states (e.g., low-intensity happiness and neutral). However, previous studies have not given sufficient attention to the impact of the dynamic emotional fluctuations, leading to unreliable emotion recognition, especially in cross-subject emotion recognition scenarios. In this paper, we propose a Prototypes Collaborative Learning with Consistency Awareness (PCLCA) method to improve the reliability of cross-subject emotion recognition by introducing prototype learning. Specifically, a consistency awareness mechanism is designed to compute the consistency between labels and actual emotional states. Furthermore, a prototype collaborative strategy is adopted to adaptively estimate the uncertainty of model predictions by computing the similarity between features and prototypes. Extensive experiments on three benchmark datasets demonstrate that PCLCA effectively alleviates label noise and reduces uncertain predictions, outperforming existing baseline models.

Abstract:
Crowd simulation is crucial for urban planning, traffic management, public safety, and immersive environments. A fundamental challenge is capturing adaptive human behaviors that evolve dynamically with social interactions and task demands. Recently, physics-informed neural networks (PINNs) seamlessly integrate interpretable physics-based models with flexible data-driven learning, significantly enhancing simulation realism. However, current PINN-based methods typically rely on rigid representations of pedestrian perceptions and static task priorities of motion planning, limiting their ability to capture real-world social complexities and behavioral adaptability. To this end, we introduce SA-PINN, a novel Self-Adaptive Physics-Informed Neural Network specifically designed for modeling adaptive crowd behaviors. SA-PINN features two innovative adaptive modules: a self-adaptive social perception module, guided by a visual-field physics model to capture context-dependent social interactions dynamically; and a self-adaptive multi-task PINN training module, automatically balancing key motion objectives such as goal-reaching, collision avoidance, and alignment with real data. By jointly enabling perception-level and task-level adaptations within a unified physics-informed framework, SA-PINN generates highly realistic and physically consistent crowd simulations across diverse environmental contexts. Comprehensive evaluations on three real-world datasets (Lane, Cross 90, and GC) reveal that SA-PINN achieves a 29.7% gain in microscopic trajectory accuracy and enhances macroscopic density similarity by 23.5% compared to the best-performing baselines.

Abstract:
Video Moment Retrieval (VMR) aims to localize specific temporal segments in untrimmed videos that correspond to given natural language queries. However, existing proposal-based methods often fail to effectively model inter-proposal relationships and typically involve large parameter overheads. To address these issues, we propose a Lightweight Relational Proposal Network (LRPN) for efficient video moment retrieval. LRPN adopts a dual-branch slow-transfer distillation framework comprising a teacher and a student branch, reflecting the real-world characteristics of both roles. Specifically, we first introduce a semantic relation-aware module that mines relationships between video snippets and queries. Besides, in the teacher branch, we design a knowledge-enhanced relational module to leverage the teacher's knowledge capacity for modeling proposal relationships. In contrast, the student branch incorporates a compact relational modeling module, enabling efficient proposal relationship modeling with less parameters to meet the demand for rapid inference. Extensive experiments on TACoS, ActivityNet-Captions, and Charades-STA demonstrate that LRPN achieves state-of-the-art performance while maintaining a highly compact model design.

Abstract:
Recent advancements in generative models have positioned them as one of the principal tools for sequential recommendation due to their exceptional sample diversity and generalization capabilities. Among these, diffusion model-based sequential recommenders have achieved remarkable success. However, most existing approaches still face critical challenges, resulting in suboptimal generation quality: (1) They fail to leverage multimodal knowledge for constructing item representations with well-structured distributional characteristics and semantically enriched information; (2) They predominantly rely on discrete diffusion processes, leading to high error accumulation, reduced time efficiency, and constrained controllability in generative sampling. To mitigate these challenges, we propose LSGM4Rec, a novel framework that integrates Large Language Models (LLMs) with advanced multimodal encoding models to establish multimodal fusion embeddings for items. This design ensures distinct distributional characteristics while enabling the incorporation of semantically rich modal features into guidance condition. Furthermore, we pioneer the stochastic differential equations (SDEs) for recommendation, facilitating smooth transitions between data distributions and enabling optimal trade-off between sampling efficiency and generation quality. Extensive experiments on three datasets demonstrate that LSGM4Rec outperforms existing state-of-the-art sequential recommendation methods.

Abstract:
The rapid evolution of the online fashion industry has intensified the demand for interactive fashion retrieval systems capable of precise and flexible searches based on user-specified attribute modifications. However, prevailing fashion retrieval methods often overlook the distinctive distributional properties of fashion images and struggle to preserve semantic consistency during attribute manipulation. To address these limitations, we propose DiSCo, a novel disentangled attribute manipulation retrieval framework via semantic reconstruction and consistency regularization. Our approach comprises three key components: (1) An attribute-aware manipulation network that constructs target fashion embeddings through cross-modal attribute modification deltas, leveraging dedicated fashion attribute encoders; (2) A cross-modal semantic reconstruction network that synthesizes target images directly from modified attribute descriptions, supervised by adversarial and attribute classification losses to ensure interpretable edits; (3) An adaptive fusion mechanism that dynamically integrates attribute-modified embeddings with reconstructed image features. Extensive evaluations on two benchmark datasets (DeepFashion and Shopping100K) demonstrate that DiSCo achieves superior retrieval accuracy over state-of-the-arts while maintaining high-fidelity editing. Quantitative and qualitative analyses further confirm that DiSCo generates more realistic fashion representations, underscoring its effectiveness in attribute-aware retrieval tasks.

Abstract:
Three-dimensional (3D) mapping is vital in modern remote sensing. Satellites provide map data but are limited by cloud cover, especially during natural disasters (e.g., earthquakes, tsunamis), where rapid response is crucial and damaged infrastructure often renders digital maps unusable. Although Unmanned Aerial Vehicles (UAVs) present a viable alternative, the generation of precise 3D maps using monocular camera systems remains technically challenging. This work introduces an innovative approach for fast 3D mapping with intelligent trajectory planning. The method employs 2D Gaussian Splatting (2DGS) with block-based parallel optimization, integrating a monocular depth prior, depth filter, and a novel dense gradient strategy to reconstruct 3D maps from 2D images. To address operational reliability, we implement a multi-agent planning system which integrates artificial intelligence generated content (AIGC) models as agents. Each UAV's trajectory is managed by the agents to optimize paths dynamically. Experiments demonstrate the method's superiority in speed and effectiveness, offering a robust solution for disaster response and reconstruction.

Abstract:
Video anomaly detection (VAD) is vital for public safety, yet current approaches struggle with limited generalization, low interpretability, and high resource demands. To address these challenges, we propose HoloTrace, an edge-cloud collaborative VAD system that integrates large language models (LLMs) to construct and update a novel bidirectional causal knowledge graph. At the edge, HoloTrace leverages LLM-based cross-modal understanding and employs Hidden Markov Model (HMM) for bidirectional event reasoning, obtaining anomaly boundaries with low computational overhead. On the cloud side, LLMs are leveraged to dynamically update the Bi-CKG graph with key frames sent from the edge, in order to update causal relationships between events. Additionally, we introduce SVAD, a new large-scale VAD dataset comprising 632 real-world surveillance videos across 10 anomaly types and diverse scenes, with manually labeled frame-level annotations. Experimental results demonstrate that HoloTrace not only achieves the highest accuracy but also enhances interpretability and efficiency, paving the way for more generalizable and explainable video anomaly detection systems.

Abstract:
Current medical multimodal large language models (MLLMs) have demonstrated high accuracy and effectiveness on specific medical visual question answering (Medical VQA) tasks. However, they largely fail to tackle continuously emerging unseen Medical VQA scenarios (e.g., MRI, X-ray) in real-world settings, which significantly hinders their broader adoption in practical clinical environments. Motivated by these gaps, this paper introduces a new task, namely LLM-centric Lifelong Learning for New Medical VQA (L3NMV), which enables Large Language Models (LLMs) to continually learn medical image-text knowledge across various medical VQA tasks. Furthermore, this paper reveals two critical challenges: 1) Efficient medical knowledge retention (Each-task), which aims to retain essential knowledge for each Medical VQA task efficiently with limited data. 2) Efficient medical interference mitigation (Cross-task), which focuses on efficiently mitigating information interference across various Medical VQA tasks with knowledge barriers. To address these challenges, this paper proposes the OmniDoctor model, i.e., an omniscient doctor that simulates how doctors continuously update their knowledge and skills through continuous medical education, with the goal of equipping the model with lifelong learning capabilities via an efficient incremental medical parameter constraining mechanism for L3 NMV. This model is designed with two key modules to address the above two challenges, respectively. Especially, this paper constructs an Unseen L3NMV dataset to simulate real-world incremental clinical scenarios. Extensive experiments on this dataset demonstrate that OmniDoctor outperforms several advanced lifelong learning baselines. These results justify the significance of the L3 NMV task and the effectiveness of OmniDoctor in continually adapting to new Medical VQA tasks.

Abstract:
In the era of widespread digital art dissemination, visible watermarks provide immediate copyright identification by overlaying visible markers, addressing the lag issue of invisible watermarks that are difficult to prevent in advance due to post hoc evidence collection, thus meeting artists' needs for preemptive prevention and explicit protection. However, existing approaches struggle to balance aesthetics and functionality. To address this challenge, we conducted an exploratory study with watermarking experts, identifying key principles, six common design patterns, and a systematic watermarking workflow. Based on these insights, we developed an end-to-end, perceptual-aware framework for aesthetic-preserving watermark embedding, modeled after expert workflows in 5 phases. Using the Chain-of-Thought strategy, we optimized prompt instructions to guide the Vision-Language Model in emulating experts' decision-making, generating effective watermarking schemes and conducting objective visual evaluations. Iterative feedback optimization ensures watermarked images adhere to aesthetic principles. Quantitative and qualitative experiments demonstrate the system's superiority over baseline methods in preserving aesthetics and ensuring effective copyright protection.

Abstract:
This study aims to utilise mid-air hand-pose movements to implement various interactive controls, e.g. dial and slider controlling, through independent low-dimensional embeddings. Towards this, we develop a novel adjustable hand-pose space disentanglement approach for a learnable VAE-based high-to-low dimensional embedding model (HandSolo). It disentangles the latent embeddings into multiple independent one- or two-dimensional embedding spaces, enabling independent control. HandSolo allows multi-dimensional settings and multi-DOF combinations, providing a new paradigm for flexible and extensible hand-pose interaction systems. Additionally, to exploit model potential and make user interaction comfortable, we propose a visual interaction evaluation strategy (VIEs) to help system designers understand model capability and user habits. Finally, we provide an example virtual interaction system that integrates various virtual interaction objects, showing how our innovations improve their interaction capabilities. Experimental user studies demonstrate the effectiveness of our embedding-disentanglement designs, including discovery experiment (n=4) for VIEs, inspiration experiment (n=4) for approach extensibility, and exploration experiment (n=8) for the virtual interaction system.

Abstract:
Federated graph classification has emerged as a promising paradigm for privacy-preserving graph learning across distributed clients. However, real-world federated scenarios often suffer from severe data heterogeneity and label noise, which significantly degrade model performance. To address these challenges, we propose FedRog, a robust and personalized federated graph neural network framework that improves generalization under non-IID and noisy label settings. FedRog introduces a parameter-aware selection and fine-tuning mechanism to align global and local representations, and a neighbor embedding consistency constraint to enhance robustness against noisy supervision. Furthermore, a fine-grained, importance-guided global aggregation strategy based on Fisher information is employed to mitigate unreliable updates from low-quality clients. We conduct extensive experiments on 16 graph classification datasets under five heterogeneous data partition settings. Results show that FedRog consistently achieves competitive or superior performance compared to 14 baselines in terms of both accuracy and robustness under clean and noisy conditions.

Abstract:
Recent advancements in visual multi-task learning (MTL) have sparked significant interest. However, existing dense prediction MTL methods predominantly rely on single-modality image data, limiting their performance due to the absence of complementary knowledge from other modalities. Additionally, different dense tasks exhibit heterogeneous preferences during information decoding, posing a critical challenge in effectively allocating multi-scale encoded features. To address these limitations, we propose CLIP-MT, a Multi-Modal Knowledge-Driven Adaptive Scale Feature Allocation for Multi-Task Dense Prediction. Specifically, to enrich task-shared image features with multi-modal knowledge, we introduce a novel CLIP-Guided Global Feature Enhancer (CGGF), which leverages aligned text-image information to augment object-level representations through a dual-path feature fusion architecture. Furthermore, to tackle the task-specific scale preference problem, we design an Adaptive Scale Selection Gate (ASSG), a learnable gating mechanism that dynamically selects high- or low-scale features based on task-specific demands. Finally, we integrate multi-modal and multi-scale information through a Task-Aware Feature Fusion Module (TAFF). Extensive experiments on the NYUDv2 and PASCAL-Context datasets demonstrate that CLIP-MT achieves state-of-the-art performance, outperforming existing methods across multiple dense prediction tasks.

Abstract:
Accessibility of multimedia content for all users, particularly blind and low-vision individuals (BLVIs), remains a significant challenge. While screen readers assist BLVIs by converting text to speech via Alt-Text and image descriptions, these methods are inherently text-based and struggle to convey spatial and graphical information effectively. To help this, we propose a framework that converts graphical components into tactile graphics rendered on a refreshable pin array. Our framework leverages on-device AI models to generate tactile representations without transmitting personal data. It thereby minimizes processing time and mitigates privacy concerns. The benchmark test showed that our on-device AI outperformed GPU servers (RTX 4090) operating in an intranet environment. To optimize the tactile output and evaluate the system's effectiveness on media accessibility, we conducted a series of user studies with three different use case scenarios. First, we derived the optimal threshold values for edge detection in tactile graphics, which resulted in 70 on a 0-255 scale. Then, we compared the proposed system to a vision language model (VLM; GPT-4o). The results indicated that our proposed framework is more effective regarding both information delivery and subjective satisfaction. The proposed framework can be directly applied to several visual media accessibility scenarios, with the benefits of using local AI, such as privacy protection, personalization, and cost-effectiveness.

Abstract:
Cross-Embodiment Learning (CEL) aims to train a generalist policy model by integrating large-scale compositional interactions of heterogeneous agents and environments. However, the inherent conflict between the unbounded space of agent-environment combinations and a single unified policy model hinders generalization to unseen combinations. To address this challenge, we propose a novel Mixture of Disentangled Prototypes (MoDP) method to improve the compositional generalization in CEL. The key idea is to introduce a finite prototype space that bridges the gap between unbounded agent-environment combinations and a single policy model. Specifically, we design a dual-headed autoencoder and a compositional reconstruction loss to disentangle agent and environment features from interaction data, and map them into respective prototype spaces. We then introduce a connection-sensitivity-based pruning method to extract sub-networks from the pre-trained policy model, forming policy prototypes associated with specific agent-environment prototype pairs. Finally, a parameter-free routing mechanism adaptively integrates relevant policy prototypes for each input composition. Experiments in both standard and compositional settings demonstrate the effectiveness of our MoDP in enhancing the generalization capability of pre-trained policies.

Abstract:
Augmented Reality (AR) enhances Human-Robot Interaction (HRI) by offering diverse interaction methods. However, existing systems often fail to resolve the conflict between a user's implicit preferences and physical ergonomics, leading to suboptimal experiences. We introduce InteractGuide, a novel framework that, for the first time, uses a Large Language Model (LLM) as a central reasoning engine to dynamically balance these competing factors. Our system translates physiological signals into a symbolic ''Preference Memory'' that the LLM reasons over, alongside real-time ergonomic and contextual data, to provide personalized interaction recommendations. A 29-participant study confirms our architecture improves efficiency and experience compared to single-factor approaches, showing the potential of LLMs as reasoning engines for complex AR-HRI. This work presents a validated end-to-end architecture for user-centric interaction adaptation, demonstrating the potential of LLMs as reasoning engines in complex AR-HRI systems.

Abstract:
The outdoor vision systems are frequently degraded by snow particles, which obscure scene content and impair the performance of downstream vision tasks. While previous methods rely on physical priors, their performance often deteriorates under real-world conditions. Recently, semantic priors have proven effective in guiding image restoration, especially with the advent of the Segment Anything Model (SAM), which provides robust segmentation masks under adverse weather. However, leveraging SAM in video restoration remains underexplored due to the temporal inconsistency of inter-frame segmentation. In this work, we carefully construct the first framework to incorporate SAM-derived semantic priors into video snow removal, called SAMVSR. Specifically, to address temporal SAM label misalignment, we introduce an Entropy-wise Zone Propagation technique, which selects a reliable reference mask and semantically aligns instances across different frames via an entropy-guided label matching mechanism. Based on the aligned SAM semantic priors, we propose a Zone-Focused Mamba module, a novel Mamba-based architecture that restricts its scanning scope to semantically coherent zones, effectively mitigating irrelevant interactions and enhancing temporal-spatial consistency. Extensive experiments on both synthetic and real-world benchmarks finely validate the superiority of our proposed SAMVSR over existing state-of-the-art video desnowing techniques.

Abstract:
The domain of time series forecasting has gained significant attention due to its critical applications in multimedia-rich web traffic (including video streaming workloads and dynamic content delivery) and cross-platform advertisement click predictions, which are essential for web operations planning. While models like TimeSieve have demonstrated strong capabilities in predicting web visitation metrics, they suffer from critical unfaithfulness issues, including sensitivity to random seeds, input noise, layer noise, and parametric perturbations. To address these limitations, we propose Faithful TimeSieve (FTS), an enhanced framework designed to improve prediction reliability and robustness. Our approach systematically detects and mitigates unfaithfulness in TimeSieve, significantly enhancing its stability and consistency. Experimental results demonstrate that FTS substantially improves the model's faithfulness, setting a new standard for temporal forecasting methods. This advancement not only increases TimeSieve's reliability but also contributes to more robust temporal modeling, particularly crucial for web traffic forecasting where prediction accuracy directly impacts operational decisions. Our work thus represents a significant step toward more dependable time series predictions in web-related applications.

Abstract:
Source-Free Domain Adaptive Object Detection addresses cross-domain detection on an unlabeled target domain without accessing source data. Existing methods implement self-training with Mean Teacher but are bottlenecked by error accumulation from noisy pseudo-labels generated via recursive teacher-student updates. This issue is handled through the proposed dual enhancements: (1) External Guidance via Multimodal Foundation Models (FMs); (2) Internal Regulation through Fast-Slow Teacher. First, despite FMs' multimodal comprehension, their semantic misalignment with a specific task introduces noise during adaptation. Bidirectional Distillation mitigates this by calibrating the FM using task-specific knowledge transferred from the source detector. The aligned cross-modal knowledge then propagates through high-quality pseudo-label generation. Second, the conventional Mean Teacher suffers from plasticity-stability dilemma, where rapid adaptation corrupts historical knowledge. Fast-Slow Teacher introduces dual-velocity knowledge consolidation: The Fast Teacher dynamically captures emerging domain features, while the Slow Teacher preserves stable historical knowledge and periodically resets the Fast Teacher, establishing an error-correcting dynamic equilibrium. Experiments show our method achieves significant improvements over SOTA.

Abstract:
Recent advances in deep neural networks and generative AI have enabled the creation of highly realistic synthetic speech, raising significant concerns about the misuse of deepfake in deception and fraud. This paper addresses the generalization challenge of active defense against cross-task audio deepfake under black-box conditions. We systematically analyze the common characteristics of state-of-the-art acoustic synthesis models and propose PhonoFence, an active defense framework that introduces fine-grained phoneme-level adversarial perturbations to prevent unauthorized synthesis. PhonoFence employs a dual-domain strategy to perturb both the time domain and frequency domain via an iterative cross-training framework, leveraging complementary acoustic features to enhance the generalization of perturbations. To further improve the transferability of perturbations, we ensemble speaker encoders with a novel Multi-Middle-Layer loss. Additionally, a psychoacoustic masking algorithm is employed to enhance the perceptual quality of protected speech and conceal perturbations. Extensive experiments on leading acoustic synthesis models demonstrate that PhonoFence reduces identity similarity and word error rate to 17.91% and 49.77%, respectively, achieving relative improvements of 7.97% and 9.67% over the best existing methods. To assess the effectiveness of our method in real-world, we test PhonoFence in commercial speaker recognition system, where it reduces deepfake attack success rates by 73.49%. Moreover, PhonoFence shows strong robustness against adaptive attacks involving compression, denoising, and re-recording.

Abstract:
Facial image steganography is crucial for privacy-preserving media transmission. Traditional embedding methods degrade image quality and are vulnerable to steganalysis, while GAN-based non-embedding approaches lack controllability and realism. Diffusion-based methods using textual prompts face two key issues: (1) security risks from interpretable prompts and (2) poor preservation of facial details. This paper presents Featurized Denoising Diffusion Implicit Models (F-DDIM), a novel non-embedding steganography framework. First, F-DDIM replaces explicit textual prompts with implicit image-based encoding, enhancing security. Second, it selectively refines facial regions for natural and high-quality recovery through iterative reconstruction. Third, it enables indistinguishable encryption without secret key sharing via a novel sub-code embedding algorithm. Fourth, a refinement step post-decoding improves the clarity and accuracy of recovered facial image details. Experimental results demonstrate that F-DDIM achieves superior image fidelity and robustness against transmission interference.

Abstract:
Effectively communicating uncertainty in ensemble hurricane forecasts poses a significant multimedia challenge, requiring the integration of spatial, temporal, and perceptual dimensions. We introduce SUVIS, a stereoscopic visualization system that encodes forecast ensembles into an immersive, layered media experience. SUVIS transforms multidimensional ensemble data into animated stereoscopic representations, mapping time to vertical depth, intensity to texture color, and forward speed to motion flow, while semi-transparent glyphs represent evolving impact areas. A progressive sampling strategy ensures spatial clarity across depth layers. Rendered on a glasses-free stereoscopic display, SUVIS frames uncertainty visualization as a media encoding problem, synthesizing motion, depth, and spatial abstraction to align with human perception. A user study with 51 participants demonstrates that SUVIS supports high accuracy in spatial tasks and enables interpretation of dynamic storm attributes. These results highlight the system's potential to advance perceptual uncertainty communication through multimedia representation and immersive visual encoding.

Abstract:
This paper studies the problem of graph out-of-distribution generalization, which aims to enhance the performance of graph neural networks (GNNs) under distribution shifts. Existing approaches usually learn graph representations from a casual graph, which may not explicitly utilize environment information explicitly. Furthermore, they could suffer from performance degradation when confusing semantics related to target labels and environments. In this paper, we propose a novel approach named Dual Prompt Learning with Information Bottleneck (DATE) for graph out-of-distribution generalization. The core of our DATE is to utilize dual prompts to extract task-oriented semantics and model distribution shifts, respectively. In particular, we first pre-train a GNN using contrastive learning with pretext tokens introduced. More importantly, we not only introduce a task-oriented prompt based on LLMs to generate environment-invariant representations, but also learn the environment-oriented prompts to simulate subgraphs in different environments. To optimize our prompts, we introduce a graph information bottleneck framework, which minimizes the mutual information between environment-invariant representations and environment semantics with the most semantics preserved. Extensive experiments on various benchmark datasets validate the effectiveness of our DATE against various state-of-the-art approaches.

Abstract:
Unsupervised remote sensing dehazing remains a challenging and ill-posed task due to the absence of reliable supervision signals. Existing dehazing methods with unpaired data often oversimplify haze removal as style transfer, limiting generalization in complex scenarios. Moreover, current unimodal frameworks neglect cross-modal cues that could improve contextual reasoning. To address these issues, we propose a novel cross-modal guided self-supervised dehazing framework called CLIP-HNet, which achieves multi-model feature extraction, boundary-focused reconstruction and adaptive sample filtering. Specifically, to capture global-local contextual features, a hybrid feature interaction network is designed, which bridges the feature representations of multi models with global context-aware module (GCAM) and hybrid feature fusion module (HF2 M). Then, based on the hybrid features, a boundary-aware feature reconstruction (BFRec) is proposed to further refine edge details. Furthermore, a CLIP-guided progressive information distillation scheme is presented to dynamically prioritize training samples and distill useful signals, which predicts haze concentration by CLIP and progressively increases sample difficulty during the training stage. Finally, a frequency-domain texture matching (FTM) strategy refines texture and spectral details, enhancing the model's ability to recover fine details. Experiments on synthetic and real RSIs demonstrate that the proposed CLIP-HNet surpasses state-of-the-art approaches, achieving superior visual quality and quantitative performance.

Abstract:
Remote sensing image classification with noisy labels is receiving increasing attention. However, the existing methods ignore the context information of the training sample and judge whether the label is a noise label only by monitoring the loss value of a single sample, which may lead to misjudgment of the sample label. Additionally, these algorithms do not consider constructing pairs of confidence instances to obtain robust potential representations after identifying confidence instances. In this paper, a Multi-view Collaborative Representation Learning (MCRL) approach from noisy labels is proposed to improve the classification performance of very high resolution (VHR) remote sensing images. Specifically, we design a correction strategy based on spatial consistency and confidence-aware mechanisms. This strategy quantitatively measures label reliability by mining the contextual information of labelled samples within the adaptive region. Leveraging the spatial consistency principle and the confidence-aware mechanism to correct and smooth the noisy labels progressively. Moreover, we construct confidence sample pairs by establishing relationships between samples within and between views to obtain robust latent representations, which improves the model's tolerance to noisy labels. Experiments show that the MCRL can significantly reduce the impact of noisy labels on the model and is more competitive than homologous algorithms.

Abstract:
Text-to-Visualization (Text2Vis) generates data visualizations directly from natural language queries, democratizing access to data insights. Early Text2Vis efforts, primarily relying on rule-based systems and machine learning models, struggled to handle semantically intricate queries. The advent of large language models (LLMs) allows for better generalization in generating visualization code. However, LLM-based approaches have mainly focused on textual or code-level optimizations, neglecting the potential benefits of assessing and improving visualized charts. Hence, we propose Visualization Refinement (VisRef), a novel framework based on vision-language models (VLMs) to enhance Text2Vis outputs. (1) Knowledge Extraction -- VisRef extracts visualization assessment knowledge through a hierarchical contrastive prompt and multi-granularity quality assessment framework by comparing superior ground-truth charts with inferior Text2Vis outputs; and (2) VLM Fine-Tuning -- This knowledge is used to fine-tune a VLM through a two-stage approach, including warm-up and iterative preference alignment phases, to judge visualization quality and provide code-level refinement suggestions. Experimental results demonstrate that VisRef significantly outperforms state-of-the-art approaches, including LLM-based and VLM-prompted, and exhibits strong orthogonal compatibility with existing approaches.

Abstract:
Instruction-based image editing enables intuitive modifications of images through natural language descriptions. However, existing models often struggle to accurately identify the target region, which refers to the area that should be modified. As a result, unintended changes may occur in non-target areas, where the original image should remain unchanged. To address this issue, we propose FoRE, an MLLM-guided framework that identifies the target region based on the given edit instruction and performs image editing using region-aware embeddings. Within FoRE, the Region-guided Edit Adapter projects these embeddings from the MLLM domain to the diffusion condition space. Subsequently, the Region-guided Refinement Module refines the projected features to enhance spatial accuracy prior to guiding the diffusion process. Through comprehensive evaluations, we demonstrate that FoRE significantly improves localization accuracy and instruction fidelity compared to existing approaches. By explicitly incorporating region-aware conditioning, our framework effectively bridges the gap between instruction comprehension and spatially precise image modifications, advancing the capabilities of instruction-based image editing.

Abstract:
Current 3D generation methods struggle to balance quality, efficiency, and controllability. This work introduces a collaborative 3D representation framework that leverages a proxy mesh as an intermediate representation. On the one hand, the proxy mesh establishes structural associations with the skeleton, guiding the generation of skeleton bindings that better align with target shape characteristics. On the other hand, it enables adaptive Gaussian sampling in the shape space for efficient rendering. Through the multi-level dependencies and collaboration among skeleton, proxy mesh, and 2DGS, image gradients obtained from diffusion models via the SDS method can synchronously and differentiably update the parameters of Gaussians, mesh shapes, and skeletons. This enables efficient generation of high-quality, editable, and driveable assets (with skeleton binding) under user-specified instructions. The proposed framework demonstrates its efficiency, high fidelity, and precise binding results in 3D rigged asset generation tasks.

Abstract:
Diffusion model has been used in indoor scene synthesis and has made significant progress. Current works encode an indoor scene as a top-down view of the room, a list of objects, and their world co-ordinates and orientation. In this paper, we develop a diffusion-based training and synthetic method which incorporates indoor scene ''characteristics''. Firstly, we calculate the relative transformations among objects to capture the local characteristics of the scene. We send this relative transformation into the self-attention layer of the denoising network as ''relative positional encoding''. Secondly, we use room guidance to guide the objects to fit the room's geometry. This improvement uses the room's characteristics to solve the physical collision problem occurring in former diffusion-based works, while preserving plausibilities. Experiments show that our improvements improve the scene variety and quality.

Abstract:
Refining user-provided natural language prompts allows users to more easily obtain their desired outputs in text-to-image generation. Existing automatic prompt refinement methods predominantly take discrete, human-engineered high-quality prompts as the final optimization target. However, human-engineered prompts are based on human intuition and derived through limited interaction with generative models, which fails to bridge the gap between human preferences and model preferences. Additionally, this discrete optimization target limits the information capacity of the conditional inputs fed to the generative model, leading to suboptimal outcomes. Therefore, we propose an end-to-end prompt optimization method that interacts directly with generative models, eliminating the need for human involvement. The optimization process takes high-quality images as the target and uses the internal states of generative models as optimization signals. This optimizes prompts in a way that aligns more naturally with the model's generation process and producing continuous representations as the final refined prompt. We also introduce a memory module to store common features of high-quality prompts as prior knowledge to guide optimization in continuous space, enabling it to be more efficient. This memory-p rior-based p rompt r efinement in continuous space (MPPR) not only bridges the gap between human preferences and model preferences, but also resolves the issue of insufficient information in the inputs provided to the generative model. Extensive experiments show that our method achieves better performance compared to the state-of-the-art baselines.

Abstract:
Synthetic images serve as a promising alternative to real images in 3D hand pose estimation, providing accurate annotations at a lower cost. However, the domain gap between real and synthetic images constrains the generalization ability of hand pose estimation trained on synthetic data. Previous methods rely on Generative Adversarial Networks (GANs) for domain translation; however, they fail to achieve realistic depth synthesis due to instability and limited image quality. Diffusion models provide high-quality synthesis due to their stability and controllability. However, existing methods often ignore the 3D structure awareness in hand image generation. In this paper, we propose a Dual-Branch 3D Spatial-Aware Latent Diffusion (DSW-LD) for realistic depth image generation. The Global Structure Module (GSM) and the Local Geometry Module (LGM) complement each other, with GSM capturing global spatial structure through coarse-grained 3D joint features and LGM focusing on local geometric details using fine-grained 3D mesh representations. To maintain the global structure consistency, we adopt a layer-aware injection mechanism that enables the model to adaptively learn the optimal representation from fused 2D latent representations and 3D joint features. To explicitly align 3D and 2D features of local regions and enhance the flexibility of feature matching, we design a dynamic depth-aware interpolation to project 3D mesh features into 2D image space. Both quantitative and qualitative experimental results demonstrate the superiority of our method over the state-of-the-arts for realistic depth synthesis. Compared to training only on real depth images, our method enables the hand pose estimator to achieve significantly better performance with our synthetic data and less real data (10%).

Abstract:
3D Gaussian Splatting (3DGS) is a recent popular technique that can reconstruct the radiance field representation of the scene efficiently. However, the naive 3DGS algorithm is easily affected by noisy pixels from transient and dynamic objects. To resolve this matter and enable the robust learning of 3DGS, previous work proposed to generate a binary mask from per-pixel training loss or an image segmentation result. However, such modeling of the distractors is not adaptive to the 3DGS model learning process, and might lead to wrong identification of noisy pixels and would affect the reconstruction performance. Instead, we propose to learn a soft mask of the likelihood of the distractors. Moreover, we develop techniques to model the spatial pattern of distractors and learn them from a design curriculum, avoiding confusion between clean and noisy pixels. Our method demonstrates state-of-the-art performance on robust novel view synthesis from distractor images, evaluated on major benchmarks of this task.

Abstract:
Generalizable 3D Gaussian Splatting (G-3DGS) has recently emerged as a promising solution for efficient 3D scene representation and novel view synthesis. However, sparse-view scenarios pose a critical challenge for accurate depth estimation. In such cases, viewpoint overlaps are minimal, and many regions are visible from only a single view. As a result, reliable multi-view matching is unavailable in these areas, leading to significant reconstruction quality degradation. To tackle this bottleneck, we propose GraphSplat, a feed-forward framework for novel view synthesis that dynamically incorporates both cross-view and monocular cues through a graph-based feature aggregation strategy. Central to our approach is a Multi-view Aggregate Graph Attention (MAGA) mechanism, which adaptively reweights intra-view and inter-view node connections to compensate for unreliable multi-view correspondences with robust single-view depth priors. In addition, we design a Hierarchical Depth Fusion Estimator (HDFE) module to integrate monocular and multi-view depth cues, effectively reducing ghosting artifacts and improving geometric consistency. Extensive evaluations on RealEstate10K and ACID benchmarks show that GraphSplat achieves competitive performance against prior SOTA methods, with improvements in appearance fidelity and cross-dataset generalization particularly under challenging sparse-view conditions.

Abstract:
Consistently stylizing and editing 3D objects from multiple viewpoints is crucial for immersive applications such as virtual reality, augmented reality, and digital entertainment. Nevertheless, existing methods frequently face significant challenges, including inconsistent textures, pronounced drifting artifacts, and compromised geometric integrity when rendered from various perspectives. To effectively address these limitations, we introduce S2-Edit3DV, a novel diffusion-guided framework that reframes multi-view 3D objects editing as a temporally coherent video editing problem. By exploiting the robust single-view generative capabilities of SV3D, our approach reliably propagates initial style edits across different viewpoints, substantially mitigating drifting artifacts prevalent in current video-based editing methods. To further enhance semantic precision and structural preservation, we propose two innovative techniques: Attention-based Differential Style Injection (ADSI) and Adaptive Structural-aware Plug-and-Play (AS-PnP). ADSI utilizes attention-driven semantic embeddings for adaptive and precise style injection, effectively reducing semantic hallucinations. AS-PnP strategically modulates stylized latent features, balancing artistic expression with strict structural coherence. Comprehensive evaluations and ablation studies demonstrate that our proposed framework significantly enhances multi-view consistency, preserves fine-grained geometric details, and ensures accurate semantic alignment, showcasing superior performance and practical value for generating high-quality, creatively stylized, and structurally robust objects.

Abstract:
High-quality three-dimensional (3D) reconstruction from sparse views is critical for applications such as virtual and augmented reality, robotics, and digital content creation. While methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have shown strong performance in novel view synthesis, they struggle in few-shot settings, especially when scenes contain large occluded or unseen regions. The lack of explicit supervision for hidden content limits reconstruction completeness and realism. We propose See-Through-the-Occlusion Gaussian Splatting (STO-GS), a novel framework that rethinks occlusion modeling in static scenes. Drawing inspiration from four-dimensional Gaussian Splatting (4DGS), we reinterpret time as a proxy for occlusion depth and apply deformation-based opacity modulation to recover hidden layers. To provide supervision, we generate amodal views via diffusion-based inpainting, exposing occluded structures for training. A two-stage layered training pipeline further refines the reconstruction, with a multi-layer perceptron (MLP) adjusting Gaussian opacity across occlusion layers. STO-GS improves occlusion-aware reconstruction and achieves superior performance over existing few-shot 3DGS baselines, including a 0.51 dB gain on challenging datasets.

Abstract:
The demand for identity-preserving Facial Aesthetic Enhancement (FAE) has surged in social media and digital entertainment. However, existing methods based on deep generative models encounter difficulties in striking a balance between fine-grained detail enhancement and preserving the unique identities of individuals from diverse ethnic and gender backgrounds. To tackle this issue, this paper proposes a novel tuning-based framework that integrates prototype-based hierarchical prompt learning within a CLIP model and a StyleGAN-based inversion model. Our approach first adapts a pre-trained StyleGAN to the input face via pivotal tuning, optimizing around pivotal latent codes to minimize reconstruction distortion while retaining editability. Then, a prototype-based hierarchical prompt learning module is designed for learning multigrained facial features to achieve comprehensive and fine-grained facial descriptions for FAE. Specifically, we propose a prototypical similarity measure based on a multi-ethnic dataset to select geometrically similar faces with high aesthetic scores as reference faces. This selection is guided by ArcFace regularization within categorized gender and ethnic groups to minimize identity loss. Additionally, we design a novel aesthetic attribute selection algorithm to generate generic fine-grained aesthetic attributes from these reference faces for detailed facial descriptions. These components work synergistically through dynamic weight modulation, prioritizing features with high aesthetic contributions (such as enhancing lip fullness) while ensuring semantic consistency through CLIP-driven optimization for pivotal latent codes. Extensive experiments demonstrate that our method outperforms state-of-the-art techniques in both aesthetic quality and identity preservation, especially for out-of-domain faces.

Abstract:
Multivariate time-series (MTS) classification tasks play a key role in data-driven applications spanning healthcare, finance, and mobile communication. As MTS data are typically collected from multiple interdependent sensors, the resulting temporal patterns inherently reflect the characteristics of the underlying sensing systems. Despite this connection, conventional MTS classification models predominantly focus on raw time-series data while disregarding valuable sensor-specific prior knowledge, which fundamentally constrains their classification accuracy. The emergence of large language models (LLMs) has encoded extensive sensor-related knowledge within their parameter spaces. However, effectively harnessing such knowledge to enhance MTS classification networks remains an open challenge. To address this, we propose Foresail, a status-guided neural framework that bridges this gap through systematic integration of LLM-derived sensor knowledge via the status relationship matrix and fine-grained status labels. Foresail can be seamlessly integrated with existing MTS networks to optimize performance and generate interpretable intermediate results. Experiments on irregularly and regularly sampled MTS data demonstrate that Foresail outperforms state-of-the-art approaches, achieving a notable improvement in F1-score of up to 10.9% compared to the basic MTS network.

Abstract:
The rapid advancement of deepfake technology has led to an increasing frequency of crisis incidents stemming from its misuse. However, existing forgery detection methods often suffer from poor cross-domain generalization due to overfitting to specific forgery cues inherent in their training datasets. Through an in-depth analysis, we identify that low-loss, overfitted features hinder models from capturing broadly applicable patterns necessary for effective generalization. To overcome this limitation, we introduce Knowledge Negative Distillation (KND), a simple yet powerful teacher-student framework, designed to encourage the student model to acquire knowledge beyond the teacher's existing scope. Specifically, we guide the student model to avoid the teacher's overfitted features by maximizing a cross-entropy loss computed from the teacher's probability distributions during the training for the target task. Additionally, we propose an adaptive fusion mechanism that integrates the extensible student features with the teacher's features, weighted and guided by their respective probability distributions. Extensive experimental results validate the superior performance of KND, demonstrating state-of-the-art capabilities across multiple benchmarks. Moreover, the extensibility and universality of KND underscore its potential applicability to a broader range of cross-domain problems characterized by significant overfitting challenges.

Abstract:
Subspace-based DPSGD has emerged as a robust solution for alleviating excessive noise in high-dimensional, privacy-preserving visual learning. It achieves a favorable privacy-utility balance by projecting privatized gradients onto a low-rank subspace derived from in-distribution public data. However, relying solely on limited public data can narrow the diversity of the anchored subspace and induce over-memorization into model training, leading to suboptimal private optimization.To overcome these limitations, we propose a synthesis-augmented subspace-based DPSGD framework, SAS-DPSGD, which integrates synthetic data into the subspace construction process. We quantitatively analyze the private optimization performance using the subspace derived from the mixed public and synthetic data, revealing the benefits as well as the saturation effects of incorporating synthetic data in private visual learning. To the best of our knowledge, this is the first work to provide a unified theoretical guarantee to synthesis-augmented subspace-based DPSGD. Moreover, we design an early projection mechanism within our framework that projects the gradient onto the subspace before performing gradient clipping. This mechanism effectively reduces the gradient clipping bias and lowers the synthetic data requirement, resulting in a faster convergence rate. Extensive experiments on two real-world datasets validate that SAS-DPSGD outperforms nine baselines by up to 9.78% in accuracy and can reduce the amount of synthetic data required by 66.7%.

Abstract:
Face-swapping deepfake poses significant risks, including privacy violations, misinformation, and defamation, amplified by the availability of pretrained models on open-source platforms. Proactive defense strategies aim to disrupt deepfake generation by modifying the original images to protect identity features. However, existing methods often introduce artifacts in facial images or rely on specific deepfake models, limiting their usability. To address these problems, we propose a style code orchestration in latent space (SCOL) method that obfuscates identity by fusing different identities in the latent space without requiring face recognition models. This study optimizes the generator to follow the original appearance while retaining the obfuscated identity via identity-preserving constraints. Further, appearance-dominant components in the latent code are aligned for visual consistency. An identity inversion attack is introduced using opposite style codes to improve the effectiveness of the defense. Experimental results demonstrate that SCOL robustly defends against various face-swapping deepfake methods, maintaining visual consistency.

Abstract:
The rapid advancement of Artificial Intelligence Generated Content (AIGC) technology has enabled deepfake videos to evolve from unimodal generation to audio-visual forgeries. Existing multimodal deepfake detection methods primarily rely on capturing correlations between audio-visual modalities to improve detection performance. However, in real-world scenarios, network jitter often leads to audio-visual asynchrony, disrupting inter-modal associations and limiting the effectiveness of these methods. To address this issue, we propose a deepfake detection method specifically designed for audio-visual asynchrony scenarios. First, based on the theory of open balls in metric space, we analyze the variation mechanism of joint features in both audio-visual synchrony and asynchrony scenarios, revealing the impact of audio-visual asynchrony on detection performance. Second, we design a multimodal subspace representation module to mitigate inconsistencies in feature distributions and representation heterogeneity between modalities. We then formulate audio-visual feature alignment as an integer linear programming task and employ the Hungarian algorithm to reconstruct missing inter-modal associations. Finally, we introduce a self-supervised masked reconstruction mechanism to reconstruct missing features and construct the joint correlation matrix to measure cross-modal dependencies, enhancing the robustness of detection. Extensive experiments demonstrate that our method outperforms baselines in audio-visual asynchrony scenarios and exhibits robustness against unknown disturbances.

Abstract:
Proactive defense against face forgery seeks to disrupt the output of forgery models by embedding imperceptible adversarial perturbations into face images to be protected. However, existing methods predominantly focus on deepfakes, often neglecting traditional image manipulations. It limits their practical applicability, as attackers may resort to traditional manipulations when deepfake attempts fail. To bridge this gap, a Dual-Forgery Proactive Defense (DFPD) method is proposed for combating both deepfakes and traditional image manipulations. For deepfake resistance, the DFPD designs a gradient-based ensemble adversarial attack that effectively disrupts outputs from multiple deepfake models. To defeat traditional manipulations, it also designs a fragile watermarking algorithm based on Invertible Neural Network (INN), enabling accurate localization of tampered regions. Furthermore, to mitigate the mutual interference between perturbation injection and watermark embedding, on the one hand, the DFPD adopts a serial pipeline starting with watermark embedding and then perturbation injection, which ensures that the injected perturbations are not displaced into residual image during INN-based embedding. On the other hand, a morphological post-processing module is introduced to eliminate adversarial noise in the tampering localization results. Extensive experiments validate the effectiveness of DFPD, demonstrating a 20.25% improvement in deepfake disruption over the best baseline in terms of PSNR and a 9.67% increase in traditional tampering localization in terms of ACC, while preserving high perceptual quality (32.75 dB PSNR).

Abstract:
Compressing attributes of 3D point clouds remains challenging due to their inherent sparsity and irregular distribution. To address this, we propose an efficient framework based on sparse hierarchical Implicit Neural Representations (INRs). Specifically, we introduce a novel vertex-based INR framework, which integrates interpolation to enable accurate and compact implicit representations of point cloud attributes. To effectively capture the varying importance of latent features, we design an adaptive quantization scheme. Furthermore, we develop efficient level-wise entropy models to exploit dependencies within and across hierarchical levels. Finally, point cloud attributes are reconstructed from concatenated multi-resolution latent representations via a sparse convolution-based reconstruction module. Experimental results demonstrate that our approach significantly outperforms previous INR-based methods, achieving superior performance compared to the latest G-PCC (TMC13v28) standard and state-of-the-art learning-based methods.

Abstract:
Building Information Model (BIM) has become a significantly digital platform for representing buildings in the Architecture, Engineering, and Construction (AEC) industry. However, the absence of extensive, class- diverse, and balanced datasets at the BIM component level has limited the development of AI-driven BIM analysis. In this study, BIMCompNet is proposed as a large-scale multimodal dataset from Industry Foundation Classes (IFC), which can learn BIM component geometry features from multiple representation methods, including rendered views, point clouds, mesh structures, voxel grids, and semantic graphs. BIMCompNet is constructed by a standardized two-stage processing pipeline: (1) At the model level, geometry units are normalized to the SI units, models are converted to the IFC format, metadata is anonymized, and components are automatically extracted into individual IFC files. (2) At the component level, semantic labels are corrected, geometry and positioning are aligned, duplicates at model and project levels are removed, and five synchronized modalities (OBJ meshes, multi-view images, point clouds, voxel grids, and heterogeneous IFC graphs) are generated. BIMCompNet comprises 1,304,206 cleaned and labeled components across 87 IFC classes, collected from 1,607 real-world BIM models spanning 14 building types. To mitigate class imbalance, underrepresented classes are merged, and dominant classes are down-sampled to create balanced subsets suitable for robust AI model training and benchmarking. Benchmarking is performed on classification tasks by different models with multiple data modalities. Both the dataset and the processing pipeline will be publicly released to support reproducibility and private dataset extension.

Abstract:
The increasing demand for large-scale, high-quality datasets in dynamic point cloud compression (PCC) and human visual perception research underscores the limitations of existing datasets, which are often constrained by limited scale and insufficient dynamism, hindering algorithm validation and perceptual analysis in complex scenarios. To address this gap, we present DPCSet, a comprehensive dynamic point cloud dataset designed to support advanced research in PCC, human perception, and related domains. Comprising 100 dynamic object point clouds-the largest collection of its kind-DPCSet includes 200-frame sequences with geometry and attribute information, capturing diverse object types across real and virtual environments. Organized into seven superclasses, the dataset ensures broad scenario coverage. By rigorous selection, format conversion, quantization, DPCSet delivers standardized, high-precision point cloud data. Evaluation of multiple compression algorithms on a curated subset demonstrates DPCSet's efficacy in assessing trade-offs between compression efficiency and quality loss, positioning it as a potential benchmark for PCC. Furthermore, just noticeable distortion (JND) experiments on a compression-distorted subset reveal distinct perceptual characteristics of dynamic point clouds, offering valuable insights for perception-driven compression algorithms. The dataset is released at https://openi.pcl.ac.cn/gaowx/DPCSet.

Abstract:
Recently, MGTV organized the Image-to-Video Model Acceleration Challenge, calling for participants to propose optimization solutions for the Wan 2.1-14B model. The challenge emphasizes techniques such as quantization and GPU acceleration to improve the model's inference efficiency. As AIGC technology advances rapidly, video generation large models exhibit great potential in content creation, yet they face critical challenges of high computing power consumption, long inference time, and excessive VRAM usage during inference, which severely hinder content production efficiency. This challenge aims to explore approaches for efficient video generation under limited computing resources, requiring participants to reduce the model's computing power and VRAM demands while improving inference speed, all without compromising generation quality. To support participants' development and evaluation, the challenge provides a baseline framework and test dataset. For further details, please refer to the official challenge website (https://challenge.ai.mgtv.com/#/track/53).

Abstract:
The automated comprehension of complex, multi-modal documents is fundamentally hampered by a disconnect between information extraction and reasoning. Existing systems suffer from inherent limitations. Monolithic models embed reasoning as a black box process, sacrificing transparency and depth. Meanwhile, current agent-based frameworks follow a passive, non-interactive paradigm; they handle static, global inputs rather than information derived from active exploration, which fundamentally restricts their ability to achieve structural understanding and complex reasoning. To bridge this critical gap, we introduce HEAR, a framework for Holistic Extraction and Agentic Reasoning. This innovative framework establishes a synergistic, closed-loop between a deep Vision-Language Model (VLM) driven holistic parsing engine and a collaborative multi-agent reasoning system. Our HEAR initially transforms unstructured documents into a semantically-rich, structured representation, preserving complex layouts and reconstituting multi-page tables. Subsequently, a multi-agent system performs cross-modal analysis, governed by a crucial verification protocol that forces agents to validate findings across textual and visual modalities. A conflict driven re-evaluation mechanism enables the system to dynamically re-engage the document to resolve ambiguities, thereby unifying the perception-cognition cycle. HEAR achieved first place in the ACM MM 2025 Grand Challenge on Large Vision–Language Model Learning and Applications.

Abstract:
Multimodal interleaved reasoning, which requires models to understand interleaved image-text sequences and multiple images, is a critical challenge in contemporary AI. This paper proposes a parameter-efficient fine-tuning framework based on Large Vision-Language Models, with Qwen2.5-VL as the backbone and Low-Rank Adaptation for task-specific adaptation. The framework integrates four stages: multimodal input preprocessing to align with pre-training distributions, visual feature extraction via a modified Vision Transformer, cross-modal fusion via attention mechanisms, and response generation via an autoregressive decoder. By freezing pre-trained weights and fine-tuning low-rank adapters in both visual and language modules, it balances preserving general multimodal knowledge with optimizing target tasks, achieving high performance with low computational overhead. On the MIRAGE Challenge Track A Dataset, it performs strongly across subtasks, achieving an aggregate score of 0.7857 and securing second place in the challenge. Ablation studies confirm that joint LoRA fine-tuning of visual and language modules yields optimal results; limitations in fine-grained visual difference tasks indicate future directions in enhancing subtle feature capture and adaptive cross-modal alignment.

Abstract:
Traditional video captioning methods often produce generic descriptions that fail to align with specific user intentions, limiting their applicability in scenarios requiring customized information extraction. This paper proposes a novel intention-oriented controllable video captioning approach, which leverages large-scale vision-language models (InternViT and InternLM) and achieves parameter-efficient fine-tuning through Low-Rank Adaptation (LoRA). The proposed framework processes both video content and user-specified intentions via a unified cross-modal pipeline, dynamically aligning visual features with intent semantics to generate focused and contextually accurate captions. Experiments on the IntentVC dataset validate the effectiveness of the proposed method in generating intention-aligned captions, with the following performance metrics: BLEU@4 scores 44.38 on the public test set and 40.21 on the private test set; METEOR scores 63.79 and 60.07 respectively; CIDEr scores 230.33 and 208.15 respectively; ROUGE-L scores 61.75 and 57.14 respectively. Ablation studies confirm the significant effectiveness of joint LoRA adaptation on vision and language modules, as well as the sensitivity of performance to LoRA parameters. This work advances the field by enabling precise control over caption generation, enhancing the practical utility of video understanding systems in applications such as accessibility services and targeted video retrieval.

Abstract:
Depression is an increasingly prevalent mental health issue worldwide, especially among the elderly, where effective early detection is crucial for timely intervention. While recent multimodal approaches demonstrate promise in leveraging visual and auditory cues for automatic depression recognition, existing methods often fail to extract fine-grained, depression-specific patterns from pre-extracted features and underutilize cross-modal interactions. To address these challenges, we propose DepFormer, a unified framework that incorporates a Bimodal Collaborative Transformer(BCT) for cross-modal representation learning and a personalized fusion module to enhance individual-specific modeling. The architecture comprises: (1) unimodal feature extraction, (2) bimodal collaborative representation learning via the BCT, (3) personalized feature fusion, and (4) final depression classification. Critically, the BCT employs symmetric bidirectional branches comprising an Audio-to-Video Transformer and a Video-to-Audio Transformer to enable mutual enhancement and complementary learning across modalities. Extensive experiments validate DepFormer's effectiveness, securing first place in the MPDD Challenge 2025 (Elderly Track), underscoring its strong practical potential.

Abstract:
The rise of Multimodal Large Language Models (MLLMs) offers new opportunities for Micro-Expression (ME) analysis. This paper introduces Micro-Expression Visual Question Answering (ME-VQA), a novel task reformulating ME annotations (e.g., emotion categories, action units) into QA pairs. To address key challenges-hardware limitations, context inconsistency, and compositional reasoning gaps-we propose a Relationship-Aware Hierarchical VQA Framework. Our approach leverages mined emotion correlations (e.g., coarse-to-fine label dependencies) and employs a two-stage process: 1) Coarse-grained anchoring for broad emotion categories, and 2) Fine-grained reasoning constrained by coarse outputs and statistical rules. We further optimize efficiency via a dual-phase video sampling strategy: during training, keyframes (onset/apex/offset) and random non-expression frames are used; uniform sampling is applied at inference. Experiments demonstrate significant improvements in answer consistency and accuracy.

Abstract:
Emotional Support Chatbots could unlock potential by providing scalable, low-cost, and personal emotional support, overcoming critical accessibility barriers inherent in traditional counseling. However, current text-based Chatbots fall short in conveying the multimodal empathy crucial in counseling. Humans naturally prefer face-to-face communication with peers to share feelings, encompassing spoken tone, micro-expressions, and body language to convey empathy. To bridge this gap, we propose EMO-Avatar, an LLM-agent-orchestrated framework that integrates emotional reasoning capabilities and multimodal expression in counseling. Our approach introduces two innovations: (1) a Multimodal Emotional Support Agent. EMO-Avatar can follow adaptive instruction across TTS, pose, micro-expressions, and body actions, leading to the generation of highly expressive human animations. (2) a Comforting-Exploration-Action support strategy; EMO-Avatar systematically integrates Hill's three-stage counseling theory into its emotional reasoning capability. Guided by the LLM's reasoning, this strategy informs response generation and displays stage-specific preferences for speech, body language, and expressions. EMO-Avatar can provide deeper emotional support and therapeutic human-like interactions. Experimental validation on the AvaMERG Challenge demonstrates EMO-Avatar's superior performance, achieving top-2 ranking among 20 participants across response appropriateness, multimodal consistency, naturalness, and emotional expressiveness metrics. Our demo is available at https://ai4ai.anonymous-demo.fun/.

Abstract:
Social media popularity prediction is essential for content optimization and platform management. Existing approaches often struggle to capture the intricate semantic relationships among heterogeneous content modalities. To solve this problem, we propose a hierarchical attention fusion framework with cross-modal semantic alignment, which integrates text, visual, and user behavior features for enhanced popularity prediction. This design enables the model to adaptively emphasize the most informative features across modalities. We systematically evaluate various regression models and their ensemble strategies on the SMPD dataset, which contains 486,000 social media posts. Experimental results demonstrate that our hierarchical attention fusion consistently outperforms existing fusion methods. These findings highlight the effectiveness of cross-modal semantic alignment and provide valuable insights for advancing multi-modal social media popularity prediction.

Abstract:
Bodily Behaviour Recognition (BBR) and Eye Contact Detection (ECD) in multi-person group conversations are critical for understanding social dynamics, but traditional methods often rely solely on visual cues, lacking integration with semantic context. To address this, we propose a novel framework based on Large Vision-Language Models (LVLMs), leveraging their cross-modal alignment capability to fuse visual features (e.g., body postures, gaze directions) and linguistic semantics (e.g., behavioral category descriptions). A parameter-efficient tuning strategy using Low-Rank Adaptation (LoRA) is adopted, adapting only a subset of parameters in both the Language Model (LM) and Vision Transformer (ViT) modules, thus retaining pre-trained knowledge while reducing computational costs. The framework incorporates multi-task output heads to simultaneously predict BBR and ECD results. Experiments on the MPIIGroupInteraction dataset demonstrate superior performance: our method achieves 0.65 accuracy on BBR and 0.82 accuracy on ECD, outperforming state-of-the-art approaches by 0.02-0.03 in absolute terms. Ablation studies validate that applying LoRA to both LM and ViT with optimal hyperparameters yields the best results, confirming the importance of cross-modal synergy. This work highlights the potential of LVLMs in social behavior analysis, providing a lightweight and effective solution for understanding complex group interactions.

Abstract:
The ACM Multimedia 2025 Industry Program presents a comprehensive overview of how multimodal AI is revolutionizing real-world applications. The program features contributions from over twenty industry leaders, covering a spectrum of domains from content creation and distribution to healthcare, manufacturing, and foundational technology. Keynotes by leaders from NEC and Google DeepMind address critical challenges in business transformation and media integrity, respectively. A dedicated seminar from Google explores advancements in video codecs like AV2 and the development of next-generation quality metrics using Large Language Models (LLMs). The program also includes twelve expert talks and seven demonstrations that showcase practical innovations, such as LLM-driven recommendation systems at Meta, generative AI for industrial optimization by Mitsubishi Electric, and a Siemens-developed protocol for semantic interoperability in manufacturing. These components collectively highlight the profound societal and industrial impact of multimedia research and bridge the gap between academic theory and real-world deployment.

Abstract:
The rise of generative AI has democratized media creation, bringing huge promise but also possible perils. While this may seem like a new problem, the generation and manipulation of media has a long history that predates the current AI boom. I'll discuss key insights from our multi-year analysis of content that people shared online. Looking at manipulations such as deepfakes and cheapfakes, as well as misleading contextual manipulations, I'll reveal surprising statistics that challenge common assumptions about the most prevalent types of problematic media. I'll then explore mitigation strategies, including ways to improve information literacy tools, the opportunities and limitations of using AI to detect manipulated content, and how provenance methods paired with AI can help address out-of-context manipulations. Finally, I'll introduce an AI-based tool that can provide additional context for the media we encounter online every day.

Abstract:
The 2024 Nobel Prizes in Chemistry and Physics have once again drawn global attention to AI for Science. The rise of foundation models has further accelerated AI for Science across multiple disciplines. Scientific research is both a touchstone for advancing the intelligence of these models, and the models themselves are accelerators that empower scientific research. As a high-tech enterprise deeply engaged in artificial intelligence, iFLYTEK has in recent years made AI for Science (AI4S) a strategic priority and undertaken a series of initiatives. In this talk, Dr. Xin Li will provide a comprehensive overview of the iFLYTEK Spark Large Language Model and highlight its recent advances. He will outline two principal pathways through which AI accelerates scientific research-deep neural networks and LLM-based approaches-and present iFLYTEK's work along both lines. In addition, he will discuss key challenges and the outlook for the future development of AI for Science. Attendees will gain insights into how AI empowers scientific research and gain inspiration in their own scientific field.

Abstract:
Video processing and compression are being re-envisioned in the age of AI. Traditional video codecs, which rely on rigid, pre-defined rules, are being augmented and, in some cases, replaced by AI-driven approaches. These new methods leverage machine learning to intelligently analyze video content, allowing for more adaptive and efficient compression. We are going to discuss AOM's new codec AV2, its low- and high-level features that enable significantly smaller file sizes with no perceptible loss in quality, a crucial development for streaming and storage. The shift to AI has also transformed how we evaluate video quality. Traditional metrics, while directionally useful, don't always align with human perception, especially for user generated content (UGC). They fail to capture what's most important for machine vision tasks. We will talk about new AI-based quality metrics that are being developed. They correlate better with a human's subjective experience and a machine's ability to perform tasks like object recognition. Along the way, we'll cover large scale industrial infrastructure challenges and the ways to achieve high reliability and accuracy.

Abstract:
Monitoring Earth's surface evolution is critical for understanding environmental processes and anthropogenic impacts, necessitating precise analysis of multi-temporal remote sensing data via integrated change detection and semantic interpretation. While existing methods achieve either pixel-level localization or semantic-level description, their conventional isolated frameworks exhibit two critical limitations: 1) inability to establish multimodal semantic consistency, and 2) feature confusion under complex environmental noise. To address these challenges, we propose Change-UP, a unified learning framework that synergizes Multilevel Change Interpretation (MCI) with Large Language Models (LLMs), enabling comprehensive analysis through dual visualization-inference perspectives. The architecture comprises three key components: 1) Adaptive Adjustment Change-Aware (A2CA) captures multi-scale inter-temporal differences through pyramidal feature fusion, significantly enhancing signal-to-noise ratio of differential features. 2) Query Semantic Consistency (QSC) establishes cross-modal alignment through learnable semantic queries, improving feature discrimination on challenging samples. 3) Style Transform Diversity (STD) integrates text-driven style transform with style synergy optimization to enhance inter-modal diversity while preventing feature space collapse. Extensive experiments demonstrate state-of-the-art performance, achieving 87.02% MIoU and CIDEr-D 143.40, surpassing previous methods by 0.59% and 3.11%, respectively. The framework's novel integration of visual-linguistic modalities opens new possibilities for intelligent Earth observation systems.

Abstract:
In a world of generative models and large-scale data, it's insightful to revisit the intimate nature and unique value of personal multimedia through the lens of SenseCam - a wearable camera that automatically captures still images to create a visual diary for the wearer. Despite a variety of wearable camera products that have come to market in the decades since SenseCam was conceived, the unique properties of the original prototypes and the benefits these unlocked remain elusive today. Having explored why it has proven so challenging to move from a hardware prototype like SenseCam to a fully-fledged product with the same properties, I present a new phase of the prototyping-to-production journey I call isotyping. Isotyping shines a light on the critical, but often overlooked, steps necessary to scale from a promising prototype towards viable low-volume production. I call on the research community to help refine the concept and process of isotyping as we collectively continue to explore the potential of new forms of hardware in our research.

Abstract:
In video anomaly detection task, to mitigate the interference of background noise on learning the appearance and motion features of foreground objects, existing object-centric methods often directly disregard scene information, making it challenging for them to detect scene-dependent anomalies. Moreover, we observe that most methods focus on reconstructing or predicting complete frame-level or object-level RGB information, which limits their inference speed. In this work, we propose a novel inter-frame RGB difference reconstruction network for efficient video anomaly detection. Specifically, we construct separate scene-dependent memory banks (SDMBs) for different scenes to store exclusive normal patterns, thus enabling sensitive detection of scene-dependent anomalies. Meanwhile, we design sparse aggregation and similarity-driven updating mechanisms for the memory items in the SDMBs, which effectively increase the reconstruction error of anomalies by adequately learning the diverse normal patterns in both the training and testing data, thus making the distinction between normal and abnormal frames easier. Extensive experiments on three public datasets demonstrate that our method outperforms state-of-the-art approaches in terms of detection accuracy, false negative rate, and inference speed, particularly in situations with more complex scene and event types.

Abstract:
Traditional Camouflaged Object Detection (COD) methods heavily depend on labor-intensive annotated datasets which require extensive manual effort, resulting in limited generalization. While recent studies have combined Multimodal Large Language Models (MLLMs) and Vision Foundation Models (VFMs) to achieve zero-shot COD, their performance is hindered by modality gap between linguistic semantics and fine-grained visual cues, especially in complex camouflage scenarios. In this paper, we propose Language-to-instance generative visual Prompting (LiP), a novel framework that addresses this limitation by transforming text prompts generated by MLLMs into instance-level visual prompts through a text-to-image generative process. Specifically, we introduce a Diffusion-driven Visual Prompt Generation (DVPG) module that leverages Stable Diffusion model to synthesize visual references, enabling robust homogeneous modality matching for COD. Additionally, we introduce Instruction Contrastive Reasoning (ICR) module to enhance the semantic reliability of prompts by suppressing hallucinated concepts during MLLM inference. To the best of our knowledge, LiP is the first framework that utilize text-to-image generative model to construct instance-level visual prompts in COD task. Extensive experiments on four benchmark datasets demonstrate the effectiveness and strong generalization ability of our approach.

Abstract:
Knowledge graphs (KGs) are widely used to store multi-source and heterogeneous structural knowledge, making federated knowledge graph completion (FedKGC) a crucial research topic. FedKGC aims to complete distributed KGs while maintaining privacy and security. Existing FedKGC methods primarily rely on uni-modal structural embedding aggregation for global knowledge sharing, which suffers from the demanding assumption that intersecting entities exist across different clients and are known by the omniscient server. Meanwhile, these uni-modal structure-only methods neglect the exploitation of client-side multi-modal information. In this paper, we propose a new framework MuCo2 to kill two birds with one stone and facilitate client-server co-design through multi-modal codebooks (MuCo). Moving beyond the traditional structure-only paradigm, we introduce multi-modal information of entities as the foundation for KGC modeling and communication. We design a MuCo-based fine-grained KGC model on the client and a MuCo-based communication mechanism on the server, which does not require entity mapping in global aggregation anymore. Comprehensive experiments demonstrate the effectiveness, generalization, reasonability, efficiency, and explainability of MuCo2.

Abstract:
In the performing arts, the interplay between visual and auditory stimuli is of substantial importance. This also applies to stage lighting and its correspondence with the music audio in live music performances. In this paper, we aim for measuring cross-modal correlations between music audio signals and stage lighting signals and formalize these correlations by introducing specific metrics. Our metrics capture the temporal concurrence of musical beats and lighting changes, the correlation of loudness and brightness, and the correspondence of structural parts in music and lighting, respectively. We achieve this by relating features extracted from the audio signal with music information retrieval techniques to features extracted from the lighting control parameters. We demonstrate that the proposed metrics effectively differentiate real-world lighting signals from randomly generated signals in relation to the real-world audio signal. Moreover, we show that the metrics capture cross-modal correspondences of artistic relevance and serve to provide interesting analyses of four distinct lighting show styles. In addition, we show that our metrics are a valuable tool for evaluating generative systems that produce lighting signals from audio input and, therefore, can be crucial for developing such systems.

Abstract:
To achieve audio and visual action detections in a given video accompanying with only video-level labels, a group of weakly-supervised audio-visual video parsing methods have been explored. Throughout their training processes, the action categories are typically assumed to be static, which is not always satisfied. Consequently, these methods can not be employed to handle dynamic scenarios involving continuously growing novel classes. To alleviate the above issue, we introduce a novel Continual Weakly-Supervised Audio-Visual Video Parsing (C-WSAVVP) task, where maintaining the knowledge of historic categories remains the eternal topic. Distinctly, owing to the weakly-supervised and multi-modal characteristics, two core challenges are more obvious in C-WSAVVP: (1) Compared with the continual audio-visual video classification task, where distilling video-level coarse action semantic of trimmed videos is sufficient for mitigating catastrophic forgetting, C-WSAVVP has to retain more fine-grained temporal semantic information of untrimmed videos containing both actions and backgrounds. (2) The semantics of different actions generally exhibit a certain degree of correlation, which is beneficial for understanding related actions, but how to maintain the semantic correlations? To address the specific challenges, the Semantic Prototype-based Action Refinement (SPAR) and Inter-Class Relation Topology Preservation (IRTP) modules are explored, where the former devotes to utilizing various semantic prototypes to refine more reliable temporal action intervals for distillation and the latter focuses on retaining semantic correlations between different actions in both modalities during continual learning. Comprehensive experiments on our reconstructed C-LLP dataset demonstrate the effectiveness and generalization capability of our proposed method.

Abstract:
Transformer-based models have achieved dominance in point cloud analysis, yet their quadratic computational complexity remains a fundamental limitation for practical applications. Recently, RWKV has emerged as a promising alternative for sequence modeling due to its linear computational complexity. However, it has yet to be effectively adapted to handle the unordered and sparse nature of point cloud data. In this paper, we propose RWKV3D, an innovative and computational framework tailored for point cloud analysis, which is adaptable to three training strategies: training from scratch, single-modal pre-training, and cross-modal pre-training. First, we replace the MLP layer with an advanced Local Feature Mixer (LFM), which not only enhances fine-grained feature extraction but also reduces the number of parameters. Second, we introduce a Bidirectional Multi-head Shift (BMS) mechanism to expand the receptive field, effectively capturing richer contextual information. Additionally, to enhance high-level feature processing, we strategically incorporate a Multi-head Self-Attention (MSA) block before the first RWKV3D block. Experimental results demonstrate that RWKV3D outperforms Transformer-based and Mamba-based methods while maintaining lower parameter counts and computational costs. Notably, it achieves several state-of-the-art results, including overall accuracies of 95.3% (training from scratch) and 95.9% (cross-modal pre-training) on the ModelNet40 dataset, as well as 95.28% (single-modal pre-training) on the ScanObjectNN (PB_T50_RS) dataset. These results underscore the superior efficacy of the RWKV architecture in 3D vision tasks and highlight its potential for broader multimodal learning scenarios.

Abstract:
Unsupervised pre-training on large-scale datasets has demonstrated significant potential for improving the sample efficiency and performance of Reinforcement Learning (RL). Given the large-scale action-free internet videos, existing methods utilize single-step transition prediction and image reconstruction to learn representations. However, these methods prefer to preserve large-proportion stationary information in the pixel space, neglecting small but crucial information. To preserve enough information in the representation, it is essential to pay equal attention to each element in videos. Specifically, we propose a temporal correlation space to distinguish each element. For implementation, we introduce the Multi-scale Temporal Contrastive Learning (MTCL) method to model multi-scale temporal correlations separately. This approach can balance the attention of different elements and yield more informative representations, effectively supporting policy learning in various downstream tasks. Experimental results demonstrate that our method improves sample efficiency and asymptotic performance across various downstream tasks.

Abstract:
Multimodal Large Language Models (MLLMs) possess extensive knowledge and strong reasoning capabilities, achieving remarkable performance in knowledge-based visual question answering, significantly surpassing traditional small-scale Vision-Language Models (VLMs). However, the distinct training paradigms of MLLMs and small-scale VLMs result in misaligned feature representation spaces and divergent answer prediction distributions. To bridge this gap, we propose a novel end-to-end large-small model synergy framework, where small VLMs and MLLMs collaborate via synergistic optimization of shared objectives while maintaining their co-evolving complementary specializations. Specifically, multimodal fine-grained heuristics are extracted from well-tuned small VLMs and subsequently projected into the textual space of MLLMs through dedicated visual and textual collaboration modules. This enables cross-modal guidance for both visual and textual inputs. Finally, a dual-objective synergy loss promotes alignment toward shared goals, while a visual discrepancy loss preserves specialization diversity. Extensive experiments demonstrate that our framework achieves state-of-the-art performance on both the OK-VQA and A-OKVQA benchmarks.

Abstract:
Fetal MRI is often acquired with thick slices to mitigate motion artifacts, but this leads to partial volume effects and reduced through-plane spatial resolution, limiting precise anatomical analysis. To address this, various super-resolution methods have been proposed to reconstruct high-resolution volumes from thick-slice scans. Current methods face several major challenges: 1) relying on multi-stack paired data makes arbitrary super-resolution ratios difficult to achieve; 2) lacking robustness against voxel coordinate misalignment caused by partial volume effects; 3) failing to fully utilize the high in-plane resolution of MRI images. To address these issues, we propose a dual-consistency guided curriculum learning method based on implicit neural representation, which uses single-stack inputs to achieve arbitrary super-resolution. We introduce progressive consistency and volumetric consistency to mitigate voxel misalignment caused by partial volume effects and ensure smooth transitions during the model's curriculum-based training. Additionally, we design a curriculum-aware multi-scale feature interaction block to fully leverage thick-slice MRI's high in-plane resolution. Comprehensive evaluations on three fetal MRI datasets demonstrate SOTA performance, with particularly outstanding results in high-ratio super-resolution tasks.

Abstract:
Self-supervised 3D hand pose estimation methods can leverage labeled synthetic data along with unlabeled real-world data for model training, thereby alleviating the reliance on large-scale annotated datasets. Multi-view information fusion is a key factor in the success of these methods. Rule-based fixed fusion methods are simple, efficient, and generalizable, but they neglect the rich visual information in each view. Neural network-based learnable fusion methods can effectively model both intra- and inter-view semantic context, but they tend to overfit to the domain-specific feature of synthetic data and susceptible to interference of domain gaps. In this paper, we decompose multi-view fusion into two components: a learnable confidence estimation stage and a fixed confidence fusion stage. This design not only enables effective use of multi-view semantic cues but also ensures strong cross-domain generalization. To achieve accurate and robust confidence estimation, our method jointly exploits both multi-view pose consistency and pose-to-data consistency. Experiments on three public datasets demonstrate that our approach significantly outperforms existing state-of-the-art self-supervised 3D hand pose estimation methods.

Abstract:
Although object counting based on two-dimensional (2D) RGB images offers an effective solution in certain scenarios, it is significantly challenged by complex environments characterized by background noise, occlusion, depth variations, illumination changes, and other factors, often resulting in miscounting or missed detections. This issue is particularly pronounced in applications such as agriculture or retail, where fruits or products are frequently stacked in layers on shelves, severely compromising counting accuracy. To address these limitations, we propose TrueCount, a novel method that integrates segmented images, depth information, and other multi-modal data to enhance counting accuracy and robustness of pretrained large vision-language models (VLMs). TrueCount introduces a flexible framework capable of simultaneously processing multiple modal signals, including 2D RGB images, segmentation, and depth maps, while supporting both textual and visual prompts. During training and inference, TrueCount performs cross-attention and self-attention across all inputs and prompts. These features are then decoded to localize prompt features within the input, thereby jointly optimizing the model's counting capability. Additionally, TrueCount dynamically assesses the confidence of each modality for accurate counting in a given context, enabling effective fusion and complementary utilization of multi-modal information. Extensive experiments on multiple benchmarks including FSC-147 and CountBench demonstrate that TrueCount surpasses the previous state-of-the-art, e.g. achieving a new minimum mean average error of 4.64 on FSC-147.

Abstract:
Generating interactive motion from texts has garnered significant attention in recent years. While text inputs offer greater flexibility, in many practical applications, there is a need to controllably impose strict constraints on the motion range or trajectory of virtual characters. However, existing trajectory-based methods are designed for single-actor scenarios and lack support for interactivity in interactive motions. Moreover, text-only methods struggle to accurately convey user-intended trajectories. The distribution shift between training and inference often leads to trajectory deviation and physical interpenetration. To address the questions mentioned, we introduce two key concepts: (1) Lead-Follow Paradigm: Inspired by role allocation in partner dancing, we decompose complex interactive motion tasks into a Lead-Follow paradigm. The leader's path is optimized first, and the follower's motion is subsequently adjusted for coherence and alignment. (2) Trajectory Guidance: We highlight the pivotal role of 3D trajectory guidance in interactive motion generation and accurately reflect user intentions. Through 3D trajectory control, we can more controllably generate the desired motion while avoiding physical interpenetration. In addition, we further investigate the refinement of motion scopes for interactive agents and propose an effective optimization strategy to enhance motion coherence and controllability. Experimental results show that the proposed approach, by more effectively using trajectory, outperforms existing methods in both realism and accuracy.

Abstract:
Developing computer-aided design (CAD) generation models has significantly enhanced design efficiency, facilitating innovation and transformation in the design industry. Existing methods typically require users to input prompts in a specific format, such as text descriptions or images, limiting their broader application in diverse scenarios. To address this limitation, we introduce FreeCAD, a user-friendly CAD generation framework that supports free-form inputs, including text descriptions and/or images, enabling users to express their design intentions more flexibly. Specifically, we propose a Large Language Models (LLMs)-based Text Translator, which effectively increases the success rate of generating CAD models by converting users' diversified requests for the same object into a unified expression. Additionally, the Multi-View Representation Fusion (MVRF) module enables the network to capture richer interaction information across views, facilitating the generation of more fine-grained CAD models. To support the training of FreeCAD, we construct a multimodal dataset RealCAD, comprising text, image, and CAD triplets, where the images are derived from the 3D printed products of CAD models. Extensive experiments demonstrate that FreeCAD consistently outperforms the existing state-of-the-art (SOTA) methods in multiple tasks.

Abstract:
Salient object detection (SOD) in light field data presents unique challenges due to dynamic semantic inconsistencies across focal slices and representation heterogeneity between focal slices and the all-focus image. Existing methods often treat focal slices uniformly or rely on simple fusion strategies, which fail to address focus-induced semantic drift and cross-modal feature misalignment. To tackle these issues, we propose LFMamba, a unified network that jointly models dynamic semantic consistency and adaptive cross-modal fusion. We design the Focal-aware State Space Module (FSSM), which generates focal-aware semantic prompts through low-rank decomposition and adaptively routes them according to focal plane indices, thereby enabling bidirectional semantic propagation across slices through non-causal state transitions. Furthermore, we introduce the Focal-guided Cross-modal Fusion Module (FCFM), which mitigates cross-modal heterogeneity by a two-stage hierarchical strategy, combining structure-aware low-level alignment and gated high-level semantic fusion. Extensive experiments on four public light field SOD benchmarks demonstrate that LFMamba achieves superior performance compared to state-of-the-art methods, with improved robustness and consistency under complex focal variation scenarios.

Abstract:
Traditional Mongolian script recognition poses unique challenges due to its vertical layout, complex morphology, and the coexistence of visually distinct writing styles, such as standard printed (White) and cursive calligraphic (Hawang) forms. Existing approaches typically rely on style-specific models, leading to limited generalization and increased computational cost. In this paper, we propose UniMTR, a unified and lightweight framework for dual-style Traditional Mongolian word recognition. UniMTR distills knowledge from two expert teacher networks into a compact student model through a novel contrastive distillation strategy. This strategy leverages cross-style positive pair construction, hard negative mining, and uncertainty-aware loss weighting to bridge the style gap. We also introduce a new glyph-code encoding scheme that captures context-dependent visual variants beyond Unicode representation. Experiments on the newly constructed benchmark MTR-Mix demonstrate that UniMTR outperforms state-of-the-art baselines, achieving 13.6% CER and 14.2% average style-wise CER, while reducing model size by over 75% and enabling real-time inference at 320 FPS. Our approach offers a practical and scalable solution for style-robust script recognition in resource-constrained scenarios.

Abstract:
Partially view-aligned clustering (PVC) has emerged as a critical area in multi-view clustering, addressing the inherent instance misalignment across views during data collection. The primary challenge of PVC is accurately establishing correspondences between cross-view samples. The Banzhaf index in cooperative game theory serves as an effective tool for modeling complex relationships between multi-view samples by quantifying the marginal contributions of coalition members to collaborative benefits. To this end, we propose a Banzhaf Index-driven cross-view aligNment method, dubbed BIN, which systematically evaluates each view sample's contribution to joint decision-making within a game-theoretic framework. This approach overcomes the limitations of existing PVC methods reliant on prior alignment information and enhances the robustness of multi-view matching. Specifically, we model multi-view samples as players in a cooperative game and quantify their interactions using a payoff model. Simultaneously, we propose a dual-loss constraint: (1) Banzhaf gain loss, which dynamically captures the marginal contribution of key cross-view sample pairs to reinforce associations; (2) contrast loss, which applies exclusion constraints in the feature space to suppress interference from weakly correlated samples. Together, these losses form an effective optimization mechanism. This game-theoretic approach adaptively learns sample correspondences without pre-alignment and ensures robust matching in complex misalignment scenarios. Extensive experiments demonstrate that our method achieves competitive performance against eight state-of-the-art PVC algorithms.

Abstract:
Open-vocabulary object detection aims to detect and recognize novel categories that are not seen in the training set.Most existing methods rely on the Region Proposal Network (RPN) to extract regions of interest and align them with textual descriptions through one-to-one region-word alignment, such as bipartite matching.However, these methods encounter three major challenges: 1) Insufficient novel proposals: RPN tends to generate high-confidence proposals for base categories, but low-confidence ones for novel categories. 2) Missing matches for duplicate instances: Bipartite matching struggles to handle multiple instances of the same category in an image. 3) Inference bias: During inference, classifiers are often biased toward seen categories with lower prediction scores for novel categories. To address these challenges, we propose a SAM based region-word clustering and inference score adjusting model for open-vocabulary object detection (coined CADet). Specifically, our method consists of three components: 1) Enhanced proposal generation: To ensure sufficient proposals for novel categories, we incorporate an unsupervised localization SAM, which generates more comprehensive proposals covering both base and novel categories. 2) Region-word clustering: To mine more matching samples, we cluster similar proposals derived from bipartite matching and assign them the same pseudo-labels. 3) Score adjusting: We introduce a similarity-guided score adjustment strategy to effectively mitigate classifier bias against novel categories during inference. Extensive experiments on two datasets demonstrate the superior performance of our approach, achieving 36.4% mAP on COCO and 29.6% mask mAP on LVIS, outperforming existing methods on novel categories.

Abstract:
The domain gap between pretraining data (e.g., ImageNet, LUPerson) and downstream ReID datasets often leads to suboptimal performance when directly fine-tuning pretrained models. While existing methods attempt to bridge this gap by incorporating additional modalities (e.g., text, 3D data) or visual cues (e.g., pose, body masks), these approaches introduce two key limitations: (1) they may distract the model with irrelevant factors like background clutter or clothing variations, and (2) they inevitably increase computational overhead during inference. To address these issues, we propose the Weak Saliency Feedback Transformer (WSFFormer), inspired by the feedback mechanisms in biological visual systems. Unlike traditional one-way feature propagation, WSFFormer employs an adaptive feedback loop during training to enhance low-response regions, enabling the model to capture richer and more discriminative features. The WSFFormer introduces three key components: (1) The Lateral Feedback Module (LFM) mimics retinal lateral inhibition by adaptively suppressing high-response regions and amplifying weak discriminative features, forcing attention on subtle details; (2) The Progressive Feedback Module (PFM) refines feedback through deep-to-shallow closed-loop propagation, blending high-level semantics with spatial details; (3) The Feedback Sensitive Entropy Loss (FSE Loss) optimizes target-domain adaptation by quantifying divergence between forward and feedback-corrected features. Experiments on holistic/occluded ReID benchmarks show WSFFormer outperforms ViT/Swin-based SOTA methods without extra inference cost.

Abstract:
Semi-supervised Video Object Segmentation (VOS) aims to segment a user-specified object across all frames of a video using only the first frame's annotated mask. A key challenge in VOS is simultaneously preserving object identity and maintaining precise segmentation in dynamic scenes, especially during rapid motion. Many existing methods use previous frame masks as positional constraints, hindering segmentation of newly exposed regions-an issue known as over-suppression. To address these challenges, we propose FlowTrack, integrating two main components: an Adjacent-frame Motion Tracker (AMT) and an Adaptive Motion Predictor (AMP). AMT explicitly captures motion and positional information between adjacent frames and fuses it with historical target cues, improving segmentation constraint robustness and helping maintain stable object identity during rapid motion or significant deformation. However, relying primarily on historical masks and frame-level constraints may fail to accurately predict sudden changes in motion states, which still leads to over-suppression. To overcome this limitation, AMP predicts future states from historical motion data. Employing a learned state predictor and a Kalman-inspired recursive measurement fusion, AMP adapts to complex and abrupt motion changes. This dynamic prediction-update scheme refines segmentation boundaries, compensates for historical constraints, and effectively mitigates over-suppression. Experimental results on standard VOS benchmarks validate the effectiveness of the proposed FlowTrack framework in handling challenging dynamic scenes involving rapid motion and addressing the over-suppression issue.

Abstract:
Medical Visual Question Answering (Medical VQA) plays an important role in medical informatics. However, the robustness of existing medical VQA models is severely challenged by adversarial attacks. Current methods (e.g. adversarial training and noise-based reasoning) heavily rely on additional data or complex procedures and often ignore model-level robustness. To address these issues, we propose Multimodal Variational Masked Autoencoder (MVMAE), a novel pre-training framework designed to enhance the robustness of the medical VQA task. MVMAE leverages masked modeling and variational inference to extract robust multimodal features. The framework introduces a low-cost multimodal bottleneck fusion module and employs reparameterization to sample robust latent representations, ensuring effective feature fusion and reconstruction. Extensive experiments on public medical VQA datasets demonstrate that MVMAE significantly improves resistance to various adversarial attacks and outperforms other medical multimodal pre-training methods.

Abstract:
Vision-Language Models (VLMs) have achieved significant advances across various downstream tasks. However, as their performance improves, the increasing number of parameters results in slower prefilling speeds and longer inference times. To overcome these limitations, we observe that most VLMs do not require a large number of image tokens for inference, we propose BOLT (Basis-Oriented Lightweight Token-Trimming), a training-free and cross-attention-free token compression method. Unlike existing approaches, BOLT addresses the challenge of insufficient visual cues in textual prompts by leveraging token internal data distributions. We categorize tokens into three types: key tokens, proxy tokens, and remaining tokens. Then, by applying basis space similarity, we merge and filter the remaining tokens with the proxy tokens to retain the most informative ones. To account for the differences in VLM architectures and model sizes, we evaluate BOLT on LLaVA-Next-Llama3 and LLaVA-1.5 (7B and 13B). Our results show that BOLT achieves state-of-the-art performance, with a 90% token compression ratio leading to a 3.3× increase in pre-filling speed and a 1.5× improvement in inference speed, outperforming other methods.

Abstract:
Federated Prompt Learning (FPL) efficiently alleviates data heterogeneity and reduces communication costs by introducing pre-trained models and prompt tuning. However, local prompts tend to favor diverse features and ignore public features captured under extreme data heterogeneity, which compromises the generalization ability of the global prompt by only aggregating local prompts. To address this challenge, we present Federated Prompt Learning with Gradient Rectification (FedGR), which modifies the gradient directions of local and global prompts to enhance the generalization of the global prompt. Specifically, we first introduce a zero-shot prompt as public knowledge and constrain the gradient of local prompts to consistently deviate from the public feature space to capture diverse features adequately. Then, we compute the angular bisector of local and zero-shot prompt gradients and replace the gradient of the global prompt with the gradient of the angular bisector to capture both diverse features and public features. Finally, the server-side global prompt can enhance generalization by aggregating all client-side global prompts. Extensive experiments with various types of heterogeneities have demonstrated that our FedGR outperforms the state-of-the-art methods.

Abstract:
Text-Based Person Search, which aims to retrieve target pedestrian images using natural language descriptions, has garnered significant attention in multimedia research due to its potential in suspect retrieval and missing person identification. While supervised and weakly supervised methods rely on costly annotated training data, unsupervised TBPS eliminates the need for textual descriptions or identity annotations, presenting a more practical paradigm. Current unsupervised TBPS approaches face two primary challenges: 1) Predefined attribute templates for caption generation limit linguistic diversity and real-world adaptability, and 2) Threshold-based sample selection using pre-trained vision-language models (VLMs) introduces noisy pairs due to inadequate pedestrian-specific representation. To address these limitations, we propose FACE, a unified framework featuring Dual-template Caption Generation (DCG) and Adaptive Curriculum Training (ACT). The DCG module generates high-quality captions through complementary flexible-style (natural language) and fixed-style (attribute-enumerated) templates, enhanced by LLM-based noise filtering. The ACT framework progressively refines training through a self-improving loop: initial high-confidence sample selection using VLMs bootstraps the model, while evolving feature representations enable dynamic incorporation of harder samples through curriculum learning. This dual strategy achieves mutual reinforcement between caption quality and model discriminability. Extensive experiments on CUHK-PEDES, ICFG-PEDES and RSTPReid datasets under unsupervised settings demonstrate that our framework achieves the state-of-the-art performance.

Abstract:
To fully leverage diverse scene representations for visual relocalization, we propose a novel localization framework that systematically establishes inter-frame relationships and integrates multiple feature modalities. Our localization pipeline comprises three key stages, containing initial pose estimation using local point cloud structure, pose refinement by hand-crafted features and 3D Gaussians, and pose confidence estimation through a leaned global representation. Specifically, the initial stage begins with aligning a known source point cloud to a predicted local Target Point Cloud (TPC) using a registration algorithm. For pose refinement, we introduce the Hybrid Feature Grid (HFG), which fuses hand-crafted points and 3D Gaussians to enrich texture cues. To assess pose reliability, we propose the learned Weighted Global Point Cloud (WGPC), aggregating multi-frame information to enhance confidence estimation. To jointly learn TPC, HFG, and WGPC, we design a Siamese Localization Network (SiaLocNet) featuring three core innovations, including learning trajectory-based features for the limitation of single-view inputs, a feature fusion module to facilitate the construction of the three core structures. and an inverse self Chamfer Distance along with a shape-aware term to improve the robustness of WGPC. Extensive experiments on the 7 Scenes and Cambridge Landmarks datasets demonstrate that our method achieves state-ofthe-art performance across both indoor and outdoor environments.

Abstract:
Radiology report generation (RRG), intended to automatically generate a coherent free-text report describing the clinical observations of a radiograph, has been attracting increasing attention from researchers. In recent years, the Transformer-based encoder-decoder architecture has been adopted by most existing methods. However, they neglect the structural rationality issue when applying this single-modal architecture to the multi-modal RRG task, where information can only flow from visual features to textual features, but not in the opposite direction. This information asymmetry results in visual features having no knowledge of the textual features, sending out all visual information, including a large amount of heterogeneous noise. Consequently, this introduces significant resistance to the downstream decoder, which substantially limits or even harms the generation process. To tackle this problem, we present a method where a cross-counter-repeat attention is developed to integrate useful information from two separate modalities, and a memory-driven visual semantics enhancing module is designed to reinforce the visual features with strong time-ordered semantic information. Experimental results on the widely-used IU-Xray dataset show that our approach achieves the state-of-the-art performance, with a remarkable 6.9% improvement in BLEU-4 score. Further analyses also demonstrate that our method can generate sufficiently comprehensive reports to assist radiologists in their clinical decision-making.

Abstract:
In recent years, despite substantial advancements in large vision-language models (LVLMs), they still encounter the issue of ''hallucinations''-where generated results appears reasonable but often deviates from the visual input or actual facts. In contrast, the human cognitive system, when processing visual input, initially relies on visual perception to distinguish between the salient region and non-salient region, integrating relevant information. Subsequently, it recalls pertinent memory details, ultimately generating a comprehensive cognitive outcome. Inspired by this process, we propose a novel, training-free decoding approach, dubbed as Multi-Path Information Contrastive Decoding (MPI-CD). Specifically, to simulate the human information integration process, we design a three-branch structure called the Tri-Branch Integrator (TBI), which contrasts the original, salient region, and non-salient region images to effectively improve the reliability of the LVLMs' output. Furthermore, to mimic the human memory recall mechanism, we further investigate the importance of hidden layer features and propose the Memory Recall Module (MRM). This module adaptively extracts meaningful memory information from the hidden layers and incorporates it into the decoding process, thereby effectively alleviating the hallucination issue. We conduct extensive experiments on three widely used benchmarks (e.g. POPE, AMBER, and MME) using two classic LVLMs. The experimental results demonstrate that our MPI-CD significantly mitigates hallucinations in LVLMs without requiring additional training.

Abstract:
Monocular depth estimation stands as a fundamental pursuit in computer vision. Recently, some methods have attempted to introduce the text-to-image diffusion model into the domain of monocular depth estimation and achieved impressive results. However, these methods typically employ pre-defined templates as text prompts to guide the learning of denoising networks, resulting in limited flexibility and scalability. In this paper, we propose OGDepth, a diffusion-based monocular depth estimation network with object prompts generated by taking advantage of the object detection information from the scene. Specifically, we design an Object Prompt Module (OPM) to encode the object detection information into prompts that are more closely aligned with the image content, offering richer contextual information while circumventing the monotony and redundancy inherent in template-generated prompts. Moreover, we employ bounding box information for each object to filter and localize objects, enabling the model to grasp relative positional information within the scene. This facilitates the creation of a more precise depth map. Additionally, we design a Global-Local Interaction Decoder (GLID) to facilitate the mutual exchange of features at different scales, enabling efficient feature fusion. Our approach underwent rigorous experiments across multiple datasets, with results showcasing its state-of-the-art performance. Notably, on the KITTI dataset, our model achieves an RMSE of 1.967 and a REL of 0.047, and both metrics are the best among all compared methods. On the NYU Depth V2 dataset, our method achieves an RMSE score of 0.221, representing a notable 12.9% enhancement compared to the baseline method (VPD).

Abstract:
Continual Generalized Category Discovery (C-GCD) aims to incrementally identify both known and novel classes from unlabeled data streams while preserving previously acquired knowledge. However, current approaches face a critical limitation we term unstructured knowledge interference, a critical issue that arises when unconstrained parameter updates entangle discriminative representations across classes, severely contaminating the feature space and introducing significant transfer and bias risks. To address these challenges, we propose the Tree of Prompts (ToP), a novel hierarchical prompting framework that facilitates structured knowledge adaptation through multi-granular parameter regulation. ToP hierarchically integrates three synergistic components: (1) Stage-level prompts preserve historical knowledge by isolating task-specific parameters, thereby mitigating conflicts between incremental tasks; (2) Centroid-level prompts disentangle category semantics through learnable prototype calibration, sharpening decision boundaries in the feature space; and (3) Context-level prompts dynamically capture discriminative local features to suppress contamination from superficial similarities. Experimental results demonstrate that ToP markedly outperforms existing methods and provides a comprehensive and efficient solution for C-GCD.

Abstract:
Accurate assessment of food nutrition is essential for promoting healthy eating habits. While recent deep learning approaches have enhanced vision-based nutritional estimation through RGB-D multi-modal fusion, they often overlook fine-grained surface components (e.g., oil and sugar) that significantly influence nutritional values. Some recent approaches have improved accuracy by incorporating ingredient data, but their reliance on such input during inference limits practical applicability, as ingredient details are often unavailable in real-world settings. To address this limitation, we propose DSDGF-Nutri, a novel Decoupled Self-Distillation network with Gating Fusion for food Nutri tional assessment. Our method leverages ingredient knowledge during training but relies solely on RGB-D inputs at inference. Specifically, DSDGF-Nutri introduces: (1) a self-distillation mechanism with gating fusion that transfers ingredient-aware features to the RGB-D network, enabling robust prediction without test-time ingredient input, and (2) a multi-task decoupling architecture with task-specific decoders to minimize cross-task interference. Extensive evaluations on two benchmark datasets demonstrate DSDGF-Nutri outperforms existing methods, achieving state-of-the-art results. This work establishes a new paradigm of multimodal fusion in nutritional assessment by unifying scientific measurements with scalable computer vision applications.

Abstract:
CLIP has been widely adopted in affective computing for its strong vision-language representation capabilities. However, it fails to accurately distinguish visually similar yet label-distinct facial expressions. This limitation is rooted in CLIP's encoding paradigm and large-scale contrastive pretraining, which bias the model toward focusing primarily on globally salient visual features and aligning them with broad semantic concepts. Such alignment overlooks subtle facial variations and induces representational shortcuts, where emotionally distinct categories are projected into overlapping regions of the shared semantic space. This semantic entanglement severely compromises the model's ability to preserve emotional separability. We propose LES-CLIP, a Lightweight and Emotion-Sensitive framework that adapts CLIP for precise discrimination of similar emotions. LES-CLIP achieves fine-grained emotional sensitivity using only simple text prompts and facial images. It introduces three novel components: 1) an Emotion-Sensitive Adaptive Mixture-of-Experts, which pre-adapts representations for subtle expression discrimination; 2) a Prompt-Guided Emotion Discrimination module that activates CLIP's visual sensitivity to fine-grained facial cues; and 3) a LES hybrid loss that guides contrastive learning toward accurate emotion-label alignment. Extensive experiments demonstrate that LES-CLIP achieves state-of-the-art performance, reaching 70.18% on the 8-class AffectNet dataset. Moreover, it converges faster and requires significantly fewer parameters.

Abstract:
Previous multimodal emotion-cause analysis in conversations (MEC-AC) has predominantly focused on English, overlooking the applicability of existing methods in multilingual contexts. To bridge this gap, we construct a Chinese contextual dataset (MEC4) to investigate how language and culture diversity influences existing MECAC approaches. Moreover, prior studies often rely on average pooling or frame sampling to extract visual and acoustic features from video and audio of long dialogues, which inevitably results in the loss of temporal dynamics and emotionally salient cues. To overcome these limitations, we propose a memory-inspired multilingual multimodal framework (M3F) based on large language model (LLM), which can effectively capture the temporal and global informative features of non-linguistic modalities through memory bank module. This module simulates the way memory is stored in human cognitive processes and incrementally aggregates past visual and acoustic features in an autoregressive manner, enabling effective reference during future sequence modeling. Through rigorous experiments and insightful analyses, we find that cultural differences cause variations in how emotional expressions in English and Chinese languages rely on modalities.

Abstract:
Visual emotion recognition (VER), which aims to understand human emotional reactions to different visual stimuli, has garnered increasing attention. However, the inherent ambiguity of emotional features presents significant challenges for data annotation in supervised learning paradigms. To address this limitation, emotion domain adaptation (EDA) facilitates knowledge transfer from labeled source domains to unlabeled target domains. Recently, large visual-language models such as CLIP have demonstrated impressive transfer performance on traditional UDA tasks. However, when generalizing to more abstract concepts such as emotion, the misalignment between CLIP and emotion spaces greatly affects the model performance. To address these challenges, we propose a CLIP-based emotion disentanglement (EmoD) framework designed for EDA. Leveraging perspectives from information bottleneck theory, EmoD implements a disentangler network that extracts emotion-specific features while removing redundant emotion-agnostic information. It also incorporates cross-domain feature alignment to reduce the affective gap between domains. Experimental evaluations in six EDA settings demonstrate that EmoD achieves state-of-the-art performance, surpassing traditional CLIP-based UDA methods by an average of 2.53%.

Abstract:
Multimodal hashing stands as an efficient approach for multimodal retrieval, yet it frequently grapples with the challenge of misaligned representation spaces across different modalities. This misalignment can degrade the consistency and discrimination of multimodal representations, complicating the learning of effective representations for image and text pairs. Particularly, the task becomes arduous when the system must handle incomplete data while ensuring accurate and relevant retrieval outcomes. To address these challenges, we propose the Asymmetric Pre-aligned Anchor Contrastive Enhanced Diffusion Hashing Model (AADH) for Incomplete Multimodal Retrieval. Our model is specifically tailored to robustly manage multimodal incomplete data scenarios. Initially, we develop an Asymmetric Pre-alignment Strategy that utilizes asymmetric contrastive learning to preliminarily align the semantic disparities between various modalities. Subsequently, we propose an innovative Anchor Contrastive Reinforcement Diffusion Hashing Model, which integrates image and text modalities to varying extents during the reverse diffusion process. It constructs an anchor space that not only facilitates the learning of incomplete multimodal hashing representations through anchor contrastive learning but also leverages inter-modal and intra-modal contrastive learning to enhance the representations. Moreover, we effectively bridge the modality gap between different modal hash codes by employing the anchor space to constrain the representations of different modal hashes. By adjusting the initial noise of the diffusion model, we indirectly expand the data volume, which in turn bolsters the model's robustness. Our extensive experimental results across multiple datasets demonstrate that the proposed AADH model achieves state-of-the-art (SOTA) results.

Abstract:
Due to the high cost and small scale of Image Quality Assessment (IQA) datasets, achieving robust generalization remains challenging for prevalent Blind IQA (BIQA) methods. Traditional deep learning-based methods emphasize visual information to capture quality features, while recent developments in Vision-Language Models (VLMs) demonstrate strong potential in learning generalizable representations through textual information. However, applying VLMs to BIQA poses three major Challenges: (1) How to make full use of the multi-modal information. (2) The prompt engineering for appropriate quality description is extremely time-consuming. (3) How to use mixed data for joint training to enhance the generalization of VLM-based BIQA model. To this end, we propose a Multi-modal BIQA method with prompt learning, named MMP-IQA. For (1), we propose a conditional fusion module to better utilize the cross-modality information. By jointly adjusting visual and textual features, our model can capture quality information with a stronger representation ability. For (2), we model the quality prompt's context words with learnable vectors during the training process, which can be adaptively updated for superior performances. For (3), we jointly train a linearity-induced quality evaluator, a relative quality evaluator, and a dataset-specific absolute quality evaluator. In addition, we propose a dual automatic weight adjustment strategy to adaptively balance the loss weights between different datasets and among various losses within the same dataset. Extensive experiments illustrate the superior effectiveness of MMP-IQA.

Abstract:
The proliferation of satellite-related commonsense data on the internet. Traditional analytical methods are challenging to integrate and effectively uncover implicit knowledge within them. However, current deep learning and LLM-based approaches often struggle with errors and hallucinations when performing multi-hop reasoning in domain-specific contexts. To address these limitations, we propose a novel multi-hop reasoning framework for implicit commonsense mining. This framework aims to uncover the underlying meta-knowledge behind reasoning problems, thereby providing enhanced interpretability of the reasoning process. Specifically, we design an extraction-retrieval-principle multi-step reasoning method that generates different levels of meta-knowledge in stages to support the reasoning process effectively. We further design the mixture of expert knowledge graph construction to construct a satellite knowledge graph that supports multi-hop reasoning. Experimental results demonstrate that our approach outperforms baselines on satellite knowledge graph reasoning.

Abstract:
Single-source domain generalization (SDG) in medical image segmentation is a challenging yet practical task that efficiently enhances generalization ability while avoiding high annotation costs and privacy concerns. In this paper, we propose EIR-SDG, a novel SDG approach that explores domain-invariant representation for medical image segmentation. The core of EIR-SDG lies in mitigating the effect of style in the encoder while facilitating robust segmentation in the decoder. Concretely, we design a training-free texture and style diversity module that transforms images into diverse random appearances without requiring optimization or gradient updates, which simulates unseen target distributions while mitigating overfitting to regular patterns in synthetic data. Building on this, we devise a feature adaptive whitening module, which disentangles and whitens the style-sensitive feature correlations between original and augmented pairs, encouraging the encoder to learn invariant representations. Moreover, to facilitate robust segmentation in the decoder, a semantic representation optimization strategy is devised to enhance invariant representations by constraining the correlation between class prototypes to be consistent while improving segmentation boundary distinction by separating different class prototypes. Experiments on cross-modality abdominal, cross-sequence cardiac and cross-center prostate segmentation tasks demonstrate that our method achieves promising generalization capacity and outperforms the SOTA methods.

Abstract:
Understanding 3D affordance is essential for agents to effectively interact with real-world environments, encompassing tasks such as manipulation and navigation. Existing methods typically support open-vocabulary queries through label-based language descriptions but often suffer from limited generalization and weak discriminative ability in their representations. However, affordance understanding requires constructing a coherent semantic landscape from fragmented linguistic expressions-one that preserves intra-class diversity while minimizing inter-class overlap. To address these challenges, we introduce Aff3DFunc, a framework designed to enhance the alignment between affordance and 3D geometry. It begins with a functional text enhancement module grounded in the Information Bottleneck (IB) principle, which strategically enriches affordance semantics by maximizing both relevance and diversity. A dual-encoder architecture is then employed to extract embeddings from both point clouds and text. To bridge the modality gap, we further propose a multilevel representation alignment strategy that incorporates supervised contrastive learning, reinforcing semantic-geometric correspondence in a part-to-whole manner. Extensive experiments demonstrate that our approach significantly enhances the understanding of affordance complexity. The learned representations exhibit high adaptability to diverse text queries, particularly in zero-shot settings. Furthermore, the real-world robot validation confirms that our method improves affordance understanding, enabling more fine-grained manipulation tasks.

Abstract:
Evolving multimedia systems are increasingly being adopted in virtual reality and gaming applications. Such systems emphasize immersion to engage users by bridging the gap between real and virtual content. In this context, visual and acoustic stimuli are the two key media that dictate such immersion. While visual 3D rendering is advancing rapidly, the same is not true for audio, where most research is limited to the reconstruction of the room impulse response (RIR) using omnidirectional audio or, at best, binaural. Such methods do not adequately account for the directions and orientations of the acoustic signals with respect to either the source or the listener, thereby compromising immersion quality. In this work, we explore the effect of adding such ''directionality'' to the training data to improve the estimation of the room's acoustic parameters. A more accurate set of such parameters implies in fact a more realistic predicted RIR, leading to a more immersive experience of the acoustic scene. Specifically, we propose a novel framework driven by a suitable loss function to account for directionality in ambisonic microphones, and novel variants of loss functions for both omnidirectional and ambisonic cases. We also propose to account for microphone characteristics and their contribution to the predicted RIRs. Experiments were performed using two datasets of real recordings and the results established the efficacy of the proposed methods.

Abstract:
Model quantization is a promising method for accelerating and compressing diffusion models. Nevertheless, since post-training quantization (PTQ) fails catastrophically at low-bit cases, quantization-aware training (QAT) is essential. Unfortunately, the wide range and time-varying activations in diffusion models sharply increase the complexity of quantization, making existing QAT methods inefficient. Equivalent scaling can effectively reduce activation range, but previous methods remain the overall quantization error unchanged. More critically, these methods significantly disrupt the original weight distribution, resulting in poor weight initialization and challenging convergence during QAT training. In this paper, we propose a novel QAT framework for diffusion models, called DilateQuant. Specifically, we propose Weight Dilation (WD) that maximally dilates the unsaturated in-channel weights to a constrained range through equivalent scaling. WD decreases the activation range while preserving the original weight range, which steadily reduces the quantization error and ensures model convergence. To further enhance accuracy and efficiency, we design a Temporal Parallel Quantizer (TPQ) to address the time-varying activations and introduce a Block-wise Knowledge Distillation (BKD) to reduce resource consumption in training. Extensive experiments demonstrate that DilateQuant significantly outperforms existing methods in terms of accuracy and efficiency.

Abstract:
Dynamic fluid scene reconstruction remains challenging in multimedia applications and digital content creation due to complex motions and changing topology. While Neural Radiance Fields (NeRF) methods are computationally expensive and 3D Gaussian Splatting (3DGS) approaches struggle with fluid phenomena, we propose Fluid-GS, a flexible, efficient end-to-end framework for sparse-view fluid reconstruction that tightly couples density field modeling with velocity estimation via differentiable advection. Our key innovation is a hybrid Lagrangian-Eulerian Gaussian primitive representation that combines the rendering efficiency of 3DGS with physically-accurate fluid motion tracking on Eulerian grid, that enables us to formulate physics-informed constraints derived from Navier-Stokes equations, enforcing temporal coherence and fluid incompressibility. Moreover, to address the inherent challenges of sparse-view reconstruction, we introduce a fluid-specific Gaussian kernel constraint that preserves the spatial characteristics of fluid phenomena, and dynamically adjusts the anisotropic kernel of Gaussian primitives based on local velocity fields, preventing non-physical artifacts. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in both reconstruction quality and computational efficiency.

Abstract:
Out-of-distribution (OOD) detection for graph-structured data remains a challenging problem, particularly when test-time OOD samples deviate significantly from the training outliers. Existing methods are typically optimized to capture the features within the in-distribution (ID) training data, but often fail to model the transitional region near the boundary between ID and OOD samples. Moreover, since data distributions are usually governed by multiple latent factors, pre-trained models constrained by the scope and diversity of training data struggle to represent the full spectrum of sample characteristics and distributional boundaries. To address this dilemma, we propose a novel test-time graph OOD detection method, termed D2GO, that constructs and dynamically updates ID and OOD graphon dictionaries for OOD score calibration, without requiring fine-tuning. Specifically, D2GO estimates graphons from test graphs and employs a mix-up strategy to generate boundary samples, eliminating the need for exposing auxiliary datasets or training graphs. Priority queues are utilized to expand the ID and OOD dictionaries by incorporating diverse graphons based on pseudo-labels at test-time, and the OOD scores are calibrated by computing the similarity between test samples and both graphon dictionaries. Extensive experiments on real-world datasets show that D2GO significantly outperforms existing state-of-the-art methods in OOD detection.

Abstract:
Visual-audio Deepfake has become increasingly prevalent in today's online environment. Passive detection methods, lacking preventive measures, struggle with detecting unknown forgery techniques, limiting their effectiveness. While proactive detection methods offer greater robustness, unimodal watermarking approaches remain vulnerable in visual-audio Deepfake scenarios, posing challenges to reliable forensics. To address these challenges, we propose a novel Separable Visual-Audio waterMark framework, called SepVAMark, for proactive Deepfake detection. SepVAMark incorporates a multi-layer perceptron-based mixer layer to fuse intra-modality and inter-modality features from both audio and visual data. We introduce the concept of separable visual-audio watermark, along with a bimodal robust extractor for traceability and two unimodal semi-robust extractors for Deepfake detection. This design ensures reliable copyright protection for source audio-video content while enabling authenticity verification for redistributed content. Experimental results on the FakeAVCeleb dataset demonstrate that SepVAMark effectively detects a wide range of advanced Deepfake manipulations, outperforming existing single-modal and multi-modal watermarking methods with superior robustness.

Abstract:
Conventional RGB cameras struggle in high-speed vision due to motion blur (above 60Hz sampling) and limited dynamic range (<60dB). To address these limitations, we propose a multimodal framework integrating event cameras, leveraging their microsecond temporal resolution (1μs) and 140dB dynamic range. Our key innovations include: (1) DSF-Net: An innovative spike-triggered dynamic sparse fusion network that effectively and efficiently fuses discriminative features from Event-RGB, enabling high-speed object detection; (2) HS-Multi: The first large-scale Event-RGB dataset specifically designed for high-speed objects, featuring 73k annotated samples across 11 object categories, with dedicated high-speed settings (HS-CAR and HS-FAN). Extensive evaluations on three benchmarks (HS-CAR, HS-FAN, PKU-DDD17-Car) demonstrate consistent advantages: (a) superiority on high-speed detection, DSF-Net significantly surpasses both unimodal (RGB/Event) and existing multimodal fusion methods, with particular superiority on HS-CAR where it achieves 87.3% mAP (9.5%↑ vs. RGB-only); (b) Generalization capability: DSF-Net achieves 50.1% mAP on PKU-DDD17-Car, surpassing prior multimodal framework in both accuracy (+4.1%) and speed (+8.3fps).

Abstract:
Reversible Adversarial Example (RAE) could be used to protect the privacy and copyright of images on social networks (SONs) by exploring the adversarial examples to disrupt the access of malicious AI models while ensuring recoverability with authorized users. Existing RAE methods add adversarial perturbations in spatial images which do not apply to JPEG images, the most widely adopted image format for image storage and transmission. To tackle this issue, we propose the first Reversible Adversarial Example (JPEG-RAE) generation framework for JPEG images, which consists of two primary components, i.e., JPEG-AE and G-RDH. JPEG-AE crafts the adversarial perturbations in the JPEG domain of images by leveraging chain rule of gradient propagation, so that they could effectively mislead the AI models in spatial domain when they are JPEG decompressed. And G-RDH adopts a gradient-directed bi-directional histogram shifting scheme for efficient reversible hiding of adversarial perturbations and location data in JPEG domain, where the histogram shifting is in sync with the sign of back-propagated gradients to further boost the performance of adversarial attacks. Experimental validation demonstrates that, although confined to the JPEG format such as the amount and intensity of alterable DCT coefficients, the proposed JPEG-RAE could still show superior or comparable performance, in terms of attack ability and recover ability, to its counterparts in spatial domain.

Abstract:
Recent advances in image editing systems reveal critical limitations in handling complex real-world scenarios requiring multimodal condition controls. While text instructions enable broad semantic guidance, visual examples provide precise visual reference in specific scenarios, existing unimodal approaches fail to synergize these complementary modalities effectively. We propose EditMaster, a unified framework that integrates text and visual controls through multimodal instruction learning, enabling precise image manipulation with bidirectional consistency. Our framework introduces three core innovations: A multimodal large language model enhanced with extended visual tokens replaces CLIP text encoders, generating pre-edited visual guidance that aligns textual commands with visual examples to guide diffusion model toward high-quality outputs; The Mask-Based Decoupled Residual Exemplar-Attention module preserves unedited regions through spatial masking while integrating visual details via residual pathways; A systematic data construction method converts unimodal editing datasets into a task-specific multimodal dataset, eliminating the need for de novo data construction. Experiments show that our approach outperforms unimodal baselines and excels in complex multimodal instruction editing, setting a new benchmark for this field.

Abstract:
The rapid development of music diffusion models has provided diverse paths for music creation transformations. However, existing methods still lack continuous strength regulation over stylistic attributes-specifically, they cannot achieve scalable adjustment of intensity (e.g., smooth transitions between ''gentle'' and ''intense'' jazz) while preserving spectral-temporal coherence. To address this, we propose RLScale-LoRA, a two-stage finetuning framework built on a structurally modified low-rank adaptation (LoRA) architecture with scale layers. In Stage 1, we finetune the modified LoRA to specialize in capturing attribute-aware latent spaces on unseen/seen music data. Stage 2 trains lightweight scale layers via proximal policy optimization (PPO), where reward functions enforce intermediate spectral-temporal state stability. Therefore, our RLScale-LoRA achieves precise, continuous music attribute transformations. Extensive experiments on Mtg-Jamendo and MedleyMD-Prompts datasets demonstrate RLScale-LoRA's superiority in granularity and coherence.

Abstract:
Recent advancements in one-shot head avatar generation and animation have garnered significant attention. However, previous works primarily focus on maintaining consistency in expression and pose between the output and driving images, with limited exploration of two crucial factors: emotion and style. In this paper, we introduce GOES, an 3D Gaussian based One-shot head animation framework for any Emotion and any Style. To achieve low rendering consumption and high reenactment speeds, we incorporate 3D Gaussian techniques into our method. Compared to controlling facial emotions with a single label, using an image as the emotion source enables more precise and fine-grained emotional expression modeling. To accurately extract emotion features from any given image, we design an efficient emotion encoder. Based on this module, we employ a deformation predictor to achieve the emotion-driven deformation of facial 3D points. Regarding stylization, directly using style features to control the deformation of 3D Gaussian parameters results in global color changes. However, facial stylization requires region-specific color transformations. To address this, we propose a Global-to-Point mapping network, which maps the global style feature to each 3D Gaussian points. This module enables precise local style adaptation across different regions of the head avatar. Experimental results demonstrate that our approach outperforms existing methods in terms of facial reconstruction quality and expression accuracy, while also supporting customization of arbitrary emotions and styles.

Abstract:
In this paper, a multi-modal model based 3D pop-out video generation framework (CP3) is proposed to solve the shortcomings of the existing video generation technology for accurate control of 3D pop-out effects. 3D pop-out effects create an immersive visual experience by changing the disparity of a particular object so that it appears beyond the screen. However, although software has made some progress in this area, there is currently no effective way to accurately control 3D pop-out effects and generate high-quality video. In addition, the lack of high-quality 3D pop-out effect data sets is also one of the bottlenecks in the field. Therefore, the CP3 framework proposed in this paper utilizes multi-modal models to help 3D video creators make 3D pop-out effects, enhance the audience's sense of immersion and visual comfort, and thus promote the development of 3D effect generation technology. To support the training and evaluation of this framework, a new dataset containing 37000 frames of pop-out effects is constructed, such as text guidance, segmentation results, depth maps, optical flow, and the trajectory of the pop-out target. Through the 3D UNet model based on the potential de-noising diffusion mechanism, combined with the 3D-try module in the CP3 framework and Mask Encoder, this paper has achieved remarkable results in the generation of 3D pop-out effect videos. The results of the experiment show that the CP3 framework demonstrates its advantages in generating immersive 3D pop-out effects in comparison to existing technologies.

Abstract:
NeRF-based talking head generation has made great progress, but existing methods still lack in achieving high-quality detail fidelity, mainly manifested in detail loss and intermittent blur. We attribute this to the limitations of the training video data in terms of viewpoint and lighting, which leads to the inability to fully model the global depth and brightness information of spatial points. Specifically, a fixed viewpoint may fail to provide sufficient depth information for high-frequency details, leading to inaccurate volume density estimation and the loss of details such as hair. Furthermore, constant lighting often fails to adapt to the drastic brightness changes of continuous video frames, resulting in color accumulation errors and blurring artifacts. To address these issues, we propose a novel talking head generation method that combines layered viewpoint simulation (LVS) and continuous lighting simulation (CLS). LVS simulates multiple viewpoints through the multi-scale features of the video frame to construct the global depth representation, which can improve the accuracy of volume density estimation and enhance detail description. CLS simulates multiple lighting through brightness changes of continuous video frames to construct the global brightness representation, thereby alleviating color accumulation errors and eliminating blur. Extensive experiments demonstrate that our method significantly improves the detail quality compared to the state-of-the-art methods.

Abstract:
Recent diffusion model advancements aim to handle conditional generative tasks without extra training. Existing training-free methods add a correction term at each denoising step, but they often face computational instability and lack controllability, especially with limited samples and large noise. We propose a new approach using the von Mises-Fisher (vMF) distribution to model the denoised result, turning the conditional generation task into an estimation problem for vMF parameters. We formulate the conditional diffusion model as a mean vector estimation problem for the Gaussian distribution, noting that this can be seen as an estimation problem from noisy observations. When the sampling number is small, the estimation is unstable. To address this, we optimize the mean vector of the vMF distribution by minimizing the KL divergence between the prior and posterior distributions. This approach not only addresses the computational instability but also improves the controllability and quality of the generated results. Once these parameters are determined, the denoised result can be sampled directly from the vMF distribution. Estimating the parameters requires minimal additional code and incurs negligible computational overhead while significantly improving performance. Extensive experiments across various conditional generation tasks, including depth maps, edge detection, segmentation, and style guidance, demonstrate the superiority and versatility of our method. Our approach consistently outperforms existing training-free methods and even surpasses some training-required methods in terms of visual quality and controllability.

Abstract:
High-fidelity 3D human reconstruction is essential for numerous applications. Existing reconstruction methods still suffer from several limitations. Implicit-function-based methods often produce artifacts, particularly when handling complex poses and loose-fitting clothing. Existing deformation-based methods require the entire body mesh to be input to the network for deformation, resulting in reconstructed results that are not ideal in detail. We propose HumanPrinter, a novel method for reconstructing high-fidelity 3D clothed human models from a single RGB image. Drawing inspiration from 3D printing, HumanPrinter reconstructs the human mesh layer by layer. HumanPrinter slices the coarse mesh resulting from deformation based on the estimated SMPL-X mesh into multiple vertically stacked polygons. The network then regresses vertex offsets from visual cues extracted from the input image to deform these polygons. Finally, these deformed polygons are stitched together and further refined to achieve a complete and detailed 3D human mesh. Polygon-based deformations associate the deformations of each vertex with its adjacent vertex so that the HumanPrinter produces fewer artifacts. By reducing the number of input polygons and increasing the number of deformable vertices, the layered reconstruction method can make the network more focused on local details. Through experiments on three datasets and visual results on in-the-wild data, we demonstrate that HumanPrinter performs competitive reconstruction quality compared to current state-of-the-art methods.

Abstract:
Recent Diffusion Transformers (DiTs) have shown impressive capabilities in generating single-modality content, including images, videos, and audio. However, the potential of DiTs to enable superb multimodal content creation remains underexplored. To bridge this gap, we introduce AV-DiT, a novel and efficient audio-visual diffusion transformer designed to generate high-quality, realistic videos with synchronized audio tracks. To minimize model complexity and computational costs, our AV-DiT utilizes a modality-shared DiT backbone pre-trained on image-only data, with only newly inserted adapters being trainable. This shared backbone facilitates the generation of both audio and video. Specifically, the video branch incorporates a trainable temporal attention layer into a pre-trained DiT block for capturing the temporal consistency for video generation. In addition, a small number of trainable parameters adapt the image-based DiT block to learn the acoustic characteristics for audio generation. An extra shared self-attention block reused from the DiT block, equipped with lightweight parameters, facilitates feature interaction between audio and visual modalities for alignment. Extensive experiments on the datasets demonstrate that our AV-DiT achieves state-of-the-art performance in joint audio-visual generation with significantly fewer tunable parameters. Furthermore, our results highlight that a single shared image generative backbone with modality-specific adaptations is sufficient for constructing a joint audio-video generator.

Abstract:
Recent advances in Large Vision-Language Models (LVLMs) have unearthed boosted performance of multi-modal understanding. In this paper, however, we for the first time uncover a critically under-explored challenge persisting in this trend, that LVLMs unfortunately exhibit cross-modal knowledge inconsistencies. Cross-modal knowledge inconsistency refers to the tendency of providing semantically inconsistent responses to contexts that are semantically equivalent but expressed in different modalities. In real-world applications, users can rely on either text or image to express their ideas. Inconsistent responses across modalities can confuse users, challenging the reliabilities of LVLMs in practice. Therefore, we argue that evaluating performance on either multi-modal or text-only task is insufficient; and waiving the mentioned cross-modal knowledge inconsistency is crucial. The paper proposes PRISM, the first-ever benchmark for measuring the inconsistency, and the corresponding evaluation metric Know-Inc. PRISM covers commonsense, encyclopedia, and mathematics knowledge, with manually-screened samples of semantic alignment. From the evaluation results of up to 27 LVLMs with diverse structures, we conclude that: 1) LVLMs show a preference for textual input, 2) there is a correlation between inconsistency and accuracy, and 3) the inconsistency is more prominent in encyclopedia knowledge. These findings can shed light on further optimization and development of LVLMs.

Abstract:
Social biases in text-to-image models have drawn increasing attention, yet existing debiasing efforts often focus solely on either the textual (e.g., CLIP) or visual (e.g., U-Net) space. This unimodal perspective introduces two major challenges: (i) Debiasing only the textual space fails to control visual outputs, often leading to pseudo- or over-corrections due to unaddressed visual biases during denoising; (ii) Debiasing only the visual space can cause modality conflicts when biases in textual and vision are misaligned, degrading the quality and consistency of generated images. To address these issues, we propose a Bimodal ADaptive Guidance DEbiasing within Textual and Visual Spaces (BADGE). First, BADGE quantifies attribute-level bias inclination in both modalities, providing precise guidance for subsequent mitigation. Second, to avoid pseudo/over-correction and modality conflicts, the quantified bias degree is used as the debiasing strength for adaptive guidance, enabling fine-grained correction tailored to discrete attribute concepts.Extensive experiments demonstrate that BADGE significantly enhances fairness across intra- and inter-category attributes (e.g., gender, skin tone, age, and their interaction) while preserving high image fidelity. Our project page is at https://badgediffusion.github.io/

Abstract:
Text-to-Image (T2I) diffusion models exhibit concerning tendencies to generate harmful imagery that perpetuates social biases and stereotypes, posing significant ethical risks in real-world applications. While existing mitigation approaches predominantly employ black-box methodologies through dataset augmentation or constrained fine-tuning, they face critical limitations, including high data acquisition costs and potential exacerbation of stereotypes during model retraining. Inspired by neuroscience principles where neurological dysfunction often stems from aberrant neural activation patterns, we propose a novel framework, StereoClinic, targeting the root cause of stereotype generation through direct neural intervention. Our solution introduces two synergistic components: Diffusion Deep Taylor Decomposition (DDTD) for precisely localizing stereotype-related neurons via Layer-wise Relevance Propagation (LRP) attribution analysis, and Stereotype Neuron Suppression (SNS) implementing targeted activation damping to neutralize bias propagation. Through extensive empirical evaluations across multiple bias dimensions, we demonstrate that our method achieves significant stereotype mitigation without compromising image quality or requiring additional training data. This neuro-inspired approach establishes a new paradigm for model interpretability and ethical alignment in generative AI systems.

Abstract:
Recent text-to-image generative models facilitate creating vivid images with arbitrary contents that are indistinguishable from authentic ones by naked eyes. Despite progress in synthetic image detection, detecting the image from new generators remains challenging. Because advanced generators leave fewer visible forgery traces, while different generative frameworks produce varied forgery patterns. We notice that generative models consistently struggle with fine-detailed content generation, creating abnormal spatial dependencies among neighboring pixels in complex texture regions. In this paper, we propose a methodology of gazing local detail of forgery (GLDF) for generator agnostic synthetic image detection, which identifies prominent spatial dependencies to capture subtle forgery. Concretely, we design frequency-aware correlation discovering (FACD) module to learn dynamic filters by instance-adaptive frequency masking block for identifying prominent spatial deficiencies, which distributed in different spatial positions with various patterns. Furthermore, we introduce the spatial forgery clue distilling module (SFCD) to iteratively aggregate and refine spatial dependencies from different positions by spatial aggregating and prototype global interacting blocks. Extensive experiments demonstrate that GLDF outperforms state-of-the-art methods on detecting synthetic images from different generators.

Abstract:
Serverless computing has become a promising paradigm for video processing workflows, offering simplified deployment and flexible management of business logic. However, the dynamic, multi-stage nature of video processing pipelines poses significant challenges for traditional serverless resource management, particularly in efficiently modeling optimal configurations and adapting to rapidly evolving pipeline structures. To address this challenge, we propose ConfigNavigator, a video pipeline resource tuning framework capable of adapting to dynamic inputs and pipeline structures with minimal overhead. In the offline phase, ConfigNavigator models function execution time distributions at the fundamental operation level and leverages graph theory to decompose complex video processing pipelines, thereby obtaining optimal configurations with minimal overhead. In the online phase, it dynamically adjusts function configurations on critical paths through real-time performance feedback, ensuring pipeline performance stability across varying workloads. We evaluate ConfigNavigator using real video streams on the commercial serverless platform AWS Lambda. Compared to state-of-the-art baselines, ConfigNavigator reduces configuration search time by 94.11% while decreasing end-to-end pipeline processing time by 13.97%.

Abstract:
With the increasing popularity of image and video analysis on mobile devices, high-throughput image inference has become essential. However, current mobile deep learning frameworks face key bottlenecks: high computational load in JPEG image recognition and low processor efficiency, which limit overall image processing throughput. To address these issues, this paper proposes the FCG framework (Frequency Domain model for CPU and GPU), a mobile JPEG inference framework based on frequency domain data and a hybrid parallel architecture that enables high-throughput inference for JPEG-encoded images on mobile devices. FCG decouples JPEG decoding from model inference by discarding the traditional RGB decoding process and retaining only the Huffman decoding. This decoding step is further accelerated through multi-core processing, significantly reducing the computational burden and latency during preprocessing. In light of the characteristics of frequency domain data and the heterogeneous CPU/GPU processors on mobile devices, FCG reconstructs the deep learning model to ensure recognition accuracy while optimizing resource utilization. By effectively allocating tasks and combining parallel and sequential execution, FCG optimizes processor resource utilization to achieve high throughput and low latency. FCG outperforms the state-of-the-art NN-Stretch by reducing latency by 36%. It also achieves significant throughput improvements—3.6x, 3.3x, and 2.8x—on CPU, GPU, and CPU+GPU configurations, respectively, compared to sequential inference systems. Additionally, FCG reduces power consumption by 56%, 35%, and 43% in these configurations.

Abstract:
Fine-grained cross-view localization seeks to predict ground-level camera positions within GPS-tagged aerial images by matching ground and aerial views. Existing methods often rely on large-scale ground truth annotations from specific regions, but performance degrades due to domain shifts when models trained in one area are applied to another. However, collecting region-specific annotations for each area is costly or infeasible. To address this, we propose a self-distillation curriculum learning framework that generalizes pretrained localization models to unseen new areas. Our approach introduces a Dirichlet-based quality assessment strategy to evaluate teacher-generated pseudo labels, where high uncertainty signals noisy predictions and low uncertainty indicates clean samples. This uncertainty is used to guide an easy-to-hard curriculum learning strategy, where easy samples are prioritized initially, and more challenging samples are progressively incorporated, enabling effective student training. Furthermore, we develop a joint optimization scheme that updates both the student model and pseudo labels, applying adaptive label smoothing to mitigate label noises and taking full advantage of new area data. Extensive experimental results on the VIGOR and KITTI benchmarks demonstrate that our method outperforms state-of-the-art approaches in new area localization, achieving superior accuracy without additional supervision.

Abstract:
Longitudinal studies of animal vocalizations provide crucial insights into developmental patterns and communicative evolution. To aid such investigations in canines, this paper introduces the Canine Age Transition Vocalization Dataset, a large-scale collection of dog vocalizations featuring meticulously verified metadata (including precise birthdate, breed, and individual dog ID) for 125 dogs across 6 common breeds. Our in-depth longitudinal analysis of this dataset then reveals novel findings on how key vocal parameters, encompassing defined bark types and finer-grained acoustic components (Elemental Dog Bark Units, or EDBUs), change as dogs mature. This work, therefore, offers both a significant new resource and foundational data that enable deeper, more nuanced investigations into the lifelong vocal development of dogs and other animal communication.

Abstract:
Video-language models are pivotal for tasks such as moment retrieval and highlight detection, yet they often struggle to capture the dynamic, non-linear interactions between temporal video sequences and textual semantics. Existing approaches, relying on static cross-attention or prompt-tuning mechanisms, fail to adaptively model the evolving relationships between modalities, leading to suboptimal alignment and limited generalization. Inspired by systems biology, we propose Reaction-Diffusion Multimodal Fusion (RDMF), a novel framework that reimagines video-language alignment as a reaction-diffusion (RD) process, drawing on the principles of pattern formation introduced by Alan Turing. In RDMF, video features diffuse across time to capture temporal context, while text-video interactions are modeled as non-linear reactions that amplify relevant features and suppress noise, forming emergent patterns akin to biological systems. Leveraging the Gray-Scott RD model, we design a computationally efficient fusion module that integrates video and text representations, supported by rigorous mathematical analysis of stability and convergence using Turing instability criteria. Our framework is theoretically grounded, employing advanced mathematical tools to ensure stable pattern formation, and is practically viable, incorporating standard components like pretrained encoders and DETR-style heads for moment retrieval and saliency prediction. RDMF represents a pioneering interdisciplinary approach, bridging systems biology and multimedia research to address the limitations of conventional multimodal fusion. Preliminary experiments demonstrate its potential to outperform existing methods in identifying salient video moments, offering a new paradigm for video-language tasks. Beyond its immediate applications, RDMF opens avenues for exploring biologically inspired architectures in multimedia, with implications for real-time video analysis, interactive media systems, and cross-disciplinary collaboration between multimedia and systems biology communities.

Abstract:
Hate speech poses a persistent threat to society, causing profound harm to both individuals and communities. Detecting such content is essential for promoting safer and more inclusive environments. While previous research has primarily focused on text-based or image-based hate speech detection, video-based hate detection remains relatively underexplored. A key barrier is the limited availability of high-quality video datasets. Existing hateful video datasets are typically limited in scale, diversity, and annotation depth, often labeling hateful content without further distinguishing between explicit and implicit forms. In this work, we present DeHate, which, to the best of our knowledge, is the largest hateful video dataset to date. DeHate comprises 6689 videos collected from two platforms and spanning six social groups. Each video is annotated with fine-grained labels that differentiate explicit, implicit, and non-hateful content, along with segment-level localization of hate, identification of contributing modalities, and specification of the targeted groups. Through detailed analysis of annotated videos across platforms, we reveal distinct patterns in how hateful content is conveyed, offering a comprehensive comparison between explicit and implicit hate in terms of their prevalence and characteristics. Furthermore, we benchmark state-of-the-art models, including both uni-modal and multi-modal architectures, and identify persistent challenges in detecting subtle and context-dependent forms of hate. Our findings highlight the importance of holistic and fine-grained hateful video datasets for advancing research in hate speech detection. Disclaimer: This paper contains sensitive content that may be disturbing to some readers.

Abstract:
This study centers on the creation of a novel dog bark emotion dataset, EmotionalCanines, capturing the emotional spectrum of canine vocalizations. In the current literature on animal communication and its intersection with machine learning, there is a limited amount of open-sourced data available to facilitate research, mainly due to constraints in animal subjects and recording conditions. To address this gap, we propose a framework that enables the collection of reliable arousal and valence labels in animal emotional state at scale. Through its application, we built a dataset of 1,400 dog bark sequences with corresponding arousal and valence labels, the largest of its kind, for the Husky and Shiba Inu dog breeds. By constructing this dataset, we provide a foundation for decoding dog bark patterns and advancing animal communication research.

Abstract:
Emerging text-to-3D generative models based on large diffusion backbones have markedly lowered the barrier to high-quality 3D asset creation, yet rigorous quantitative evaluation remains elusive. We introduce T23D-QA, a novel open benchmark that couples diverse prompts, multiple generation paradigms, and fine-grained human judgements for text-conditioned 3D synthesis. The dataset comprises 1,710 textured meshes produced by nine state-of-the-art pipelines spanning feed-forward, optimization-based, and view-reconstruction families. Assets are driven by a tri-categorical prompt suite: single-object, multi-object, and primitive-anchored, covering 160 ShapeNet-level object classes. Each mesh is rated by 20 participants along three orthogonal dimensions: geometry, texture, and alignment. Building upon this corpus, we propose an evaluator that decouples multi-modal features via cross-attention. On T23D-QA, our baseline surpasses the strongest published metric by 8.1% (geometry), 6.1% (texture), and 1.9% (alignment) in Spearman rank correlation. Dataset and code are publicly available at https://t23d-qa.github.io to foster reproducible research.

Abstract:
Effective video retrieval in large-scale datasets presents a significant challenge, with existing tools often being too complex, lacking sufficient retrieval capabilities, or being too slow for rapid search tasks. This paper introduces diveXplore, an open-source software designed for interactive video retrieval. Due to its success in various competitions like the Video Browser Showdown (VBS) and the Interactive Video Retrieval 4 Beginners (IVR4B), as well as its continued development since 2017, diveXplore is a solid foundation for various kinds of retrieval tasks. The system is built on a three-layer architecture, comprising a backend for offline preprocessing, a middleware with a Node.js and Python server for query handling, and a MongoDB for metadata storage, as well as an Angular-based frontend for user interaction. Key functionalities include free-text search using natural language, temporal queries, similarity search, and other specialized search strategies. By open-sourcing diveXplore, we aim to establish a solid baseline for future research and development in the video retrieval community, encouraging contributions and adaptations for a wide range of use cases, even beyond competitive settings.

Abstract:
The advent of Large Vision-Language Models (LVLMs) has demonstrated significant capabilities in multimodal understanding. However, their application to complex, multi-page documents, particularly in non-English languages like Japanese, remains a significant challenge due to the scarcity of suitable benchmarks. To address this gap, we organized the ''Large Vision---Language Model Learning and Applications (LAVA) Grand Challenge'' at ACM MultiMedia 2025. We present an overview of the competition. We designed a novel, challenging task: a 10-way multiple-choice Visual Question Answering (VQA) task on multi-page Japanese PDF documents. The task demands that models integrate information across multiple pages, text, and figures. We detail the dataset construction, including an annotation and filtering process designed to ensure questions are visually grounded and non-trivial. We also present the competition results, including an analysis of the leaderboard, and discuss the baseline performance of representative models. The LAVA Grand Challenge highlighted both the current capabilities and limitations of LVLMs in practical document understanding scenarios, thereby stimulating future research and providing a robust benchmark in this important domain.

Abstract:
The IntentVC Challenge, held in conjunction with ACM Multimedia 2025, introduces a novel benchmark for intention-oriented controllable video captioning. Unlike conventional captioning methods that generate generic, scene-level summaries, IntentVC focuses on intention-specific generation. Participants are required to produce captions explicitly conditioned on user-defined intentions, such as emphasizing a specific object tracked within a video. To support this task, the challenge provides an extended version of the LaSOT dataset annotated with intention-focused captions across 70 object categories. A standardized evaluation protocol and public leaderboard enable fair and reproducible comparison among submitted methods. By advancing research in personalized and adaptive video understanding, IntentVC offers a platform for exploring controllable vision-language modeling with practical relevance for accessibility, retrieval, and human-AI interaction. As a result, a total of 23 teams and 58 active participants have participated, and a total of 1,443 entries have been submitted. More information and resources are available at https://sites.google.com/view/intentvc/.

Abstract:
Traditional emotion recognition methods struggle with complex emotional dynamics including multi-emotion states, transitions, and contextual reasoning. While multimodal large language models demonstrate great potential for understanding such complex scene dynamics, they still face challenges in adapting to emotion recognition tasks. We propose UniEmotion, a unified framework that simultaneously addresses conventional categorical emotion recognition, open-vocabulary fine-grained emotion recognition, and descriptive emotion understanding. Our approach leverages an iterative consensus-based training pipeline where pseudo-labels and model parameters co-evolve, maximizing large models' utility while mitigating downstream limitations. The framework integrates a selector module that identifies high-quality samples through prediction variance analysis, coupled with a pseudo-labeling module employing consistency regularization and class-wise adaptive mapping. This dual mechanism reduces error accumulation during self-training while aligning open-vocabulary output with task-specific labels. Experimental results demonstrate the effectiveness of our framework, achieving state-of-the-art performance across all three tracks, including 1st place on the MER-SEMI track with a significant improvement of 11.97% over the best baseline, and 2nd place on the MER-DES track.

Abstract:
This paper focuses on Open-Vocabulary Multimodal Emotion Recognition (OV-MER) and is dedicated to solving the two challenges it faces: concept semantic misalignment and incomplete coverage of fine-grained emotion categories. To address this, we propose a novel cognitive agent framework (Agent-MER), which reframes the OV-MER task as a problem to be solved by an agent that mimics the human cognitive process through knowledge-guided deliberation. We first construct a hierarchical Emotion Tree to serve as the agent's knowledge base. Building on this, we design a Knowledge-Guided Hierarchical Deliberation reasoning process. This process systematically explores the entire emotional landscape through a three-level, coarse-to-fine iterative reasoning process, enabling the identification of a richer and deeper range of emotions. Finally, a Self-Consistent Voting mechanism is employed to aggregate the results from multiple reasoning runs, ensuring the robustness of the final output. Experiments conducted in the MER2025 Challenge demonstrate that our proposed method achieved a top-ranking score of 61.04%, securing first place and significantly outperforming existing baselines. This work not only provides an effective solution for OV-MER but also opens up new avenues for developing more human-like affective intelligence systems.

Abstract:
We present the DCU team's system for the CASTLE Challenge at ACM Multimedia 2025, which explores video retrieval and question answering in egocentric, multi-user environments. Our system adapts techniques developed for lifelogging, particularly event-based semantic retrieval and QA pipelines, to the CASTLE dataset with minimal architectural changes. It combines vision-language embeddings, transcript-based retrieval, and person tracking to support both automatic and interactive search workflows. In the interactive track, we introduce a modular interface for narrative reconstruction and exploratory search. Qualitative results show that the system can generate plausible, evidence-based answers to complex multimodal queries. These findings suggest that lifelog retrieval systems offer a viable foundation for broader egocentric video analysis.

Abstract:
The 8th ACM International Workshop on Multimedia Content Analysis in Sports is held in Dublin, Ireland on October 28th, 2025. It is co-located with ACM Multimedia 2025. The goal of this workshop is to bring together researchers and practitioners from academia and industry to address challenges and report progress in mining, analyzing, understanding, and visualizing the multimodal data in sports. The combination of sports and modern technology offers a novel and intriguing field of research with promising approaches for visual broadcast augmentation as well as understanding, statistical analysis, and evaluation in amateur and professional sports. There is a lack of research communities focusing on the fusion of multiple modalities. Thus, this workshop series on multimedia content analysis in sports aims to contribute to the closure of this research gap by bringing together the breadth and depth of these diverse approaches to stimulate each other with new ideas and foster research progress.

Abstract:
The Second ACM Workshop on AI-Powered Question & Answering Systems for Multimedia (AIQAM'25) was held on 27 October 2025 in Dublin, Ireland, co-located with ACM Multimedia 2025. The workshop's main objective is to create a collaborative and inclusive space for researchers at the intersection of Artificial Inteligence (AI), large language models (LLMs), multimodal information retrieval, and question answering systems. Building on the success of its first edition (AIQAM'24) at ICMR 2024, AIQAM'25 provided a forum for presenting novel methods, applications, and surveys, which address the challenges of integrating text, image, audio, and video data into QA systems. The programme featured contributions ranging from methodological advances in reasoning and evaluation frameworks, domain-specific applications in education and sustainability, to a survey work reviewing the state of multimedia retrieval-augmented QA. This summary paper outlines the objectives and scope of the workshop, describes its format, highlights the keynote and accepted contributions, and acknowledges the efforts of the organising committee.

Abstract:
Multimodal, generative, and responsible affective computing aims to enhance people's lives. In recent years, the AI revolution has already begun to impact daily life, with virtual assistants being deployed across various sectors such as healthcare, banking, transportation, and education. It is clear that, in the near future, humans may interact with AI-powered systems as much or maybe even more than direct human-to-human interactions. Affective computing has numerous applications, including innovative approaches to forecasting and preventing anxiety, stress, and mental health issues; enhancing robotic empathy; assisting individuals with communication, behavior, and emotion regulation challenges; and promoting awareness of health and well-being. Many of these applications require enhanced control and protection of sensitive, private, and personal data. Therefore, it is crucial to further develop the creation, evaluation, and deployment of emotionally intelligent systems that are both responsive and responsible. Additionally, improving the accuracy and interpretability of emotion prediction results can significantly enhance the application of this technology in the downstream tasks mentioned above. MRAC'25 is the continuation of MRAC'23 and MRAC'24. Through this workshop, we aim to bring together researchers to discuss the potential and development of affective computing.

Abstract:
Balancing energy efficiency and occupant comfort in building HVAC systems is a critical challenge. While generative AI shows promise, its application has been hindered by a reliance on simulations and the inherent instability of its numerical predictions. This paper presents ''Office-in-the-Loop,'' a cyber-physical system leveraging generative AI in a real-world office. Our real-world experiments resolve the energy-comfort trade-off, achieving up to 47.92% energy savings with a 26.36% comfort improvement. We introduce a novel prompting technique, ''Data-Driven Reasoning,'' which compels the AI to justify its predictions with data. This simple addition improves prediction accuracy within ±0.5°C from 50% to 92.31%, paving the way for reliable, AI-driven building automation.

Abstract:
The spread of tampered text poses a critical challenge to information security. Previous methods for tampered text detection (TTD) primarily relied on visual artifacts as clues, while overlooking potential semantic inconsistencies introduced during text manipulation. To address this limitation, we propose TVSIP (Tampered text Visual-Semantic InterPreter), a novel framework leveraging Multimodal Large Language Models (MLLMs) to integrate both visual and semantic clues for comprehensive tampered text analysis and verification. TVSIP consists of a Locator and an Interpreter. The Locator combines the visual detection ability of existing expert models with the semantic comprehension capabilities of MLLMs to create precise tampering masks. Subsequently, the Interpreter provides comprehensive descriptions and explanations based on identified tampered regions. To train and evaluate TVSIP, we construct the TextDDLE benchmark using GPT-4o. Extensive experiments demonstrate that TVSIP outperforms expert models in pixel-level localization and advanced MLLMs in interpretability. Furthermore, it maintains robustness against image degradation and exhibits strong generalization ability on out-of-domain datasets. Our work highlights the crucial role of semantic inconsistencies in TTD and establishes a more reliable verification system for ensuring document authenticity in the digital age.

Abstract:
Due to the frequent occurrence of missing views in real-world multi-view data, incomplete multi-view clustering (IMVC) has attracted significant attention. However, most existing IMVC methods overlook the fact that incomplete data in practical applications often exhibits varying missing rates across different views, rendering their mechanisms ineffective under such conditions. Although several works based on conventional learning methods have been proposed to solve unbalanced incomplete multi-view clustering (UIMVC), their performance is limited by their shallow feature representation and over-sophisticated optimization procedure. In this paper, we propose Deep Unbalanced Incomplete Multi-view Clustering via Graph Constrained Imputation and Contrastive Learning (DUIMC) to address UIMVC with deep learning paradigm. Specifically, DUIMC introduces a novel differentiable imputation layer for dynamically handling unbalanced incompleteness and integrates it with multi-view contrastive clustering into a unified deep representation learning framework. Furthermore, bi-level graph constraints are imposed on imputation and representation learning to preserve local consistency at both the feature and instance levels. In addition, we develop adaptive fusion mechanisms to adaptively restrain the impact aroused by information unbalance among views. Extensive experimental results on five benchmark datasets demonstrate DUIMC's superior clustering performance over several traditional state-of-the-art approaches.

Abstract:
Multimodal Named Entity Recognition (MNER) integrates visual information to resolve textual ambiguities but struggles with generalizing to unseen entities (out-of-vocabulary, OOV), particularly in social media. To bridge this gap, we leveraging internal label knowledge and visual information and propose a Label-Enhanced Information Bottleneck Distillation (LIBD) framework, which transfers label-aware generalization capabilities via a teacher-student architecture. Our method introduces Dual-level Label Augmentation (DLA), enhancing the teacher model by integrating word-level entity replacement with labels and embedding-level learnable label vectors. This is paired with Information Bottleneck Distillation (IBD), selectively distilling critical knowledge from the teacher while suppressing irrelevant noise. Experiments on benchmark datasets demonstrate that LIBD outperforms state-of-the-art methods, especially in identifying OOV entities.

Abstract:
Spatial transcriptomics technologies enable the integration of gene expression profiles with spatial context, facilitating a deeper understanding of tissue architecture through downstream tasks such as clustering. However, existing approaches predominantly focus on highly variable genes (HVGs), while the informative structural and contextual signals embedded in low variability genes (LVGs) remain largely underutilized. To bridge this gap, we propose SALVG (Spatial Augmentation via Latent Variable Genes), a novel and plug-and-play framework that leverages LVG-derived structural priors to enhance HVG representation learning for spatial clustering. Specifically, SALVG constructs spatial, feature, and combined graphs for both HVGs and LVGs, and introduces two graph-based augmentation strategies to inject LVG information into HVG graphs. The first strategy enhances the HVG combined graph directly using the LVG combined graph, while the other individually augments HVG spatial and feature graphs with their LVG counterparts before fusing them into a new combined representation. These enhanced graph structures are subsequently employed for downstream clustering. To the best of our knowledge, SALVG is the first framework to exploit LVG signals for assisting HVG-centric spatial transcriptomics clustering, effectively capturing complementary structural and contextual cues. Experiments on multiple benchmarks demonstrate its effectiveness, robustness, and transferability. Case studies further confirm that LVG-derived structure enhances biological interpretability by revealing coherent spatial and cellular patterns.

Abstract:
Multi-modal test-time adaptation (TTA) for 3D semantic segmentation has increasingly become a research hotspot due to its ability to address label dependency and enable rapid adaptation. Existing methods rely on learnable extra components to mitigate reliability bias, however, learning-based approaches in TTA scenarios often lack sufficient training. Moreover, most existing approaches update only normalization layers in the teacher-student framework, which limits their ability to model domain shifts. To overcome these limitations, we propose PLATO-TTA, a novel multi-modal TTA method for 3D semantic segmentation leveraging the native stability in robust prototypes and adaptive tuning of critical teacher-student parameters. The approach contains three key components: Prototype-Guided Pseudo-Labeling (PGPL), Consistency Based Backtracking (CBB), and Domain Specific Updating (DSU). PGPL reduces reliability bias by constructing pseudo-source domain prototypes and computing modality fusion weights based on domain discrepancies. CBB updates all student model parameters while preventing catastrophic forgetting through a parameter backtracking mechanism. DSU selectively updates the teacher model using only domain-specific parameters from the student model, ensuring rapid adaptation and stable guidance. Extensive experiments demonstrate the effectiveness of PLATO-TTA, bringing a 6.3% gain to the SynthiatoSemanticKITTI scenario with severe reliability bias and significant domain discrepancy, and achieve state-of-the-art performance across various domain adaptation scenarios.

Abstract:
Puzzle solving has recently become a popular research topic. Existing solvers often overlook puzzles with missing pieces. The missing pieces, together with gaps between pieces, pose significant challenges, amplified by a large solution space. To tackle the challenges, we propose Co-Evolutionary Agents for Reassembling and Inpainting (CEARI), one agent to inpaint missing contents and the other to reassemble the puzzle, with a shared perception network to perceive the puzzle status. The reassembly agent utilizes an evolutionary algorithm to explore the large solution space, to discover a sequence of fragment-swapping actions to efficiently reassemble the puzzle, while the inpainting agent evolves from using a local outpainting network at the early stage to using a global inpainting network at the latter stage. Furthermore, a co-evolutionary training paradigm is designed to iteratively evolve the two agents in a coherent and collaborative manner, improving reassembly accuracy and inpainting quality simultaneously. Experimental results on three datasets show that CEARI largely outperforms state-of-the-art methods in terms of both reassembly accuracy and inpainting quality.

Abstract:
The performance of current Vision-Language Tracking (VLT) models is constrained by the limited diversity and quantity of labeled data. Compared to constructing large-scale datasets, data augmentation offers a more cost-saving strategy for VLT by synthesizing new samples from existing data, rather than generating them from scratch. However, conventional techniques like rotation and flipping may disrupt scene composition, causing conflicts between visual layouts and textual annotations. Recent advances in generative models have inspired the use of synthetic videos for data augmentation. Yet, existing approaches fail to address the core concerns of data augmentation in VLT (shown in Fig. 1)-target location accuracy, text-video consistency, and video content coherency. To bridge the gap, we propose Gen4Track, a tuning-free data augmentation framework that leverages the self-correcting mechanism to dynamically generate high-quality video data with annotations. Our approach involves (1) optimizing the attention calculations in a frozen text-to-image diffusion model to synthesize coherent videos that satisfy specific conditions (e.g., spatial location, category, color, and style), and (2) implementing a self-correcting mechanism based on a Large Language Model (LLM) to improve text-video consistency. During video augmentation, we propose content-coherent self-attention and location-enhanced cross-attention mechanisms, ensuring that image-level editings are accurately and coherently propagated throughout the video. Then, with the goal of maximizing text-video consistency, we iteratively refine the augmentation instruction with our designed self-correcting mechanism for a more aligned video. Extensive experiments validate that Gen4Track significantly boosts the performance of SOTA VLT models (achieving improvements of up to 3.2% in SUC and 3.5% in PRE), opening a new chapter of training Vision-Language trackers with synthetic videos rather than manually annotated data.

Abstract:
In few-shot medical image segmentation, most existing methods focus heavily on learning explicit correlations between support and query sets, often overlooking the core demands of the segmentation task itself. In this work, we identify three overlooked yet critical issues that limit current performance: the diversity of background distributions, the degradation of support prototypes, and the over-activation of irrelevant regions. To address these challenges, we propose a novel framework with three lightweight and adaptive modules. First, a background self-distillation module acts as a self-attention-driven agent to cluster and aggregate diverse background features, generating multiple sub-prototypes that enhance foreground-background separation. Second, we introduce a prototype self-anchoring mechanism that leverages a dual-branch correlation mapping and reverse supervision to stabilize support prototype learning and prevent feature degradation. Third, an activation self-calibration module identifies over-activated residuals and applies test-time channel manipulation to suppress noisy activations without additional training. Extensive experiments on standard few-shot medical segmentation benchmarks demonstrate the superiority of our approach over state-of-the-art methods. Our findings suggest that performance gains come not only from better support-query alignment, but also from rethinking and addressing the often neglected aspects of few-shot segmentation.

Abstract:
Unsupervised domain adaptation transfers knowledge from a labeled source domain to an unlabeled target domain, which assumes that the source domain is labeled correctly, and both source and target data are available during the adaptive process. But collecting large-scale datasets with fully precise annotations is expensive and time-consuming. Besides, due to data privacy and security issues, the source data is often inaccessible during domain adaptation and only unlabeled target data is available. Therefore, considering both the source domain data with noisy labels and the unavailability of source domain data during domain adaptation, this paper proposes an Adaptive Neighbors and Uncertainty Estimation (ANUE) method for Source-Free Unsupervised Domain Adaptation with Noisy Labels (SF-UDA-NL). To the best of our knowledge, there has been no prior research conducted on this particular issue. Specifically, since source domain data contains noisy labels, we propose a reweighted small-loss with uncertainty estimation to filter reliable samples for updating the dual-branch network. To prevent noise knowledge from misleading domain adaptation, we adopt a contrastive learning framework and design an adaptive neighbors module to help target samples generate more reliable pseudo-labels. Then, we design a reweighted contrastive loss at both class level and instance level based on uncertainty estimation to further enhance the network's classification performance. We conduct extensive experiments on three widely used datasets for unsupervised domain adaptation in image classification, and the results demonstrate the effectiveness and robustness of our method.

Abstract:
Visual language navigation (VLN) poses challenges in guiding agents through unseen environments based on natural language instructions. Existing methods either rely on imitation learning, for which training across various complex scenarios remains challenging, or leverage large visual language models (LVLMs) for zero-shot object recognition and expert iterative reasoning for improved scene understanding. Although LVLMs enhance target detection generalization, current VLN methods lack robustness in terms of environmental generalization and struggle with multi-step, coarsely directed instructions. Addressing these challenges, we introduce Ali-UI, a novel vision-language navigation approach that enables agents to navigate from random starting points in unvisited scenes and handle complex multi-step instructions. Specifically, we incorporate continuously accumulating global grid maps and local semantic maps as scene memory by employing frontier-based exploration. Multi-step coarsely directed commands are broken down with the assistance of LLaVA and matched with the scene, considering temporal and spatial alignment. Panoramic data are saved in topological form and queried by instruction segments for sequential navigation. Extensive experiments carried out in simulated environments demonstrate that Ali-UI outperforms existing state-of-the-art methods in terms of flexible human instructions and scene generalization, with the success rate improved by 23.37% and the SPL increased by 19.27% in R2R dataset.

Abstract:
Understanding the content of multi-page documents with rich layout information is a challenging task. Recent multimodal large language models (MLLMs) have made remarkable progress in understanding single-page document images. However, the understanding of multi-page documents remains insufficiently explored. This work proposes a Document Retrieval-enhanced, Expert-guided, Attention-aware Multimodal Framework, dubbed DREAM. Specifically, we propose a confidence-based, high-level semantic, multimodal retrieval method. Then, we propose a machine learning algorithm to complement the result of confidence-based retrieval and multimodal embedding similarity retrieval to obtain the most query-relevant set of document images. Subsequently, we designed a decoupled cross-page attention-aware multimodal language model for multi-page documents to interpret these retrieved images and produce the final answer. Experimental results demonstrate the effectiveness of the retrieval module within the framework, as well as the robust performance of the multimodal model in multi-page document comprehension. These findings offer a compelling solution for multi-page document comprehension and cross-page document visual question answering.

Abstract:
Prompt learning has emerged as an efficient adaptation paradigm for vision-language models (VLMs), yet it remains highly vulnerable to label noise, which limits its real-world applicability. We propose TrustCLIP, a noise-robust prompt tuning framework that leverages the inherent semantic structure of CLIP through two key components: Semantic Label Verification (SLV) and Trust-aligned Gradient Projection (TGP). SLV defines a semantic trust boundary based on CLIP's zero-shot predictions to identify reliable samples for standard supervised training. For uncertain samples, TGP projects their gradients into a trust-aligned subspace constructed from the gradients of clean samples, thereby preserving semantically aligned learning signals while suppressing noise-induced optimization drift. Unlike prior approaches, TrustCLIP doesn't require additional parameters, loss reweighting, or uncertainty estimation. Extensive experiments on 7 benchmark datasets with both synthetic and real-world noisy labels demonstrate that TrustCLIP consistently outperforms state-of-the-art methods in terms of both robustness and transferability.

Abstract:
Recent advances in multi-view 3D multi-person pose estimation have led to significant progress. However, several critical challenges remain, including the limited extraction and integration of multi-domain information, as well as the high annotation costs associated with 3D data in multi-person scenarios. These issues hinder the broader applicability of current methods in complex computer vision tasks. In this paper, we propose a Dense-Sparse Parallel Networks (DSP) framework that jointly leverages spatial, temporal, and frequency-domain information through an adaptive geo-consistency self-supervised strategy. Specifically, we design a multi-view spatial feature extraction module that captures cross-view spatial distributions from dense multi-view feature maps. In parallel, we employ a local-global temporal attention module and a frequency-aware attention module to extract dynamic temporal patterns and localized frequency-domain features from sparse keypoint data. Furthermore, a multi-domain parallel fusion module is introduced to effectively integrate features across all domains, enabling accurate multi-person 3D pose regression. To enhance self-supervised learning, we employ a dynamic view selector guided by reinforcement learning, which reduces the impact of inaccurate pre-trained 2D poses. Experimental results on three benchmark datasets (i.e., CMU Panoptic, Campus, and Shelf) demonstrate that the proposed DSP framework achieves robust and accurate performance, as evidenced by comparisons with other state-of-the-art methods.

Abstract:
Face deepfake detection is a critical technology for verifying the authenticity of facial media content and has long been a focal point in multimedia forensics. However, existing methods face significant challenges, primarily due to their limited ability to generalize across domains. Consequently, the growing variety of forgery techniques, combined with the degradation of visual quality in forged images, makes reliable detection even more difficult. To address these challenges, we propose WKD, a proactive deepfake detection framework based on Watermarking and Knowledge Distillation. The key insights of WKD are twofold: First, we embed watermark information into the Fractional-order Quaternion Radial Harmonic Fourier Moments (FrQRHFMs) space of the host image, achieving a robust balance between imperceptibility and robustness. Second, we design a dual-task learning framework consisting of a watermark extractor and a forgery discriminator, where learnable Low-Rank Adaptation (LoRA) layers are used to transfer knowledge from the extractor to the discriminator, thereby providing additional clues for deepfake detection. Specifically, the integrity of the watermark is compromised only when the host image undergoes a deepfake forgery, while it remains unaffected by conventional attacks. Experimental results on benchmark datasets demonstrate that WKD achieves state-of-the-art performance in both intra-domain and cross-domain deepfake detection, particularly when images are subjected to various conventional attacks.

Abstract:
Medical image segmentation is essential for precise anatomical delineation and clinical decision-making. However, fully supervised methods are limited by the substantial cost of acquiring pixel-level annotations, particularly for 3D volumetric data. Semi-supervised learning (SSL) alleviates this challenge by leveraging unlabeled data, yet it remains hindered by severe class imbalance, where dominant structures disproportionately occupy the voxel space, leading to feature degradation and unreliable pseudo-labels. To address this issue, we propose a simple but effective SSL framework, namely Sub-Volume Contrastive Learning (SuVCL), to enhance feature discriminability in imbalanced 3D medical image segmentation. Our approach incorporates localized contrastive learning through sub-volume sampling, which captures small but semantically informative regions to retain fine-grained structural details while mitigating computational overhead. Furthermore, we introduce a balanced memory bank mechanism, which dynamically maintains class-specific feature representations with adaptive updates guided by class-predictive confidence. Extensive experimental evaluations demonstrate that our method substantially enhances segmentation performance for minority classes, demonstrating substantial performance gains over existing SOTAs.

Abstract:
A striking proficiency of diffusion models in producing and manipulating images with an unprecedented level of realism has unquestionably elicited concerns. Many methods have been proposed to detect generated images. In particular, recent studies reveal that autoencoder reconstruction error can serve as an effective indicator for distinguishing authentic and synthetic images, since most generative models adopt analogous encoder-decoder operation. However, the reliance on a single autoencoder reconstruction error provides only limited information, which is insufficient for comprehensively capturing discriminative features, resulting in restricted generalization performance. In this paper, we propose Multiple Reconstruction Contrastive Learning (MRCL), which leverages multiple reconstruction residuals to enhance the generalizability of generated image detection. Specifically, MRCL applies Dinov2-ViT with LoRA fine-tuning to extract fine-grained feature representations of origin images and their multiple VAE reconstructions. In addition, a Residual Dense Fusion module is designed to effectively combine multiple VAE reconstruction residuals. Further, a contrastive learning strategy is adopted to guide the distance of origin images and VAE reconstruction representations. Extensive experimental results demonstrate the superior generalization performance of the proposed MRCL.

Abstract:
Current affective computing paradigms often treat emotional understanding and generation as separate tasks, yet they inherently possess symbiotic potential for mutual enhancement. In this paper, we aim to bridge the gap by developing a unified framework. The primary challenge lies in the extraction of precise and semantically rich representations of abstract emotions, which are crucial for both tasks. To address this, we harness the Chain-of-Thought reasoning at the latent space of multimodal large language models and propose EmoSym, a unified framework built upon this advanced foundation. Our framework is executed through three key steps: 1) Emotional reasoning knowledge compression. To enable efficient transfer of emotional reasoning priors, we design specialized reasoning tokens to compact emotion-aware contexts from external reasoning knowledge bases into latent representations. 2) Verifiable reinforcement reasoning optimization. To ensure more reliable and consistent emotional reasoning, we develop a verifiable reinforcement learning paradigm to further enhance the reasoning token by emotion-specific verifiable reward signals. Processed through the above two steps, the reasoning token simultaneously enhances emotional understanding while enriching semantic representations, benefiting subsequent emotional generation tasks. 3) Reasoning-augmented generation and online feedback. We then fuse it with emotional representations and feed them into a diffusion model to generate emotion-evoking images. Additionally, to create a generative-to-understanding enhancement feedback, we propose an Online Emotional Memory Bank (OEMB). It leverages newly generated images to progressively update the training dataset in the training process to reinforce understanding. Extensive experiments demonstrate the superior capabilities of our framework in both emotional understanding and generation tasks.

Abstract:
Learning discriminative micro-expression (ME) features from low-intensity facial movements is a key challenge for micro-expression recognition (MER). Although existing research has demonstrated that the appearance, motion and geometric information are distinguishing for MEs, the effectiveness of merging these information is still unclear. Thus, this paper proposes a Multi-information Hierarchical Fusion Transformer (MiHF-Tr) model to fully and effectively aggregate the facial appearance, motion, and geometric information of MEs, exploring a more reasonable way of multi-information fusion. As different information is homology, MiHF-Tr introduces a local and global hierarchy fusion framework to fuse them by modeling their local and global semantic consistency. Considering the bias of different information in feature representation ability, a single-core self-attention is proposed to achieve local multi-information fusion, which focuses on strong information and supplements it with weak information. The experimental results demonstrate that the fusion of appearance, motion, and geometric features is discriminative, and the proposed method can effectively aggregate multiple information, achieving competitive performance.

Abstract:
Multimodal Sentiment Analysis (MSA) aims to identify sentiment polarity and intensity in media. Current methods typically employ a two-stage pipeline: extracting features from each modality, then predicting sentiment based on fused representations. However, most fusion strategies align features from different modalities in a single step, leading to conflicts during cross-modal interactions and hindering the modeling of hierarchical sentiment dependencies. Additionally, existing methods often overlook the dominant role of textual modality in high level latent fusion space, causing explicit linguistic sentiment cues to be obscured by redundant information. To address these issues, DDSE (Decoupled Dual-Stream Enhanced framework) is proposed in this work, which decouples features into public and private representations for improved feature enhancement and cross-modal interaction. The proposed TC-Mamba module enables progressive cross-modal interactions within shared state transition matrices under a text-guided fusion paradigm, effectively preserving sentiment cues and minimizing redundancy. Additionally, DDSE adopts a multi-task learning strategy to further enhance overall performance. Extensive experiments on the MOSI and MOSEI datasets demonstrate that DDSE achieves state-of-the-art results, with Acc-5 improvements of 3.06% and 0.1%, respectively, underscoring its effectiveness in MSA. Ablation studies confirm the critical contributions of each component within the framework. Code is available at https://anonymous.4open.science/r/DDSE-76D6.

Abstract:
Supervised cross-modal hashing has achieved remarkable progress in retrieving related items across different modalities. However, in practical applications, a significant portion of data remains unlabeled, such as online data on websites, which must be included for effective retrieval. To address this challenge, while maintaining the high accuracy and efficiency of supervised methods, few works have attempted to adapt existing supervised techniques to handle unsupervised tasks through a general modular approach. To this end, we introduce a novel cross-modal hashing method, termed Label Prediction Inherited Hashing (LPIH). Initially, LPIH leverages labeled data to learn high-quality general label functions using supervised methods. Subsequently, it inherits the existing hash codes from existing supervised methods to further refine the pseudo-label information. Finally, LPIH integrates the refined pseudo-label information with the existing hash functions to learn new hash functions specifically tailored for unsupervised tasks. Extensive experimental results on three public datasets demonstrate the superior performance of LPIH compared to state-of-the-art (SOTA) cross-modal hashing methods. Specifically, LPIH achieves an average precision improvement of 5% over SOTA methods, highlighting its effectiveness in bridging the gap between supervised and unsupervised learning in the context of cross-modal retrieval.

Abstract:
Interactive data illustrations in an immersive environment are challenging due to their inherent ambiguities during the interaction. These challenges are introduced by visual clutter and 3D occlusions resulting from depth information, as well as the relatively inefficient fine-grained manipulations required by handle controllers on immersive devices. In this paper, we propose Meta-Illustrator, an illustration transfer tool to generate immersive 3D illustrations for a volumetric data with the 2D illustrated results transferred from its one or multiple 2D slices (images). Initially, the slices can be illustrated by users expressively, owing to the plenty of the existing mature 2D sketching techniques and image processing algorithms. Then the 2D illustrated results on the slices can be intelligently transferred from their 2D image space to 3D volumetric space by Meta-Illustrator. Compared to the state-of-the-art image-to-image style transfer neural networks, which are either computation-intensive or memory-intensive, the proposed 2D-to-3D transferring approach can be built on a desktop PC without training. We demonstrate the usability, expressiveness, and effectiveness of Meta-Illustrator by both quantitative and qualitative evaluations.

Abstract:
Evaluating the visual quality of autostereoscopic 3D displays is crucial for quantifying their stereoscopic viewing experience and optimizing display performance. Existing quality evaluation methods primarily predict the visual quality of autostereoscopic 3D displays by indirectly learning display parameter information from image content. However, these methods fail to explicitly model the relationship between display parameters and visual quality, thereby limiting their prediction accuracy. To address this problem, a Multimodal Parameter Perception Network (MPPNet)-based visual quality assessment method is proposed in this paper, which treats display parameters as textual modalities to explicitly establish their relationship with visual quality. To effectively understand the semantic information of display parameter texts, a Contrastive Language-Image Pretraining (CLIP)-based adaptive text encoder is proposed to generate robust semantic representations by capturing both general and domain-specific semantic embeddings. In parallel, a hierarchical vision encoder is adopted to extract visual representations from display images, which simulates the human binocular perception by capturing multi-level visual features from the left and right views. To achieve comprehensive cross-modal interaction, a mamba-based cross-modal fusion module is proposed to fuse textual and visual representations of display parameters by capturing both shallow and deep correlations. Extensive experimental results demonstrate that the proposed MPPNet achieves state-of-the-art performance in evaluating the visual quality of autostereoscopic 3D displays.

Abstract:
Teleportation dominates Virtual Reality (VR) locomotion, known for its efficiency and ease of use, particularly in small physical spaces where users have limited room to move. However, is efficiency the only metric that matters? In this study, we challenge this well-established technique by comparing teleportation with an alternative approach for navigating constrained spaces: Walking-with-Portals. We conducted a user study (N = 24) comparing both techniques, collecting data on orientation, cybersickness, presence, and immersion. Our results reveal a distinct trade-off: Teleportation proved significantly faster, more efficient for pathfinding, and induced less cybersickness. Conversely, Walking-with-Portals significantly enhanced users' sense of spatial presence and perceived control over their movement. While a majority preferred Teleportation for the task's efficiency, 91.7% of participants identified Walking-with-Portals as the more immersive technique. Additionally, we establish key design guidelines for intuitive portal placement that allow a large virtual environment to fit within the small play areas (about 2.5 × 2.5 meters) common to home VR users. Our findings suggest that while teleportation remains the default solution in VR locomotion, Walking-with-Portals provides unique benefits that should not be overlooked. This work invites a reflection on locomotion in VR, arguing that the future of VR navigation may go beyond teleport.

Abstract:
As the use of dynamic point clouds (DPCs) expands in immersive media settings including augmented and virtual reality, it has become more important than ever to have precise and scalable methods for quality evaluation. However, most existing objective Point Cloud Quality Assessment (PCQA) methods focus on static content and fail to capture the temporal dynamics and multimodal perceptual cues inherent in dynamic scenarios. In this work, we propose a no-reference dynamic PCQA framework that integrates both geometric and visual modalities with global temporal modeling for perceptually aligned quality prediction. For the 3D modality, we extract localized spatio-temporal features using a time-aware point cloud encoder that incorporates the normalized frame index as an additional input channel. In parallel, we generate two complementary projections per frame and extract visual features using a pre-trained convolutional network. A dynamic gating network adaptively weights the contributions of the two modalities at each time step. These weighted features are fused and passed to a temporal transformer, which captures long-range temporal dependencies to regress the final quality score. Comprehensive tests on benchmark datasets reveal that our approach surpasses existing full-reference and no-reference PCQA techniques, demonstrating its efficacy in assessing the quality of dynamic point clouds.

Abstract:
Localizing text under low-light conditions has gained attention, with typical approaches relying on two stage cascading modules that combine low-light enhancement and text localization. However, these often require additional enhancement modules and cause inefficiency in joint optimization. In this work, we address the challenge by adopting a novel approach: tailoring the detector for low light conditions through knowledge distillation from normal light conditions, without relying on any enhancement module. First, we design a Graph Topological Aggregation (GTA) model that utilizes the message passing mechanism of graph neural networks to structurally represent text topology and facilitate structured feature expression in knowledge transfer. We then introduce two specially designed knowledge transfer constraints aimed at enhancing the learning of text's multi-scale features and topological knowledge. Finally,we propose a Stage-wise Bilevel Knowledge Transfer learning strategy that designates the low-light learning process as the upper-level task, while treating normal light learning as the lower-level task, effectively addressing the coupling issues and sequential dependencies prevalent during the distillation process. Extensive experiments underscore the approach's superiority.

Abstract:
Screenshot tools bring convenience to daily workflows but simultaneously pose risks of screen content leakage. Most existing image watermarking methods struggle to protect dynamic screen content (DSC) due to two key limitations: low generalizability across diverse file types and reliance on cover images, which cause difficulties in handling diverse and dynamic screen content. Recently proposed screen-targeted watermarking methods offer cover-independent, fast-response solutions for DSC protection, but they mainly support large-scale screenshots and struggle to balance robustness and visual quality, limiting real-world applicability. To address these issues, in this work, we propose DynMark, a novel cover-independent watermarking scheme for DSC protection. Our method generates a watermark mask that is directly overlaid onto the screen surface for embedding, without relying on the underlying content. As a result, the approach maintains the same watermark mask even when the screen content changes, ensuring stability without the need for updates. Specifically, we use an invertible neural network (INN) to generate watermark and location blocks, jointly optimized with the decoder and locator. Additionally, edge smoothing is applied to further enhance visual quality. These components are integrated into a three-stage training framework to ensure robust performance. This design ensures stable extraction even from small screenshots with size down to 256 x 256, overcoming the limitations of existing methods regarding screenshot size. Extensive experiments show that our method achieves superior visual quality, extraction accuracy, and adaptability to different screenshot tools and screen resolutions, offering an efficient and practical solution for protecting screen content.

Abstract:
Visual perception of nighttime images is often compromised by co-existing low-light and blur degradations. While recent methods have made progress in jointly solving these degradations, the diversity of patterns and intensities in degradation has not been properly considered, leading to inconsistent illumination and unintended artifacts. In response, we propose to integrate perceptual cues with mixture-of-experts (IPCMoE) to achieve flexible processing for low-light blurry images. By exploiting the perceptual cues, we strategically combine dedicated experts with the selective collaboration approach for feature enlightening and texture restoration. To this end, we develop perceptual-integrated MoEs by designing customized routers and task-depended experts. Specifically, the texture memorial MoE is developed to preserve valuable features to restore high-fidelity details, and the enhancement MoE that adaptively integrates enlightening cues and texture cues is designed to formulate the relationship between feature enlightening and texture restoration, thereby achieving dynamic image processing. Extensive experiments show that our method achieves state-of-the-art performance on LOL-Blur and Real-LOL-Blur datasets.

Abstract:
Few-Shot Class-Incremental Learning (FSCIL) aims to continuously learn novel concepts from limited training samples without forgetting previously encountered classes. Recent advancements have leveraged Parameter-Efficient Tuning (PET) strategies on pre-trained models to enhance FSCIL performance. However, current PET-based FSCIL approaches still suffer from the challenges posed by catastrophic collapse of general prompt and limited adaptability of specific prompt. To this end, we redefine the function of the PET paradigm with both gradient-aware prompting (GAP) and router-free adapters (RFA) to boost the performance of FSCIL, termed as "PET-GPRA". To dynamically balance the retention of previously learned general knowledge and the acquisition of novel class information across sessions, the GAP paradigm adaptively adjusts the updated gradient of the general prompt by leveraging the angular relationship between the general knowledge gradient and the novel knowledge gradient. Meanwhile, the RFA mechanism utilizes the semantic similarity between class attributes to replace the routing network, guiding the integration of adapter information, in which adapters serve as specific prompts to enhance the adaptability. Extensive experiments on multiple benchmark datasets consistently demonstrate the superiority and effectiveness of our proposed PET-GPRA framework over state-of-the-art baselines.

Abstract:
Forged videos are often subjected to double compression. When a forger maliciously or unintentionally increases the video's bitrate during re-encoding, the resulting videos are termed fake bitrate videos. Detecting these videos offers a generalized approach for efficiently identifying potentially forged content within large datasets. However, previous research has largely focused on video-level detection of fully fake bitrate videos, where an entire video is re-encoded at a higher bitrate after content modification or the creation of fake high-definition (HD) footage. In practice, a skilled forger may adjust the bitrate of only specific video segments, generating partial fake bitrate videos-a common manipulation in tampering processes like video splicing. Existing methods face difficulties in detecting such partial modifications at the frame level and in pinpointing the manipulated segments. Our study addresses this gap by introducing a novel frame-level detection approach, which significantly enhances forensic precision. We simultaneously account for two types of abnormal frames arising from re-encoding and bitrate escalation and, for the first time, define fake bitrate video detection as a triple classification problem. To meet the challenges of this task, we extract anomalous bitrate-compression traces that capture subtle differences among the three frame types. Additionally, we propose the Trident Transformer Network (TTNet), a model designed to effectively integrate and learn high-frequency information within the encoding domain. Our approach achieves substantial improvements in accuracy, surpassing state-of-the-art methods by 3.62% and 11.95% in video-level and frame-level detection scenarios, respectively.

Abstract:
Remote sensing image restoration under cloud and haze occlusions poses a significant challenge due to severe spectral degradation and spatial distortions. While recent generative models have shown promise in image restoration, they struggle with three key issues: (1) Lack of precise annotations, making supervised methods unreliable; (2) Unintended interference with clear regions, leading to distortion in unaffected areas; (3) Spectral and structural inconsistencies in heavily occluded regions, limiting realistic recovery. To address these challenges, we propose Saliency-Guided Adaptive Random Diffusion Strategy(SG-ARD), a novel blind restoration framework that integrates saliency-aware guidance with adaptive diffusion for enhanced reconstruction. First, we introduce a Saliency-Guided Pseudo-label Generation module (SGPG) to identify degraded regions and generate pseudo-labels for blind restoration. Second, we propose an Adaptive Random Diffusion Correction Strategy (ARDC), which employs a Random-Walk-based Diffusion and an Adaptive Enhancement module to refine local and global texture pseudo-labels. Lastly, we design a Spectral-Aware Consistency Loss (SAC) to improve spectral fidelity, ensuring that the generated content aligns with the real spectral distribution. Extensive experiments on three large-scale remote sensing datasets demonstrate that SG-ARD outperforms state-of-the-art generative restoration models, producing high-fidelity, visually coherent remote sensing images.

Abstract:
Skeleton-based human action recognition (HAR) is greatly affected by abnormal situations in real-world scenarios, like occlusions and performance limitations of motion capture devices. Although recent research has enhanced the robustness of recognition by incorporating occlusion simulation in model training, it is still insufficient to effectively handle the complex and diverse abnormal situations in real-world scenarios. To address this issue, we propose SCCEAP, a novel framework combining fine-grained Skeleton Compression and Complementary Enhanced Adaptive feature fusion with the supervision of branch-stage text Prompts, for robust skeleton-based HAR. Our contributions lie in three aspects. First, the fine-grained skeleton compression is designed to generate multi-granularity skeleton sequences with diverse spatial details by fusing joints in the human skeleton according to their joint reliabilities and correlations. Then, we devise the complementary enhanced adaptive feature fusion, which utilizes motion details and stable semantic descriptions of motion features of the uncompressed and compressed skeleton sequences respectively, for complementary enhancement and adaptive feature fusion. Third, the branch-stage composite text-prompt supervision is performed to integrate both branch-wise and stage-wise text-prompt supervision for improving the ability to learn fine-grained spatiotemporal relationships of motion features. Experiments on three benchmark datasets-NTU RGB+D, NTU RGB+D 120, and Kinetics-400-demonstrate that SCCEAP achieves the state-of-the-art (SOTA) results, excelling on both normal and noisy skeleton data.

Abstract:
Flexible object recognition remains challenging in multimedia scenarios due to inherently diverse shapes and sizes, and subtle inter-class differences. Graph-based vision models show promise in flexible objects recognition by capturing variable relationships. However, they suffer from two problems: (1) inter-class ambiguity hinders model discrimination and (2) frequent scale changes degrade model generalization. To address these limitations, we propose a unified graph distillation framework that enhances inter-class discrimination and spatial generalization while maintaining computational efficiency. For inter-class ambiguity problem, we introduce a virtual prototype module that dynamically generates learnable class prototypes via clustering intermediate features. These prototypes are incorporated into the distillation loss to sharpen decision boundaries. A global-local distillation mechanism further capture both image-level global semantics and patch-level local details, enhancing inter-class discrimination. For frequent scale changes problem, we design a patch-aware distillation strategy that transfers knowledge across multiple patch scales, strengthening the student model's spatial generalization to match various shapes and sizes of flexible objects, thus alleviate generalization degradation. Extensive experiments on flexible-object datasets (FDA, FSCW, CCSN) and challenging benchmarks (CIFAR-100, Mini-ImageNet) confirm effectiveness and efficiency of our method.

Abstract:
Recent advances in VR have created a growing demand for immersive content, especially in viewing music performances. However, most existing videos are captured with a narrow field of view, limiting their applicability in VR environments. In this paper, we present a practical system that converts fixed-camera music performance videos into immersive VR experiences via high-resolution video outpainting. Our method leverages pre-trained text-to-image diffusion models with multi-conditioning based on ControlNet to generate spatially and temporally consistent frames at high resolution. While our approach builds on existing components, we introduce a novel orchestration of these tools tailored specifically for immersive video generation, requiring no additional training and running efficiently on consumer GPUs. The system supports videos of arbitrary length via frame-by-frame processing and produces seamless 8K outputs through lightweight post-processing. Extensive experiments and user studies demonstrate that our system outperforms state-of-the-art methods in perceptual quality and viewer immersion, offering a scalable pathway for repurposing conventional footage into high-fidelity VR content.

Abstract:
Listening head generation aims to synthesize realistic and responsive non-verbal listener head motions that respond to speakers in conversational scenarios. Existing methods typically rely on fixed audio-visual input modalities and predefined emotion labels, limiting their adaptability and expressiveness in real-world scenarios. In this paper, we propose a novel real-time framework, REA-Listener, to generate high-fidelity listening head videos with flexible modality adaptation and dynamic emotion modeling. Specifically, we first propose a Modality-Adaptive Mixture of Experts (MA-MoE) module to encode arbitrary combinations of speaker audio and visual signals into a unified embedding space, ensuring robustness under partial modality conditions. To further enhance the temporal consistency of listener emotion, we present a lightweight emotional head dynamics generator with a multi-modal emotion predictor, which infers listener emotions dynamically from speaker context alongside head motion coefficient prediction. Finally, we employ a 3D-aware renderer based on 3D Gaussian Splatting to produce high-quality listener head videos in real time. With these components, our approach achieves efficient head motion generation at 30fps on a single NVIDIA RTX 3090 GPU, supporting real-time interaction. Extensive evaluations and applications demonstrate that our method outperforms state-of-the-art methods in listening head generation.

Abstract:
Classical spatiotemporal sequence prediction tasks are designed to forecast future image sequences based on historical observations. However, the inherent unpredictability of future events often renders this process uncontrollable due to infinite possibilities in nature, limiting broader applicability of this technology. In this study, we explore the utilization of text prompts to constrain probabilistic space of future outcomes, resulting more controllable future prediction complying with user intent. We primarily address two critical challenges in this research setting: (i) text-vision misalignment, where embeddings extracted by text pre-trained models are not strictly aligned with visual embeddings, leading to predictions semantically irrelevant to text prompts. (ii) Spatiotemporal modeling distortion, where the fixed observation interval during training causes the model to produce unrealistic results when reasoning longer time dimensions. To tackle these issues, we propose a text-prompted spatiotemporal sequence prediction (TPS2P) model, leveraging historical observations and textual prompts to predict probabilistic future outcomes. In this model, a text-vision prompt refiner (TV-Refiner) is introduced to provide aligned textual and historical visual embeddings for integrating the denoising diffusion prediction process. Additionally, a spatiotemporal-masked diffusion transformer (StMDiT) is proposed by exploiting masked attention in constituting spatial and temporal self-attention modules within latent diffusion processes, enabling the model to observe more sequences of varying spatiotemporal patterns during training. We conduct extensive experiments on Something-Something V2 (Sthv2) and BridgeData datasets. Reported results demonstrate that our TPS2P predicts more accurate and high-quality future sequences, more user-intent compliant by textual controllability.

Abstract:
Generating high-quality, user-preferred backgrounds for e-commerce product images poses unique challenges for diffusion models, particularly in aligning outputs with human visual preferences. While Direct Preference Optimization (DPO) has shown promise in aligning generative models with human feedback, its application to diffusion models faces key limitations, including the trade-off between reward sparsity and supervision quality, mode collapse, and training instability. To tackle these issues, we propose Direct Expected Preference Optimization (DEPO), a novel framework that adapts DPO to diffusion models through redesigned training and sampling strategies. Specifically, DEPO introduces a DEPO loss combined with trajectory segmentation to enable more frequent and informative reward feedback, employs Langevin MCMC to broaden the exploration space and mitigate mode collapse, and leverages masks to effectively constrain the search space while incorporating targeted engineering designs to improve training stability. By directly linking image-domain evaluations to expected log probabilities and incorporating adversarial training, DEPO achieves better alignment with user preferences while maintaining high image fidelity. Experimental results demonstrate that DEPO surpasses existing methods in both the diversity and quality of background generation.

Abstract:
Multi-conditional image generation aims to create customized images that align with multiple specified conditions. Existing methods, whether through end-to-end training or by fine-tuning adapters to integrate pre-trained control modules of the same category (e.g., LoRA, IP-Adapter, ControlNet, T2I-Adapter), are restricted to a closed set of predefined input conditions. To overcome this limitation, we propose ModuleTeam, a training-free method for latent mixture of arbitrary control modules, capable of handling open-set conditions by incorporating the corresponding modules. The design of ModuleTeam is rooted in two key findings: (i) modules interfere with each other at the level of model parameters, and (ii) module weights contribute to the generated images by affecting the noise predictions within the diffusion process in an approximately linear manner. The first finding motivates our latent mixture approach, which mixes the control modules by aggregating their latent variables between diffusion model blocks. The second finding enables a multi-inference module reweighting strategy that balances module contributions to generation, requiring no additional training or fine-tuning overhead. Extensive results demonstrate that ModuleTeam not only outperforms existing methods but also provides flexibility in the types of conditions and scalability in their number.

Abstract:
Medical Foundation Models (MFMs) are revolutionizing radiography image analysis with scalable and generalized diagnostic capabilities. However, their effectiveness in real-world clinical practice is limited due to insufficient interpretability. To address this limitation, we propose RadLAS, a novel MFM for interpretable Radiographic image analysis by introducing Lesion-Aware Self-supervised pre-training. Unlike conventional MFMs that rely on post-hoc explanations, RadLAS innovates by directly emulating human diagnostic reasoning to first grounding lesion evidence and then making decisions accordingly. Specifically, RadLAS introduces two self-supervised tasks: (I) Lesion-grounded Reconstruction, which learns structured anatomical representations by restoring lesion-aware image patches into their healthy counterparts, thereby facilitating pixel-level grounding of lesion evidence via input-normal contrast. (II) Lesion-discrimination Contrastive Learning, which enhances lesion-aware pattern in representations by explicitly decoupling grounded lesion evidence as clinical cues and aligning them with global semantics, thereby enabling direct lesion-oriented diagnosis while preserving global context. RadLAS demonstrates excellent performance across diverse downstream radiographic datasets, offering verifiable explanations by deriving specific diagnoses (Task II) based on grounded lesion evidence (Task I), while preserving generalized representations essential for high diagnostic accuracy. Extensive experiments demonstrate that RadLAS (i) achieves superior interpretability with highly correlated lesion prediction and localization, surpassing 11 interpretable medical models; (ii) delivers scalable representation learning, outperforming 14 SOTA supervised and self-supervised MFMs.

Abstract:
In UAV applications, dense haze severely obscures small ground-level objects, hindering the recovery of fine details. Existing visible-only dehazing methods struggle with such dense occlusions, while infrared imaging lacks color and fine texture information. To address these limitations, we propose the Haze Distribution-aware Cross-modal Fusion Network (HDCFN). HDCFN features two key components: (i) an infrared-guided multiscale feature enhancement framework that integrates haze-resistant structural cues from infrared modality with visible features across coarse to fine, improving the recovery of small objects, and (ii) a haze distribution-aware cross-modal fusion module that adaptively prioritizes relevant information from each modality according to haze density. This framework effectively combines the complementary strengths of visible and infrared imaging for dense haze removal. Extensive experiments on multiple public datasets show that HDCFN outperforms state-of-the-art dehazing and fusion methods, yielding higher-quality and more detailed images.

Abstract:
Chinese Classical Studies (CCS) is a pivotal gateway to ancient Chinese culture. Spanning ancient texts, illustrations, paintings, and calligraphy, CCS presents significant challenges for non-specialists due to its language and visual complexity. While Large Language Models (LLMs) have been explored to facilitate CCS, current methods primarily focus on textual analysis, overlooking the rich visual information intrinsic to classical materials. To bridge this gap, we propose TongGu-VL, a pioneering specialized MLLM designed for CCS applications. Our contributions are threefold. First, we construct CCS358K, a comprehensive multimodal instruction dataset to enhance MLLMs' CCS capabilities. Second, we propose Parameter Sensitivity-Guided Instruction Tuning (PSG-IT), a novel method that mitigates catastrophic forgetting without data replay. It effectively preserves TongGu-VL's general skills, while optimizing its CCS performance. Third, we design a Visual-Text Early Fusion (VTEF) module, which harnesses MLLMs' modality alignment to generate instruction-aware visual representations, thus improving language modeling. Extensive experimental results demonstrate that our model outperforms existing MLLMs on a broad range of CCS tasks, while maintaining general capabilities that benefit other domains beyond CCS. Our model and dataset will be publicly available.

Abstract:
Residential design is a complex and open-ended problem that requires designers to integrate diverse types of input information while adhering to stringent energy consumption standards. However, most current research in this field focuses on generating floor plans from a limited set of input types, often neglecting to incorporate energy-related physical constraints. Existing approaches are limited by: (1) the lack of multimodal datasets in this domain, (2) the absence of comprehensive residential energy consumption data, and (3) the challenges associated with effectively integrating multiple input types into a unified model. To address these challenges, we propose MRED-14, the first large-scale Multimodal Residential Energy Dataset, comprising 14 input types, including energy consumption values, vector drawings, and textual descriptions, paired with 41,280 high-quality residential floor plans that have been scored and annotated by human experts. Based on this dataset, we introduce the LER-net model, which can flexibly adapt to various input types and generate low-energy residential floor plans. Experimental results demonstrate that LER-net outperforms existing models, achieving state-of-the-art performance under the same input conditions. In addition, the energy consumption of the generated floor plans is reduced by 5.1% compared to the actual residential designs. Further expert evaluations confirm the LER-net model's feasibility for use in residential design.

Abstract:
This study systematically uncovers and quantitatively evaluates the pervasive Orientalist biases in text-to-image (T2I) and text-to-video (T2V) generation models through a sociocultural lens grounded in postcolonial Orientalist theoretical frameworks. We identify systematic biases in the visual representations produced by multimodal generative models, including hyper-exoticization and temporal alienation. These biases mirror colonial-era narratives and undermine equitable sociocultural communication. Through empirical analysis of 8 mainstream T2I models and 4 T2V models, we demonstrate that culturally neutral prompts related to China consistently generate visual outputs embedded with Orientalist biases. We develop a novel visual question answering (VQA) framework as an evaluation metric, leveraging state-of-the-art vision-language model (VLM) to establish the first automated quantitative assessment methodology for such biases. A mitigation framework employing large language model (LLM) is proposed and experimentally validated. This interdisciplinary work illuminates the societal implications of multimodal generative models while advancing efforts toward fair and inclusive social computing.

Abstract:
The dynamic variations of food quality across spatial and temporal scales pose significant challenges for global food safety and nutrition research, requiring comprehensive analysis of diverse, multi-modal, and distributed data while preserving privacy. Existing centralized approaches suffer from data silos and limited collaboration, and although federated learning and blockchain technologies have shown promise independently, their combined potential for incentivized, privacy-preserving, and heterogeneous model collaboration remains underexplored. In this paper, we propose the concept of a Global Spatial-Temporal Food Memory-a novel research paradigm that envisions secure, decentralized, and incentivized collaboration among multiple stakeholders worldwide, leveraging blockchain-enabled token-based rewards integrated with federated learning of heterogeneous models. We discuss the scientific challenges and opportunities inherent in this vision, including multi-modal data fusion, trustworthy incentive mechanisms, and scalable long-term temporal analysis. This work aims to open new avenues in multimedia research by bridging decentralized AI, blockchain, and spatiotemporal food quality monitoring, providing a foundation for future explorations in privacy-preserving, collaborative, and large-scale multimedia data analysis.

Abstract:
Artificial intelligence will not achieve genuine empathy until models can reason about the causes of human emotions rather than only label them. Current datasets fail to support this objective, as existing emotional causality datasets primarily focus on textual modalities, lack non-verbal information such as speech and facial expressions, feature relatively short dialogue lengths, and limit research on long-term emotional evolution. Existing annotations concentrate on stimulus-response patterns and lack cross-temporal emotional causal chain annotations, failing to reveal how early events accumulate and ultimately trigger emotional changes. In this work, we introduce Genesis, the first multimodal dialogue dataset supporting long-term emotional causality analysis, which Genesis contains 1,000 dialogues averaging 208 turns each, spanning debate, family, educational, and social scenarios. Through two-layer annotation system: proximal cause identification and long-term causal chain tracking, Genesis labels complex emotional phenomena including cross-modal inconsistencies and long-distance causal dependencies. Our evaluation of 20 mainstream multimodal models reveals limitations in current approaches for long-term emotional causality. We propose Empathica as an evaluation baseline, employing a Recognition-Memory-Attribution architecture that integrates dynamic sliding windows and event aggregation mechanisms to address multimodal emotional causality modeling challenges. Empathica outperforms text-based models GPT-o1, and multimodal model Gemini 1.5 Pro and GPT-4o across all evaluation metrics.

Abstract:
This paper demonstrates a pioneering unified multimodal agent that transforms complex visual content creation into an intuitive, conversational experience, allowing users to talk, imagine, and evolve their ideas. Overcoming the limitations of fragmented multimodal technique tools, our system seamlessly integrates text-to-image generation, instruction-based image editing, text/image-to-video generation, and interactive understanding within a single AI interface. Users of all skill levels can perform sophisticated visual tasks using natural language and visual inputs. The system's architecture features a central Coordinator module processing multimodal inputs and directing tasks to Generation or Chat pathways. For Generation, a Planner utilizes our state-of-the-art specialized models in image/video generation and image editing, while the Chat function facilitates clarification and collaboration. The interactive demonstration will showcase intuitive multimodal input, seamless real-time content creation/editing, dynamic interactive understanding, and a unified workflow. This agent pioneers a new way for accessible, interactive visual storytelling and collaborative content creation in multimodal generative AI.

Abstract:
In real-world cooking scenarios, users often need to create personalized recipes based on limited ingredients, dietary goals, and various restrictions, such as the availability of equipment, flavor preferences, and health conditions. Existing recipe generation methods lack the flexibility and adaptability to meet these individual needs. We present LetMeCook, an end-to-end interactive system for personalized recipe generation that leverages multimodal perception, hybrid retrieval, and content generation. Given a photo of the user's refrigerator and a dietary profile, LetMeCook detects available ingredients, retrieves relevant recipe candidates, and generatively refines them based on user requirements, such as ingredient substitution and flavor adjustment. The system provides both textual and visual previews of the adapted recipes, offering a highly interactive and user-centric experience.

Abstract:
Nüshu (Jiangyong Nüshu), a script developed and practiced exclusively by women in China, holds recognition as National Intangible Cultural Heritage. This paper presents the design and implementation of an immersive multimodal interaction system centered on The Song of Nüshu, a foundational cultural artifact. By integrating Natural Interaction (NI) and Mixed Reality (MR) technologies, the system constructs a visual environment inspired by classical Chinese landscape painting aesthetics. Within this experiential space, users engage with representations of pivotal female historical figures and events interwoven with Nüshu textual elements. This framework enables participant-driven narrative construction that foregrounds Nüshu's aesthetic dimensions and socio-cultural significance. Our work advances a heritage preservation methodology that innovatively reinterprets tradition while explicitly foregrounding women's historical agency and expressive practices.

Abstract:
This companion paper provides artifacts and instructions on replicating the experiments in the ACM Multimedia 2024 paper entitled ''Swarical: An Integrated Hierarchical Approach to Localizing Flying Light Specks.'' Swarm-based hierarchical, Swarical, is a localization technique that enables miniature drones, Flying Light Specks (FLSs), to accurately and efficiently localize and illuminate complex 2D and 3D shapes. It consists of two components, an offline planner and an online localization technique that executes on an FLS. The offline planner uses the FLS sensor specification for positioning to convert mesh files into swarms of FLSs. Some FLSs are dark and used only for localization. We reported the online localization technique to be fast and highly accurate. We describe how to reproduce this finding using our artifacts.

Abstract:
LAVA Challenge 2025 aims to improve the ability of large visual language models to accurately understand complex visual information such as data flow diagrams and Gantt charts contained in Japanese government and business documents. For this challenge, we adopted a two-stage approach consisting of retrieval and reading comprehension. Specifically, in the retrieval step, we select pages relevant to the question from multi-page PDF documents, and in the reading comprehension step, we perform question answering by referring to the top k images selected in the retrieval step. For the retrieval step, we employ ColQwen2, which performs visual information retrieval using the multilingual Qwen2-VL. For the reading comprehension step, we propose a method that performs question answering, including voting, using the multilingual visual language model Qwen2.5VL under different prompts, model sizes, and image qualities. In LAVA Challenge 2025, we clarify the importance of the two-step process of retrieval search and reading comprehension in visual question answering, and verify the effectiveness of a method for determining the multiple results of reading comprehension steps through voting.

Abstract:
We introduce multiXview, an interactive retrieval framework for synchronized multi-camera video collections. It features a multi-index search engine that supports natural-language queries over visual embeddings, speech transcripts, and scene descriptions. It supports a synchronized multi-stream player offering parallel playback, and a timeline-based navigation view for temporal scoping and faceted exploration. These components address the redundancy and fragmentation of overlapping egocentric and exocentric video feeds and enable users to locate, aggregate, and reconstruct events across partial perspectives. This paper focuses on system design and implementation, with quantitative and qualitative evaluation to take place at the CASTLE 2025 Grand Challenge Interactive Track.

Abstract:
With the advancement of digital technologies and gadgets, online content has become easily accessible. At the same time, harmful content also spread widely. There are different harmful content types present on various platforms in multiple languages. The topic of harmful content is broad and covers multiple research directions. Users of platforms are affected by all of them. In research, the different forms are mostly analysed separately, e.g. misinformation, cyber-bullying and hate speech. Most research has been conducted for only one platform, for a monolingual situation or on a particular issue. Counter-measures like blocking are down-ranking can make harmful content spreaders to switch platforms and languages to continuously reach a user base. Harmful content does not only appear on social media but also on news media. Spreader share harmful content in posts, news articles, comments and hyperlinks. There is a great need to study harmful content across platforms, languages, and topics. We plan to bring the research on harmful content under one umbrella such that different approaches and novel methods can be shared. The workshop will also cover the currently ongoing issues of war and elections. We propose the workshop, DHOW: Diffusion of Harmful Content on Online Web, which brings together the research on different topics of harmful content. We expect to discuss innovative research work and future research directions. The proposed workshop is the next iteration of DHOW 2024. https://dhow-workshop.github.io previously organized at ACM WebSci 2024 in Stuttgart, Germany.

Abstract:
Recent advancements in Multimodal Large Language Models (MLLMs), coupled with the progress of reinforcement learning, have substantially enhanced reasoning and decision-making across modalities, including text, vision, audio, and video. This tutorial introduces the fundamental principles, methodologies, and practical applications of MLLM reasoning, with a particular emphasis on strengthening reasoning capabilities in multilingual and cross-domain settings. We further discuss the key challenges and limitations of current multimodal reasoning approaches, as well as future directions for advancing the field. By highlighting how MLLMs support enhanced reasoning and planning in cross-lingual and cross-domain contexts, this session aims to equip researchers and practitioners with the conceptual foundations and practical tools needed to effectively integrate MLLM reasoning into their work.

Abstract:
The demand for 3D spatial information is rapidly increasing across a wide range of industrial fields. For instance, 3D point cloud data is being actively adopted in sectors such as construction, civil engineering, and disaster prevention to enhance work efficiency and safety. However, 3D point cloud data typically involves extremely large data volumes, which presents a significant challenge; as a result, rapid sharing and utilization over public networks has yet to be fully realized. To address these issues, we are engaged in research and development of compression and transmission technologies for 3D point clouds, as well as contributing to international standardization. In this talk, as part of our initiatives aimed at industrial applications, we will present the latest trends in the international standardization of 3D point cloud compression technologies, such as Geometry-based Point Cloud Compression, and introduce case studies from demonstration experiments utilizing these technologies.

Abstract:
Skeleton-based human action recognition has wide applications in video understanding and virtual reality. However, most existing methods focus excessively on spatial location and global movement, while underrepresenting subtle and local actions. To address the limitation, we innovatively propose a Kinematic Enhanced Hypergraph Convolutional Network(KEHCN) with LLM training guides. The network mainly consists of LLM Training Guides(LTG), Kinematic Hypergraph Convolution(KHC), and Kinematic Gating Module(KGM). Specifically, we use the hypergraph convolutional network to extract high-order correlated human skeleton features, the KHC to encode the kinematic features and the LTG to provide a pre-trained large language model to generate text and kinematic description features during the training phase. Based on the Mixture of Experts (MoE) framework, we simplify the gating network by introducing a kinematic feature threshold, thereby constructing a dual-branch global and local motion expert network (KGM). We integrated kinematic features into KHC, LTG and KGM to seek improvements from three perspectives, all of which have enhanced the performance. The experiments on three benchmark datasets(NTU RGB+D, NTU-RGB+D 120 and NW-UCLA), demonstrate the state-of-the-art performance compared to current open-source methods.

Abstract:
Federated Prompt Tuning (FPT) integrates large pre-trained Vision Transformers (ViT) into Federated Learning (FL) by leveraging Visual Prompt Tuning (VPT), achieving state-of-the-art performance with enhanced efficiency across various visual downstream tasks. However, data heterogeneity, such as feature shift and class imbalance, limits prompts' transferability and robustness in FPT. Existing methods primarily focus on generalized FPT (GFPT) or personalized FPT (PFPT), while only a few methods make initial attempts to integrate both approaches. In this paper, we propose a new FPT framework, dubbed DualFPT, which handles data heterogeneity from both generalized and personalized perspectives. Specifically, DualFPT divides the learnable prompts into global and local prompts to jointly capture general and client-specific information, achieving the harmonization of GFPT and PFPT. The generalization of DualFPT is realized by the Feature Sharing (FS) mechanism, which effectively narrows the distribution gap by allowing clients to securely share a portion of sensitive features. The key feature for improving personalization is the Prompt Composition Scheme (PCS), which weights local prompts with distribution similarity to generate composite prompts, thus achieving automatic distribution adaptation. Extensive experiments under feature shift and class imbalance scenarios demonstrate the superior performance of DualFPT. On DomainNet and CIFAR-100 (D (0.1)), DualFPT surpasses SGPT by 4.35% and 5.89% for generalization along with 7.89% and 9.09% for personalization. Ablation studies further validate the effectiveness, efficiency, and security of DualFPT.

Abstract:
Images captured in low-light nighttime scenes suffer from light effects. Existing nighttime visibility enhancement methods predominantly focus on low-light image enhancement (LLIE), neglecting light-effect suppression (LES). Current LES methods mainly rely on unsupervised or zero-shot learning due to the lack of paired nighttime light-effect datasets. We construct a large-scale nighttime dataset containing diverse light effects to enable supervised learning for joint LES and LLIE. We design a two-stage structural prior-guided diffusion model for nighttime visibility enhancement, proposing a Laplacian decomposition physical model and a dual-loop Receptance Weighted Key Value (RWKV) to separate light effects from structural features. Experimental results demonstrate that our method outperforms state-of-the-art (SOTA) methods in LLIE and LES tasks. Through supervised training on our dataset, our method achieves optimal performance in joint LES and LLIE while maintaining effectiveness across various real-world scenarios.

Abstract:
Multimodal Emotion Analysis (MEA) plays a crucial role in extracting and understanding emotional insights from diverse data sources, including text, video, and audio. However, existing methods may overlook the key issue that multimodal components exhibit asynchronism temporally and they obtain insufficient representation of fine-grained emotional expressions. In light of this, we propose a unified emotion reasoning model, EmoChat, which enhances multimodal emotion analysis by dynamically generating emotion-related tokens and fine-grained expression information through facial action modeling. To incorporate expression semantics, we design the AU Agent, a lightweight facial expression extractor, to provide LLMs with fine-grained facial knowledge for reasoning. In addition, we propose the Correlation Aggregator to alleviate the correlation differences between acoustic features and textual content. Therefore, our method decouples both the audio and vision modalities, allowing for efficient token-level emotion cues mining in misaligned multimodal input, while maintaining semantic consistency across different languages. Experiments on public benchmark datasets have demonstrated the superiority of our proposed EmoChat over the state-of-the-art methods.

Abstract:
In the literature, prior studies on Video Anomaly Detection (VAD) primarily focus on anomalies that have already occurred (i.e., consequential anomaly), but cannot identify the causative anomalies (i.e., the cause of final anomaly), while this type of causative anomaly could be powerfully beneficial to early warning against the anomalies. Meanwhile existing work mainly focuses on classifying whether each video clip is abnormal, and couldn't extract structured video information, such as what is the abnormal type, which people or things are involved, whereas such structured information can potentially contribute to building an efficient system to monitor the above causative and consequential anomalies. To this end, this paper proposes a new chat-paradigm Video Abnormal Events' Early Warning (VAE-EW) task, aiming to localize and extract not only the consequential abnormal event quadruples but also the causative abnormal event quadruples (i.e., subject, predicate, object, and event type). Further, this paper believes that this new task faces two key challenges, i.e., Spatial-temporal modeling challenge and temporal highlighting challenge. On this basis, this paper proposes a new Skynet-V1 Model with a spatial-temporal causal-enhanced Mixture-of-Expert (MoE) Framework, i.e., acting like Skynet in movie 'The Terminator' to track and early warn against abnormal events, for VAE-EW task. Specifically, this model designs a Spatial-temporal Aware MoE Block (SAMB) and a Causal-guided Temporal Enhancing Block (CTEB) to address the two challenges respectively. Extensive experiments on our VAE-EW dataset show the superiority of our model in localizing and extracting abnormal events, especially the causative events, compared to other advanced baseline models, highlighting the importance of the new VAE-EW task and the effectiveness of Skynet-V1 in addressing such task.

Abstract:
Live-streamed concerts have become a new cultural phenomenon, yet they struggle to replicate the collective emotional experience of in-person events. Traditional text-based chats often cause information overload and distraction, diminishing the sense of shared experience. To address this challenge, we present VibeOn, a multimodal interaction system designed to foster collective engagement and socio-emotional connection among remote audiences. Developed through a formative study and an iterative design process, VibeOn integrates features such as chat and emoji recommendations, chat highlights, avatar-based cheering, ambient visualization, and concert-specific layout. A user study with 40 participants demonstrated that VibeOn significantly enhanced social connectedness, sense of community, and collective effervescence compared to a conventional chat interface, while maintaining high usability. Our findings indicate that VibeOn enables audiences to feel a shared, extraordinary experience beyond simply watching, highlighting its potential to enrich collective emotions in large-scale online events.

Abstract:
3D Gaussian Splatting (3DGS) has emerged as a promising framework for real-time radiance field rendering due to its high fidelity and explicit scene modeling. However, its practical deployment in the multimedia domain remains limited by excessive memory usage stemming from redundant and memory-inefficient Gaussian primitives. In this paper, we propose SOC-GS, a novel compression framework that enhances the anchor-based 3DGS representation through perceptually guided and structural optimization. Specifically, we begin by introducing the Perceptual Relevance Score (PRS), with a Gumbel noise perturbation applied to facilitate sparse Top-K selection of Gaussians critical for densification, significantly reducing the number of anchors. Further, we stabilize training and prevent premature overfitting the high-frequency noise using a Joint Resolution-Blur Training strategy, with guidance from Total Variation Loss, enabling coarse-to-fine learning with the consistency of spatial distribution throughout training. Finally, a Spatial Condition-based Prediction module is employed to further reduce storage while preserving comparable quality. Extensive experiments on three benchmark datasets demonstrate that our method achieves an average of 34% reduction in model size when compared to existing state-of-the-art compression method (126 × compression on vanilla 3DGS), while maintaining comparable--or even superior--rendering quality.

Abstract:
Inverse-Tone-Mapped High Dynamic Range Video Quality Assessment (ITM-HDR VQA) plays a pivotal role in evaluating the visual quality of ITM-enhanced HDR videos. The research community tackles this issue from the dataset and method perspectives. However, current ITM-HDR VQA datasets exhibit three key limitations: narrow scene diversity, partial HDR format representation, and inadequate distortion coverage; existing methods face challenges in HDR and SDR domain discrepancy and insufficient ITM-induced quality feature extraction. To bridge these gaps, we introduce a comprehensive ITM HDR Video Quality Assessment dataset tailored to Broadcast Television (BT-ITM-VQA), along with a novel SDR-Referenced Bidirectional Quality Interaction (SDR-R-BQI) method. The BT-ITM-VQA dataset features rich broadcast scenes, multiple HDR-format support of Hybrid Log-Gamma (HLG) and Perceptual Quantizer (PQ), and real-world distortions induced by super-resolution and deinterlacing, providing a systematic foundation for ITM-HDR VQA model development and validation. The SDR-R-BQI method effectively mitigates HDR and SDR discrepancies through luminance dynamic range alignment and color gamut alignment, and then extracts ITM-induced quality alterations by bidirectional, cross-quality-based computation in a unified feature space. Extensive validation on four datasets demonstrates the effectiveness of our newly constructed dataset and proposed method.

Abstract:
Vietnamese street food has gained global popularity through platforms like YouTube, where creators are incentivized to post videos on specific topics, including street food. Efficient video editing has become essential for YouTubers. This study analyzed 1,507 videos and gathered insights from 213 respondents to identify the factors that shape viewer preferences and highlight key video features for creators. The study's key finding was the strong synergy between visual elements and linguistic richness, emphasizing the connection between storytelling and what appears on screen. Using machine learning, we examined visual, linguistic, and acoustic features, achieving a 70.5% accurate predictive model. Compared to the baseline 50%, these insights led to personalized recommendations for street food content creators, offering strategies to enhance viewer engagement. Our automated recommendation system bridges data-driven insights with content creation, elevating Vietnamese street food on the global stage and celebrating the blend of local culture and technology.

Abstract:
Phase distribution plays a critical role in various holographic applications; it affects the randomness within phase holograms, which, in turn, influences the viewing angle, speckle noise, and compression efficiency of the hologram. It also impacts the accuracy of measuring distortion between two phase holograms. Unlike ordinary image signals, uniquely determining the probability distribution of a phase hologram can be challenging due to the inherent periodicity of phase values. Although this ambiguity could impair the performance of many holographic applications, it has received little attention in prior works. In this paper, we introduce the Phase Distribution Alignment (PDA) method, which can transform a phase hologram so that its phase distribution is centered at a predefined reference point, resolving the ambiguity without affecting the numerical reconstruction result. Furthermore, when provided with a source and target phase hologram, PDA can align the source to maximize the overlap with the target distribution. We demonstrate the practical benefits of PDA in a range of holographic applications, including neural phase hologram generation, compression, and phase-domain distortion measurement.

Abstract:
In this paper, we propose a new method called Multi-Modal Gradual Domain Osmosis, which aims to solve the problem of smooth knowledge migration from the source domain to the target domain in Gradual Domain Adaptation (GDA). Traditional Gradual Domain Adaptation methods mitigate domain bias by introducing intermediate domains and self-training strategies but often face the challenges of inefficient knowledge migration or missing data in intermediate domains. In this paper, we design an optimization framework based on the hyperparameter łambda by dynamically balancing the loss weights of the source and target domains, which enables the model to progressively adjust the strength of knowledge migration (łambda incrementing from 0 to 1) during the training process, thus achieving cross-domain generalization more efficiently. Specifically, the method incorporates self-training to generate pseudo-labels and iteratively updates the model by minimizing a weighted loss function to ensure stability and robustness during progressive adaptation in the intermediate domain. The experimental part validates the effectiveness of the method on rotated MNIST, color-shifted MNIST, portrait dataset, and forest cover type dataset, and the results show that it outperforms existing baseline methods. The paper further analyses the impact of the dynamic tuning strategy of the hyperparameter łambda on the performance through ablation experiments, confirming the advantages of progressive domain penetration in mitigating domain bias and enhancing the model generalization capability. The study provides theoretical support and a practical framework for asymptotic domain adaptation and expands its application potential in dynamic environments.

Abstract:
Current advances in text-driven 3D scene editing tasks typically render the 3D representations into multi-view images and modify the images with the text instructions. Context consistency across multiple views and cross-modal consistency in the single-view are the keys to effective 3D editing. Accordingly, existing methods introduce additional image constraints and apply pre-trained 2D editing models. However, they fix the same text instruction across all views and freeze the pre-trained 2D model for single-view editing, leading to deficient modeling of 3D scene views and results in inconsistent generations with visual artifacts. To address these limitations, we introduce a discretized 3D view modeling method and a diffusion-based multi-view consistent editing pipeline for text-driven 3D gaussian splatting editing, abbreviated as D2Gaussian. Specifically, our approach constructs a codebook that encodes continuous 3D view information into discrete token embeddings to model the spatial feature expressions. Then, the token embeddings are proposed to guide and finetune the diffusion-based image editing model with the dynamic addition of control conditions, yielding a multi-view consistent editing pipeline. Finally, we introduce a 3D editing dataset generation approach along with a 3D-CLIP-SIM metric to form a benchmark, 3D-MagicBrush, to provide more diverse evaluation scenarios for future 3D editing works. Experiments demonstrate that our method achieves better visual results and multi-view consistency than previous state-of-the-art methods.

Abstract:
With the rapid progress of diffusion-based Text-to-Image Generation (TIG), Text-to-Image Editing (TIE) has become increasingly important for enabling controllable visual content creation. A core challenge in TIE is generating text-guided edits while preserving the spatial structure of the original image. Recent methods attempt to address this by leveraging self-attention maps from diffusion models, as these encode rich spatial information. However, we identify two key limitations: (1) not all self-attention maps contribute meaningfully to spatial structure, and (2) over-reliance on them can suppress desired editing effects. To address this, we propose the Spatial Information Score (SIS), a novel metric that quantifies the spatial structure encoded in each self-attention map. Leveraging SIS, we develop Selective Self-Attention-based Image Manipulation (SSAIM), which selectively utilizes self-attention maps with effective spatial structure (high SIS) to preserve the structural of the original image and reduce excessive reliance on self-attention maps with ineffective spatial structure (low SIS) to enhance editing performance in TIE tasks. Extensive experiments across diverse TIE tasks demonstrate that SSAIM significantly improves both structural fidelity and editing quality.

Abstract:
Virtual reality (VR) cloud gaming is increasingly developing in the gaming industry. Yet, the performance of the congestion control algorithms on top of which these systems build remains under-explored. In this study, we implement two industry-standard network congestion control algorithms, Google Congestion Control (GCC) and Network-Assisted Dynamic Adaptation (NADA), according to their Requests for Comments (RFCs), and integrate them into an open-source VR gaming system (ALVR). Including ALVR's congestion control (ALVR-ABR), we conduct extensive experiments on real-world networks to evaluate each algorithm's frame latency, target-to-receiving bitrate gap, dropped frames, image quality, and fairness among heterogeneous competing flows. GCC decreases frame latency by 35% compared to NADA and by 42% compared to ALVR. NADA and ALVR-ABR present significant gaps between the selected and received bitrate, causing substantial congestion-induced frame drops, while GCC has a minimal gap, resulting in minor frame drops, suggesting its suitability for game-player interaction. GCC exhibits a 2.7% and 5% decrease in image quality compared to NADA and ALVR-ABR, respectively, indicating slight immersion degradation. However, only NADA ensures a fair bandwidth share against loss-based flows due to its bitrate response to loss-induced congestion signals and lower sensitivity to delay gradients compared to GCC.

Abstract:
Low-latency interactive video streaming services critically depend on robust congestion control algorithms (CCA). However, existing CCAs often exhibit frequent self-induced oscillations from unconstrained probing, causing periodic heavy queuing and degrading the user experience. We propose Themis, a novel end-to-end CCA designed to achieve stabilized near-zero queuing delay with high bitrate. By precisely quantifying the trade-off between bitrate and queuing delay within a utility feedback mechanism, Themis effectively controls the amplitude and frequency of probing while ensuring fairness. In parallel, Themis incorporates the awareness of the queuing state through an adaptive-pacing method, combined with utility feedback to guide a three-phase bitrate adjustment strategy. This enables rapid and stable convergence to the optimal utility. We implemented Themis in QUIC, evaluated it on the Mahimahi and conducted a 90-day large-scale A/B test in a real-world network. Compared to state-of-the-art CCAs, Themis effectively suppresses self-induced oscillations, increases the average frame bitrate by 66.8%, and reduces the average frame delay by 13.5%, demonstrating a superior trade-off between high bitrate and low queuing delay.

Abstract:
For thousands of years, Chinese culture has regarded calligraphy as both a form of painting and a deeply physical, philosophical practice. While traditionally manifesting as a two-dimensional art form, Chinese calligraphy emerges from three-dimensional, embodied movements infused with breath, emotion, and intention. This demo/video paper introduces Embodied Ink, a multimedia interactive installation that reinterprets Chinese calligraphy through motion capture and generative AI. By translating the audience movements into real-time dynamic visuals and soundscapes, the project reveals the hidden kinetics and philosophical depth underlying calligraphy. Embodied Ink reimagines the static medium of calligraphy as a dynamic interplay of forces and particles, inviting viewers to experience calligraphy as a living, ever-changing art form.

Abstract:
The rapid proliferation of AI-generated media, particularly hyper-realistic deepfakes, has underscored the critical need for robust detection systems to mitigate risks such as misinformation and identity theft. However, state-of-the-art deepfake detectors remain vulnerable to adversarial attacks-subtle perturbations designed to evade classification. To address this gap, we organized the Adversarial Attacks on Deepfake Detectors (AADD-2025) challenge, a competitive evaluation aimed at advancing methodologies to expose and strengthen weaknesses in deepfake detection models. The challenge tasked participants with generating adversarial examples capable of evading four diverse classifiers (including ResNet, DenseNet, and two blind models) while preserving structural similarity to original deepfakes. A dataset comprising 16 subsets of high- and low-quality deepfake images generated by GAN-based and diffusion models (e.g., StableDiffusion, StyleGAN3) was provided. Participants were evaluated using a weighted combination of Structural Similarity Index (SSIM) and attack success rates across all classifiers. Thirteen teams proposed innovative solutions leveraging techniques such as latent-space manipulation, ensemble gradient optimization, surrogate modeling, and frequency-domain perturbation. Top-performing approaches, including MR-CAS (1st place), Safe AI (2nd place), and RoMa (3rd place), achieved high SSIM scores (0.74-0.93) while successfully misleading classifiers. Notably, MR-CAS's latent diffusion model inversion strategy and Safe AI's consensus-orthogonal gradient weighting framework demonstrated superior transferability across architectures, including Vision Transformers. The challenge revealed critical insights: latent-space attacks outperformed pixel-level methods, ensemble-based strategies enhanced cross-model robustness, and adversarial perturbations optimized for both CNNs and transformers proved most effective. However, gaps persist in generalizing attacks across heterogeneous models and maintaining perceptual fidelity, highlighting the urgency of developing adaptive defenses and hybrid detection mechanisms. By fostering collaboration and innovation, AADD-2025 provides a benchmark for evaluating adversarial robustness in deepfake detection and underscores the need for resilient systems in the era of AI-generated media.

Abstract:
Major Depressive Disorder (MDD) is a prevalent and severe psychiatric disorder, and its detection remains challenging due to the complexity and variability of its symptoms. Traditional single-modality methods often fail to capture the full spectrum of depressive cues, which has led to the rise of multimodal methods. The ACM Multimedia 2025 ''Multimodal Personality-Aware Depression Detection Challenge'' (MPDD 2025) aims to advance the development of more accurate depression detection models by incorporating multimodal data. In this paper, we proposed a Multi-Level Segment Fusion Based on Adaptive Time-Window Selection (MSF-ATS) method for the MPDD-Elderly Track. To address the challenge of sparse and transient depressive symptoms, we fuse segment-level classifications to obtain subject-level classifications. An adaptive time-window selection based on mean class variance is employed to choose the window with the smallest variance for more stable detection results. Our method achieved an average score of 0.8576 on the MPDD 2025 official test set, significantly outperforming the baseline score of 0.6675.

Abstract:
Classifying anatomical regions in endoscopic ENT (ear, nose, and throat) images is challenging due to strong inter-class similarities, bilateral symmetry, and the scarcity of annotated datasets. To overcome these issues, we present HyMoENet. This novel hybrid deep learning architecture combines convolutional neural networks (CNNs) for localized feature extraction with Vision Transformers to represent global context. Furthermore, it leverages a sparse Mixture-of-Experts (MoE) technique to improve multi-perspective specialization. Our architecture makes use of parallel CNN-Transformer encoders, which are incorporated into a dynamic MoE layer that adaptively routes representations to the most appropriate experts. Concurrently, a semantic-preserving skip connection preserves global coherence. When tested on a clinically annotated ENT endoscopy dataset from Thong Nhat Hospital in Vietnam, HyMoENet outperformed both single-stream and conventional hybrid models with an accuracy of 97.50. These results demonstrate that integrating local-global representation learning and expert modularization enhances classification accuracy for anatomically similar structures. HyMoENet sets a new benchmark for automated ENT image processing, setting the groundwork for intelligent diagnostic systems in clinical endoscopy.

Abstract:
Estimating engagement in group interactions is crucial for building socially intelligent systems, such as in human-agent and human-robot interaction. However, precisely modeling the continuous frame-level fluctuations of engagement remains challenging, particularly when considering the complex multi-party signal interactions within groups. Our method employs an encoder that integrates BiLSTM with Transformer to effectively capture both local and global temporal dependencies of multimodal features. Crucially, we explicitly fuse signals from both the target participant and their conversational partners in group to model the holistic group interaction dynamics. Furthermore, we introduce an 8x overlapped optimized sliding window strategy, constructing a ''sliding pipeline'', which significantly enhances the temporal smoothness, continuity, and stability of predictions. In the final regression stage, we replace the traditional multilayer perceptron(MLP) decoder with Kolmogorov-Arnold Network (KAN), leveraging their superior function approximation capability to achieve more accurate engagement predictions. Evaluated on the test sets of NoXi-base, NoXi-addition, MPIIGroupInteraction and NoXi-J datasets from the Multimediate'25 Engagement Challenge, our approach demonstrates significant performance improvements, achieving highly competitive Concordance Correlation Coefficients (CCC) of 0.678 for global, approximately 56.1% higher than the baseline, which shows a significant improvement.

Abstract:
Point cloud processing and 3D vision have emerged as very hot topics in the multimedia community. Point clouds can give an immersive visual experience and provide accurate structural information of 3D objects and scenes in the applications including virtual reality/augmented reality (VR/AR), autonomous driving, robot navigation, and geo-information systems (GIS). Moreover, 3D Gaussian splatting has become a very powerful and popular tool for 3D reconstruction, generation and rendering, as well as compression, representation and understanding. 3D Gaussian splatting technology can also be deemed as an extension of point cloud processing technology. 3D vision technologies empower the developments of immersive media, embodied artificial intelligence (AI) and unmanned systems. Their challenges in processing, analysis, and applications have attracted significant interest from industry, academia, and standardization bodies. This workshop invites innovative contributions in point cloud processing and 3D Gaussian splatting to propel the advancements of 3D vision technologies.

Abstract:
We demonstrate an end-to-end system for real-time, multimodal industrial anomaly detection (IAD), built upon a custom hardware platform for synchronized 2D and 3D data acquisition. Our core contribution is a novel cross-modal residual mechanism that identifies defects by quantifying predictive errors between visual and geometric feature spaces. Instead of traditional concatenation, our dual-stream architecture mutually predicts features across modalities, leveraging the prediction residual's magnitude as a direct and robust anomaly indicator. The entire system achieves sub-second inference from acquisition to decision, enabled by efficient depth map analysis that circumvents the complexity of direct point cloud processing, offering a deployable solution for high-speed inspection.

Abstract:
NEC is the leading ICT technology provider in the B-to-B market and is actively integrating cutting-edge technologies into its business solutions to drive innovation, enhance capabilities, and create new value for its customers in a broad spectrum of industrial segments. And the recent business focus of NEC is to support digital transformation of business processes of customer enterprises by leveraging technical capabilities in AI, Cyber Security and Communication. This keynote discusses the specific role of NEC's Research in such a business context by sharing a variety of generative and multimodal AI-related use cases that are aimed at solving critical customer challenges in the real-world. From the multimedia perspective, the topics will include world-leading facial recognition technology for security boost and enhanced customer experience, development of drive-recorder video analytics for insurance adjusters leveraging visual language model (VLM) and medical document generation AI service for genuinely supporting overworked clinical doctors. Meanwhile, distributed acoustic sensing technology using optical fiber cables is opening a new opportunity for infrastructure and incident monitoring solutions after integration with AI and ML algorithms. As the common denominator, our commitment of solving critical customer challenges requires (and justifies) nurturing both world-class excellence in performing academic research and accumulated experience and/or culture of application-oriented technology refinement as well as technology combination to ensure business-ready practicality. Also, being the industrial research organization, we are engaged at the forefront of customer co-creation and co-design that play an indispensable role in pinpointing customer's critical challenges. These expertise and practices are indeed the core ingredients of NEC's Research for creating new business opportunities from the technology innovation approach. Furthermore, we also envision that such an industrial lab model in the Generative AI era will become the driver of a new technology paradigm - industry segment-oriented customizable foundation models and business transforming Agentic AI framework.

Abstract:
Recent advances in Molecular Graph-Language Models (MGLMs) have demonstrated promising capabilities in molecular understanding tasks. However, existing approaches face critical limitations: (1) shallow alignment methods which employ identical processing modules for both modalities, resulting in compromised expressiveness and catastrophic forgetting of pre-trained language capabilities; and (2) over-reliance on high-level molecular representations that inadequately capture fine-grained structural information essential for comprehensive molecular understanding. To address these challenges, we present DeepMolTex, a novel framework for Deep fusion of Molecular structure and Textual representations across multiple scales. Our approach introduces a Mixture of Modality Experts (MoME) architecture that facilitates deep alignment between molecular graph features and large language models while preserving language capabilities, and a multi-scale graph projector that extracts and aligns molecular features at atom, motif, and molecule levels. Experimental results demonstrate that DeepMolTex significantly outperforms existing methods on fundamental molecular understanding tasks, including molecule description generation and IUPAC name prediction, while effectively preserving the language capabilities of the pre-trained LLM.

Abstract:
Rain and snow in real scenes have diverse appearances that are tightly related to shooting angle, light sources, and background elements. Most existing learning-based methods utilized synthetic weather/clean pairs to handle a single weather type. However, synthetic scenarios more focus on the property of image or weather layers independently without considering their mutually exclusive relationship, which lead to domain gap between real and synthetic scenes. This makes it difficult to generalize to complex real-world weather conditions for existing methods, even though they perform well in synthetic scenes. In this paper, we highlight the importance of contextual influence information and utilize virtual reality to effectively simulate real scenes. Specifically, we propose a novel weather removal framework, called Contextual-Weather Correlation Pairing (CWCP), which contains modules Weather Style Adaptation (WSA) and Continual Contextual Learning (CCL). The WSA learns the style correlation of noises by pairing local blocks with weather similarity information, which is highly effective for handling a large amount of similar noises. CCL learns knowledge from the entire image while contrasting it with the features of local blocks based on contextual similarity pairing, which reduces the color difference among adjacent local blocks. Moreover, we propose a novel complex weather dataset, namely SkyLine-Weather, which is a virtual rain and snow removal dataset contains ~10K noisy images and their corresponding ground truth images, by using 3D computer graphic platform. Experiments on 8 datasets demonstrate that our framework achieves SOTA results on deraining and desnowing tasks, across real, synthetic, and virtual scenes.

Abstract:
Lightweight models are currently the focal point in image super-resolution (ISR) research, of which the application on resource-limited devices is constrained by heavy computational requirements. As an efficient approach to enhance the inference efficiency of deep learning models, low-bit quantization has garnered significant interest. In this paper, we emphasize that low-bit ISR is not merely a parody of its full-precision version and explore binary quantization in ISR from the perspective of information transmission, pushing the limits of binarized ISR. Specifically, we propose a Maximum Entropy Routing (MER) mechanism to dynamically control activation distribution, maximizing the information entropy of binarized feature maps. Additionally, a Learnable Deviation Compensation (LDC) and an Adaptive Step-size Estimation (ASE) are introduced to reduce information loss during the forward and backward passes, respectively. By enabling smoother information transmission through more flexible binarized activation representations and more precise gradient estimation, the performance gap between binarized and full-precision models is narrowed to less than 0.3 dB. Extensive experiments demonstrate that our proposed binarization method achieves state-of-the-art results in Peak Signal-to-Noise Ratio (PSNR) across all popular benchmarks.

Abstract:
Emotion recognition based on multimodal physiological signals is playing an increasingly important role in areas such as human-computer interaction and disease diagnosis, attracting growing attention from the research community. Current studies primarily focus on emotion recognition under unified data collection paradigms, overlooking the prevalent issue of imperfect modality matching in real-world scenarios. In particular, existing methods fail to effectively utilize these mismatched modalities, leading to incomplete emotional representations. This limits the model's ability to accurately capture the multidimensional semantic features of emotions, thereby constraining its effectiveness and applicability in practical settings. To address this challenge, we propose MoCERNet. At the modality level, it first reduces the domain gap among matched modalities and then aligns mismatched modalities in a semantics-aware manner, guided by the matched ones. At the decision level, it further mitigates the global distribution discrepancies to achieve a more complete emotional representation. In addition, we design a Nervous System Functional Structure Transformer (NFSformer) that enables the model to focus on the correlation between different brain regions and peripheral physiological signals under various emotional states, thereby enhancing its capacity to model complex emotional processes. Experiments on three multimodal emotion datasets demonstrate that MoCERNet outperforms state-of-the-art baselines under imperfect modality matching scenarios.

Abstract:
Image enhancement is a classical and enduring challenge in computer vision, seeking to produce high-quality images from corrupted observations. Unlike existing methods that target specific tasks, this work focuses on endowing the model with generic inductive capabilities, enabling fast adaptation to previously unseen enhancement tasks. Specifically, we investigate the inter-task weaving from both structural and parametric perspectives. Structurally, we establish inter-task weaving under a Hadamard view by designing a unified architecture called Degradation Unraveling Network (DUNet) tailored for diverse enhancement tasks, which incorporates a progressive degradation unraveling mechanism for fine-grained enhancement. Parametrically, we reveal the task-agnostic nature of degradation estimation parameters and treat them as meta-representations. A Bilevel Purify Modeling (BPM) framework is then proposed to reinforce their latent unified representation, where only the degradation-related parameters are optimized as meta-representations. Based on this design, a task-aware adaptation solution is further introduced, only the remaining parameters are allowed to be fine-tuned efficiently and enabling fast task adaptation. Extensive performance evaluations on three representative image enhancement tasks demonstrate the effectiveness and superiority of our method. The adaptability of our method is further verified by a series of algorithm analyses.

Abstract:
Current video mirror detection models demonstrate satisfactory performance by analyzing different attributes of mirrors and incorporating temporal information. However, these models still struggle to detect mirrors in complex and dynamic scenarios. A simple yet critical visual cue is that objects reflected in a mirror appear to be farther away than the mirror itself. Motivated by this observation, some studies propose to explicitly analyze the Depth of Mirror (DOM) to effectively localize mirrors - DOM refers to distinct perceived distances that make mirror regions appear farther away from their surroundings. However, merely analyzing the DOM is insufficient in some scenes where the object behind the mirror also appears distant. Meanwhile, the changes in the DOM across different video frames are also important for video mirror detection, yet this aspect has not been fully explored. To address these issues, we devise a novel framework called FTM-Net, which includes two main contributions: a Pattern-Compensated DOM estimation strategy and a Dual-Granularity Affinity module. The Pattern-Compensated DOM estimation strategy uses multiple visual mirror patterns to refine the DOM, enhancing the accuracy of mirror localization in a single image. Furthermore, the Dual-Granularity Affinity module can effectively detect mirrors in video sequences by tracking and integrating DOM changes across video frames. Experimental results on two benchmark datasets show that our model significantly outperforms other state-of-the-art methods in the video mirror detection task. We shall release our trained models, code, and results.

Abstract:
Heterogeneous graphs (HGs) are graph topologies with multiple types of nodes and edges, which have attracted increasing attention in graph representation learning. Many heterogeneous graph neural networks (HGNNs) assume that the graph structure information is complete when processing heterogeneous graph data. However, due to factors such as privacy protection and high data acquisition costs, heterogeneous graphs often suffer from missing attributes. Existing methods that rely on local neighbor aggregation to obtain missing attributes have achieved good performance in various downstream tasks, yet when neighbor types differ from the target node type, the completed attributes become highly unreliable. Therefore, this paper proposes a novel method for Attribute Completion in Heterogeneous Graph with Integration of External Knowledge from Large Language Models (HGACLLM). HGACLLM simplifies local attribute aggregation while maintaining model performance, and is the first to leverage LLMs to learn external knowledge of nodes with missing attributes in the attribute completion task on HGs. First, HGACLLM uses a mean aggregator to precalculate neighbor attributes along different metapaths and applies a Transformer-based semantic fusion module to integrate node attributes from various metapaths, generating local attributes. Second, it infers implicit external knowledge of target nodes based on textual descriptions of attribute-rich neighbors along metapaths. Finally, HGACLLM combines local attributes and external knowledge to generate complete attribute representations. We evaluated our method on six real-world datasets, where HGACLLM achieved up to a 1.94% improvement in evaluation metrics for the downstream tasks. The experimental results demonstrate that HGACLLM achieves state-of-the-art performance.

Abstract:
Preserving the semantic integrity of image details is difficult in neural image compression. Failure to do so can result in miscompressions: reconstruction errors that change the meaning between the original and reconstructed images. Undetected miscompressions can compromise the reliability of reconstructed images and potentially reduce the accuracy of downstream computer vision tasks. To advance research on this problem, we present SCLIC, a curated dataset of 18k human-annotated miscompressions generated by 12 neural compression models. It includes images from three common benchmark datasets, compressed and reconstructed using codecs based on CNNs, GANs, diffusion models, and image transformers for different perceptual metrics and rate-distortion settings. We envision that this dataset will facilitate the development of strategies to mitigate miscompressions and enable more reliable neural image compression codecs.

Abstract:
We introduce the AR2 O Painter, an interactive intelligent system designed for real-time, highly realistic oil painting creation. This Agent can faithfully reproduce any portrait image, allowing users to visually enjoy the stroke-by-stroke painting process immersively as the artwork is completed within two minutes. It consists of two modules: the Oil Painting Stroke Sequence Planner, which performs multi-level semantic-based brushstroke sequence decomposition on portrait images, mimicking the logic of artist painting, and the Oil Painting Rendering Engine, which receives the brushstroke sequence, models the pigment via fluid dynamics, simulates its interaction with the canvas and brush, and applies a tailored PBR model with microfacet BRDF, Fresnel effects, and stroke-level geometry, enabling perceptually plausible gloss and fine-grained surface relief. To the best of our knowledge, it is the first real-time intelligent painting system to generate realistic oil paintings with high interactivity and artistic fidelity. The demo video is available at https://youtu.be/aN-W06GmnP8.

Abstract:
RhythmGate reimagines the elevator as a spatial medium for sensing and expressing the unspoken codes of shared space. By combining pressure-sensitive flooring with overhead visual sensing, the system captures bodily presence and spatial movement in real time. These inputs are processed to classify position and trigger one of five pre-recorded human voices, selectively layered into a responsive soundscape. As riders shift their stance, they ''cut'' between temporal zones mapped across the floor, composing a living acoustic montage. The resulting auditory feedback surfaces subtle social dynamics-such as avoidance, boundary-setting, or latent tension-and gently dissolves them. In doing so, RhythmGate transforms a routine ascent into a fleeting, multisensory conversation among strangers, system, and space.

Abstract:
With the rapid growth of digital content and the increasing demand for high-resolution displays, the efficient compression of screen content images characterized by text, graphics, and UI elements has become an important research field. This paper introduces a new dataset named SCID-Compress900 specially designed for image compression research. The dataset consists of 900 high-quality screen content images, including 500 4K images and 400 1080P images. All these images are mainly composed of text/graphics, reflecting typical screen content scenarios such as office documents, software interfaces, and presentation slides. The dataset covers a diverse range of content, including various font sizes, graphic styles, and color modes, providing a comprehensive testbed for compression algorithms. To demonstrate the effectiveness of SCID-Compress900, we conduct benchmark tests using several deep learning-based image compression methods commonly employed by researchers. The experimental results show that SCID-Compress900 can well differentiate the performance of different compression algorithms. Compared with existing datasets, SCID-Compress900 offers higher resolution, larger scale, and more targeted content, making it an ideal resource for developing and evaluating advanced image compression algorithms for screen content. This dataset will not only promote the research and development of screen content compression technology but also contribute to the standardization and optimization of compression algorithms in practical applications. The project is available at https://openi.pcl.ac.cn/OpenDatasets/SCID-Compress900.

Abstract:
We present MedAI Hub, an integrated multimodal medical platform designed to bridge clinical practice and research by transforming patient-doctor interactions into structured scientific data. This platform supports comprehensive management of multimodal medical records-including clinical notes, medical images, and patient-reported outcomes-while implementing privacy-preserving data sharing mechanisms. Building upon this infrastructure, we introduce two novel AI-driven modules:(1) ITERATE (Image-Text Enhancement, Retrieval, and Alignment): An evolutionary algorithm inspired by Visual Genome that optimizes medical image-text alignment through iterative cross-modal refinement. Leveraging LLM-guided ''DNA evolution'' and multimodal feedback, ITERATE enhances ultrasound image quality for diagnostic tasks, achieving 3.5-7% accuracy gains on ScienceQA and ARC-Easy benchmarks.(2) MedQuery: A graph-driven literature retrieval system that constructs multimodal knowledge graphs from medical literature (text, figures, tables). By aligning PubMed documents with complex clinical queries through semantic relationship modeling, it achieves >90% answer quality win rates and 13-36% accuracy improvements on PubMedQA and MedInquiry datasets. MedAI Hub demonstrates that synergistic integration of clinical data platforms with evolutionary vision-language optimization and multimodal knowledge graphs significantly advances medical AI capabilities, enabling more accurate diagnostics and research insights. The platform and algorithms are publicly available to accelerate innovation in medical AI.

Abstract:
AI as a concept has been around since the 1950s. With the recent advancements in machine learning technology, and the availability of big data and large computing resources, the scene is set for the explosive growth of AI. In particular, the emergence of Multimodal Foundation Models that offer significant capabilities in content comprehension, generation and reasoning has opened up opportunities for multimodal research and applications. The talk first reviews the trends and developments in Multimodal Foundation Models. It then outlines the advances in multilingual and multimodal alignments and discusses the emergence of language- and media-agnostic signals that appear to represent abstract concepts commonly used in human languages. These signals have been shown to have a positive impact on both the performance and safety of the resulting models, especially in enhancing those with low-training samples. To further improve the performance of the models, most current approaches employ reinforcement learning with various reward functions to achieve better controllable content generation and trust. To facilitate effective reinforcement learning, quality assessment is of pivotal importance, while it has largely been overlooked. This talk further presents recent approaches to quality assessment and its role in the generation of textual descriptions, videos, and 3D media. As research on Multimodal Foundation Models is still in an early stage, this talk concludes with directions for future research.

Abstract:
Integrating multimodal learning (ML) with polysomnography (PSG) has emerged as a research hotspot for reliable sleep staging. However, the complexity of these signals and the discomfort associated with wearing multi-lead devices somewhat limit the feasibility of daily and ubiquitous sleep monitoring. Unfortunately, most existing ML paradigms are constrained by consistent and fixed input patterns. When the number of modalities is less than required by the ML framework, it is easy to cause inference bias, resulting in a significant performance degradation. To this end, we propose an elastic multimodal sleep staging network (ElaSleepNet), consisting of multimodal information completion (MIC) and adaptive cross-modal (ACM) interaction. Specifically, MIC maximizes the consistency of multimodal signals on intra-epoch temporal level and inter-epoch contextual level, thereby enhancing the reasoning and completion abilities of the available modalities for the unavailable modalities. Moreover, we introduce learnable parameters and design the ACM attention mechanism, which allows handling multimodal information interaction while maintaining robustness in the absence of certain modalities. Our ElaSleepNet demonstrates its state-of-the-art on three multimodal sleep datasets. Compared with previous methods, ElaSleepNet can achieve better performance with fewer testing modalities, making it flexible for daily monitoring.

Abstract:
Deep neural networks (DNNs) are susceptible to Universal Adversarial Perturbation (UAP), which significantly increases the likelihood of deceiving DNNs. Current UAP generation methods are categorized into data-dependent, relaxed-data-free, and data-free attacks based on different data dependencies. However, both strategies exhibit poor transferability in the black-box settings. To address this limitation, we propose BTUAP, a novel UAP generation method designed to enhance the transferability of UAP in the black-box setting. BTUAP employs an ensemble strategy with min-max weight adjustment mechanisms to reduce the impact of model characteristics and introduces a self-supervised optimization strategy to maximize the distance of predicted logits between benign samples and adversarial examples. Experimental results demonstrate that BTUAP significantly improves transferability in different data dependency settings under black-box constraints. We also quantify the impact of the distribution shift and provide a new metric to measure the robustness of models. The source code is available at htps://github.com/RaymondDawn/BTUAP.

Abstract:
Multimodal Large Language Models (MLLMs) have achieved impressive performance across a range of tasks by leveraging Multimodal In-Context Learning (MICL), which uses a few task-specific examples as demonstrations. However, existing approaches assume the availability of pre-prepared curated datasets that serve as support sets, limiting the adaptability of MICL to novel and unseen tasks where dedicated data is unavailable. To fill this research gap, we first explore the effectiveness of MICL using non-customized data. Through systematic evaluations across 17 datasets and five state-of-the-art MLLMs, we demonstrate significant performance gains with MICL compared to zero-shot evaluation. To more thoroughly understand underlying reasons behind this phenomenon, we posit and validate two hypotheses: 1) multimodal demonstrations facilitate cross-modal interactions and 2) demonstrations provide transferable knowledge. Building on these insights, we explore factors that affect MICL and arrive at several key takeaways. First, to address the limitations of existing retrieval methods in MICL without dedicated data, we propose a Fast Maximum Mean Discrepancy based (FMMD) retrieval metric and a Semantics-Modality Relation-Aware (SMRA) retrieval metric to perform inter- and intra-dataset retrieval, respectively. Additionally, we find that increasing demonstrations, combining demonstrations from diverse datasets, and providing instructions for query samples can further boost MICL. We hope this study can inspire future works on improving MICL in real-world scenarios.

Abstract:
We present VibeSpace, a novel method for the fully unsupervised construction of interpretable embedding spaces applicable to arbitrary domains. Our approach automates costly data acquisition by leveraging the knowledge embedded in large language models (LLMs), facilitating similarity assessments between entities for meaningful positioning within vector spaces, while also enabling intelligent mappings between vector space representations of disparate domains through a novel form of cross-domain similarity analysis. First, we demonstrate that our data collection methodology yields comprehensive and rich datasets across multiple domains, including songs, books, and movies. We validate the reliability of the automatically generated data via cross-checks with domain-specific catalogues. Second, we show that our method generates single-domain embedding spaces that are separable by domain-specific features, providing a robust foundation for classification tasks, recommendation systems, and other downstream applications. These spaces can be interactively queried for semantic information about different regions in embedding spaces. Lastly, by exploiting the unique capabilities of current state-of-the-art large language models, we produce cross-domain mappings that capture contextual relationships between heterogeneous entities that may not be attainable through traditional methods. This approach facilitates the creation of embedding spaces of any domain, which circumvents the need to collect and calibrate sensitive user data and provides deeper insights and better interpretations of multi-domain data.

Abstract:
Semi-supervised singing melody extraction (SSME) is one of the key tasks in the field of music information retrieval (MIR). However, there are two critical issues that remain to be addressed in data limited scenarios. Firstly, the prior unsupervised domain adaptation methods for SSME typically rely on learning domain-agnostic features at holistic level, which ignores the associations between holistic information (i.e., fundamental frequency) and fine-grained information (i.e., tone and octave). Secondly, the fine-grained information can be utilized to judge the availability of unlabeled data, which is ignored by prior methods. There is a lack of a consistency regularization method that utilizes fine-grained information to validate the availability of unlabeled data. To address these issues, in this paper, we propose a novel two-stage decoupling unsupervised domain adaptation framework for semi-supervised singing melody extraction, termed as DUDA. Specifically, in the first stage, we decouple the holistic information into fine-grained information: tone and octave, and narrow the domain gap at the tone and octave level, respectively. This enables the model to align the tone-octave information between source and target domains for better feature distribution. Then, we leverage the learned domain-agnostic fine-grained features as additional information to obtain domain-agnostic holistic features. We also suggest to align intra-domain, inter-domain, and sample-level features to further improve the performances. In the second stage, we propose a novel tone-octave consistency regularization method by leveraging the extracted fine-grained information to judge the availability of unlabeled data. We evaluate our proposed framework on several well-known public datasets, and the conducted experiments demonstrate the effectiveness of our method.

Abstract:
Millimeter-wave (mmWave) radar enables privacy-preserving gesture recognition but suffers from limited training data, particularly for lying postures. Existing mmWave radar data generation methods are ineffective due to insufficient 2D video data. To this end, we design a novel system named Venus to generate realistic radar data for lying postures using few 2D videos, which addresses two key challenges including i) the simulation of diverse reflected signals and ii) few real-world data leading to low data fidelity. Venus consists of two key components: (i) a gesture sequence generation and signal simulation network, which combines several key modules, movement information extractor, spatio-temporal latent diffusion model, and mmWave signal simulator, to generate diverse gesture vertex sequences under certain conditions and simulate signal propagation characteristics to obtain coarse radar data; (ii) a meta-learning domain adaption network generates realistic radar data with few real-world data via ''meta-learning'' strategy. Extensive experiments on both generated and self-collected datasets demonstrate that Venus significantly outperforms state-of-the-art methods in recognizing gestures performed in lying postures.

Abstract:
Brain-inspired Spiking Neural Networks (SNNs) have garnered significant attention due to their bio-plausibility and low power consumption advantages compared to Artificial Neural Networks (ANNs). However, the application of SNN in computer vision remains limited, primarily due to their inferior performance. In this work, we aim to bridge the performance gap between ANNs and SNNs in object detection by our Advanced SpikingYOLOX. The proposed approach extends the SpikingYOLOX with two key innovations: PSA-SNN and 2D-Spiking Transformer, both designed to enhance object detection performance. PSA-SNN extends spike-based self-attention by incorporating high-speed partial self-attention with an SNN-based 2D-Spiking Transformer in the deepest layer of the backbone, significantly improving feature extraction. The 2D-Spiking Transformer redefines the role of spiking neurons in Transformer sequences (Key, Query, Value), demonstrating that applying an additional spiking layer solely to the Value sequence yields the best performance while maintaining computational efficiency in spike-driven Transformers. We conduct extensive experiments on static images and the Advanced SpikingYOLOX achieves state-of-the-art performance among other SNN-based object detection methods. This work paves the way for more advanced SNN applications in object detection and broader computer vision tasks.

Abstract:
This paper presents CrePoster, a data-driven framework to generate aesthetic posters for Chinese cultural relics, aiming to enhance the exhibition experience and promote cultural spread. CrePoster comprises three modules: (1) object segmentation module, (2) content generation module, and (3) poster generation module. Upon processing a cultural relic image, the object segmentation module first leverages a cascaded U2Net-SAM structure to obtain the visual target. Secondly, the content generation module utilizes a multi-target learning-enabled caption generator to produce professional captions. Thirdly, the Multimodal Large Language Model (MLLM) based poster generation module adaptively creates aesthetic parameters, including layout and color scheme, ultimately rendering them into refined posters.

Abstract:
Motion artifacts in structural and functional magnetic resonance imaging (MRI) pose a significant challenge for both clinical use and machine learning (ML)-based image analysis. Existing ML approaches for artifact correction require paired clean and corrupted datasets, which are difficult to acquire. We present py-simpace, an open-source, pip-installable MRI motion artifact simulation toolkit with native ML integration. py-simpace supports structural MRI and functional MRI (fMRI) simulation, offering configurable k-space and image-space motion, ghosting, Gibbs ringing, and physiological noise. It provides an end-to-end pipeline with a ready-to-use PyTorch Dataset interface for ML training. We describe the design of py-simpace v2.0, compare it with existing tools, and demonstrate its utility for robust artifact correction model development.

Abstract:
Multimodal human understanding is an evolving interdisciplinary field integrating computer science, psychology, and social sciences to model human perception, behaviour, and biases in multimodal data. While recent advancements in multimodal learning excel in tasks like image-text synthesis, they often overlook nuanced human-centric dynamics---such as cultural, political, and individual influences on how modalities (e.g., text and images) interact, complement, or contradict each other. The 4th International Workshop on Multimodal Human Understanding (MUWS) aims at addressing these challenges, fostering novel solutions that explicitly model human perception, behaviour, and biases in multimodal data, with a particular emphasis on real-world challenges in web and social media analysis. This year edition covers two tracks: (1) human-centred multimodal understanding, such as quantifying social biases, analysing sentiment and hate speech, and modelling cross-modal interactions through interdisciplinary theories (e.g., semiotics, gestalt psychology); and (2) Multimodal understanding of global events, supported by a newly curated dataset covering news articles with diverse stances, which facilitates research on cultural framing, societal impact, and bias mitigation in vision-language models. The event features two keynotes from renowned experts from journalism and computer science, research presentations for six accepted papers, and interactive discussions to explore and discuss cutting-edge methodologies and applications in multimodal human understanding. The workshop proceedings can be found at: https://dl.acm.org/doi/proceedings/10.1145/3728481

Abstract:
This workshop addresses next-generation methods in multimedia research, with a focus on content generation, quality assessment, and dataset development. These three areas are foundational for advancing multimedia technologies and applications. Emerging approaches in multimedia content generation, powered by generative AI and multimodal learning, are reshaping domains such as entertainment, advertising, education, and healthcare. At the same time, robust quality assessment is essential to ensure that generated content achieves high standards of perceptual fidelity, semantic consistency, and user satisfaction, thereby determining the real-world impact of multimedia systems. Datasets remain indispensable for training and evaluating algorithms, and innovative strategies in dataset construction-ranging from augmentation and annotation to addressing issues of bias and small-sample imbalance-are driving the development of more reliable and ethical multimedia applications. By convening leading researchers and practitioners, this workshop provides a platform to explore state-of-the-art methods, share best practices, and discuss open challenges in next-generation multimedia research. The goal is to foster interdisciplinary collaboration and inspire innovative solutions that advance the creation, evaluation, and application of multimedia content, setting new benchmarks for the field and shaping the future of multimedia technologies.

Abstract:
Existing LLM-based motion models fail to fully leverage large models' planning capabilities for motion-related tasks, exhibiting poor generalization, limited text-motion alignment, and an inability to perform multimodal condition joint driven motion generation. We argue that these issues arise from the modality gap and the highly coupled nature of motion tokens. To address this, we proposed the hybrid motion sentence, which is consistant of fine-grained motion decription and atomic body-part motion token that can bridge the gap between motion and text. To obtain a large corpus of hybrid motion sentences, we introduced a novel motion-to-text generation method that combines atomic motion operators with GPT-4o, resulting in 68.2 million fine-grained textual descriptions across diverse modalities. To reconstruct high-quality motion from hybrid sentences and make better motion-text alignment, we introduce Semantic-Aware Decoupled Motion Tokenization. Furthermore, we propose MotionUPG based on LLaMA, leveraging MotionWords dataset for both pretraining and instruction tuning. Our method achieves strong fine-grained text-motion alignment, impressive zero-shot motion generation, and is the first to support multimodal condition joint driven motion generation tasks.

Abstract:
Video reframing, which converts landscape-oriented (LO) to portrait-oriented (PO) video for some PO devices such as smartphones and tablets, faces challenges. Existing approaches mainly follow a multi-step pipeline to preserve video content that ignore composition quality due to lack of large-scale datasets. To address these challenges, we propose a fully automated composition-aware dataset using vision-language models and image composition assessment models, pairing LO videos with high-quality PO versions. We then propose an end-to-end model with an attention-aware backbone and a time-aware consistency module. Experiments show our approach outperforms others in efficiency and effectiveness, proving that composition awareness and end-to-end modeling are critical for video reframing.

Abstract:
As a form of multimedia creation, visual novel (VN) conveys engaging narratives through the integrated presentation of text, images, and music, and has shown promise across various application domains. Recent advances in generative AI have fueled interest in automating VN creation using LLMs and other foundation models. However, fully end-to-end VN creation (i.e., from user description to executable VN) remains underexplored and presents several key challenges: 1) the hallucination and limited capacity of LLMs hinder the generation of long and coherent plots; 2) current models lack effective mechanisms for ensuring cross-modal consistency between plot, visual, and audio elements. To address these issues, we propose a hierarchical end-to-end framework for automatic VN generation and assembly, which employs an outline-guided autoregressive generation mechanism that transforms high-level user prompts into coherent plots, while a vision LLM-based self-correction mechanism ensures consistency between multimedia assets and plot content. Additionally, we introduce a script validation mechanism to ensure the executable of the final VN application. Experiments demonstrate that our framework generates high-quality VN applications with coherent storylines and consistent multimedia content.

Abstract:
AI-based fitness coaching systems are typically monolithic and opaque, limiting adaptability, transparency, and embodied interaction. We propose a protocol-integrated Digital Twin (DT) architecture that reimagines fitness coaching as a distributed, explainable, and emotionally adaptive ecosystem. The framework adopts a CrewAI-inspired multi-agent design, where specialized agents for posture analysis, speech, physiological sensing, and personalized recommendation collaborate through the Agent-to-Agent (A2A) protocol to enable secure and interoperable task delegation. Context is maintained through short- and long-term memory modules, while the Model Context Protocol (MCP) supports flexible tool and model invocation across heterogeneous AI resources.

Abstract:
The rapid development of generative AI and in particular deepfake technology enables the seamless creation and manipulation of visual content. As the resulting syntheses are often indistinguishable from authentic images, they threaten the integrity of visual evidence. While forensic detectors can be used to detect syntheses, they can become targets of adversarial attacks. In the ''Adversarial Attacks on Deepfake Detectors'' challenge, competitors were tasked with perturbing a dataset of AI-synthesized images so that four classifiers would mistakenly accept them as authentic. In this paper, we introduce our solution, a white-box adversarial framework that injects globally distributed, data-driven noise perturbations optimized via additional surrogate Vision Transformer and EfficientNet classifiers. Empirical comparisons to both conventional post-processing transforms and localized adversarial patches demonstrate that our approach based on globally distributed noise achieves the highest attack success rates across all public detectors while preserving superior SSIM, confirming its efficacy and visual imperceptibility. In the final evaluation of the challenge, our proposed approach placed third with a final score of 2679.

Abstract:
Salient object detection in optical remote sensing images (ORSI-SOD) faces unique challenges due to complex backgrounds, diverse scales, and multi-directional objects. Existing methods primarily rely on visual features, often struggling to distinguish salient objects from visually similar backgrounds. To address this limitation, we leverage large language models (LLMs) to expend existing ORSI-SOD datasets with detailed textual annotations, creating a more comprehensive benchmark for image-text ORSI-SOD. Building upon this foundation, we propose the Frequency Meets Semantics Network (FMS-Net), a novel framework that integrates text-visual fusion with directional spectral enhancement for ORSI-SOD. FMS-Net consists of two key innovations: the Hierarchical Multi-Modal Dual-Channel Fusion (HMDF) module and the Adaptive Directional Spectral Enhancement (ADSE) module. The HMDF module enables bidirectional interactions between visual and textual features via parallel global-local attention mechanisms, progressively enriching visual representations with semantic context. Meanwhile, the ADSE module enhances feature representations in the frequency domain, capturing directional patterns and boundary details critical for accurate saliency detection. Extensive experiments on two public datasets, ORSSD and EORSSD, demonstrate that FMS-Net outperforms state-of-the-art methods, particularly in complex scenes with ambiguous boundaries. Our work paves the way for integrating multi-modal and frequency-based approaches in the interpretation of optical remote sensing images (ORSI).

Abstract:
Three-dimensional (3D) light field displays (LFDs) provide immersive visual experiences and have attracted increasing attention. However, visual fatigue remains an important concern when users watch 3D LFDs which limits their development and application. In this paper, we propose a comprehensive methodology that integrates subjective and objective data to establish a robust dataset and employs eye movement data for systematically investigating visual fatigue in 3D LFDs. Firstly, a multimodal dataset is constructed by integrating subjective fatigue scores and objective eye movement data collection. Then, we propose the Deep Correlation Data Analysis Model (DCDAM), which uses Spearman's rank correlation coefficient to analyze correlations between key objective metrics and subjective fatigue curves, validating the effectiveness of these metrics. Furthermore, to comprehensively assess visual fatigue, we develop a specialized model, the Temporo-Spatial Synergy Network (TSSNet), which uses temporal and spatial eye movement features to predict subjective fatigue curves. Through validation across diverse videos, the model achieves R² > 0.98 (±0.005) and RMSE of 0.02 (±0.05) between actual and predicted values, demonstrating high precision and valid generalization across different video content. The proposed model provides a foundational framework for future research on visual fatigue assessment tasks of 3D LFDs.

Abstract:
A high effort in Quality of Experience (QoE) research has been put into subjective assessment to determine the perceived quality of video. Most laboratory experiments follow guidelines from the ITU-T Recommendations, which suggest Absolute Category Rating (ACR) as a method to conduct such experiments. However, this method of video assessment radically limits confounding variables, is unnatural, and is far from how people cope with quality degradation and the cues they receive in these situations. This paper addresses this issue and proposes a more realistic subjective experiment based on the participant's behavior. Instead of passively rating the degraded quality of silent videos, we created the possibility to react to the annoying quality of chosen Netflix movies and reward participants by increasing the quality to the best possible. To cope with the data obtained, we adapted the method of fitting psychometric functions known in neuroscience, auditory science, animal science, and psychology. As a result, we obtained a more comprehensive image of the participants and their perceived quality, including their consistency, lapses, and differences. We estimated the parameters using Maximum Likelihood Estimation (MLE) and evaluated the goodness-of-fit of three S-shaped functions: Weibull, cumulative normal, and logistic. In the end, as the most common, we analyze the basic properties of the fitted functions, such as the Point of Subjective Equality (PSE), confidence intervals, and slope (β). We examine their role in describing individual differences among 34 subjects.

Abstract:
Conventional 2D inpainting models are trained using masks confined to 2D scenarios, resulting in meaningless content when applied to 3D-specific masks. These 3D-specific masks, termed Unknown Pixels (UP) masks, represent unseen pixels from novel viewpoints that remain obscured in the original input image. Existing methods attempt to mitigate this issue by employing post-processing techniques to transform UP masks into 2D equivalents, frequently suffering from unnatural distortions. To address these issues, we investigate the efficacy of directly training 2D inpainting models with UP masks to circumvent such distortions. In this paper, we introduce a novel framework designed to generate unbounded 3D scenes from a single image, guided by textual descriptions. Our approach leverages fine-tuned inpainting models that iteratively reconstruct incomplete images originating from pure projection. The generated points are then seamlessly integrated into the original point cloud via pixel-wise depth alignment. Extensive evaluations demonstrate that our framework outperforms existing methods in scene quality, processing speed, and memory efficiency.

Abstract:
The proliferation of generative models, particularly Generative Adversarial Networks (GANs) and Diffusion Models, has reshaped multimedia content creation. Alongside creative and commercial opportunities, they have introduced unprecedented risks through the production of highly realistic synthetic content, or deepfakes. These artifacts challenge visual and auditory trust, with major implications for media, security, politics, and law. This workshop provides a forum to examine deepfake technology from forensic, technical, legal, and social perspectives. It will bring together experts to advance robust and explainable detection methods, define benchmarking practices, and address ethical and regulatory frameworks. Topics include detection and attribution, adversarial countermeasures, multimodal analysis, model traceability, legal admissibility of synthetic content, as well as real-world deployment challenges and dataset creation. Further information about the workshop is available at https://iplab.dmi.unict.it/mfs/acm-dff-ws-2025/

Abstract:
With the popularization of the Internet and the diversification of attack methods, web security has become an important part of information security. As a carrier of network behavior, traffic can reveal attack behaviors in the Web environment through malicious traffic detection. Since images can fully express spatial features and local associations, it is feasible to visualize traffic as images and retrieve key feature information to detect malicious traffic. However, existing methods are prone to redundancy during feature extraction. Secondly, a single perspective makes it difficult to learn patterns with universality from diverse feature information. In addition, the selection of segmentation thresholds in the preprocessing is closely related to the model's information retrieval effect. The commonly adopted preset thresholds are difficult to cope with the changes in traffic data, limiting the applicability of the existing methods. Therefore, this paper proposes a Multi-Module-Based Composite Robust Model for Network Attack Detection (MCNAD). The model adopts depthwise separable convolution (DSC) to reduce redundant information, proposes a multi-scale feature learning module to enhance the model characterization ability, and proposes a gray level co-occurrence matrix segmentation algorithm with adaptive threshold (GLCM-AT) to optimize data preprocessing. The results show that MCNAD improves detection performance with better detection efficiency, generalization ability, and robustness, demonstrating its wide applicability in multiple scenarios.

Abstract:
SUMAC 2025 is the 7th edition of the workshop on analySis, Understanding and proMotion of heritAge Contents. It is held in Dublin, Ireland, on 27 October and is co-located with the 33rd ACM International Conference on Multimedia. The workshop's objective is to present and discuss the latest and most significant trends, challenges, and advances in the fields of machine learning, signal processing, multimodal techniques, and human-machine interaction. The workshop is dedicated to the valorization of cultural heritage, with an emphasis on unlocking and access to the big data of the past. A representative scope of Computer Science methodologies dedicated to the processing of multimedia heritage contents and their exploitation is covered by the works presented, with the ambition of advancing and raising awareness about this fully developing research field.