Abstract Paper Portal of ACM Multimedia 2025

Paperid: 1,   https://arxiv.org/pdf/2507.21167     GitHub GitHub GitHub
Authors:Donglu Yang, Liang Zhang, Zihao Yue, Liangyu Chen, Yichen Xu, Wenxuan Wang, Qin Jin
Title: ChartM$^3$: Benchmarking Chart Editing with Multimodal Instructions
Abstract:
Charts are a fundamental visualization format widely used in data analysis across research and industry. While enabling users to edit charts based on high-level intentions is of great practical value, existing methods primarily rely on natural language instructions, which are often too ambiguous to support fine-grained editing. In this work, we introduce a novel paradigm for multimodal chart editing, where user intent is expressed through a combination of natural language and visual indicators that explicitly highlight the elements to be modified. To support this paradigm, we present Chart\textM^3, a new benchmark for Multimodal chart editing with Multi-level complexity and Multi-perspective evaluation. Chart\textM^3 contains 1,000 samples spanning four levels of editing difficulty. Each sample includes triplets in the form of (chart, code, multimodal instructions). To comprehensively evaluate chart editing models, Chart\textM^3 provides metrics that assess both visual appearance and code correctness. Our benchmark reveals significant limitations in current multimodal large language models (MLLMs), including GPT-4o, particularly in their ability to interpret and act on visual indicators. To address this, we construct Chart\textM^3-Train, a large-scale training set with 24,000 multimodal chart editing samples. Fine-tuning MLLMs on this dataset leads to substantial improvements, demonstrating the importance of multimodal supervision in building practical chart editing systems. Our datasets, codes, and evaluation tools are available at https://github.com/MLrollIT/ChartM3. %https://github.com/MLrollIT/ChartM3Our datasets, codes, and evaluation tools are available at https://github.com/yaolinli/VCE.
Authors:Siqi Chen, Xinyu Dong, Haolei Xu, Xingyu Wu, Fei Tang, Hang Zhang, Yuchen Yan, Linjuan Wu, Wenqi Zhang, Guiyang Hou, Yongliang Shen, Weiming Lu, Yueting Zhuang
Title: SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation
Abstract:
Large Language Models (LLMs) and Multimodal LLMs have shown promising capabilities for SVG processing, yet existing benchmarks suffer from limited real-world coverage, lack of complexity stratification, and fragmented evaluation paradigms. We introduce SVGenius, a comprehensive benchmark comprising 2,377 queries across three progressive dimensions: understanding, editing, and generation. Built on real-world data from 24 application domains with systematic complexity stratification, SVGenius evaluates models through 8 task categories and 18 metrics. We assess 22 mainstream models spanning different scales, architectures, training paradigms, and accessibility levels. Our analysis reveals that while proprietary models significantly outperform open-source counterparts, all models exhibit systematic performance degradation with increasing complexity, indicating fundamental limitations in current approaches; however, reasoning-enhanced training proves more effective than pure scaling for overcoming these limitations, though style transfer remains the most challenging capability across all model types. SVGenius establishes the first systematic evaluation framework for SVG processing, providing crucial insights for developing more capable vector graphics models and advancing automated graphic design applications. Appendix and supplementary materials (including all data and code) are available at https://zju-real.github.io/SVGenius.
Authors:Xinyu Huang, Yi-Jie Huang, Youcai Zhang, Weiwei Tian, Rui Feng, Yuejie Zhang, Yanchun Xie, Yaqian Li, Lei Zhang
Title: Open-Set Image Tagging with Multi-Grained Text Supervision
Abstract:
In this paper, we introduce the Recognize Anything Plus Model (RAM++), an open-set image tagging model effectively leveraging multi-grained text supervision. Previous approaches (e.g., CLIP) primarily utilize global text supervision paired with images, leading to sub-optimal performance in recognizing multiple individual semantic tags. In contrast, RAM++ seamlessly integrates individual tag supervision with global text supervision, all within a unified alignment framework. This integration not only ensures efficient recognition of predefined tag categories, but also enhances generalization capabilities for diverse open-set categories. Furthermore, RAM++ employs large language models (LLMs) to convert semantically constrained tag supervision into more expansive tag description supervision, thereby enriching the scope of open-set visual description concepts. Comprehensive evaluations on various image recognition benchmarks demonstrate RAM++ exceeds existing state-of-the-art (SOTA) open-set image tagging models on most aspects. Specifically, for predefined commonly used tag categories, RAM++ showcases 10.2 mAP and 15.4 mAP enhancements over CLIP on OpenImages and ImageNet. For open-set categories beyond predefined, RAM++ records improvements of 5.0 mAP and 6.4 mAP over CLIP and RAM respectively on OpenImages. For diverse human-object interaction phrases, RAM++ achieves 7.8 mAP and 4.7 mAP improvements on the HICO benchmark. Code, datasets and pre-trained models are available at \urlhttps://github.com/xinyu1205/recognize-anything.
Authors:Yuhong Zhang, Liyao Wang, Han Wang, Danni Wu, Zuzeng Lin, Feng Wang, Li Song
Title: AnimeColor: Reference-based Animation Colorization with Diffusion Transformers
Abstract:
Animation colorization plays a vital role in animation production, yet existing methods struggle to achieve color accuracy and temporal consistency. To address these challenges, we propose AnimeColor, a novel reference-based animation colorization framework leveraging Diffusion Transformers (DiT). Our approach integrates sketch sequences into a DiT-based video diffusion model, enabling sketch-controlled animation generation. We introduce two key components: a High-level Color Extractor (HCE) to capture semantic color information and a Low-level Color Guider (LCG) to extract fine-grained color details from reference images. These components work synergistically to guide the video diffusion process. Additionally, we employ a multi-stage training strategy to maximize the utilization of reference image color information. Extensive experiments demonstrate that AnimeColor outperforms existing methods in color accuracy, sketch alignment, temporal consistency, and visual quality. Our framework not only advances the state of the art in animation colorization but also provides a practical solution for industrial applications. The code will be made publicly available at \hrefhttps://github.com/IamCreateAI/AnimeColorhttps://github.com/IamCreateAI/AnimeColor.
Authors:Qiang Chen, Zhongze Wu, Ang He, Xi Lin, Shuo Jiang, Shan You, Chang Xu, Yi Chen, Xiu Su
Title: Graph Unlearning Meets Influence-aware Negative Preference Optimization
Abstract:
Recent advancements in graph unlearning models have enhanced model utility by preserving the node representation essentially invariant, while using gradient ascent on the forget set to achieve unlearning. However, this approach causes a drastic degradation in model utility during the unlearning process due to the rapid divergence speed of gradient ascent. In this paper, we introduce INPO, an Influence-aware Negative Preference Optimization framework that focuses on slowing the divergence speed and improving the robustness of the model utility to the unlearning process. Specifically, we first analyze that NPO has slower divergence speed and theoretically propose that unlearning high-influence edges can reduce impact of unlearning. We design an influence-aware message function to amplify the influence of unlearned edges and mitigate the tight topological coupling between the forget set and the retain set. The influence of each edge is quickly estimated by a removal-based method. Additionally, we propose a topological entropy loss from the perspective of topology to avoid excessive information loss in the local structure during unlearning. Extensive experiments conducted on five real-world datasets demonstrate that INPO-based model achieves state-of-the-art performance on all forget quality metrics while maintaining the model's utility. Codes are available at \hrefhttps://github.com/sh-qiangchen/INPOhttps://github.com/sh-qiangchen/INPO.
Authors:Weiran Chen, Guiqian Zhu, Ying Li, Yi Ji, Chunping Liu
Title: DA-Font: Few-Shot Font Generation via Dual-Attention Hybrid Integration
Abstract:
Few-shot font generation aims to create new fonts with a limited number of glyph references. It can be used to significantly reduce the labor cost of manual font design. However, due to the variety and complexity of font styles, the results generated by existing methods often suffer from visible defects, such as stroke errors, artifacts and blurriness. To address these issues, we propose DA-Font, a novel framework which integrates a Dual-Attention Hybrid Module (DAHM). Specifically, we introduce two synergistic attention blocks: the component attention block that leverages component information from content images to guide the style transfer process, and the relation attention block that further refines spatial relationships through interacting the content feature with both original and stylized component-wise representations. These two blocks collaborate to preserve accurate character shapes and stylistic textures. Moreover, we also design a corner consistency loss and an elastic mesh feature loss to better improve geometric alignment. Extensive experiments show that our DA-Font outperforms the state-of-the-art methods across diverse font styles and characters, demonstrating its effectiveness in enhancing structural integrity and local fidelity. The source code can be found at \hrefhttps://github.com/wrchen2001/DA-Fonthttps://github.com/wrchen2001/DA-Font.
Authors:Yitong Yang, Yinglin Wang, Tian Zhang, Jing Wang, Shuting He
Title: Prompt-Softbox-Prompt: A Free-Text Embedding Control for Image Editing
Abstract:
While text-driven diffusion models demonstrate remarkable performance in image editing, the critical components of their text embeddings remain underexplored. The ambiguity and entanglement of these embeddings pose challenges for precise editing. In this paper, we provide a comprehensive analysis of text embeddings in Stable Diffusion XL, offering three key insights: (1) aug embedding~\footnoteaug embedding is obtained by combining the pooled output of the final text encoder with the timestep embeddings. https://github.com/huggingface/diffusers retains complete textual semantics but contributes minimally to image generation as it is only fused via the ResBlocks. More text information weakens its local semantics while preserving most global semantics. (2) BOS and padding embedding do not contain any semantic information. (3) EOS holds the semantic information of all words and stylistic information. Each word embedding is important and does not interfere with the semantic injection of other embeddings. Based on these insights, we propose PSP (Prompt-Softbox-Prompt), a training-free image editing method that leverages free-text embedding. PSP enables precise image editing by modifying text embeddings within the cross-attention layers and using Softbox to control the specific area for semantic injection. This technique enables the addition and replacement of objects without affecting other areas of the image. Additionally, PSP can achieve style transfer by simply replacing text embeddings. Extensive experiments show that PSP performs remarkably well in tasks such as object replacement, object addition, and style transfer. Our code is available at https://github.com/yangyt46/PSP.
Authors:Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Tao Jin, Zhou Zhao
Title: ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting
Abstract:
Multimodal immersive spatial drama generation focuses on creating continuous multi-speaker binaural speech with dramatic prosody based on multimodal prompts, with potential applications in AR, VR, and others. This task requires simultaneous modeling of spatial information and dramatic prosody based on multimodal inputs, with high data collection costs. To the best of our knowledge, our work is the first attempt to address these challenges. We construct MRSDrama, the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts. Then, we propose ISDrama, the first immersive spatial drama generation model through multimodal prompting. ISDrama comprises these primary components: 1) Multimodal Pose Encoder, based on contrastive learning, considering the Doppler effect caused by moving speakers to extract unified pose information from multimodal prompts. 2) Immersive Drama Transformer, a flow-based mamba-transformer model that generates high-quality drama, incorporating Drama-MOE to select proper experts for enhanced prosody and pose control. We also design a context-consistent classifier-free guidance strategy to coherently generate complete drama. Experimental results show that ISDrama outperforms baseline models on objective and subjective metrics. The demos are available at https://aaronz345.github.io/ISDramaDemo. We provide the dataset and the evaluation code at https://huggingface.co/datasets/AaronZ345/MRSDrama and https://github.com/AaronZ345/ISDrama.
Authors:Lei Yao, Yi Wang, Yi Zhang, Moyun Liu, Lap-Pui Chau
Title: GaussianCross: Cross-modal Self-supervised 3D Representation Learning via Gaussian Splatting
Abstract:
The significance of informative and robust point representations has been widely acknowledged for 3D scene understanding. Despite existing self-supervised pre-training counterparts demonstrating promising performance, the model collapse and structural information deficiency remain prevalent due to insufficient point discrimination difficulty, yielding unreliable expressions and suboptimal performance. In this paper, we present GaussianCross, a novel cross-modal self-supervised 3D representation learning architecture integrating feed-forward 3D Gaussian Splatting (3DGS) techniques to address current challenges. GaussianCross seamlessly converts scale-inconsistent 3D point clouds into a unified cuboid-normalized Gaussian representation without missing details, enabling stable and generalizable pre-training. Subsequently, a tri-attribute adaptive distillation splatting module is incorporated to construct a 3D feature field, facilitating synergetic feature capturing of appearance, geometry, and semantic cues to maintain cross-modal consistency. To validate GaussianCross, we perform extensive evaluations on various benchmarks, including ScanNet, ScanNet200, and S3DIS. In particular, GaussianCross shows a prominent parameter and data efficiency, achieving superior performance through linear probing (<0.1% parameters) and limited data training (1% of scenes) compared to state-of-the-art methods. Furthermore, GaussianCross demonstrates strong generalization capabilities, improving the full fine-tuning accuracy by 9.3% mIoU and 6.1% AP_50 on ScanNet200 semantic and instance segmentation tasks, respectively, supporting the effectiveness of our approach. The code, weights, and visualizations are publicly available at \hrefhttps://rayyoh.github.io/GaussianCross/https://rayyoh.github.io/GaussianCross/.
Authors:Shouxing Ma, Yawen Zeng, Shiqing Wu, Guandong Xu
Title: Refining Contrastive Learning and Homography Relations for Multi-Modal Recommendation
Abstract:
Multi-modal recommender system focuses on utilizing rich modal information ( i.e., images and textual descriptions) of items to improve recommendation performance. The current methods have achieved remarkable success with the powerful structure modeling capability of graph neural networks. However, these methods are often hindered by sparse data in real-world scenarios. Although contrastive learning and homography ( i.e., homogeneous graphs) are employed to address the data sparsity challenge, existing methods still suffer two main limitations: 1) Simple multi-modal feature contrasts fail to produce effective representations, causing noisy modal-shared features and loss of valuable information in modal-unique features; 2) The lack of exploration of the homograph relations between user interests and item co-occurrence results in incomplete mining of user-item interplay. To address the above limitations, we propose a novel framework for REfining multi-modAl contRastive learning and hoMography relations (REARM). Specifically, we complement multi-modal contrastive learning by employing meta-network and orthogonal constraint strategies, which filter out noise in modal-shared features and retain recommendation-relevant information in modal-unique features. To mine homogeneous relationships effectively, we integrate a newly constructed user interest graph and an item co-occurrence graph with the existing user co-occurrence and item semantic graphs for graph learning. The extensive experiments on three real-world datasets demonstrate the superiority of REARM to various state-of-the-art baselines. Our visualization further shows an improvement made by REARM in distinguishing between modal-shared and modal-unique features. Code is available \hrefhttps://github.com/MrShouxingMa/REARMhere.
Authors:Yanrui Yu, Tianfei Zhou, Jiaxin Sun, Lianpeng Qiao, Lizhong Ding, Ye Yuan, Guoren Wang
Title: LAVA: Language Driven Scalable and Versatile Traffic Video Analytics
Abstract:
In modern urban environments, camera networks generate massive amounts of operational footage -- reaching petabytes each day -- making scalable video analytics essential for efficient processing. Many existing approaches adopt an SQL-based paradigm for querying such large-scale video databases; however, this constrains queries to rigid patterns with predefined semantic categories, significantly limiting analytical flexibility. In this work, we explore a language-driven video analytics paradigm aimed at enabling flexible and efficient querying of high-volume video data driven by natural language. Particularly, we build \textscLava, a system that accepts natural language queries and retrieves traffic targets across multiple levels of granularity and arbitrary categories. \textscLava comprises three main components: 1) a multi-armed bandit-based efficient sampling method for video segment-level localization; 2) a video-specific open-world detection module for object-level retrieval; and 3) a long-term object trajectory extraction scheme for temporal object association, yielding complete trajectories for object-of-interests. To support comprehensive evaluation, we further develop a novel benchmark by providing diverse, semantically rich natural language predicates and fine-grained annotations for multiple videos. Experiments on this benchmark demonstrate that \textscLava improves F_1-scores for selection queries by \mathbf14%, reduces MPAE for aggregation queries by \mathbf0.39, and achieves top-k precision of \mathbf86%, while processing videos \mathbf9.6× faster than the most accurate baseline. Our code and dataset are available at https://github.com/yuyanrui/LAVA.
Authors:Tiancheng Gu, Kaicheng Yang, Chaoyi Zhang, Yin Xie, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng
Title: RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm
Abstract:
After pre-training on extensive image-text pairs, Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a wide variety of benchmarks. However, a substantial volume of multimodal interleaved documents remains underutilized for contrastive vision-language representation learning. To fully leverage these unpaired documents, we initially establish a Real-World Data Extraction pipeline to extract high-quality images and texts. Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts. To further enhance fine-grained visual information, we propose an image semantic augmented generation module for synthetic text production. Furthermore, we employ a semantic balance sampling strategy to improve dataset diversity, enabling better learning of long-tail concepts. Based on these innovations, we construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales: 15M, 30M, and 100M. We compare our dataset with other widely used datasets of equivalent scale for CLIP training. Models pre-trained on RealSyn consistently achieve state-of-the-art performance across various downstream tasks, including linear probe, zero-shot transfer, zero-shot robustness, and zero-shot retrieval. Furthermore, extensive experiments confirm that RealSyn significantly enhances contrastive vision-language representation learning and demonstrates robust scalability. To facilitate future research, the RealSyn dataset and pretrained model weights are released at https://github.com/deepglint/RealSyn.
Authors:Haiyang Zhou, Wangbo Yu, Jiawen Guan, Xinhua Cheng, Yonghong Tian, Li Yuan
Title: HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation
Abstract:
The rapid advancement of diffusion models holds the promise of revolutionizing the application of VR and AR technologies, which typically require scene-level 4D assets for user experience. Nonetheless, existing diffusion models predominantly concentrate on modeling static 3D scenes or object-level dynamics, constraining their capacity to provide truly immersive experiences. To address this issue, we propose HoloTime, a framework that integrates video diffusion models to generate panoramic videos from a single prompt or reference image, along with a 360-degree 4D scene reconstruction method that seamlessly transforms the generated panoramic video into 4D assets, enabling a fully immersive 4D experience for users. Specifically, to tame video diffusion models for generating high-fidelity panoramic videos, we introduce the 360World dataset, the first comprehensive collection of panoramic videos suitable for downstream 4D scene reconstruction tasks. With this curated dataset, we propose Panoramic Animator, a two-stage image-to-video diffusion model that can convert panoramic images into high-quality panoramic videos. Following this, we present Panoramic Space-Time Reconstruction, which leverages a space-time depth estimation method to transform the generated panoramic videos into 4D point clouds, enabling the optimization of a holistic 4D Gaussian Splatting representation to reconstruct spatially and temporally consistent 4D scenes. To validate the efficacy of our method, we conducted a comparative analysis with existing approaches, revealing its superiority in both panoramic video generation and 4D scene reconstruction. This demonstrates our method's capability to create more engaging and realistic immersive environments, thereby enhancing user experiences in VR and AR applications.
Authors:Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, Lei Zhang
Title: Knowledge Regularized Negative Feature Tuning of Vision-Language Models for Out-of-Distribution Detection
Abstract:
Out-of-distribution (OOD) detection is crucial for building reliable machine learning models. Although negative prompt tuning has enhanced the OOD detection capabilities of vision-language models, these tuned models often suffer from reduced generalization performance on unseen classes and styles. To address this challenge, we propose a novel method called Knowledge Regularized Negative Feature Tuning (KR-NFT), which integrates an innovative adaptation architecture termed Negative Feature Tuning (NFT) and a corresponding knowledge-regularization (KR) optimization strategy. Specifically, NFT applies distribution-aware transformations to pre-trained text features, effectively separating positive and negative features into distinct spaces. This separation maximizes the distinction between in-distribution (ID) and OOD images. Additionally, we introduce image-conditional learnable factors through a lightweight meta-network, enabling dynamic adaptation to individual images and mitigating sensitivity to class and style shifts. Compared to traditional negative prompt tuning, NFT demonstrates superior efficiency and scalability. To optimize this adaptation architecture, the KR optimization strategy is designed to enhance the discrimination between ID and OOD sets while mitigating pre-trained knowledge forgetting. This enhances OOD detection performance on trained ID classes while simultaneously improving OOD detection on unseen ID datasets. Notably, when trained with few-shot samples from ImageNet dataset, KR-NFT not only improves ID classification accuracy and OOD detection but also significantly reduces the FPR95 by 5.44% under an unexplored generalization setting with unseen ID categories. Codes can be found at \hrefhttps://github.com/ZhuWenjie98/KRNFT.
Authors:Hanbing Wu, Ping Jiang, Anyang Su, Chenxu Zhao, Tianyu Fu, Minghui Wu, Beiping Tan, Huiying Li
Title: PRE-MAP: Personalized Reinforced Eye-tracking Multimodal LLM for High-Resolution Multi-Attribute Point Prediction
Abstract:
Visual selective attention, driven by individual preferences, regulates human prioritization of visual stimuli by bridging subjective cognitive mechanisms with objective visual elements, thereby steering the semantic interpretation and hierarchical processing of dynamic visual scenes. However, existing models and datasets predominantly neglect the influence of subjective cognitive diversity on fixation behavior. Conventional saliency prediction models, typically employing segmentation approaches, rely on low-resolution imagery to generate saliency heatmaps, subsequently upscaled to native resolutions, which limiting their capacity to capture personalized attention patterns. Furthermore, MLLMs are constrained by factors such as hallucinations, making it very costly to strictly adhere to the expected format in tasks involving multiple point predictions, and achieving precise point positioning is challenging. To address these limitations, we present Subjective Personalized Attention for Advertisement Videos, namely SPA-ADV, a large-scale multimodal dataset capturing gaze behaviors from over 4,500 participants varying in age and gender with 486 videos. Furthermore, we propose PRE-MAP, a novel eye-tracking saliency model that characterizes Personalized visual disparities through Reinforcement learning-optimized Eye-tracking, built upon MLLMs and guided by Multi-Attribute user profiles to predict Points. To ensure MLLMs produce prediction points that are both format-correct and spatially accurate, we introduce Consistency Group Relative Policy Optimization (C-GRPO), inspired by the variability in eye movement points and Multi-Attribute profiles. Extensive experiments on SPA-ADV and other benchmarks demonstrate the effectiveness of our approach. The code and dataset are available at \hrefhttps://github.com/mininglamp-MLLM/PRE-MAPthis URL.
Authors:Xiaoqin Wang, Xianxu Hou, Meidan Ding, Junliang Chen, Kaijun Deng, Jinheng Xie, Linlin Shen
Title: DisFaceRep: Representation Disentanglement for Co-occurring Facial Components in Weakly Supervised Face Parsing
Abstract:
Face parsing aims to segment facial images into key components such as eyes, lips, and eyebrows. While existing methods rely on dense pixel-level annotations, such annotations are expensive and labor-intensive to obtain. To reduce annotation cost, we introduce Weakly Supervised Face Parsing (WSFP), a new task setting that performs dense facial component segmentation using only weak supervision, such as image-level labels and natural language descriptions. WSFP introduces unique challenges due to the high co-occurrence and visual similarity of facial components, which lead to ambiguous activations and degraded parsing performance. To address this, we propose DisFaceRep, a representation disentanglement framework designed to separate co-occurring facial components through both explicit and implicit mechanisms. Specifically, we introduce a co-occurring component disentanglement strategy to explicitly reduce dataset-level bias, and a text-guided component disentanglement loss to guide component separation using language supervision implicitly. Extensive experiments on CelebAMask-HQ, LaPa, and Helen demonstrate the difficulty of WSFP and the effectiveness of DisFaceRep, which significantly outperforms existing weakly supervised semantic segmentation methods. The code will be released at \hrefhttps://github.com/CVI-SZU/DisFaceRep\textcolorcyanhttps://github.com/CVI-SZU/DisFaceRep.
Authors:Parul Gupta, Shreya Ghosh, Tom Gedeon, Thanh-Toan Do, Abhinav Dhall
Title: Multiverse Through Deepfakes: The MultiFakeVerse Dataset of Person-Centric Visual and Conceptual Manipulations
Abstract:
The rapid advancement of GenAI technology over the past few years has significantly contributed towards highly realistic deepfake content generation. Despite ongoing efforts, the research community still lacks a large-scale and reasoning capability driven deepfake benchmark dataset specifically tailored for person-centric object, context and scene manipulations. In this paper, we address this gap by introducing MultiFakeVerse, a large scale person-centric deepfake dataset, comprising 845,286 images generated through manipulation suggestions and image manipulations both derived from vision-language models (VLM). The VLM instructions were specifically targeted towards modifications to individuals or contextual elements of a scene that influence human perception of importance, intent, or narrative. This VLM-driven approach enables semantic, context-aware alterations such as modifying actions, scenes, and human-object interactions rather than synthetic or low-level identity swaps and region-specific edits that are common in existing datasets. Our experiments reveal that current state-of-the-art deepfake detection models and human observers struggle to detect these subtle yet meaningful manipulations. The code and dataset are available on \hrefhttps://github.com/Parul-Gupta/MultiFakeVerseGitHub.
Authors:Chao Yin, Hao Li, Kequan Yang, Jide Li, Pinpin Zhu, Xiaoqiang Li
Title: Stepwise Decomposition and Dual-stream Focus: A Novel Approach for Training-free Camouflaged Object Segmentation
Abstract:
While promptable segmentation (e.g., SAM) has shown promise for various segmentation tasks, it still requires manual visual prompts for each object to be segmented. In contrast, task-generic promptable segmentation aims to reduce the need for such detailed prompts by employing only a task-generic prompt to guide segmentation across all test samples. However, when applied to Camouflaged Object Segmentation (COS), current methods still face two critical issues: 1) semantic ambiguity in getting instance-specific text prompts, which arises from insufficient discriminative cues in holistic captions, leading to foreground-background confusion; 2) semantic discrepancy combined with spatial separation in getting instance-specific visual prompts, which results from global background sampling far from object boundaries with low feature correlation, causing SAM to segment irrelevant regions. To address the issues above, we propose RDVP-MSD, a novel training-free test-time adaptation framework that synergizes Region-constrained Dual-stream Visual Prompting (RDVP) via Multimodal Stepwise Decomposition Chain of Thought (MSD-CoT). MSD-CoT progressively disentangles image captions to eliminate semantic ambiguity, while RDVP injects spatial constraints into visual prompting and independently samples visual prompts for foreground and background points, effectively mitigating semantic discrepancy and spatial separation. Without requiring any training or supervision, RDVP-MSD achieves a state-of-the-art segmentation result on multiple COS benchmarks and delivers a faster inference speed than previous methods, demonstrating significantly improved accuracy and efficiency. The codes will be available at \hrefhttps://github.com/ycyinchao/RDVP-MSDhttps://github.com/ycyinchao/RDVP-MSD
Authors:Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, Minlie Huang
Title: JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering
Abstract:
Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker's malicious intent. This oversight frequently leads to low-quality outputs that bypass safety filters but lack substantial harmful content. To address this gap, we propose JPS, \underlineJailbreak MLLMs with collaborative visual \underlinePerturbation and textual \underlineSteering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target-guided adversarial image perturbations for effective safety bypass, complemented by "steering prompt" optimized via a multi-agent system to specifically guide LLM responses fulfilling the attackers' intent. These visual and textual components undergo iterative co-optimization for enhanced performance. To evaluate the quality of attack outcomes, we propose the Malicious Intent Fulfillment Rate (MIFR) metric, assessed using a Reasoning-LLM-based evaluator. Our experiments show JPS sets a new state-of-the-art in both ASR and MIFR across various MLLMs and benchmarks, with analyses confirming its efficacy. Codes are available at \hrefhttps://github.com/thu-coai/JPShttps://github.com/thu-coai/JPS. \colorwarningcolorWarning: This paper contains potentially sensitive contents.
Authors:Mingwei Li, Pu Pang, Hehe Fan, Hua Huang, Yi Yang
Title: TSGS: Improving Gaussian Splatting for Transparent Surface Reconstruction via Normal and De-lighting Priors
Abstract:
Reconstructing transparent surfaces is essential for tasks such as robotic manipulation in labs, yet it poses a significant challenge for 3D reconstruction techniques like 3D Gaussian Splatting (3DGS). These methods often encounter a transparency-depth dilemma, where the pursuit of photorealistic rendering through standard α-blending undermines geometric precision, resulting in considerable depth estimation errors for transparent materials. To address this issue, we introduce Transparent Surface Gaussian Splatting (TSGS), a new framework that separates geometry learning from appearance refinement. In the geometry learning stage, TSGS focuses on geometry by using specular-suppressed inputs to accurately represent surfaces. In the second stage, TSGS improves visual fidelity through anisotropic specular modeling, crucially maintaining the established opacity to ensure geometric accuracy. To enhance depth inference, TSGS employs a first-surface depth extraction method. This technique uses a sliding window over α-blending weights to pinpoint the most likely surface location and calculates a robust weighted average depth. To evaluate the transparent surface reconstruction task under realistic conditions, we collect a TransLab dataset that includes complex transparent laboratory glassware. Extensive experiments on TransLab show that TSGS achieves accurate geometric reconstruction and realistic rendering of transparent objects simultaneously within the efficient 3DGS framework. Specifically, TSGS significantly surpasses current leading methods, achieving a 37.3% reduction in chamfer distance and an 8.0% improvement in F1 score compared to the top baseline. The code and dataset are available at https://longxiang-ai.github.io/TSGS/.
Authors:Fenghe Tang, Bingkun Nian, Jianrui Ding, Wenxin Ma, Quan Quan, Chengqi Dong, Jie Yang, Wei Liu, S. Kevin Zhou
Title: Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation
Abstract:
In clinical practice, medical image analysis often requires efficient execution on resource-constrained mobile devices. However, existing mobile models-primarily optimized for natural images-tend to perform poorly on medical tasks due to the significant information density gap between natural and medical domains. Combining computational efficiency with medical imaging-specific architectural advantages remains a challenge when developing lightweight, universal, and high-performing networks. To address this, we propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation. Specifically, we employ the newly purposed ConvUtr as a hierarchical patch embedding, featuring a parameter-efficient large-kernel CNN with inverted bottleneck fusion. This design exhibits transformer-like representation learning capacity while being lighter and faster. To enable efficient local-global information exchange, we introduce a novel Large-kernel Local-Global-Local (LGL) block that effectively balances the low information density and high-level semantic discrepancy of medical images. Finally, we incorporate a shallow and lightweight transformer bottleneck for long-range modeling and employ a cascaded decoder with downsample skip connections for dense prediction. Despite its reduced computational demands, our medical-optimized architecture achieves state-of-the-art performance across eight public 2D and 3D datasets covering diverse imaging modalities, including zero-shot testing on four unseen datasets. These results establish it as an efficient yet powerful and generalization solution for mobile medical image analysis. Code is available at https://github.com/FengheTan9/Mobile-U-ViT.
Authors:Guankun Wang, Han Xiao, Huxin Gao, Renrui Zhang, Long Bai, Xiaoxiao Yang, Zhen Li, Hongsheng Li, Hongliang Ren
Title: CoPESD: A Multi-Level Surgical Motion Dataset for Training Large Vision-Language Models to Co-Pilot Endoscopic Submucosal Dissection
Abstract:
submucosal dissection (ESD) enables rapid resection of large lesions, minimizing recurrence rates and improving long-term overall survival. Despite these advantages, ESD is technically challenging and carries high risks of complications, necessitating skilled surgeons and precise instruments. Recent advancements in Large Visual-Language Models (LVLMs) offer promising decision support and predictive planning capabilities for robotic systems, which can augment the accuracy of ESD and reduce procedural risks. However, existing datasets for multi-level fine-grained ESD surgical motion understanding are scarce and lack detailed annotations. In this paper, we design a hierarchical decomposition of ESD motion granularity and introduce a multi-level surgical motion dataset (CoPESD) for training LVLMs as the robotic Co-Pilot of Endoscopic Submucosal Dissection. CoPESD includes 17,679 images with 32,699 bounding boxes and 88,395 multi-level motions, from over 35 hours of ESD videos for both robot-assisted and conventional surgeries. CoPESD enables granular analysis of ESD motions, focusing on the complex task of submucosal dissection. Extensive experiments on the LVLMs demonstrate the effectiveness of CoPESD in training LVLMs to predict following surgical robotic motions. As the first multimodal ESD motion dataset, CoPESD supports advanced research in ESD instruction-following and surgical automation. The dataset is available at \hrefhttps://github.com/gkw0010/CoPESDhttps://github.com/gkw0010/CoPESD.
Authors:Zhihao Luo, Luojun Lin, Zheng Lin
Title: Synthetic-to-Real Camouflaged Object Detection
Abstract:
Due to the high cost of collection and labeling, there are relatively few datasets for camouflaged object detection (COD). In particular, for certain specialized categories, the available image dataset is insufficiently populated. Synthetic datasets can be utilized to alleviate the problem of limited data to some extent. However, directly training with synthetic datasets compared to real datasets can lead to a degradation in model performance. To tackle this problem, in this work, we investigate a new task, namely Syn-to-Real Camouflaged Object Detection (S2R-COD). In order to improve the model performance in real world scenarios, a set of annotated synthetic camouflaged images and a limited number of unannotated real images must be utilized. We propose the Cycling Syn-to-Real Domain Adaptation Framework (CSRDA), a method based on the student-teacher model. Specially, CSRDA propagates class information from the labeled source domain to the unlabeled target domain through pseudo labeling combined with consistency regularization. Considering that narrowing the intra-domain gap can improve the quality of pseudo labeling, CSRDA utilizes a recurrent learning framework to build an evolving real domain for bridging the source and target domain. Extensive experiments demonstrate the effectiveness of our framework, mitigating the problem of limited data and handcraft annotations in COD. Our code is publicly available at: https://github.com/Muscape/S2R-COD.
Authors:M. H. Izadimehr, Milad Ghanbari, Guodong Chen, Wei Zhou, Xiaoshuai Hao, Mallesham Dasari, Christian Timmerer, Hadi Amirpour
Title: SVD: Spatial Video Dataset
Abstract:
Stereoscopic video has long been the subject of research due to its capacity to deliver immersive three-dimensional content across a wide range of applications, from virtual and augmented reality to advanced human-computer interaction. The dual-view format inherently provides binocular disparity cues that enhance depth perception and realism, making it indispensable for fields such as telepresence, 3D mapping, and robotic vision. Until recently, however, end-to-end pipelines for capturing, encoding, and viewing high-quality 3D video were neither widely accessible nor optimized for consumer-grade devices. Today's smartphones, such as the iPhone Pro, and modern Head-Mounted Displays (HMDs), like the Apple Vision Pro (AVP), offer built-in support for stereoscopic video capture, hardware-accelerated encoding, and seamless playback on devices like the Apple Vision Pro and Meta Quest 3, requiring minimal user intervention. Apple refers to this streamlined workflow as spatial video. Making the full stereoscopic video process available to everyone has made new applications possible. Despite these advances, there remains a notable absence of publicly available datasets that include the complete spatial video pipeline. In this paper, we introduce SVD, a spatial video dataset comprising 300 five-second video sequences, 150 captured using an iPhone Pro and 150 with an AVP. Additionally, 10 longer videos with a minimum duration of 2 minutes have been recorded. The SVD dataset is publicly released under an open-access license to facilitate research in codec performance evaluation, subjective and objective quality of experience (QoE) assessment, depth-based computer vision, stereoscopic video streaming, and other emerging 3D applications such as neural rendering and volumetric capture. Link to the dataset: https://cd-athena.github.io/SVD/
Authors:Kun Ding, Ying Wang, Shiming Xiang
Title: EvoVLMA: Evolutionary Vision-Language Model Adaptation
Abstract:
Pre-trained Vision-Language Models (VLMs) have been exploited in various Computer Vision tasks (e.g., few-shot recognition) via model adaptation, such as prompt tuning and adapters. However, existing adaptation methods are designed by human experts, requiring significant time cost and experience. Inspired by recent advances in Large Language Models (LLMs) based code generation, we propose an Evolutionary Vision-Language Model Adaptation (EvoVLMA) method to automatically search training-free efficient adaptation algorithms for VLMs. We recognize feature selection and logits computation as the key functions in training-free VLM adaptation, and propose a two-stage LLM-assisted evolutionary algorithm for optimizing these parts in a sequential manner, effectively addressing the challenge posed by the expansive search space through a divide-and-conquer strategy. Besides, to enhance the stability and efficiency of searching process, we propose low-precision code conversion, web based code execution and process monitoring, leading to a highly effective automatic algorithm design system. Extensive experiments demonstrate that the algorithms found by EvoVLMA can obtain promising results compared to previous manually-designed ones. More specifically, in the 8-shot image classification setting, the classical APE algorithm can be improved by 1.91 points in recognition accuracy. This research opens new possibilities for automating the optimization of adaptation algorithms of pre-trained multimodal models. Code is available at: https://github.com/kding1225/EvoVLMA
Authors:Yaoxun Xu, Hangting Chen, Jianwei Yu, Wei Tan, Rongzhi Gu, Shun Lei, Zhiwei Lin, Zhiyong Wu
Title: MuCodec: Ultra Low-Bitrate Music Codec
Abstract:
Music codecs are a vital aspect of audio codec research, and ultra low-bitrate compression holds significant importance for music transmission and generation. Due to the complexity of music backgrounds and the richness of vocals, solely relying on modeling semantic or acoustic information cannot effectively reconstruct music with both vocals and backgrounds. To address this issue, we propose MuCodec, specifically targeting music compression and reconstruction tasks at ultra low bitrates. MuCodec employs MuEncoder to extract both acoustic and semantic features, discretizes them with RVQ, and obtains Mel-VAE features via flow-matching. The music is then reconstructed using a pre-trained MEL-VAE decoder and HiFi-GAN. MuCodec can reconstruct high-fidelity music at ultra low (0.35kbps) or high bitrates (1.35kbps), achieving the best results to date in both subjective and objective metrics. Code and Demo: https://xuyaoxun.github.io/MuCodec_demo/.
Authors:Jiahao Zhang, Wenzhe Yin, Shujian Yu
Title: Cross-Modal Retrieval with Cauchy-Schwarz Divergence
Abstract:
Effective cross-modal retrieval requires robust alignment of heterogeneous data types. Most existing methods focus on bi-modal retrieval tasks and rely on distributional alignment techniques such as Kullback-Leibler divergence, Maximum Mean Discrepancy, and correlation alignment. However, these methods often suffer from critical limitations, including numerical instability, sensitivity to hyperparameters, and their inability to capture the full structure of the underlying distributions. In this paper, we introduce the Cauchy-Schwarz (CS) divergence, a hyperparameter-free measure that improves both training stability and retrieval performance. We further propose a novel Generalized CS (GCS) divergence inspired by Hölder's inequality. This extension enables direct alignment of three or more modalities within a unified mathematical framework through a bidirectional circular comparison scheme, eliminating the need for exhaustive pairwise comparisons. Extensive experiments on six benchmark datasets demonstrate the effectiveness of our method in both bi-modal and tri-modal retrieval tasks. The code of our CS/GCS divergence is publicly available at https://github.com/JiahaoZhang666/CSD.
Authors:Xiaoyu Guo, Pengzhi Zhong, Hao Zhang, Defeng Huang, Huikai Shao, Qijun Zhao, Shuiwang Li
Title: Camouflaged Object Tracking: A Benchmark
Abstract:
Visual tracking has seen remarkable advancements, largely driven by the availability of large-scale training datasets that have enabled the development of highly accurate and robust algorithms. While significant progress has been made in tracking general objects, research on more challenging scenarios, such as tracking camouflaged objects, remains limited. Camouflaged objects, which blend seamlessly with their surroundings or other objects, present unique challenges for detection and tracking in complex environments. This challenge is particularly critical in applications such as military, security, agriculture, and marine monitoring, where precise tracking of camouflaged objects is essential. To address this gap, we introduce the Camouflaged Object Tracking Dataset (COTD), a specialized benchmark designed specifically for evaluating camouflaged object tracking methods. The COTD dataset comprises 200 sequences and approximately 80,000 frames, each annotated with detailed bounding boxes. Our evaluation of 20 existing tracking algorithms reveals significant deficiencies in their performance with camouflaged objects. To address these issues, we propose a novel tracking framework, HiPTrack-MLS, which demonstrates promising results in improving tracking performance for camouflaged objects. COTD and code are avialable at https://github.com/openat25/HIPTrack-MLS.
Authors:Yihua Shao, Haojin He, Sijie Li, Siyu Chen, Xinwei Long, Fanhu Zeng, Yuxuan Fan, Muyang Zhang, Ziyang Yan, Ao Ma, Xiaochen Wang, Hao Tang, Yan Wang, Shuyan Li
Title: EventVAD: Training-Free Event-Aware Video Anomaly Detection
Abstract:
Video Anomaly Detection~(VAD) focuses on identifying anomalies within videos. Supervised methods require an amount of in-domain training data and often struggle to generalize to unseen anomalies. In contrast, training-free methods leverage the intrinsic world knowledge of large language models (LLMs) to detect anomalies but face challenges in localizing fine-grained visual transitions and diverse events. Therefore, we propose EventVAD, an event-aware video anomaly detection framework that combines tailored dynamic graph architectures and multimodal LLMs through temporal-event reasoning. Specifically, EventVAD first employs dynamic spatiotemporal graph modeling with time-decay constraints to capture event-aware video features. Then, it performs adaptive noise filtering and uses signal ratio thresholding to detect event boundaries via unsupervised statistical features. The statistical boundary detection module reduces the complexity of processing long videos for MLLMs and improves their temporal reasoning through event consistency. Finally, it utilizes a hierarchical prompting strategy to guide MLLMs in performing reasoning before determining final decisions. We conducted extensive experiments on the UCF-Crime and XD-Violence datasets. The results demonstrate that EventVAD with a 7B MLLM achieves state-of-the-art (SOTA) in training-free settings, outperforming strong baselines that use 7B or larger MLLMs.
Authors:Yue Zhu, Haiwen Diao, Shang Gao, Jiazuo Yu, Jiawen Zhu, Yunzhi Zhuge, Shuai Hao, Xu Jia, Lu Zhang, Ying Zhang, Huchuan Lu
Title: Regularizing Subspace Redundancy of Low-Rank Adaptation
Abstract:
Low-Rank Adaptation (LoRA) and its variants have delivered strong capability in Parameter-Efficient Transfer Learning (PETL) by minimizing trainable parameters and benefiting from reparameterization. However, their projection matrices remain unrestricted during training, causing high representation redundancy and diminishing the effectiveness of feature adaptation in the resulting subspaces. While existing methods mitigate this by manually adjusting the rank or implicitly applying channel-wise masks, they lack flexibility and generalize poorly across various datasets and architectures. Hence, we propose ReSoRA, a method that explicitly models redundancy between mapping subspaces and adaptively Regularizes Subspace redundancy of Low-Rank Adaptation. Specifically, it theoretically decomposes the low-rank submatrices into multiple equivalent subspaces and systematically applies de-redundancy constraints to the feature distributions across different projections. Extensive experiments validate that our proposed method consistently facilitates existing state-of-the-art PETL methods across various backbones and datasets in vision-language retrieval and standard visual classification benchmarks. Besides, as a training supervision, ReSoRA can be seamlessly integrated into existing approaches in a plug-and-play manner, with no additional inference costs. Code is publicly available at: https://github.com/Lucenova/ReSoRA.
Authors:Jiayuan Rao, Zifeng Li, Haoning Wu, Ya Zhang, Yanfeng Wang, Weidi Xie
Title: Multi-Agent System for Comprehensive Soccer Understanding
Abstract:
Recent advances in soccer understanding have demonstrated rapid progress, yet existing research predominantly focuses on isolated or narrow tasks. To bridge this gap, we propose a comprehensive framework for holistic soccer understanding. Concretely, we make the following contributions in this paper: (i) we construct SoccerWiki, the first large-scale multimodal soccer knowledge base, integrating rich domain knowledge about players, teams, referees, and venues to enable knowledge-driven reasoning; (ii) we present SoccerBench, the largest and most comprehensive soccer-specific benchmark, featuring around 10K multimodal (text, image, video) multi-choice QA pairs across 13 distinct tasks; (iii) we introduce SoccerAgent, a novel multi-agent system that decomposes complex soccer questions via collaborative reasoning, leveraging domain expertise from SoccerWiki and achieving robust performance; (iv) extensive evaluations and comparisons with representative MLLMs on SoccerBench highlight the superiority of our agentic system.
Authors:Xinyang Li, Chengjie Yi, Jiawei Lai, Mingbao Lin, Yansong Qu, Shengchuan Zhang, Liujuan Cao
Title: SynergyAmodal: Deocclude Anything with Text Control
Abstract:
Image deocclusion (or amodal completion) aims to recover the invisible regions (\ie, shape and appearance) of occluded instances in images. Despite recent advances, the scarcity of high-quality data that balances diversity, plausibility, and fidelity remains a major obstacle. To address this challenge, we identify three critical elements: leveraging in-the-wild image data for diversity, incorporating human expertise for plausibility, and utilizing generative priors for fidelity. We propose SynergyAmodal, a novel framework for co-synthesizing in-the-wild amodal datasets with comprehensive shape and appearance annotations, which integrates these elements through a tripartite data-human-model collaboration. First, we design an occlusion-grounded self-supervised learning algorithm to harness the diversity of in-the-wild image data, fine-tuning an inpainting diffusion model into a partial completion diffusion model. Second, we establish a co-synthesis pipeline to iteratively filter, refine, select, and annotate the initial deocclusion results of the partial completion diffusion model, ensuring plausibility and fidelity through human expert guidance and prior model constraints. This pipeline generates a high-quality paired amodal dataset with extensive category and scale diversity, comprising approximately 16K pairs. Finally, we train a full completion diffusion model on the synthesized dataset, incorporating text prompts as conditioning signals. Extensive experiments demonstrate the effectiveness of our framework in achieving zero-shot generalization and textual controllability. Our code, dataset, and models will be made publicly available at https://github.com/imlixinyang/SynergyAmodal.
Authors:Aryana Hou, Li Lin, Justin Li, Shu Hu
Title: Rethinking Individual Fairness in Deepfake Detection
Abstract:
Generative AI models have substantially improved the realism of synthetic media, yet their misuse through sophisticated DeepFakes poses significant risks. Despite recent advances in deepfake detection, fairness remains inadequately addressed, enabling deepfake markers to exploit biases against specific populations. While previous studies have emphasized group-level fairness, individual fairness (i.e., ensuring similar predictions for similar individuals) remains largely unexplored. In this work, we identify for the first time that the original principle of individual fairness fundamentally fails in the context of deepfake detection, revealing a critical gap previously unexplored in the literature. To mitigate it, we propose the first generalizable framework that can be integrated into existing deepfake detectors to enhance individual fairness and generalization. Extensive experiments conducted on leading deepfake datasets demonstrate that our approach significantly improves individual fairness while maintaining robust detection performance, outperforming state-of-the-art methods. The code is available at https://github.com/Purdue-M2/Individual-Fairness-Deepfake-Detection.
Authors:Zhenghao Liu, Xingsheng Zhu, Tianshuo Zhou, Xinyi Zhang, Xiaoyuan Yi, Yukun Yan, Ge Yu, Maosong Sun
Title: Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts
Abstract:
With the rapid advancement of Multi-modal Large Language Models (MLLMs), their capability in understanding both images and text has greatly improved. However, their potential for leveraging multi-modal contextual information in Retrieval-Augmented Generation (RAG) remains largely underexplored. To address this gap, this paper introduces Multi-Modal Retrieval-Augmented Generation (M^2RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models in leveraging knowledge from multi-modal retrieval documents. The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking. All tasks are set in an open-domain setting, requiring RAG models to retrieve query-relevant information from a multi-modal document collection and use it as contextual input for RAG modeling. To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT), an instruction tuning method that optimizes MLLMs within multi-modal contexts. Our experiments demonstrate the effectiveness of MM-RAIT by significantly improving the quality of responses generated by different RAG models, outperforming MiniCPM-V 2.6 and Qwen2-VL with 34% and 33% gains, respectively. All data and code are available at https://github.com/NEUIR/M2RAG.
Authors:Peiang Zhao, Han Li, Ruiyang Jin, S. Kevin Zhou
Title: LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis
Abstract:
Recent text-to-image diffusion models have reached an unprecedented level in generating high-quality images. However, their exclusive reliance on textual prompts often falls short in precise control of image compositions. In this paper, we propose LoCo, a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions. Specifically, we introduce a Localized Attention Constraint (LAC), leveraging semantic affinity between pixels in self-attention maps to create precise representations of desired objects and effectively ensure the accurate placement of objects in designated regions. We further propose a Padding Token Constraint (PTC) to leverage the semantic information embedded in previously neglected padding tokens, improving the consistency between object appearance and layout instructions. LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods. Extensive experiments showcase the superiority of our approach, surpassing existing state-of-the-art training-free layout-to-image methods both qualitatively and quantitatively across multiple benchmarks.
Authors:Junxiang Qiu, Shuo Wang, Jinda Lu, Lin Liu, Houcheng Jiang, Xingyu Zhu, Yanbin Hao
Title: Accelerating Diffusion Transformer via Error-Optimized Cache
Abstract:
Diffusion Transformer (DiT) is a crucial method for content generation. However, it needs a lot of time to sample. Many studies have attempted to use caching to reduce the time consumption of sampling. Existing caching methods accelerate generation by reusing DiT features from the previous time step and skipping calculations in the next, but they tend to locate and cache low-error modules without focusing on reducing caching-induced errors, resulting in a sharp decline in generated content quality when increasing caching intensity. To solve this problem, we propose the Error-Optimized Cache (EOC). This method introduces three key improvements: (1) Prior knowledge extraction: Extract and process the caching differences; (2) A judgment method for cache optimization: Determine whether certain caching steps need to be optimized; (3) Cache optimization: reduce caching errors. Experiments show that this algorithm significantly reduces the error accumulation caused by caching, especially excessive caching. On the ImageNet dataset, without substantially increasing the computational load, this method improves the FID of the generated images when the rule-based model FORA has a caching level of 75%, 50%, and 25%, and the training-based model Learning-to-cache has a caching level of 22%. Specifically, the FID values change from 30.454 to 21.690 (28.8%), from 6.857 to 5.821 (15.1%), from 3.870 to 3.692 (4.6%), and from 3.539 to 3.451 (2.5%) respectively. Code is available at https://github.com/qiujx0520/EOC_MM2025.git.
Authors:Lin Zhang, Yi Tian, XiYun Wang, Wanru Xu, Yi Jin, Yaping Huang
Title: Differential Contrastive Training for Gaze Estimation
Abstract:
The complex application scenarios have raised critical requirements for precise and generalizable gaze estimation methods. Recently, the pre-trained CLIP has achieved remarkable performance on various vision tasks, but its potentials have not been fully exploited in gaze estimation. In this paper, we propose a novel Differential Contrastive Training strategy, which boosts gaze estimation performance with the help of the CLIP. Accordingly, a Differential Contrastive Gaze Estimation network (DCGaze) composed of a Visual Appearance-aware branch and a Semantic Differential-aware branch is introduced. The Visual Appearance-aware branch is essentially a primary gaze estimation network and it incorporates an Adaptive Feature-refinement Unit (AFU) and a Double-head Gaze Regressor (DGR), which both help the primary network to extract informative and gaze-related appearance features. Moreover, the Semantic Difference-aware branch is designed on the basis of the CLIP's text encoder to reveal the semantic difference of gazes. This branch could further empower the Visual Appearance-aware branch with the capability of characterizing the gaze-related semantic information. Extensive experimental results on four challenging datasets over within and cross-domain tasks demonstrate the effectiveness of our DCGaze.The code is available at https://github.com/LinZhang-bjtu/DCGaze.
Authors:Rongzhen Zhao, Yi Zhao, Juho Kannala, Joni Pajarinen
Title: Slot Attention with Re-Initialization and Self-Distillation
Abstract:
Unlike popular solutions based on dense feature maps, Object-Centric Learning (OCL) represents visual scenes as sub-symbolic object-level feature vectors, termed slots, which are highly versatile for tasks involving visual modalities. OCL typically aggregates object superpixels into slots by iteratively applying competitive cross attention, known as Slot Attention, with the slots as the query. However, once initialized, these slots are reused naively, causing redundant slots to compete with informative ones for representing objects. This often results in objects being erroneously segmented into parts. Additionally, mainstream methods derive supervision signals solely from decoding slots into the input's reconstruction, overlooking potential supervision based on internal information. To address these issues, we propose Slot Attention with re-Initialization and self-Distillation (DIAS): \emphi) We reduce redundancy in the aggregated slots and re-initialize extra aggregation to update the remaining slots; \emphii) We drive the bad attention map at the first aggregation iteration to approximate the good at the last iteration to enable self-distillation. Experiments demonstrate that DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Our source code and model checkpoints are available on https://github.com/Genera1Z/DIAS.
Authors:Qinwei Lin, Xiaopeng Sun, Yu Gao, Yujie Zhong, Dengjie Li, Zheng Zhao, Haoqian Wang
Title: TASR: Timestep-Aware Diffusion Model for Image Super-Resolution
Abstract:
Diffusion models have recently achieved outstanding results in the field of image super-resolution. These methods typically inject low-resolution (LR) images via ControlNet.In this paper, we first explore the temporal dynamics of information infusion through ControlNet, revealing that the input from LR images predominantly influences the initial stages of the denoising process. Leveraging this insight, we introduce a novel timestep-aware diffusion model that adaptively integrates features from both ControlNet and the pre-trained Stable Diffusion (SD). Our method enhances the transmission of LR information in the early stages of diffusion to guarantee image fidelity and stimulates the generation ability of the SD model itself more in the later stages to enhance the detail of generated images. To train this method, we propose a timestep-aware training strategy that adopts distinct losses at varying timesteps and acts on disparate modules. Experiments on benchmark datasets demonstrate the effectiveness of our method. Code: https://github.com/SleepyLin/TASR
Authors:Kefan Tang, Lihuo He, Jisheng Dang, Xinbo Gao
Title: Boosting Temporal Sentence Grounding via Causal Inference
Abstract:
Temporal Sentence Grounding (TSG) aims to identify relevant moments in an untrimmed video that semantically correspond to a given textual query. Despite existing studies having made substantial progress, they often overlook the issue of spurious correlations between video and textual queries. These spurious correlations arise from two primary factors: (1) inherent biases in the textual data, such as frequent co-occurrences of specific verbs or phrases, and (2) the model's tendency to overfit to salient or repetitive patterns in video content. Such biases mislead the model into associating textual cues with incorrect visual moments, resulting in unreliable predictions and poor generalization to out-of-distribution examples. To overcome these limitations, we propose a novel TSG framework, causal intervention and counterfactual reasoning that utilizes causal inference to eliminate spurious correlations and enhance the model's robustness. Specifically, we first formulate the TSG task from a causal perspective with a structural causal model. Then, to address unobserved confounders reflecting textual biases toward specific verbs or phrases, a textual causal intervention is proposed, utilizing do-calculus to estimate the causal effects. Furthermore, visual counterfactual reasoning is performed by constructing a counterfactual scenario that focuses solely on video features, excluding the query and fused multi-modal features. This allows us to debias the model by isolating and removing the influence of the video from the overall effect. Experiments on public datasets demonstrate the superiority of the proposed method. The code is available at https://github.com/Tangkfan/CICR.
Authors:Qiang Li, Qingsen Yan, Haojian Huang, Peng Wu, Haokui Zhang, Yanning Zhang
Title: Text-Visual Semantic Constrained AI-Generated Image Quality Assessment
Abstract:
With the rapid advancements in Artificial Intelligence Generated Image (AGI) technology, the accurate assessment of their quality has become an increasingly vital requirement. Prevailing methods typically rely on cross-modal models like CLIP or BLIP to evaluate text-image alignment and visual quality. However, when applied to AGIs, these methods encounter two primary challenges: semantic misalignment and details perception missing. To address these limitations, we propose Text-Visual Semantic Constrained AI-Generated Image Quality Assessment (SC-AGIQA), a unified framework that leverages text-visual semantic constraints to significantly enhance the comprehensive evaluation of both text-image consistency and perceptual distortion in AI-generated images. Our approach integrates key capabilities from multiple models and tackles the aforementioned challenges by introducing two core modules: the Text-assisted Semantic Alignment Module (TSAM), which leverages Multimodal Large Language Models (MLLMs) to bridge the semantic gap by generating an image description and comparing it against the original prompt for a refined consistency check, and the Frequency-domain Fine-Grained Degradation Perception Module (FFDPM), which draws inspiration from Human Visual System (HVS) properties by employing frequency domain analysis combined with perceptual sensitivity weighting to better quantify subtle visual distortions and enhance the capture of fine-grained visual quality details in images. Extensive experiments conducted on multiple benchmark datasets demonstrate that SC-AGIQA outperforms existing state-of-the-art methods. The code is publicly available at https://github.com/mozhu1/SC-AGIQA.
Authors:Liang Yao, Fan Liu, Delong Chen, Chuanyi Zhang, Yijun Wang, Ziyun Chen, Wei Xu, Shimin Di, Yuhui Zheng
Title: RemoteSAM: Towards Segment Anything for Earth Observation
Abstract:
We aim to develop a robust yet flexible visual foundation model for Earth observation. It should possess strong capabilities in recognizing and localizing diverse visual targets while providing compatibility with various input-output interfaces required across different task scenarios. Current systems cannot meet these requirements, as they typically utilize task-specific architecture trained on narrow data domains with limited semantic coverage. Our study addresses these limitations from two aspects: data and modeling. We first introduce an automatic data engine that enjoys significantly better scalability compared to previous human annotation or rule-based approaches. It has enabled us to create the largest dataset of its kind to date, comprising 270K image-text-mask triplets covering an unprecedented range of diverse semantic categories and attribute specifications. Based on this data foundation, we further propose a task unification paradigm that centers around referring expression segmentation. It effectively handles a wide range of vision-centric perception tasks, including classification, detection, segmentation, grounding, etc, using a single model without any task-specific heads. Combining these innovations on data and modeling, we present RemoteSAM, a foundation model that establishes new SoTA on several earth observation perception benchmarks, outperforming other foundation models such as Falcon, GeoChat, and LHRS-Bot with significantly higher efficiency. Models and data are publicly available at https://github.com/1e12Leon/RemoteSAM.
Authors:Libin Liu, Shen Chen, Sen Jia, Jingzhe Shi, Zhongyu Jiang, Can Jin, Wu Zongkai, Jenq-Neng Hwang, Lei Li
Title: Graph Canvas for Controllable 3D Scene Generation
Abstract:
Spatial intelligence is foundational to AI systems that interact with the physical world, particularly in 3D scene generation and spatial comprehension. Current methodologies for 3D scene generation often rely heavily on predefined datasets, and struggle to adapt dynamically to changing spatial relationships. In this paper, we introduce GraphCanvas3D, a programmable, extensible, and adaptable framework for controllable 3D scene generation. Leveraging in-context learning, GraphCanvas3D enables dynamic adaptability without the need for retraining, supporting flexible and customizable scene creation. Our framework employs hierarchical, graph-driven scene descriptions, representing spatial elements as graph nodes and establishing coherent relationships among objects in 3D environments. Unlike conventional approaches, which are constrained in adaptability and often require predefined input masks or retraining for modifications, GraphCanvas3D allows for seamless object manipulation and scene adjustments on the fly. Additionally, GraphCanvas3D supports 4D scene generation, incorporating temporal dynamics to model changes over time. Experimental results and user studies demonstrate that GraphCanvas3D enhances usability, flexibility, and adaptability for scene generation. Our code and models are available on the project website: https://github.com/ILGLJ/Graph-Canvas.
Authors:Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen
Title: Vector-Quantized Vision Foundation Models for Object-Centric Learning
Abstract:
Object-Centric Learning (OCL) aggregates image or video feature maps into object-level feature vectors, termed slots. It's self-supervision of reconstructing the input from slots struggles with complex object textures, thus Vision Foundation Model (VFM) representations are used as the aggregation input and reconstruction target. Existing methods leverage VFM representations in diverse ways yet fail to fully exploit their potential. In response, we propose a unified architecture, Vector-Quantized VFMs for OCL (VQ-VFM-OCL, or VVO). The key to our unification is simply shared quantizing VFM representations in OCL aggregation and decoding. Experiments show that across different VFMs, aggregators and decoders, our VVO consistently outperforms baselines in object discovery and recognition, as well as downstream visual prediction and reasoning. We also mathematically analyze why VFM representations facilitate OCL aggregation and why their shared quantization as reconstruction targets strengthens OCL supervision. Our source code and model checkpoints are available on https://github.com/Genera1Z/VQ-VFM-OCL.
Authors:Berta Céspedes-Sarrias, Carlos Collado-Capell, Pablo Rodenas-Ruiz, Olena Hrynenko, Andrea Cavallaro
Title: MM-HSD: Multi-Modal Hate Speech Detection in Videos
Abstract:
While hate speech detection (HSD) has been extensively studied in text, existing multi-modal approaches remain limited, particularly in videos. As modalities are not always individually informative, simple fusion methods fail to fully capture inter-modal dependencies. Moreover, previous work often omits relevant modalities such as on-screen text and audio, which may contain subtle hateful content and thus provide essential cues, both individually and in combination with others. In this paper, we present MM-HSD, a multi-modal model for HSD in videos that integrates video frames, audio, and text derived from speech transcripts and from frames (i.e.~on-screen text) together with features extracted by Cross-Modal Attention (CMA). We are the first to use CMA as an early feature extractor for HSD in videos, to systematically compare query/key configurations, and to evaluate the interactions between different modalities in the CMA block. Our approach leads to improved performance when on-screen text is used as a query and the rest of the modalities serve as a key. Experiments on the HateMM dataset show that MM-HSD outperforms state-of-the-art methods on M-F1 score (0.874), using concatenation of transcript, audio, video, on-screen text, and CMA for feature extraction on raw embeddings of the modalities. The code is available at https://github.com/idiap/mm-hsd
Authors:Yangyang Xu, Shengfeng He, Wenqi Shao, Kwan-Yee K. Wong, Yu Qiao, Ping Luo
Title: DiffusionMat: Alpha Matting as Sequential Refinement Learning
Abstract:
In this paper, we introduce DiffusionMat, a novel image matting framework that employs a diffusion model for the transition from coarse to refined alpha mattes. Diverging from conventional methods that utilize trimaps merely as loose guidance for alpha matte prediction, our approach treats image matting as a sequential refinement learning process. This process begins with the addition of noise to trimaps and iteratively denoises them using a pre-trained diffusion model, which incrementally guides the prediction towards a clean alpha matte. The key innovation of our framework is a correction module that adjusts the output at each denoising step, ensuring that the final result is consistent with the input image's structures. We also introduce the Alpha Reliability Propagation, a novel technique designed to maximize the utility of available guidance by selectively enhancing the trimap regions with confident alpha information, thus simplifying the correction task. To train the correction module, we devise specialized loss functions that target the accuracy of the alpha matte's edges and the consistency of its opaque and transparent regions. We evaluate our model across several image matting benchmarks, and the results indicate that DiffusionMat consistently outperforms existing methods. Project page at~\urlhttps://cnnlstm.github.io/DiffusionMat
Authors:Zeshuai Deng, Guohao Chen, Shuaicheng Niu, Hui Luo, Shuhai Zhang, Yifan Yang, Renjie Chen, Wei Luo, Mingkui Tan
Title: Test-Time Model Adaptation for Quantized Neural Networks
Abstract:
Quantizing deep models prior to deployment is a widely adopted technique to speed up inference for various real-time applications, such as autonomous driving. However, quantized models often suffer from severe performance degradation in dynamic environments with potential domain shifts and this degradation is significantly more pronounced compared with their full-precision counterparts, as shown by our theoretical and empirical illustrations. To address the domain shift problem, test-time adaptation (TTA) has emerged as an effective solution by enabling models to learn adaptively from test data. Unfortunately, existing TTA methods are often impractical for quantized models as they typically rely on gradient backpropagation--an operation that is unsupported on quantized models due to vanishing gradients, as well as memory and latency constraints. In this paper, we focus on TTA for quantized models to improve their robustness and generalization ability efficiently. We propose a continual zeroth-order adaptation (ZOA) framework that enables efficient model adaptation using only two forward passes, eliminating the computational burden of existing methods. Moreover, we propose a domain knowledge management scheme to store and reuse different domain knowledge with negligible memory consumption, reducing the interference of different domain knowledge and fostering the knowledge accumulation during long-term adaptation. Experimental results on three classical architectures, including quantized transformer-based and CNN-based models, demonstrate the superiority of our methods for quantized model adaptation. On the quantized W6A6 ViT-B model, our ZOA is able to achieve a 5.0% improvement over the state-of-the-art FOA on ImageNet-C dataset. The source code is available at https://github.com/DengZeshuai/ZOA.
Authors:Junxian Wu, Weitao You, Heda Zuo, Dengming Zhang, Pei Chen, Lingyun Sun
Title: Controllable Video-to-Music Generation with Multiple Time-Varying Conditions
Abstract:
Music enhances video narratives and emotions, driving demand for automatic video-to-music (V2M) generation. However, existing V2M methods relying solely on visual features or supplementary textual inputs generate music in a black-box manner, often failing to meet user expectations. To address this challenge, we propose a novel multi-condition guided V2M generation framework that incorporates multiple time-varying conditions for enhanced control over music generation. Our method uses a two-stage training strategy that enables learning of V2M fundamentals and audiovisual temporal synchronization while meeting users' needs for multi-condition control. In the first stage, we introduce a fine-grained feature selection module and a progressive temporal alignment attention mechanism to ensure flexible feature alignment. For the second stage, we develop a dynamic conditional fusion module and a control-guided decoder module to integrate multiple conditions and accurately guide the music composition process. Extensive experiments demonstrate that our method outperforms existing V2M pipelines in both subjective and objective evaluations, significantly enhancing control and alignment with user expectations.
Authors:Xiaoxing Hu, Kaicheng Yang, Jun Wang, Haoran Xu, Ziyong Feng, Yupei Wang
Title: Decoupled Global-Local Alignment for Improving Compositional Understanding
Abstract:
Contrastive Language-Image Pre-training (CLIP) has achieved success on multiple downstream tasks by aligning image and text modalities. However, the nature of global contrastive learning limits CLIP's ability to comprehend compositional concepts, such as relations and attributes. Although recent studies employ global hard negative samples to improve compositional understanding, these methods significantly compromise the model's inherent general capabilities by forcibly distancing textual negative samples from images in the embedding space. To overcome this limitation, we introduce a Decoupled Global-Local Alignment (DeGLA) framework that improves compositional understanding while substantially mitigating losses in general capabilities. To optimize the retention of the model's inherent capabilities, we incorporate a self-distillation mechanism within the global alignment process, aligning the learnable image-text encoder with a frozen teacher model derived from an exponential moving average. Under the constraint of self-distillation, it effectively mitigates the catastrophic forgetting of pretrained knowledge during fine-tuning. To improve compositional understanding, we first leverage the in-context learning capability of Large Language Models (LLMs) to construct about 2M high-quality negative captions across five types. Subsequently, we propose the Image-Grounded Contrast (IGC) loss and Text-Grounded Contrast (TGC) loss to enhance vision-language compositionally. Extensive experimental results demonstrate the effectiveness of the DeGLA framework. Compared to previous state-of-the-art methods, DeGLA achieves an average enhancement of 3.5% across the VALSE, SugarCrepe, and ARO benchmarks. Concurrently, it obtains an average performance improvement of 13.0% on zero-shot classification tasks across eleven datasets. Our code will be released at https://github.com/xiaoxing2001/DeGLA
Authors:Pei Liu, Xin Liu, Ruoyu Yao, Junming Liu, Siyuan Meng, Ding Wang, Jun Ma
Title: HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation
Abstract:
While Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge, conventional single-agent RAG remains fundamentally limited in resolving complex queries demanding coordinated reasoning across heterogeneous data ecosystems. We present HM-RAG, a novel Hierarchical Multi-agent Multimodal RAG framework that pioneers collaborative intelligence for dynamic knowledge synthesis across structured, unstructured, and graph-based data. The framework is composed of three-tiered architecture with specialized agents: a Decomposition Agent that dissects complex queries into contextually coherent sub-tasks via semantic-aware query rewriting and schema-guided context augmentation; Multi-source Retrieval Agents that carry out parallel, modality-specific retrieval using plug-and-play modules designed for vector, graph, and web-based databases; and a Decision Agent that uses consistency voting to integrate multi-source answers and resolve discrepancies in retrieval results through Expert Model Refinement. This architecture attains comprehensive query understanding by combining textual, graph-relational, and web-derived evidence, resulting in a remarkable 12.95% improvement in answer accuracy and a 3.56% boost in question classification accuracy over baseline RAG systems on the ScienceQA and CrisisMMD benchmarks. Notably, HM-RAG establishes state-of-the-art results in zero-shot settings on both datasets. Its modular architecture ensures seamless integration of new data modalities while maintaining strict data governance, marking a significant advancement in addressing the critical challenges of multimodal reasoning and knowledge synthesis in RAG systems. Code is available at https://github.com/ocean-luna/HMRAG.
Authors:Hyebin Cho, Jaehyup Lee
Title: Uncertainty-Guided Face Matting for Occlusion-Aware Face Transformation
Abstract:
Face filters have become a key element of short-form video content, enabling a wide array of visual effects such as stylization and face swapping. However, their performance often degrades in the presence of occlusions, where objects like hands, hair, or accessories obscure the face. To address this limitation, we introduce the novel task of face matting, which estimates fine-grained alpha mattes to separate occluding elements from facial regions. We further present FaceMat, a trimap-free, uncertainty-aware framework that predicts high-quality alpha mattes under complex occlusions. Our approach leverages a two-stage training pipeline: a teacher model is trained to jointly estimate alpha mattes and per-pixel uncertainty using a negative log-likelihood (NLL) loss, and this uncertainty is then used to guide the student model through spatially adaptive knowledge distillation. This formulation enables the student to focus on ambiguous or occluded regions, improving generalization and preserving semantic consistency. Unlike previous approaches that rely on trimaps or segmentation masks, our framework requires no auxiliary inputs making it well-suited for real-time applications. In addition, we reformulate the matting objective by explicitly treating skin as foreground and occlusions as background, enabling clearer compositing strategies. To support this task, we newly constructed CelebAMat, a large-scale synthetic dataset specifically designed for occlusion-aware face matting. Extensive experiments show that FaceMat outperforms state-of-the-art methods across multiple benchmarks, enhancing the visual quality and robustness of face filters in real-world, unconstrained video scenarios. The source code and CelebAMat dataset are available at https://github.com/hyebin-c/FaceMat.git
Authors:Feng Liu, Lingna Gu, Chen Shi, Xiaolan Fu
Title: Action Unit Enhance Dynamic Facial Expression Recognition
Abstract:
Dynamic Facial Expression Recognition(DFER) is a rapidly evolving field of research that focuses on the recognition of time-series facial expressions. While previous research on DFER has concentrated on feature learning from a deep learning perspective, we put forward an AU-enhanced Dynamic Facial Expression Recognition architecture, namely AU-DFER, that incorporates AU-expression knowledge to enhance the effectiveness of deep learning modeling. In particular, the contribution of the Action Units(AUs) to different expressions is quantified, and a weight matrix is designed to incorporate a priori knowledge. Subsequently, the knowledge is integrated with the learning outcomes of a conventional deep learning network through the introduction of AU loss. The design is incorporated into the existing optimal model for dynamic expression recognition for the purpose of validation. Experiments are conducted on three recent mainstream open-source approaches to DFER on the principal datasets in this field. The results demonstrate that the proposed architecture outperforms the state-of-the-art(SOTA) methods without the need for additional arithmetic and generally produces improved results. Furthermore, we investigate the potential of AU loss function redesign to address data label imbalance issues in established dynamic expression datasets. To the best of our knowledge, this is the first attempt to integrate quantified AU-expression knowledge into various DFER models. We also devise strategies to tackle label imbalance, or minor class problems. Our findings suggest that employing a diverse strategy of loss function design can enhance the effectiveness of DFER. This underscores the criticality of addressing data imbalance challenges in mainstream datasets within this domain. The source code is available at https://github.com/Cross-Innovation-Lab/AU-DFER.
Authors:Peng Wang, Minh Huy Pham, Zhihao Guo, Wei Zhou
Title: A Spatial Relationship Aware Dataset for Robotics
Abstract:
Robotic task planning in real-world environments requires not only object recognition but also a nuanced understanding of spatial relationships between objects. We present a spatial-relationship-aware dataset of nearly 1,000 robot-acquired indoor images, annotated with object attributes, positions, and detailed spatial relationships. Captured using a Boston Dynamics Spot robot and labelled with a custom annotation tool, the dataset reflects complex scenarios with similar or identical objects and intricate spatial arrangements. We benchmark six state-of-the-art scene-graph generation models on this dataset, analysing their inference speed and relational accuracy. Our results highlight significant differences in model performance and demonstrate that integrating explicit spatial relationships into foundation models, such as ChatGPT 4o, substantially improves their ability to generate executable, spatially-aware plans for robotics. The dataset and annotation tool are publicly available at https://github.com/PengPaulWang/SpatialAwareRobotDataset, supporting further research in spatial reasoning for robotics.
Authors:Changsheng Gao, Wei Zhou, Guosheng Lin, Weisi Lin
Title: Compressed Feature Quality Assessment: Dataset and Baselines
Abstract:
The widespread deployment of large models in resource-constrained environments has underscored the need for efficient transmission of intermediate feature representations. In this context, feature coding, which compresses features into compact bitstreams, becomes a critical component for scenarios involving feature transmission, storage, and reuse. However, this compression process inevitably introduces semantic degradation that is difficult to quantify with traditional metrics. To address this, we formalize the research problem of Compressed Feature Quality Assessment (CFQA), aiming to evaluate the semantic fidelity of compressed features. To advance CFQA research, we propose the first benchmark dataset, comprising 300 original features and 12000 compressed features derived from three vision tasks and four feature codecs. Task-specific performance degradation is provided as true semantic distortion for evaluating CFQA metrics. We systematically assess three widely used metrics -- MSE, cosine similarity, and Centered Kernel Alignment (CKA) -- in terms of their ability to capture semantic degradation. Our findings demonstrate the representativeness of the proposed dataset while underscoring the need for more sophisticated metrics capable of measuring semantic distortion in compressed features. This work advances the field by establishing a foundational benchmark and providing a critical resource for the community to explore CFQA. To foster further research, we release the dataset and all associated source code at https://github.com/chansongoal/Compressed-Feature-Quality-Assessment.
Authors:Zhaochen Guo, Zhixiang Shen, Xuanting Xie, Liangjian Wen, Zhao Kang
Title: Disentangling Homophily and Heterophily in Multimodal Graph Clustering
Abstract:
Multimodal graphs, which integrate unstructured heterogeneous data with structured interconnections, offer substantial real-world utility but remain insufficiently explored in unsupervised learning. In this work, we initiate the study of multimodal graph clustering, aiming to bridge this critical gap. Through empirical analysis, we observe that real-world multimodal graphs often exhibit hybrid neighborhood patterns, combining both homophilic and heterophilic relationships. To address this challenge, we propose a novel framework -- \textscDisentangled Multimodal Graph Clustering (DMGC) -- which decomposes the original hybrid graph into two complementary views: (1) a homophily-enhanced graph that captures cross-modal class consistency, and (2) heterophily-aware graphs that preserve modality-specific inter-class distinctions. We introduce a \emphMultimodal Dual-frequency Fusion mechanism that jointly filters these disentangled graphs through a dual-pass strategy, enabling effective multimodal integration while mitigating category confusion. Our self-supervised alignment objectives further guide the learning process without requiring labels. Extensive experiments on both multimodal and multi-relational graph datasets demonstrate that DMGC achieves state-of-the-art performance, highlighting its effectiveness and generalizability across diverse settings. Our code is available at https://github.com/Uncnbb/DMGC.
Authors:Yoorhim Cho, Hongyeob Kim, Semin Kim, Youjia Zhang, Yunseok Choi, Sungeun Hong
Title: RA-Touch: Retrieval-Augmented Touch Understanding with Enriched Visual Data
Abstract:
Visuo-tactile perception aims to understand an object's tactile properties, such as texture, softness, and rigidity. However, the field remains underexplored because collecting tactile data is costly and labor-intensive. We observe that visually distinct objects can exhibit similar surface textures or material properties. For example, a leather sofa and a leather jacket have different appearances but share similar tactile properties. This implies that tactile understanding can be guided by material cues in visual data, even without direct tactile supervision. In this paper, we introduce RA-Touch, a retrieval-augmented framework that improves visuo-tactile perception by leveraging visual data enriched with tactile semantics. We carefully recaption a large-scale visual dataset with tactile-focused descriptions, enabling the model to access tactile semantics typically absent from conventional visual datasets. A key challenge remains in effectively utilizing these tactile-aware external descriptions. RA-Touch addresses this by retrieving visual-textual representations aligned with tactile inputs and integrating them to focus on relevant textural and material properties. By outperforming prior methods on the TVL benchmark, our method demonstrates the potential of retrieval-based visual reuse for tactile understanding. Code is available at https://aim-skku.github.io/RA-Touch
Authors:Binyan Xu, Fan Yang, Xilin Dai, Di Tang, Kehuan Zhang
Title: CLIP-Guided Backdoor Defense through Entropy-Based Poisoned Dataset Separation
Abstract:
Deep Neural Networks (DNNs) are susceptible to backdoor attacks, where adversaries poison training data to implant backdoor into the victim model. Current backdoor defenses on poisoned data often suffer from high computational costs or low effectiveness against advanced attacks like clean-label and clean-image backdoors. To address them, we introduce CLIP-Guided backdoor Defense (CGD), an efficient and effective method that mitigates various backdoor attacks. CGD utilizes a publicly accessible CLIP model to identify inputs that are likely to be clean or poisoned. It then retrains the model with these inputs, using CLIP's logits as a guidance to effectively neutralize the backdoor. Experiments on 4 datasets and 11 attack types demonstrate that CGD reduces attack success rates (ASRs) to below 1% while maintaining clean accuracy (CA) with a maximum drop of only 0.3%, outperforming existing defenses. Additionally, we show that clean-data-based defenses can be adapted to poisoned data using CGD. Also, CGD exhibits strong robustness, maintaining low ASRs even when employing a weaker CLIP model or when CLIP itself is compromised by a backdoor. These findings underscore CGD's exceptional efficiency, effectiveness, and applicability for real-world backdoor defense scenarios. Code: https://github.com/binyxu/CGD.
Authors:Qirui Yang, Fangpu Zhang, Yeying Jin, Qihua Cheng, Peng-Tao Jiang, Huanjing Yue, Jingyu Yang
Title: DSDNet: Raw Domain Demoiréing via Dual Color-Space Synergy
Abstract:
With the rapid advancement of mobile imaging, capturing screens using smartphones has become a prevalent practice in distance learning and conference recording. However, moiré artifacts, caused by frequency aliasing between display screens and camera sensors, are further amplified by the image signal processing pipeline, leading to severe visual degradation. Existing sRGB domain demoiréing methods struggle with irreversible information loss, while recent two-stage raw domain approaches suffer from information bottlenecks and inference inefficiency. To address these limitations, we propose a single-stage raw domain demoiréing framework, Dual-Stream Demoiréing Network (DSDNet), which leverages the synergy of raw and YCbCr images to remove moiré while preserving luminance and color fidelity. Specifically, to guide luminance correction and moiré removal, we design a raw-to-YCbCr mapping pipeline and introduce the Synergic Attention with Dynamic Modulation (SADM) module. This module enriches the raw-to-sRGB conversion with cross-domain contextual features. Furthermore, to better guide color fidelity, we develop a Luminance-Chrominance Adaptive Transformer (LCAT), which decouples luminance and chrominance representations. Extensive experiments demonstrate that DSDNet outperforms state-of-the-art methods in both visual quality and quantitative evaluation and achieves an inference speed \mathrm2.4x faster than the second-best method, highlighting its practical advantages. We provide an anonymous online demo at https://xxxxxxxxdsdnet.github.io/DSDNet/.
Authors:Ke Xing, Hanwen Liang, Dejia Xu, Yuyang Yin, Konstantinos N. Plataniotis, Yao Zhao, Yunchao Wei
Title: TiP4GEN: Text to Immersive Panorama 4D Scene Generation
Abstract:
With the rapid advancement and widespread adoption of VR/AR technologies, there is a growing demand for the creation of high-quality, immersive dynamic scenes. However, existing generation works predominantly concentrate on the creation of static scenes or narrow perspective-view dynamic scenes, falling short of delivering a truly 360-degree immersive experience from any viewpoint. In this paper, we introduce TiP4GEN, an advanced text-to-dynamic panorama scene generation framework that enables fine-grained content control and synthesizes motion-rich, geometry-consistent panoramic 4D scenes. TiP4GEN integrates panorama video generation and dynamic scene reconstruction to create 360-degree immersive virtual environments. For video generation, we introduce a Dual-branch Generation Model consisting of a panorama branch and a perspective branch, responsible for global and local view generation, respectively. A bidirectional cross-attention mechanism facilitates comprehensive information exchange between the branches. For scene reconstruction, we propose a Geometry-aligned Reconstruction Model based on 3D Gaussian Splatting. By aligning spatial-temporal point clouds using metric depth maps and initializing scene cameras with estimated poses, our method ensures geometric consistency and temporal coherence for the reconstructed scenes. Extensive experiments demonstrate the effectiveness of our proposed designs and the superiority of TiP4GEN in generating visually compelling and motion-coherent dynamic panoramic scenes. Our project page is at https://ke-xing.github.io/TiP4GEN/.
Authors:Fansheng Zeng, Bineng Zhong, Haiying Xia, Yufei Tan, Xiantao Hu, Liangtao Shi, Shuxiang Song
Title: Explicit Context Reasoning with Supervision for Visual Tracking
Abstract:
Contextual reasoning with constraints is crucial for enhancing temporal consistency in cross-frame modeling for visual tracking. However, mainstream tracking algorithms typically associate context by merely stacking historical information without explicitly supervising the association process, making it difficult to effectively model the target's evolving dynamics. To alleviate this problem, we propose RSTrack, which explicitly models and supervises context reasoning via three core mechanisms. 1) Context Reasoning Mechanism: Constructs a target state reasoning pipeline, converting unconstrained contextual associations into a temporal reasoning process that predicts the current representation based on historical target states, thereby enhancing temporal consistency. 2) Forward Supervision Strategy: Utilizes true target features as anchors to constrain the reasoning pipeline, guiding the predicted output toward the true target distribution and suppressing drift in the context reasoning process. 3) Efficient State Modeling: Employs a compression-reconstruction mechanism to extract the core features of the target, removing redundant information across frames and preventing ineffective contextual associations. These three mechanisms collaborate to effectively alleviate the issue of contextual association divergence in traditional temporal modeling. Experimental results show that RSTrack achieves state-of-the-art performance on multiple benchmark datasets while maintaining real-time running speeds. Our code is available at https://github.com/GXNU-ZhongLab/RSTrack.
Authors:Kun-Hsiang Lin, Yu-Wen Tseng, Kang-Yang Huang, Jhih-Ciang Wu, Wen-Huang Cheng
Title: InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing
Abstract:
Face anti-spoofing (FAS) aims to construct a robust system that can withstand diverse attacks. While recent efforts have concentrated mainly on cross-domain generalization, two significant challenges persist: limited semantic understanding of attack types and training redundancy across domains. We address the first by integrating vision-language models (VLMs) to enhance the perception of visual input. For the second challenge, we employ a meta-domain strategy to learn a unified model that generalizes well across multiple domains. Our proposed InstructFLIP is a novel instruction-tuned framework that leverages VLMs to enhance generalization via textual guidance trained solely on a single domain. At its core, InstructFLIP explicitly decouples instructions into content and style components, where content-based instructions focus on the essential semantics of spoofing, and style-based instructions consider variations related to the environment and camera characteristics. Extensive experiments demonstrate the effectiveness of InstructFLIP by outperforming SOTA models in accuracy and substantially reducing training redundancy across diverse domains in FAS. Project website is available at https://kunkunlin1221.github.io/InstructFLIP.
Authors:Chen Qian, Danyang Li, Xinran Yu, Zheng Yang, Qiang Ma
Title: OpenMoCap: Rethinking Optical Motion Capture under Real-world Occlusion
Abstract:
Optical motion capture is a foundational technology driving advancements in cutting-edge fields such as virtual reality and film production. However, system performance suffers severely under large-scale marker occlusions common in real-world applications. An in-depth analysis identifies two primary limitations of current models: (i) the lack of training datasets accurately reflecting realistic marker occlusion patterns, and (ii) the absence of training strategies designed to capture long-range dependencies among markers. To tackle these challenges, we introduce the CMU-Occlu dataset, which incorporates ray tracing techniques to realistically simulate practical marker occlusion patterns. Furthermore, we propose OpenMoCap, a novel motion-solving model designed specifically for robust motion capture in environments with significant occlusions. Leveraging a marker-joint chain inference mechanism, OpenMoCap enables simultaneous optimization and construction of deep constraints between markers and joints. Extensive comparative experiments demonstrate that OpenMoCap consistently outperforms competing methods across diverse scenarios, while the CMU-Occlu dataset opens the door for future studies in robust motion solving. The proposed OpenMoCap is integrated into the MoSen MoCap system for practical deployment. The code is released at: https://github.com/qianchen214/OpenMoCap.
Authors:Na Zhang, Moran Li, Chengming Xu, Han Feng, Xiaobin Hu, Jiangning Zhang, Weijian Cao, Chengjie Wang, Yanwei Fu
Title: StrandDesigner: Towards Practical Strand Generation with Sketch Guidance
Abstract:
Realistic hair strand generation is crucial for applications like computer graphics and virtual reality. While diffusion models can generate hairstyles from text or images, these inputs lack precision and user-friendliness. Instead, we propose the first sketch-based strand generation model, which offers finer control while remaining user-friendly. Our framework tackles key challenges, such as modeling complex strand interactions and diverse sketch patterns, through two main innovations: a learnable strand upsampling strategy that encodes 3D strands into multi-scale latent spaces, and a multi-scale adaptive conditioning mechanism using a transformer with diffusion heads to ensure consistency across granularity levels. Experiments on several benchmark datasets show our method outperforms existing approaches in realism and precision. Qualitative results further confirm its effectiveness. Code will be released at [GitHub](https://github.com/fighting-Zhang/StrandDesigner).
Authors:Chunyan Wang, Dong Zhang, Jinhui Tang
Title: Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-Light Semantic Segmentation
Abstract:
Weakly-supervised semantic segmentation aims to assign category labels to each pixel using weak annotations, significantly reducing manual annotation costs. Although existing methods have achieved remarkable progress in well-lit scenarios, their performance significantly degrades in low-light environments due to two fundamental limitations: severe image quality degradation (e.g., low contrast, noise, and color distortion) and the inherent constraints of weak supervision. These factors collectively lead to unreliable class activation maps and semantically ambiguous pseudo-labels, ultimately compromising the model's ability to learn discriminative feature representations. To address these problems, we propose Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-light Semantic Segmentation (DGKD-WLSS), a novel framework that synergistically combines Diffusion-Guided Knowledge Distillation (DGKD) with Depth-Guided Feature Fusion (DGF2). DGKD aligns normal-light and low-light features via diffusion-based denoising and knowledge distillation, while DGF2 integrates depth maps as illumination-invariant geometric priors to enhance structural feature learning. Extensive experiments demonstrate the effectiveness of DGKD-WLSS, which achieves state-of-the-art performance in weakly supervised semantic segmentation tasks under low-light conditions. The source codes have been released at:https://github.com/ChunyanWang1/DGKD-WLSS.
Authors:Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, Tat-Seng Chua
Title: ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
Abstract:
Recent advancements in Multi-modal Large Language Models (MLLMs) have led to significant progress in developing GUI agents for general tasks such as web browsing and mobile phone use. However, their application in professional domains remains under-explored. These specialized workflows introduce unique challenges for GUI perception models, including high-resolution displays, smaller target sizes, and complex environments. In this paper, we introduce ScreenSpot-Pro, a new benchmark designed to rigorously evaluate the grounding capabilities of MLLMs in high-resolution professional settings. The benchmark comprises authentic high-resolution images from a variety of professional domains with expert annotations. It spans 23 applications across five industries and three operating systems. Existing GUI grounding models perform poorly on this dataset, with the best model achieving only 18.9%. Our experiments reveal that strategically reducing the search area enhances accuracy. Based on this insight, we propose ScreenSeekeR, a visual search method that utilizes the GUI knowledge of a strong planner to guide a cascaded search, achieving state-of-the-art performance with 48.1% without any additional training. We hope that our benchmark and findings will advance the development of GUI agents for professional applications. Code, data and leaderboard can be found at https://gui-agent.github.io/grounding-leaderboard.
Authors:Zichi Liu, Yinggui Wang, Tao Wei, Chao Ma
Title: AnchorSync: Global Consistency Optimization for Long Video Editing
Abstract:
Editing long videos remains a challenging task due to the need for maintaining both global consistency and temporal coherence across thousands of frames. Existing methods often suffer from structural drift or temporal artifacts, particularly in minute-long sequences. We introduce AnchorSync, a novel diffusion-based framework that enables high-quality, long-term video editing by decoupling the task into sparse anchor frame editing and smooth intermediate frame interpolation. Our approach enforces structural consistency through a progressive denoising process and preserves temporal dynamics via multimodal guidance. Extensive experiments show that AnchorSync produces coherent, high-fidelity edits, surpassing prior methods in visual quality and temporal stability.
Authors:Kangjie Chen, BingQuan Dai, Minghan Qin, Dongbin Zhang, Peihao Li, Yingshuang Zou, Haoqian Wang
Title: SLGaussian: Fast Language Gaussian Splatting in Sparse Views
Abstract:
3D semantic field learning is crucial for applications like autonomous navigation, AR/VR, and robotics, where accurate comprehension of 3D scenes from limited viewpoints is essential. Existing methods struggle under sparse view conditions, relying on inefficient per-scene multi-view optimizations, which are impractical for many real-world tasks. To address this, we propose SLGaussian, a feed-forward method for constructing 3D semantic fields from sparse viewpoints, allowing direct inference of 3DGS-based scenes. By ensuring consistent SAM segmentations through video tracking and using low-dimensional indexing for high-dimensional CLIP features, SLGaussian efficiently embeds language information in 3D space, offering a robust solution for accurate 3D scene understanding under sparse view conditions. In experiments on two-view sparse 3D object querying and segmentation in the LERF and 3D-OVS datasets, SLGaussian outperforms existing methods in chosen IoU, Localization Accuracy, and mIoU. Moreover, our model achieves scene inference in under 30 seconds and open-vocabulary querying in just 0.011 seconds per query.
Authors:Jiajing Lin, Zhenzhong Wang, Dejun Xu, Shu Jiang, YunPeng Gong, Min Jiang
Title: Phys4DGen: Physics-Compliant 4D Generation with Multi-Material Composition Perception
Abstract:
4D content generation aims to create dynamically evolving 3D content that responds to specific input objects such as images or 3D representations. Current approaches typically incorporate physical priors to animate 3D representations, but these methods suffer from significant limitations: they not only require users lacking physics expertise to manually specify material properties but also struggle to effectively handle the generation of multi-material composite objects. To address these challenges, we propose Phys4DGen, a novel 4D generation framework that integrates multi-material composition perception with physical simulation. The framework achieves automated, physically plausible 4D generation through three innovative modules: first, the 3D Material Grouping module partitions heterogeneous material regions on 3D representation surfaces via semantic segmentation; second, the Internal Physical Structure Discovery module constructs the mechanical structure of object interiors; finally, we distill physical prior knowledge from multimodal large language models to enable rapid and automatic material properties identification for both objects' surfaces and interiors. Experiments on both synthetic and real-world datasets demonstrate that Phys4DGen can generate high-fidelity 4D content with physical realism in open-world scenarios, significantly outperforming state-of-the-art methods.
Authors:Ruixiang Jiang, Changwen Chen
Title: Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
Abstract:
The rapid technical progress of generative art (GenArt) has democratized the creation of visually appealing imagery. However, achieving genuine artistic impact - the kind that resonates with viewers on a deeper, more meaningful level - remains formidable as it requires a sophisticated aesthetic sensibility. This sensibility involves a multifaceted cognitive process extending beyond mere visual appeal, which is often overlooked by current computational methods. This paper pioneers an approach to capture this complex process by investigating how the reasoning capabilities of Multimodal LLMs (MLLMs) can be effectively elicited to perform aesthetic judgment. Our analysis reveals a critical challenge: MLLMs exhibit a tendency towards hallucinations during aesthetic reasoning, characterized by subjective opinions and unsubstantiated artistic interpretations. We further demonstrate that these hallucinations can be suppressed by employing an evidence-based and objective reasoning process, as substantiated by our proposed baseline, ArtCoT. MLLMs prompted by this principle produce multifaceted, in-depth aesthetic reasoning that aligns significantly better with human judgment. These findings have direct applications in areas such as AI art tutoring and as reward models for image generation. Ultimately, we hope this work paves the way for AI systems that can truly understand, appreciate, and contribute to art that aligns with human aesthetic values. Project homepage: https://github.com/songrise/MLLM4Art.
Authors:Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Fei Ren, Shaobo Wang, Kaixin Li, Linfeng Zhang
Title: SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching
Abstract:
Diffusion models have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. These models face two fundamental challenges: strict temporal dependencies preventing parallelization, and computationally intensive forward passes required at each denoising step. Drawing inspiration from speculative decoding in large language models, we present SpeCa, a novel 'Forecast-then-verify' acceleration framework that effectively addresses both limitations. SpeCa's core innovation lies in introducing Speculative Sampling to diffusion models, predicting intermediate features for subsequent timesteps based on fully computed reference timesteps. Our approach implements a parameter-free verification mechanism that efficiently evaluates prediction reliability, enabling real-time decisions to accept or reject each prediction while incurring negligible computational overhead. Furthermore, SpeCa introduces sample-adaptive computation allocation that dynamically modulates resources based on generation complexity, allocating reduced computation for simpler samples while preserving intensive processing for complex instances. Experiments demonstrate 6.34x acceleration on FLUX with minimal quality degradation (5.5% drop), 7.3x speedup on DiT while preserving generation fidelity, and 79.84% VBench score at 6.1x acceleration for HunyuanVideo. The verification mechanism incurs minimal overhead (1.67%-3.5% of full inference costs), establishing a new paradigm for efficient diffusion model inference while maintaining generation quality even at aggressive acceleration ratios. Our codes have been released in Github: https://github.com/Shenyi-Z/Cache4Diffusion
Authors:Kotaro Kikuchi, Naoto Inoue, Mayu Otani, Edgar Simo-Serra, Kota Yamaguchi
Title: Multimodal Markup Document Models for Graphic Design Completion
Abstract:
This paper presents multimodal markup document models (MarkupDM) that can generate both markup language and images within interleaved multimodal documents. Unlike existing vision-and-language multimodal models, our MarkupDM tackles unique challenges critical to graphic design tasks: generating partial images that contribute to the overall appearance, often involving transparency and varying sizes, and understanding the syntax and semantics of markup languages, which play a fundamental role as a representational format of graphic designs. To address these challenges, we design an image quantizer to tokenize images of diverse sizes with transparency and modify a code language model to process markup languages and incorporate image modalities. We provide in-depth evaluations of our approach on three graphic design completion tasks: generating missing attribute values, images, and texts in graphic design templates. Results corroborate the effectiveness of our MarkupDM for graphic design tasks. We also discuss the strengths and weaknesses in detail, providing insights for future research on multimodal document generation.
Authors:Zejian Li, Yize Li, Chenye Meng, Zhongni Liu, Yang Ling, Shengyuan Zhang, Guang Yang, Changyuan Yang, Zhiyuan Yang, Lingyun Sun
Title: Inversion-DPO: Precise and Efficient Post-Training for Diffusion Models
Abstract:
Recent advancements in diffusion models (DMs) have been propelled by alignment methods that post-train models to better conform to human preferences. However, these approaches typically require computation-intensive training of a base model and a reward model, which not only incurs substantial computational overhead but may also compromise model accuracy and training efficiency. To address these limitations, we propose Inversion-DPO, a novel alignment framework that circumvents reward modeling by reformulating Direct Preference Optimization (DPO) with DDIM inversion for DMs. Our method conducts intractable posterior sampling in Diffusion-DPO with the deterministic inversion from winning and losing samples to noise and thus derive a new post-training paradigm. This paradigm eliminates the need for auxiliary reward models or inaccurate appromixation, significantly enhancing both precision and efficiency of training. We apply Inversion-DPO to a basic task of text-to-image generation and a challenging task of compositional image generation. Extensive experiments show substantial performance improvements achieved by Inversion-DPO compared to existing post-training methods and highlight the ability of the trained generative models to generate high-fidelity compositionally coherent images. For the post-training of compostitional image geneation, we curate a paired dataset consisting of 11,140 images with complex structural annotations and comprehensive scores, designed to enhance the compositional capabilities of generative models. Inversion-DPO explores a new avenue for efficient, high-precision alignment in diffusion models, advancing their applicability to complex realistic generation tasks. Our code is available at https://github.com/MIGHTYEZ/Inversion-DPO
Authors:Yongqi Li, Lu Yang, Jian Wang, Runyang You, Wenjie Li, Liqiang Nie
Title: Towards Harmless Multimodal Assistants with Blind Preference Optimization
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. Given the extensive applications of MLLMs, the associated safety issues have become increasingly critical. Due to the effectiveness of preference optimization in aligning MLLMs with human preferences, there is an urgent need for safety-related preference data for MLLMs. To address this, we construct the MMSafe-PO preference dataset towards harmless multimodal assistants, featuring multimodal instructions, the conversational format, and ranked paired responses from human feedback. We also identify two insightful observations: modality co-defense and modality cheating, which illustrate that MLLMs possess a certain level of inherent defense while still presenting unique safety challenges. Based on these observations, we propose the Blind Preference Optimization (BPO) approach. Comprehensive experiments on three benchmarks show that BPO effectively enhances the safety capabilities of MLLMs. Notably, BPO significantly improves the safety rate of the base MLLM by 45.0%, outperforming the DPO approach. Additionally, applying BPO to the MMSafe-PO dataset greatly reduces the base MLLM's unsafe rate on other safety benchmarks (14.5% on MM-SafetyBench and 82.9% on HarmEval, demonstrating the effectiveness and robustness of both the dataset and the approach. We release code and data at https://lu-yang666.github.io/MMsafe-PO-Web/.
Authors:Shaochen Zhang, Zekun Qi, Runpei Dong, Xiuxiu Bai, Xing Wei
Title: Positional Prompt Tuning for Efficient 3D Representation Learning
Abstract:
We rethink the role of positional encoding in 3D representation learning and fine-tuning. We argue that using positional encoding in point Transformer-based methods serves to aggregate multi-scale features of point clouds. Additionally, we explore parameter-efficient fine-tuning (PEFT) through the lens of prompts and adapters, introducing a straightforward yet effective method called PPT for point cloud analysis. PPT incorporates increased patch tokens and trainable positional encoding while keeping most pre-trained model parameters frozen. Extensive experiments validate that PPT is both effective and efficient. Our proposed method of PEFT tasks, namely PPT, with only 1.05M of parameters for training, gets state-of-the-art results in several mainstream datasets, such as 95.01% accuracy in the ScanObjectNN OBJ_BG dataset. Codes and weights will be released at https://github.com/zsc000722/PPT.
Authors:Qingqing Fang, Wenxi Lv, Qinliang Su
Title: AF-CLIP: Zero-Shot Anomaly Detection via Anomaly-Focused CLIP Adaptation
Abstract:
Visual anomaly detection has been widely used in industrial inspection and medical diagnosis. Existing methods typically demand substantial training samples, limiting their utility in zero-/few-shot scenarios. While recent efforts have leveraged CLIP's zero-shot recognition capability for this task, they often ignore optimizing visual features to focus on local anomalies, reducing their efficacy. In this work, we propose AF-CLIP (Anomaly-Focused CLIP) by dramatically enhancing its visual representations to focus on local defects. Our approach introduces a lightweight adapter that emphasizes anomaly-relevant patterns in visual features, simultaneously optimizing both class-level features for image classification and patch-level features for precise localization. To capture anomalies of different sizes and improve detection accuracy, prior to the adapter, we develop a multi-scale spatial aggregation mechanism to effectively consolidate neighborhood context. Complementing these visual enhancements, we design learnable textual prompts that generically characterize normal and abnormal states. After optimization on auxiliary datasets using a composite objective function, AF-CLIP demonstrates strong zero-shot detection capability. Our method is also extended to few-shot scenarios by extra memory banks. Experimental results across diverse industrial and medical datasets demonstrate the effectiveness and generalization of our proposed method. Code is available at https://github.com/Faustinaqq/AF-CLIP.
Authors:Ling Yang, Bohan Zeng, Jiaming Liu, Hong Li, Minghao Xu, Wentao Zhang, Shuicheng Yan
Title: EditWorld: Simulating World Dynamics for Instruction-Following Image Editing
Abstract:
Diffusion models have significantly improved the performance of image editing. Existing methods realize various approaches to achieve high-quality image editing, including but not limited to text control, dragging operation, and mask-and-inpainting. Among these, instruction-based editing stands out for its convenience and effectiveness in following human instructions across diverse scenarios. However, it still focuses on simple editing operations like adding, replacing, or deleting, and falls short of understanding aspects of world dynamics that convey the realistic dynamic nature in the physical world. Therefore, this work, EditWorld, introduces a new editing task, namely world-instructed image editing, which defines and categorizes the instructions grounded by various world scenarios. We curate a new image editing dataset with world instructions using a set of large pretrained models (e.g., GPT-3.5, Video-LLava and SDXL). To enable sufficient simulation of world dynamics for image editing, our EditWorld trains model in the curated dataset, and improves instruction-following ability with designed post-edit strategy. Extensive experiments demonstrate our method significantly outperforms existing editing methods in this new task. Our dataset and code will be available at https://github.com/YangLing0818/EditWorld
Authors:Chenghanyu Zhang, Zekun Li, Peipei Li, Xing Cui, Shuhan Xia, Weixiang Yan, Yiqiao Zhang, Qianyu Zhuang
Title: SpineBench: Benchmarking Multimodal LLMs for Spinal Pathology Analysis
Abstract:
With the increasing integration of Multimodal Large Language Models (MLLMs) into the medical field, comprehensive evaluation of their performance in various medical domains becomes critical. However, existing benchmarks primarily assess general medical tasks, inadequately capturing performance in nuanced areas like the spine, which relies heavily on visual input. To address this, we introduce SpineBench, a comprehensive Visual Question Answering (VQA) benchmark designed for fine-grained analysis and evaluation of MLLMs in the spinal domain. SpineBench comprises 64,878 QA pairs from 40,263 spine images, covering 11 spinal diseases through two critical clinical tasks: spinal disease diagnosis and spinal lesion localization, both in multiple-choice format. SpineBench is built by integrating and standardizing image-label pairs from open-source spinal disease datasets, and samples challenging hard negative options for each VQA pair based on visual similarity (similar but not the same disease), simulating real-world challenging scenarios. We evaluate 12 leading MLLMs on SpineBench. The results reveal that these models exhibit poor performance in spinal tasks, highlighting limitations of current MLLM in the spine domain and guiding future improvements in spinal medicine applications. SpineBench is publicly available at https://zhangchenghanyu.github.io/SpineBench.github.io/.
Authors:Zijing Zhao, Zhu Xu, Qingchao Chen, Yuxin Peng, Yang Liu
Title: Investigating Domain Gaps for Indoor 3D Object Detection
Abstract:
As a fundamental task for indoor scene understanding, 3D object detection has been extensively studied, and the accuracy on indoor point cloud data has been substantially improved. However, existing researches have been conducted on limited datasets, where the training and testing sets share the same distribution. In this paper, we consider the task of adapting indoor 3D object detectors from one dataset to another, presenting a comprehensive benchmark with ScanNet, SUN RGB-D and 3D Front datasets, as well as our newly proposed large-scale datasets ProcTHOR-OD and ProcFront generated by a 3D simulator. Since indoor point cloud datasets are collected and constructed in different ways, the object detectors are likely to overfit to specific factors within each dataset, such as point cloud quality, bounding box layout and instance features. We conduct experiments across datasets on different adaptation scenarios including synthetic-to-real adaptation, point cloud quality adaptation, layout adaptation and instance feature adaptation, analyzing the impact of different domain gaps on 3D object detectors. We also introduce several approaches to improve adaptation performances, providing baselines for domain adaptive indoor 3D object detection, hoping that future works may propose detectors with stronger generalization ability across domains. Our project homepage can be found in https://jeremyzhao1998.github.io/DAVoteNet-release/.
Authors:Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Petr Sushko, Gaoyang Jiang, Yao Wan, Ranjay Krishna
Title: MultiRef: Controllable Image Generation with Multiple Visual References
Abstract:
Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs -- either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 reference combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facilitate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of-the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively integrate multiple sources of visual inspiration. The dataset is publicly available at: https://multiref.github.io/.
Authors:Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, Mu Xu
Title: FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis
Abstract:
Creating a realistic animatable avatar from a single static portrait remains challenging. Existing approaches often struggle to capture subtle facial expressions, the associated global body movements, and the dynamic background. To address these limitations, we propose a novel framework that leverages a pretrained video diffusion transformer model to generate high-fidelity, coherent talking portraits with controllable motion dynamics. At the core of our work is a dual-stage audio-visual alignment strategy. In the first stage, we employ a clip-level training scheme to establish coherent global motion by aligning audio-driven dynamics across the entire scene, including the reference portrait, contextual objects, and background. In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals. To preserve identity without compromising motion flexibility, we replace the commonly used reference network with a facial-focused cross-attention module that effectively maintains facial consistency throughout the video. Furthermore, we integrate a motion intensity modulation module that explicitly controls expression and body motion intensity, enabling controllable manipulation of portrait movements beyond mere lip motion. Extensive experimental results show that our proposed approach achieves higher quality with better realism, coherence, motion intensity, and identity preservation. Ours project page: https://fantasy-amap.github.io/fantasy-talking/.
Authors:Qi Zheng, Haozhi Wang, Zihao Liu, Jiaming Liu, Peiye Liu, Zhijian Hao, Yanheng Lu, Dimin Niu, Jinjia Zhou, Minge Jing, Yibo Fan
Title: Unicorn: Unified Neural Image Compression with One Number Reconstruction
Abstract:
Prevalent lossy image compression schemes can be divided into: 1) explicit image compression (EIC), including traditional standards and neural end-to-end algorithms; 2) implicit image compression (IIC) based on implicit neural representations (INR). The former is encountering impasses of either leveling off bitrate reduction at a cost of tremendous complexity while the latter suffers from excessive smoothing quality as well as lengthy decoder models. In this paper, we propose an innovative paradigm, which we dub Unicorn (Unified Neural Image Compression with One Nnumber Reconstruction). By conceptualizing the images as index-image pairs and learning the inherent distribution of pairs in a subtle neural network model, Unicorn can reconstruct a visually pleasing image from a randomly generated noise with only one index number. The neural model serves as the unified decoder of images while the noises and indexes corresponds to explicit representations. As a proof of concept, we propose an effective and efficient prototype of Unicorn based on latent diffusion models with tailored model designs. Quantitive and qualitative experimental results demonstrate that our prototype achieves significant bitrates reduction compared with EIC and IIC algorithms. More impressively, benefitting from the unified decoder, our compression ratio escalates as the quantity of images increases. We envision that more advanced model designs will endow Unicorn with greater potential in image compression. We will release our codes in \urlhttps://github.com/uniqzheng/Unicorn-Laduree.
Authors:Yue Guo, Haoxiang Liao, Haibin Ling, Bingyao Huang
Title: NeuroPump: Simultaneous Geometric and Color Rectification for Underwater Images
Abstract:
Underwater image restoration aims to remove geometric and color distortions due to water refraction, absorption and scattering. Previous studies focus on restoring either color or the geometry, but to our best knowledge, not both. However, in practice it may be cumbersome to address the two rectifications one-by-one. In this paper, we propose NeuroPump, a self-supervised method to simultaneously optimize and rectify underwater geometry and color as if water were pumped out. The key idea is to explicitly model refraction, absorption and scattering in Neural Radiance Field (NeRF) pipeline, such that it not only performs simultaneous geometric and color rectification, but also enables to synthesize novel views and optical effects by controlling the decoupled parameters. In addition, to address issue of lack of real paired ground truth images, we propose an underwater 360 benchmark dataset that has real paired (i.e., with and without water) images. Our method clearly outperforms other baselines both quantitatively and qualitatively. Our project page is available at: https://ygswu.github.io/NeuroPump.github.io/.
Authors:Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song, Liqiang Nie
Title: OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval
Abstract:
Composed Image Retrieval (CIR) represents a novel retrieval paradigm that is capable of expressing users' intricate retrieval requirements flexibly. It enables the user to give a multimodal query, comprising a reference image and a modification text, and subsequently retrieve the target image. Notwithstanding the considerable advances made by prevailing methodologies, CIR remains in its nascent stages due to two limitations: 1) inhomogeneity between dominant and noisy portions in visual data is ignored, leading to query feature degradation, and 2) the priority of textual data in the image modification process is overlooked, which leads to a visual focus bias. To address these two limitations, this work presents a focus mapping-based feature extractor, which consists of two modules: dominant portion segmentation and dual focus mapping. It is designed to identify significant dominant portions in images and guide the extraction of visual and textual data features, thereby reducing the impact of noise interference. Subsequently, we propose a textually guided focus revision module, which can utilize the modification requirements implied in the text to perform adaptive focus revision on the reference image, thereby enhancing the perception of the modification focus on the composed features. The aforementioned modules collectively constitute the segmentatiOn-based Focus shiFt reviSion nETwork (\mboxOFFSET), and comprehensive experiments on four benchmark datasets substantiate the superiority of our proposed method. The codes and data are available on https://zivchen-ty.github.io/OFFSET.github.io/
Authors:Wenhui Song, Hanhui Li, Jiehui Huang, Panwen Hu, Yuhao Cheng, Long Chen, Yiqiang Yan, Xiaodan Liang
Title: LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation
Abstract:
In this paper, we present LaVieID, a novel \underlinelocal \underlineautoregressive \underlinevid\underlineeo diffusion framework designed to tackle the challenging \underlineidentity-preserving text-to-video task. The key idea of LaVieID is to mitigate the loss of identity information inherent in the stochastic global generation process of diffusion transformers (DiTs) from both spatial and temporal perspectives. Specifically, unlike the global and unstructured modeling of facial latent states in existing DiTs, LaVieID introduces a local router to explicitly represent latent states by weighted combinations of fine-grained local facial structures. This alleviates undesirable feature interference and encourages DiTs to capture distinctive facial characteristics. Furthermore, a temporal autoregressive module is integrated into LaVieID to refine denoised latent tokens before video decoding. This module divides latent tokens temporally into chunks, exploiting their long-range temporal dependencies to predict biases for rectifying tokens, thereby significantly enhancing inter-frame identity consistency. Consequently, LaVieID can generate high-fidelity personalized videos and achieve state-of-the-art performance. Our code and models are available at https://github.com/ssugarwh/LaVieID.
Authors:Xuewen Liu, Zhikai Li, Minhao Jiang, Mengjuan Chen, Jianquan Li, Qingyi Gu
Title: DilateQuant: Accurate and Efficient Diffusion Quantization via Weight Dilation
Abstract:
Model quantization is a promising method for accelerating and compressing diffusion models. Nevertheless, since post-training quantization (PTQ) fails catastrophically at low-bit cases, quantization-aware training (QAT) is essential. Unfortunately, the wide range and time-varying activations in diffusion models sharply increase the complexity of quantization, making existing QAT methods inefficient. Equivalent scaling can effectively reduce activation range, but previous methods remain the overall quantization error unchanged. More critically, these methods significantly disrupt the original weight distribution, resulting in poor weight initialization and challenging convergence during QAT training. In this paper, we propose a novel QAT framework for diffusion models, called DilateQuant. Specifically, we propose Weight Dilation (WD) that maximally dilates the unsaturated in-channel weights to a constrained range through equivalent scaling. WD decreases the activation range while preserving the original weight range, which steadily reduces the quantization error and ensures model convergence. To further enhance accuracy and efficiency, we design a Temporal Parallel Quantizer (TPQ) to address the time-varying activations and introduce a Block-wise Knowledge Distillation (BKD) to reduce resource consumption in training. Extensive experiments demonstrate that DilateQuant significantly outperforms existing methods in terms of accuracy and efficiency. Code is available at http://github.com/BienLuky/DilateQuant .
Authors:Xihang Hu, Fuming Sun, Jiazhe Liu, Feilong Xu, Xiaoli Zhang
Title: ST-SAM: SAM-Driven Self-Training Framework for Semi-Supervised Camouflaged Object Detection
Abstract:
Semi-supervised Camouflaged Object Detection (SSCOD) aims to reduce reliance on costly pixel-level annotations by leveraging limited annotated data and abundant unlabeled data. However, existing SSCOD methods based on Teacher-Student frameworks suffer from severe prediction bias and error propagation under scarce supervision, while their multi-network architectures incur high computational overhead and limited scalability. To overcome these limitations, we propose ST-SAM, a highly annotation-efficient yet concise framework that breaks away from conventional SSCOD constraints. Specifically, ST-SAM employs Self-Training strategy that dynamically filters and expands high-confidence pseudo-labels to enhance a single-model architecture, thereby fundamentally circumventing inter-model prediction bias. Furthermore, by transforming pseudo-labels into hybrid prompts containing domain-specific knowledge, ST-SAM effectively harnesses the Segment Anything Model's potential for specialized tasks to mitigate error accumulation in self-training. Experiments on COD benchmark datasets demonstrate that ST-SAM achieves state-of-the-art performance with only 1% labeled data, outperforming existing SSCOD methods and even matching fully supervised methods. Remarkably, ST-SAM requires training only a single network, without relying on specific models or loss functions. This work establishes a new paradigm for annotation-efficient SSCOD. Codes will be available at https://github.com/hu-xh/ST-SAM.
Authors:Bolun Zheng, Xinjie Liu, Qianyu Zhang, Canjin Wang, Fangni Chen, Mingen Xu
Title: EHPE: A Segmented Architecture for Enhanced Hand Pose Estimation
Abstract:
3D hand pose estimation has garnered great attention in recent years due to its critical applications in human-computer interaction, virtual reality, and related fields. The accurate estimation of hand joints is essential for high-quality hand pose estimation. However, existing methods neglect the importance of Distal Phalanx Tip (TIP) and Wrist in predicting hand joints overall and often fail to account for the phenomenon of error accumulation for distal joints in gesture estimation, which can cause certain joints to incur larger errors, resulting in misalignments and artifacts in the pose estimation and degrading the overall reconstruction quality. To address this challenge, we propose a novel segmented architecture for enhanced hand pose estimation (EHPE). We perform local extraction of TIP and wrist, thus alleviating the effect of error accumulation on TIP prediction and further reduce the predictive errors for all joints on this basis. EHPE consists of two key stages: In the TIP and Wrist Joints Extraction stage (TW-stage), the positions of the TIP and wrist joints are estimated to provide an initial accurate joint configuration; In the Prior Guided Joints Estimation stage (PG-stage), a dual-branch interaction network is employed to refine the positions of the remaining joints. Extensive experiments on two widely used benchmarks demonstrate that EHPE achieves state-of-the-arts performance. Code is available at https://github.com/SereinNout/EHPE.
Authors:Zhiming Ma, Peidong Wang, Minhua Huang, Jingpeng Wang, Kai Wu, Xiangzhao Lv, Yachun Pang, Yin Yang, Wenjie Tang, Yuchen Kang
Title: TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection
Abstract:
The detection of telecom fraud faces significant challenges due to the lack of high-quality multimodal training data that integrates audio signals with reasoning-oriented textual analysis. To address this gap, we present TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. Our dataset is constructed through three strategies: (1) Privacy-preserved text-truth sample generation using automatically speech recognition (ASR)-transcribed call recordings (with anonymized original audio), ensuring real-world consistency through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via large language model (LLM)-based self-instruction sampling on authentic ASR outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that simulates emerging fraud tactics through predefined communication scenarios and fraud typologies. The generated dataset contains 28,511 rigorously processed speech-text pairs, complete with detailed annotations for fraud reasoning. The dataset is divided into three tasks: scenario classification, fraud detection, fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from the dataset, to facilitate systematic testing of model performance on telecom fraud detection tasks. We also contribute a production-optimized supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while open-sourcing the data processing framework to enable community-driven dataset expansion. This work establishes a foundational framework for multimodal anti-fraud research while addressing critical challenges in data privacy and scenario diversity. The project will be released at https://github.com/JimmyMa99/TeleAntiFraud.
Authors:Jinhong He, Minglong Xue, Zhipu Liu, Mingliang Zhou, Aoxiang Ning, Palaiahnakote Shivakumara
Title: Degradation-Consistent Learning via Bidirectional Diffusion for Low-Light Image Enhancement
Abstract:
Low-light image enhancement aims to improve the visibility of degraded images to better align with human visual perception. While diffusion-based methods have shown promising performance due to their strong generative capabilities. However, their unidirectional modelling of degradation often struggles to capture the complexity of real-world degradation patterns, leading to structural inconsistencies and pixel misalignments. To address these challenges, we propose a bidirectional diffusion optimization mechanism that jointly models the degradation processes of both low-light and normal-light images, enabling more precise degradation parameter matching and enhancing generation quality. Specifically, we perform bidirectional diffusion-from low-to-normal light and from normal-to-low light during training and introduce an adaptive feature interaction block (AFI) to refine feature representation. By leveraging the complementarity between these two paths, our approach imposes an implicit symmetry constraint on illumination attenuation and noise distribution, facilitating consistent degradation learning and improving the models ability to perceive illumination and detail degradation. Additionally, we design a reflection-aware correction module (RACM) to guide color restoration post-denoising and suppress overexposed regions, ensuring content consistency and generating high-quality images that align with human visual perception. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms state-of-the-art methods in both quantitative and qualitative evaluations while generalizing effectively to diverse degradation scenarios. Code at https://github.com/hejh8/BidDiff
Authors:Shibo Sun, Xue Li, Donglin Di, Mingjie Wei, Lanshun Nie, Wei-Nan Zhang, Dechen Zhan, Yang Song, Lei Fan
Title: LLaPa: A Vision-Language Model Framework for Counterfactual-Aware Procedural Planning
Abstract:
While large language models (LLMs) have advanced procedural planning for embodied AI systems through strong reasoning abilities, the integration of multimodal inputs and counterfactual reasoning remains underexplored. To tackle these challenges, we introduce LLaPa, a vision-language model framework designed for multimodal procedural planning. LLaPa generates executable action sequences from textual task descriptions and visual environmental images using vision-language models (VLMs). Furthermore, we enhance LLaPa with two auxiliary modules to improve procedural planning. The first module, the Task-Environment Reranker (TER), leverages task-oriented segmentation to create a task-sensitive feature space, aligning textual descriptions with visual environments and emphasizing critical regions for procedural execution. The second module, the Counterfactual Activities Retriever (CAR), identifies and emphasizes potential counterfactual conditions, enhancing the model's reasoning capability in counterfactual scenarios. Extensive experiments on ActPlan-1K and ALFRED benchmarks demonstrate that LLaPa generates higher-quality plans with superior LCS and correctness, outperforming advanced models. The code and models are available https://github.com/sunshibo1234/LLaPa.
Authors:Zitong Xu, Huiyu Duan, Bingnan Liu, Guangji Ma, Jiarui Wang, Liu Yang, Shiqi Gao, Xiaoyu Wang, Jia Wang, Xiongkuo Min, Guangtao Zhai, Weisi Lin
Title: LMM4Edit: Benchmarking and Evaluating Multimodal Image Editing with LMMs
Abstract:
The rapid advancement of Text-guided Image Editing (TIE) enables image modifications through text prompts. However, current TIE models still struggle to balance image quality, editing alignment, and consistency with the original image, limiting their practical applications. Existing TIE evaluation benchmarks and metrics have limitations on scale or alignment with human perception. To this end, we introduce EBench-18K, the first large-scale image Editing Benchmark including 18K edited images with fine-grained human preference annotations for evaluating TIE. Specifically, EBench-18K includes 1,080 source images with corresponding editing prompts across 21 tasks, 18K+ edited images produced by 17 state-of-the-art TIE models, 55K+ mean opinion scores (MOSs) assessed from three evaluation dimensions, and 18K+ question-answering (QA) pairs. Based on EBench-18K, we employ outstanding LMMs to assess edited images, while the evaluation results, in turn, provide insights into assessing the alignment between the LMMs' understanding ability and human preferences. Then, we propose LMM4Edit, a LMM-based metric for evaluating image Editing models from perceptual quality, editing alignment, attribute preservation, and task-specific QA accuracy in an all-in-one manner. Extensive experiments show that LMM4Edit achieves outstanding performance and aligns well with human preference. Zero-shot validation on the other datasets also shows the generalization ability of our model. The dataset and code are available at https://github.com/IntMeGroup/LMM4Edit.
Authors:Hongyao Yu, Yixiang Qiu, Yiheng Yang, Hao Fang, Tianqu Zhuang, Jiaxin Hong, Bin Chen, Hao Wu, Shu-Tao Xia
Title: ICAS: Detecting Training Data from Autoregressive Image Generative Models
Abstract:
Autoregressive image generation has witnessed rapid advancements, with prominent models such as scale-wise visual auto-regression pushing the boundaries of visual synthesis. However, these developments also raise significant concerns regarding data privacy and copyright. In response, training data detection has emerged as a critical task for identifying unauthorized data usage in model training. To better understand the vulnerability of autoregressive image generative models to such detection, we conduct the first study applying membership inference to this domain. Our approach comprises two key components: implicit classification and an adaptive score aggregation strategy. First, we compute the implicit token-wise classification score within the query image. Then we propose an adaptive score aggregation strategy to acquire a final score, which places greater emphasis on the tokens with lower scores. A higher final score indicates that the sample is more likely to be involved in the training set. To validate the effectiveness of our method, we adapt existing detection algorithms originally designed for LLMs to visual autoregressive models. Extensive experiments demonstrate the superiority of our method in both class-conditional and text-to-image scenarios. Moreover, our approach exhibits strong robustness and generalization under various data transformations. Furthermore, sufficient experiments suggest two novel key findings: (1) A linear scaling law on membership inference, exposing the vulnerability of large foundation models. (2) Training data from scale-wise visual autoregressive models is easier to detect than other autoregressive paradigms.Our code is available at https://github.com/Chrisqcwx/ImageAR-MIA.
Authors:Zhihan Zhang, Yixin Cao, Lizi Liao
Title: Boosting Chart-to-Code Generation in MLLM via Dual Preference-Guided Refinement
Abstract:
Translating chart images into executable plotting scripts-referred to as the chart-to-code generation task-requires Multimodal Large Language Models (MLLMs) to perform fine-grained visual parsing, precise code synthesis, and robust cross-modal reasoning. However, this task is inherently under-constrained: multiple valid code implementations can produce the same visual chart, and evaluation must consider both code correctness and visual fidelity across diverse dimensions. This makes it difficult to learn accurate and generalizable mappings through standard supervised fine-tuning. To address these challenges, we propose a dual preference-guided refinement framework that combines a feedback-driven, dual-modality reward mechanism with iterative preference learning. Our approach introduces a structured variant generation strategy and a visual reward model to efficiently produce high-quality, aspect-aware preference pairs-making preference collection scalable and supervision more targeted. These preferences are used in an offline reinforcement learning setup to optimize the model toward multi-dimensional fidelity. Experimental results show that our framework significantly enhances the performance of general-purpose open-source MLLMs, enabling them to generate high-quality plotting code that rivals specialized chart-centric models and even some proprietary systems. The code and datasets are publicly available at https://github.com/Zhihan72/Chart2Code.
Authors:Zhiwen Yang, Yuxin Peng
Title: SPHERE: Semantic-PHysical Engaged REpresentation for 3D Semantic Scene Completion
Abstract:
Camera-based 3D Semantic Scene Completion (SSC) is a critical task in autonomous driving systems, assessing voxel-level geometry and semantics for holistic scene perception. While existing voxel-based and plane-based SSC methods have achieved considerable progress, they struggle to capture physical regularities for realistic geometric details. On the other hand, neural reconstruction methods like NeRF and 3DGS demonstrate superior physical awareness, but suffer from high computational cost and slow convergence when handling large-scale, complex autonomous driving scenes, leading to inferior semantic accuracy. To address these issues, we propose the Semantic-PHysical Engaged REpresentation (SPHERE) for camera-based SSC, which integrates voxel and Gaussian representations for joint exploitation of semantic and physical information. First, the Semantic-guided Gaussian Initialization (SGI) module leverages dual-branch 3D scene representations to locate focal voxels as anchors to guide efficient Gaussian initialization. Then, the Physical-aware Harmonics Enhancement (PHE) module incorporates semantic spherical harmonics to model physical-aware contextual details and promote semantic-geometry consistency through focal distribution alignment, generating SSC results with realistic details. Extensive experiments and analyses on the popular SemanticKITTI and SSCBench-KITTI-360 benchmarks validate the effectiveness of SPHERE. The code is available at https://github.com/PKU-ICST-MIPL/SPHERE_ACMMM2025.
Authors:Meng Yu, Kun Zhan
Title: Frequency Regulation for Exposure Bias Mitigation in Diffusion Models
Abstract:
Diffusion models exhibit impressive generative capabilities but are significantly impacted by exposure bias. In this paper, we make a key observation: the energy of predicted noisy samples in the reverse process continuously declines compared to perturbed samples in the forward process. Building on this, we identify two important findings: 1) The reduction in energy follows distinct patterns in the low-frequency and high-frequency subbands; 2) The subband energy of reverse-process reconstructed samples is consistently lower than that of forward-process ones, and both are lower than the original data samples. Based on the first finding, we introduce a dynamic frequency regulation mechanism utilizing wavelet transforms, which separately adjusts the low- and high-frequency subbands. Leveraging the second insight, we derive the rigorous mathematical form of exposure bias. It is worth noting that, our method is training-free and plug-and-play, significantly improving the generative quality of various diffusion models and frameworks with negligible computational cost. The source code is available at https://github.com/kunzhan/wpp.
Authors:Wenhao Zheng, Chenwei Sun, Wenbo Zhang, Jiancheng Lv, Xianggen Liu
Title: Target-Guided Bayesian Flow Networks for Quantitatively Constrained CAD Generation
Abstract:
Deep generative models, such as diffusion models, have shown promising progress in image generation and audio generation via simplified continuity assumptions. However, the development of generative modeling techniques for generating multi-modal data, such as parametric CAD sequences, still lags behind due to the challenges in addressing long-range constraints and parameter sensitivity. In this work, we propose a novel framework for quantitatively constrained CAD generation, termed Target-Guided Bayesian Flow Network (TGBFN). For the first time, TGBFN handles the multi-modality of CAD sequences (i.e., discrete commands and continuous parameters) in a unified continuous and differentiable parameter space rather than in the discrete data space. In addition, TGBFN penetrates the parameter update kernel and introduces a guided Bayesian flow to control the CAD properties. To evaluate TGBFN, we construct a new dataset for quantitatively constrained CAD generation. Extensive comparisons across single-condition and multi-condition constrained generation tasks demonstrate that TGBFN achieves state-of-the-art performance in generating high-fidelity, condition-aware CAD sequences. The code is available at https://github.com/scu-zwh/TGBFN.
Authors:Tianqi Li, Ruobing Zheng, Minghui Yang, Jingdong Chen, Ming Yang
Title: Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis
Abstract:
Recent advances in diffusion models have endowed talking head synthesis with subtle expressions and vivid head movements, but have also led to slow inference speed and insufficient control over generated results. To address these issues, we propose Ditto, a diffusion-based talking head framework that enables fine-grained controls and real-time inference. Specifically, we utilize an off-the-shelf motion extractor and devise a diffusion transformer to generate representations in a specific motion space. We optimize the model architecture and training strategy to address the issues in generating motion representations, including insufficient disentanglement between motion and identity, and large internal discrepancies within the representation. Besides, we employ diverse conditional signals while establishing a mapping between motion representation and facial semantics, enabling control over the generation process and correction of the results. Moreover, we jointly optimize the holistic framework to enable streaming processing, real-time inference, and low first-frame delay, offering functionalities crucial for interactive applications such as AI assistants. Extensive experimental results demonstrate that Ditto generates compelling talking head videos and exhibits superiority in both controllability and real-time performance.
Authors:Feiwei Qin, Shichao Lu, Junhao Hou, Changmiao Wang, Meie Fang, Ligang Liu
Title: Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vector Drawings
Abstract:
Computer-Aided Design (CAD) generative modeling is driving significant innovations across industrial applications. Recent works have shown remarkable progress in creating solid models from various inputs such as point clouds, meshes, and text descriptions. However, these methods fundamentally diverge from traditional industrial workflows that begin with 2D engineering drawings. The automatic generation of parametric CAD models from these 2D vector drawings remains underexplored despite being a critical step in engineering design. To address this gap, our key insight is to reframe CAD generation as a sequence-to-sequence learning problem where vector drawing primitives directly inform the generation of parametric CAD operations, preserving geometric precision and design intent throughout the transformation process. We propose Drawing2CAD, a framework with three key technical components: a network-friendly vector primitive representation that preserves precise geometric information, a dual-decoder transformer architecture that decouples command type and parameter generation while maintaining precise correspondence, and a soft target distribution loss function accommodating inherent flexibility in CAD parameters. To train and evaluate Drawing2CAD, we create CAD-VGDrawing, a dataset of paired engineering drawings and parametric CAD models, and conduct thorough experiments to demonstrate the effectiveness of our method. Code and dataset are available at https://github.com/lllssc/Drawing2CAD.
Authors:Hang Guo, Qing Zhang, Zixuan Gao, Siyuan Yang, Shulin Peng, Xiang Tao, Ting Yu, Yan Wang, Qingli Li
Title: Efficient Multi-Slide Visual-Language Feature Fusion for Placental Disease Classification
Abstract:
Accurate prediction of placental diseases via whole slide images (WSIs) is critical for preventing severe maternal and fetal complications. However, WSI analysis presents significant computational challenges due to the massive data volume. Existing WSI classification methods encounter critical limitations: (1) inadequate patch selection strategies that either compromise performance or fail to sufficiently reduce computational demands, and (2) the loss of global histological context resulting from patch-level processing approaches. To address these challenges, we propose an Efficient multimodal framework for Patient-level placental disease Diagnosis, named EmmPD. Our approach introduces a two-stage patch selection module that combines parameter-free and learnable compression strategies, optimally balancing computational efficiency with critical feature preservation. Additionally, we develop a hybrid multimodal fusion module that leverages adaptive graph learning to enhance pathological feature representation and incorporates textual medical reports to enrich global contextual understanding. Extensive experiments conducted on both a self-constructed patient-level Placental dataset and two public datasets demonstrating that our method achieves state-of-the-art diagnostic performance. The code is available at https://github.com/ECNU-MultiDimLab/EmmPD.
Paperid: 100,   https://arxiv.org/pdf/2505.11013     GitHub
Authors:Zongye Zhang, Bohan Kong, Qingjie Liu, Yunhong Wang
Title: Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion
Abstract:
Generating 3D human motion from text descriptions remains challenging due to the diverse and complex nature of human motion. While existing methods excel within the training distribution, they often struggle with out-of-distribution motions, limiting their applicability in real-world scenarios. Existing VQVAE-based methods often fail to represent novel motions faithfully using discrete tokens, which hampers their ability to generalize beyond seen data. Meanwhile, diffusion-based methods operating on continuous representations often lack fine-grained control over individual frames. To address these challenges, we propose a robust motion generation framework MoMADiff, which combines masked modeling with diffusion processes to generate motion using frame-level continuous representations. Our model supports flexible user-provided keyframe specification, enabling precise control over both spatial and temporal aspects of motion synthesis. MoMADiff demonstrates strong generalization capability on novel text-to-motion datasets with sparse keyframes as motion prompts. Extensive experiments on two held-out datasets and two standard benchmarks show that our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and keyframe adherence. The code is available at: https://github.com/zzysteve/MoMADiff
Paperid: 101,   https://arxiv.org/pdf/2504.17414     GitHub
Authors:Min Wei, Chaohui Yu, Jingkai Zhou, Fan Wang
Title: 3DV-TON: Textured 3D-Guided Consistent Video Try-on via Diffusion Models
Abstract:
Video try-on replaces clothing in videos with target garments. Existing methods struggle to generate high-quality and temporally consistent results when handling complex clothing patterns and diverse body poses. We present 3DV-TON, a novel diffusion-based framework for generating high-fidelity and temporally consistent video try-on results. Our approach employs generated animatable textured 3D meshes as explicit frame-level guidance, alleviating the issue of models over-focusing on appearance fidelity at the expanse of motion coherence. This is achieved by enabling direct reference to consistent garment texture movements throughout video sequences. The proposed method features an adaptive pipeline for generating dynamic 3D guidance: (1) selecting a keyframe for initial 2D image try-on, followed by (2) reconstructing and animating a textured 3D mesh synchronized with original video poses. We further introduce a robust rectangular masking strategy that successfully mitigates artifact propagation caused by leaking clothing information during dynamic human and garment movements. To advance video try-on research, we introduce HR-VVT, a high-resolution benchmark dataset containing 130 videos with diverse clothing types and scenarios. Quantitative and qualitative results demonstrate our superior performance over existing methods. The project page is at this link https://2y7c3.github.io/3DV-TON/
Paperid: 102,   https://arxiv.org/pdf/2509.12653     GitHub
Authors:Jinjie Shen, Yaxiong Wang, Lechao Cheng, Nan Pu, Zhun Zhong
Title: Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations
Abstract:
The detection and grounding of manipulated content in multimodal data has emerged as a critical challenge in media forensics. While existing benchmarks demonstrate technical progress, they suffer from misalignment artifacts that poorly reflect real-world manipulation patterns: practical attacks typically maintain semantic consistency across modalities, whereas current datasets artificially disrupt cross-modal alignment, creating easily detectable anomalies. To bridge this gap, we pioneer the detection of semantically-coordinated manipulations where visual edits are systematically paired with semantically consistent textual descriptions. Our approach begins with constructing the first Semantic-Aligned Multimodal Manipulation (SAMM) dataset, generated through a two-stage pipeline: 1) applying state-of-the-art image manipulations, followed by 2) generation of contextually-plausible textual narratives that reinforce the visual deception. Building on this foundation, we propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework. RamDG commences by harnessing external knowledge repositories to retrieve contextual evidence, which serves as the auxiliary texts and encoded together with the inputs through our image forgery grounding and deep manipulation detection modules to trace all manipulations. Extensive experiments demonstrate our framework significantly outperforms existing methods, achieving 2.06% higher detection accuracy on SAMM compared to state-of-the-art approaches. The dataset and code are publicly available at https://github.com/shen8424/SAMM-RamDG-CAP.
Paperid: 103,   https://arxiv.org/pdf/2507.08557     GitHub
Authors:Yuxuan Jiang, Zehua Chen, Zeqian Ju, Chang Li, Weibei Dou, Jun Zhu
Title: FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation
Abstract:
Text-to-audio (T2A) generation has achieved promising results with the recent advances in generative models. However, because of the limited quality and quantity of temporally-aligned audio-text pairs, existing T2A methods struggle to handle the complex text prompts that contain precise timing control, e.g., "owl hooted at 2.4s-5.2s". Recent works have explored data augmentation techniques or introduced timing conditions as model inputs to enable timing-conditioned 10-second T2A generation, while their synthesis quality is still limited. In this work, we propose a novel training-free timing-controlled T2A framework, FreeAudio, making the first attempt to enable timing-controlled long-form T2A generation, e.g., "owl hooted at 2.4s-5.2s and crickets chirping at 0s-24s". Specifically, we first employ an LLM to plan non-overlapping time windows and recaption each with a refined natural language description, based on the input text and timing prompts. Then we introduce: 1) Decoupling and Aggregating Attention Control for precise timing control; 2) Contextual Latent Composition for local smoothness and Reference Guidance for global consistency. Extensive experiments show that: 1) FreeAudio achieves state-of-the-art timing-conditioned T2A synthesis quality among training-free methods and is comparable to leading training-based methods; 2) FreeAudio demonstrates comparable long-form generation quality with training-based Stable Audio and paves the way for timing-controlled long-form T2A synthesis. Demo samples are available at: https://freeaudio.github.io/FreeAudio/
Paperid: 104,   https://arxiv.org/pdf/2510.00647     GitHub
Authors:Jinlan Fu, Shenzhen Huangfu, Hao Fei, Yichong Huang, Xiaoyu Shen, Xipeng Qiu, See-Kiong Ng
Title: MCM-DPO: Multifaceted Cross-Modal Direct Preference Optimization for Alt-text Generation
Abstract:
The alt-text generation task produces concise, context-relevant descriptions of images, enabling blind and low-vision users to access online images. Despite the capabilities of large vision-language models, alt-text generation performance remains limited due to noisy user annotations, inconsistent standards, and MLLMs' insensitivity to contextual information. Previous efforts to fine-tune MLLMs using supervised fine-tuning (SFT) have struggled, as SFT relies on accurate target annotations, which are often flawed in user-generated alt-text. To address this, we propose Multi-faceted Cross-modal Direct Preference Optimization (MCM-DPO), which improves alt-text generation by learning to identify better options in preference pairs without requiring precise annotations. MCM-DPO optimizes preferences across single, paired, and multi-preference dimensions, covering textual, visual, and cross-modal factors. In light of the scarcity of high-quality annotated and preference-labeled datasets for alt-text, we constructed two large-scale, high-quality datasets named TAlt and PAlt, sourced from Twitter and Pinterest. These datasets include 202k annotated alt-text samples and 18k preference pairs that cover diverse preference dimensions, aiming to support further research in this domain. Experimental results show that our proposed MCM-DPO method consistently outperforms both DPO and SFT, establishing a new state of the art in alt-text generation. We release the code and data here: https://github.com/LVUGAI/MCM-DPO
Paperid: 105,   https://arxiv.org/pdf/2507.09747     GitHub
Authors:Dongyang Li, Haoyang Qin, Mingyang Wu, Chen Wei, Quanying Liu
Title: BrainFLORA: Uncovering Brain Concept Representation via Multimodal Neural Embeddings
Abstract:
Understanding how the brain represents visual information is a fundamental challenge in neuroscience and artificial intelligence. While AI-driven decoding of neural data has provided insights into the human visual system, integrating multimodal neuroimaging signals, such as EEG, MEG, and fMRI, remains a critical hurdle due to their inherent spatiotemporal misalignment. Current approaches often analyze these modalities in isolation, limiting a holistic view of neural representation. In this study, we introduce BrainFLORA, a unified framework for integrating cross-modal neuroimaging data to construct a shared neural representation. Our approach leverages multimodal large language models (MLLMs) augmented with modality-specific adapters and task decoders, achieving state-of-the-art performance in joint-subject visual retrieval task and has the potential to extend multitasking. Combining neuroimaging analysis methods, we further reveal how visual concept representations align across neural modalities and with real world object perception. We demonstrate that the brain's structured visual concept representations exhibit an implicit mapping to physical-world stimuli, bridging neuroscience and machine learning from different modalities of neural imaging. Beyond methodological advancements, BrainFLORA offers novel implications for cognitive neuroscience and brain-computer interfaces (BCIs). Our code is available at https://github.com/ncclab-sustech/BrainFLORA.
Paperid: 106,   https://arxiv.org/pdf/2504.12576     GitHub
Authors:Wentao Wu, Xiao Wang, Chenglong Li, Bo Jiang, Jin Tang, Bin Luo, Qi Liu
Title: CM3AE: A Unified RGB Frame and Event-Voxel/-Frame Pre-training Framework
Abstract:
Event cameras have attracted increasing attention in recent years due to their advantages in high dynamic range, high temporal resolution, low power consumption, and low latency. Some researchers have begun exploring pre-training directly on event data. Nevertheless, these efforts often fail to establish strong connections with RGB frames, limiting their applicability in multi-modal fusion scenarios. To address these issues, we propose a novel CM3AE pre-training framework for the RGB-Event perception. This framework accepts multi-modalities/views of data as input, including RGB images, event images, and event voxels, providing robust support for both event-based and RGB-event fusion based downstream tasks. Specifically, we design a multi-modal fusion reconstruction module that reconstructs the original image from fused multi-modal features, explicitly enhancing the model's ability to aggregate cross-modal complementary information. Additionally, we employ a multi-modal contrastive learning strategy to align cross-modal feature representations in a shared latent space, which effectively enhances the model's capability for multi-modal understanding and capturing global dependencies. We construct a large-scale dataset containing 2,535,759 RGB-Event data pairs for the pre-training. Extensive experiments on five downstream tasks fully demonstrated the effectiveness of CM3AE. Source code and pre-trained models will be released on https://github.com/Event-AHU/CM3AE.
Paperid: 107,   https://arxiv.org/pdf/2507.21977     GitHub
Authors:Jihao Gu, Kun Li, Fei Wang, Yanyan Wei, Zhiliang Wu, Hehe Fan, Meng Wang
Title: Motion Matters: Motion-guided Modulation Network for Skeleton-based Micro-Action Recognition
Abstract:
Micro-Actions (MAs) are an important form of non-verbal communication in social interactions, with potential applications in human emotional analysis. However, existing methods in Micro-Action Recognition often overlook the inherent subtle changes in MAs, which limits the accuracy of distinguishing MAs with subtle changes. To address this issue, we present a novel Motion-guided Modulation Network (MMN) that implicitly captures and modulates subtle motion cues to enhance spatial-temporal representation learning. Specifically, we introduce a Motion-guided Skeletal Modulation module (MSM) to inject motion cues at the skeletal level, acting as a control signal to guide spatial representation modeling. In parallel, we design a Motion-guided Temporal Modulation module (MTM) to incorporate motion information at the frame level, facilitating the modeling of holistic motion patterns in micro-actions. Finally, we propose a motion consistency learning strategy to aggregate the motion cues from multi-scale features for micro-action classification. Experimental results on the Micro-Action 52 and iMiGUE datasets demonstrate that MMN achieves state-of-the-art performance in skeleton-based micro-action recognition, underscoring the importance of explicitly modeling subtle motion cues. The code will be available at https://github.com/momiji-bit/MMN.
Paperid: 108,   https://arxiv.org/pdf/2408.10883     GitHub
Authors:Xinqi Su, Zitong Yu, Yawen Cui, Ajian Liu, Xun Lin, Yuhao Wang, Haochen Liang, Wenhui Li, Li Shen, Xiaochun Cao
Title: Dynamic Analysis and Adaptive Discriminator for Fake News Detection
Abstract:
In current web environment, fake news spreads rapidly across online social networks, posing serious threats to society. Existing multimodal fake news detection methods can generally be classified into knowledge-based and semantic-based approaches. However, these methods are heavily rely on human expertise and feedback, lacking flexibility. To address this challenge, we propose a Dynamic Analysis and Adaptive Discriminator (DAAD) approach for fake news detection. For knowledge-based methods, we introduce the Monte Carlo Tree Search algorithm to leverage the self-reflective capabilities of large language models (LLMs) for prompt optimization, providing richer, domain-specific details and guidance to the LLMs, while enabling more flexible integration of LLM comment on news content. For semantic-based methods, we define four typical deceit patterns: emotional exaggeration, logical inconsistency, image manipulation, and semantic inconsistency, to reveal the mechanisms behind fake news creation. To detect these patterns, we carefully design four discriminators and expand them in depth and breadth, using the soft-routing mechanism to explore optimal detection models. Experimental results on three real-world datasets demonstrate the superiority of our approach. The code will be available at: https://github.com/SuXinqi/DAAD.
Paperid: 109,   https://arxiv.org/pdf/2502.14834     GitHub
Authors:Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li
Title: LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models
Abstract:
Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words. We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model. Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs. Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs. Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o. Code and data: https://github.com/THU-KEG/LongWriter-V
Paperid: 110,   https://arxiv.org/pdf/2504.12704     GitHub
Authors:Qianqian Sun, Jixiang Luo, Dell Zhang, Xuelong Li
Title: SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding
Abstract:
Recent advancements in image editing have utilized large-scale multimodal models to enable intuitive, natural instruction-driven interactions. However, conventional methods still face significant challenges, particularly in spatial reasoning, precise region segmentation, and maintaining semantic consistency, especially in complex scenes. To overcome these challenges, we introduce SmartFreeEdit, a novel end-to-end framework that integrates a multimodal large language model (MLLM) with a hypergraph-enhanced inpainting architecture, enabling precise, mask-free image editing guided exclusively by natural language instructions. The key innovations of SmartFreeEdit include:(1)the introduction of region aware tokens and a mask embedding paradigm that enhance the spatial understanding of complex scenes;(2) a reasoning segmentation pipeline designed to optimize the generation of editing masks based on natural language instructions;and (3) a hypergraph-augmented inpainting module that ensures the preservation of both structural integrity and semantic coherence during complex edits, overcoming the limitations of local-based image generation. Extensive experiments on the Reason-Edit benchmark demonstrate that SmartFreeEdit surpasses current state-of-the-art methods across multiple evaluation metrics, including segmentation accuracy, instruction adherence, and visual quality preservation, while addressing the issue of local information focus and improving global consistency in the edited image. Our project will be available at https://github.com/smileformylove/SmartFreeEdit.
Paperid: 111,   https://arxiv.org/pdf/2507.04061     GitHub
Authors:Hanghui Guo, Weijie Shi, Mengze Li, Juncheng Li, Hao Chen, Yue Cui, Jiajie Xu, Jia Zhu, Jiawei Shen, Zhangze Chen, Sirui Han
Title: Consistent and Invariant Generalization Learning for Short-video Misinformation Detection
Abstract:
Short-video misinformation detection has attracted wide attention in the multi-modal domain, aiming to accurately identify the misinformation in the video format accompanied by the corresponding audio. Despite significant advancements, current models in this field, trained on particular domains (source domains), often exhibit unsatisfactory performance on unseen domains (target domains) due to domain gaps. To effectively realize such domain generalization on the short-video misinformation detection task, we propose deep insights into the characteristics of different domains: (1) The detection on various domains may mainly rely on different modalities (i.e., mainly focusing on videos or audios). To enhance domain generalization, it is crucial to achieve optimal model performance on all modalities simultaneously. (2) For some domains focusing on cross-modal joint fraud, a comprehensive analysis relying on cross-modal fusion is necessary. However, domain biases located in each modality (especially in each frame of videos) will be accumulated in this fusion process, which may seriously damage the final identification of misinformation. To address these issues, we propose a new DOmain generalization model via ConsisTency and invariance learning for shORt-video misinformation detection (named DOCTOR), which contains two characteristic modules: (1) We involve the cross-modal feature interpolation to map multiple modalities into a shared space and the interpolation distillation to synchronize multi-modal learning; (2) We design the diffusion model to add noise to retain core features of multi modal and enhance domain invariant features through cross-modal guided denoising. Extensive experiments demonstrate the effectiveness of our proposed DOCTOR model. Our code is public available at https://github.com/ghh1125/DOCTOR.
Paperid: 112,   https://arxiv.org/pdf/2503.08116     GitHub
Authors:Ruipeng Wang, Junfeng Fang, Jiaqi Li, Hao Chen, Jie Shi, Kun Wang, Xiang Wang
Title: ACE: Concept Editing in Diffusion Models without Performance Degradation
Abstract:
Diffusion-based text-to-image models have demonstrated remarkable capabilities in generating realistic images, but they raise societal and ethical concerns, such as the creation of unsafe content. While concept editing is proposed to address these issues, they often struggle to balance the removal of unsafe concept with maintaining the model's general genera-tive capabilities. In this work, we propose ACE, a new editing method that enhances concept editing in diffusion models. ACE introduces a novel cross null-space projection approach to precisely erase unsafe concept while maintaining the model's ability to generate high-quality, semantically consistent images. Extensive experiments demonstrate that ACE significantly outperforms the advancing baselines,improving semantic consistency by 24.56% and image generation quality by 34.82% on average with only 1% of the time cost. These results highlight the practical utility of concept editing by mitigating its potential risks, paving the way for broader applications in the field. Code is avaliable at https://github.com/littlelittlenine/ACE-zero.git
Paperid: 113,   https://arxiv.org/pdf/2505.22053     GitHub
Authors:Yan Rong, Jinting Wang, Guangzhi Lei, Shan Yang, Li Liu
Title: AudioGenie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to-Multiaudio Generation
Abstract:
Multimodality-to-Multiaudio (MM2MA) generation faces significant challenges in synthesizing diverse and contextually aligned audio types (e.g., sound effects, speech, music, and songs) from multimodal inputs (e.g., video, text, images), owing to the scarcity of high-quality paired datasets and the lack of robust multi-task learning frameworks. Recently, multi-agent system shows great potential in tackling the above issues. However, directly applying it to MM2MA task presents three critical challenges: (1) inadequate fine-grained understanding of multimodal inputs (especially for video), (2) the inability of single models to handle diverse audio events, and (3) the absence of self-correction mechanisms for reliable outputs. To this end, we propose AudioGenie, a novel training-free multi-agent system featuring a dual-layer architecture with a generation team and a supervisor team. For the generation team, a fine-grained task decomposition and an adaptive Mixture-of-Experts (MoE) collaborative entity are designed for detailed comprehensive multimodal understanding and dynamic model selection, and a trial-and-error iterative refinement module is designed for self-correction. The supervisor team ensures temporal-spatial consistency and verifies outputs through feedback loops. Moreover, we build MA-Bench, the first benchmark for MM2MA tasks, comprising 198 annotated videos with multi-type audios. Experiments demonstrate that our AudioGenie achieves state-of-the-art (SOTA) or comparable performance across 9 metrics in 8 tasks. User study further validates the effectiveness of our method in terms of quality, accuracy, alignment, and aesthetic. The project website with audio samples can be found at https://audiogenie.github.io/.
Paperid: 114,   https://arxiv.org/pdf/2508.16217     GitHub
Authors:Hohyun Na, Seunghoo Hong, Simon S. Woo
Title: PromptFlare: Prompt-Generalized Defense via Cross-Attention Decoy in Diffusion-Based Inpainting
Abstract:
The success of diffusion models has enabled effortless, high-quality image modifications that precisely align with users' intentions, thereby raising concerns about their potential misuse by malicious actors. Previous studies have attempted to mitigate such misuse through adversarial attacks. However, these approaches heavily rely on image-level inconsistencies, which pose fundamental limitations in addressing the influence of textual prompts. In this paper, we propose PromptFlare, a novel adversarial protection method designed to protect images from malicious modifications facilitated by diffusion-based inpainting models. Our approach leverages the cross-attention mechanism to exploit the intrinsic properties of prompt embeddings. Specifically, we identify and target shared token of prompts that is invariant and semantically uninformative, injecting adversarial noise to suppress the sampling process. The injected noise acts as a cross-attention decoy, diverting the model's focus away from meaningful prompt-image alignments and thereby neutralizing the effect of prompt. Extensive experiments on the EditBench dataset demonstrate that our method achieves state-of-the-art performance across various metrics while significantly reducing computational overhead and GPU memory usage. These findings highlight PromptFlare as a robust and efficient protection against unauthorized image manipulations. The code is available at https://github.com/NAHOHYUN-SKKU/PromptFlare.
Paperid: 115,   https://arxiv.org/pdf/2504.12867     GitHub
Authors:Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, Xie Chen
Title: EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting
Abstract:
Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design that makes the model output phoneme tokens and audio tokens in parallel to enhance content consistency, inspired by chain-of-thought (CoT) and chain-of-modality (CoM) techniques. Besides, we introduce EmoVoice-DB, a high-quality 40-hour English emotion dataset featuring expressive speech and fine-grained emotion labels with natural language descriptions. EmoVoice achieves state-of-the-art performance on the English EmoVoice-DB test set using only synthetic training data, and on the Chinese Secap test set using our in-house data. We further investigate the reliability of existing emotion evaluation metrics and their alignment with human perceptual preferences, and explore using SOTA multimodal LLMs GPT-4o-audio and Gemini to assess emotional speech. Dataset, code, checkpoints, and demo samples are available at https://github.com/yanghaha0908/EmoVoice.
Paperid: 116,   https://arxiv.org/pdf/2507.23219     GitHub
Authors:Yang Ren, Hai Jiang, Wei Li, Menglong Yang, Heng Zhang, Zehua Sheng, Qingsheng Ye, Shuaicheng Liu
Title: Learning Arbitrary-Scale RAW Image Downscaling with Wavelet-based Recurrent Reconstruction
Abstract:
Image downscaling is critical for efficient storage and transmission of high-resolution (HR) images. Existing learning-based methods focus on performing downscaling within the sRGB domain, which typically suffers from blurred details and unexpected artifacts. RAW images, with their unprocessed photonic information, offer greater flexibility but lack specialized downscaling frameworks. In this paper, we propose a wavelet-based recurrent reconstruction framework that leverages the information lossless attribute of wavelet transformation to fulfill the arbitrary-scale RAW image downscaling in a coarse-to-fine manner, in which the Low-Frequency Arbitrary-Scale Downscaling Module (LASDM) and the High-Frequency Prediction Module (HFPM) are proposed to preserve structural and textural integrity of the reconstructed low-resolution (LR) RAW images, alongside an energy-maximization loss to align high-frequency energy between HR and LR domain. Furthermore, we introduce the Realistic Non-Integer RAW Downscaling (Real-NIRD) dataset, featuring a non-integer downscaling factor of 1.3×, and incorporate it with publicly available datasets with integer factors (2×, 3×, 4×) for comprehensive benchmarking arbitrary-scale image downscaling purposes. Extensive experiments demonstrate that our method outperforms existing state-of-the-art competitors both quantitatively and visually. The code and dataset will be released at https://github.com/RenYangSCU/ASRD.
Paperid: 117,   https://arxiv.org/pdf/2504.16455     GitHub
Authors:Shun Zou, Yi Zou, Juncheng Li, Guangwei Gao, Guojun Qi
Title: Cross Paradigm Representation and Alignment Transformer for Image Deraining
Abstract:
Transformer-based networks have achieved strong performance in low-level vision tasks like image deraining by utilizing spatial or channel-wise self-attention. However, irregular rain patterns and complex geometric overlaps challenge single-paradigm architectures, necessitating a unified framework to integrate complementary global-local and spatial-channel representations. To address this, we propose a novel Cross Paradigm Representation and Alignment Transformer (CPRAformer). Its core idea is the hierarchical representation and alignment, leveraging the strengths of both paradigms (spatial-channel and global-local) to aid image reconstruction. It bridges the gap within and between paradigms, aligning and coordinating them to enable deep interaction and fusion of features. Specifically, we use two types of self-attention in the Transformer blocks: sparse prompt channel self-attention (SPC-SA) and spatial pixel refinement self-attention (SPR-SA). SPC-SA enhances global channel dependencies through dynamic sparsity, while SPR-SA focuses on spatial rain distribution and fine-grained texture recovery. To address the feature misalignment and knowledge differences between them, we introduce the Adaptive Alignment Frequency Module (AAFM), which aligns and interacts with features in a two-stage progressive manner, enabling adaptive guidance and complementarity. This reduces the information gap within and between paradigms. Through this unified cross-paradigm dynamic interaction framework, we achieve the extraction of the most valuable interactive fusion information from the two paradigms. Extensive experiments demonstrate that our model achieves state-of-the-art performance on eight benchmark datasets and further validates CPRAformer's robustness in other image restoration tasks and downstream applications.
Paperid: 118,   https://arxiv.org/pdf/2507.04630     GitHub
Authors:Shengli Zhou, Yang Liu, Feng Zheng
Title: Learn 3D VQA Better with Active Selection and Reannotation
Abstract:
3D Visual Question Answering (3D VQA) is crucial for enabling models to perceive the physical world and perform spatial reasoning. In 3D VQA, the free-form nature of answers often leads to improper annotations that can confuse or mislead models when training on the entire dataset. While other text generation tasks can mitigate this issue by learning on large-scale datasets, the scarcity of 3D scene data enlarges the negative effect of misleading annotations. Although active learning strategies can select valuable instances for training, they fail to identify and resolve misleading labels, which the oracle inevitably provides in practice. To address this issue, we propose a multi-turn interactive active learning strategy. This strategy selects data based on models' semantic uncertainty to form a solid knowledge foundation more effectively and actively requests reannotation from an oracle to resolve potentially misleading labels. For uncertainty assessment, we utilize a variance-based metric that takes semantic relationships between terms into consideration, thus avoiding the uniform inter-class similarity assumption of previous assessment metrics. Extensive experiments exhibit better model performance and a substantial reduction in training costs, with a halving of training costs for achieving relatively high accuracy. The code is available at https://github.com/fz-zsl/AQuA.
Paperid: 119,   https://arxiv.org/pdf/2507.05939     GitHub
Authors:Bing Wang, Ximing Li, Mengzhe Ye, Changchun Li, Bo Fu, Jianfeng Qu, Lin Yuanbo Wu
Title: Remember Past, Anticipate Future: Learning Continual Multimodal Misinformation Detectors
Abstract:
Nowadays, misinformation articles, especially multimodal ones, are widely spread on social media platforms and cause serious negative effects. To control their propagation, Multimodal Misinformation Detection (MMD) becomes an active topic in the community to automatically identify misinformation. Previous MMD methods focus on supervising detectors by collecting offline data. However, in real-world scenarios, new events always continually emerge, making MMD models trained on offline data consistently outdated and ineffective. To address this issue, training MMD models under online data streams is an alternative, inducing an emerging task named continual MMD. Unfortunately, it is hindered by two major challenges. First, training on new data consistently decreases the detection performance on past data, named past knowledge forgetting. Second, the social environment constantly evolves over time, affecting the generalization on future data. To alleviate these challenges, we propose to remember past knowledge by isolating interference between event-specific parameters with a Dirichlet process-based mixture-of-expert structure, and anticipate future environmental distributions by learning a continuous-time dynamics model. Accordingly, we induce a new continual MMD method DAEDCMD. Extensive experiments demonstrate that DAEDCMD can consistently and significantly outperform the compared methods, including six MMD baselines and three continual learning methods.
Paperid: 120,   https://arxiv.org/pdf/2507.07015     GitHub
Authors:Hui Li, Pengfei Yang, Juanyang Chen, Le Dong, Yanxin Chen, Quan Wang
Title: MST-Distill: Mixture of Specialized Teachers for Cross-Modal Knowledge Distillation
Abstract:
Knowledge distillation as an efficient knowledge transfer technique, has achieved remarkable success in unimodal scenarios. However, in cross-modal settings, conventional distillation methods encounter significant challenges due to data and statistical heterogeneities, failing to leverage the complementary prior knowledge embedded in cross-modal teacher models. This paper empirically reveals two critical issues in existing approaches: distillation path selection and knowledge drift. To address these limitations, we propose MST-Distill, a novel cross-modal knowledge distillation framework featuring a mixture of specialized teachers. Our approach employs a diverse ensemble of teacher models across both cross-modal and multimodal configurations, integrated with an instance-level routing network that facilitates adaptive and dynamic distillation. This architecture effectively transcends the constraints of traditional methods that rely on monotonous and static teacher models. Additionally, we introduce a plug-in masking module, independently trained to suppress modality-specific discrepancies and reconstruct teacher representations, thereby mitigating knowledge drift and enhancing transfer effectiveness. Extensive experiments across five diverse multimodal datasets, spanning visual, audio, and text, demonstrate that our method significantly outperforms existing state-of-the-art knowledge distillation methods in cross-modal distillation tasks. The source code is available at https://github.com/Gray-OREO/MST-Distill.
Paperid: 121,   https://arxiv.org/pdf/2507.07708     GitHub
Authors:Wei Shang, Dongwei Ren, Wanying Zhang, Pengfei Zhu, Qinghua Hu, Wangmeng Zuo
Title: Motion-Aware Adaptive Pixel Pruning for Efficient Local Motion Deblurring
Abstract:
Local motion blur in digital images originates from the relative motion between dynamic objects and static imaging systems during exposure. Existing deblurring methods face significant challenges in addressing this problem due to their inefficient allocation of computational resources and inadequate handling of spatially varying blur patterns. To overcome these limitations, we first propose a trainable mask predictor that identifies blurred regions in the image. During training, we employ blur masks to exclude sharp regions. For inference optimization, we implement structural reparameterization by converting 3× 3 convolutions to computationally efficient 1× 1 convolutions, enabling pixel-level pruning of sharp areas to reduce computation. Second, we develop an intra-frame motion analyzer that translates relative pixel displacements into motion trajectories, establishing adaptive guidance for region-specific blur restoration. Our method is trained end-to-end using a combination of reconstruction loss, reblur loss, and mask loss guided by annotated blur masks. Extensive experiments demonstrate superior performance over state-of-the-art methods on both local and global blur datasets while reducing FLOPs by 49% compared to SOTA models (e.g., LMD-ViT). The source code is available at https://github.com/shangwei5/M2AENet.
Paperid: 122,   https://arxiv.org/pdf/2512.02792     GitHub
Authors:Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, Weili Guan
Title: HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval
Abstract:
Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semantic content compared to the textual modality. However, previous works have largely overlooked the disparity in information density between these two modalities. This limitation can lead to two critical issues: 1) modification subject referring ambiguity and 2) limited detailed semantic focus, both of which degrade the performance of CVR models. To address the aforementioned issues, we propose a novel CVR framework, namely the Hierarchical Uncertainty-aware Disambiguation network (HUD). HUD is the first framework that leverages the disparity in information density between video and text to enhance multi-modal query understanding. It comprises three key components: (a) Holistic Pronoun Disambiguation, (b) Atomistic Uncertainty Modeling, and (c) Holistic-to-Atomistic Alignment. By exploiting overlapping semantics through holistic cross-modal interaction and fine-grained semantic alignment via atomistic-level cross-modal interaction, HUD enables effective object disambiguation and enhances the focus on detailed semantics, thereby achieving precise composed feature learning. Moreover, our proposed HUD is also applicable to the Composed Image Retrieval (CIR) task and achieves state-of-the-art performance across three benchmark datasets for both CVR and CIR tasks. The codes are available on https://zivchen-ty.github.io/HUD.github.io/.
Paperid: 123,   https://arxiv.org/pdf/2504.13072     GitHub
Authors:Wenqi Dong, Bangbang Yang, Zesong Yang, Yuan Li, Tao Hu, Hujun Bao, Yuewen Ma, Zhaopeng Cui
Title: HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation
Abstract:
Scene-level 3D generation represents a critical frontier in multimedia and computer graphics, yet existing approaches either suffer from limited object categories or lack editing flexibility for interactive applications. In this paper, we present HiScene, a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation and delivers high-fidelity scenes with compositional identities and aesthetic scene content. Our key insight is treating scenes as hierarchical "objects" under isometric views, where a room functions as a complex object that can be further decomposed into manipulatable items. This hierarchical approach enables us to generate 3D content that aligns with 2D representations while maintaining compositional structure. To ensure completeness and spatial alignment of each decomposed instance, we develop a video-diffusion-based amodal completion technique that effectively handles occlusions and shadows between objects, and introduce shape prior injection to ensure spatial coherence within the scene. Experimental results demonstrate that our method produces more natural object arrangements and complete object instances suitable for interactive applications, while maintaining physical plausibility and alignment with user inputs.
Paperid: 124,   https://arxiv.org/pdf/2502.21291     GitHub
Authors:Xueyun Tian, Wei Li, Bingbing Xu, Yige Yuan, Yuanzhuo Wang, Huawei Shen
Title: MIGE: Mutually Enhanced Multimodal Instruction-Based Image Generation and Editing
Abstract:
Despite significant progress in diffusion-based image generation, subject-driven generation and instruction-based editing remain challenging. Existing methods typically treat them separately, struggling with limited high-quality data and poor generalization. However, both tasks require capturing complex visual variations while maintaining consistency between inputs and outputs. Inspired by this, we propose MIGE, a unified framework that standardizes task representations using multimodal instructions. It first treats subject-driven generation as creation on a blank canvas and instruction-based editing as modification of an existing image, establishing a shared input-output formulation, then introduces a novel multimodal encoder that maps free-form multimodal instructions into a unified vision-language space, integrating visual and semantic features through a feature fusion mechanism. This unification enables joint training of both tasks, providing two key advantages: (1) Cross-Task Enhancement: by leveraging shared visual and semantic representations, joint training improves instruction adherence and visual consistency in both subject-driven generation and instruction-based editing. (2) Generalization: learning in a unified format facilitates cross-task knowledge transfer, enabling MIGE to generalize to novel compositional tasks, including instruction-based subject-driven editing. Experiments show that MIGE excels in both subject-driven generation and instruction-based editing while setting a SOTA in the new task of instruction-based subject-driven editing. Code and model have been publicly available at https://github.com/Eureka-Maggie/MIGE.
Paperid: 125,   https://arxiv.org/pdf/2406.09181     GitHub
Authors:Yijun Bei, Hengrui Lou, Jinsong Geng, Erteng Liu, Lechao Cheng, Jie Song, Mingli Song, Zunlei Feng
Title: A Large-scale Universal Evaluation Benchmark For Face Forgery Detection
Abstract:
With the rapid development of AI-generated content (AIGC) technology, the production of realistic fake facial images and videos that deceive human visual perception has become possible. Consequently, various face forgery detection techniques have been proposed to identify such fake facial content. However, evaluating the effectiveness and generalizability of these detection techniques remains a significant challenge. To address this, we have constructed a large-scale evaluation benchmark called DeepFaceGen, aimed at quantitatively assessing the effectiveness of face forgery detection and facilitating the iterative development of forgery detection technology. DeepFaceGen consists of 776,990 real face image/video samples and 773,812 face forgery image/video samples, generated using 34 mainstream face generation techniques. During the construction process, we carefully consider important factors such as content diversity, fairness across ethnicities, and availability of comprehensive labels, in order to ensure the versatility and convenience of DeepFaceGen. Subsequently, DeepFaceGen is employed in this study to evaluate and analyze the performance of 13 mainstream face forgery detection techniques from various perspectives. Through extensive experimental analysis, we derive significant findings and propose potential directions for future research. The code and dataset for DeepFaceGen are available at https://github.com/HengruiLou/DeepFaceGen.
Paperid: 126,   https://arxiv.org/pdf/2510.06619     GitHub
Authors:Tao Feng, Tingfa Xu, Haolin Qin, Tianhao Li, Shuaihao Han, Xuyang Zou, Zhan Lv, Jianan Li
Title: MSITrack: A Challenging Benchmark for Multispectral Single Object Tracking
Abstract:
Visual object tracking in real-world scenarios presents numerous challenges including occlusion, interference from similar objects and complex backgrounds-all of which limit the effectiveness of RGB-based trackers. Multispectral imagery, which captures pixel-level spectral reflectance, enhances target discriminability. However, the availability of multispectral tracking datasets remains limited. To bridge this gap, we introduce MSITrack, the largest and most diverse multispectral single object tracking dataset to date. MSITrack offers the following key features: (i) More Challenging Attributes-including interference from similar objects and similarity in color and texture between targets and backgrounds in natural scenarios, along with a wide range of real-world tracking challenges; (ii) Richer and More Natural Scenes-spanning 55 object categories and 300 distinct natural scenes, MSITrack far exceeds the scope of existing benchmarks. Many of these scenes and categories are introduced to the multispectral tracking domain for the first time; (iii) Larger Scale-300 videos comprising over 129k frames of multispectral imagery. To ensure annotation precision, each frame has undergone meticulous processing, manual labeling and multi-stage verification. Extensive evaluations using representative trackers demonstrate that the multispectral data in MSITrack significantly improves performance over RGB-only baselines, highlighting its potential to drive future advancements in the field. The MSITrack dataset is publicly available at: https://github.com/Fengtao191/MSITrack.
Paperid: 127,   https://arxiv.org/pdf/2506.00991     GitHub
Authors:Xiaorong Zhu, Ziheng Jia, Jiarui Wang, Xiangyu Zhao, Haodong Duan, Xiongkuo Min, Jia Wang, Zicheng Zhang, Guangtao Zhai
Title: GOBench: Benchmarking Geometric Optics Generation and Understanding of MLLMs
Abstract:
The rapid evolution of Multi-modality Large Language Models (MLLMs) is driving significant advancements in visual understanding and generation. Nevertheless, a comprehensive assessment of their capabilities, concerning the fine-grained physical principles especially in geometric optics, remains underexplored. To address this gap, we introduce GOBench, the first benchmark to systematically evaluate MLLMs' ability across two tasks: 1) Generating Optically Authentic Imagery and 2) Understanding Underlying Optical Phenomena. We curates high-quality prompts of geometric optical scenarios and use MLLMs to construct GOBench-Gen-1k dataset.We then organize subjective experiments to assess the generated imagery based on Optical Authenticity, Aesthetic Quality, and Instruction Fidelity, revealing MLLMs' generation flaws that violate optical principles. For the understanding task, we apply crafted evaluation instructions to test optical understanding ability of eleven prominent MLLMs. The experimental results demonstrate that current models face significant challenges in both optical generation and understanding. The top-performing generative model, GPT-4o-Image, cannot perfectly complete all generation tasks, and the best-performing MLLM model, Gemini-2.5Pro, attains a mere 37.35% accuracy in optical understanding. Database and codes are publicly available at https://github.com/aiben-ch/GOBench.
Paperid: 128,   https://arxiv.org/pdf/2508.07554     GitHub
Authors:Xusheng He, Wei Liu, Shanshan Ma, Qian Liu, Chenghao Ma, Jianlong Wu
Title: FineBadminton: A Multi-Level Dataset for Fine-Grained Badminton Video Understanding
Abstract:
Fine-grained analysis of complex and high-speed sports like badminton presents a significant challenge for Multimodal Large Language Models (MLLMs), despite their notable advancements in general video understanding. This difficulty arises primarily from the scarcity of datasets with sufficiently rich and domain-specific annotations. To bridge this gap, we introduce FineBadminton, a novel and large-scale dataset featuring a unique multi-level semantic annotation hierarchy (Foundational Actions, Tactical Semantics, and Decision Evaluation) for comprehensive badminton understanding. The construction of FineBadminton is powered by an innovative annotation pipeline that synergistically combines MLLM-generated proposals with human refinement. We also present FBBench, a challenging benchmark derived from FineBadminton, to rigorously evaluate MLLMs on nuanced spatio-temporal reasoning and tactical comprehension. Together, FineBadminton and FBBench provide a crucial ecosystem to catalyze research in fine-grained video understanding and advance the development of MLLMs in sports intelligence. Furthermore, we propose an optimized baseline approach incorporating Hit-Centric Keyframe Selection to focus on pivotal moments and Coordinate-Guided Condensation to distill salient visual information. The results on FBBench reveal that while current MLLMs still face significant challenges in deep sports video analysis, our proposed strategies nonetheless achieve substantial performance gains. The project homepage is available at https://finebadminton.github.io/FineBadminton/.
Paperid: 129,   https://arxiv.org/pdf/2508.15429     GitHub
Authors:Yulin Sun, Qisheng Xu, Yi Su, Qian Zhu, Yong Dou, Xinwang Liu, Kele Xu
Title: AudioSet-R: A Refined AudioSet with Multi-Stage LLM Label Reannotation
Abstract:
AudioSet is a widely used benchmark in the audio research community and has significantly advanced various audio-related tasks. However, persistent issues with label accuracy and completeness remain critical bottlenecks that limit performance in downstream applications.To address the aforementioned challenges, we propose a three-stage reannotation framework that harnesses general-purpose audio-language foundation models to systematically improve the label quality of AudioSet. The framework employs a cross-modal prompting strategy, inspired by the concept of prompt chaining, wherein prompts are sequentially composed to execute subtasks (audio comprehension, label synthesis, and semantic alignment). Leveraging this framework, we construct a high-quality, structured relabeled version of AudioSet-R. Extensive experiments conducted on representative audio classification models--including AST, PANNs, SSAST, and AudioMAE--consistently demonstrate substantial performance improvements, thereby validating the generalizability and effectiveness of the proposed approach in enhancing label reliability.The code is publicly available at: https://github.com/colaudiolab/AudioSet-R.
Paperid: 130,   https://arxiv.org/pdf/2506.18372     GitHub
Authors:Hieu Nguyen, Phuc-Tan Nguyen, Thien-Phuc Tran, Minh-Quang Nguyen, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Title: OpenEvents V1: Large-Scale Benchmark Dataset for Multimodal Event Grounding
Abstract:
We introduce OpenEvents V1a large-scale benchmark dataset designed to advance event-centric vision-language understanding. Unlike conventional image captioning and retrieval datasets that focus on surface-level descriptions, OpenEvents V1 dataset emphasizes contextual and temporal grounding through three primary tasks: (1) generating rich, event-aware image captions, (2) retrieving event-relevant news articles from image queries, and (3) retrieving event-relevant images from narrative-style textual queries. The dataset comprises over 200,000 news articles and 400,000 associated images sourced from CNN and The Guardian, spanning diverse domains and time periods. We provide extensive baseline results and standardized evaluation protocols for all tasks. OpenEvents V1 establishes a robust foundation for developing multimodal AI systems capable of deep reasoning over complex real-world events. The dataset is publicly available at https://ltnghia.github.io/eventa/openevents-v1.
Paperid: 131,   https://arxiv.org/pdf/2509.05592     GitHub
Authors:Changtao Miao, Yi Zhang, Man Luo, Weiwei Feng, Kaiyuan Zheng, Qi Chu, Tao Gong, Jianshu Li, Yunfeng Diao, Wei Zhou, Joey Tianyi Zhou, Xiaoshuai Hao
Title: MFFI: Multi-Dimensional Face Forgery Image Dataset for Real-World Scenarios
Abstract:
Rapid advances in Artificial Intelligence Generated Content (AIGC) have enabled increasingly sophisticated face forgeries, posing a significant threat to social security. However, current Deepfake detection methods are limited by constraints in existing datasets, which lack the diversity necessary in real-world scenarios. Specifically, these data sets fall short in four key areas: unknown of advanced forgery techniques, variability of facial scenes, richness of real data, and degradation of real-world propagation. To address these challenges, we propose the Multi-dimensional Face Forgery Image (MFFI) dataset, tailored for real-world scenarios. MFFI enhances realism based on four strategic dimensions: 1) Wider Forgery Methods; 2) Varied Facial Scenes; 3) Diversified Authentic Data; 4) Multi-level Degradation Operations. MFFI integrates 50 different forgery methods and contains 1024K image samples. Benchmark evaluations show that MFFI outperforms existing public datasets in terms of scene complexity, cross-domain generalization capability, and detection difficulty gradients. These results validate the technical advance and practical utility of MFFI in simulating real-world conditions. The dataset and additional details are publicly available at https://github.com/inclusionConf/MFFI.
Paperid: 132,   https://arxiv.org/pdf/2506.12830     GitHub
Authors:Chenglin Wang, Yucheng Zhou, Qianning Wang, Zhe Wang, Kai Zhang
Title: ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies
Abstract:
Text-driven image editing has achieved remarkable success in following single instructions. However, real-world scenarios often involve complex, multi-step instructions, particularly ``chain'' instructions where operations are interdependent. Current models struggle with these intricate directives, and existing benchmarks inadequately evaluate such capabilities. Specifically, they often overlook multi-instruction and chain-instruction complexities, and common consistency metrics are flawed. To address this, we introduce ComplexBench-Edit, a novel benchmark designed to systematically assess model performance on complex, multi-instruction, and chain-dependent image editing tasks. ComplexBench-Edit also features a new vision consistency evaluation method that accurately assesses non-modified regions by excluding edited areas. Furthermore, we propose a simple yet powerful Chain-of-Thought (CoT)-based approach that significantly enhances the ability of existing models to follow complex instructions. Our extensive experiments demonstrate ComplexBench-Edit's efficacy in differentiating model capabilities and highlight the superior performance of our CoT-based method in handling complex edits. The data and code are released at https://github.com/llllly26/ComplexBench-Edit.
Paperid: 133,   https://arxiv.org/pdf/2506.23852     GitHub
Authors:Jianing Jin, Jiangyong Ying, Huiyu Duan, Liu Yang, Sijing Wu, Yunhao Li, Yushuo Zheng, Xiongkuo Min, Guangtao Zhai
Title: RGC-VQA: An Exploration Database for Robotic-Generated Video Quality Assessment
Abstract:
As camera-equipped robotic platforms become increasingly integrated into daily life, robotic-generated videos have begun to appear on streaming media platforms, enabling us to envision a future where humans and robots coexist. We innovatively propose the concept of Robotic-Generated Content (RGC) to term these videos generated from egocentric perspective of robots. The perceptual quality of RGC videos is critical in human-robot interaction scenarios, and RGC videos exhibit unique distortions and visual requirements that differ markedly from those of professionally-generated content (PGC) videos and user-generated content (UGC) videos. However, dedicated research on quality assessment of RGC videos is still lacking. To address this gap and to support broader robotic applications, we establish the first Robotic-Generated Content Database (RGCD), which contains a total of 2,100 videos drawn from three robot categories and sourced from diverse platforms. A subjective VQA experiment is conducted subsequently to assess human visual perception of robotic-generated videos. Finally, we conduct a benchmark experiment to evaluate the performance of 11 state-of-the-art VQA models on our database. Experimental results reveal significant limitations in existing VQA models when applied to complex, robotic-generated content, highlighting a critical need for RGC-specific VQA models. Our RGCD is publicly available at: https://github.com/IntMeGroup/RGC-VQA.
Paperid: 134,   https://arxiv.org/pdf/2409.14072     GitHub
Authors:Shuai Zhang, Guanjun Wu, Zhoufeng Xie, Xinggang Wang, Bin Feng, Wenyu Liu
Title: Dynamic 2D Gaussians: Geometrically Accurate Radiance Fields for Dynamic Objects
Abstract:
Reconstructing objects and extracting high-quality surfaces play a vital role in the real world. Current 4D representations show the ability to render high-quality novel views for dynamic objects, but cannot reconstruct high-quality meshes due to their implicit or geometrically inaccurate representations. In this paper, we propose a novel representation that can reconstruct accurate meshes from sparse image input, named Dynamic 2D Gaussians (D-2DGS). We adopt 2D Gaussians for basic geometry representation and use sparse-controlled points to capture the 2D Gaussian's deformation. By extracting the object mask from the rendered high-quality image and masking the rendered depth map, we remove floaters that are prone to occur during reconstruction and can extract high-quality dynamic mesh sequences of dynamic objects. Experiments demonstrate that our D-2DGS is outstanding in reconstructing detailed and smooth high-quality meshes from sparse inputs. The code is available at https://github.com/hustvl/Dynamic-2DGS.
Paperid: 135,   https://arxiv.org/pdf/2508.02113     GitHub
Authors:Yihang Huang, Yuanfei Huang, Junhui Lin, Hua Huang
Title: DeflareMamba: Hierarchical Vision Mamba for Contextually Consistent Lens Flare Removal
Abstract:
Lens flare removal remains an information confusion challenge in the underlying image background and the optical flares, due to the complex optical interactions between light sources and camera lens. While recent solutions have shown promise in decoupling the flare corruption from image, they often fail to maintain contextual consistency, leading to incomplete and inconsistent flare removal. To eliminate this limitation, we propose DeflareMamba, which leverages the efficient sequence modeling capabilities of state space models while maintains the ability to capture local-global dependencies. Particularly, we design a hierarchical framework that establishes long-range pixel correlations through varied stride sampling patterns, and utilize local-enhanced state space models that simultaneously preserves local details. To the best of our knowledge, this is the first work that introduces state space models to the flare removal task. Extensive experiments demonstrate that our method effectively removes various types of flare artifacts, including scattering and reflective flares, while maintaining the natural appearance of non-flare regions. Further downstream applications demonstrate the capacity of our method to improve visual object recognition and cross-modal semantic understanding. Code is available at https://github.com/BNU-ERC-ITEA/DeflareMamba.
Paperid: 136,   https://arxiv.org/pdf/2504.12782     GitHub
Authors:Leyang Li, Shilin Lu, Yan Ren, Adams Wai-Kin Kong
Title: Set You Straight: Auto-Steering Denoising Trajectories to Sidestep Unwanted Concepts
Abstract:
Ensuring the ethical deployment of text-to-image models requires effective techniques to prevent the generation of harmful or inappropriate content. While concept erasure methods offer a promising solution, existing finetuning-based approaches suffer from notable limitations. Anchor-free methods risk disrupting sampling trajectories, leading to visual artifacts, while anchor-based methods rely on the heuristic selection of anchor concepts. To overcome these shortcomings, we introduce a finetuning framework, dubbed ANT, which Automatically guides deNoising Trajectories to avoid unwanted concepts. ANT is built on a key insight: reversing the condition direction of classifier-free guidance during mid-to-late denoising stages enables precise content modification without sacrificing early-stage structural integrity. This inspires a trajectory-aware objective that preserves the integrity of the early-stage score function field, which steers samples toward the natural image manifold, without relying on heuristic anchor concept selection. For single-concept erasure, we propose an augmentation-enhanced weight saliency map to precisely identify the critical parameters that most significantly contribute to the unwanted concept, enabling more thorough and efficient erasure. For multi-concept erasure, our objective function offers a versatile plug-and-play solution that significantly boosts performance. Extensive experiments demonstrate that ANT achieves state-of-the-art results in both single and multi-concept erasure, delivering high-quality, safe outputs without compromising the generative fidelity. Code is available at https://github.com/lileyang1210/ANT
Paperid: 137,   https://arxiv.org/pdf/2508.02538     GitHub
Authors:Zhengxin Pan, Haishuai Wang, Fangyu Wu, Peng Zhang, Jiajun Bu
Title: Hubness Reduction with Dual Bank Sinkhorn Normalization for Cross-Modal Retrieval
Abstract:
The past decade has witnessed rapid advancements in cross-modal retrieval, with significant progress made in accurately measuring the similarity between cross-modal pairs. However, the persistent hubness problem, a phenomenon where a small number of targets frequently appear as nearest neighbors to numerous queries, continues to hinder the precision of similarity measurements. Despite several proposed methods to reduce hubness, their underlying mechanisms remain poorly understood. To bridge this gap, we analyze the widely-adopted Inverted Softmax approach and demonstrate its effectiveness in balancing target probabilities during retrieval. Building on these insights, we propose a probability-balancing framework for more effective hubness reduction. We contend that balancing target probabilities alone is inadequate and, therefore, extend the framework to balance both query and target probabilities by introducing Sinkhorn Normalization (SN). Notably, we extend SN to scenarios where the true query distribution is unknown, showing that current methods, which rely solely on a query bank to estimate target hubness, produce suboptimal results due to a significant distributional gap between the query bank and targets. To mitigate this issue, we introduce Dual Bank Sinkhorn Normalization (DBSN), incorporating a corresponding target bank alongside the query bank to narrow this distributional gap. Our comprehensive evaluation across various cross-modal retrieval tasks, including image-text retrieval, video-text retrieval, and audio-text retrieval, demonstrates consistent performance improvements, validating the effectiveness of both SN and DBSN. All codes are publicly available at https://github.com/ppanzx/DBSN.
Paperid: 138,   https://arxiv.org/pdf/2508.00391     GitHub
Authors:Guanjie Huang, Danny H. K. Tsang, Shan Yang, Guangzhi Lei, Li Liu
Title: Cued-Agent: A Collaborative Multi-Agent System for Automatic Cued Speech Recognition
Abstract:
Cued Speech (CS) is a visual communication system that combines lip-reading with hand coding to facilitate communication for individuals with hearing impairments. Automatic CS Recognition (ACSR) aims to convert CS hand gestures and lip movements into text via AI-driven methods. Traditionally, the temporal asynchrony between hand and lip movements requires the design of complex modules to facilitate effective multimodal fusion. However, constrained by limited data availability, current methods demonstrate insufficient capacity for adequately training these fusion mechanisms, resulting in suboptimal performance. Recently, multi-agent systems have shown promising capabilities in handling complex tasks with limited data availability. To this end, we propose the first collaborative multi-agent system for ACSR, named Cued-Agent. It integrates four specialized sub-agents: a Multimodal Large Language Model-based Hand Recognition agent that employs keyframe screening and CS expert prompt strategies to decode hand movements, a pretrained Transformer-based Lip Recognition agent that extracts lip features from the input video, a Hand Prompt Decoding agent that dynamically integrates hand prompts with lip features during inference in a training-free manner, and a Self-Correction Phoneme-to-Word agent that enables post-process and end-to-end conversion from phoneme sequences to natural language sentences for the first time through semantic refinement. To support this study, we expand the existing Mandarin CS dataset by collecting data from eight hearing-impaired cuers, establishing a mixed dataset of fourteen subjects. Extensive experiments demonstrate that our Cued-Agent performs superbly in both normal and hearing-impaired scenarios compared with state-of-the-art methods. The implementation is available at https://github.com/DennisHgj/Cued-Agent.
Paperid: 139,   https://arxiv.org/pdf/2507.16343     GitHub
Authors:Pengfei Cai, Yan Song, Qing Gu, Nan Jiang, Haoyu Song, Ian McLoughlin
Title: Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries
Abstract:
Most existing sound event detection~(SED) algorithms operate under a closed-set assumption, restricting their detection capabilities to predefined classes. While recent efforts have explored language-driven zero-shot SED by exploiting audio-language models, their performance is still far from satisfactory due to the lack of fine-grained alignment and cross-modal feature fusion. In this work, we propose the Detect Any Sound Model (DASM), a query-based framework for open-vocabulary SED guided by multi-modal queries. DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors derived from text or audio prompts. To support this formulation, DASM introduces a dual-stream decoder that explicitly decouples event recognition and temporal localization: a cross-modality event decoder performs query-feature fusion and determines the presence of sound events at the clip-level, while a context network models temporal dependencies for frame-level localization. Additionally, an inference-time attention masking strategy is proposed to leverage semantic relations between base and novel classes, substantially enhancing generalization to novel classes. Experiments on the AudioSet Strong dataset demonstrate that DASM effectively balances localization accuracy with generalization to novel classes, outperforming CLAP-based methods in open-vocabulary setting (+ 7.8 PSDS) and the baseline in the closed-set setting (+ 6.9 PSDS). Furthermore, in cross-dataset zero-shot evaluation on DESED, DASM achieves a PSDS1 score of 42.2, even exceeding the supervised CRNN baseline. The project page is available at https://cai525.github.io/Transformer4SED/demo_page/DASM/.
Paperid: 140,   https://arxiv.org/pdf/2508.04197     GitHub
Authors:Yan Zhang, Gangyan Zeng, Daiqing Wu, Huawen Shen, Binbin Li, Yu Zhou, Can Ma, Xiaojun Bi
Title: Gather and Trace: Rethinking Video TextVQA from an Instance-oriented Perspective
Abstract:
Video text-based visual question answering (Video TextVQA) aims to answer questions by explicitly reading and reasoning about the text involved in a video. Most works in this field follow a frame-level framework which suffers from redundant text entities and implicit relation modeling, resulting in limitations in both accuracy and efficiency. In this paper, we rethink the Video TextVQA task from an instance-oriented perspective and propose a novel model termed GAT (Gather and Trace). First, to obtain accurate reading result for each video text instance, a context-aggregated instance gathering module is designed to integrate the visual appearance, layout characteristics, and textual contents of the related entities into a unified textual representation. Then, to capture dynamic evolution of text in the video flow, an instance-focused trajectory tracing module is utilized to establish spatio-temporal relationships between instances and infer the final answer. Extensive experiments on several public Video TextVQA datasets validate the effectiveness and generalization of our framework. GAT outperforms existing Video TextVQA methods, video-language pretraining methods, and video large language models in both accuracy and inference speed. Notably, GAT surpasses the previous state-of-the-art Video TextVQA methods by 3.86% in accuracy and achieves ten times of faster inference speed than video large language models. The source code is available at https://github.com/zhangyan-ucas/GAT.
Paperid: 141,   https://arxiv.org/pdf/2504.17432     GitHub
Authors:Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, Jiankang Deng
Title: Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
Abstract:
The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored.In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLMś language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.
Paperid: 142,   https://arxiv.org/pdf/2507.19076     GitHub
Authors:Rui Pan, Ruiying Lu
Title: SP-Mamba: Spatial-Perception State Space Model for Unsupervised Medical Anomaly Detection
Abstract:
Radiography imaging protocols target on specific anatomical regions, resulting in highly consistent images with recurrent structural patterns across patients. Recent advances in medical anomaly detection have demonstrated the effectiveness of CNN- and transformer-based approaches. However, CNNs exhibit limitations in capturing long-range dependencies, while transformers suffer from quadratic computational complexity. In contrast, Mamba-based models, leveraging superior long-range modeling, structural feature extraction, and linear computational efficiency, have emerged as a promising alternative. To capitalize on the inherent structural regularity of medical images, this study introduces SP-Mamba, a spatial-perception Mamba framework for unsupervised medical anomaly detection. The window-sliding prototype learning and Circular-Hilbert scanning-based Mamba are introduced to better exploit consistent anatomical patterns and leverage spatial information for medical anomaly detection. Furthermore, we excavate the concentration and contrast characteristics of anomaly maps for improving anomaly detection. Extensive experiments on three diverse medical anomaly detection benchmarks confirm the proposed method's state-of-the-art performance, validating its efficacy and robustness. The code is available at https://github.com/Ray-RuiPan/SP-Mamba.
Paperid: 143,   https://arxiv.org/pdf/2508.04273     GitHub
Authors:Junan Lin, Daizong Liu, Xianke Chen, Xiaoye Qu, Xun Yang, Jixiang Zhu, Sanyuan Zhang, Jianfeng Dong
Title: Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval
Abstract:
Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to the given query. To tackle this task, most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality. Although a few recent works try to tackle the joint audio-vision-text reasoning, they treat all modalities equally and simply embed them without fine-grained interaction for moment retrieval. These designs are counter-practical as: Not all audios are helpful for video moment retrieval, and the audio of some videos may be complete noise or background sound that is meaningless to the moment determination. To this end, we propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR. Specifically, after integrating the textual guidance with vision and audio separately, we first design a pseudo-label-supervised audio importance predictor that predicts the importance score of the audio, and accordingly assigns weights to mitigate the interference caused by noisy audio. Then, we design a multi-granularity audio fusion module that adaptively fuses audio and visual modalities at local-, event-, and global-level, fully capturing their complementary contexts. We further propose a cross-modal knowledge distillation strategy to address the challenge of missing audio modality during inference. To evaluate our method, we further construct a new VMR dataset, i.e., Charades-AudioMatter, where audio-related samples are manually selected and re-organized from the original Charades-STA to validate the model's capability in utilizing audio modality. Extensive experiments validate the effectiveness of our method, achieving state-of-the-art with audio-video fusion in VMR methods. Our code is available at https://github.com/HuiGuanLab/IMG.
Paperid: 144,   https://arxiv.org/pdf/2505.02331     GitHub
Authors:Hao Cheng, Zhiwei Zhao, Yichao He, Zhenzhen Hu, Jia Li, Meng Wang, Richang Hong
Title: VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection
Abstract:
Audiovisual emotion recognition (AVER) aims to infer human emotions from nonverbal visual-audio (VA) cues, offering modality-complementary and language-agnostic advantages. However, AVER remains challenging due to the inherent ambiguity of emotional expressions, cross-modal expressive disparities, and the scarcity of reliably annotated data. Recent self-supervised AVER approaches have introduced strong multimodal representations, yet they predominantly rely on modality-specific encoders and coarse content-level alignment, limiting fine-grained emotional semantic modeling. To address these issues, we propose VAEmo, an efficient two-stage framework for emotion-centric joint VA representation learning with external knowledge injection. In Stage~1, a unified and lightweight representation network is pre-trained on large-scale speaker-centric VA corpora via masked reconstruction and contrastive objectives, mitigating the modality gap and learning expressive, complementary representations without emotion labels. In Stage~2, multimodal large language models automatically generate detailed affective descriptions according to our well-designed chain-of-thought prompting for only a small subset of VA samples; these rich textual semantics are then injected by aligning their corresponding embeddings with VA representations through dual-path contrastive learning, further bridging the emotion gap. Extensive experiments on multiple downstream AVER benchmarks show that VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance for efficient, generalizable VA emotion representations.
Paperid: 145,   https://arxiv.org/pdf/2512.23546     GitHub
Authors:Zongsheng Cao, Yangfan He, Anran Liu, Jun Xie, Feng Chen, Zepeng Wang
Title: PurifyGen: A Risk-Discrimination and Semantic-Purification Model for Safe Text-to-Image Generation
Abstract:
Recent advances in diffusion models have notably enhanced text-to-image (T2I) generation quality, but they also raise the risk of generating unsafe content. Traditional safety methods like text blacklisting or harmful content classification have significant drawbacks: they can be easily circumvented or require extensive datasets and extra training. To overcome these challenges, we introduce PurifyGen, a novel, training-free approach for safe T2I generation that retains the model's original weights. PurifyGen introduces a dual-stage strategy for prompt purification. First, we evaluate the safety of each token in a prompt by computing its complementary semantic distance, which measures the semantic proximity between the prompt tokens and concept embeddings from predefined toxic and clean lists. This enables fine-grained prompt classification without explicit keyword matching or retraining. Tokens closer to toxic concepts are flagged as risky. Second, for risky prompts, we apply a dual-space transformation: we project toxic-aligned embeddings into the null space of the toxic concept matrix, effectively removing harmful semantic components, and simultaneously align them into the range space of clean concepts. This dual alignment purifies risky prompts by both subtracting unsafe semantics and reinforcing safe ones, while retaining the original intent and coherence. We further define a token-wise strategy to selectively replace only risky token embeddings, ensuring minimal disruption to safe content. PurifyGen offers a plug-and-play solution with theoretical grounding and strong generalization to unseen prompts and models. Extensive testing shows that PurifyGen surpasses current methods in reducing unsafe content across five datasets and competes well with training-dependent approaches. The code can refer to https://github.com/AI-Researcher-Team/PurifyGen.
Paperid: 146,   https://arxiv.org/pdf/2507.06656     GitHub
Authors:Hongjie Wu, Mingqin Zhang, Linchao He, Ji-Zhe Zhou, Jiancheng Lv
Title: Enhancing Diffusion Model Stability for Image Restoration via Gradient Management
Abstract:
Diffusion models have shown remarkable promise for image restoration by leveraging powerful priors. Prominent methods typically frame the restoration problem within a Bayesian inference framework, which iteratively combines a denoising step with a likelihood guidance step. However, the interactions between these two components in the generation process remain underexplored. In this paper, we analyze the underlying gradient dynamics of these components and identify significant instabilities. Specifically, we demonstrate conflicts between the prior and likelihood gradient directions, alongside temporal fluctuations in the likelihood gradient itself. We show that these instabilities disrupt the generative process and compromise restoration performance. To address these issues, we propose Stabilized Progressive Gradient Diffusion (SPGD), a novel gradient management technique. SPGD integrates two synergistic components: (1) a progressive likelihood warm-up strategy to mitigate gradient conflicts; and (2) adaptive directional momentum (ADM) smoothing to reduce fluctuations in the likelihood gradient. Extensive experiments across diverse restoration tasks demonstrate that SPGD significantly enhances generation stability, leading to state-of-the-art performance in quantitative metrics and visually superior results. Code is available at https://github.com/74587887/SPGD.
Paperid: 147,   https://arxiv.org/pdf/2507.21585     GitHub
Authors:Hao Ye, Mengshi Qi, Zhaohong Liu, Liang Liu, Huadong Ma
Title: SafeDriveRAG: Towards Safe Autonomous Driving with Knowledge Graph-based Retrieval-Augmented Generation
Abstract:
In this work, we study how vision-language models (VLMs) can be utilized to enhance the safety for the autonomous driving system, including perception, situational understanding, and path planning. However, existing research has largely overlooked the evaluation of these models in traffic safety-critical driving scenarios. To bridge this gap, we create the benchmark (SafeDrive228K) and propose a new baseline based on VLM with knowledge graph-based retrieval-augmented generation (SafeDriveRAG) for visual question answering (VQA). Specifically, we introduce SafeDrive228K, the first large-scale multimodal question-answering benchmark comprising 228K examples across 18 sub-tasks. This benchmark encompasses a diverse range of traffic safety queries, from traffic accidents and corner cases to common safety knowledge, enabling a thorough assessment of the comprehension and reasoning abilities of the models. Furthermore, we propose a plug-and-play multimodal knowledge graph-based retrieval-augmented generation approach that employs a novel multi-scale subgraph retrieval algorithm for efficient information retrieval. By incorporating traffic safety guidelines collected from the Internet, this framework further enhances the model's capacity to handle safety-critical situations. Finally, we conduct comprehensive evaluations on five mainstream VLMs to assess their reliability in safety-sensitive driving tasks. Experimental results demonstrate that integrating RAG significantly improves performance, achieving a +4.73% gain in Traffic Accidents tasks, +8.79% in Corner Cases tasks and +14.57% in Traffic Safety Commonsense across five mainstream VLMs, underscoring the potential of our proposed benchmark and methodology for advancing research in traffic safety. Our source code and data are available at https://github.com/Lumos0507/SafeDriveRAG.
Paperid: 148,   https://arxiv.org/pdf/2501.01212     GitHub
Authors:Yitong Zhu, Zhuowen Liang, Yiming Wu, Tangyao Li, Yuyang Wang
Title: Towards Consumer-Grade Cybersickness Prediction: Multi-Model Alignment for Real-Time Vision-Only Inference
Abstract:
Cybersickness remains a major obstacle to the widespread adoption of immersive virtual reality (VR), particularly in consumer-grade environments. While prior methods rely on invasive signals such as electroencephalography (EEG) for high predictive accuracy, these approaches require specialized hardware and are impractical for real-world applications. In this work, we propose a scalable, deployable framework for personalized cybersickness prediction leveraging only non-invasive signals readily available from commercial VR headsets, including head motion, eye tracking, and physiological responses. Our model employs a modality-specific graph neural network enhanced with a Difference Attention Module to extract temporal-spatial embeddings capturing dynamic changes across modalities. A cross-modal alignment module jointly trains the video encoder to learn personalized traits by aligning video features with sensor-derived representations. Consequently, the model accurately predicts individual cybersickness using only video input during inference. Experimental results show our model achieves 88.4% accuracy, closely matching EEG-based approaches (89.16%), while reducing deployment complexity. With an average inference latency of 90ms, our framework supports real-time applications, ideal for integration into consumer-grade VR platforms without compromising personalization or performance. The code will be relesed at https://github.com/U235-Aurora/PTGNN.
Paperid: 149,   https://arxiv.org/pdf/2508.01644     GitHub
Authors:Peiyuan Jiang, Yao Liu, Qiao Liu, Zongshun Zhang, Jiaye Yang, Lu Liu, Daibing Yao
Title: DRKF: Decoupled Representations with Knowledge Fusion for Multimodal Emotion Recognition
Abstract:
Multimodal emotion recognition (MER) aims to identify emotional states by integrating and analyzing information from multiple modalities. However, inherent modality heterogeneity and inconsistencies in emotional cues remain key challenges that hinder performance. To address these issues, we propose a Decoupled Representations with Knowledge Fusion (DRKF) method for MER. DRKF consists of two main modules: an Optimized Representation Learning (ORL) Module and a Knowledge Fusion (KF) Module. ORL employs a contrastive mutual information estimation method with progressive modality augmentation to decouple task-relevant shared representations and modality-specific features while mitigating modality heterogeneity. KF includes a lightweight self-attention-based Fusion Encoder (FE) that identifies the dominant modality and integrates emotional information from other modalities to enhance the fused representation. To handle potential errors from incorrect dominant modality selection under emotionally inconsistent conditions, we introduce an Emotion Discrimination Submodule (ED), which enforces the fused representation to retain discriminative cues of emotional inconsistency. This ensures that even if the FE selects an inappropriate dominant modality, the Emotion Classification Submodule (EC) can still make accurate predictions by leveraging preserved inconsistency information. Experiments show that DRKF achieves state-of-the-art (SOTA) performance on IEMOCAP, MELD, and M3ED. The source code is publicly available at https://github.com/PANPANKK/DRKF.
Paperid: 150,   https://arxiv.org/pdf/2504.13153     GitHub
Authors:Shaohui Dai, Yansong Qu, Zheyan Li, Xinyang Li, Shengchuan Zhang, Liujuan Cao
Title: Training-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs
Abstract:
Bridging natural language and 3D geometry is a crucial step toward flexible, language-driven scene understanding. While recent advances in 3D Gaussian Splatting (3DGS) have enabled fast and high-quality scene reconstruction, research has also explored incorporating open-vocabulary understanding into 3DGS. However, most existing methods require iterative optimization over per-view 2D semantic feature maps, which not only results in inefficiencies but also leads to inconsistent 3D semantics across views. To address these limitations, we introduce a training-free framework that constructs a superpoint graph directly from Gaussian primitives. The superpoint graph partitions the scene into spatially compact and semantically coherent regions, forming view-consistent 3D entities and providing a structured foundation for open-vocabulary understanding. Based on the graph structure, we design an efficient reprojection strategy that lifts 2D semantic features onto the superpoints, avoiding costly multi-view iterative training. The resulting representation ensures strong 3D semantic coherence and naturally supports hierarchical understanding, enabling both coarse- and fine-grained open-vocabulary perception within a unified semantic field. Extensive experiments demonstrate that our method achieves state-of-the-art open-vocabulary segmentation performance, with semantic field reconstruction completed over 30× faster. Our code will be available at https://github.com/Atrovast/THGS.
Paperid: 151,   https://arxiv.org/pdf/2507.19778     GitHub
Authors:Kanglin Qu, Pan Gao, Qun Dai, Yuanhao Sun
Title: HydraMamba: Multi-Head State Space Model for Global Point Cloud Learning
Abstract:
The attention mechanism has become a dominant operator in point cloud learning, but its quadratic complexity leads to limited inter-point interactions, hindering long-range dependency modeling between objects. Due to excellent long-range modeling capability with linear complexity, the selective state space model (S6), as the core of Mamba, has been exploited in point cloud learning for long-range dependency interactions over the entire point cloud. Despite some significant progress, related works still suffer from imperfect point cloud serialization and lack of locality learning. To this end, we explore a state space model-based point cloud network termed HydraMamba to address the above challenges. Specifically, we design a shuffle serialization strategy, making unordered point sets better adapted to the causal nature of S6. Meanwhile, to overcome the deficiency of existing techniques in locality learning, we propose a ConvBiS6 layer, which is capable of capturing local geometries and global context dependencies synergistically. Besides, we propose MHS6 by extending the multi-head design to S6, further enhancing its modeling capability. HydraMamba achieves state-of-the-art results on various tasks at both object-level and scene-level. The code is available at https://github.com/Point-Cloud-Learning/HydraMamba.
Paperid: 152,   https://arxiv.org/pdf/2508.11673     GitHub
Authors:Haojie Zhang, Yixiong Liang, Hulin Kuang, Lihui Cen, Zhe Qu, Yigang Cen, Min Zeng, Shichao Kan
Title: Contrastive Regularization over LoRA for Multimodal Biomedical Image Incremental Learning
Abstract:
Multimodal Biomedical Image Incremental Learning (MBIIL) is essential for handling diverse tasks and modalities in the biomedical domain, as training separate models for each modality or task significantly increases inference costs. Existing incremental learning methods focus on task expansion within a single modality, whereas MBIIL seeks to train a unified model incrementally across modalities. The MBIIL faces two challenges: I) How to preserve previously learned knowledge during incremental updates? II) How to effectively leverage knowledge acquired from existing modalities to support new modalities? To address these challenges, we propose MSLoRA-CR, a method that fine-tunes Modality-Specific LoRA modules while incorporating Contrastive Regularization to enhance intra-modality knowledge sharing and promote inter-modality knowledge differentiation. Our approach builds upon a large vision-language model (LVLM), keeping the pretrained model frozen while incrementally adapting new LoRA modules for each modality or task. Experiments on the incremental learning of biomedical images demonstrate that MSLoRA-CR outperforms both the state-of-the-art (SOTA) approach of training separate models for each modality and the general incremental learning method (incrementally fine-tuning LoRA). Specifically, MSLoRA-CR achieves a 1.88% improvement in overall performance compared to unconstrained incremental learning methods while maintaining computational efficiency. Our code is publicly available at https://github.com/VentusAislant/MSLoRA_CR.
Paperid: 153,   https://arxiv.org/pdf/2503.06678     GitHub
Authors:Hantao Zhou, Rui Yang, Longxiang Tang, Guanyi Qin, Runze Hu, Xiu Li
Title: Gamma: Toward Generic Image Assessment with Mixture of Assessment Experts
Abstract:
Image assessment aims to evaluate the quality and aesthetics of images and has been applied across various scenarios, such as natural and AIGC scenes. Existing methods mostly address these sub-tasks or scenes individually. While some works attempt to develop unified image assessment models, they have struggled to achieve satisfactory performance or cover a broad spectrum of assessment scenarios. In this paper, we present Gamma, a Generic imAge assessMent model using Mixture of Assessment Experts, which can effectively assess images from diverse scenes through mixed-dataset training. Achieving unified training in image assessment presents significant challenges due to annotation biases across different datasets. To address this issue, we first propose a Mixture of Assessment Experts (MoAE) module, which employs shared and adaptive experts to dynamically learn common and specific knowledge for different datasets, respectively. In addition, we introduce a Scene-based Differential Prompt (SDP) strategy, which uses scene-specific prompts to provide prior knowledge and guidance during the learning process, further boosting adaptation for various scenes. Our Gamma model is trained and evaluated on 12 datasets spanning 6 image assessment scenarios. Extensive experiments show that our unified Gamma outperforms other state-of-the-art mixed-training methods by significant margins while covering more scenes. Codes are available at https://github.com/zht8506/Gamma.
Paperid: 154,   https://arxiv.org/pdf/2507.09500     GitHub
Authors:Yiwen Liang, Hui Chen, Yizhe Xiong, Zihan Zhou, Mengyao Lyu, Zijia Lin, Shuaicheng Niu, Sicheng Zhao, Jungong Han, Guiguang Ding
Title: Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations
Abstract:
Vision-language models (VLMs) exhibit remarkable zero-shot capabilities but struggle with distribution shifts in downstream tasks when labeled data is unavailable, which has motivated the development of Test-Time Adaptation (TTA) to improve VLMs' performance during inference without annotations. Among various TTA approaches, cache-based methods show promise by preserving historical knowledge from low-entropy samples in a dynamic cache and fostering efficient adaptation. However, these methods face two critical reliability challenges: (1) entropy often becomes unreliable under distribution shifts, causing error accumulation in the cache and degradation in adaptation performance; (2) the final predictions may be unreliable due to inflexible decision boundaries that fail to accommodate large downstream shifts. To address these challenges, we propose a Reliable Test-time Adaptation (ReTA) method that integrates two complementary strategies to enhance reliability from two perspectives. First, to mitigate the unreliability of entropy as a sample selection criterion for cache construction, we introduce Consistency-aware Entropy Reweighting (CER), which incorporates consistency constraints to weight entropy during cache updating. While conventional approaches rely solely on low entropy for cache prioritization and risk introducing noise, our method leverages predictive consistency to maintain a high-quality cache and facilitate more robust adaptation. Second, we present Diversity-driven Distribution Calibration (DDC), which models class-wise text embeddings as multivariate Gaussian distributions, enabling adaptive decision boundaries for more accurate predictions across visually diverse content. Extensive experiments demonstrate that ReTA consistently outperforms state-of-the-art methods, particularly under real-world distribution shifts. Code: https://github.com/Evelyn1ywliang/ReTA.
Paperid: 155,   https://arxiv.org/pdf/2508.05269     GitHub
Authors:Changho Choi, Youngwoo Shin, Gyojin Han, Dong-Jae Lee, Junmo Kim
Title: B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding
Abstract:
Understanding dynamic outdoor environments requires capturing complex object interactions and their evolution over time. LiDAR-based 4D point clouds provide precise spatial geometry and rich temporal cues, making them ideal for representing real-world scenes. However, despite their potential, 4D LiDAR remains underexplored in the context of Multimodal Large Language Models (MLLMs) due to the absence of high-quality, modality-specific annotations and the lack of MLLM architectures capable of processing its high-dimensional composition. To address these challenges, we introduce B4DL, a new benchmark specifically designed for training and evaluating MLLMs on 4D LiDAR understanding. In addition, we propose a scalable data generation pipeline and an MLLM model that, for the first time, directly processes raw 4D LiDAR by bridging it with language understanding. Combined with our dataset and benchmark, our model offers a unified solution for spatio-temporal reasoning in dynamic outdoor environments. We provide rendered 4D LiDAR videos, generated dataset, and inference outputs on diverse scenarios at: https://mmb4dl.github.io/mmb4dl/
Paperid: 156,   https://arxiv.org/pdf/2504.14245     GitHub
Authors:Yikun Ji, Yan Hong, Jiahui Zhan, Haoxing Chen, jun lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang
Title: Towards Explainable Fake Image Detection with Multi-Modal Large Language Models
Abstract:
Progress in image generation raises significant public security concerns. We argue that fake image detection should not operate as a "black box". Instead, an ideal approach must ensure both strong generalization and transparency. Recent progress in Multi-modal Large Language Models (MLLMs) offers new opportunities for reasoning-based AI-generated image detection. In this work, we evaluate the capabilities of MLLMs in comparison to traditional detection methods and human evaluators, highlighting their strengths and limitations. Furthermore, we design six distinct prompts and propose a framework that integrates these prompts to develop a more robust, explainable, and reasoning-driven detection system. The code is available at https://github.com/Gennadiyev/mllm-defake.
Paperid: 157,   https://arxiv.org/pdf/2506.21109     GitHub
Authors:Luosheng Xu, Dalin Zhang, Zhaohui Song
Title: Pushing Trade-Off Boundaries: Compact yet Effective Remote Sensing Change Detection
Abstract:
Remote sensing change detection is essential for monitoring urban expansion, disaster assessment, and resource management, offering timely, accurate, and large-scale insights into dynamic landscape transformations. While deep learning has revolutionized change detection, the increasing complexity and computational demands of modern models have not necessarily translated into significant accuracy gains. Instead of following this trend, this study explores a more efficient approach, focusing on lightweight models that maintain high accuracy while minimizing resource consumption, which is an essential requirement for on-satellite processing. To this end, we propose FlickCD, which means quick flick then get great results, pushing the boundaries of the performance-resource trade-off. FlickCD introduces an Enhanced Difference Module (EDM) to amplify critical feature differences between temporal phases while suppressing irrelevant variations such as lighting and weather changes, thereby reducing computational costs in the subsequent change decoder. Additionally, the FlickCD decoder incorporates Local-Global Fusion Blocks, leveraging Shifted Window Self-Attention (SWSA) and Efficient Global Self-Attention (EGSA) to effectively capture semantic information at multiple scales, preserving both coarse- and fine-grained changes. Extensive experiments on four benchmark datasets demonstrate that FlickCD reduces computational and storage overheads by more than an order of magnitude while achieving state-of-the-art (SOTA) performance or incurring only a minor (<1% F1) accuracy trade-off. The implementation code is publicly available at https://github.com/xulsh8/FlickCD.
Paperid: 158,   https://arxiv.org/pdf/2504.12151     GitHub
Authors:Miaosen Luo, Yuncheng Jiang, Sijie Mai
Title: Towards Explainable Fusion and Balanced Learning in Multimodal Sentiment Analysis
Abstract:
Multimodal Sentiment Analysis (MSA) faces two critical challenges: the lack of interpretability in the decision logic of multimodal fusion and modality imbalance caused by disparities in inter-modal information density. To address these issues, we propose KAN-MCP, a novel framework that integrates the interpretability of Kolmogorov-Arnold Networks (KAN) with the robustness of the Multimodal Clean Pareto (MCPareto) framework. First, KAN leverages its univariate function decomposition to achieve transparent analysis of cross-modal interactions. This structural design allows direct inspection of feature transformations without relying on external interpretation tools, thereby ensuring both high expressiveness and interpretability. Second, the proposed MCPareto enhances robustness by addressing modality imbalance and noise interference. Specifically, we introduce the Dimensionality Reduction and Denoising Modal Information Bottleneck (DRD-MIB) method, which jointly denoises and reduces feature dimensionality. This approach provides KAN with discriminative low-dimensional inputs to reduce the modeling complexity of KAN while preserving critical sentiment-related information. Furthermore, MCPareto dynamically balances gradient contributions across modalities using the purified features output by DRD-MIB, ensuring lossless transmission of auxiliary signals and effectively alleviating modality imbalance. This synergy of interpretability and robustness not only achieves superior performance on benchmark datasets such as CMU-MOSI, CMU-MOSEI, and CH-SIMS v2 but also offers an intuitive visualization interface through KAN's interpretable architecture. Our code is released on https://github.com/LuoMSen/KAN-MCP.
Paperid: 159,   https://arxiv.org/pdf/2507.06671     GitHub
Authors:Boyuan Tian, Qizhe Gao, Siran Xianyu, Xiaotong Cui, Minjia Zhang
Title: FlexGaussian: Flexible and Cost-Effective Training-Free Compression for 3D Gaussian Splatting
Abstract:
3D Gaussian splatting has become a prominent technique for representing and rendering complex 3D scenes, due to its high fidelity and speed advantages. However, the growing demand for large-scale models calls for effective compression to reduce memory and computation costs, especially on mobile and edge devices with limited resources. Existing compression methods effectively reduce 3D Gaussian parameters but often require extensive retraining or fine-tuning, lacking flexibility under varying compression constraints. In this paper, we introduce FlexGaussian, a flexible and cost-effective method that combines mixed-precision quantization with attribute-discriminative pruning for training-free 3D Gaussian compression. FlexGaussian eliminates the need for retraining and adapts easily to diverse compression targets. Evaluation results show that FlexGaussian achieves up to 96.4% compression while maintaining high rendering quality (<1 dB drop in PSNR), and is deployable on mobile devices. FlexGaussian delivers high compression ratios within seconds, being 1.7-2.1x faster than state-of-the-art training-free methods and 10-100x faster than training-involved approaches. The code is being prepared and will be released soon at: https://github.com/Supercomputing-System-AI-Lab/FlexGaussian
Paperid: 160,   https://arxiv.org/pdf/2507.15401     GitHub
Authors:Huiyu Zhai, Xingxing Yang, Yalan Ye, Chenyang Li, Bin Fan, Changze Li
Title: Rethinking Occlusion in FER: A Semantic-Aware Perspective and Go Beyond
Abstract:
Facial expression recognition (FER) is a challenging task due to pervasive occlusion and dataset biases. Especially when facial information is partially occluded, existing FER models struggle to extract effective facial features, leading to inaccurate classifications. In response, we present ORSANet, which introduces the following three key contributions: First, we introduce auxiliary multi-modal semantic guidance to disambiguate facial occlusion and learn high-level semantic knowledge, which is two-fold: 1) we introduce semantic segmentation maps as dense semantics prior to generate semantics-enhanced facial representations; 2) we introduce facial landmarks as sparse geometric prior to mitigate intrinsic noises in FER, such as identity and gender biases. Second, to facilitate the effective incorporation of these two multi-modal priors, we customize a Multi-scale Cross-interaction Module (MCM) to adaptively fuse the landmark feature and semantics-enhanced representations within different scales. Third, we design a Dynamic Adversarial Repulsion Enhancement Loss (DARELoss) that dynamically adjusts the margins of ambiguous classes, further enhancing the model's ability to distinguish similar expressions. We further construct the first occlusion-oriented FER dataset to facilitate specialized robustness analysis on various real-world occlusion conditions, dubbed Occlu-FER. Extensive experiments on both public benchmarks and Occlu-FER demonstrate that our proposed ORSANet achieves SOTA recognition performance. Code is publicly available at https://github.com/Wenyuzhy/ORSANet-master.
Paperid: 161,   https://arxiv.org/pdf/2508.06382     GitHub
Authors:Xiangyu Wu, Feng Yu, Yang Yang, Jianfeng Lu
Title: Text as Any-Modality for Zero-Shot Classification by Consistent Prompt Tuning
Abstract:
The integration of prompt tuning with multimodal learning has shown significant generalization abilities for various downstream tasks. Despite advancements, existing methods heavily depend on massive modality-specific labeled data (e.g., video, audio, and image), or are customized for a single modality. In this study, we present Text as Any-Modality by Consistent Prompt Tuning (TaAM-CPT), a scalable approach for constructing a general representation model toward unlimited modalities using solely text data. TaAM-CPT comprises modality prompt pools, text construction, and modality-aligned text encoders from pre-trained models, which allows for extending new modalities by simply adding prompt pools and modality-aligned text encoders. To harmonize the learning across different modalities, TaAM-CPT designs intra- and inter-modal learning objectives, which can capture category details within modalities while maintaining semantic consistency across different modalities. Benefiting from its scalable architecture and pre-trained models, TaAM-CPT can be seamlessly extended to accommodate unlimited modalities. Remarkably, without any modality-specific labeled data, TaAM-CPT achieves leading results on diverse datasets spanning various modalities, including video classification, image classification, and audio classification. The code is available at https://github.com/Jinx630/TaAM-CPT.
Paperid: 162,   https://arxiv.org/pdf/2507.17456     GitHub
Authors:Francesco Tonini, Lorenzo Vaquero, Alessandro Conti, Cigdem Beyan, Elisa Ricci
Title: Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction Detection
Abstract:
Human-Object Interaction (HOI) detection aims to identify humans and objects within images and interpret their interactions. Existing HOI methods rely heavily on large datasets with manual annotations to learn interactions from visual cues. These annotations are labor-intensive to create, prone to inconsistency, and limit scalability to new domains and rare interactions. We argue that recent advances in Vision-Language Models (VLMs) offer untapped potential, particularly in enhancing interaction representation. While prior work has injected such potential and even proposed training-free methods, there remain key gaps. Consequently, we propose a novel training-free HOI detection framework for Dynamic Scoring with enhanced semantics (DYSCO) that effectively utilizes textual and visual interaction representations within a multimodal registry, enabling robust and nuanced interaction understanding. This registry incorporates a small set of visual cues and uses innovative interaction signatures to improve the semantic alignment of verbs, facilitating effective generalization to rare interactions. Additionally, we propose a unique multi-head attention mechanism that adaptively weights the contributions of the visual and textual features. Experimental results demonstrate that our DYSCO surpasses training-free state-of-the-art models and is competitive with training-based approaches, particularly excelling in rare interactions. Code is available at https://github.com/francescotonini/dysco.
Paperid: 163,   https://arxiv.org/pdf/2508.01427     GitHub
Authors:Peirong Zhang, Kai Ding, Lianwen Jin
Title: Capturing More: Learning Multi-Domain Representations for Robust Online Handwriting Verification
Abstract:
In this paper, we propose SPECTRUM, a temporal-frequency synergistic model that unlocks the untapped potential of multi-domain representation learning for online handwriting verification (OHV). SPECTRUM comprises three core components: (1) a multi-scale interactor that finely combines temporal and frequency features through dual-modal sequence interaction and multi-scale aggregation, (2) a self-gated fusion module that dynamically integrates global temporal and frequency features via self-driven balancing. These two components work synergistically to achieve micro-to-macro spectral-temporal integration. (3) A multi-domain distance-based verifier then utilizes both temporal and frequency representations to improve discrimination between genuine and forged handwriting, surpassing conventional temporal-only approaches. Extensive experiments demonstrate SPECTRUM's superior performance over existing OHV methods, underscoring the effectiveness of temporal-frequency multi-domain learning. Furthermore, we reveal that incorporating multiple handwritten biometrics fundamentally enhances the discriminative power of handwriting representations and facilitates verification. These findings not only validate the efficacy of multi-domain learning in OHV but also pave the way for future research in multi-domain approaches across both feature and biometric domains. Code is publicly available at https://github.com/NiceRingNode/SPECTRUM.
Paperid: 164,   https://arxiv.org/pdf/2507.18929     GitHub
Authors:Jian Chen, Yuxuan Hu, Haifeng Lu, Wei Wang, Min Yang, Chengming Li, Xiping Hu
Title: MGHFT: Multi-Granularity Hierarchical Fusion Transformer for Cross-Modal Sticker Emotion Recognition
Abstract:
Although pre-trained visual models with text have demonstrated strong capabilities in visual feature extraction, sticker emotion understanding remains challenging due to its reliance on multi-view information, such as background knowledge and stylistic cues. To address this, we propose a novel multi-granularity hierarchical fusion transformer (MGHFT), with a multi-view sticker interpreter based on Multimodal Large Language Models. Specifically, inspired by the human ability to interpret sticker emotions from multiple views, we first use Multimodal Large Language Models to interpret stickers by providing rich textual context via multi-view descriptions. Then, we design a hierarchical fusion strategy to fuse the textual context into visual understanding, which builds upon a pyramid visual transformer to extract both global and local sticker features at multiple stages. Through contrastive learning and attention mechanisms, textual features are injected at different stages of the visual backbone, enhancing the fusion of global- and local-granularity visual semantics with textual guidance. Finally, we introduce a text-guided fusion attention mechanism to effectively integrate the overall multimodal features, enhancing semantic understanding. Extensive experiments on 2 public sticker emotion datasets demonstrate that MGHFT significantly outperforms existing sticker emotion recognition approaches, achieving higher accuracy and more fine-grained emotion recognition. Compared to the best pre-trained visual models, our MGHFT also obtains an obvious improvement, 5.4% on F1 and 4.0% on accuracy. The code is released at https://github.com/cccccj-03/MGHFT_ACMMM2025.
Paperid: 165,   https://arxiv.org/pdf/2508.14058     GitHub
Authors:Jingmao Zhang, Zhiting Zhao, Yunqi Lin, Jianghong Ma, Tianjun Wei, Haijun Zhang, Xiaofeng Zhang
Title: Dual-Phase Playtime-guided Recommendation: Interest Intensity Exploration and Multimodal Random Walks
Abstract:
The explosive growth of the video game industry has created an urgent need for recommendation systems that can scale with expanding catalogs and maintain user engagement. While prior work has explored accuracy and diversity in recommendations, existing models underutilize playtime, a rich behavioral signal unique to gaming platforms, and overlook the potential of multimodal information to enhance diversity. In this paper, we propose DP2Rec, a novel Dual-Phase Playtime-guided Recommendation model designed to jointly optimize accuracy and diversity. First, we introduce a playtime-guided interest intensity exploration module that separates strong and weak preferences via dual-beta modeling, enabling fine-grained user profiling and more accurate recommendations. Second, we present a playtime-guided multimodal random walks module that simulates player exploration using transitions guided by both playtime-derived interest similarity and multimodal semantic similarity. This mechanism preserves core preferences while promoting cross-category discovery through latent semantic associations and adaptive category balancing. Extensive experiments on a real-world game dataset show that DP2Rec outperforms existing methods in both recommendation accuracy and diversity.
Paperid: 166,   https://arxiv.org/pdf/2508.02243     GitHub
Authors:Ziyan Liu, Junwen Li, Kaiwen Li, Tong Ruan, Chao Wang, Xinyan He, Zongyu Wang, Xuezhi Cao, Jingping Liu
Title: I2CR: Intra- and Inter-modal Collaborative Reflections for Multimodal Entity Linking
Abstract:
Multimodal entity linking plays a crucial role in a wide range of applications. Recent advances in large language model-based methods have become the dominant paradigm for this task, effectively leveraging both textual and visual modalities to enhance performance. Despite their success, these methods still face two challenges, including unnecessary incorporation of image data in certain scenarios and the reliance only on a one-time extraction of visual features, which can undermine their effectiveness and accuracy. To address these challenges, we propose a novel LLM-based framework for the multimodal entity linking task, called Intra- and Inter-modal Collaborative Reflections. This framework prioritizes leveraging text information to address the task. When text alone is insufficient to link the correct entity through intra- and inter-modality evaluations, it employs a multi-round iterative strategy that integrates key visual clues from various aspects of the image to support reasoning and enhance matching accuracy. Extensive experiments on three widely used public datasets demonstrate that our framework consistently outperforms current state-of-the-art methods in the task, achieving improvements of 3.2%, 5.1%, and 1.6%, respectively. Our code is available at https://github.com/ziyan-xiaoyu/I2CR/.
Paperid: 167,   https://arxiv.org/pdf/2506.16495     GitHub
Authors:Changsheng Gao, Zijie Liu, Li Li, Dong Liu, Xiaoyan Sun, Weisi Lin
Title: DT-UFC: Universal Large Model Feature Coding via Peaky-to-Balanced Distribution Transformation
Abstract:
Like image coding in visual data transmission, feature coding is essential for the distributed deployment of large models by significantly reducing transmission and storage burden. However, prior studies have mostly targeted task- or model-specific scenarios, leaving the challenge of universal feature coding across diverse large models largely unexplored. In this paper, we present the first systematic study on universal feature coding for large models. The key challenge lies in the inherently diverse and distributionally incompatible nature of features extracted from different models. For example, features from DINOv2 exhibit highly peaky, concentrated distributions, while those from Stable Diffusion 3 (SD3) are more dispersed and uniform. This distributional heterogeneity severely hampers both compression efficiency and cross-model generalization. To address this, we propose a learned peaky-to-balanced distribution transformation, which reshapes highly skewed feature distributions into a common, balanced target space. This transformation is non-uniform, data-driven, and plug-and-play, enabling effective alignment of heterogeneous distributions without modifying downstream codecs. With this alignment, a universal codec trained on the balanced target distribution can effectively generalize to features from different models and tasks. We validate our approach on three representative large models (LLaMA3, DINOv2, and SD3) across multiple tasks and modalities. Extensive experiments show that our method achieves notable improvements in both compression efficiency and cross-model generalization over task-specific baselines. All source code has been made available at https://github.com/chansongoal/DT-UFC.
Paperid: 168,   https://arxiv.org/pdf/2508.02374     GitHub
Authors:Shuo Lu, Yanyin Chen, Wei Feng, Jiahao Fan, Fengheng Li, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law, Jian Liang
Title: Uni-Layout: Integrating Human Feedback in Unified Layout Generation and Evaluation
Abstract:
Layout generation plays a crucial role in enhancing both user experience and design efficiency. However, current approaches suffer from task-specific generation capabilities and perceptually misaligned evaluation metrics, leading to limited applicability and ineffective measurement. In this paper, we propose Uni-Layout, a novel framework that achieves unified generation, human-mimicking evaluation and alignment between the two. For universal generation, we incorporate various layout tasks into a single taxonomy and develop a unified generator that handles background or element contents constrained tasks via natural language prompts. To introduce human feedback for the effective evaluation of layouts, we build Layout-HF100k, the first large-scale human feedback dataset with 100,000 expertly annotated layouts. Based on Layout-HF100k, we introduce a human-mimicking evaluator that integrates visual and geometric information, employing a Chain-of-Thought mechanism to conduct qualitative assessments alongside a confidence estimation module to yield quantitative measurements. For better alignment between the generator and the evaluator, we integrate them into a cohesive system by adopting Dynamic-Margin Preference Optimization (DMPO), which dynamically adjusts margins based on preference strength to better align with human judgments. Extensive experiments show that Uni-Layout significantly outperforms both task-specific and general-purpose methods. Our code is publicly available at https://github.com/JD-GenX/Uni-Layout.
Paperid: 169,   https://arxiv.org/pdf/2505.19958     GitHub
Authors:Yong Liu, Jinshan Pan, Yinchuan Li, Qingji Dong, Chao Zhu, Yu Guo, Fei Wang
Title: UltraVSR: Achieving Ultra-Realistic Video Super-Resolution with Efficient One-Step Diffusion Space
Abstract:
Diffusion models have shown great potential in generating realistic image detail. However, adapting these models to video super-resolution (VSR) remains challenging due to their inherent stochasticity and lack of temporal modeling. Previous methods have attempted to mitigate this issue by incorporating motion information and temporal layers. However, unreliable motion estimation from low-resolution videos and costly multiple sampling steps with deep temporal layers limit them to short sequences. In this paper, we propose UltraVSR, a novel framework that enables ultra-realistic and temporally-coherent VSR through an efficient one-step diffusion space. A central component of UltraVSR is the Degradation-aware Reconstruction Scheduling (DRS), which estimates a degradation factor from the low-resolution input and transforms the iterative denoising process into a single-step reconstruction from low-resolution to high-resolution videos. To ensure temporal consistency, we propose a lightweight Recurrent Temporal Shift (RTS) module, including an RTS-convolution unit and an RTS-attention unit. By partially shifting feature components along the temporal dimension, it enables effective propagation, fusion, and alignment across frames without explicit temporal layers. The RTS module is integrated into a pretrained text-to-image diffusion model and is further enhanced through Spatio-temporal Joint Distillation (SJD), which improves temporally coherence while preserving realistic details. Additionally, we introduce a Temporally Asynchronous Inference (TAI) strategy to capture long-range temporal dependencies under limited memory constraints. Extensive experiments show that UltraVSR achieves state-of-the-art performance, both qualitatively and quantitatively, in a single sampling step. Code is available at https://github.com/yongliuy/UltraVSR.
Paperid: 170,   https://arxiv.org/pdf/2507.07902     GitHub
Authors:Jinhong Wang, Tajamul Ashraf, Zongyan Han, Jorma Laaksonen, Rao Mohammad Anwer
Title: MIRA: A Novel Framework for Fusing Modalities in Medical RAG
Abstract:
Multimodal Large Language Models (MLLMs) have significantly advanced AI-assisted medical diagnosis, but they often generate factually inconsistent responses that deviate from established medical knowledge. Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external sources, but it presents two key challenges. First, insufficient retrieval can miss critical information, whereas excessive retrieval can introduce irrelevant or misleading content, disrupting model output. Second, even when the model initially provides correct answers, over-reliance on retrieved data can lead to factual errors. To address these issues, we introduce the Multimodal Intelligent Retrieval and Augmentation (MIRA) framework, designed to optimize factual accuracy in MLLM. MIRA consists of two key components: (1) a calibrated Rethinking and Rearrangement module that dynamically adjusts the number of retrieved contexts to manage factual risk, and (2) A medical RAG framework integrating image embeddings and a medical knowledge base with a query-rewrite module for efficient multimodal reasoning. This enables the model to effectively integrate both its inherent knowledge and external references. Our evaluation of publicly available medical VQA and report generation benchmarks demonstrates that MIRA substantially enhances factual accuracy and overall performance, achieving new state-of-the-art results. Code is released at https://github.com/mbzuai-oryx/MIRA.
Paperid: 171,   https://arxiv.org/pdf/2510.05586     GitHub
Authors:Bin Kang, Bin Chen, Junjie Wang, Yulin Li, Junzhi Zhao, Zhuotao Tian
Title: CalibCLIP: Contextual Calibration of Dominant Semantics for Text-Driven Image Retrieval
Abstract:
Existing Visual Language Models (VLMs) suffer structural limitations where a few low contribution tokens may excessively capture global semantics, dominating the information aggregation process and suppressing the discriminative features in text-driven image retrieval tasks. To address this, we introduce CalibCLIP, a training-free method designed to calibrate the suppressive effect of dominant tokens. Specifically, in the visual space, we propose the Contrastive Visual Enhancer (CVE), which decouples visual features into target and low information regions. Subsequently, it identifies dominant tokens and dynamically suppresses their representations.In the textual space, we introduce the Discriminative Concept Calibrator (DCC), which aims to differentiate between general and discriminative concepts within the text query. By mitigating the challenges posed by generic concepts and improving the representations of discriminative concepts, DCC strengthens the differentiation among similar samples. Finally, extensive experiments demonstrate consistent improvements across seven benchmarks spanning three image retrieval tasks, underscoring the effectiveness of CalibCLIP. Code is available at: https://github.com/kangbin98/CalibCLIP
Paperid: 172,   https://arxiv.org/pdf/2408.09186     GitHub
Authors:Qile Liu, Weishan Ye, Lingli Zhang, Zhen Liang
Title: EEG-SCMM: Soft Contrastive Masked Modeling for Cross-Corpus EEG-Based Emotion Recognition
Abstract:
Emotion recognition using electroencephalography (EEG) signals has attracted increasing attention in recent years. However, existing methods often lack generalization in cross-corpus settings, where a model trained on one dataset is directly applied to another without retraining, due to differences in data distribution and recording conditions. To tackle the challenge of cross-corpus EEG-based emotion recognition, we propose a novel framework termed Soft Contrastive Masked Modeling (SCMM). Grounded in the theory of emotional continuity, SCMM integrates soft contrastive learning with a hybrid masking strategy to effectively capture emotion dynamics (refer to short-term continuity). Specifically, in the self-supervised learning stage, we propose a soft weighting mechanism that assigns similarity scores to sample pairs, enabling fine-grained modeling of emotional transitions and capturing the temporal continuity of human emotions. To further enhance representation learning, we design a similarity-aware aggregator that fuses complementary information from semantically related samples based on pairwise similarities, thereby improving feature expressiveness and reconstruction quality. This dual design contributes to a more discriminative and transferable representation, which is crucial for robust cross-corpus generalization. Extensive experiments on the SEED, SEED-IV, and DEAP datasets show that SCMM achieves state-of-the-art (SOTA) performance, outperforming the second-best method by an average accuracy of 4.26% under both same-class and different-class cross-corpus settings. The source code is available at https://github.com/Kyler-RL/SCMM.
Paperid: 173,   https://arxiv.org/pdf/2411.19475     GitHub
Authors:Ruoqi Wang, Haitao Wang, Qiong Luo
Title: GalaxAlign: Mimicking Citizen Scientists' Multimodal Guidance for Galaxy Morphology Analysis
Abstract:
Galaxy morphology analysis involves studying galaxies based on their shapes and structures. For such studies, fundamental tasks include identifying and classifying galaxies in astronomical images, as well as retrieving visually or structurally similar galaxies through similarity search. Existing methods either directly train domain-specific foundation models on large, annotated datasets or fine-tune vision foundation models on a smaller set of images. The former is effective but costly, while the latter is more resource-efficient but often yields lower accuracy. To address these challenges, we introduce GalaxAlign, a multimodal approach inspired by how citizen scientists identify galaxies in astronomical images by following textual descriptions and matching schematic symbols. Specifically, GalaxAlign employs a tri-modal alignment framework to align three types of data during fine-tuning: (1) schematic symbols representing galaxy shapes and structures, (2) textual labels for these symbols, and (3) galaxy images. By incorporating multimodal instructions, GalaxAlign eliminates the need for expensive pretraining and enhances the effectiveness of fine-tuning. Experiments on galaxy classification and similarity search demonstrate that our method effectively fine-tunes general pre-trained models for astronomical tasks by incorporating domain-specific multi-modal knowledge. Code is available at https://github.com/RapidsAtHKUST/GalaxAlign.
Paperid: 174,   https://arxiv.org/pdf/2502.07239     GitHub
Authors:Pinxin Liu, Pengfei Zhang, Hyeongwoo Kim, Pablo Garrido, Ari Shapiro, Kyle Olszewski
Title: Contextual Gesture: Co-Speech Gesture Video Generation through Context-aware Gesture Representation
Abstract:
Co-speech gesture generation is crucial for creating lifelike avatars and enhancing human-computer interactions by synchronizing gestures with speech. Despite recent advancements, existing methods struggle with accurately identifying the rhythmic or semantic triggers from audio for generating contextualized gesture patterns and achieving pixel-level realism. To address these challenges, we introduce Contextual Gesture, a framework that improves co-speech gesture video generation through three innovative components: (1) a chronological speech-gesture alignment that temporally connects two modalities, (2) a contextualized gesture tokenization that incorporate speech context into motion pattern representation through distillation, and (3) a structure-aware refinement module that employs edge connection to link gesture keypoints to improve video generation. Our extensive experiments demonstrate that Contextual Gesture not only produces realistic and speech-aligned gesture videos but also supports long-sequence generation and video gesture editing applications, shown in Fig.1.
Paperid: 175,   https://arxiv.org/pdf/2507.22003     GitHub
Authors:Ziyun Dai, Xiaoqiang Li, Shaohua Zhang, Yuanchen Wu, Jide Li
Title: See Different, Think Better: Visual Variations Mitigating Hallucinations in LVLMs
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in visual understanding and multimodal reasoning. However, LVLMs frequently exhibit hallucination phenomena, manifesting as the generated textual responses that demonstrate inconsistencies with the provided visual content. Existing hallucination mitigation methods are predominantly text-centric, the challenges of visual-semantic alignment significantly limit their effectiveness, especially when confronted with fine-grained visual understanding scenarios. To this end, this paper presents ViHallu, a Vision-Centric Hallucination mitigation framework that enhances visual-semantic alignment through Visual Variation Image Generation and Visual Instruction Construction. ViHallu introduces visual variation images with controllable visual alterations while maintaining the overall image structure. These images, combined with carefully constructed visual instructions, enable LVLMs to better understand fine-grained visual content through fine-tuning, allowing models to more precisely capture the correspondence between visual content and text, thereby enhancing visual-semantic alignment. Extensive experiments on multiple benchmarks show that ViHallu effectively enhances models' fine-grained visual understanding while significantly reducing hallucination tendencies. Furthermore, we release ViHallu-Instruction, a visual instruction dataset specifically designed for hallucination mitigation and visual-semantic alignment. Code is available at https://github.com/oliviadzy/ViHallu.
Paperid: 176,   https://arxiv.org/pdf/2503.17116     GitHub
Authors:Luca Rossetto, Werner Bailer, Duc-Tien Dang-Nguyen, Graham Healy, Björn Þór Jónsson, Onanong Kongmeesub, Hoang-Bao Le, Stevan Rudinac, Klaus Schöffmann, Florian Spiess, Allie Tran, Minh-Triet Tran, Quang-Linh Tran, Cathal Gurrin
Title: The CASTLE 2024 Dataset: Advancing the Art of Multimodal Understanding
Abstract:
Egocentric video has seen increased interest in recent years, as it is used in a range of areas. However, most existing datasets are limited to a single perspective. In this paper, we present the CASTLE 2024 dataset, a multimodal collection containing ego- and exo-centric (i.e., first- and third-person perspective) video and audio from 15 time-aligned sources, as well as other sensor streams and auxiliary data. The dataset was recorded by volunteer participants over four days in a fixed location and includes the point of view of 10 participants, with an additional 5 fixed cameras providing an exocentric perspective. The entire dataset contains over 600 hours of UHD video recorded at 50 frames per second. In contrast to other datasets, CASTLE 2024 does not contain any partial censoring, such as blurred faces or distorted audio. The dataset is available via https://castle-dataset.github.io/.
Paperid: 177,   https://arxiv.org/pdf/2506.03007     GitHub
Authors:Jiarui Wang, Huiyu Duan, Juntong Wang, Ziheng Jia, Woo Yi Yang, Xiaorong Zhu, Yu Zhao, Jiaying Qian, Yuke Xing, Guangtao Zhai, Xiongkuo Min
Title: DFBench: Benchmarking Deepfake Image Detection Capability of Large Multimodal Models
Abstract:
With the rapid advancement of generative models, the realism of AI-generated images has significantly improved, posing critical challenges for verifying digital content authenticity. Current deepfake detection methods often depend on datasets with limited generation models and content diversity that fail to keep pace with the evolving complexity and increasing realism of the AI-generated content. Large multimodal models (LMMs), widely adopted in various vision tasks, have demonstrated strong zero-shot capabilities, yet their potential in deepfake detection remains largely unexplored. To bridge this gap, we present DFBench, a large-scale DeepFake Benchmark featuring (i) broad diversity, including 540,000 images across real, AI-edited, and AI-generated content, (ii) latest model, the fake images are generated by 12 state-of-the-art generation models, and (iii) bidirectional benchmarking and evaluating for both the detection accuracy of deepfake detectors and the evasion capability of generative models. Based on DFBench, we propose MoA-DF, Mixture of Agents for DeepFake detection, leveraging a combined probability strategy from multiple LMMs. MoA-DF achieves state-of-the-art performance, further proving the effectiveness of leveraging LMMs for deepfake detection. Database and codes are publicly available at https://github.com/IntMeGroup/DFBench.
Paperid: 178,   https://arxiv.org/pdf/2506.14642     GitHub
Authors:Yuke Xing, Jiarui Wang, Peizhi Niu, Wenjie Huang, Guangtao Zhai, Yiling Xu
Title: 3DGS-IEval-15K: A Large-scale Image Quality Evaluation Database for 3D Gaussian-Splatting
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a promising approach for novel view synthesis, offering real-time rendering with high visual fidelity. However, its substantial storage requirements present significant challenges for practical applications. While recent state-of-the-art (SOTA) 3DGS methods increasingly incorporate dedicated compression modules, there is a lack of a comprehensive framework to evaluate their perceptual impact. Therefore we present 3DGS-IEval-15K, the first large-scale image quality assessment (IQA) dataset specifically designed for compressed 3DGS representations. Our dataset encompasses 15,200 images rendered from 10 real-world scenes through 6 representative 3DGS algorithms at 20 strategically selected viewpoints, with different compression levels leading to various distortion effects. Through controlled subjective experiments, we collect human perception data from 60 viewers. We validate dataset quality through scene diversity and MOS distribution analysis, and establish a comprehensive benchmark with 30 representative IQA metrics covering diverse types. As the largest-scale 3DGS quality assessment dataset to date, our work provides a foundation for developing 3DGS specialized IQA metrics, and offers essential data for investigating view-dependent quality distribution patterns unique to 3DGS. The database is publicly available at https://github.com/YukeXing/3DGS-IEval-15K.
Paperid: 179,   https://arxiv.org/pdf/2411.03314     GitHub
Authors:Ziliang Gan, Yu Lu, Dong Zhang, Haohan Li, Che Liu, Jian Liu, Ji Liu, Haipang Wu, Chaoyou Fu, Zenglin Xu, Rongjunchen Zhang, Yong Dai
Title: MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
Abstract:
In recent years, multimodal benchmarks for general domains have guided the rapid development of multimodal models on general tasks. However, the financial field has its peculiarities. It features unique graphical images (e.g., candlestick charts, technical indicator charts) and possesses a wealth of specialized financial knowledge (e.g., futures, turnover rate). Therefore, benchmarks from general fields often fail to measure the performance of multimodal models in the financial domain, and thus cannot effectively guide the rapid development of large financial models. To promote the development of large financial multimodal models, we propose MME-Finance, an bilingual open-ended and practical usage-oriented Visual Question Answering (VQA) benchmark. The characteristics of our benchmark are finance and expertise, which include constructing charts that reflect the actual usage needs of users (e.g., computer screenshots and mobile photography), creating questions according to the preferences in financial domain inquiries, and annotating questions by experts with 10+ years of experience in the financial industry. Additionally, we have developed a custom-designed financial evaluation system in which visual information is first introduced in the multi-modal evaluation process. Extensive experimental evaluations of 19 mainstream MLLMs are conducted to test their perception, reasoning, and cognition capabilities. The results indicate that models performing well on general benchmarks cannot do well on MME-Finance; for instance, the top-performing open-source and closed-source models obtain 65.69 (Qwen2VL-72B) and 63.18 (GPT-4o), respectively. Their performance is particularly poor in categories most relevant to finance, such as candlestick charts and technical indicator charts. In addition, we propose a Chinese version, which helps compare performance of MLLMs under a Chinese context.
Paperid: 180,   https://arxiv.org/pdf/2503.20756     GitHub
Authors:Chenxi Wang, Jizhan Fang, Xiang Chen, Bozhong Tian, Ziwen Xu, Huajun Chen, Ningyu Zhang
Title: ADS-Edit: A Multimodal Knowledge Editing Dataset for Autonomous Driving Systems
Abstract:
Recent advancements in Large Multimodal Models (LMMs) have shown promise in Autonomous Driving Systems (ADS). However, their direct application to ADS is hindered by challenges such as misunderstanding of traffic knowledge, complex road conditions, and diverse states of vehicle. To address these challenges, we propose the use of Knowledge Editing, which enables targeted modifications to a model's behavior without the need for full retraining. Meanwhile, we introduce ADS-Edit, a multimodal knowledge editing dataset specifically designed for ADS, which includes various real-world scenarios, multiple data types, and comprehensive evaluation metrics. We conduct comprehensive experiments and derive several interesting conclusions. We hope that our work will contribute to the further advancement of knowledge editing applications in the field of autonomous driving. Code and data are available in https://github.com/zjunlp/EasyEdit/blob/main/examples/ADSEdit.md.
Paperid: 181,   https://arxiv.org/pdf/2505.24458     GitHub
Authors:Tianlong Yu, Chenghang Ye, Zheyu Yang, Ziyi Zhou, Cui Tang, Zui Tao, Jun Zhang, Kailong Wang, Liting Zhou, Yang Yang, Ting Bi
Title: SEAR: A Multimodal Dataset for Analyzing AR-LLM-Driven Social Engineering Behaviors
Abstract:
The SEAR Dataset is a novel multimodal resource designed to study the emerging threat of social engineering (SE) attacks orchestrated through augmented reality (AR) and multimodal large language models (LLMs). This dataset captures 180 annotated conversations across 60 participants in simulated adversarial scenarios, including meetings, classes and networking events. It comprises synchronized AR-captured visual/audio cues (e.g., facial expressions, vocal tones), environmental context, and curated social media profiles, alongside subjective metrics such as trust ratings and susceptibility assessments. Key findings reveal SEAR's alarming efficacy in eliciting compliance (e.g., 93.3% phishing link clicks, 85% call acceptance) and hijacking trust (76.7% post-interaction trust surge). The dataset supports research in detecting AR-driven SE attacks, designing defensive frameworks, and understanding multimodal adversarial manipulation. Rigorous ethical safeguards, including anonymization and IRB compliance, ensure responsible use. The SEAR dataset is available at https://github.com/INSLabCN/SEAR-Dataset.
Paperid: 182,   https://arxiv.org/pdf/2509.20715     GitHub
Authors:Ruixu Zhang, Yuran Wang, Xinyi Hu, Chaoyu Mai, Wenxuan Liu, Danni Xu, Xian Zhong, Zheng Wang
Title: Beyond the Individual: Introducing Group Intention Forecasting with SHOT Dataset
Abstract:
Intention recognition has traditionally focused on individual intentions, overlooking the complexities of collective intentions in group settings. To address this limitation, we introduce the concept of group intention, which represents shared goals emerging through the actions of multiple individuals, and Group Intention Forecasting (GIF), a novel task that forecasts when group intentions will occur by analyzing individual actions and interactions before the collective goal becomes apparent. To investigate GIF in a specific scenario, we propose SHOT, the first large-scale dataset for GIF, consisting of 1,979 basketball video clips captured from 5 camera views and annotated with 6 types of individual attributes. SHOT is designed with 3 key characteristics: multi-individual information, multi-view adaptability, and multi-level intention, making it well-suited for studying emerging group intentions. Furthermore, we introduce GIFT (Group Intention ForecasTer), a framework that extracts fine-grained individual features and models evolving group dynamics to forecast intention emergence. Experimental results confirm the effectiveness of SHOT and GIFT, establishing a strong foundation for future research in group intention forecasting. The dataset is available at https://xinyi-hu.github.io/SHOT_DATASET.
Paperid: 183,   https://arxiv.org/pdf/2506.03635     GitHub
Authors:Yinfan Wang, Jie Gui, Baosheng Yu, Qi Li, Zhenan Sun, Juho Kannala, Guoying Zhao
Title: FingerVeinSyn-5M: A Million-Scale Dataset and Benchmark for Finger Vein Recognition
Abstract:
A major challenge in finger vein recognition is the lack of large-scale public datasets. Existing datasets contain few identities and limited samples per finger, restricting the advancement of deep learning-based methods. To address this, we introduce FVeinSyn, a synthetic generator capable of producing diverse finger vein patterns with rich intra-class variations. Using FVeinSyn, we created FingerVeinSyn-5M -- the largest available finger vein dataset -- containing 5 million samples from 50,000 unique fingers, each with 100 variations including shift, rotation, scale, roll, varying exposure levels, skin scattering blur, optical blur, and motion blur. FingerVeinSyn-5M is also the first to offer fully annotated finger vein images, supporting deep learning applications in this field. Models pretrained on FingerVeinSyn-5M and fine-tuned with minimal real data achieve an average 53.91% performance gain across multiple benchmarks. The dataset is publicly available at: https://github.com/EvanWang98/FingerVeinSyn-5M.
Paperid: 184,   https://arxiv.org/pdf/2508.01712     GitHub
Authors:Han Wang, Zhuoran Wang, Roy Ka-Wei Lee
Title: HateClipSeg: A Segment-Level Annotated Dataset for Fine-Grained Hate Video Detection
Abstract:
Detecting hate speech in videos remains challenging due to the complexity of multimodal content and the lack of fine-grained annotations in existing datasets. We present HateClipSeg, a large-scale multimodal dataset with both video-level and segment-level annotations, comprising over 11,714 segments labeled as Normal or across five Offensive categories: Hateful, Insulting, Sexual, Violence, Self-Harm, along with explicit target victim labels. Our three-stage annotation process yields high inter-annotator agreement (Krippendorff's alpha = 0.817). We propose three tasks to benchmark performance: (1) Trimmed Hateful Video Classification, (2) Temporal Hateful Video Localization, and (3) Online Hateful Video Classification. Results highlight substantial gaps in current models, emphasizing the need for more sophisticated multimodal and temporally aware approaches. The HateClipSeg dataset are publicly available at https://github.com/Social-AI-Studio/HateClipSeg.git.
Paperid: 185,   https://arxiv.org/pdf/2509.15753     GitHub
Authors:Yang Li, Tingfa Xu, Shuyan Bai, Peifu Liu, Jianan Li
Title: MCOD: The First Challenging Benchmark for Multispectral Camouflaged Object Detection
Abstract:
Camouflaged Object Detection (COD) aims to identify objects that blend seamlessly into natural scenes. Although RGB-based methods have advanced, their performance remains limited under challenging conditions. Multispectral imagery, providing rich spectral information, offers a promising alternative for enhanced foreground-background discrimination. However, existing COD benchmark datasets are exclusively RGB-based, lacking essential support for multispectral approaches, which has impeded progress in this area. To address this gap, we introduce MCOD, the first challenging benchmark dataset specifically designed for multispectral camouflaged object detection. MCOD features three key advantages: (i) Comprehensive challenge attributes: It captures real-world difficulties such as small object sizes and extreme lighting conditions commonly encountered in COD tasks. (ii) Diverse real-world scenarios: The dataset spans a wide range of natural environments to better reflect practical applications. (iii) High-quality pixel-level annotations: Each image is manually annotated with precise object masks and corresponding challenge attribute labels. We benchmark eleven representative COD methods on MCOD, observing a consistent performance drop due to increased task difficulty. Notably, integrating multispectral modalities substantially alleviates this degradation, highlighting the value of spectral information in enhancing detection robustness. We anticipate MCOD will provide a strong foundation for future research in multispectral camouflaged object detection. The dataset is publicly accessible at https://github.com/yl2900260-bit/MCOD.
Paperid: 186,   https://arxiv.org/pdf/2508.00497     GitHub
Authors:Jinghui Zhang, Kaiyang Wan, Longwei Xu, Ao Li, Zongfang Liu, Xiuying Chen
Title: From Individuals to Crowds: Dual-Level Public Response Prediction in Social Media
Abstract:
Public response prediction is critical for understanding how individuals or groups might react to specific events, policies, or social phenomena, making it highly valuable for crisis management, policy-making, and social media analysis. However, existing works face notable limitations. First, they lack micro-level personalization, producing generic responses that ignore individual user preferences. Moreover, they overlook macro-level sentiment distribution and only deal with individual-level sentiment, constraining them from analyzing broader societal trends and group sentiment dynamics. To address these challenges, we propose SocialAlign, a unified framework that predicts real-world responses at both micro and macro levels in social contexts. At the micro level, SocialAlign employs SocialLLM with an articulate Personalized Analyze-Compose LoRA (PAC-LoRA) structure, which deploys specialized expert modules for content analysis and response generation across diverse topics and user profiles, enabling the generation of personalized comments with corresponding sentiments. At the macro level, it models group sentiment distributions and aligns predictions with real-world sentiment trends derived from social media data. To evaluate SocialAlign in real-world scenarios, we introduce SentiWeibo, a large-scale dataset curated from authentic social interactions on the Weibo platform. Experimental results on our SentiWeibo and related LaMP benchmark demonstrate that SocialAlign surpasses strong baselines, showing improved accuracy, interpretability, and generalization in public response prediction. We hope our work inspires further research in public response prediction and computational social science: https://github.com/Znull-1220/SocialAlign.
Paperid: 187,   https://arxiv.org/pdf/2402.13185     GitHub
Authors:Jianhong Bai, Tianyu He, Yuchi Wang, Junliang Guo, Haoji Hu, Zuozhu Liu, Jiang Bian
Title: UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing
Abstract:
Recent advances in text-guided video editing have showcased promising results in appearance editing (e.g., stylization). However, video motion editing in the temporal dimension (e.g., from eating to waving), which distinguishes video editing from image editing, is underexplored. In this work, we present UniEdit, a tuning-free framework that supports both video motion and appearance editing by harnessing the power of a pre-trained text-to-video generator within an inversion-then-generation framework. To realize motion editing while preserving source video content, based on the insights that temporal and spatial self-attention layers encode inter-frame and intra-frame dependency respectively, we introduce auxiliary motion-reference and reconstruction branches to produce text-guided motion and source features respectively. The obtained features are then injected into the main editing path via temporal and spatial self-attention layers. Extensive experiments demonstrate that UniEdit covers video motion editing and various appearance editing scenarios, and surpasses the state-of-the-art methods. Our code will be publicly available.
Paperid: 188,   https://arxiv.org/pdf/2504.13650     GitHub
Authors:Sijing Li, Tianwei Lin, Lingshuai Lin, Wenqiao Zhang, Jiang Liu, Xiaoda Yang, Juncheng Li, Yucheng He, Xiaohui Song, Jun Xiao, Yueting Zhuang, Beng Chin Ooi
Title: EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model
Abstract:
Medical Large Vision-Language Models (Med-LVLMs) demonstrate significant potential in healthcare, but their reliance on general medical data and coarse-grained global visual understanding limits them in intelligent ophthalmic diagnosis. Currently, intelligent ophthalmic diagnosis faces three major challenges: (i) Data. The lack of deeply annotated, high-quality, multi-modal ophthalmic visual instruction data; (ii) Benchmark. The absence of a comprehensive and systematic benchmark for evaluating diagnostic performance; (iii) Model. The difficulty of adapting holistic visual architectures to fine-grained, region-specific ophthalmic lesion identification. In this paper, we propose the Eyecare Kit, which systematically tackles the aforementioned three key challenges with the tailored dataset, benchmark and model: First, we construct a multi-agent data engine with real-life ophthalmology data to produce Eyecare-100K, a high-quality ophthalmic visual instruction dataset. Subsequently, we design Eyecare-Bench, a benchmark that comprehensively evaluates the overall performance of LVLMs on intelligent ophthalmic diagnosis tasks across multiple dimensions. Finally, we develop the EyecareGPT, optimized for fine-grained ophthalmic visual understanding thoroughly, which incorporates an adaptive resolution mechanism and a layer-wise dense connector. Extensive experimental results indicate that the EyecareGPT achieves state-of-the-art performance in a range of ophthalmic tasks, underscoring its significant potential for the advancement of open research in intelligent ophthalmic diagnosis. Our project is available at https://github.com/DCDmllm/EyecareGPT.
Paperid: 189,   https://arxiv.org/pdf/2505.05741     GitHub
Authors:Zhangchi Hu, Peixi Wu, Jie Chen, Huyue Zhu, Yijun Wang, Yansong Peng, Hebei Li, Xiaoyan Sun
Title: Dome-DETR: DETR with Density-Oriented Feature-Query Manipulation for Efficient Tiny Object Detection
Abstract:
Tiny object detection plays a vital role in drone surveillance, remote sensing, and autonomous systems, enabling the identification of small targets across vast landscapes. However, existing methods suffer from inefficient feature leverage and high computational costs due to redundant feature processing and rigid query allocation. To address these challenges, we propose Dome-DETR, a novel framework with Density-Oriented Feature-Query Manipulation for Efficient Tiny Object Detection. To reduce feature redundancies, we introduce a lightweight Density-Focal Extractor (DeFE) to produce clustered compact foreground masks. Leveraging these masks, we incorporate Masked Window Attention Sparsification (MWAS) to focus computational resources on the most informative regions via sparse attention. Besides, we propose Progressive Adaptive Query Initialization (PAQI), which adaptively modulates query density across spatial areas for better query allocation. Extensive experiments demonstrate that Dome-DETR achieves state-of-the-art performance (+3.3 AP on AI-TOD-V2 and +2.5 AP on VisDrone) while maintaining low computational complexity and a compact model size. Code is available at https://github.com/RicePasteM/Dome-DETR.
Paperid: 190,   https://arxiv.org/pdf/2508.05213     GitHub
Authors:Jianming Liu, Wenlong Qiu, Haitao Wei
Title: Textual and Visual Guided Task Adaptation for Source-Free Cross-Domain Few-Shot Segmentation
Abstract:
Few-Shot Segmentation(FSS) aims to efficient segmentation of new objects with few labeled samples. However, its performance significantly degrades when domain discrepancies exist between training and deployment. Cross-Domain Few-Shot Segmentation(CD-FSS) is proposed to mitigate such performance degradation. Current CD-FSS methods primarily sought to develop segmentation models on a source domain capable of cross-domain generalization. However, driven by escalating concerns over data privacy and the imperative to minimize data transfer and training expenses, the development of source-free CD-FSS approaches has become essential. In this work, we propose a source-free CD-FSS method that leverages both textual and visual information to facilitate target domain task adaptation without requiring source domain data. Specifically, we first append Task-Specific Attention Adapters (TSAA) to the feature pyramid of a pretrained backbone, which adapt multi-level features extracted from the shared pre-trained backbone to the target task. Then, the parameters of the TSAA are trained through a Visual-Visual Embedding Alignment (VVEA) module and a Text-Visual Embedding Alignment (TVEA) module. The VVEA module utilizes global-local visual features to align image features across different views, while the TVEA module leverages textual priors from pre-aligned multi-modal features (e.g., from CLIP) to guide cross-modal adaptation. By combining the outputs of these modules through dense comparison operations and subsequent fusion via skip connections, our method produces refined prediction masks. Under both 1-shot and 5-shot settings, the proposed approach achieves average segmentation accuracy improvements of 2.18% and 4.11%, respectively, across four cross-domain datasets, significantly outperforming state-of-the-art CD-FSS methods. Code are available at https://github.com/ljm198134/TVGTANet.
Paperid: 191,   https://arxiv.org/pdf/2504.15545     GitHub
Authors:Zizhi Chen, Xinyu Zhang, Minghao Han, Yizhou Liu, Ziyun Qian, Weifeng Zhang, Xukun Zhang, Jingwei Wei, Lihua Zhang
Title: VLM-based Prompts as the Optimal Assistant for Unpaired Histopathology Virtual Staining
Abstract:
In histopathology, tissue sections are typically stained using common H&E staining or special stains (MAS, PAS, PASM, etc.) to clearly visualize specific tissue structures. The rapid advancement of deep learning offers an effective solution for generating virtually stained images, significantly reducing the time and labor costs associated with traditional histochemical staining. However, a new challenge arises in separating the fundamental visual characteristics of tissue sections from the visual differences induced by staining agents. Additionally, virtual staining often overlooks essential pathological knowledge and the physical properties of staining, resulting in only style-level transfer. To address these issues, we introduce, for the first time in virtual staining tasks, a pathological vision-language large model (VLM) as an auxiliary tool. We integrate contrastive learnable prompts, foundational concept anchors for tissue sections, and staining-specific concept anchors to leverage the extensive knowledge of the pathological VLM. This approach is designed to describe, frame, and enhance the direction of virtual staining. Furthermore, we have developed a data augmentation method based on the constraints of the VLM. This method utilizes the VLM's powerful image interpretation capabilities to further integrate image style and structural information, proving beneficial in high-precision pathological diagnostics. Extensive evaluations on publicly available multi-domain unpaired staining datasets demonstrate that our method can generate highly realistic images and enhance the accuracy of downstream tasks, such as glomerular detection and segmentation. Our code is available at: https://github.com/CZZZZZZZZZZZZZZZZZ/VPGAN-HARBOR
Paperid: 192,   https://arxiv.org/pdf/2509.00843     GitHub
Authors:Xueyang Kang, Zhengkang Xiang, Zezheng Zhang, Kourosh Khoshelham
Title: Look Beyond: Two-Stage Scene View Generation via Panorama and Video Diffusion
Abstract:
Novel view synthesis (NVS) from a single image is highly ill-posed due to large unobserved regions, especially for views that deviate significantly from the input. While existing methods focus on consistency between the source and generated views, they often fail to maintain coherence and correct view alignment across long-range or looped trajectories. We propose a model that addresses this by decomposing single-view NVS into a 360-degree scene extrapolation followed by novel view interpolation. This design ensures long-term view and scene consistency by conditioning on keyframes extracted and warped from a generated panoramic representation. In the first stage, a panorama diffusion model learns the scene prior from the input perspective image. Perspective keyframes are then sampled and warped from the panorama and used as anchor frames in a pre-trained video diffusion model, which generates novel views through a proposed spatial noise diffusion process. Compared to prior work, our method produces globally consistent novel views -- even in loop closure scenarios -- while enabling flexible camera control. Experiments on diverse scene datasets demonstrate that our approach outperforms existing methods in generating coherent views along user-defined trajectories. Our implementation is available at https://github.com/YiGuYT/LookBeyond.
Paperid: 193,   https://arxiv.org/pdf/2505.10576     GitHub
Authors:Qifan Fu, Xu Chen, Muhammad Asad, Shanxin Yuan, Changjae Oh, Gregory Slabaugh
Title: Robust Photo-Realistic Hand Gesture Generation: from Single View to Multiple View
Abstract:
High-fidelity hand gesture generation represents a significant challenge in human-centric generation tasks. Existing methods typically employ a single-view mesh-rendered image prior to enhancing gesture generation quality. However, the spatial complexity of hand gestures and the inherent limitations of single-view rendering make it difficult to capture complete gesture information, particularly when fingers are occluded. The fundamental contradiction lies in the loss of 3D topological relationships through 2D projection and the incomplete spatial coverage inherent to single-view representations. Diverging from single-view prior approaches, we propose a multi-view prior framework, named Multi-Modal UNet-based Feature Encoder (MUFEN), to guide diffusion models in learning comprehensive 3D hand information. Specifically, we extend conventional front-view rendering to include rear, left, right, top, and bottom perspectives, selecting the most information-rich view combination as training priors to address occlusion. This multi-view prior with a dedicated dual stream encoder significantly improves the model's understanding of complete hand features. Furthermore, we design a bounding box feature fusion module, which can fuse the gesture localization features and multi-modal features to enhance the location-awareness of the MUFEN features to the gesture-related features. Experiments demonstrate that our method achieves state-of-the-art performance in both quantitative metrics and qualitative evaluations. The source code is available at https://github.com/fuqifan/MUFEN.
Paperid: 194,   https://arxiv.org/pdf/2507.07526     GitHub
Authors:Cunhang Fan, Sheng Zhang, Jingjing Zhang, Enrui Liu, Xinhui Li, Gangming Zhao, Zhao Lv
Title: DMF2Mel: A Dynamic Multiscale Fusion Network for EEG-Driven Mel Spectrogram Reconstruction
Abstract:
Decoding speech from brain signals is a challenging research problem. Although existing technologies have made progress in reconstructing the mel spectrograms of auditory stimuli at the word or letter level, there remain core challenges in the precise reconstruction of minute-level continuous imagined speech: traditional models struggle to balance the efficiency of temporal dependency modeling and information retention in long-sequence decoding. To address this issue, this paper proposes the Dynamic Multiscale Fusion Network (DMF2Mel), which consists of four core components: the Dynamic Contrastive Feature Aggregation Module (DC-FAM), the Hierarchical Attention-Guided Multi-Scale Network (HAMS-Net), the SplineMap attention mechanism, and the bidirectional state space module (convMamba). Specifically, the DC-FAM separates speech-related "foreground features" from noisy "background features" through local convolution and global attention mechanisms, effectively suppressing interference and enhancing the representation of transient signals. HAMS-Net, based on the U-Net framework,achieves cross-scale fusion of high-level semantics and low-level details. The SplineMap attention mechanism integrates the Adaptive Gated Kolmogorov-Arnold Network (AGKAN) to combine global context modeling with spline-based local fitting. The convMamba captures long-range temporal dependencies with linear complexity and enhances nonlinear dynamic modeling capabilities. Results on the SparrKULee dataset show that DMF2Mel achieves a Pearson correlation coefficient of 0.074 in mel spectrogram reconstruction for known subjects (a 48% improvement over the baseline) and 0.048 for unknown subjects (a 35% improvement over the baseline).Code is available at: https://github.com/fchest/DMF2Mel.
Paperid: 195,   https://arxiv.org/pdf/2510.20327     GitHub
Authors:Fengyuan Yu, Yuyuan Li, Xiaohua Feng, Junjie Fang, Tao Wang, Chaochao Chen
Title: LEGO: A Lightweight and Efficient Multiple-Attribute Unlearning Framework for Recommender Systems
Abstract:
With the growing demand for safeguarding sensitive user information in recommender systems, recommendation attribute unlearning is receiving increasing attention. Existing studies predominantly focus on single-attribute unlearning. However, privacy protection requirements in the real world often involve multiple sensitive attributes and are dynamic. Existing single-attribute unlearning methods cannot meet these real-world requirements due to i) CH1: the inability to handle multiple unlearning requests simultaneously, and ii) CH2: the lack of efficient adaptability to dynamic unlearning needs. To address these challenges, we propose LEGO, a lightweight and efficient multiple-attribute unlearning framework. Specifically, we divide the multiple-attribute unlearning process into two steps: i) Embedding Calibration removes information related to a specific attribute from user embedding, and ii) Flexible Combination combines these embeddings into a single embedding, protecting all sensitive attributes. We frame the unlearning process as a mutual information minimization problem, providing LEGO a theoretical guarantee of simultaneous unlearning, thereby addressing CH1. With the two-step framework, where Embedding Calibration can be performed in parallel and Flexible Combination is flexible and efficient, we address CH2. Extensive experiments on three real-world datasets across three representative recommendation models demonstrate the effectiveness and efficiency of our proposed framework. Our code and appendix are available at https://github.com/anonymifish/lego-rec-multiple-attribute-unlearning.
Paperid: 196,   https://arxiv.org/pdf/2405.18991     GitHub
Authors:Jiaqi Xu, Xinyi Zou, Kunzhe Huang, Yunkuo Chen, Bo Liu, MengLi Cheng, Xing Shi, Jun Huang
Title: EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture
Abstract:
This paper presents EasyAnimate, an advanced method for video generation that leverages the power of transformer architecture for high-performance outcomes. We have expanded the DiT framework originally designed for 2D image synthesis to accommodate the complexities of 3D video generation by incorporating a motion module block. It is used to capture temporal dynamics, thereby ensuring the production of consistent frames and seamless motion transitions. The motion module can be adapted to various DiT baseline methods to generate video with different styles. It can also generate videos with different frame rates and resolutions during both training and inference phases, suitable for both images and videos. Moreover, we introduce slice VAE, a novel approach to condense the temporal axis, facilitating the generation of long duration videos. Currently, EasyAnimate exhibits the proficiency to generate videos with 144 frames. We provide a holistic ecosystem for video production based on DiT, encompassing aspects such as data pre-processing, VAE training, DiT models training (both the baseline model and LoRA model), and end-to-end video inference. Code is available at: https://github.com/aigc-apps/EasyAnimate. We are continuously working to enhance the performance of our method.
Paperid: 197,   https://arxiv.org/pdf/2506.23121     GitHub
Authors:Xinlei Yu, Changmiao Wang, Hui Jin, Ahmed Elazab, Gangyong Jia, Xiang Wan, Changqing Zou, Ruiquan Ge
Title: CRISP-SAM2: SAM2 with Cross-Modal Interaction and Semantic Prompting for Multi-Organ Segmentation
Abstract:
Multi-organ medical segmentation is a crucial component of medical image processing, essential for doctors to make accurate diagnoses and develop effective treatment plans. Despite significant progress in this field, current multi-organ segmentation models often suffer from inaccurate details, dependence on geometric prompts and loss of spatial information. Addressing these challenges, we introduce a novel model named CRISP-SAM2 with CRoss-modal Interaction and Semantic Prompting based on SAM2. This model represents a promising approach to multi-organ medical segmentation guided by textual descriptions of organs. Our method begins by converting visual and textual inputs into cross-modal contextualized semantics using a progressive cross-attention interaction mechanism. These semantics are then injected into the image encoder to enhance the detailed understanding of visual information. To eliminate reliance on geometric prompts, we use a semantic prompting strategy, replacing the original prompt encoder to sharpen the perception of challenging targets. In addition, a similarity-sorting self-updating strategy for memory and a mask-refining process is applied to further adapt to medical imaging and enhance localized details. Comparative experiments conducted on seven public datasets indicate that CRISP-SAM2 outperforms existing models. Extensive analysis also demonstrates the effectiveness of our method, thereby confirming its superior performance, especially in addressing the limitations mentioned earlier. Our code is available at: https://github.com/YU-deep/CRISP_SAM2.git.
Paperid: 198,   https://arxiv.org/pdf/2411.18659     GitHub
Authors:Yudong Zhang, Ruobing Xie, Xingwu Sun, Yiqing Huang, Jiansheng Chen, Zhanhui Kang, Di Wang, Yu Wang
Title: DHCP: Detecting Hallucinations by Cross-modal Attention Pattern in Large Vision-Language Models
Abstract:
Large vision-language models (LVLMs) have demonstrated exceptional performance on complex multimodal tasks. However, they continue to suffer from significant hallucination issues, including object, attribute, and relational hallucinations. To accurately detect these hallucinations, we investigated the variations in cross-modal attention patterns between hallucination and non-hallucination states. Leveraging these distinctions, we developed a lightweight detector capable of identifying hallucinations. Our proposed method, Detecting Hallucinations by Cross-modal Attention Patterns (DHCP), is straightforward and does not require additional LVLM training or extra LVLM inference steps. Experimental results show that DHCP achieves remarkable performance in hallucination detection. By offering novel insights into the identification and analysis of hallucinations in LVLMs, DHCP contributes to advancing the reliability and trustworthiness of these models. The code is available at https://github.com/btzyd/DHCP.
Paperid: 199,   https://arxiv.org/pdf/2404.17360     GitHub
Authors:Maoxun Yuan, Bo Cui, Tianyi Zhao, Jiayi Wang, Shan Fu, Xue Yang, Xingxing Wei
Title: UniRGB-IR: A Unified Framework for Visible-Infrared Semantic Tasks via Adapter Tuning
Abstract:
Semantic analysis on visible (RGB) and infrared (IR) images has gained significant attention due to their enhanced accuracy and robustness under challenging conditions including low-illumination and adverse weather. However, due to the lack of pre-trained foundation models on the large-scale infrared image datasets, existing methods prefer to design task-specific frameworks and directly fine-tune them with pre-trained foundation models on their RGB-IR semantic relevance datasets, which results in poor scalability and limited generalization. To address these limitations, we propose UniRGB-IR, a scalable and efficient framework for RGB-IR semantic tasks that introduces a novel adapter mechanism to effectively incorporate rich multi-modal features into pre-trained RGB-based foundation models. Our framework comprises three key components: a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (MFP) module, and a Supplementary Feature Injector (SFI) module. The MFP and SFI modules cooperate with each other as an adpater to effectively complement the ViT features with the contextual multi-scale features. During training process, we freeze the entire foundation model to inherit prior knowledge and only optimize the MFP and SFI modules. Furthermore, to verify the effectiveness of our framework, we utilize the ViT-Base as the pre-trained foundation model to perform extensive experiments. Experimental results on various RGB-IR semantic tasks demonstrate that our method can achieve state-of-the-art performance. The source code and results are available at https://github.com/PoTsui99/UniRGB-IR.git.
Paperid: 200,   https://arxiv.org/pdf/2507.05963     GitHub
Authors:Zhenghao Zhang, Junchao Liao, Xiangyu Meng, Long Qin, Weizhi Wang
Title: Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation
Abstract:
Recent advances in diffusion transformer models for motion-guided video generation, such as Tora, have shown significant progress. In this paper, we present Tora2, an enhanced version of Tora, which introduces several design improvements to expand its capabilities in both appearance and motion customization. Specifically, we introduce a decoupled personalization extractor that generates comprehensive personalization embeddings for multiple open-set entities, better preserving fine-grained visual details compared to previous methods. Building on this, we design a gated self-attention mechanism to integrate trajectory, textual description, and visual information for each entity. This innovation significantly reduces misalignment in multimodal conditioning during training. Moreover, we introduce a contrastive loss that jointly optimizes trajectory dynamics and entity consistency through explicit mapping between motion and personalization embeddings. Tora2 is, to our best knowledge, the first method to achieve simultaneous multi-entity customization of appearance and motion for video generation. Experimental results demonstrate that Tora2 achieves competitive performance with state-of-the-art customization methods while providing advanced motion control capabilities, which marks a critical advancement in multi-condition video generation. Project page: https://ali-videoai.github.io/Tora2_page/.
Paperid: 201,   https://arxiv.org/pdf/2507.23362     GitHub
Authors:Ji Ma, Wei Suo, Peng Wang, Yanning Zhang
Title: Short-LVLM: Compressing and Accelerating Large Vision-Language Models by Pruning Redundant Layers
Abstract:
Although large vision-language models (LVLMs) have demonstrated impressive capabilities in multi-modal understanding and reasoning, their practical applications are still limited by massive model parameters and high computational costs. Recent efforts from natural language processing (NLP) have shown the effectiveness of layer pruning, offering a plausible training-free compression solution. However, due to the modality divergence between vision and language, it is unclear whether these NLP techniques are still effective in LVLMs. In this paper, we empirically prove that directly applying these layer pruning methods to LVLMs is ineffective. Through extensive experiments, we find that non-essential vision-language (VL) tokens and inter-layer feature gaps pose critical challenges to pruning layers in LVLMs. Based on these insights, we propose a novel framework Short-LVLM (SVL) that can utilize important VL tokens and mitigate the layer-wise feature gaps. Notably, Short-LVLM not only achieves a superior trade-off between performance and efficiency but also exhibits several potential advantages, i.e., training-free, model-agnostic, and highly compatible. The code for this work is publicly available at https://github.com/ASGO-MM/Short-LVLM.
Paperid: 202,   https://arxiv.org/pdf/2504.12773     GitHub
Authors:Yicheng Pan, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Jianshu Zhang, Quan Liu, Jianqing Gao, Feng Ma
Title: Enhancing the Geometric Problem-Solving Ability of Multimodal LLMs via Symbolic-Neural Integration
Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have achieved remarkable progress in general domains and demonstrated promise in multimodal mathematical reasoning. However, applying MLLMs to geometry problem solving (GPS) remains challenging due to lack of accurate step-by-step solution data and severe hallucinations during reasoning. In this paper, we propose GeoGen, a pipeline that can automatically generates step-wise reasoning paths for geometry diagrams. By leveraging the precise symbolic reasoning, GeoGen produces large-scale, high-quality question-answer pairs. To further enhance the logical reasoning ability of MLLMs, we train GeoLogic, a Large Language Model (LLM) using synthetic data generated by GeoGen. Serving as a bridge between natural language and symbolic systems, GeoLogic enables symbolic tools to help verifying MLLM outputs, making the reasoning process more rigorous and alleviating hallucinations. Experimental results show that our approach consistently improves the performance of MLLMs, achieving remarkable results on benchmarks for geometric reasoning tasks. This improvement stems from our integration of the strengths of LLMs and symbolic systems, which enables a more reliable and interpretable approach for the GPS task. Codes are available at https://github.com/ycpNotFound/GeoGen.
Paperid: 203,   https://arxiv.org/pdf/2504.11379     GitHub
Authors:Liu Yang, Huiyu Duan, Yucheng Zhu, Xiaohong Liu, Lu Liu, Zitong Xu, Guangji Ma, Xiongkuo Min, Guangtao Zhai, Patrick Le Callet
Title: Omni$^2$: Unifying Omnidirectional Image Generation and Editing in an Omni Model
Abstract:
360^\circ omnidirectional images (ODIs) have gained considerable attention recently, and are widely used in various virtual reality (VR) and augmented reality (AR) applications. However, capturing such images is expensive and requires specialized equipment, making ODI synthesis increasingly important. While common 2D image generation and editing methods are rapidly advancing, these models struggle to deliver satisfactory results when generating or editing ODIs due to the unique format and broad 360^\circ Field-of-View (FoV) of ODIs. To bridge this gap, we construct Any2Omni, the first comprehensive ODI generation-editing dataset comprises 60,000+ training data covering diverse input conditions and up to 9 ODI generation and editing tasks. Built upon Any2Omni, we propose an \underlineOmni model for \underlineOmni-directional image generation and editing (Omni^2), with the capability of handling various ODI generation and editing tasks under diverse input conditions using one model. Extensive experiments demonstrate the superiority and effectiveness of the proposed Omni^2 model for both the ODI generation and editing tasks. Both the Any2Omni dataset and the Omni^2 model are publicly available at: https://github.com/IntMeGroup/Omni2.
Paperid: 204,   https://arxiv.org/pdf/2512.23453     GitHub
Authors:Zongsheng Cao, Yangfan He, Anran Liu, Jun Xie, Feng Chen, Zepeng Wang
Title: CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models
Abstract:
Large Vision-Language Models (LVLMs) have achieved impressive progress in multi-modal understanding and generation. However, they still tend to produce hallucinated content that is inconsistent with the visual input, which limits their reliability in real-world applications. We propose CoFi-Dec, a training-free decoding framework that mitigates hallucinations by integrating generative self-feedback with coarse-to-fine visual conditioning. Inspired by the human visual process from global scene perception to detailed inspection, CoFi-Dec first generates two intermediate textual responses conditioned on coarse- and fine-grained views of the original image. These responses are then transformed into synthetic images using a text-to-image model, forming multi-level visual hypotheses that enrich grounding cues. To unify the predictions from these multiple visual conditions, we introduce a Wasserstein-based fusion mechanism that aligns their predictive distributions into a geometrically consistent decoding trajectory. This principled fusion reconciles high-level semantic consistency with fine-grained visual grounding, leading to more robust and faithful outputs. Extensive experiments on six hallucination-focused benchmarks show that CoFi-Dec substantially reduces both entity-level and semantic-level hallucinations, outperforming existing decoding strategies. The framework is model-agnostic, requires no additional training, and can be seamlessly applied to a wide range of LVLMs. The implementation is available at https://github.com/AI-Researcher-Team/CoFi-Dec.
Paperid: 205,   https://arxiv.org/pdf/2508.11531     GitHub
Authors:Shilei Wang, Gong Cheng, Pujian Lai, Dong Gao, Junwei Han
Title: Multi-State Tracker: Enhancing Efficient Object Tracking via Multi-State Specialization and Interaction
Abstract:
Efficient trackers achieve faster runtime by reducing computational complexity and model parameters. However, this efficiency often compromises the expense of weakened feature representation capacity, thus limiting their ability to accurately capture target states using single-layer features. To overcome this limitation, we propose Multi-State Tracker (MST), which utilizes highly lightweight state-specific enhancement (SSE) to perform specialized enhancement on multi-state features produced by multi-state generation (MSG) and aggregates them in an interactive and adaptive manner using cross-state interaction (CSI). This design greatly enhances feature representation while incurring minimal computational overhead, leading to improved tracking robustness in complex environments. Specifically, the MSG generates multiple state representations at multiple stages during feature extraction, while SSE refines them to highlight target-specific features. The CSI module facilitates information exchange between these states and ensures the integration of complementary features. Notably, the introduced SSE and CSI modules adopt a highly lightweight hidden state adaptation-based state space duality (HSA-SSD) design, incurring only 0.1 GFLOPs in computation and 0.66 M in parameters. Experimental results demonstrate that MST outperforms all previous efficient trackers across multiple datasets, significantly improving tracking accuracy and robustness. In particular, it shows excellent runtime performance, with an AO score improvement of 4.5% over the previous SOTA efficient tracker HCAT on the GOT-10K dataset. The code is available at https://github.com/wsumel/MST.
Paperid: 206,   https://arxiv.org/pdf/2509.02357     GitHub
Authors:Zeren Xiong, Zikun Chen, Zedong Zhang, Xiang Li, Ying Tai, Jian Yang, Jun Li
Title: Category-Aware 3D Object Composition with Disentangled Texture and Shape Multi-view Diffusion
Abstract:
In this paper, we tackle a new task of 3D object synthesis, where a 3D model is composited with another object category to create a novel 3D model. However, most existing text/image/3D-to-3D methods struggle to effectively integrate multiple content sources, often resulting in inconsistent textures and inaccurate shapes. To overcome these challenges, we propose a straightforward yet powerful approach, category+3D-to-3D (C33D), for generating novel and structurally coherent 3D models. Our method begins by rendering multi-view images and normal maps from the input 3D model, then generating a novel 2D object using adaptive text-image harmony (ATIH) with the front-view image and a text description from another object category as inputs. To ensure texture consistency, we introduce texture multi-view diffusion, which refines the textures of the remaining multi-view RGB images based on the novel 2D object. For enhanced shape accuracy, we propose shape multi-view diffusion to improve the 2D shapes of both the multi-view RGB images and the normal maps, also conditioned on the novel 2D object. Finally, these outputs are used to reconstruct a complete and novel 3D model. Extensive experiments demonstrate the effectiveness of our method, yielding impressive 3D creations, such as shark(3D)-crocodile(text) in the first row of Fig. 1. A project page is available at: https://xzr52.github.io/C33D/
Paperid: 207,   https://arxiv.org/pdf/2509.04118     GitHub
Authors:Junqi Liao, Yaojun Wu, Chaoyi Lin, Zhipin Deng, Li Li, Dong Liu, Xiaoyan Sun
Title: EHVC: Efficient Hierarchical Reference and Quality Structure for Neural Video Coding
Abstract:
Neural video codecs (NVCs), leveraging the power of end-to-end learning, have demonstrated remarkable coding efficiency improvements over traditional video codecs. Recent research has begun to pay attention to the quality structures in NVCs, optimizing them by introducing explicit hierarchical designs. However, less attention has been paid to the reference structure design, which fundamentally should be aligned with the hierarchical quality structure. In addition, there is still significant room for further optimization of the hierarchical quality structure. To address these challenges in NVCs, we propose EHVC, an efficient hierarchical neural video codec featuring three key innovations: (1) a hierarchical multi-reference scheme that draws on traditional video codec design to align reference and quality structures, thereby addressing the reference-quality mismatch; (2) a lookahead strategy to utilize an encoder-side context from future frames to enhance the quality structure; (3) a layer-wise quality scale with random quality training strategy to stabilize quality structures during inference. With these improvements, EHVC achieves significantly superior performance to the state-of-the-art NVCs. Code will be released in: https://github.com/bytedance/NEVC.
Paperid: 208,   https://arxiv.org/pdf/2507.19807     GitHub
Authors:Guiping Cao, Xiangyuan Lan, Wenjian Huang, Jianguo Zhang, Dongmei Jiang, Yaowei Wang
Title: DS-Det: Single-Query Paradigm and Attention Disentangled Learning for Flexible Object Detection
Abstract:
Popular transformer detectors have achieved promising performance through query-based learning using attention mechanisms. However, the roles of existing decoder query types (e.g., content query and positional query) are still underexplored. These queries are generally predefined with a fixed number (fixed-query), which limits their flexibility. We find that the learning of these fixed-query is impaired by Recurrent Opposing inTeractions (ROT) between two attention operations: Self-Attention (query-to-query) and Cross-Attention (query-to-encoder), thereby degrading decoder efficiency. Furthermore, "query ambiguity" arises when shared-weight decoder layers are processed with both one-to-one and one-to-many label assignments during training, violating DETR's one-to-one matching principle. To address these challenges, we propose DS-Det, a more efficient detector capable of detecting a flexible number of objects in images. Specifically, we reformulate and introduce a new unified Single-Query paradigm for decoder modeling, transforming the fixed-query into flexible. Furthermore, we propose a simplified decoder framework through attention disentangled learning: locating boxes with Cross-Attention (one-to-many process), deduplicating predictions with Self-Attention (one-to-one process), addressing "query ambiguity" and "ROT" issues directly, and enhancing decoder efficiency. We further introduce a unified PoCoo loss that leverages box size priors to prioritize query learning on hard samples such as small objects. Extensive experiments across five different backbone models on COCO2017 and WiderPerson datasets demonstrate the general effectiveness and superiority of DS-Det. The source codes are available at https://github.com/Med-Process/DS-Det/.
Paperid: 209,   https://arxiv.org/pdf/2507.11119     GitHub
Authors:Hankun Liu, Yujian Zhao, Guanglin Niu
Title: Try Harder: Hard Sample Generation and Learning for Clothes-Changing Person Re-ID
Abstract:
Hard samples pose a significant challenge in person re-identification (ReID) tasks, particularly in clothing-changing person Re-ID (CC-ReID). Their inherent ambiguity or similarity, coupled with the lack of explicit definitions, makes them a fundamental bottleneck. These issues not only limit the design of targeted learning strategies but also diminish the model's robustness under clothing or viewpoint changes. In this paper, we propose a novel multimodal-guided Hard Sample Generation and Learning (HSGL) framework, which is the first effort to unify textual and visual modalities to explicitly define, generate, and optimize hard samples within a unified paradigm. HSGL comprises two core components: (1) Dual-Granularity Hard Sample Generation (DGHSG), which leverages multimodal cues to synthesize semantically consistent samples, including both coarse- and fine-grained hard positives and negatives for effectively increasing the hardness and diversity of the training data. (2) Hard Sample Adaptive Learning (HSAL), which introduces a hardness-aware optimization strategy that adjusts feature distances based on textual semantic labels, encouraging the separation of hard positives and drawing hard negatives closer in the embedding space to enhance the model's discriminative capability and robustness to hard samples. Extensive experiments on multiple CC-ReID benchmarks demonstrate the effectiveness of our approach and highlight the potential of multimodal-guided hard sample generation and learning for robust CC-ReID. Notably, HSAL significantly accelerates the convergence of the targeted learning procedure and achieves state-of-the-art performance on both PRCC and LTCC datasets. The code is available at https://github.com/undooo/TryHarder-ACMMM25.
Paperid: 210,   https://arxiv.org/pdf/2507.09184     GitHub
Authors:Qiyan Zhao, Xiaofeng Zhang, Yiheng Li, Yun Xing, Xiaosong Yuan, Feilong Tang, Sinan Fan, Xuhang Chen, Xuyao Zhang, Dahan Wang
Title: MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models
Abstract:
Hallucinations pose a significant challenge in Large Vision Language Models (LVLMs), with misalignment between multimodal features identified as a key contributing factor. This paper reveals the negative impact of the long-term decay in Rotary Position Encoding (RoPE), used for positional modeling in LVLMs, on multimodal alignment. Concretely, under long-term decay, instruction tokens exhibit uneven perception of image tokens located at different positions within the two-dimensional space: prioritizing image tokens from the bottom-right region since in the one-dimensional sequence, these tokens are positionally closer to the instruction tokens. This biased perception leads to insufficient image-instruction interaction and suboptimal multimodal alignment. We refer to this phenomenon as image alignment bias. To enhance instruction's perception of image tokens at different spatial locations, we propose MCA-LLaVA, based on Manhattan distance, which extends the long-term decay to a two-dimensional, multi-directional spatial decay. MCA-LLaVA integrates the one-dimensional sequence order and two-dimensional spatial position of image tokens for positional modeling, mitigating hallucinations by alleviating image alignment bias. Experimental results of MCA-LLaVA across various hallucination and general benchmarks demonstrate its effectiveness and generality. The code can be accessed in https://github.com/ErikZ719/MCA-LLaVA.
Paperid: 211,   https://arxiv.org/pdf/2410.09864     GitHub
Authors:Guoqiang Liang, Qingnan Fan, Bingtao Fu, Jinwei Chen, Hong Gu, Lin Wang
Title: AuthFace: Towards Authentic Blind Face Restoration with Face-oriented Generative Diffusion Prior
Abstract:
Blind face restoration (BFR) is a fundamental and challenging problem in computer vision. To faithfully restore high-quality (HQ) photos from poor-quality ones, recent research endeavors predominantly rely on facial image priors from the powerful pretrained text-to-image (T2I) diffusion models. However, such priors often lead to the incorrect generation of non-facial features and insufficient facial details, thus rendering them less practical for real-world applications. In this paper, we propose a novel framework, namely AuthFace that achieves highly authentic face restoration results by exploring a face-oriented generative diffusion prior. To learn such a prior, we first collect a dataset of 1.5K high-quality images, with resolutions exceeding 8K, captured by professional photographers. Based on the dataset, we then introduce a novel face-oriented restoration-tuning pipeline that fine-tunes a pretrained T2I model. Identifying key criteria of quality-first and photography-guided annotation, we involve the retouching and reviewing process under the guidance of photographers for high-quality images that show rich facial features. The photography-guided annotation system fully explores the potential of these high-quality photographic images. In this way, the potent natural image priors from pretrained T2I diffusion models can be subtly harnessed, specifically enhancing their capability in facial detail restoration. Moreover, to minimize artifacts in critical facial areas, such as eyes and mouth, we propose a time-aware latent facial feature loss to learn the authentic face restoration process. Extensive experiments on the synthetic and real-world BFR datasets demonstrate the superiority of our approach.
Paperid: 212,   https://arxiv.org/pdf/2507.11334     GitHub
Authors:Yuehao Huang, Liang Liu, Shuangming Lei, Yukai Ma, Hao Su, Jianbiao Mei, Pengxiang Zhao, Yaqing Gu, Yong Liu, Jiajun Lv
Title: CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual-Process Thinking
Abstract:
Mobile robots are increasingly required to navigate and interact within unknown and unstructured environments to meet human demands. Demand-driven navigation (DDN) enables robots to identify and locate objects based on implicit human intent, even when object locations are unknown. However, traditional data-driven DDN methods rely on pre-collected data for model training and decision-making, limiting their generalization capability in unseen scenarios. In this paper, we propose CogDDN, a VLM-based framework that emulates the human cognitive and learning mechanisms by integrating fast and slow thinking systems and selectively identifying key objects essential to fulfilling user demands. CogDDN identifies appropriate target objects by semantically aligning detected objects with the given instructions. Furthermore, it incorporates a dual-process decision-making module, comprising a Heuristic Process for rapid, efficient decisions and an Analytic Process that analyzes past errors, accumulates them in a knowledge base, and continuously improves performance. Chain of Thought (CoT) reasoning strengthens the decision-making process. Extensive closed-loop evaluations on the AI2Thor simulator with the ProcThor dataset show that CogDDN outperforms single-view camera-only methods by 15%, demonstrating significant improvements in navigation accuracy and adaptability. The project page is available at https://yuehaohuang.github.io/CogDDN/.
Paperid: 213,   https://arxiv.org/pdf/2509.11642     GitHub
Authors:Qiyuan Guan, Qianfeng Yang, Xiang Chen, Tianyu Song, Guiyue Jin, Jiyu Jin
Title: WeatherBench: A Real-World Benchmark Dataset for All-in-One Adverse Weather Image Restoration
Abstract:
Existing all-in-one image restoration approaches, which aim to handle multiple weather degradations within a single framework, are predominantly trained and evaluated using mixed single-weather synthetic datasets. However, these datasets often differ significantly in resolution, style, and domain characteristics, leading to substantial domain gaps that hinder the development and fair evaluation of unified models. Furthermore, the lack of a large-scale, real-world all-in-one weather restoration dataset remains a critical bottleneck in advancing this field. To address these limitations, we present a real-world all-in-one adverse weather image restoration benchmark dataset, which contains image pairs captured under various weather conditions, including rain, snow, and haze, as well as diverse outdoor scenes and illumination settings. The resulting dataset provides precisely aligned degraded and clean images, enabling supervised learning and rigorous evaluation. We conduct comprehensive experiments by benchmarking a variety of task-specific, task-general, and all-in-one restoration methods on our dataset. Our dataset offers a valuable foundation for advancing robust and practical all-in-one image restoration in real-world scenarios. The dataset has been publicly released and is available at https://github.com/guanqiyuan/WeatherBench.
Paperid: 214,   https://arxiv.org/pdf/2506.06631     GitHub
Authors:Minghao Zou, Qingtian Zeng, Yongping Miao, Shangkun Liu, Zilong Wang, Hantao Liu, Wei Zhou
Title: PhysLab: A Benchmark Dataset for Multi-Granularity Visual Parsing of Physics Experiments
Abstract:
Visual parsing of images and videos is critical for a wide range of real-world applications. However, progress in this field is constrained by limitations of existing datasets: (1) insufficient annotation granularity, which impedes fine-grained scene understanding and high-level reasoning; (2) limited coverage of domains, particularly a lack of datasets tailored for educational scenarios; and (3) lack of explicit procedural guidance, with minimal logical rules and insufficient representation of structured task process. To address these gaps, we introduce PhysLab, the first video dataset that captures students conducting complex physics experiments. The dataset includes four representative experiments that feature diverse scientific instruments and rich human-object interaction (HOI) patterns. PhysLab comprises 620 long-form videos and provides multilevel annotations that support a variety of vision tasks, including action recognition, object detection, HOI analysis, etc. We establish strong baselines and perform extensive evaluations to highlight key challenges in the parsing of procedural educational videos. We expect PhysLab to serve as a valuable resource for advancing fine-grained visual parsing, facilitating intelligent classroom systems, and fostering closer integration between computer vision and educational technologies. The dataset and the evaluation toolkit are publicly available at https://github.com/ZMH-SDUST/PhysLab.
Paperid: 215,   https://arxiv.org/pdf/2507.18932     GitHub
Authors:Lei Zhang, Xin Zhou, Chaoyue He, Di Wang, Yi Wu, Hong Xu, Wei Liu, Chunyan Miao
Title: MMESGBench: Pioneering Multimodal Understanding and Complex Reasoning Benchmark for ESG Tasks
Abstract:
Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency. However, these documents are often lengthy, structurally diverse, and multimodal, comprising dense text, structured tables, complex figures, and layout-dependent semantics. Existing AI systems often struggle to perform reliable document-level reasoning in such settings, and no dedicated benchmark currently exists in ESG domain. To fill the gap, we introduce MMESGBench, a first-of-its-kind benchmark dataset targeted to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents. This dataset is constructed via a human-AI collaborative, multi-stage pipeline. First, a multimodal LLM generates candidate question-answer (QA) pairs by jointly interpreting rich textual, tabular, and visual information from layout-aware document pages. Second, an LLM verifies the semantic accuracy, completeness, and reasoning complexity of each QA pair. This automated process is followed by an expert-in-the-loop validation, where domain specialists validate and calibrate QA pairs to ensure quality, relevance, and diversity. MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories. Questions are categorized as single-page, cross-page, or unanswerable, with each accompanied by fine-grained multimodal evidence. Initial experiments validate that multimodal and retrieval-augmented models substantially outperform text-only baselines, particularly on visually grounded and cross-page tasks. MMESGBench is publicly available as an open-source dataset at https://github.com/Zhanglei1103/MMESGBench.
Paperid: 216,   https://arxiv.org/pdf/2507.03990     GitHub
Authors:Aleksandr Gushchin, Maksim Smirnov, Dmitriy Vatolin, Anastasia Antsiferova
Title: LEHA-CVQAD: Dataset To Enable Generalized Video Quality Assessment of Compression Artifacts
Abstract:
We propose the LEHA-CVQAD (Large-scale Enriched Human-Annotated Compressed Video Quality Assessment) dataset, which comprises 6,240 clips for compression-oriented video quality assessment. 59 source videos are encoded with 186 codec-preset variants, 1.8M pairwise, and 1.5k MOS ratings are fused into a single quality scale; part of the videos remains hidden for blind evaluation. We also propose Rate-Distortion Alignment Error (RDAE), a novel evaluation metric that quantifies how well VQA models preserve bitrate-quality ordering, directly supporting codec parameter tuning. Testing IQA/VQA methods reveals that popular VQA metrics exhibit high RDAE and lower correlations, underscoring the dataset challenges and utility. The open part and the results of LEHA-CVQAD are available at https://aleksandrgushchin.github.io/lcvqad/
Paperid: 217,   https://arxiv.org/pdf/2506.20103     GitHub
Authors:Jiahao Lin, Weixuan Peng, Bojia Zi, Yifeng Gao, Xianbiao Qi, Xingjun Ma, Yu-Gang Jiang
Title: BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos
Abstract:
Recent advances in deep generative models have led to significant progress in video generation, yet the fidelity of AI-generated videos remains limited. Synthesized content often exhibits visual artifacts such as temporally inconsistent motion, physically implausible trajectories, unnatural object deformations, and local blurring that undermine realism and user trust. Accurate detection and spatial localization of these artifacts are crucial for both automated quality control and for guiding the development of improved generative models. However, the research community currently lacks a comprehensive benchmark specifically designed for artifact localization in AI generated videos. Existing datasets either restrict themselves to video or frame level detection or lack the fine-grained spatial annotations necessary for evaluating localization methods. To address this gap, we introduce BrokenVideos, a benchmark dataset of 3,254 AI-generated videos with meticulously annotated, pixel-level masks highlighting regions of visual corruption. Each annotation is validated through detailed human inspection to ensure high quality ground truth. Our experiments show that training state of the art artifact detection models and multi modal large language models (MLLMs) on BrokenVideos significantly improves their ability to localize corrupted regions. Through extensive evaluation, we demonstrate that BrokenVideos establishes a critical foundation for benchmarking and advancing research on artifact localization in generative video models. The dataset is available at: https://broken-video-detection-datetsets.github.io/Broken-Video-Detection-Datasets.github.io/.
Paperid: 218,   https://arxiv.org/pdf/2506.12538     GitHub
Authors:Shuo Yang, Yuqin Dai, Guoqing Wang, Xinran Zheng, Jinfeng Xu, Jinze Li, Zhenzhe Ying, Weiqiang Wang, Edith C. H. Ngai
Title: RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking
Abstract:
Large Language Models (LLMs) hold significant potential for advancing fact-checking by leveraging their capabilities in reasoning, evidence retrieval, and explanation generation. However, existing benchmarks fail to comprehensively evaluate LLMs and Multimodal Large Language Models (MLLMs) in realistic misinformation scenarios. To bridge this gap, we introduce RealFactBench, a comprehensive benchmark designed to assess the fact-checking capabilities of LLMs and MLLMs across diverse real-world tasks, including Knowledge Validation, Rumor Detection, and Event Verification. RealFactBench consists of 6K high-quality claims drawn from authoritative sources, encompassing multimodal content and diverse domains. Our evaluation framework further introduces the Unknown Rate (UnR) metric, enabling a more nuanced assessment of models' ability to handle uncertainty and balance between over-conservatism and over-confidence. Extensive experiments on 7 representative LLMs and 4 MLLMs reveal their limitations in real-world fact-checking and offer valuable insights for further research. RealFactBench is publicly available at https://github.com/kalendsyang/RealFactBench.git.
Paperid: 219,   https://arxiv.org/pdf/2508.06564     GitHub
Authors:Guanyu Hu, Dimitrios Kollias, Xinyu Yang
Title: Grounding Emotion Recognition with Visual Prototypes: VEGA -- Revisiting CLIP in MERC
Abstract:
Multimodal Emotion Recognition in Conversations remains a challenging task due to the complex interplay of textual, acoustic and visual signals. While recent models have improved performance via advanced fusion strategies, they often lack psychologically meaningful priors to guide multimodal alignment. In this paper, we revisit the use of CLIP and propose a novel Visual Emotion Guided Anchoring (VEGA) mechanism that introduces class-level visual semantics into the fusion and classification process. Distinct from prior work that primarily utilizes CLIP's textual encoder, our approach leverages its image encoder to construct emotion-specific visual anchors based on facial exemplars. These anchors guide unimodal and multimodal features toward a perceptually grounded and psychologically aligned representation space, drawing inspiration from cognitive theories (prototypical emotion categories and multisensory integration). A stochastic anchor sampling strategy further enhances robustness by balancing semantic stability and intra-class diversity. Integrated into a dual-branch architecture with self-distillation, our VEGA-augmented model achieves sota performance on IEMOCAP and MELD. Code is available at: https://github.com/dkollias/VEGA.
Paperid: 220,   https://arxiv.org/pdf/2507.06071     GitHub
Authors:Chang Liu, Ye Pan, Chenyang Ding, Susanto Rahardja, Xiaokang Yang
Title: MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding
Abstract:
Audio-driven emotional 3D facial animation aims to generate synchronized lip movements and vivid facial expressions. However, most existing approaches focus on static and predefined emotion labels, limiting their diversity and naturalness. To address these challenges, we propose MEDTalk, a novel framework for fine-grained and dynamic emotional talking head generation. Our approach first disentangles content and emotion embedding spaces from motion sequences using a carefully designed cross-reconstruction process, enabling independent control over lip movements and facial expressions. Beyond conventional audio-driven lip synchronization, we integrate audio and speech text, predicting frame-wise intensity variations and dynamically adjusting static emotion features to generate realistic emotional expressions. Furthermore, to enhance control and personalization, we incorporate multimodal inputs-including text descriptions and reference expression images-to guide the generation of user-specified facial expressions. With MetaHuman as the priority, our generated results can be conveniently integrated into the industrial production pipeline. The code is available at: https://github.com/SJTU-Lucy/MEDTalk.
Paperid: 221,   https://arxiv.org/pdf/2509.11589     GitHub
Authors:Yanyun Pu, Kehan Li, Zeyi Huang, Zhijie Zhong, Kaixiang Yang
Title: MVQA-68K: A Multi-dimensional and Causally-annotated Dataset with Quality Interpretability for Video Assessment
Abstract:
With the rapid advancement of video generation models such as Sora, video quality assessment (VQA) is becoming increasingly crucial for selecting high-quality videos from large-scale datasets used in pre-training. Traditional VQA methods, typically producing single numerical scores, often lack comprehensiveness and interpretability. To address these challenges, we introduce MVQA-68K, a novel multi-dimensional VQA dataset comprising over 68,000 carefully annotated videos, covering seven essential quality dimensions: overall aesthetics, camera movement, dynamic degree, texture detail, composition, visual quality, and factual consistency. Each annotation includes detailed chain-of-thought reasoning to facilitate interpretability and comprehensive understanding. Extensive experiments demonstrate that MVQA-68K significantly enhances the performance of various multimodal large language models (MLLMs) on the VQA task, achieving state-of-the-art results not only on our internal test set (Fig.1) but also on public benchmarks including LSVQ-test, LSVQ-1080p, and LIVE-VQC. Meantime, incorporating explicit reasoning process during VQA training substantially boosts the zero-shot generalization. Code and dataset will be available at github: https://github.com/Controller01-ai/MVQA-68K
Paperid: 222,   https://arxiv.org/pdf/2507.18969     GitHub
Authors:Zeyi Lu, Xiaoxiao Ma, Yujun Huang, Minxiao Chen, Bin Chen, Baoyi An, Shu-Tao Xia
Title: EDPC: Accelerating Lossless Compression via Lightweight Probability Models and Decoupled Parallel Dataflow
Abstract:
The explosive growth of multi-source multimedia data has significantly increased the demands for transmission and storage, placing substantial pressure on bandwidth and storage infrastructures. While Autoregressive Compression Models (ACMs) have markedly improved compression efficiency through probabilistic prediction, current approaches remain constrained by two critical limitations: suboptimal compression ratios due to insufficient fine-grained feature extraction during probability modeling, and real-time processing bottlenecks caused by high resource consumption and low compression speeds. To address these challenges, we propose Efficient Dual-path Parallel Compression (EDPC), a hierarchically optimized compression framework that synergistically enhances modeling capability and execution efficiency via coordinated dual-path operations. At the modeling level, we introduce the Information Flow Refinement (IFR) metric grounded in mutual information theory, and design a Multi-path Byte Refinement Block (MBRB) to strengthen cross-byte dependency modeling via heterogeneous feature propagation. At the system level, we develop a Latent Transformation Engine (LTE) for compact high-dimensional feature representation and a Decoupled Pipeline Compression Architecture (DPCA) to eliminate encoding-decoding latency through pipelined parallelization. Experimental results demonstrate that EDPC achieves comprehensive improvements over state-of-the-art methods, including a 2.7x faster compression speed, and a 3.2% higher compression ratio. These advancements establish EDPC as an efficient solution for real-time processing of large-scale multimedia data in bandwidth-constrained scenarios. Our code is available at https://github.com/Magie0/EDPC.
Paperid: 223,   https://arxiv.org/pdf/2504.17728     GitHub
Authors:Shucheng Gong, Lingzhe Zhao, Wenpu Li, Hong Xie, Yin Zhang, Shiyu Zhao, Peidong Liu
Title: Casual3DHDR: Deblurring High Dynamic Range 3D Gaussian Splatting from Casually Captured Videos
Abstract:
Photo-realistic novel view synthesis from multi-view images, such as neural radiance field (NeRF) and 3D Gaussian Splatting (3DGS), has gained significant attention for its superior performance. However, most existing methods rely on low dynamic range (LDR) images, limiting their ability to capture detailed scenes in high-contrast environments. While some prior works address high dynamic range (HDR) scene reconstruction, they typically require multi-view sharp images with varying exposure times captured at fixed camera positions, which is time-consuming and impractical. To make data acquisition more flexible, we propose Casual3DHDR, a robust one-stage method that reconstructs 3D HDR scenes from casually-captured auto-exposure (AE) videos, even under severe motion blur and unknown, varying exposure times. Our approach integrates a continuous-time camera trajectory into a unified physical imaging model, jointly optimizing exposure times, camera trajectory, and the camera response function (CRF). Extensive experiments on synthetic and real-world datasets demonstrate that Casual3DHDR outperforms existing methods in robustness and rendering quality. Our source code and dataset will be available at https://lingzhezhao.github.io/CasualHDRSplat/
Paperid: 224,   https://arxiv.org/pdf/2411.16619     GitHub
Authors:Zhichao Zhang, Wei Sun, Xinyue Li, Yunhao Li, Qihang Ge, Jun Jia, Zicheng Zhang, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, Guangtao Zhai
Title: Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric
Abstract:
AI-driven video generation techniques have made significant progress in recent years. However, AI-generated videos (AGVs) involving human activities often exhibit substantial visual and semantic distortions, hindering the practical application of video generation technologies in real-world scenarios. To address this challenge, we conduct a pioneering study on human activity AGV quality assessment, focusing on visual quality evaluation and the identification of semantic distortions. First, we construct the AI-Generated Human activity Video Quality Assessment (Human-AGVQA) dataset, consisting of 6,000 AGVs derived from 15 popular text-to-video (T2V) models using 400 text prompts that describe diverse human activities. We conduct a subjective study to evaluate the human appearance quality, action continuity quality, and overall video quality of AGVs, and identify semantic issues of human body parts. Based on Human-AGVQA, we benchmark the performance of T2V models and analyze their strengths and weaknesses in generating different categories of human activities. Second, we develop an objective evaluation metric, named AI-Generated Human activity Video Quality metric (GHVQ), to automatically analyze the quality of human activity AGVs. GHVQ systematically extracts human-focused quality features, AI-generated content-aware quality features, and temporal continuity features, making it a comprehensive and explainable quality metric for human activity AGVs. The extensive experimental results show that GHVQ outperforms existing quality metrics on the Human-AGVQA dataset by a large margin, demonstrating its efficacy in assessing the quality of human activity AGVs. The Human-AGVQA dataset and GHVQ metric will be released at https://github.com/zczhang-sjtu/GHVQ.git.
Paperid: 225,   https://arxiv.org/pdf/2508.06800     GitHub
Authors:Rui Liu, Haolin Zuo, Zheng Lian, Hongyu Yuan, Qi Fan
Title: Hardness-Aware Dynamic Curriculum Learning for Robust Multimodal Emotion Recognition with Missing Modalities
Abstract:
Missing modalities have recently emerged as a critical research direction in multimodal emotion recognition (MER). Conventional approaches typically address this issue through missing modality reconstruction. However, these methods fail to account for variations in reconstruction difficulty across different samples, consequently limiting the model's ability to handle hard samples effectively. To overcome this limitation, we propose a novel Hardness-Aware Dynamic Curriculum Learning framework, termed HARDY-MER. Our framework operates in two key stages: first, it estimates the hardness level of each sample, and second, it strategically emphasizes hard samples during training to enhance model performance on these challenging instances. Specifically, we first introduce a Multi-view Hardness Evaluation mechanism that quantifies reconstruction difficulty by considering both Direct Hardness (modality reconstruction errors) and Indirect Hardness (cross-modal mutual information). Meanwhile, we introduce a Retrieval-based Dynamic Curriculum Learning strategy that dynamically adjusts the training curriculum by retrieving samples with similar semantic information and balancing the learning focus between easy and hard instances. Extensive experiments on benchmark datasets demonstrate that HARDY-MER consistently outperforms existing methods in missing-modality scenarios. Our code will be made publicly available at https://github.com/HARDY-MER/HARDY-MER.
Paperid: 226,   https://arxiv.org/pdf/2507.20518     GitHub
Authors:Yili Li, Gang Xiong, Gaopeng Gou, Xiangyan Qu, Jiamin Zhuang, Zhen Li, Junzheng Shi
Title: T2VParser: Adaptive Decomposition Tokens for Partial Alignment in Text to Video Retrieval
Abstract:
Text-to-video retrieval essentially aims to train models to align visual content with textual descriptions accurately. Due to the impressive general multimodal knowledge demonstrated by image-text pretrained models such as CLIP, existing work has primarily focused on extending CLIP knowledge for video-text tasks. However, videos typically contain richer information than images. In current video-text datasets, textual descriptions can only reflect a portion of the video content, leading to partial misalignment in video-text matching. Therefore, directly aligning text representations with video representations can result in incorrect supervision, ignoring the inequivalence of information. In this work, we propose T2VParser to extract multiview semantic representations from text and video, achieving adaptive semantic alignment rather than aligning the entire representation. To extract corresponding representations from different modalities, we introduce Adaptive Decomposition Tokens, which consist of a set of learnable tokens shared across modalities. The goal of T2VParser is to emphasize precise alignment between text and video while retaining the knowledge of pretrained models. Experimental results demonstrate that T2VParser achieves accurate partial alignment through effective cross-modal content decomposition. The code is available at https://github.com/Lilidamowang/T2VParser.
Paperid: 227,   https://arxiv.org/pdf/2507.05715     GitHub
Authors:Guohao Li, Li Jing, Jia Wu, Xuefei Li, Kai Zhu, Yue He
Title: From ID-based to ID-free: Rethinking ID Effectiveness in Multimodal Collaborative Filtering Recommendation
Abstract:
Most existing multimodal collaborative filtering recommendation (MCFRec) methods rely heavily on ID features and multimodal content to enhance recommendation performance. However, this paper reveals that ID features are effective but have limited benefits in multimodal collaborative filtering recommendation. Therefore, this paper systematically deconstruct the pros and cons of ID features: (i) they provide initial embedding but lack semantic richness, (ii) they provide a unique identifier for each user and item but hinder generalization to untrained data, and (iii) they assist in aligning and fusing multimodal features but may lead to representation shift. Based on these insights, this paper proposes IDFREE, an ID-free multimodal collaborative Filtering REcommEndation baseline. IDFREE replaces ID features with multimodal features and positional encodings to generate semantically meaningful ID-free embeddings. For ID-free multimodal collaborative filtering, it further proposes an adaptive similarity graph module to construct dynamic user-user and item-item graphs based on multimodal features. Then, an augmented user-item graph encoder is proposed to construct more effective user and item encoding. Finally, IDFREE achieves inter-multimodal alignment based on the contrastive learning and uses Softmax loss as recommendation loss. Basic experiments on three public datasets demonstrate that IDFREE outperforms existing ID-based MCFRec methods, achieving an average performance gain of 72.24% across standard metrics (Recall@5, 10, 20, 50 and NDCG@5, 10, 20, 50). Exploratory and extended experiments further validate our findings on the limitations of ID features in MCFRec. The code is released at https://github.com/G-H-Li/IDFREE.
Paperid: 228,   https://arxiv.org/pdf/2504.04840     GitHub
Authors:Zhaofeng Shi, Heqian Qiu, Lanxiao Wang, Qingbo Wu, Fanman Meng, Hongliang Li
Title: Unsupervised Ego- and Exo-centric Dense Procedural Activity Captioning via Gaze Consensus Adaptation
Abstract:
Even from an early age, humans naturally adapt between exocentric (Exo) and egocentric (Ego) perspectives to understand daily procedural activities. Inspired by this cognitive ability, we propose a novel Unsupervised Ego-Exo Dense Procedural Activity Captioning (UE^2DPAC) task, which aims to transfer knowledge from the labeled source view to predict the time segments and descriptions of action sequences for the target view without annotations. Despite previous works endeavoring to address the fully-supervised single-view or cross-view dense video captioning, they lapse in the proposed task due to the significant inter-view gap caused by temporal misalignment and irrelevant object interference. Hence, we propose a Gaze Consensus-guided Ego-Exo Adaptation Network (GCEAN) that injects the gaze information into the learned representations for the fine-grained Ego-Exo alignment. Specifically, we propose a Score-based Adversarial Learning Module (SALM) that incorporates a discriminative scoring network and compares the scores of distinct views to learn unified view-invariant representations from a global level. Then, the Gaze Consensus Construction Module (GCCM) utilizes the gaze to progressively calibrate the learned representations to highlight the regions of interest and extract the corresponding temporal contexts. Moreover, we adopt hierarchical gaze-guided consistency losses to construct gaze consensus for the explicit temporal and spatial adaptation between the source and target views. To support our research, we propose a new EgoMe-UE^2DPAC benchmark, and extensive experiments demonstrate the effectiveness of our method, which outperforms many related methods by a large margin. Code is available at https://github.com/ZhaofengSHI/GCEAN.
Paperid: 229,   https://arxiv.org/pdf/2504.09885     GitHub
Authors:Zihao Liu, Mingwen Ou, Zunnan Xu, Jiaqi Huang, Haonan Han, Ronghui Li, Xiu Li
Title: Separate to Collaborate: Dual-Stream Diffusion Model for Coordinated Piano Hand Motion Synthesis
Abstract:
Automating the synthesis of coordinated bimanual piano performances poses significant challenges, particularly in capturing the intricate choreography between the hands while preserving their distinct kinematic signatures. In this paper, we propose a dual-stream neural framework designed to generate synchronized hand gestures for piano playing from audio input, addressing the critical challenge of modeling both hand independence and coordination. Our framework introduces two key innovations: (i) a decoupled diffusion-based generation framework that independently models each hand's motion via dual-noise initialization, sampling distinct latent noise for each while leveraging a shared positional condition, and (ii) a Hand-Coordinated Asymmetric Attention (HCAA) mechanism suppresses symmetric (common-mode) noise to highlight asymmetric hand-specific features, while adaptively enhancing inter-hand coordination during denoising. Comprehensive evaluations demonstrate that our framework outperforms existing state-of-the-art methods across multiple metrics. Our project is available at https://monkek123king.github.io/S2C_page/.
Paperid: 230,   https://arxiv.org/pdf/2508.03201     GitHub
Authors:Yidan Wang, Chenyi Zhuang, Wutao Liu, Pan Gao, Nicu Sebe
Title: AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding
Abstract:
Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out misaligned visual queries and enhances contrastive learning efficiency. Extensive experiments on three VG benchmarks, namely RefCOCO, RefCOCO+, and RefCOCOg, verify the superiority of AlignCAT against existing weakly supervised methods on two VG tasks. Our code is available at: https://github.com/I2-Multimedia-Lab/AlignCAT.
Paperid: 231,   https://arxiv.org/pdf/2507.09334     GitHub
Authors:Wencan Huang, Daizong Liu, Wei Hu
Title: Fast3D: Accelerating 3D Multi-modal Large Language Models for Efficient 3D Scene Understanding
Abstract:
While 3D Multi-modal Large Language Models (MLLMs) demonstrate remarkable scene understanding capabilities, their practical deployment faces critical challenges due to computational inefficiency. The key bottleneck stems from processing excessive object-centric visual tokens required for comprehensive 3D scene representation. Although visual token pruning has shown promise in accelerating 2D MLLMs, its applicability to 3D domains remains largely unexplored due to fundamental disparities in token structures. In this paper, we reveal two critical insights: (1) Significant redundancy exists in object-level 3D token representations, analogous to patch-level redundancy in 2D systems; (2) Global attention patterns exhibit strong predictive power for identifying non-essential tokens in 3D contexts. Building on these observations, we propose Fast3D, a plug-and-play visual token pruning framework for 3D MLLMs featuring two technical innovations: (1) Global Attention Prediction (GAP), where a lightweight neural network learns to predict the global attention distributions of the target model, enabling efficient token importance estimation for precise pruning guidance; (2) Sample-Adaptive visual token Pruning (SAP), which introduces dynamic token budgets through attention-based complexity assessment, automatically adjusting layer-wise pruning ratios based on input characteristics. Both of these two techniques operate without modifying the parameters of the target model. Extensive evaluations across five benchmarks validate the effectiveness of Fast3D, particularly under high visual token pruning ratios. Code is available at https://github.com/wencan25/Fast3D
Paperid: 232,   https://arxiv.org/pdf/2507.19201     GitHub
Authors:Xin Li, Kaixiang Yang, Qiang Li, Zhiwei Wang
Title: Joint Holistic and Lesion Controllable Mammogram Synthesis via Gated Conditional Diffusion Model
Abstract:
Mammography is the most commonly used imaging modality for breast cancer screening, driving an increasing demand for deep-learning techniques to support large-scale analysis. However, the development of accurate and robust methods is often limited by insufficient data availability and a lack of diversity in lesion characteristics. While generative models offer a promising solution for data synthesis, current approaches often fail to adequately emphasize lesion-specific features and their relationships with surrounding tissues. In this paper, we propose Gated Conditional Diffusion Model (GCDM), a novel framework designed to jointly synthesize holistic mammogram images and localized lesions. GCDM is built upon a latent denoising diffusion framework, where the noised latent image is concatenated with a soft mask embedding that represents breast, lesion, and their transitional regions, ensuring anatomical coherence between them during the denoising process. To further emphasize lesion-specific features, GCDM incorporates a gated conditioning branch that guides the denoising process by dynamically selecting and fusing the most relevant radiomic and geometric properties of lesions, effectively capturing their interplay. Experimental results demonstrate that GCDM achieves precise control over small lesion areas while enhancing the realism and diversity of synthesized mammograms. These advancements position GCDM as a promising tool for clinical applications in mammogram synthesis. Our code is available at https://github.com/lixinHUST/Gated-Conditional-Diffusion-Model/
Paperid: 233,   https://arxiv.org/pdf/2507.07395     GitHub
Authors:Yongtang Bao, Chengjie Tang, Yuze Wang, Haojie Li
Title: Seg-Wild: Interactive Segmentation based on 3D Gaussian Splatting for Unconstrained Image Collections
Abstract:
Reconstructing and segmenting scenes from unconstrained photo collections obtained from the Internet is a novel but challenging task. Unconstrained photo collections are easier to get than well-captured photo collections. These unconstrained images suffer from inconsistent lighting and transient occlusions, which makes segmentation challenging. Previous segmentation methods cannot address transient occlusions or accurately restore the scene's lighting conditions. Therefore, we propose Seg-Wild, an interactive segmentation method based on 3D Gaussian Splatting for unconstrained image collections, suitable for in-the-wild scenes. We integrate multi-dimensional feature embeddings for each 3D Gaussian and calculate the feature similarity between the feature embeddings and the segmentation target to achieve interactive segmentation in the 3D scene. Additionally, we introduce the Spiky 3D Gaussian Cutter (SGC) to smooth abnormal 3D Gaussians. We project the 3D Gaussians onto a 2D plane and calculate the ratio of 3D Gaussians that need to be cut using the SAM mask. We also designed a benchmark to evaluate segmentation quality in in-the-wild scenes. Experimental results demonstrate that compared to previous methods, Seg-Wild achieves better segmentation results and reconstruction quality. Our code will be available at https://github.com/Sugar0725/Seg-Wild.
Paperid: 234,   https://arxiv.org/pdf/2507.15765     GitHub
Authors:Feng-Qi Cui, Anyang Tong, Jinyang Huang, Jie Zhang, Dan Guo, Zhi Liu, Meng Wang
Title: Learning from Heterogeneity: Generalizing Dynamic Facial Expression Recognition via Distributionally Robust Optimization
Abstract:
Dynamic Facial Expression Recognition (DFER) plays a critical role in affective computing and human-computer interaction. Although existing methods achieve comparable performance, they inevitably suffer from performance degradation under sample heterogeneity caused by multi-source data and individual expression variability. To address these challenges, we propose a novel framework, called Heterogeneity-aware Distributional Framework (HDF), and design two plug-and-play modules to enhance time-frequency modeling and mitigate optimization imbalance caused by hard samples. Specifically, the Time-Frequency Distributional Attention Module (DAM) captures both temporal consistency and frequency robustness through a dual-branch attention design, improving tolerance to sequence inconsistency and visual style shifts. Then, based on gradient sensitivity and information bottleneck principles, an adaptive optimization module Distribution-aware Scaling Module (DSM) is introduced to dynamically balance classification and contrastive losses, enabling more stable and discriminative representation learning. Extensive experiments on two widely used datasets, DFEW and FERV39k, demonstrate that HDF significantly improves both recognition accuracy and robustness. Our method achieves superior weighted average recall (WAR) and unweighted average recall (UAR) while maintaining strong generalization across diverse and imbalanced scenarios. Codes are released at https://github.com/QIcita/HDF_DFER.
Paperid: 235,   https://arxiv.org/pdf/2505.06152     GitHub
Authors:Wenqi Zeng, Yuqi Sun, Chenxi Ma, Weimin Tan, Bo Yan
Title: MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks
Abstract:
Medical vision-language models (VLMs) have shown promise as clinical assistants across various medical fields. However, specialized dermatology VLM capable of delivering professional and detailed diagnostic analysis remains underdeveloped, primarily due to less specialized text descriptions in current dermatology multimodal datasets. To address this issue, we propose MM-Skin, the first large-scale multimodal dermatology dataset that encompasses 3 imaging modalities, including clinical, dermoscopic, and pathological and nearly 10k high-quality image-text pairs collected from professional textbooks. In addition, we generate over 27k diverse, instruction-following vision question answering (VQA) samples (9 times the size of current largest dermatology VQA dataset). Leveraging public datasets and MM-Skin, we developed SkinVL, a dermatology-specific VLM designed for precise and nuanced skin disease interpretation. Comprehensive benchmark evaluations of SkinVL on VQA, supervised fine-tuning (SFT) and zero-shot classification tasks across 8 datasets, reveal its exceptional performance for skin diseases in comparison to both general and medical VLM models. The introduction of MM-Skin and SkinVL offers a meaningful contribution to advancing the development of clinical dermatology VLM assistants. MM-Skin is available at https://github.com/ZwQ803/MM-Skin
Paperid: 236,   https://arxiv.org/pdf/2504.09540     GitHub
Authors:Hao Wang, Xiaobao Wei, Xiaoan Zhang, Jianing Li, Chengyu Bai, Ying Li, Ming Lu, Wenzhao Zheng, Shanghang Zhang
Title: EmbodiedOcc++: Boosting Embodied 3D Occupancy Prediction with Plane Regularization and Uncertainty Sampler
Abstract:
Online 3D occupancy prediction provides a comprehensive spatial understanding of embodied environments. While the innovative EmbodiedOcc framework utilizes 3D semantic Gaussians for progressive indoor occupancy prediction, it overlooks the geometric characteristics of indoor environments, which are primarily characterized by planar structures. This paper introduces EmbodiedOcc++, enhancing the original framework with two key innovations: a Geometry-guided Refinement Module (GRM) that constrains Gaussian updates through plane regularization, along with a Semantic-aware Uncertainty Sampler (SUS) that enables more effective updates in overlapping regions between consecutive frames. GRM regularizes the position update to align with surface normals. It determines the adaptive regularization weight using curvature-based and depth-based constraints, allowing semantic Gaussians to align accurately with planar surfaces while adapting in complex regions. To effectively improve geometric consistency from different views, SUS adaptively selects proper Gaussians to update. Comprehensive experiments on the EmbodiedOcc-ScanNet benchmark demonstrate that EmbodiedOcc++ achieves state-of-the-art performance across different settings. Our method demonstrates improved edge accuracy and retains more geometric details while ensuring computational efficiency, which is essential for online embodied perception. The code will be released at: https://github.com/PKUHaoWang/EmbodiedOcc2.
Paperid: 237,   https://arxiv.org/pdf/2508.20758     GitHub
Authors:Jiawen Lin, Shiran Bian, Yihang Zhu, Wenbin Tan, Yachao Zhang, Yuan Xie, Yanyun Qu
Title: SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding
Abstract:
3D Visual Grounding (3DVG) aims to localize objects in 3D scenes using natural language descriptions. Although supervised methods achieve higher accuracy in constrained settings, zero-shot 3DVG holds greater promise for real-world applications since eliminating scene-specific training requirements. However, existing zero-shot methods face challenges of spatial-limited reasoning due to reliance on single-view localization, and contextual omissions or detail degradation. To address these issues, we propose SeqVLM, a novel zero-shot 3DVG framework that leverages multi-view real-world scene images with spatial information for target object reasoning. Specifically, SeqVLM first generates 3D instance proposals via a 3D semantic segmentation network and refines them through semantic filtering, retaining only semantic-relevant candidates. A proposal-guided multi-view projection strategy then projects these candidate proposals onto real scene image sequences, preserving spatial relationships and contextual details in the conversion process of 3D point cloud to images. Furthermore, to mitigate VLM computational overload, we implement a dynamic scheduling mechanism that iteratively processes sequances-query prompts, leveraging VLM's cross-modal reasoning capabilities to identify textually specified objects. Experiments on the ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, achieving Acc@0.25 scores of 55.6% and 53.2%, surpassing previous zero-shot methods by 4.0% and 5.2%, respectively, which advance 3DVG toward greater generalization and real-world applicability. The code is available at https://github.com/JiawLin/SeqVLM.
Paperid: 238,   https://arxiv.org/pdf/2511.22033     GitHub
Authors:Chunzheng Zhu, Yangfang Lin, Jialin Shao, Jianxin Lin, Yijun Wang
Title: Pathology-Aware Prototype Evolution via LLM-Driven Semantic Disambiguation for Multicenter Diabetic Retinopathy Diagnosis
Abstract:
Diabetic retinopathy (DR) grading plays a critical role in early clinical intervention and vision preservation. Recent explorations predominantly focus on visual lesion feature extraction through data processing and domain decoupling strategies. However, they generally overlook domain-invariant pathological patterns and underutilize the rich contextual knowledge of foundation models, relying solely on visual information, which is insufficient for distinguishing subtle pathological variations. Therefore, we propose integrating fine-grained pathological descriptions to complement prototypes with additional context, thereby resolving ambiguities in borderline cases. Specifically, we propose a Hierarchical Anchor Prototype Modulation (HAPM) framework to facilitate DR grading. First, we introduce a variance spectrum-driven anchor prototype library that preserves domain-invariant pathological patterns. We further employ a hierarchical differential prompt gating mechanism, dynamically selecting discriminative semantic prompts from both LVLM and LLM sources to address semantic confusion between adjacent DR grades. Finally, we utilize a two-stage prototype modulation strategy that progressively integrates clinical knowledge into visual prototypes through a Pathological Semantic Injector (PSI) and a Discriminative Prototype Enhancer (DPE). Extensive experiments across eight public datasets demonstrate that our approach achieves pathology-guided prototype evolution while outperforming state-of-the-art methods. The code is available at https://github.com/zhcz328/HAPM.
Paperid: 239,   https://arxiv.org/pdf/2510.18409     GitHub
Authors:Yuheng Wu, Thanh-Tung Nguyen, Lucas Liebe, Quang Tau, Pablo Espinosa Campos, Jinghan Cheng, Dongman Lee
Title: How2Compress: Scalable and Efficient Edge Video Analytics via Adaptive Granular Video Compression
Abstract:
With the rapid proliferation of the Internet of Things, video analytics has become a cornerstone application in wireless multimedia sensor networks. To support such applications under bandwidth constraints, learning-based adaptive quantization for video compression have demonstrated strong potential in reducing bitrate while maintaining analytical accuracy. However, existing frameworks often fail to fully exploit the fine-grained quality control enabled by modern blockbased video codecs, leaving significant compression efficiency untapped. In this paper, we present How2Compress, a simple yet effective framework designed to enhance video compression efficiency through precise, fine-grained quality control at the macroblock level. How2Compress is a plug-and-play module and can be seamlessly integrated into any existing edge video analytics pipelines. We implement How2Compress on the H.264 codec and evaluate its performance across diverse real-world scenarios. Experimental results show that How2Compress achieves up to 50.4% bitrate savings and outperforms baselines by up to 3.01× without compromising accuracy, demonstrating its practical effectiveness and efficiency. Code is available at https://github.com/wyhallenwu/how2compress and a reproducible docker image at https://hub.docker.com/r/wuyuheng/how2compress.
Paperid: 240,   https://arxiv.org/pdf/2408.05477     GitHub
Authors:Yiying Yang, Fukun Yin, Jiayuan Fan, Xin Chen, Wanzhang Li, Gang Yu
Title: Scene123: One Prompt to 3D Scene Generation via Video-Assisted and Consistency-Enhanced MAE
Abstract:
As Artificial Intelligence Generated Content (AIGC) advances, a variety of methods have been developed to generate text, images, videos, and 3D objects from single or multimodal inputs, contributing efforts to emulate human-like cognitive content creation. However, generating realistic large-scale scenes from a single input presents a challenge due to the complexities involved in ensuring consistency across extrapolated views generated by models. Benefiting from recent video generation models and implicit neural representations, we propose Scene123, a 3D scene generation model, that not only ensures realism and diversity through the video generation framework but also uses implicit neural fields combined with Masked Autoencoders (MAE) to effectively ensures the consistency of unseen areas across views. Specifically, we initially warp the input image (or an image generated from text) to simulate adjacent views, filling the invisible areas with the MAE model. However, these filled images usually fail to maintain view consistency, thus we utilize the produced views to optimize a neural radiance field, enhancing geometric consistency. Moreover, to further enhance the details and texture fidelity of generated views, we employ a GAN-based Loss against images derived from the input image through the video generation model. Extensive experiments demonstrate that our method can generate realistic and consistent scenes from a single prompt. Both qualitative and quantitative results indicate that our approach surpasses existing state-of-the-art methods. We show encourage video examples at https://yiyingyang12.github.io/Scene123.github.io/.
Paperid: 241,   https://arxiv.org/pdf/2509.12701     GitHub
Authors:Wenzhuo Jin, Qianfeng Yang, Xianhao Wu, Hongming Chen, Pengpeng Li, Xiang Chen
Title: SmokeBench: A Real-World Dataset for Surveillance Image Desmoking in Early-Stage Fire Scenes
Abstract:
Early-stage fire scenes (0-15 minutes after ignition) represent a crucial temporal window for emergency interventions. During this stage, the smoke produced by combustion significantly reduces the visibility of surveillance systems, severely impairing situational awareness and hindering effective emergency response and rescue operations. Consequently, there is an urgent need to remove smoke from images to obtain clear scene information. However, the development of smoke removal algorithms remains limited due to the lack of large-scale, real-world datasets comprising paired smoke-free and smoke-degraded images. To address these limitations, we present a real-world surveillance image desmoking benchmark dataset named SmokeBench, which contains image pairs captured under diverse scenes setup and smoke concentration. The curated dataset provides precisely aligned degraded and clean images, enabling supervised learning and rigorous evaluation. We conduct comprehensive experiments by benchmarking a variety of desmoking methods on our dataset. Our dataset provides a valuable foundation for advancing robust and practical image desmoking in real-world fire scenes. This dataset has been released to the public and can be downloaded from https://github.com/ncfjd/SmokeBench.
Paperid: 242,   https://arxiv.org/pdf/2508.05016     GitHub
Authors:Shushi Wang, Chunyi Li, Zicheng Zhang, Han Zhou, Wei Dong, Jun Chen, Guangtao Zhai, Xiaohong Liu
Title: AU-IQA: A Benchmark Dataset for Perceptual Quality Assessment of AI-Enhanced User-Generated Content
Abstract:
AI-based image enhancement techniques have been widely adopted in various visual applications, significantly improving the perceptual quality of user-generated content (UGC). However, the lack of specialized quality assessment models has become a significant limiting factor in this field, limiting user experience and hindering the advancement of enhancement methods. While perceptual quality assessment methods have shown strong performance on UGC and AIGC individually, their effectiveness on AI-enhanced UGC (AI-UGC) which blends features from both, remains largely unexplored. To address this gap, we construct AU-IQA, a benchmark dataset comprising 4,800 AI-UGC images produced by three representative enhancement types which include super-resolution, low-light enhancement, and denoising. On this dataset, we further evaluate a range of existing quality assessment models, including traditional IQA methods and large multimodal models. Finally, we provide a comprehensive analysis of how well current approaches perform in assessing the perceptual quality of AI-UGC. The access link to the AU-IQA is https://github.com/WNNGGU/AU-IQA-Dataset.
Paperid: 243,   https://arxiv.org/pdf/2507.02316     GitHub
Authors:Zecheng Zhao, Selena Song, Tong Chen, Zhi Chen, Shazia Sadiq, Yadan Luo
Title: Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos
Abstract:
Text-to-video (T2V) synthesis has advanced rapidly, yet current evaluation metrics primarily capture visual quality and temporal consistency, offering limited insight into how synthetic videos perform in downstream tasks such as text-to-video retrieval (TVR). In this work, we introduce SynTVA, a new dataset and benchmark designed to evaluate the utility of synthetic videos for building retrieval models. Based on 800 diverse user queries derived from MSRVTT training split, we generate synthetic videos using state-of-the-art T2V models and annotate each video-text pair along four key semantic alignment dimensions: Object \& Scene, Action, Attribute, and Prompt Fidelity. Our evaluation framework correlates general video quality assessment (VQA) metrics with these alignment scores, and examines their predictive power for downstream TVR performance. To explore pathways of scaling up, we further develop an Auto-Evaluator to estimate alignment quality from existing metrics. Beyond benchmarking, our results show that SynTVA is a valuable asset for dataset augmentation, enabling the selection of high-utility synthetic samples that measurably improve TVR outcomes. Project page and dataset can be found at https://jasoncodemaker.github.io/SynTVA/.
Paperid: 244,   https://arxiv.org/pdf/2504.10331     GitHub
Authors:Hao Sun, Fenggen Yu, Huiyao Xu, Tao Zhang, Changqing Zou
Title: LL-Gaussian: Low-Light Scene Reconstruction and Enhancement via Gaussian Splatting for Novel View Synthesis
Abstract:
Novel view synthesis (NVS) in low-light scenes remains a significant challenge due to degraded inputs characterized by severe noise, low dynamic range (LDR) and unreliable initialization. While recent NeRF-based approaches have shown promising results, most suffer from high computational costs, and some rely on carefully captured or pre-processed data--such as RAW sensor inputs or multi-exposure sequences--which severely limits their practicality. In contrast, 3D Gaussian Splatting (3DGS) enables real-time rendering with competitive visual fidelity; however, existing 3DGS-based methods struggle with low-light sRGB inputs, resulting in unstable Gaussian initialization and ineffective noise suppression. To address these challenges, we propose LL-Gaussian, a novel framework for 3D reconstruction and enhancement from low-light sRGB images, enabling pseudo normal-light novel view synthesis. Our method introduces three key innovations: 1) an end-to-end Low-Light Gaussian Initialization Module (LLGIM) that leverages dense priors from learning-based MVS approach to generate high-quality initial point clouds; 2) a dual-branch Gaussian decomposition model that disentangles intrinsic scene properties (reflectance and illumination) from transient interference, enabling stable and interpretable optimization; 3) an unsupervised optimization strategy guided by both physical constrains and diffusion prior to jointly steer decomposition and enhancement. Additionally, we contribute a challenging dataset collected in extreme low-light environments and demonstrate the effectiveness of LL-Gaussian. Compared to state-of-the-art NeRF-based methods, LL-Gaussian achieves up to 2,000 times faster inference and reduces training time to just 2%, while delivering superior reconstruction and rendering quality.
Paperid: 245,   https://arxiv.org/pdf/2509.14268     GitHub
Authors:Jiachen Fu, Chun-Le Guo, Chongyi Li
Title: DetectAnyLLM: Towards Generalizable and Robust Detection of Machine-Generated Text Across Domains and Models
Abstract:
The rapid advancement of large language models (LLMs) has drawn urgent attention to the task of machine-generated text detection (MGTD). However, existing approaches struggle in complex real-world scenarios: zero-shot detectors rely heavily on scoring model's output distribution while training-based detectors are often constrained by overfitting to the training data, limiting generalization. We found that the performance bottleneck of training-based detectors stems from the misalignment between training objective and task needs. To address this, we propose Direct Discrepancy Learning (DDL), a novel optimization strategy that directly optimizes the detector with task-oriented knowledge. DDL enables the detector to better capture the core semantics of the detection task, thereby enhancing both robustness and generalization. Built upon this, we introduce DetectAnyLLM, a unified detection framework that achieves state-of-the-art MGTD performance across diverse LLMs. To ensure a reliable evaluation, we construct MIRAGE, the most diverse multi-task MGTD benchmark. MIRAGE samples human-written texts from 10 corpora across 5 text-domains, which are then re-generated or revised using 17 cutting-edge LLMs, covering a wide spectrum of proprietary models and textual styles. Extensive experiments on MIRAGE reveal the limitations of existing methods in complex environment. In contrast, DetectAnyLLM consistently outperforms them, achieving over a 70% performance improvement under the same training data and base scoring model, underscoring the effectiveness of our DDL. Project page: https://fjc2005.github.io/detectanyllm.
Paperid: 246,   https://arxiv.org/pdf/2508.16291     GitHub
Authors:Fengshun Wang, Qiurui Wang, Peilin Zhao
Title: Learning Long-Range Action Representation by Two-Stream Mamba Pyramid Network for Figure Skating Assessment
Abstract:
Technical Element Score (TES) and Program Component Score (PCS) evaluations in figure skating demand precise assessment of athletic actions and artistic interpretation, respectively. Existing methods face three major challenges. Firstly, video and audio cues are regarded as common features for both TES and PCS predictions in previous works without considering the prior evaluation criterion of figure skating. Secondly, action elements in competitions are separated in time, TES should be derived from each element's score, but existing methods try to give an overall TES prediction without evaluating each action element. Thirdly, lengthy competition videos make it difficult and inefficient to handle long-range contexts. To address these challenges, we propose a two-stream Mamba pyramid network that aligns with actual judging criteria to predict TES and PCS by separating visual-feature based TES evaluation stream from audio-visual-feature based PCS evaluation stream. In the PCS evaluation stream, we introduce a multi-level fusion mechanism to guarantee that video-based features remain unaffected when assessing TES, and enhance PCS estimation by fusing visual and auditory cues across each contextual level of the pyramid. In the TES evaluation stream, the multi-scale Mamba pyramid and TES head we proposed effectively address the challenges of localizing and evaluating action elements with various temporal scales and give score predictions. With Mamba's superior ability to capture long-range dependencies and its linear computational complexity, our method is ideal for handling lengthy figure skating videos. Comprehensive experimentation demonstrates that our framework attains state-of-the-art performance on the FineFS benchmark. Our source code is available at https://github.com/ycwfs/Figure-Skating-Action-Quality-Assessment.
Paperid: 247,   https://arxiv.org/pdf/2512.23483     GitHub
Authors:Zongsheng Cao, Yangfan He, Anran Liu, Feng Chen, Zepeng Wang, Jun Xie
Title: TV-RAG: A Temporal-aware and Semantic Entropy-Weighted Framework for Long Video Retrieval and Understanding
Abstract:
Large Video Language Models (LVLMs) have rapidly emerged as the focus of multimedia AI research. Nonetheless, when confronted with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic shifts that unfold over extended durations. Moreover, mainstream text-based retrieval pipelines, which rely chiefly on surface-level lexical overlap, ignore the rich temporal interdependence among visual, audio, and subtitle channels. To mitigate these limitations, we propose TV-RAG, a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning. The framework contributes two main mechanisms: \emph(i) a time-decay retrieval module that injects explicit temporal offsets into the similarity computation, thereby ranking text queries according to their true multimedia context; and \emph(ii) an entropy-weighted key-frame sampler that selects evenly spaced, information-dense frames, reducing redundancy while preserving representativeness. By weaving these temporal and semantic signals together, TV-RAG realises a dual-level reasoning routine that can be grafted onto any LVLM without re-training or fine-tuning. The resulting system offers a lightweight, budget-friendly upgrade path and consistently surpasses most leading baselines across established long-video benchmarks such as Video-MME, MLVU, and LongVideoBench, confirming the effectiveness of our model. The code can be found at https://github.com/AI-Researcher-Team/TV-RAG.
Paperid: 248,   https://arxiv.org/pdf/2507.20574     GitHub
Authors:Yanyin Guo, Runxuan An, Junwei Li, Zhiyuan Zhang
Title: LSFDNet: A Single-Stage Fusion and Detection Network for Ships Using SWIR and LWIR
Abstract:
Traditional ship detection methods primarily rely on single-modal approaches, such as visible or infrared images, which limit their application in complex scenarios involving varying lighting conditions and heavy fog. To address this issue, we explore the advantages of short-wave infrared (SWIR) and long-wave infrared (LWIR) in ship detection and propose a novel single-stage image fusion detection algorithm called LSFDNet. This algorithm leverages feature interaction between the image fusion and object detection subtask networks, achieving remarkable detection performance and generating visually impressive fused images. To further improve the saliency of objects in the fused images and improve the performance of the downstream detection task, we introduce the Multi-Level Cross-Fusion (MLCF) module. This module combines object-sensitive fused features from the detection task and aggregates features across multiple modalities, scales, and tasks to obtain more semantically rich fused features. Moreover, we utilize the position prior from the detection task in the Object Enhancement (OE) loss function, further increasing the retention of object semantics in the fused images. The detection task also utilizes preliminary fused features from the fusion task to complement SWIR and LWIR features, thereby enhancing detection performance. Additionally, we have established a Nearshore Ship Long-Short Wave Registration (NSLSR) dataset to train effective SWIR and LWIR image fusion and detection networks, bridging a gap in this field. We validated the superiority of our proposed single-stage fusion detection algorithm on two datasets. The source code and dataset are available at https://github.com/Yanyin-Guo/LSFDNet
Paperid: 249,   https://arxiv.org/pdf/2509.11264     GitHub
Authors:Kerun Mi, Guoliang Kang, Guangyu Li, Lin Zhao, Tao Zhou, Chen Gong
Title: Cross-Domain Attribute Alignment with CLIP: A Rehearsal-Free Approach for Class-Incremental Unsupervised Domain Adaptation
Abstract:
Class-Incremental Unsupervised Domain Adaptation (CI-UDA) aims to adapt a model from a labeled source domain to an unlabeled target domain, where the sets of potential target classes appearing at different time steps are disjoint and are subsets of the source classes. The key to solving this problem lies in avoiding catastrophic forgetting of knowledge about previous target classes during continuously mitigating the domain shift. Most previous works cumbersomely combine two technical components. On one hand, they need to store and utilize rehearsal target sample from previous time steps to avoid catastrophic forgetting; on the other hand, they perform alignment only between classes shared across domains at each time step. Consequently, the memory will continuously increase and the asymmetric alignment may inevitably result in knowledge forgetting. In this paper, we propose to mine and preserve domain-invariant and class-agnostic knowledge to facilitate the CI-UDA task. Specifically, via using CLIP, we extract the class-agnostic properties which we name as "attribute". In our framework, we learn a "key-value" pair to represent an attribute, where the key corresponds to the visual prototype and the value is the textual prompt. We maintain two attribute dictionaries, each corresponding to a different domain. Then we perform attribute alignment across domains to mitigate the domain shift, via encouraging visual attention consistency and prediction consistency. Through attribute modeling and cross-domain alignment, we effectively reduce catastrophic knowledge forgetting while mitigating the domain shift, in a rehearsal-free way. Experiments on three CI-UDA benchmarks demonstrate that our method outperforms previous state-of-the-art methods and effectively alleviates catastrophic forgetting. Code is available at https://github.com/RyunMi/VisTA.
Paperid: 250,   https://arxiv.org/pdf/2509.04844     GitHub
Authors:Xinkui Lin, Yongxiu Xu, Minghao Tang, Shilong Zhang, Hongbo Xu, Hao Xu, Yubin Wang
Title: REMOTE: A Unified Multimodal Relation Extraction Framework with Multilevel Optimal Transport and Mixture-of-Experts
Abstract:
Multimodal relation extraction (MRE) is a crucial task in the fields of Knowledge Graph and Multimedia, playing a pivotal role in multimodal knowledge graph construction. However, existing methods are typically limited to extracting a single type of relational triplet, which restricts their ability to extract triplets beyond the specified types. Directly combining these methods fails to capture dynamic cross-modal interactions and introduces significant computational redundancy. Therefore, we propose a novel unified multimodal Relation Extraction framework with Multilevel Optimal Transport and mixture-of-Experts, termed REMOTE, which can simultaneously extract intra-modal and inter-modal relations between textual entities and visual objects. To dynamically select optimal interaction features for different types of relational triplets, we introduce mixture-of-experts mechanism, ensuring the most relevant modality information is utilized. Additionally, considering that the inherent property of multilayer sequential encoding in existing encoders often leads to the loss of low-level information, we adopt a multilevel optimal transport fusion module to preserve low-level features while maintaining multilayer encoding, yielding more expressive representations. Correspondingly, we also create a Unified Multimodal Relation Extraction (UMRE) dataset to evaluate the effectiveness of our framework, encompassing diverse cases where the head and tail entities can originate from either text or image. Extensive experiments show that REMOTE effectively extracts various types of relational triplets and achieves state-of-the-art performanc on almost all metrics across two other public MRE datasets. We release our resources at https://github.com/Nikol-coder/REMOTE.
Paperid: 251,   https://arxiv.org/pdf/2507.19253     GitHub
Authors:An Xiang, Zixuan Huang, Xitong Gao, Kejiang Ye, Cheng-zhong Xu
Title: BridgeNet: A Unified Multimodal Framework for Bridging 2D and 3D Industrial Anomaly Detection
Abstract:
Industrial anomaly detection for 2D objects has gained significant attention and achieved progress in anomaly detection (AD) methods. However, identifying 3D depth anomalies using only 2D information is insufficient. Despite explicitly fusing depth information into RGB images or using point cloud backbone networks to extract depth features, both approaches struggle to adequately represent 3D information in multimodal scenarios due to the disparities among different modal information. Additionally, due to the scarcity of abnormal samples in industrial data, especially in multimodal scenarios, it is necessary to perform anomaly generation to simulate real-world abnormal samples. Therefore, we propose a novel unified multimodal anomaly detection framework to address these issues. Our contributions consist of 3 key aspects. (1) We extract visible depth information from 3D point cloud data simply and use 2D RGB images to represent appearance, which disentangles depth and appearance to support unified anomaly generation. (2) Benefiting from the flexible input representation, the proposed Multi-Scale Gaussian Anomaly Generator and Unified Texture Anomaly Generator can generate richer anomalies in RGB and depth. (3) All modules share parameters for both RGB and depth data, effectively bridging 2D and 3D anomaly detection. Subsequent modules can directly leverage features from both modalities without complex fusion. Experiments show our method outperforms state-of-the-art (SOTA) on MVTec-3D AD and Eyecandies datasets. Code available at: https://github.com/Xantastic/BridgeNet
Paperid: 252,   https://arxiv.org/pdf/2508.05507     GitHub
Authors:Lin Zhu, Ruonan Liu, Xiao Wang, Lizhi Wang, Hua Huang
Title: Revealing Latent Information: A Physics-inspired Self-supervised Pre-training Framework for Noisy and Sparse Events
Abstract:
Event camera, a novel neuromorphic vision sensor, records data with high temporal resolution and wide dynamic range, offering new possibilities for accurate visual representation in challenging scenarios. However, event data is inherently sparse and noisy, mainly reflecting brightness changes, which complicates effective feature extraction. To address this, we propose a self-supervised pre-training framework to fully reveal latent information in event data, including edge information and texture cues. Our framework consists of three stages: Difference-guided Masked Modeling, inspired by the event physical sampling process, reconstructs temporal intensity difference maps to extract enhanced information from raw event data. Backbone-fixed Feature Transition contrasts event and image features without updating the backbone to preserve representations learned from masked modeling and stabilizing their effect on contrastive learning. Focus-aimed Contrastive Learning updates the entire model to improve semantic discrimination by focusing on high-value regions. Extensive experiments show our framework is robust and consistently outperforms state-of-the-art methods on various downstream tasks, including object recognition, semantic segmentation, and optical flow estimation. The code and dataset are available at https://github.com/BIT-Vision/EventPretrain.
Paperid: 253,   https://arxiv.org/pdf/2508.04723     GitHub
Authors:Sha Zhao, Song Yi, Yangxuan Zhou, Jiadong Pan, Jiquan Wang, Jie Xia, Shijian Li, Shurong Dong, Gang Pan
Title: Wearable Music2Emotion : Assessing Emotions Induced by AI-Generated Music through Portable EEG-fNIRS Fusion
Abstract:
Emotions critically influence mental health, driving interest in music-based affective computing via neurophysiological signals with Brain-computer Interface techniques. While prior studies leverage music's accessibility for emotion induction, three key limitations persist: (1) Stimulus Constraints: Music stimuli are confined to small corpora due to copyright and curation costs, with selection biases from heuristic emotion-music mappings that ignore individual affective profiles. (2) Modality Specificity: Overreliance on unimodal neural data (e.g., EEG) ignores complementary insights from cross-modal signal fusion. (3) Portability Limitation: Cumbersome setups (e.g., 64+ channel gel-based EEG caps) hinder real-world applicability due to procedural complexity and portability barriers. To address these limitations, we propose MEEtBrain, a portable and multimodal framework for emotion analysis (valence/arousal), integrating AI-generated music stimuli with synchronized EEG-fNIRS acquisition via a wireless headband. By MEEtBrain, the music stimuli can be automatically generated by AI on a large scale, eliminating subjective selection biases while ensuring music diversity. We use our developed portable device that is designed in a lightweight headband-style and uses dry electrodes, to simultaneously collect EEG and fNIRS recordings. A 14-hour dataset from 20 participants was collected in the first recruitment to validate the framework's efficacy, with AI-generated music eliciting target emotions (valence/arousal). We are actively expanding our multimodal dataset (44 participants in the latest dataset) and make it publicly available to promote further research and practical applications. The dataset is available at https://zju-bmi-lab.github.io/ZBra.
Paperid: 254,   https://arxiv.org/pdf/2508.16448     GitHub
Authors:Lianchen Jia, Chaoyang Li, Ziqi Yuan, Jiahui Chen, Tianchi Huang, Jiangchuan Liu, Lifeng Sun
Title: Beyond Interpretability: Exploring the Comprehensibility of Adaptive Video Streaming through Large Language Models
Abstract:
Over the past decade, adaptive video streaming technology has witnessed significant advancements, particularly driven by the rapid evolution of deep learning techniques. However, the black-box nature of deep learning algorithms presents challenges for developers in understanding decision-making processes and optimizing for specific application scenarios. Although existing research has enhanced algorithm interpretability through decision tree conversion, interpretability does not directly equate to developers' subjective comprehensibility. To address this challenge, we introduce \textttComTree, the first bitrate adaptation algorithm generation framework that considers comprehensibility. The framework initially generates the complete set of decision trees that meet performance requirements, then leverages large language models to evaluate these trees for developer comprehensibility, ultimately selecting solutions that best facilitate human understanding and enhancement. Experimental results demonstrate that \textttComTree significantly improves comprehensibility while maintaining competitive performance, showing potential for further advancement. The source code is available at https://github.com/thu-media/ComTree.
Paperid: 255,   https://arxiv.org/pdf/2508.10655     GitHub
Authors:Zhangyong Tang, Tianyang Xu, Xuefeng Zhu, Chunyang Cheng, Tao Zhou, Xiaojun Wu, Josef Kittler
Title: Serial Over Parallel: Learning Continual Unification for Multi-Modal Visual Object Tracking and Benchmarking
Abstract:
Unifying multiple multi-modal visual object tracking (MMVOT) tasks draws increasing attention due to the complementary nature of different modalities in building robust tracking systems. Existing practices mix all data sensor types in a single training procedure, structuring a parallel paradigm from the data-centric perspective and aiming for a global optimum on the joint distribution of the involved tasks. However, the absence of a unified benchmark where all types of data coexist forces evaluations on separated benchmarks, causing inconsistency between training and testing, thus leading to performance degradation. To address these issues, this work advances in two aspects: \ding182 A unified benchmark, coined as UniBench300, is introduced to bridge the inconsistency by incorporating multiple task data, reducing inference passes from three to one and cutting time consumption by 27%. \ding183 The unification process is reformulated in a serial format, progressively integrating new tasks. In this way, the performance degradation can be specified as knowledge forgetting of previous tasks, which naturally aligns with the philosophy of continual learning (CL), motivating further exploration of injecting CL into the unification process. Extensive experiments conducted on two baselines and four benchmarks demonstrate the significance of UniBench300 and the superiority of CL in supporting a stable unification process. Moreover, while conducting dedicated analyses, the performance degradation is found to be negatively correlated with network capacity. Additionally, modality discrepancies contribute to varying degradation levels across tasks (RGBT > RGBD > RGBE in MMVOT), offering valuable insights for future multi-modal vision research. Source codes and the proposed benchmark is available at https://github.com/Zhangyong-Tang/UniBench300.
Paperid: 256,   https://arxiv.org/pdf/2507.08340     GitHub
Authors:Jia-Xuan Jiang, Jiashuai Liu, Hongtao Wu, Yifeng Wu, Zhong Wang, Qi Bi, Yefeng Zheng
Title: Single Domain Generalization for Multimodal Cross-Cancer Prognosis via Dirac Rebalancer and Distribution Entanglement
Abstract:
Deep learning has shown remarkable performance in integrating multimodal data for survival prediction. However, existing multimodal methods mainly focus on single cancer types and overlook the challenge of generalization across cancers. In this work, we are the first to reveal that multimodal prognosis models often generalize worse than unimodal ones in cross-cancer scenarios, despite the critical need for such robustness in clinical practice. To address this, we propose a new task: Cross-Cancer Single Domain Generalization for Multimodal Prognosis, which evaluates whether models trained on a single cancer type can generalize to unseen cancers. We identify two key challenges: degraded features from weaker modalities and ineffective multimodal integration. To tackle these, we introduce two plug-and-play modules: Sparse Dirac Information Rebalancer (SDIR) and Cancer-aware Distribution Entanglement (CADE). SDIR mitigates the dominance of strong features by applying Bernoulli-based sparsification and Dirac-inspired stabilization to enhance weaker modality signals. CADE, designed to synthesize the target domain distribution, fuses local morphological cues and global gene expression in latent space. Experiments on a four-cancer-type benchmark demonstrate superior generalization, laying the foundation for practical, robust cross-cancer multimodal prognosis. Code is available at https://github.com/HopkinsKwong/MCCSDG
Paperid: 257,   https://arxiv.org/pdf/2507.12062     GitHub
Authors:Hongxu Ma, Guanshuo Wang, Fufu Yu, Qiong Jia, Shouhong Ding
Title: MS-DETR: Towards Effective Video Moment Retrieval and Highlight Detection by Joint Motion-Semantic Learning
Abstract:
Video Moment Retrieval (MR) and Highlight Detection (HD) aim to pinpoint specific moments and assess clip-wise relevance based on the text query. While DETR-based joint frameworks have made significant strides, there remains untapped potential in harnessing the intricate relationships between temporal motion and spatial semantics within video content. In this paper, we propose the Motion-Semantics DETR (MS-DETR), a framework that captures rich motion-semantics features through unified learning for MR/HD tasks. The encoder first explicitly models disentangled intra-modal correlations within motion and semantics dimensions, guided by the given text queries. Subsequently, the decoder utilizes the task-wise correlation across temporal motion and spatial semantics dimensions to enable precise query-guided localization for MR and refined highlight boundary delineation for HD. Furthermore, we observe the inherent sparsity dilemma within the motion and semantics dimensions of MR/HD datasets. To address this issue, we enrich the corpus from both dimensions by generation strategies and propose contrastive denoising learning to ensure the above components learn robustly and effectively. Extensive experiments on four MR/HD benchmarks demonstrate that our method outperforms existing state-of-the-art models by a margin. Our code is available at https://github.com/snailma0229/MS-DETR.git.
Paperid: 258,   https://arxiv.org/pdf/2507.22477     GitHub
Authors:Hui Liu, Chen Jia, Fan Shi, Xu Cheng, Mengfei Shi, Xia Xie, Shengyong Chen
Title: LIDAR: Lightweight Adaptive Cue-Aware Fusion Vision Mamba for Multimodal Segmentation of Structural Cracks
Abstract:
Achieving pixel-level segmentation with low computational cost using multimodal data remains a key challenge in crack segmentation tasks. Existing methods lack the capability for adaptive perception and efficient interactive fusion of cross-modal features. To address these challenges, we propose a Lightweight Adaptive Cue-Aware Vision Mamba network (LIDAR), which efficiently perceives and integrates morphological and textural cues from different modalities under multimodal crack scenarios, generating clear pixel-level crack segmentation maps. Specifically, LIDAR is composed of a Lightweight Adaptive Cue-Aware Visual State Space module (LacaVSS) and a Lightweight Dual Domain Dynamic Collaborative Fusion module (LD3CF). LacaVSS adaptively models crack cues through the proposed mask-guided Efficient Dynamic Guided Scanning Strategy (EDG-SS), while LD3CF leverages an Adaptive Frequency Domain Perceptron (AFDP) and a dual-pooling fusion strategy to effectively capture spatial and frequency-domain cues across modalities. Moreover, we design a Lightweight Dynamically Modulated Multi-Kernel convolution (LDMK) to perceive complex morphological structures with minimal computational overhead, replacing most convolutional operations in LIDAR. Experiments on three datasets demonstrate that our method outperforms other state-of-the-art (SOTA) methods. On the light-field depth dataset, our method achieves 0.8204 in F1 and 0.8465 in mIoU with only 5.35M parameters. Code and datasets are available at https://github.com/Karl1109/LIDAR-Mamba.
Paperid: 259,   https://arxiv.org/pdf/2508.11058     GitHub
Authors:Wentao Mo, Qingchao Chen, Yuxin Peng, Siyuan Huang, Yang Liu
Title: Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset
Abstract:
The advancement of 3D vision-language (3D VL) learning is hindered by several limitations in existing 3D VL datasets: they rarely necessitate reasoning beyond a close range of objects in single viewpoint, and annotations often link instructions to single objects, missing richer contextual alignments between multiple objects. This significantly curtails the development of models capable of deep, multi-view 3D scene understanding over distant objects. To address these challenges, we introduce MV-ScanQA, a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views (compared to less than 7% in existing datasets), thereby rigorously testing multi-view compositional reasoning. To facilitate the training of models for such demanding scenarios, we present TripAlign dataset, a large-scale and low-cost 2D-3D-language pre-training corpus containing 1M <2D view, set of 3D objects, text> triplets that explicitly aligns groups of contextually related objects with text, providing richer, view-grounded multi-object multimodal alignment signals than previous single-object annotations. We further develop LEGO, a baseline method for the multi-view reasoning challenge in MV-ScanQA, transferring knowledge from pre-trained 2D LVLMs to 3D domain with TripAlign. Empirically, LEGO pre-trained on TripAlign achieves state-of-the-art performance not only on the proposed MV-ScanQA, but also on existing benchmarks for 3D dense captioning and question answering. Datasets and code are available at https://matthewdm0816.github.io/tripalign-mvscanqa.
Paperid: 260,   https://arxiv.org/pdf/2510.25332     GitHub
Authors:Yuhang Hu, Zhenyu Yang, Shihan Wang, Shengsheng Qian, Bin Wen, Fan Yang, Tingting Gao, Changsheng Xu
Title: StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA
Abstract:
The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion, paired with question-answer sets constrained by temporal evolution patterns. We further propose an explicit reasoning chain generation paradigm that extracts spatiotemporal objects via keyframe semantic alignment, derives object state transition-based reasoning paths using large language models, and ensures logical coherence through human-verified validation. This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference. Our StreamingCoT and its construction toolkit can be accessed at https://github.com/Fleeting-hyh/StreamingCoT.
Paperid: 261,   https://arxiv.org/pdf/2509.10312     GitHub
Authors:Zhixin Zheng, Xinyu Wang, Chang Zou, Shaobo Wang, Linfeng Zhang
Title: Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching
Abstract:
Diffusion transformers have gained significant attention in recent years for their ability to generate high-quality images and videos, yet still suffer from a huge computational cost due to their iterative denoising process. Recently, feature caching has been introduced to accelerate diffusion transformers by caching the feature computation in previous timesteps and reusing it in the following timesteps, which leverage the temporal similarity of diffusion models while ignoring the similarity in the spatial dimension. In this paper, we introduce Cluster-Driven Feature Caching (ClusCa) as an orthogonal and complementary perspective for previous feature caching. Specifically, ClusCa performs spatial clustering on tokens in each timestep, computes only one token in each cluster and propagates their information to all the other tokens, which is able to reduce the number of tokens by over 90%. Extensive experiments on DiT, FLUX and HunyuanVideo demonstrate its effectiveness in both text-to-image and text-to-video generation. Besides, it can be directly applied to any diffusion transformer without requirements for training. For instance, ClusCa achieves 4.96x acceleration on FLUX with an ImageReward of 99.49%, surpassing the original model by 0.51%. The code is available at https://github.com/Shenyi-Z/Cache4Diffusion.
Paperid: 262,   https://arxiv.org/pdf/2510.15752     GitHub
Authors:Yitong Sun, Yao Huang, Ruochen Zhang, Huanran Chen, Shouwei Ruan, Ranjie Duan, Xingxing Wei
Title: NDM: A Noise-driven Detection and Mitigation Framework against Implicit Sexual Intentions in Text-to-Image Generation
Abstract:
Despite the impressive generative capabilities of text-to-image (T2I) diffusion models, they remain vulnerable to generating inappropriate content, especially when confronted with implicit sexual prompts. Unlike explicit harmful prompts, these subtle cues, often disguised as seemingly benign terms, can unexpectedly trigger sexual content due to underlying model biases, raising significant ethical concerns. However, existing detection methods are primarily designed to identify explicit sexual content and therefore struggle to detect these implicit cues. Fine-tuning approaches, while effective to some extent, risk degrading the model's generative quality, creating an undesirable trade-off. To address this, we propose NDM, the first noise-driven detection and mitigation framework, which could detect and mitigate implicit malicious intention in T2I generation while preserving the model's original generative capabilities. Specifically, we introduce two key innovations: first, we leverage the separability of early-stage predicted noise to develop a noise-based detection method that could identify malicious content with high accuracy and efficiency; second, we propose a noise-enhanced adaptive negative guidance mechanism that could optimize the initial noise by suppressing the prominent region's attention, thereby enhancing the effectiveness of adaptive negative guidance for sexual mitigation. Experimentally, we validate NDM on both natural and adversarial datasets, demonstrating its superior performance over existing SOTA methods, including SLD, UCE, and RECE, etc. Code and resources are available at https://github.com/lorraine021/NDM.
Paperid: 263,   https://arxiv.org/pdf/2502.00425     GitHub
Authors:JiangYong Yu, Sifan Zhou, Dawei Yang, Shuo Wang, Shuoyu Li, Xing Hu, Chen Xu, Zukang Xu, Changyong Shu, Zhihang Yuan
Title: MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization
Abstract:
Multimodal large language models (MLLMs) have garnered widespread attention due to their ability to understand multimodal input. However, their large parameter sizes and substantial computational demands severely hinder their practical deployment and application.While quantization is an effective way to reduce model size and inference latency, its application to MLLMs remains underexplored. In this paper, we propose MQuant, a post-training quantization (PTQ) framework designed to tackle the unique challenges of multimodal large language models (MLLMs). Conventional quantization often struggles with MLLMs because of (a) high inference latency from large visual token counts, (b) distributional disparities between visual and textual tokens, and (c) extreme outliers introduced by Hadamard-based transformations. To address these issues, MQuant introduces: Modality-Specific Static Quantization (MSQ), assigning distinct static scales for visual vs. textual tokens; Attention-Invariant Flexible Switching (AIFS), reordering tokens to preserve casual attention while eliminating expensive token-wise scale computations; Rotation Magnitude Suppression (RMS), mitigating weight outliers arising from online Hadamard rotations. On five mainstream MLLMs (including Qwen-VL, MiniCPM-V, CogVLM2), MQuant under W4A8 achieves near-floating-point accuracy (<1% degradation) while reducing inference latency by up to 30%, significantly outperforming existing PTQ baselines. Our MQuant effectively bridges the gap for efficient and accurate MLLMs inference in resource-constrained devices. Code has been released in https://github.com/StiphyJay/MQuant.
Paperid: 264,   https://arxiv.org/pdf/2508.17857     GitHub
Authors:Pengfei Jiang, Hanjun Li, Linglan Zhao, Fei Chao, Ke Yan, Shouhong Ding, Rongrong Ji
Title: VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference
Abstract:
In this study, we introduce a novel method called group-wise VIsual token Selection and Aggregation (VISA) to address the issue of inefficient inference stemming from excessive visual tokens in multimoal large language models (MLLMs). Compared with previous token pruning approaches, our method can preserve more visual information while compressing visual tokens. We first propose a graph-based visual token aggregation (VTA) module. VTA treats each visual token as a node, forming a graph based on semantic similarity among visual tokens. It then aggregates information from removed tokens into kept tokens based on this graph, producing a more compact visual token representation. Additionally, we introduce a group-wise token selection strategy (GTS) to divide visual tokens into kept and removed ones, guided by text tokens from the final layers of each group. This strategy progressively aggregates visual information, enhancing the stability of the visual information extraction process. We conduct comprehensive experiments on LLaVA-1.5, LLaVA-NeXT, and Video-LLaVA across various benchmarks to validate the efficacy of VISA. Our method consistently outperforms previous methods, achieving a superior trade-off between model performance and inference speed. The code is available at https://github.com/mobiushy/VISA.
Paperid: 265,   https://arxiv.org/pdf/2507.07939     GitHub
Authors:Guoxin Zang, Xue Li, Donglin Di, Lanshun Nie, Dechen Zhan, Yang Song, Lei Fan
Title: SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment
Abstract:
While Vision-Language Models (VLMs) have shown promising progress in general multimodal tasks, they often struggle in industrial anomaly detection and reasoning, particularly in delivering interpretable explanations and generalizing to unseen categories. This limitation stems from the inherently domain-specific nature of anomaly detection, which hinders the applicability of existing VLMs in industrial scenarios that require precise, structured, and context-aware analysis. To address these challenges, we propose SAGE, a VLM-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). SFE integrates domain-specific knowledge into visual reasoning via fact extraction and fusion, while E-DPO aligns model outputs with expert preferences using entropy-aware optimization. Additionally, we introduce AD-PL, a preference-optimized dataset tailored for industrial anomaly reasoning, consisting of 28,415 question-answering instances with expert-ranked responses. To evaluate anomaly reasoning models, we develop Multiscale Logical Evaluation (MLE), a quantitative framework analyzing model logic and consistency. SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings. The code, model and dataset are available at https://github.com/amoreZgx1n/SAGE.
Paperid: 266,   https://arxiv.org/pdf/2506.02380     GitHub
Authors:Zihao Ding, Cheng-Tse Lee, Mufeng Zhu, Tao Guan, Yuan-Chun Sun, Cheng-Hsin Hsu, Yao Liu
Title: EyeNavGS: A 6-DoF Navigation Dataset and Record-n-Replay Software for Real-World 3DGS Scenes in VR
Abstract:
3D Gaussian Splatting (3DGS) is an emerging media representation that reconstructs real-world 3D scenes in high fidelity, enabling 6-degrees-of-freedom (6-DoF) navigation in virtual reality (VR). However, developing and evaluating 3DGS-enabled applications and optimizing their rendering performance, require realistic user navigation data. Such data is currently unavailable for photorealistic 3DGS reconstructions of real-world scenes. This paper introduces EyeNavGS (EyeNavGS), the first publicly available 6-DoF navigation dataset featuring traces from 46 participants exploring twelve diverse, real-world 3DGS scenes. The dataset was collected at two sites, using the Meta Quest Pro headsets, recording the head pose and eye gaze data for each rendered frame during free world standing 6-DoF navigation. For each of the twelve scenes, we performed careful scene initialization to correct for scene tilt and scale, ensuring a perceptually-comfortable VR experience. We also release our open-source SIBR viewer software fork with record-and-replay functionalities and a suite of utility tools for data processing, conversion, and visualization. The EyeNavGS dataset and its accompanying software tools provide valuable resources for advancing research in 6-DoF viewport prediction, adaptive streaming, 3D saliency, and foveated rendering for 3DGS scenes. The EyeNavGS dataset is available at: https://symmru.github.io/EyeNavGS/.
Paperid: 267,   https://arxiv.org/pdf/2508.09404     GitHub
Authors:Guangxun Zhu, Shiyu Fan, Hang Dai, Edmond S. L. Ho
Title: Waymo-3DSkelMo: A Multi-Agent 3D Skeletal Motion Dataset for Pedestrian Interaction Modeling in Autonomous Driving
Abstract:
Large-scale high-quality 3D motion datasets with multi-person interactions are crucial for data-driven models in autonomous driving to achieve fine-grained pedestrian interaction understanding in dynamic urban environments. However, existing datasets mostly rely on estimating 3D poses from monocular RGB video frames, which suffer from occlusion and lack of temporal continuity, thus resulting in unrealistic and low-quality human motion. In this paper, we introduce Waymo-3DSkelMo, the first large-scale dataset providing high-quality, temporally coherent 3D skeletal motions with explicit interaction semantics, derived from the Waymo Perception dataset. Our key insight is to utilize 3D human body shape and motion priors to enhance the quality of the 3D pose sequences extracted from the raw LiDRA point clouds. The dataset covers over 14,000 seconds across more than 800 real driving scenarios, including rich interactions among an average of 27 agents per scene (with up to 250 agents in the largest scene). Furthermore, we establish 3D pose forecasting benchmarks under varying pedestrian densities, and the results demonstrate its value as a foundational resource for future research on fine-grained human behavior understanding in complex urban environments. The dataset and code will be available at https://github.com/GuangxunZhu/Waymo-3DSkelMo
Paperid: 268,   https://arxiv.org/pdf/2508.03497     GitHub
Authors:Deqiang Yin, Junyi Guo, Huanda Lu, Fangyu Wu, Dongming Lu
Title: EditGarment: An Instruction-Based Garment Editing Dataset Constructed with Automated MLLM Synthesis and Semantic-Aware Evaluation
Abstract:
Instruction-based garment editing enables precise image modifications via natural language, with broad applications in fashion design and customization. Unlike general editing tasks, it requires understanding garment-specific semantics and attribute dependencies. However, progress is limited by the scarcity of high-quality instruction-image pairs, as manual annotation is costly and hard to scale. While MLLMs have shown promise in automated data synthesis, their application to garment editing is constrained by imprecise instruction modeling and a lack of fashion-specific supervisory signals. To address these challenges, we present an automated pipeline for constructing a garment editing dataset. We first define six editing instruction categories aligned with real-world fashion workflows to guide the generation of balanced and diverse instruction-image triplets. Second, we introduce Fashion Edit Score, a semantic-aware evaluation metric that captures semantic dependencies between garment attributes and provides reliable supervision during construction. Using this pipeline, we construct a total of 52,257 candidate triplets and retain 20,596 high-quality triplets to build EditGarment, the first instruction-based dataset tailored to standalone garment editing. The project page is https://yindq99.github.io/EditGarment-project/.
Paperid: 269,   https://arxiv.org/pdf/2509.20961     GitHub
Authors:Sarmistha Das, R E Zera Marveen Lyngkhoi, Sriparna Saha, Alka Maurya
Title: Unlocking Financial Insights: An advanced Multimodal Summarization with Multimodal Output Framework for Financial Advisory Videos
Abstract:
The dynamic propagation of social media has broadened the reach of financial advisory content through podcast videos, yet extracting insights from lengthy, multimodal segments (30-40 minutes) remains challenging. We introduce FASTER (Financial Advisory Summariser with Textual Embedded Relevant images), a modular framework that tackles three key challenges: (1) extracting modality-specific features, (2) producing optimized, concise summaries, and (3) aligning visual keyframes with associated textual points. FASTER employs BLIP for semantic visual descriptions, OCR for textual patterns, and Whisper-based transcription with Speaker diarization as BOS features. A modified Direct Preference Optimization (DPO)-based loss function, equipped with BOS-specific fact-checking, ensures precision, relevance, and factual consistency against the human-aligned summary. A ranker-based retrieval mechanism further aligns keyframes with summarized content, enhancing interpretability and cross-modal coherence. To acknowledge data resource scarcity, we introduce Fin-APT, a dataset comprising 470 publicly accessible financial advisory pep-talk videos for robust multimodal research. Comprehensive cross-domain experiments confirm FASTER's strong performance, robustness, and generalizability when compared to Large Language Models (LLMs) and Vision-Language Models (VLMs). By establishing a new standard for multimodal summarization, FASTER makes financial advisory content more accessible and actionable, thereby opening new avenues for research. The dataset and code are available at: https://github.com/sarmistha-D/FASTER
Paperid: 270,   https://arxiv.org/pdf/2508.00421     GitHub
Authors:Runmin Cong, Zongji Yu, Hao Fang, Haoyan Sun, Sam Kwong
Title: UIS-Mamba: Exploring Mamba for Underwater Instance Segmentation via Dynamic Tree Scan and Hidden State Weaken
Abstract:
Underwater Instance Segmentation (UIS) tasks are crucial for underwater complex scene detection. Mamba, as an emerging state space model with inherently linear complexity and global receptive fields, is highly suitable for processing image segmentation tasks with long sequence features. However, due to the particularity of underwater scenes, there are many challenges in applying Mamba to UIS. The existing fixed-patch scanning mechanism cannot maintain the internal continuity of scanned instances in the presence of severely underwater color distortion and blurred instance boundaries, and the hidden state of the complex underwater background can also inhibit the understanding of instance objects. In this work, we propose the first Mamba-based underwater instance segmentation model UIS-Mamba, and design two innovative modules, Dynamic Tree Scan (DTS) and Hidden State Weaken (HSW), to migrate Mamba to the underwater task. DTS module maintains the continuity of the internal features of the instance objects by allowing the patches to dynamically offset and scale, thereby guiding the minimum spanning tree and providing dynamic local receptive fields. HSW module suppresses the interference of complex backgrounds and effectively focuses the information flow of state propagation to the instances themselves through the Ncut-based hidden state weakening mechanism. Experimental results show that UIS-Mamba achieves state-of-the-art performance on both UIIS and USIS10K datasets, while maintaining a low number of parameters and computational complexity. Code is available at https://github.com/Maricalce/UIS-Mamba.
Paperid: 271,   https://arxiv.org/pdf/2505.13419     GitHub
Authors:Zhuozhao Hu, Kaishen Yuan, Xin Liu, Zitong Yu, Yuan Zong, Jingang Shi, Huanjing Yue, Jingyu Yang
Title: FEALLM: Advancing Facial Emotion Analysis in Multimodal Large Language Models with Emotional Synergy and Reasoning
Abstract:
Facial Emotion Analysis (FEA) plays a crucial role in visual affective computing, aiming to infer a person's emotional state based on facial data. Scientifically, facial expressions (FEs) result from the coordinated movement of facial muscles, which can be decomposed into specific action units (AUs) that provide detailed emotional insights. However, traditional methods often struggle with limited interpretability, constrained generalization and reasoning abilities. Recently, Multimodal Large Language Models (MLLMs) have shown exceptional performance in various visual tasks, while they still face significant challenges in FEA due to the lack of specialized datasets and their inability to capture the intricate relationships between FEs and AUs. To address these issues, we introduce a novel FEA Instruction Dataset that provides accurate and aligned FE and AU descriptions and establishes causal reasoning relationships between them, followed by constructing a new benchmark, FEABench. Moreover, we propose FEALLM, a novel MLLM architecture designed to capture more detailed facial information, enhancing its capability in FEA tasks. Our model demonstrates strong performance on FEABench and impressive generalization capability through zero-shot evaluation on various datasets, including RAF-DB, AffectNet, BP4D, and DISFA, showcasing its robustness and effectiveness in FEA tasks. The dataset and code will be available at https://github.com/953206211/FEALLM.
Paperid: 272,   https://arxiv.org/pdf/2509.11884     GitHub
Authors:Zhenni Yu, Li Zhao, Guobao Xiao, Xiaoqin Zhang
Title: SAM-TTT: Segment Anything Model via Reverse Parameter Configuration and Test-Time Training for Camouflaged Object Detection
Abstract:
This paper introduces a new Segment Anything Model (SAM) that leverages reverse parameter configuration and test-time training to enhance its performance on Camouflaged Object Detection (COD), named SAM-TTT. While most existing SAM-based COD models primarily focus on enhancing SAM by extracting favorable features and amplifying its advantageous parameters, a crucial gap is identified: insufficient attention to adverse parameters that impair SAM's semantic understanding in downstream tasks. To tackle this issue, the Reverse SAM Parameter Configuration Module is proposed to effectively mitigate the influence of adverse parameters in a train-free manner by configuring SAM's parameters. Building on this foundation, the T-Visioner Module is unveiled to strengthen advantageous parameters by integrating Test-Time Training layers, originally developed for language tasks, into vision tasks. Test-Time Training layers represent a new class of sequence modeling layers characterized by linear complexity and an expressive hidden state. By integrating two modules, SAM-TTT simultaneously suppresses adverse parameters while reinforcing advantageous ones, significantly improving SAM's semantic understanding in COD task. Our experimental results on various COD benchmarks demonstrate that the proposed approach achieves state-of-the-art performance, setting a new benchmark in the field. The code will be available at https://github.com/guobaoxiao/SAM-TTT.
Paperid: 273,   https://arxiv.org/pdf/2508.07766     GitHub
Authors:Jinke Li, Jiarui Yu, Chenxing Wei, Hande Dong, Qiang Lin, Liangjing Yang, Zhicai Wang, Yanbin Hao
Title: UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models
Abstract:
Unlike bitmap images, scalable vector graphics (SVG) maintain quality when scaled, frequently employed in computer vision and artistic design in the representation of SVG code. In this era of proliferating AI-powered systems, enabling AI to understand and generate SVG has become increasingly urgent. However, AI-driven SVG understanding and generation (U&G) remain significant challenges. SVG code, equivalent to a set of curves and lines controlled by floating-point parameters, demands high precision in SVG U&G. Besides, SVG generation operates under diverse conditional constraints, including textual prompts and visual references, which requires powerful multi-modal processing for condition-to-SVG transformation. Recently, the rapid growth of Multi-modal Large Language Models (MLLMs) have demonstrated capabilities to process multi-modal inputs and generate complex vector controlling parameters, suggesting the potential to address SVG U&G tasks within a unified model. To unlock MLLM's capabilities in the SVG area, we propose an SVG-centric dataset called UniSVG, comprising 525k data items, tailored for MLLM training and evaluation. To our best knowledge, it is the first comprehensive dataset designed for unified SVG generation (from textual prompts and images) and SVG understanding (color, category, usage, etc.). As expected, learning on the proposed dataset boosts open-source MLLMs' performance on various SVG U&G tasks, surpassing SOTA close-source MLLMs like GPT-4V. We release dataset, benchmark, weights, codes and experiment details on https://ryanlijinke.github.io/.
Paperid: 274,   https://arxiv.org/pdf/2412.17632     GitHub
Authors:Renyang Liu, Ziyu Lyu, Wei Zhou, See-Kiong Ng
Title: D-Judge: How Far Are We? Assessing the Discrepancies Between AI-synthesized and Natural Images through Multimodal Guidance
Abstract:
In the rapidly evolving field of Artificial Intelligence Generated Content (AIGC), a central challenge is distinguishing AI-synthesized images from natural ones. Despite the impressive capabilities of advanced generative models in producing visually compelling images, significant discrepancies remain when compared to natural images. To systematically investigate and quantify these differences, we construct a large-scale multimodal dataset, D-ANI, comprising 5,000 natural images and over 440,000 AIGI samples generated by nine representative models using both unimodal and multimodal prompts, including Text-to-Image (T2I), Image-to-Image (I2I), and Text-and-Image-to-Image (TI2I). We then introduce an AI-Natural Image Discrepancy assessment benchmark (D-Judge) to address the critical question: how far are AI-generated images (AIGIs) from truly realistic images? Our fine-grained evaluation framework assesses the D-ANI dataset across five dimensions: naive visual quality, semantic alignment, aesthetic appeal, downstream task applicability, and coordinated human validation. Extensive experiments reveal substantial discrepancies across these dimensions, highlighting the importance of aligning quantitative metrics with human judgment to achieve a comprehensive understanding of AI-generated image quality. Code: https://github.com/ryliu68/DJudge ; Data: https://huggingface.co/datasets/Renyang/DANI.
Paperid: 275,   https://arxiv.org/pdf/2405.08427     GitHub
Authors:Yuanchen Shi, Biao Ma, Longyin Zhang, Fang Kong
Title: Impact of Stickers on Multimodal Sentiment and Intent in Social Media: A New Task, Dataset and Baseline
Abstract:
Stickers are increasingly used in social media to express sentiment and intent. Despite their significant impact on sentiment analysis and intent recognition, little research has been conducted in this area. To address this gap, we propose a new task: Multimodal chat Sentiment Analysis and Intent Recognition involving Stickers (MSAIRS). Additionally, we introduce a novel multimodal dataset containing Chinese chat records and stickers excerpted from several mainstream social media platforms. Our dataset includes paired data with the same text but different stickers, the same sticker but different contexts, and various stickers consisting of the same images with different texts, allowing us to better understand the impact of stickers on chat sentiment and intent. We also propose an effective multimodal joint model, MMSAIR, featuring differential vector construction and cascaded attention mechanisms for enhanced multimodal fusion. Our experiments demonstrate the necessity and effectiveness of jointly modeling sentiment and intent, as they mutually reinforce each other's recognition accuracy. MMSAIR significantly outperforms traditional models and advanced MLLMs, demonstrating the challenge and uniqueness of sticker interpretation in social media. Our dataset and code are available on https://github.com/FakerBoom/MSAIRS-Dataset.
Paperid: 276,   https://arxiv.org/pdf/2509.01533    
Authors:Jiao Chen, Jiayi He, Fangfang Chen, Zuohong Lv, Jianhua Tang
Abstract:
Catastrophic forgetting remains a central challenge in continual learning (CL) with pre-trained models. While existing approaches typically freeze the backbone and fine-tune a small number of parameters to mitigate forgetting, they still rely on iterative error backpropagation and gradient-based optimization, which can be computationally intensive and less suitable for resource-constrained environments. To address this, we propose FoRo, a forward-only, gradient-free continual learning method. FoRo consists of a lightweight prompt tuning strategy and a novel knowledge encoding mechanism, both designed without modifying the pre-trained model. Specifically, prompt embeddings are inserted at the input layer and optimized using the Covariance Matrix Adaptation Evolution Strategy (CMA-ES), which mitigates distribution shifts and extracts high-quality task representations. Subsequently, task-specific knowledge is encoded into a knowledge encoding matrix via nonlinear random projection and recursive least squares, enabling incremental updates to the classifier without revisiting prior data. Experiments show that FoRo significantly reduces average forgetting and improves accuracy. Thanks to forward-only learning, FoRo reduces memory usage and run time while maintaining high knowledge retention across long task sequences. These results suggest that FoRo could serve as a promising direction for exploring continual learning with pre-trained models, especially in real-world multimedia applications where both efficiency and effectiveness are critical.
Paperid: 277,   https://arxiv.org/pdf/2507.07796    
Authors:Xi Xiao, Yunbei Zhang, Xingjian Li, Tianyang Wang, Xiao Wang, Yuxiang Wei, Jihun Hamm, Min Xu
Abstract:
Visual Prompt Tuning (VPT) has emerged as a parameter-efficient fine-tuning paradigm for vision transformers, with conventional approaches utilizing dataset-level prompts that remain the same across all input instances. We observe that this strategy results in sub-optimal performance due to high variance in downstream datasets. To address this challenge, we propose Visual Instance-aware Prompt Tuning (ViaPT), which generates instance-aware prompts based on each individual input and fuses them with dataset-level prompts, leveraging Principal Component Analysis (PCA) to retain important prompting information. Moreover, we reveal that VPT-Deep and VPT-Shallow represent two corner cases based on a conceptual understanding, in which they fail to effectively capture instance-specific information, while random dimension reduction on prompts only yields performance between the two extremes. Instead, ViaPT overcomes these limitations by balancing dataset-level and instance-level knowledge, while reducing the amount of learnable parameters compared to VPT-Deep. Extensive experiments across 34 diverse datasets demonstrate that our method consistently outperforms state-of-the-art baselines, establishing a new paradigm for analyzing and optimizing visual prompts for vision transformers.
Paperid: 278,   https://arxiv.org/pdf/2507.10358    
Authors:Hongxu Ma, Chenbo Zhang, Lu Zhang, Jiaogen Zhou, Jihong Guan, Shuigeng Zhou
Abstract:
Zero-shot object detection (ZSD) aims to leverage semantic descriptions to localize and recognize objects of both seen and unseen classes. Existing ZSD works are mainly coarse-grained object detection, where the classes are visually quite different, thus are relatively easy to distinguish. However, in real life we often have to face fine-grained object detection scenarios, where the classes are too similar to be easily distinguished. For example, detecting different kinds of birds, fishes, and flowers. In this paper, we propose and solve a new problem called Fine-Grained Zero-Shot Object Detection (FG-ZSD for short), which aims to detect objects of different classes with minute differences in details under the ZSD paradigm. We develop an effective method called MSHC for the FG-ZSD task, which is based on an improved two-stage detector and employs a multi-level semantics-aware embedding alignment loss, ensuring tight coupling between the visual and semantic spaces. Considering that existing ZSD datasets are not suitable for the new FG-ZSD task, we build the first FG-ZSD benchmark dataset FGZSD-Birds, which contains 148,820 images falling into 36 orders, 140 families, 579 genera and 1432 species. Extensive experiments on FGZSD-Birds show that our method outperforms existing ZSD models.
Paperid: 279,   https://arxiv.org/pdf/2509.12204    
Authors:Zhongrui Gui, Junyu Xie, Tengda Han, Weidi Xie, Andrew Zisserman
Abstract:
Animated movies are captivating for their unique character designs and imaginative storytelling, yet they pose significant challenges for existing recognition systems. Unlike the consistent visual patterns detected by conventional face recognition methods, animated characters exhibit extreme diversity in their appearance, motion, and deformation. In this work, we propose an audio-visual pipeline to enable automatic and robust animated character recognition, and thereby enhance character-centric understanding of animated movies. Central to our approach is the automatic construction of an audio-visual character bank from online sources. This bank contains both visual exemplars and voice (audio) samples for each character, enabling subsequent multi-modal character recognition despite long-tailed appearance distributions. Building on accurate character recognition, we explore two downstream applications: Audio Description (AD) generation for visually impaired audiences, and character-aware subtitling for the hearing impaired. To support research in this domain, we introduce CMD-AM, a new dataset of 75 animated movies with comprehensive annotations. Our character-centric pipeline demonstrates significant improvements in both accessibility and narrative comprehension for animated content over prior face-detection-based approaches. For the code and dataset, visit https://www.robots.ox.ac.uk/~vgg/research/animated_ad/.
Paperid: 280,   https://arxiv.org/pdf/2507.19085    
Authors:Mulin Chen, Bocheng Wang, Jiaxin Zhong, Zongcheng Miao, Xuelong Li
Abstract:
Attribute-missing graph clustering has emerged as a significant unsupervised task, where only attribute vectors of partial nodes are available and the graph structure is intact. The related models generally follow the two-step paradigm of imputation and refinement. However, most imputation approaches fail to capture class-relevant semantic information, leading to sub-optimal imputation for clustering. Moreover, existing refinement strategies optimize the learned embedding through graph reconstruction, while neglecting the fact that some attributes are uncorrelated with the graph. To remedy the problems, we establish the Clustering-oriented Generative Imputation with reliable Refinement (CGIR) model. Concretely, the subcluster distributions are estimated to reveal the class-specific characteristics precisely, and constrain the sampling space of the generative adversarial module, such that the imputation nodes are impelled to align with the correct clusters. Afterwards, multiple subclusters are merged to guide the proposed edge attention network, which identifies the edge-wise attributes for each class, so as to avoid the redundant attributes in graph reconstruction from disturbing the refinement of overall embedding. To sum up, CGIR splits attribute-missing graph clustering into the search and mergence of subclusters, which guides to implement node imputation and refinement within a unified framework. Extensive experiments prove the advantages of CGIR over state-of-the-art competitors.
Paperid: 281,   https://arxiv.org/pdf/2412.18977    
Authors:Chenxi Zhang, Qing Zhang, Jiayun Wu, Youwei Pang
Abstract:
Camouflaged Object Detection (COD) aims to identify objects that blend seamlessly into their surroundings. The inherent visual complexity of camouflaged objects, including their low contrast with the background, diverse textures, and subtle appearance variations, often obscures semantic cues, making accurate segmentation highly challenging. Existing methods primarily rely on visual features, which are insufficient to handle the variability and intricacy of camouflaged objects, leading to unstable object perception and ambiguous segmentation results. To tackle these limitations, we introduce a novel task, class-guided camouflaged object detection (CGCOD), which extends traditional COD task by incorporating object-specific class knowledge to enhance detection robustness and accuracy. To facilitate this task, we present a new dataset, CamoClass, comprising real-world camouflaged objects with class annotations. Furthermore, we propose a multi-stage framework, CGNet, which incorporates a plug-and-play class prompt generator and a simple yet effective class-guided detector. This establishes a new paradigm for COD, bridging the gap between contextual understanding and class-guided detection. Extensive experimental results demonstrate the effectiveness of our flexible framework in improving the performance of proposed and existing detectors by leveraging class-level textual information.
Paperid: 282,   https://arxiv.org/pdf/2406.20078    
Authors:Yingxin Lai, Zitong Yu, Jing Yang, Bin Li, Xiangui Kang, Linlin Shen
Abstract:
Existing face forgery detection usually follows the paradigm of training models in a single domain, which leads to limited generalization capacity when unseen scenarios and unknown attacks occur. In this paper, we elaborately investigate the generalization capacity of deepfake detection models when jointly trained on multiple face forgery detection datasets. We first find a rapid degradation of detection accuracy when models are directly trained on combined datasets due to the discrepancy across collection scenarios and generation methods. To address the above issue, a Generalized Multi-Scenario Deepfake Detection framework (GM-DF) is proposed to serve multiple real-world scenarios by a unified model. First, we propose a hybrid expert modeling approach for domain-specific real/forgery feature extraction. Besides, as for the commonality representation, we use CLIP to extract the common features for better aligning visual and textual features across domains. Meanwhile, we introduce a masked image reconstruction mechanism to force models to capture rich forged details. Finally, we supervise the models via a domain-aware meta-learning strategy to further enhance their generalization capacities. Specifically, we design a novel domain alignment loss to strongly align the distributions of the meta-test domains and meta-train domains. Thus, the updated models are able to represent both specific and common real/forgery features across multiple datasets. In consideration of the lack of study of multi-dataset training, we establish a new benchmark leveraging multi-source data to fairly evaluate the models' generalization capacity on unseen scenarios. Both qualitative and quantitative experiments on five datasets conducted on traditional protocols as well as the proposed benchmark demonstrate the effectiveness of our approach.
Paperid: 283,   https://arxiv.org/pdf/2405.13335    
Authors:Yuguang Zhang, Qihang Fan, Huaibo Huang
Abstract:
In recent years, Transformers have achieved remarkable progress in computer vision tasks. However, their global modeling often comes with substantial computational overhead, in stark contrast to the human eye's efficient information processing. Inspired by the human eye's sparse scanning mechanism, we propose a Sparse Scan Self-Attention mechanism (\rmS^3\rmA). This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors, avoiding redundant global modeling and excessive focus on local information. This approach mirrors the human eye's functionality and significantly reduces the computational load of vision models. Building on \rmS^3\rmA, we introduce the Sparse Scan Vision Transformer (SSViT). Extensive experiments demonstrate the outstanding performance of SSViT across a variety of tasks. Specifically, on ImageNet classification, without additional supervision or training data, SSViT achieves top-1 accuracies of 84.4%/85.7% with 4.4G/18.2G FLOPs. SSViT also excels in downstream tasks such as object detection, instance segmentation, and semantic segmentation. Its robustness is further validated across diverse datasets.
Paperid: 284,   https://arxiv.org/pdf/2510.24767    
Authors:Guorui Song, Guocun Wang, Zhe Huang, Jing Lin, Xuefei Zhe, Jian Li, Haoqian Wang
Abstract:
Generating accurate descriptions of human actions in videos remains a challenging task for video captioning models. Existing approaches often struggle to capture fine-grained motion details, resulting in vague or semantically inconsistent captions. In this work, we introduce the Motion-Augmented Caption Model (M-ACM), a novel generative framework that enhances caption quality by incorporating motion-aware decoding. At its core, M-ACM leverages motion representations derived from human mesh recovery to explicitly highlight human body dynamics, thereby reducing hallucinations and improving both semantic fidelity and spatial alignment in the generated captions. To support research in this area, we present the Human Motion Insight (HMI) Dataset, comprising 115K video-description pairs focused on human movement, along with HMI-Bench, a dedicated benchmark for evaluating motion-focused video captioning. Experimental results demonstrate that M-ACM significantly outperforms previous methods in accurately describing complex human motions and subtle temporal variations, setting a new standard for motion-centric video captioning.
Paperid: 285,   https://arxiv.org/pdf/2509.24299    
Authors:Hanqi Chen, Zhongyin Zhao, Ye Chen, Zhujin Liang, Bingbing Ni
Abstract:
Scalable Vector Graphics (SVG) is a code-based representation for 2D visuals. Leveraging recent advances in large language models (LLMs), we study text-to-SVG generation and address two persistent gaps: weak generalization and poor adherence to input instructions. We present SVGThinker, a reasoning-driven framework that aligns the production of SVG code with the visualization process and supports the full set of SVG primitives. Our pipeline first renders each primitive in sequence and uses a multimodal model to annotate the image and code; we then build stepwise updates that mirror the incremental addition of primitives. On this data, we train an LLM with supervised fine-tuning that exposes its chain-of-thought as intermediate reasoning, improving robustness and reducing errors and hallucinations. Experiments against state-of-the-art baselines show that SVGThinker produces more stable, editable, and higher-quality SVGs while preserving the structural advantages of vector graphics. Unlike image-based methods, our outputs enable precise and hierarchical editing, opening new directions for design, content creation, and automated graphics generation.
Paperid: 286,   https://arxiv.org/pdf/2507.17687    
Authors:Jiazhen Chen, Zheng Ma, Sichao Fu, Mingbin Feng, Tony S. Wirjanto, Weihua Ou
Abstract:
Graph class-incremental learning (GCIL) allows graph neural networks (GNNs) to adapt to evolving graph analytical tasks by incrementally learning new class knowledge while retaining knowledge of old classes. Existing GCIL methods primarily focus on a closed-set assumption, where all test samples are presumed to belong to previously known classes. Such an assumption restricts their applicability in real-world scenarios, where unknown classes naturally emerge during inference, and are absent during training. In this paper, we explore a more challenging open-set graph class-incremental learning scenario with two intertwined challenges: catastrophic forgetting of old classes, which impairs the detection of unknown classes, and inadequate open-set recognition, which destabilizes the retention of learned knowledge. To address the above problems, a novel OGCIL framework is proposed, which utilizes pseudo-sample embedding generation to effectively mitigate catastrophic forgetting and enable robust detection of unknown classes. To be specific, a prototypical conditional variational autoencoder is designed to synthesize node embeddings for old classes, enabling knowledge replay without storing raw graph data. To handle unknown classes, we employ a mixing-based strategy to generate out-of-distribution (OOD) samples from pseudo in-distribution and current node embeddings. A novel prototypical hypersphere classification loss is further proposed, which anchors in-distribution embeddings to their respective class prototypes, while repelling OOD embeddings away. Instead of assigning all unknown samples into one cluster, our proposed objective function explicitly models them as outliers through prototype-aware rejection regions, ensuring a robust open-set recognition. Extensive experiments on five benchmarks demonstrate the effectiveness of OGCIL over existing GCIL and open-set GNN methods.
Paperid: 287,   https://arxiv.org/pdf/2507.00498    
Authors:Yifan Liu, Yu Fang, Zhouhan Lin
Abstract:
Conventional voice conversion modifies voice characteristics from a source speaker to a target speaker, relying on audio input from both sides. However, this process becomes infeasible when clean audio is unavailable, such as in silent videos or noisy environments. In this work, we focus on the task of Silent Face-based Voice Conversion (SFVC), which does voice conversion entirely from visual inputs. i.e., given images of a target speaker and a silent video of a source speaker containing lip motion, SFVC generates speech aligning the identity of the target speaker while preserving the speech content in the source silent video. As this task requires generating intelligible speech and converting identity using only visual cues, it is particularly challenging. To address this, we introduce MuteSwap, a novel framework that employs contrastive learning to align cross-modality identities and minimize mutual information to separate shared visual features. Experimental results show that MuteSwap achieves impressive performance in both speech synthesis and identity conversion, especially under noisy conditions where methods dependent on audio input fail to produce intelligible results, demonstrating both the effectiveness of our training approach and the feasibility of SFVC.
Paperid: 288,   https://arxiv.org/pdf/2506.02452    
Authors:Wenshuo Chen, Kuimou Yu, Haozhe Jia, Kaishen Yuan, Zexu Huang, Bowen Tian, Songning Lai, Hongru Xiao, Erhang Zhang, Lei Wang, Yutao Yue
Abstract:
While diffusion models advance text-to-motion generation, their static semantic conditioning ignores temporal-frequency demands: early denoising requires structural semantics for motion foundations while later stages need localized details for text alignment. This mismatch mirrors biological morphogenesis where developmental phases demand distinct genetic programs. Inspired by epigenetic regulation governing morphological specialization, we propose (ANT), an Adaptive Neural Temporal-Aware architecture. ANT orchestrates semantic granularity through: (i) Semantic Temporally Adaptive (STA) Module: Automatically partitions denoising into low-frequency structural planning and high-frequency refinement via spectral analysis. (ii) Dynamic Classifier-Free Guidance scheduling (DCFG): Adaptively adjusts conditional to unconditional ratio enhancing efficiency while maintaining fidelity. Extensive experiments show that ANT can be applied to various baselines, significantly improving model performance, and achieving state-of-the-art semantic alignment on StableMoFusion.
Paperid: 289,   https://arxiv.org/pdf/2501.15183    
Authors:Yanbiao Ji, Dan Luo, Chang Liu, Shaokai Wu, Jing Tong, Qicheng He, Deyi Ji, Hongtao Lu, Yue Ding
Abstract:
Multi-modal recommender systems (MMRS) have gained significant attention due to their ability to leverage information from various modalities to enhance recommendation quality. However, existing negative sampling techniques often struggle to effectively utilize the multi-modal data, leading to suboptimal performance. In this paper, we identify two key challenges in negative sampling for MMRS: (1) producing cohesive negative samples contrasting with positive samples and (2) maintaining a balanced influence across different modalities. To address these challenges, we propose NegGen, a novel framework that utilizes multi-modal large language models (MLLMs) to generate balanced and contrastive negative samples. We design three different prompt templates to enable NegGen to analyze and manipulate item attributes across multiple modalities, and then generate negative samples that introduce better supervision signals and ensure modality balance. Furthermore, NegGen employs a causal learning module to disentangle the effect of intervened key features and irrelevant item attributes, enabling fine-grained learning of user preferences. Extensive experiments on real-world datasets demonstrate the superior performance of NegGen compared to state-of-the-art methods in both negative sampling and multi-modal recommendation.
Paperid: 290,   https://arxiv.org/pdf/2504.09291    
Authors:Jiaying Qian, Ziheng Jia, Zicheng Zhang, Zeyu Zhang, Guangtao Zhai, Xiongkuo Min
Abstract:
The rapid advancement of AI-driven visual generation technologies has catalyzed significant breakthroughs in image manipulation, particularly in achieving photorealistic localized editing effects on natural scene images (NSIs). Despite extensive research on image quality assessment (IQA) for AI-generated images (AGIs), most studies focus on fully AI-generated outputs (e.g., text-to-image generation), leaving the quality assessment of partial-AIGC images (PAIs)-images with localized AI-driven edits an almost unprecedented field. Motivated by this gap, we construct the first large-scale PAI dataset towards explainable partial-AIGC image quality assessment (EPAIQA), the EPAIQA-15K, which includes 15K images with localized AI manipulation in different regions and over 300K multi-dimensional human ratings. Based on this, we leverage large multi-modal models (LMMs) and propose a three-stage model training paradigm. This paradigm progressively trains the LMM for editing region grounding, quantitative quality scoring, and quality explanation. Finally, we develop the EPAIQA series models, which possess explainable quality feedback capabilities. Our work represents a pioneering effort in the perceptual IQA field for comprehensive PAI quality assessment.
Paperid: 291,   https://arxiv.org/pdf/2507.14206    
Authors:Zhijiang Tang, Jiaxin Qi, Yuhua Zheng, Jianqiang Huang
Abstract:
Electrocardiogram~(ECG), a key bioelectrical time-series signal, is crucial for assessing cardiac health and diagnosing various diseases. Given its time-series format, ECG data is often incorporated into pre-training datasets for large-scale time-series model training. However, existing studies often overlook its unique characteristics and specialized downstream applications, which differ significantly from other time-series data, leading to an incomplete understanding of its properties. In this paper, we present an in-depth investigation of ECG signals and establish a comprehensive benchmark, which includes (1) categorizing its downstream applications into four distinct evaluation tasks, (2) identifying limitations in traditional evaluation metrics for ECG analysis, and introducing a novel metric; (3) benchmarking state-of-the-art time-series models and proposing a new architecture. Extensive experiments demonstrate that our proposed benchmark is comprehensive and robust. The results validate the effectiveness of the proposed metric and model architecture, which establish a solid foundation for advancing research in ECG signal analysis.
Paperid: 292,   https://arxiv.org/pdf/2504.21054    
Authors:Yangxu Yin, Honglong Chen, Yudong Gao, Peng Sun, Liantao Wu, Zhe Li, Weifeng Liu
Abstract:
Backdoor attacks pose a significant threat to deep neural networks, as backdoored models would misclassify poisoned samples with specific triggers into target classes while maintaining normal performance on clean samples. Among these, multi-target backdoor attacks can simultaneously target multiple classes. However, existing multi-target backdoor attacks all follow the dirty-label paradigm, where poisoned samples are mislabeled, and most of them require an extremely high poisoning rate. This makes them easily detectable by manual inspection. In contrast, clean-label attacks are more stealthy, as they avoid modifying the labels of poisoned samples. However, they generally struggle to achieve stable and satisfactory attack performance and often fail to scale effectively to multi-target attacks. To address this issue, we propose the Feature-based Full-target Clean-label Backdoor Attacks (FFCBA) which consists of two paradigms: Feature-Spanning Backdoor Attacks (FSBA) and Feature-Migrating Backdoor Attacks (FMBA). FSBA leverages class-conditional autoencoders to generate noise triggers that align perturbed in-class samples with the original category's features, ensuring the effectiveness, intra-class consistency, inter-class specificity and natural-feature correlation of triggers. While FSBA supports swift and efficient attacks, its cross-model attack capability is relatively weak. FMBA employs a two-stage class-conditional autoencoder training process that alternates between using out-of-class samples and in-class samples. This allows FMBA to generate triggers with strong target-class features, making it highly effective for cross-model attacks. We conduct experiments on multiple datasets and models, the results show that FFCBA achieves outstanding attack performance and maintains desirable robustness against the state-of-the-art backdoor defenses.
Paperid: 293,   https://arxiv.org/pdf/2407.03243    
Authors:Weitai Kang, Luowei Zhou, Junyi Wu, Changchang Sun, Yan Yan
Abstract:
Unlike Object Detection, Visual Grounding task necessitates the detection of an object described by complex free-form language. To simultaneously model such complex semantic and visual representations, recent state-of-the-art studies adopt transformer-based models to fuse features from both modalities, further introducing various modules that modulate visual features to align with the language expressions and eliminate the irrelevant redundant information. However, their loss function, still adopting common Object Detection losses, solely governs the bounding box regression output, failing to fully optimize for the above objectives. To tackle this problem, in this paper, we first analyze the attention mechanisms of transformer-based models. Building upon this, we further propose a novel framework named Attention-Driven Constraint Balancing (AttBalance) to optimize the behavior of visual features within language-relevant regions. Extensive experimental results show that our method brings impressive improvements. Specifically, we achieve constant improvements over five different models evaluated on four different benchmarks. Moreover, we attain a new state-of-the-art performance by integrating our method into QRNet.
Paperid: 294,   https://arxiv.org/pdf/2508.12663    
Authors:Seung Young Noh, Ju Yong Chang
Abstract:
Humans can infer the missing parts of an occluded object by leveraging prior knowledge and visible cues. However, enabling deep learning models to accurately predict such occluded regions remains a challenging task. De-occlusion addresses this problem by reconstructing both the mask and RGB appearance. In this work, we focus on human de-occlusion, specifically targeting the recovery of occluded body structures and appearances. Our approach decomposes the task into two stages: mask completion and RGB completion. The first stage leverages a diffusion-based human body prior to provide a comprehensive representation of body structure, combined with occluded joint heatmaps that offer explicit spatial cues about missing regions. The reconstructed amodal mask then serves as a conditioning input for the second stage, guiding the model on which areas require RGB reconstruction. To further enhance RGB generation, we incorporate human-specific textual features derived using a visual question answering (VQA) model and encoded via a CLIP encoder. RGB completion is performed using Stable Diffusion, with decoder fine-tuning applied to mitigate pixel-level degradation in visible regions -- a known limitation of prior diffusion-based de-occlusion methods caused by latent space transformations. Our method effectively reconstructs human appearances even under severe occlusions and consistently outperforms existing methods in both mask and RGB completion. Moreover, the de-occluded images generated by our approach can improve the performance of downstream human-centric tasks, such as 2D pose estimation and 3D human reconstruction. The code will be made publicly available.
Paperid: 295,   https://arxiv.org/pdf/2510.26759    
Authors:Shaokai Wu, Yapan Guo, Yanbiao Ji, Jing Tong, Yuxiang Lu, Mei Li, Suizhi Huang, Yue Ding, Hongtao Lu
Abstract:
CT reconstruction provides radiologists with images for diagnosis and treatment, yet current deep learning methods are typically limited to specific anatomies and datasets, hindering generalization ability to unseen anatomies and lesions. To address this, we introduce the Multi-Organ medical image REconstruction (MORE) dataset, comprising CT scans across 9 diverse anatomies with 15 lesion types. This dataset serves two key purposes: (1) enabling robust training of deep learning models on extensive, heterogeneous data, and (2) facilitating rigorous evaluation of model generalization for CT reconstruction. We further establish a strong baseline solution that outperforms prior approaches under these challenging conditions. Our results demonstrate that: (1) a comprehensive dataset helps improve the generalization capability of models, and (2) optimization-based methods offer enhanced robustness for unseen anatomies. The MORE dataset is freely accessible under CC-BY-NC 4.0 at our project page https://more-med.github.io/
Paperid: 296,   https://arxiv.org/pdf/2506.05163    
Authors:Gabriele Magrini, Niccolò Marini, Federico Becattini, Lorenzo Berlincioni, Niccolò Biondi, Pietro Pala, Alberto Del Bimbo
Abstract:
Small, fast, and lightweight drones present significant challenges for traditional RGB cameras due to their limitations in capturing fast-moving objects, especially under challenging lighting conditions. Event cameras offer an ideal solution, providing high temporal definition and dynamic range, yet existing benchmarks often lack fine temporal resolution or drone-specific motion patterns, hindering progress in these areas. This paper introduces the Florence RGB-Event Drone dataset (FRED), a novel multimodal dataset specifically designed for drone detection, tracking, and trajectory forecasting, combining RGB video and event streams. FRED features more than 7 hours of densely annotated drone trajectories, using 5 different drone models and including challenging scenarios such as rain and adverse lighting conditions. We provide detailed evaluation protocols and standard metrics for each task, facilitating reproducible benchmarking. The authors hope FRED will advance research in high-speed drone perception and multimodal spatiotemporal understanding.
Paperid: 297,   https://arxiv.org/pdf/2504.00606    
Authors:Ping Li, Chenhao Ping, Wenxiao Wang, Mingli Song
Abstract:
Knowledge Distillation (KD) compresses neural networks by learning a small network (student) via transferring knowledge from a pre-trained large network (teacher). Many endeavours have been devoted to the image domain, while few works focus on video analysis which desires training much larger model making it be hardly deployed in resource-limited devices. However, traditional methods neglect two important problems, i.e., 1) Since the capacity gap between the teacher and the student exists, some knowledge w.r.t. difficult-to-transfer samples cannot be correctly transferred, or even badly affects the final performance of student, and 2) As training progresses, difficult-to-transfer samples may become easier to learn, and vice versa. To alleviate the two problems, we propose a Sample-level Adaptive Knowledge Distillation (SAKD) framework for action recognition. In particular, it mainly consists of the sample distillation difficulty evaluation module and the sample adaptive distillation module. The former applies the temporal interruption to frames, i.e., randomly dropout or shuffle the frames during training, which increases the learning difficulty of samples during distillation, so as to better discriminate their distillation difficulty. The latter module adaptively adjusts distillation ratio at sample level, such that KD loss dominates the training with easy-to-transfer samples while vanilla loss dominates that with difficult-to-transfer samples. More importantly, we only select those samples with both low distillation difficulty and high diversity to train the student model for reducing computational cost. Experimental results on two video benchmarks and one image benchmark demonstrate the superiority of the proposed method by striking a good balance between performance and efficiency.
Paperid: 298,   https://arxiv.org/pdf/2508.03266    
Authors:Huaihai Lyu, Chaofan Chen, Yuheng Ji, Changsheng Xu
Abstract:
Driven by the increasing demand for applications in augmented and virtual reality, egocentric action recognition has emerged as a prominent research area. It is typically divided into two subtasks: recognizing the performed behavior (i.e., verb component) and identifying the objects being acted upon (i.e., noun component) from the first-person perspective. However, most existing approaches treat these two components as independent classification tasks, focusing on extracting component-specific knowledge while overlooking their inherent semantic and contextual relationships, leading to fragmented representations and sub-optimal generalization capability. To address these challenges, we propose a prompt learning-based framework, EgoPrompt, to conduct the egocentric action recognition task. Building on the existing prompting strategy to capture the component-specific knowledge, we construct a Unified Prompt Pool space to establish interaction between the two types of component representations. Specifically, the component representations (from verbs and nouns) are first decomposed into fine-grained patterns with the prompt pair form. Then, these pattern-level representations are fused through an attention-based mechanism to facilitate cross-component interaction. To ensure the prompt pool is informative, we further introduce a novel training objective, Diverse Pool Criteria. This objective realizes our goals from two perspectives: Prompt Selection Frequency Regularization and Prompt Knowledge Orthogonalization. Extensive experiments are conducted on the Ego4D, EPIC-Kitchens, and EGTEA datasets. The results consistently show that EgoPrompt achieves state-of-the-art performance across within-dataset, cross-dataset, and base-to-novel generalization benchmarks.
Paperid: 299,   https://arxiv.org/pdf/2504.11008    
Authors:Qinyue Tong, Ziqian Lu, Jun Liu, Yangming Zheng, Zheming Lu
Abstract:
Despite remarkable advancements in pixel-level medical image perception, existing methods are either limited to specific tasks or heavily rely on accurate bounding boxes or text labels as input prompts. However, the medical knowledge required for input is a huge obstacle for general public, which greatly reduces the universality of these methods. Compared with these domain-specialized auxiliary information, general users tend to rely on oral queries that require logical reasoning. In this paper, we introduce a novel medical vision task: Medical Reasoning Segmentation and Detection (MedSD), which aims to comprehend implicit queries about medical images and generate the corresponding segmentation mask and bounding box for the target object. To accomplish this task, we first introduce a Multi-perspective, Logic-driven Medical Reasoning Segmentation and Detection (MLMR-SD) dataset, which encompasses a substantial collection of medical entity targets along with their corresponding reasoning. Furthermore, we propose MediSee, an effective baseline model designed for medical reasoning segmentation and detection. The experimental results indicate that the proposed method can effectively address MedSD with implicit colloquial queries and outperform traditional medical referring segmentation methods.
Paperid: 300,   https://arxiv.org/pdf/2504.18349    
Authors:Hongyu Zhu, Sichu Liang, Wenwen Wang, Boheng Li, Tongxin Yuan, Fangqi Li, ShiLin Wang, Zhuosheng Zhang
Abstract:
With the surge of large language models (LLMs), Large Vision-Language Models (VLMs)--which integrate vision encoders with LLMs for accurate visual grounding--have shown great potential in tasks like generalist agents and robotic control. However, VLMs are typically trained on massive web-scraped images, raising concerns over copyright infringement and privacy violations, and making data auditing increasingly urgent. Membership inference (MI), which determines whether a sample was used in training, has emerged as a key auditing technique, with promising results on open-source VLMs like LLaVA (AUC > 80%). In this work, we revisit these advances and uncover a critical issue: current MI benchmarks suffer from distribution shifts between member and non-member images, introducing shortcut cues that inflate MI performance. We further analyze the nature of these shifts and propose a principled metric based on optimal transport to quantify the distribution discrepancy. To evaluate MI in realistic settings, we construct new benchmarks with i.i.d. member and non-member images. Existing MI methods fail under these unbiased conditions, performing only marginally better than chance. Further, we explore the theoretical upper bound of MI by probing the Bayes Optimality within the VLM's embedding space and find the irreducible error rate remains high. Despite this pessimistic outlook, we analyze why MI for VLMs is particularly challenging and identify three practical scenarios--fine-tuning, access to ground-truth texts, and set-based inference--where auditing becomes feasible. Our study presents a systematic view of the limits and opportunities of MI for VLMs, providing guidance for future efforts in trustworthy data auditing.
Paperid: 301,   https://arxiv.org/pdf/2507.22668    
Authors:Hongbin Lin, Yifan Jiang, Juangui Xu, Jesse Jiaxi Xu, Yi Lu, Zhengyu Hu, Ying-Cong Chen, Hao Wang
Abstract:
3D point cloud segmentation aims to assign semantic labels to individual points in a scene for fine-grained spatial understanding. Existing methods typically adopt data augmentation to alleviate the burden of large-scale annotation. However, most augmentation strategies only focus on local transformations or semantic recomposition, lacking the consideration of global structural dependencies within scenes. To address this limitation, we propose a graph-guided data augmentation framework with dual-level constraints for realistic 3D scene synthesis. Our method learns object relationship statistics from real-world data to construct guiding graphs for scene generation. Local-level constraints enforce geometric plausibility and semantic consistency between objects, while global-level constraints maintain the topological structure of the scene by aligning the generated layout with the guiding graph. Extensive experiments on indoor and outdoor datasets demonstrate that our framework generates diverse and high-quality augmented scenes, leading to consistent improvements in point cloud segmentation performance across various models.
Paperid: 302,   https://arxiv.org/pdf/2507.16877    
Authors:Yizhi Hu, Zezhao Tian, Xingqun Qi, Chen Su, Bingkun Yang, Junhui Yin, Muyi Sun, Man Zhang, Zhenan Sun
Abstract:
Referring Expression Comprehension (REC) aims to localize specified entities or regions in an image based on natural language descriptions. While existing methods handle single-entity localization, they often ignore complex inter-entity relationships in multi-entity scenes, limiting their accuracy and reliability. Additionally, the lack of high-quality datasets with fine-grained, paired image-text-relation annotations hinders further progress. To address this challenge, we first construct a relation-aware, multi-entity REC dataset called ReMeX, which includes detailed relationship and textual annotations. We then propose ReMeREC, a novel framework that jointly leverages visual and textual cues to localize multiple entities while modeling their inter-relations. To address the semantic ambiguity caused by implicit entity boundaries in language, we introduce the Text-adaptive Multi-entity Perceptron (TMP), which dynamically infers both the quantity and span of entities from fine-grained textual cues, producing distinctive representations. Additionally, our Entity Inter-relationship Reasoner (EIR) enhances relational reasoning and global scene understanding. To further improve language comprehension for fine-grained prompts, we also construct a small-scale auxiliary dataset, EntityText, generated using large language models. Experiments on four benchmark datasets show that ReMeREC achieves state-of-the-art performance in multi-entity grounding and relation prediction, outperforming existing approaches by a large margin.
Paperid: 303,   https://arxiv.org/pdf/2507.19882    
Authors:Xinshu Li, Ruoyu Wang, Erdun Gao, Mingming Gong, Lina Yao
Abstract:
Prompt learning has garnered attention for its efficiency over traditional model training and fine-tuning. However, existing methods, constrained by inadequate theoretical foundations, encounter difficulties in achieving causally invariant prompts, ultimately falling short of capturing robust features that generalize effectively across categories. To address these challenges, we introduce the DiCap model, a theoretically grounded Diffusion-based Counterfactual prompt learning framework, which leverages a diffusion process to iteratively sample gradients from the marginal and conditional distributions of the causal model, guiding the generation of counterfactuals that satisfy the minimal sufficiency criterion. Grounded in rigorous theoretical derivations, this approach guarantees the identifiability of counterfactual outcomes while imposing strict bounds on estimation errors. We further employ a contrastive learning framework that leverages the generated counterfactuals, thereby enabling the refined extraction of prompts that are precisely aligned with the causal features of the data. Extensive experimental results demonstrate that our method performs excellently across tasks such as image classification, image-text retrieval, and visual question answering, with particularly strong advantages in unseen categories.
Paperid: 304,   https://arxiv.org/pdf/2507.16472    
Authors:Yu-Fan Lin, Chia-Ming Lee, Chih-Chung Hsu
Abstract:
Shadows are a common factor degrading image quality. Single-image shadow removal (SR), particularly under challenging indirect illumination, is hampered by non-uniform content degradation and inherent ambiguity. Consequently, traditional methods often fail to simultaneously recover intra-shadow details and maintain sharp boundaries, resulting in inconsistent restoration and blurring that negatively affect both downstream applications and the overall viewing experience. To overcome these limitations, we propose the DenseSR, approaching the problem from a dense prediction perspective to emphasize restoration quality. This framework uniquely synergizes two key strategies: (1) deep scene understanding guided by geometric-semantic priors to resolve ambiguity and implicitly localize shadows, and (2) high-fidelity restoration via a novel Dense Fusion Block (DFB) in the decoder. The DFB employs adaptive component processing-using an Adaptive Content Smoothing Module (ACSM) for consistent appearance and a Texture-Boundary Recuperation Module (TBRM) for fine textures and sharp boundaries-thereby directly tackling the inconsistent restoration and blurring issues. These purposefully processed components are effectively fused, yielding an optimized feature representation preserving both consistency and fidelity. Extensive experimental results demonstrate the merits of our approach over existing methods. Our code can be available on https://github.com/VanLinLin/DenseSR
Paperid: 305,   https://arxiv.org/pdf/2504.11218    
Authors:Zeming Wei, Junyi Lin, Yang Liu, Weixing Chen, Jingzhou Luo, Guanbin Li, Liang Lin
Abstract:
3D affordance reasoning is essential in associating human instructions with the functional regions of 3D objects, facilitating precise, task-oriented manipulations in embodied AI. However, current methods, which predominantly depend on sparse 3D point clouds, exhibit limited generalizability and robustness due to their sensitivity to coordinate variations and the inherent sparsity of the data. By contrast, 3D Gaussian Splatting (3DGS) delivers high-fidelity, real-time rendering with minimal computational overhead by representing scenes as dense, continuous distributions. This positions 3DGS as a highly effective approach for capturing fine-grained affordance details and improving recognition accuracy. Nevertheless, its full potential remains largely untapped due to the absence of large-scale, 3DGS-specific affordance datasets. To overcome these limitations, we present 3DAffordSplat, the first large-scale, multi-modal dataset tailored for 3DGS-based affordance reasoning. This dataset includes 23,677 Gaussian instances, 8,354 point cloud instances, and 6,631 manually annotated affordance labels, encompassing 21 object categories and 18 affordance types. Building upon this dataset, we introduce AffordSplatNet, a novel model specifically designed for affordance reasoning using 3DGS representations. AffordSplatNet features an innovative cross-modal structure alignment module that exploits structural consistency priors to align 3D point cloud and 3DGS representations, resulting in enhanced affordance recognition accuracy. Extensive experiments demonstrate that the 3DAffordSplat dataset significantly advances affordance learning within the 3DGS domain, while AffordSplatNet consistently outperforms existing methods across both seen and unseen settings, highlighting its robust generalization capabilities.
Paperid: 306,   https://arxiv.org/pdf/2508.01723    
Authors:Danyang Li, Zenghui Yang, Guangpeng Qi, Songtao Pang, Guangyong Shang, Qiang Ma, Zheng Yang
Abstract:
Grounding natural language instructions to visual observations is fundamental for embodied agents operating in open-world environments. Recent advances in visual-language mapping have enabled generalizable semantic representations by leveraging vision-language models (VLMs). However, these methods often fall short in aligning free-form language commands with specific scene instances, due to limitations in both instance-level semantic consistency and instruction interpretation. We present OpenMap, a zero-shot open-vocabulary visual-language map designed for accurate instruction grounding in navigation tasks. To address semantic inconsistencies across views, we introduce a Structural-Semantic Consensus constraint that jointly considers global geometric structure and vision-language similarity to guide robust 3D instance-level aggregation. To improve instruction interpretation, we propose an LLM-assisted Instruction-to-Instance Grounding module that enables fine-grained instance selection by incorporating spatial context and expressive target descriptions. We evaluate OpenMap on ScanNet200 and Matterport3D, covering both semantic mapping and instruction-to-target retrieval tasks. Experimental results show that OpenMap outperforms state-of-the-art baselines in zero-shot settings, demonstrating the effectiveness of our method in bridging free-form language and 3D perception for embodied navigation.
Paperid: 307,   https://arxiv.org/pdf/2501.09617    
Authors:Siran Peng, Tianshuo Zhang, Li Gao, Xiangyu Zhu, Haoyuan Zhang, Kai Pang, Zhen Lei
Abstract:
With the rapid advancement of deepfake generation technologies, the demand for robust and accurate face forgery detection algorithms has become increasingly critical. Recent studies have demonstrated that wavelet analysis can uncover subtle forgery artifacts that remain imperceptible in the spatial domain. Wavelets effectively capture important facial contours, which are often slender, fine-grained, and global in nature. However, existing wavelet-based approaches fail to fully leverage these unique characteristics, resulting in sub-optimal feature extraction and limited generalizability. To address this challenge, we introduce WMamba, a novel wavelet-based feature extractor built upon the Mamba architecture. WMamba maximizes the utility of wavelet information through two key innovations. First, we propose Dynamic Contour Convolution (DCConv), which employs specially crafted deformable kernels to adaptively model slender facial contours. Second, by leveraging the Mamba architecture, our method captures long-range spatial relationships with linear computational complexity. This efficiency allows for the extraction of fine-grained, global forgery artifacts from small image patches. Extensive experimental results show that WMamba achieves state-of-the-art (SOTA) performance, highlighting its effectiveness and superiority in face forgery detection.
Paperid: 308,   https://arxiv.org/pdf/2504.14348    
Authors:Le Wang, Zonghao Ying, Tianyuan Zhang, Siyuan Liang, Shengshan Hu, Mingchuan Zhang, Aishan Liu, Xianglong Liu
Abstract:
The emergence of multimodal large language models has redefined the agent paradigm by integrating language and vision modalities with external data sources, enabling agents to better interpret human instructions and execute increasingly complex tasks. However, in this paper, we identify a critical yet previously overlooked security vulnerability in multimodal agents: cross-modal prompt injection attacks. To exploit this vulnerability, we propose CrossInject, a novel attack framework in which attackers embed adversarial perturbations across multiple modalities to align with target malicious content, allowing external instructions to hijack the agent's decision-making process and execute unauthorized tasks. Our approach incorporates two key coordinated components. First, we introduce Visual Latent Alignment, where we optimize adversarial features to the malicious instructions in the visual embedding space based on a text-to-image generative model, ensuring that adversarial images subtly encode cues for malicious task execution. Subsequently, we present Textual Guidance Enhancement, where a large language model is leveraged to construct the black-box defensive system prompt through adversarial meta prompting and generate an malicious textual command that steers the agent's output toward better compliance with attackers' requests. Extensive experiments demonstrate that our method outperforms state-of-the-art attacks, achieving at least a +30.1% increase in attack success rates across diverse tasks. Furthermore, we validate our attack's effectiveness in real-world multimodal autonomous agents, highlighting its potential implications for safety-critical applications.
Paperid: 309,   https://arxiv.org/pdf/2311.14749    
Authors:Lin Li, Guikun Chen, Zhen Wang, Jun Xiao, Long Chen
Abstract:
Compositional zero-shot learning aims to recognize unseen state-object compositions by leveraging known primitives (state and object) during training. However, effectively modeling interactions between primitives and generalizing knowledge to novel compositions remains a perennial challenge. There are two key factors: object-conditioned and state-conditioned variance, i.e., the appearance of states (or objects) can vary significantly when combined with different objects (or states). For instance, the state "old" can signify a vintage design for a "car" or an advanced age for a "cat". In this paper, we argue that these variances can be mitigated by predicting composition categories based on pre-observed primitive. To this end, we propose Progressive Language-based Observations (PLO), which can dynamically determine a better observation order of primitives. These observations comprise a series of concepts or languages that allow the model to understand image content in a step-by-step manner. Specifically, PLO adopts pre-trained vision-language models (VLMs) to empower the model with observation capabilities. We further devise two variants: 1) PLO-VLM: a two-step method, where a pre-observing classifier dynamically determines the observation order of two primitives. 2) PLO-LLM: a multi-step scheme, which utilizes large language models (LLMs) to craft composition-specific prompts for step-by-step observing. Extensive ablations on three challenging datasets demonstrate the superiority of PLO compared with state-of-the-art methods, affirming its abilities in compositional recognition.
Paperid: 310,   https://arxiv.org/pdf/2507.12498    
Authors:Beizhen Zhao, Yifan Zhou, Sicheng Yu, Zijian Wang, Hao Wang
Abstract:
3D Gaussian Splatting (3DGS) has revolutionized 3D scene reconstruction, which effectively balances rendering quality, efficiency, and speed. However, existing 3DGS approaches usually generate plausible outputs and face significant challenges in complex scene reconstruction, manifesting as incomplete holistic structural outlines and unclear local lighting effects. To address these issues simultaneously, we propose a novel decoupled optimization framework, which integrates wavelet decomposition into 3D Gaussian Splatting and 2D sampling. Technically, through 3D wavelet decomposition, our approach divides point clouds into high-frequency and low-frequency components, enabling targeted optimization for each. The low-frequency component captures global structural outlines and manages the distribution of Gaussians through voxelization. In contrast, the high-frequency component restores intricate geometric and textural details while incorporating a relight module to mitigate lighting artifacts and enhance photorealistic rendering. Additionally, a 2D wavelet decomposition is applied to the training images, simulating radiance variations. This provides critical guidance for high-frequency detail reconstruction, ensuring seamless integration of details with the global structure. Extensive experiments on challenging datasets demonstrate our method achieves state-of-the-art performance across various metrics, surpassing existing approaches and advancing the field of 3D scene reconstruction.
Paperid: 311,   https://arxiv.org/pdf/2507.05805    
Authors:Xin Li, Mingming Gong, Yunfei Wu, Jianxin Dai, Antai Guo, Xinghua Jiang, Haoyu Cao, Yinsong Liu, Deqiang Jiang, Xing Sun
Abstract:
Document reconstruction constitutes a significant facet of document analysis and recognition, a field that has been progressively accruing interest within the scholarly community. A multitude of these researchers employ an array of document understanding models to generate predictions on distinct subtasks, subsequently integrating their results into a holistic document reconstruction format via heuristic principles. Nevertheless, these multi-stage methodologies are hindered by the phenomenon of error propagation, resulting in suboptimal performance. Furthermore, contemporary studies utilize generative models to extract the logical sequence of plain text, tables and mathematical expressions in an end-to-end process. However, this approach is deficient in preserving the information related to element layouts, which are vital for document reconstruction. To surmount these aforementioned limitations, we in this paper present an innovative autoregressive model specifically designed for document reconstruction, referred to as Document Reconstruction via End-to-end Autoregressive Model (DREAM). DREAM transmutes the text image into a sequence of document reconstruction in a comprehensive, end-to-end process, encapsulating a broader spectrum of document element information. In addition, we establish a standardized definition of the document reconstruction task, and introduce a novel Document Similarity Metric (DSM) and DocRec1K dataset for assessing the performance of the task. Empirical results substantiate that our methodology attains unparalleled performance in the realm of document reconstruction. Furthermore, the results on a variety of subtasks, encompassing document layout analysis, text recognition, table structure recognition, formula recognition and reading order detection, indicate that our model is competitive and compatible with various tasks.
Paperid: 312,   https://arxiv.org/pdf/2511.03425    
Authors:Ilya Borovik, Dmitrii Gavrilev, Vladimir Viro
Abstract:
Emotions are fundamental to the creation and perception of music performances. However, achieving human-like expression and emotion through machine learning models for performance rendering remains a challenging task. In this work, we present SyMuPe, a novel framework for developing and training affective and controllable symbolic piano performance models. Our flagship model, PianoFlow, uses conditional flow matching trained to solve diverse multi-mask performance inpainting tasks. By design, it supports both unconditional generation and infilling of music performance features. For training, we use a curated, cleaned dataset of 2,968 hours of aligned musical scores and expressive MIDI performances. For text and emotion control, we integrate a piano performance emotion classifier and tune PianoFlow with the emotion-weighted Flan-T5 text embeddings provided as conditional inputs. Objective and subjective evaluations against transformer-based baselines and existing models show that PianoFlow not only outperforms other approaches, but also achieves performance quality comparable to that of human-recorded and transcribed MIDI samples. For emotion control, we present and analyze samples generated under different text conditioning scenarios. The developed model can be integrated into interactive applications, contributing to the creation of more accessible and engaging music performance systems.
Paperid: 313,   https://arxiv.org/pdf/2507.18750    
Authors:Hyunwoo Oh, SeungJu Cha, Kwanyoung Lee, Si-Woo Kim, Dong-Jin Kim
Abstract:
We propose CatchPhrase, a novel audio-to-image generation framework designed to mitigate semantic misalignment between audio inputs and generated images. While recent advances in multi-modal encoders have enabled progress in cross-modal generation, ambiguity stemming from homographs and auditory illusions continues to hinder accurate alignment. To address this issue, CatchPhrase generates enriched cross-modal semantic prompts (EXPrompt Mining) from weak class labels by leveraging large language models (LLMs) and audio captioning models (ACMs). To address both class-level and instance-level misalignment, we apply multi-modal filtering and retrieval to select the most semantically aligned prompt for each audio sample (EXPrompt Selector). A lightweight mapping network is then trained to adapt pre-trained text-to-image generation models to audio input. Extensive experiments on multiple audio classification datasets demonstrate that CatchPhrase improves audio-to-image alignment and consistently enhances generation quality by mitigating semantic misalignment.
Paperid: 314,   https://arxiv.org/pdf/2510.11175    
Authors:Xiang Ma, Litian Xu, Lexin Fang, Caiming Zhang, Lizhen Cui
Abstract:
Cross-modal alignment is an important multi-modal task, aiming to bridge the semantic gap between different modalities. The most reliable fundamention for achieving this objective lies in the semantic consistency between matched pairs. Conventional methods implicitly assume embeddings contain solely semantic information, ignoring the impact of non-semantic information during alignment, which inevitably leads to information bias or even loss. These non-semantic information primarily manifest as stylistic variations in the data, which we formally define as style information. An intuitive approach is to separate style from semantics, aligning only the semantic information. However, most existing methods distinguish them based on feature columns, which cannot represent the complex coupling relationship between semantic and style information. In this paper, we propose PICO, a novel framework for suppressing style interference during embedding interaction. Specifically, we quantify the probability of each feature column representing semantic information, and regard it as the weight during the embedding interaction. To ensure the reliability of the semantic probability, we propose a prototype iterative construction method. The key operation of this method is a performance feedback-based weighting function, and we have theoretically proven that the function can assign higher weight to prototypes that bring higher performance improvements. Extensive experiments on various benchmarks and model backbones demonstrate the superiority of PICO, outperforming state-of-the-art methods by 5.2%-14.1%.
Paperid: 315,   https://arxiv.org/pdf/2509.05949    
Authors:Qiqi Zhan, Shiwei Li, Qingjie Liu, Yunhong Wang
Abstract:
The evolution of prompt learning methodologies has driven exploration of deeper prompt designs to enhance model performance. However, current deep text prompting approaches suffer from two critical limitations: Over-reliance on constrastive learning objectives that prioritize high-level semantic alignment, neglecting fine-grained feature optimization; Static prompts across all input categories, preventing content-aware adaptation. To address these limitations, we propose AttriPrompt-a novel framework that enhances and refines textual semantic representations by leveraging the intermediate-layer features of CLIP's vision encoder. We designed an Attribute Retrieval module that first clusters visual features from each layer. The aggregated visual features retrieve semantically similar prompts from a prompt pool, which are then concatenated to the input of every layer in the text encoder. Leveraging hierarchical visual information embedded in prompted text features, we introduce Dual-stream Contrastive Learning to realize fine-grained alignment. Furthermore, we introduce a Self-Regularization mechanism by applying explicit regularization constraints between the prompted and non-prompted text features to prevent overfitting on limited training data. Extensive experiments across three benchmarks demonstrate AttriPrompt's superiority over state-of-the-art methods, achieving up to 7.37% improvement in the base-to-novel setting. The observed strength of our method in cross-domain knowledge transfer positions vision-language pre-trained models as more viable solutions for real-world implementation.
Paperid: 316,   https://arxiv.org/pdf/2507.18632    
Authors:Ye-Chan Kim, SeungJu Cha, Si-Woo Kim, Taewhan Kim, Dong-Jin Kim
Abstract:
Zero-shot domain adaptation is a method for adapting a model to a target domain without utilizing target domain image data. To enable adaptation without target images, existing studies utilize CLIP's embedding space and text description to simulate target-like style features. Despite the previous achievements in zero-shot domain adaptation, we observe that these text-driven methods struggle to capture complex real-world variations and significantly increase adaptation time due to their alignment process. Instead of relying on text descriptions, we explore solutions leveraging image data, which provides diverse and more fine-grained style cues. In this work, we propose SIDA, a novel and efficient zero-shot domain adaptation method leveraging synthetic images. To generate synthetic images, we first create detailed, source-like images and apply image translation to reflect the style of the target domain. We then utilize the style features of these synthetic images as a proxy for the target domain. Based on these features, we introduce Domain Mix and Patch Style Transfer modules, which enable effective modeling of real-world variations. In particular, Domain Mix blends multiple styles to expand the intra-domain representations, and Patch Style Transfer assigns different styles to individual patches. We demonstrate the effectiveness of our method by showing state-of-the-art performance in diverse zero-shot adaptation scenarios, particularly in challenging domains. Moreover, our approach achieves high efficiency by significantly reducing the overall adaptation time.
Paperid: 317,   https://arxiv.org/pdf/2507.06735    
Authors:Guan Zheng, Xue Wang, Wenhua Qian, Peng Liu, Runzhuo Ma
Abstract:
Image fusion aims to integrate complementary information across modalities to generate high-quality fused images, thereby enhancing the performance of high-level vision tasks. While global spatial modeling mechanisms show promising results, constructing long-range feature dependencies in the spatial domain incurs substantial computational costs. Additionally, the absence of ground-truth exacerbates the difficulty of capturing complementary features effectively. To tackle these challenges, we propose a Residual Prior-driven Frequency-aware Network, termed as RPFNet. Specifically, RPFNet employs a dual-branch feature extraction framework: the Residual Prior Module (RPM) extracts modality-specific difference information from residual maps, thereby providing complementary priors for fusion; the Frequency Domain Fusion Module (FDFM) achieves efficient global feature modeling and integration through frequency-domain convolution. Additionally, the Cross Promotion Module (CPM) enhances the synergistic perception of local details and global structures through bidirectional feature interaction. During training, we incorporate an auxiliary decoder and saliency structure loss to strengthen the model's sensitivity to modality-specific differences. Furthermore, a combination of adaptive weight-based frequency contrastive loss and SSIM loss effectively constrains the solution space, facilitating the joint capture of local details and global features while ensuring the retention of complementary information. Extensive experiments validate the fusion performance of RPFNet, which effectively integrates discriminative features, enhances texture details and salient objects, and can effectively facilitate the deployment of the high-level vision task.
Paperid: 318,   https://arxiv.org/pdf/2504.13219    
Authors:Wenxuan Yang, Qingqu Wei, Chenxi Ma, Weimin Tan, Bo Yan
Abstract:
Current scaling laws for visual AI models focus predominantly on large-scale pretraining, leaving a critical gap in understanding how performance scales for data-constrained downstream tasks. To address this limitation, this paper establishes the first practical framework for data-efficient scaling laws in visual transfer learning, addressing two fundamental questions: 1) How do scaling behaviors shift when downstream tasks operate with limited data? 2) What governs the efficacy of knowledge distillation under such constraints? Through systematic analysis of vision tasks across data regimes (1K-1M samples), we propose the distillation boundary theory, revealing a critical turning point in distillation efficiency: 1) Distillation superiority: In data-scarce conditions, distilled models significantly outperform their non-distillation counterparts, efficiently leveraging inherited knowledge to compensate for limited training samples. 2) Pre-training dominance: As pre-training data increases beyond a critical threshold, non-distilled models gradually surpass distilled versions, suggesting diminishing returns from knowledge inheritance when sufficient task-specific data becomes available. Empirical validation across various model scales (2.5M to 38M parameters) and data volumes demonstrate these performance inflection points, with error difference curves transitioning from positive to negative values at critical data thresholds, confirming our theoretical predictions. This work redefines scaling laws for data-limited regimes, bridging the knowledge gap between large-scale pretraining and practical downstream adaptation, addressing a critical barrier to understanding vision model scaling behaviors and optimizing computational resource allocation.
Paperid: 319,   https://arxiv.org/pdf/2510.19653    
Authors:Yuxin Cheng, Binxiao Huang, Wenyong Zhou, Taiqiang Wu, Zhengwu Liu, Graziano Chesi, Ngai Wong
Abstract:
3D Gaussian Splatting (3D-GS) achieves real-time photorealistic novel view synthesis, yet struggles with complex scenes due to over-reconstruction artifacts, manifesting as local blurring and needle-shape distortions. While recent approaches attribute these issues to insufficient splitting of large-scale Gaussians, we identify two fundamental limitations: gradient magnitude dilution during densification and the primitive frozen phenomenon, where essential Gaussian densification is inhibited in complex regions while suboptimally scaled Gaussians become trapped in local optima. To address these challenges, we introduce ReAct-GS, a method founded on the principle of re-activation. Our approach features: (1) an importance-aware densification criterion incorporating α-blending weights from multiple viewpoints to re-activate stalled primitive growth in complex regions, and (2) a re-activation mechanism that revitalizes frozen primitives through adaptive parameter perturbations. Comprehensive experiments across diverse real-world datasets demonstrate that ReAct-GS effectively eliminates over-reconstruction artifacts and achieves state-of-the-art performance on standard novel view synthesis metrics while preserving intricate geometric details. Additionally, our re-activation mechanism yields consistent improvements when integrated with other 3D-GS variants such as Pixel-GS, demonstrating its broad applicability.
Paperid: 320,   https://arxiv.org/pdf/2504.14868    
Authors:Jianhui Wang, Yangfan He, Yan Zhong, Xinyuan Song, Jiayi Su, Yuheng Feng, Hongyang He, Wenyu Zhu, Xinhang Yuan, Kuan Lu, Menghao Huo, Miao Zhang, Keqin Li, Jiaqi Chen, Tianyu Shi, Xueqian Wang
Abstract:
Modern text-to-image generation systems have enabled the creation of remarkably realistic and high-quality visuals, yet they often falter when handling the inherent ambiguities in user prompts. In this work, we present Twin-Co, a framework that leverages synchronized, co-adaptive dialogue to progressively refine image generation. Instead of a static generation process, Twin-Co employs a dynamic, iterative workflow where an intelligent dialogue agent continuously interacts with the user. Initially, a base image is generated from the user's prompt. Then, through a series of synchronized dialogue exchanges, the system adapts and optimizes the image according to evolving user feedback. The co-adaptive process allows the system to progressively narrow down ambiguities and better align with user intent. Experiments demonstrate that Twin-Co not only enhances user experience by reducing trial-and-error iterations but also improves the quality of the generated images, streamlining the creative process across various applications.
Paperid: 321,   https://arxiv.org/pdf/2507.09887    
Authors:Huynh Dang Nguyen, Trong-Thang Pham, Ngan Le, Van Nguyen
Abstract:
The electrocardiogram (ECG) is an essential and effective tool for diagnosing heart diseases. However, its effectiveness can be compromised by noise or unavailability of one or more leads of the standard 12-lead recordings, resulting in diagnostic errors or uncertainty. To address these challenges, we propose TolerantECG, a foundation model for ECG signals that is robust to noise and capable of functioning with arbitrary subsets of the standard 12-lead ECG. TolerantECG training combines contrastive and self-supervised learning frameworks to jointly learn ECG signal representations alongside their corresponding knowledge-retrieval-based text report descriptions and corrupted or lead-missing signals. Comprehensive benchmarking results demonstrate that TolerantECG consistently ranks as the best or second-best performer across various ECG signal conditions and class levels in the PTB-XL dataset, and achieves the highest performance on the MIT-BIH Arrhythmia Database.
Paperid: 322,   https://arxiv.org/pdf/2502.18485    
Authors:Jiaqi Xu, Cuiling Lan, Yan Lu
Abstract:
The burgeoning growth of open-sourced vision-language models (VLMs) has catalyzed a plethora of applications across diverse domains. Ensuring the transparency and interpretability of these models is critical for fostering trustworthy and responsible AI systems. In this study, our objective is to delve into the internals of VLMs to interpret the functions of individual neurons. We observe the activations of neurons with respects to the input visual tokens and text tokens, and reveal some interesting findings. Particularly, we found that there are neurons responsible for only visual or text information, or both, respectively, which we refer to them as visual neurons, text neurons, and multi-modal neurons, respectively. We build a framework that automates the explanation of neurons with the assistant of GPT-4o. Meanwhile, for visual neurons, we propose an activation simulator to assess the reliability of the explanations for visual neurons. System statistical analyses on top of one representative VLM of LLaVA, uncover the behaviors/characteristics of different categories of neurons.
Paperid: 323,   https://arxiv.org/pdf/2504.19549    
Authors:Deng Li, Bohao Xing, Xin Liu, Baiqiang Xia, Bihan Wen, Heikki Kälviäinen
Abstract:
Emotion understanding is a critical yet challenging task. Most existing approaches rely heavily on identity-sensitive information, such as facial expressions and speech, which raises concerns about personal privacy. To address this, we introduce the De-identity Multimodal Emotion Recognition and Reasoning (DEEMO), a novel task designed to enable emotion understanding using de-identified video and audio inputs. The DEEMO dataset consists of two subsets: DEEMO-NFBL, which includes rich annotations of Non-Facial Body Language (NFBL), and DEEMO-MER, an instruction dataset for Multimodal Emotion Recognition and Reasoning using identity-free cues. This design supports emotion understanding without compromising identity privacy. In addition, we propose DEEMO-LLaMA, a Multimodal Large Language Model (MLLM) that integrates de-identified audio, video, and textual information to enhance both emotion recognition and reasoning. Extensive experiments show that DEEMO-LLaMA achieves state-of-the-art performance on both tasks, outperforming existing MLLMs by a significant margin, achieving 74.49% accuracy and 74.45% F1-score in de-identity emotion recognition, and 6.20 clue overlap and 7.66 label overlap in de-identity emotion reasoning. Our work contributes to ethical AI by advancing privacy-preserving emotion understanding and promoting responsible affective computing.
Paperid: 324,   https://arxiv.org/pdf/2507.11152    
Authors:Duoyou Chen, Yunqing Chen, Can Zhang, Zhou Wang, Cheng Chen, Ruoxiu Xiao
Abstract:
Computed Tomography (CT) is a widely utilized imaging modality in clinical settings. Using densely acquired rotational X-ray arrays, CT can capture 3D spatial features. However, it is confronted with challenged such as significant time consumption and high radiation exposure. CT reconstruction methods based on sparse-view X-ray images have garnered substantial attention from researchers as they present a means to mitigate costs and risks. In recent years, diffusion models, particularly the Latent Diffusion Model (LDM), have demonstrated promising potential in the domain of 3D CT reconstruction. Nonetheless, due to the substantial differences between the 2D latent representation of X-ray modalities and the 3D latent representation of CT modalities, the vanilla LDM is incapable of achieving effective alignment within the latent space. To address this issue, we propose the Consistent Latent Space Diffusion Model (CLS-DM), which incorporates cross-modal feature contrastive learning to efficiently extract latent 3D information from 2D X-ray images and achieve latent space alignment between modalities. Experimental results indicate that CLS-DM outperforms classical and state-of-the-art generative models in terms of standard voxel-level metrics (PSNR, SSIM) on the LIDC-IDRI and CTSpine1K datasets. This methodology not only aids in enhancing the effectiveness and economic viability of sparse X-ray reconstructed CT but can also be generalized to other cross-modal transformation tasks, such as text-to-image synthesis. We have made our code publicly available at https://anonymous.4open.science/r/CLS-DM-50D6/ to facilitate further research and applications in other domains.
Paperid: 325,   https://arxiv.org/pdf/2508.19575    
Authors:Zhu Xu, Zhaowen Wang, Yuxin Peng, Yang Liu
Abstract:
Compositional Customized Image Generation aims to customize multiple target concepts within generation content, which has gained attention for its wild application. Existing approaches mainly concentrate on the target entity's appearance preservation, while neglecting the fine-grained interaction control among target entities. To enable the model of such interaction control capability, we focus on human object interaction scenario and propose the task of Customized Human Object Interaction Image Generation(CHOI), which simultaneously requires identity preservation for target human object and the interaction semantic control between them. Two primary challenges exist for CHOI:(1)simultaneous identity preservation and interaction control demands require the model to decompose the human object into self-contained identity features and pose-oriented interaction features, while the current HOI image datasets fail to provide ideal samples for such feature-decomposed learning.(2)inappropriate spatial configuration between human and object may lead to the lack of desired interaction semantics. To tackle it, we first process a large-scale dataset, where each sample encompasses the same pair of human object involving different interactive poses. Then we design a two-stage model Interact-Custom, which firstly explicitly models the spatial configuration by generating a foreground mask depicting the interaction behavior, then under the guidance of this mask, we generate the target human object interacting while preserving their identities features. Furthermore, if the background image and the union location of where the target human object should appear are provided by users, Interact-Custom also provides the optional functionality to specify them, offering high content controllability. Extensive experiments on our tailored metrics for CHOI task demonstrate the effectiveness of our approach.
Paperid: 326,   https://arxiv.org/pdf/2509.03536    
Authors:Weizhi Chen, Ziwei Wang, Leyang Yang, Sheng Zhou, Xiaoxuan Tang, Jiajun Bu, Yong Li, Wei Jiang
Abstract:
Graphical User Interface (GUI) agents possess significant commercial and social value, and GUI agents powered by advanced multimodal large language models (MLLMs) have demonstrated remarkable potential. Currently, existing GUI agents usually utilize sequential episodes of multi-step operations across pages as the prior GUI knowledge, which fails to capture the complex transition relationship between pages, making it challenging for the agents to deeply perceive the GUI environment and generalize to new scenarios. Therefore, we design an automated pipeline to transform the sequential episodes into page graphs, which explicitly model the graph structure of the pages that are naturally connected by actions. To fully utilize the page graphs, we further introduce Retrieval-Augmented Generation (RAG) technology to effectively retrieve reliable perception guidelines of GUI from them, and a tailored multi-agent framework PG-Agent with task decomposition strategy is proposed to be injected with the guidelines so that it can generalize to unseen scenarios. Extensive experiments on various benchmarks demonstrate the effectiveness of PG-Agent, even with limited episodes for page graph construction.
Paperid: 327,   https://arxiv.org/pdf/2504.12222    
Authors:Yike Liu, Jianhui Zhang, Haipeng Li, Shuaicheng Liu, Bing Zeng
Abstract:
While recent video deblurring methods have advanced significantly, they often overlook two valuable prior information: (1) motion vectors (MVs) and coding residuals (CRs) from video codecs, which provide efficient inter-frame alignment cues, and (2) the rich real-world knowledge embedded in pre-trained diffusion generative models. We present CPGDNet, a novel two-stage framework that effectively leverages both coding priors and generative diffusion priors for high-quality deblurring. First, our coding-prior feature propagation (CPFP) module utilizes MVs for efficient frame alignment and CRs to generate attention masks, addressing motion inaccuracies and texture variations. Second, a coding-prior controlled generation (CPC) module network integrates coding priors into a pretrained diffusion model, guiding it to enhance critical regions and synthesize realistic details. Experiments demonstrate our method achieves state-of-the-art perceptual quality with up to 30% improvement in IQA metrics. Both the code and the codingprior-augmented dataset will be open-sourced.
Paperid: 328,   https://arxiv.org/pdf/2410.22373    
Authors:Yufei Zhang, Yicheng Xu, Hongxin Wei, Zhiping Lin, Xiaofeng Zou, Cen Chen, Huiping Zhuang
Abstract:
Test-Time Adaptation (TTA) enables pre-trained models to bridge the gap between source and target datasets using unlabeled test data, addressing domain shifts caused by corruptions like weather changes, noise, or sensor malfunctions in test time. Multi-Modal Continual Test-Time Adaptation (MM-CTTA), as an extension of standard TTA, further allows models to handle multi-modal inputs and adapt to continuously evolving target domains. However, MM-CTTA faces critical challenges such as catastrophic forgetting and reliability bias, which are rarely addressed effectively under multi-modal corruption scenarios. In this paper, we propose a novel approach, Multi-modality Dynamic Analytic Adapter (MDAA), to tackle MM-CTTA tasks. MDAA introduces analytic learning,a closed-form training technique,through Analytic Classifiers (ACs) to mitigate catastrophic forgetting. Furthermore, we design the Dynamic Late Fusion Mechanism (DLFM) to dynamically select and integrate reliable information from different modalities. Extensive experiments show that MDAA achieves state-of-the-art performance across the proposed tasks.
Paperid: 329,   https://arxiv.org/pdf/2506.11356    
Authors:Sahar Nasirihaghighi, Negin Ghamsarian, Leonie Peschek, Matteo Munari, Heinrich Husslein, Raphael Sznitman, Klaus Schoeffmann
Abstract:
Recent advances in deep learning have transformed computer-assisted intervention and surgical video analysis, driving improvements not only in surgical training, intraoperative decision support, and patient outcomes, but also in postoperative documentation and surgical discovery. Central to these developments is the availability of large, high-quality annotated datasets. In gynecologic laparoscopy, surgical scene understanding and action recognition are fundamental for building intelligent systems that assist surgeons during operations and provide deeper analysis after surgery. However, existing datasets are often limited by small scale, narrow task focus, or insufficiently detailed annotations, limiting their utility for comprehensive, end-to-end workflow analysis. To address these limitations, we introduce GynSurg, the largest and most diverse multi-task dataset for gynecologic laparoscopic surgery to date. GynSurg provides rich annotations across multiple tasks, supporting applications in action recognition, semantic segmentation, surgical documentation, and discovery of novel procedural insights. We demonstrate the dataset quality and versatility by benchmarking state-of-the-art models under a standardized training protocol. To accelerate progress in the field, we publicly release the GynSurg dataset and its annotations
Paperid: 330,   https://arxiv.org/pdf/2510.20393    
Authors:Qing Wang, Chong-Wah Ngo, Yu Cao, Ee-Peng Lim
Abstract:
Existing approaches for image-to-recipe retrieval have the implicit assumption that a food image can fully capture the details textually documented in its recipe. However, a food image only reflects the visual outcome of a cooked dish and not the underlying cooking process. Consequently, learning cross-modal representations to bridge the modality gap between images and recipes tends to ignore subtle, recipe-specific details that are not visually apparent but are crucial for recipe retrieval. Specifically, the representations are biased to capture the dominant visual elements, resulting in difficulty in ranking similar recipes with subtle differences in use of ingredients and cooking methods. The bias in representation learning is expected to be more severe when the training data is mixed of images and recipes sourced from different cuisines. This paper proposes a novel causal approach that predicts the culinary elements potentially overlooked in images, while explicitly injecting these elements into cross-modal representation learning to mitigate biases. Experiments are conducted on the standard monolingual Recipe1M dataset and a newly curated multilingual multicultural cuisine dataset. The results indicate that the proposed causal representation learning is capable of uncovering subtle ingredients and cooking actions and achieves impressive retrieval performance on both monolingual and multilingual multicultural datasets.
Paperid: 331,   https://arxiv.org/pdf/2509.12250    
Authors:Yihong Ji, Yunze Liu, Yiyao Zhuo, Weijiang Yu, Fei Ma, Joshua Huang, Fei Yu
Abstract:
The perception and generation of Human-Object Interaction (HOI) are crucial for fields such as robotics, AR/VR, and human behavior understanding. However, current approaches model this task in an offline setting, where information at each time step can be drawn from the entire interaction sequence. In contrast, in real-world scenarios, the information available at each time step comes only from the current moment and historical data, i.e., an online setting. We find that offline methods perform poorly in an online context. Based on this observation, we propose two new tasks: Online HOI Generation and Perception. To address this task, we introduce the OnlineHOI framework, a network architecture based on the Mamba framework that employs a memory mechanism. By leveraging Mamba's powerful modeling capabilities for streaming data and the Memory mechanism's efficient integration of historical information, we achieve state-of-the-art results on the Core4D and OAKINK2 online generation tasks, as well as the online HOI4D perception task.
Paperid: 332,   https://arxiv.org/pdf/2408.07836    
Authors:Doğa Yılmaz, He Wang, Towaki Takikawa, Duygu Ceylan, Kaan Akşit
Abstract:
Emerging immersive display technologies efficiently utilize resources with perceptual graphics methods such as foveated rendering and denoising. Running multiple perceptual graphics methods challenges devices with limited power and computational resources. We propose a computationally-lightweight learned multitasking perceptual graphics model. Given RGB images and text-prompts, our model performs text-described perceptual tasks in a single inference step. Simply daisy-chaining multiple models or training dedicated models can lead to model management issues and exhaust computational resources. In contrast, our flexible method unlocks consistent high quality perceptual effects with reasonable compute, supporting various permutations at varied intensities using adjectives in text prompts (e.g. mildly, lightly). Text-guidance provides ease of use for dynamic requirements such as creative processes. To train our model, we propose a dataset containing source and perceptually enhanced images with corresponding text prompts. We evaluate our model on desktop and embedded platforms and validate perceptual quality through a user study.
Paperid: 333,   https://arxiv.org/pdf/2507.07394    
Authors:Zhimin Zhang, Bi'an Du, Caoyuan Ma, Zheng Wang, Wei Hu
Abstract:
Animal motion embodies species-specific behavioral habits, making the transfer of motion across categories a critical yet complex task for applications in animation and virtual reality. Existing motion transfer methods, primarily focused on human motion, emphasize skeletal alignment (motion retargeting) or stylistic consistency (motion style transfer), often neglecting the preservation of distinct habitual behaviors in animals. To bridge this gap, we propose a novel habit-preserved motion transfer framework for cross-category animal motion. Built upon a generative framework, our model introduces a habit-preservation module with category-specific habit encoder, allowing it to learn motion priors that capture distinctive habitual characteristics. Furthermore, we integrate a large language model (LLM) to facilitate the motion transfer to previously unobserved species. To evaluate the effectiveness of our approach, we introduce the DeformingThings4D-skl dataset, a quadruped dataset with skeletal bindings, and conduct extensive experiments and quantitative analyses, which validate the superiority of our proposed model.
Paperid: 334,   https://arxiv.org/pdf/2403.11535    
Authors:Jianzhi Liu, Junchen Zhu, Lianli Gao, Heng Tao Shen, Jingkuan Song
Abstract:
The open-domain video generation models are constrained by the scale of the training video datasets, and some less common actions still cannot be generated. Some researchers explore video editing methods and achieve action generation by editing the spatial information of the same action video. However, this method mechanically generates identical actions without understanding, which does not align with the characteristics of open-domain scenarios. In this paper, we propose AICL, which empowers the generative model with the ability to understand action information in reference videos, similar to how humans do, through in-context learning. Extensive experiments demonstrate that AICL effectively captures the action and achieves state-of-the-art generation performance across three typical video diffusion models on five metrics when using randomly selected categories from non-training datasets.
Paperid: 335,   https://arxiv.org/pdf/2508.07723    
Authors:Ting Xiang, Changjian Chen, Zhuo Tang, Qifeng Zhang, Fei Lyu, Li Yang, Jiapeng Zhang, Kenli Li
Abstract:
The performance of computer vision models in certain real-world applications, such as medical diagnosis, is often limited by the scarcity of available images. Expanding datasets using pre-trained generative models is an effective solution. However, due to the uncontrollable generation process and the ambiguity of natural language, noisy images may be generated. Re-weighting is an effective way to address this issue by assigning low weights to such noisy images. We first theoretically analyze three types of supervision for the generated images. Based on the theoretical analysis, we develop TriReWeight, a triplet-connection-based sample re-weighting method to enhance generative data augmentation. Theoretically, TriReWeight can be integrated with any generative data augmentation methods and never downgrade their performance. Moreover, its generalization approaches the optimal in the order O(\sqrtd\ln (n)/n). Our experiments validate the correctness of the theoretical analysis and demonstrate that our method outperforms the existing SOTA methods by 7.9% on average over six natural image datasets and by 3.4% on average over three medical datasets. We also experimentally validate that our method can enhance the performance of different generative data augmentation methods.
Paperid: 336,   https://arxiv.org/pdf/2411.03795    
Authors:Ziheng Jia, Zicheng Zhang, Jiaying Qian, Haoning Wu, Wei Sun, Chunyi Li, Xiaohong Liu, Weisi Lin, Guangtao Zhai, Xiongkuo Min
Abstract:
The advent and proliferation of large multi-modal models (LMMs) have introduced new paradigms to computer vision, transforming various tasks into a unified visual question answering framework. Video Quality Assessment (VQA), a classic field in low-level visual perception, focused initially on quantitative video quality scoring. However, driven by advances in LMMs, it is now progressing toward more holistic visual quality understanding tasks. Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can markedly enhance low-level visual quality evaluation. Nevertheless, related work has not been explored in the video domain, leaving substantial room for improvement. To address this gap, we introduce the VQA2 Instruction Dataset - the first visual question answering instruction dataset that focuses on video quality assessment. This dataset consists of 3 subsets and covers various video types, containing 157,755 instruction question-answer pairs. Then, leveraging this foundation, we present the VQA2 series models. The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos. We conduct extensive experiments on video quality scoring and understanding tasks, and results demonstrate that the VQA2series models achieve excellent performance in both tasks. Notably, our final model, the VQA2-Assistant, exceeds the renowned GPT-4o in visual quality understanding tasks while maintaining strong competitiveness in quality scoring tasks. Our work provides a foundation and feasible approach for integrating low-level video quality assessment and understanding with LMMs.
Paperid: 337,   https://arxiv.org/pdf/2508.04050    
Authors:Jitong Liao, Yulu Gao, Shaofei Huang, Jialin Gao, Jie Lei, Ronghua Liang, Si Liu
Abstract:
Cross-view object correspondence involves matching objects between egocentric (first-person) and exocentric (third-person) views. It is a critical yet challenging task for visual understanding. In this work, we propose the Dense Object Matching and Refinement (DOMR) framework to establish dense object correspondences across views. The framework centers around the Dense Object Matcher (DOM) module, which jointly models multiple objects. Unlike methods that directly match individual object masks to image features, DOM leverages both positional and semantic relationships among objects to find correspondences. DOM integrates a proposal generation module with a dense matching module that jointly encodes visual, spatial, and semantic cues, explicitly constructing inter-object relationships to achieve dense matching among objects. Furthermore, we combine DOM with a mask refinement head designed to improve the completeness and accuracy of the predicted masks, forming the complete DOMR framework. Extensive evaluations on the Ego-Exo4D benchmark demonstrate that our approach achieves state-of-the-art performance with a mean IoU of 49.7% on Ego\toExo and 55.2% on Exo\toEgo. These results outperform those of previous methods by 5.8% and 4.3%, respectively, validating the effectiveness of our integrated approach for cross-view understanding.
Paperid: 338,   https://arxiv.org/pdf/2411.17406    
Authors:Meng Wei, Zhongnian Li, Peng Ying, Xinzheng Xu
Abstract:
Recent advances in vision-language models (VLMs) have demonstrated remarkable capabilities in image classification by leveraging predefined sets of labels to construct text prompts for zero-shot reasoning. However, these approaches face significant limitations in undefined domains, where the label space is vocabulary-unknown and composite. We thus introduce Generative Semantic Labels (GSLs), a novel task that aims to predict a comprehensive set of semantic labels for an image without being constrained by a predefined labels set. Unlike traditional zero-shot classification, GSLs generates multiple semantic-level labels, encompassing objects, scenes, attributes, and relationships, thereby providing a richer and more accurate representation of image content. In this paper, we propose Chain-of-Action (CoA), an innovative method designed to tackle the GSLs task. CoA is motivated by the observation that enriched contextual information significantly improves generative performance during inference. Specifically, CoA decomposes the GSLs task into a sequence of detailed actions. Each action extracts and merges key information from the previous step, passing enriched context to the next, ultimately guiding the VLM to generate comprehensive and accurate semantic labels. We evaluate the effectiveness of CoA through extensive experiments on widely-used benchmark datasets. The results demonstrate significant improvements across key performance metrics, validating the capability of CoA to generate accurate and contextually rich semantic labels. Our work not only advances the state-of-the-art in generative semantic labels but also opens new avenues for applying VLMs in open-ended and dynamic real-world scenarios.
Paperid: 339,   https://arxiv.org/pdf/2508.02320    
Authors:Gefan Ye, Lin Li, Kexin Li, Jun Xiao, Long Chen
Abstract:
Zero-shot compositional action recognition (ZS-CAR) aims to identify unseen verb-object compositions in the videos by exploiting the learned knowledge of verb and object primitives during training. Despite compositional learning's progress in ZS-CAR, two critical challenges persist: 1) Missing compositional structure constraint, leading to spurious correlations between primitives; 2) Neglecting semantic hierarchy constraint, leading to semantic ambiguity and impairing the training process. In this paper, we argue that human-like symbolic reasoning offers a principled solution to these challenges by explicitly modeling compositional and hierarchical structured abstraction. To this end, we propose a logic-driven ZS-CAR framework LogicCAR that integrates dual symbolic constraints: Explicit Compositional Logic and Hierarchical Primitive Logic. Specifically, the former models the restrictions within the compositions, enhancing the compositional reasoning ability of our model. The latter investigates the semantical dependencies among different primitives, empowering the models with fine-to-coarse reasoning capacity. By formalizing these constraints in first-order logic and embedding them into neural network architectures, LogicCAR systematically bridges the gap between symbolic abstraction and existing models. Extensive experiments on the Sth-com dataset demonstrate that our LogicCAR outperforms existing baseline methods, proving the effectiveness of our logic-driven constraints.
Paperid: 340,   https://arxiv.org/pdf/2508.09785    
Authors:Linpu He, Yanan Li, Bingze Li, Elvis Han Cui, Donghui Wang
Abstract:
Learning from large-scale pre-trained models with strong generalization ability has shown remarkable success in a wide range of downstream tasks recently, but it is still underexplored in the challenging few-shot class-incremental learning (FSCIL) task. It aims to continually learn new concepts from limited training samples without forgetting the old ones at the same time. In this paper, we introduce DSS-Prompt, a simple yet effective approach that transforms the pre-trained Vision Transformer with minimal modifications in the way of prompts into a strong FSCIL classifier. Concretely, we synergistically utilize two complementary types of prompts in each Transformer block: static prompts to bridge the domain gap between the pre-training and downstream datasets, thus enabling better adaption; and dynamic prompts to capture instance-aware semantics, thus enabling easy transfer from base to novel classes. Specially, to generate dynamic prompts, we leverage a pre-trained multi-modal model to extract input-related diverse semantics, thereby generating complementary input-aware prompts, and then adaptively adjust their importance across different layers. In this way, on top of the prompted visual embeddings, a simple prototype classifier can beat state-of-the-arts without further training on the incremental tasks. We conduct extensive experiments on four benchmarks to validate the effectiveness of our DSS-Prompt and show that it consistently achieves better performance than existing approaches on all datasets and can alleviate the catastrophic forgetting issue as well.
Paperid: 341,   https://arxiv.org/pdf/2508.05060    
Authors:Yifeng Huang, Zhang Chen, Yi Xu, Minh Hoai, Zhong Li
Abstract:
We present DualMat, a novel dual-path diffusion framework for estimating Physically Based Rendering (PBR) materials from single images under complex lighting conditions. Our approach operates in two distinct latent spaces: an albedo-optimized path leveraging pretrained visual knowledge through RGB latent space, and a material-specialized path operating in a compact latent space designed for precise metallic and roughness estimation. To ensure coherent predictions between the albedo-optimized and material-specialized paths, we introduce feature distillation during training. We employ rectified flow to enhance efficiency by reducing inference steps while maintaining quality. Our framework extends to high-resolution and multi-view inputs through patch-based estimation and cross-view attention, enabling seamless integration into image-to-3D pipelines. DualMat achieves state-of-the-art performance on both Objaverse and real-world data, significantly outperforming existing methods with up to 28% improvement in albedo estimation and 39% reduction in metallic-roughness prediction errors.
Paperid: 342,   https://arxiv.org/pdf/2507.20342    
Authors:Zhipeng Tang, Sha Zhang, Jiajun Deng, Chenjie Wang, Guoliang You, Yuting Huang, Xinrui Lin, Yanyong Zhang
Abstract:
Integrating large language models (LLMs) into autonomous driving motion planning has recently emerged as a promising direction, offering enhanced interpretability, better controllability, and improved generalization in rare and long-tail scenarios. However, existing methods often rely on abstracted perception or map-based inputs, missing crucial visual context, such as fine-grained road cues, accident aftermath, or unexpected obstacles, which are essential for robust decision-making in complex driving environments. To bridge this gap, we propose VLMPlanner, a hybrid framework that combines a learning-based real-time planner with a vision-language model (VLM) capable of reasoning over raw images. The VLM processes multi-view images to capture rich, detailed visual information and leverages its common-sense reasoning capabilities to guide the real-time planner in generating robust and safe trajectories. Furthermore, we develop the Context-Adaptive Inference Gate (CAI-Gate) mechanism that enables the VLM to mimic human driving behavior by dynamically adjusting its inference frequency based on scene complexity, thereby achieving an optimal balance between planning performance and computational efficiency. We evaluate our approach on the large-scale, challenging nuPlan benchmark, with comprehensive experimental results demonstrating superior planning performance in scenarios with intricate road conditions and dynamic elements. Code will be available.
Paperid: 343,   https://arxiv.org/pdf/2508.03039    
Authors:Yiran Meng, Junhong Ye, Wei Zhou, Guanghui Yue, Xudong Mao, Ruomei Wang, Baoquan Zhao
Abstract:
Cross-video question answering presents significant challenges beyond traditional single-video understanding, particularly in establishing meaningful connections across video streams and managing the complexity of multi-source information retrieval. We introduce VideoForest, a novel framework that addresses these challenges through person-anchored hierarchical reasoning. Our approach leverages person-level features as natural bridge points between videos, enabling effective cross-video understanding without requiring end-to-end training. VideoForest integrates three key innovations: 1) a human-anchored feature extraction mechanism that employs ReID and tracking algorithms to establish robust spatiotemporal relationships across multiple video sources; 2) a multi-granularity spanning tree structure that hierarchically organizes visual content around person-level trajectories; and 3) a multi-agent reasoning framework that efficiently traverses this hierarchical structure to answer complex cross-video queries. To evaluate our approach, we develop CrossVideoQA, a comprehensive benchmark dataset specifically designed for person-centric cross-video analysis. Experimental results demonstrate VideoForest's superior performance in cross-video reasoning tasks, achieving 71.93% accuracy in person recognition, 83.75% in behavior analysis, and 51.67% in summarization and reasoning, significantly outperforming existing methods. Our work establishes a new paradigm for cross-video understanding by unifying multiple video streams through person-level features, enabling sophisticated reasoning across distributed visual information while maintaining computational efficiency.
Paperid: 344,   https://arxiv.org/pdf/2407.15842    
Authors:Ruixiang Jiang, Changwen Chen
Abstract:
Artistic styles are defined by both their structural and appearance elements. Existing neural stylization techniques primarily focus on transferring appearance-level features such as color and texture, often neglecting the equally crucial aspect of structural stylization. To address this gap, we introduce DiffArtist, the first 2D stylization method to offer fine-grained, simultaneous control over both structure and appearance style strength. This dual controllability is achieved by representing structure and appearance generation as separate diffusion processes, necessitating no further tuning or additional adapters. To properly evaluate this new capability of dual stylization, we further propose a Multimodal LLM-based stylization evaluator that aligns significantly better with human preferences than existing metrics. Extensive analysis shows that DiffArtist achieves superior style fidelity and dual-controllability compared to state-of-the-art methods. Its text-driven, training-free design and unprecedented dual controllability make it a powerful and interactive tool for various creative applications. Project homepage: https://diffusionartist.github.io.
Paperid: 345,   https://arxiv.org/pdf/2512.14225    
Authors:Tao Tang, Enhui Ma, xia zhou, Letian Wang, Tianyi Yan, Xueyang Zhang, Kun Zhan, Peng Jia, XianPeng Lang, Jia-Wang Bian, Kaicheng Yu, Xiaodan Liang
Abstract:
Autonomous driving has seen remarkable advancements, largely driven by extensive real-world data collection. However, acquiring diverse and corner-case data remains costly and inefficient. Generative models have emerged as a promising solution by synthesizing realistic sensor data. However, existing approaches primarily focus on single-modality generation, leading to inefficiencies and misalignment in multimodal sensor data. To address these challenges, we propose OminiGen, which generates aligned multimodal sensor data in a unified framework. Our approach leverages a shared Bird\u2019s Eye View (BEV) space to unify multimodal features and designs a novel generalizable multimodal reconstruction method, UAE, to jointly decode LiDAR and multi-view camera data. UAE achieves multimodal sensor decoding through volume rendering, enabling accurate and flexible reconstruction. Furthermore, we incorporate a Diffusion Transformer (DiT) with a ControlNet branch to enable controllable multimodal sensor generation. Our comprehensive experiments demonstrate that OminiGen achieves desired performances in unified multimodal sensor data generation with multimodal consistency and flexible sensor adjustments.
Paperid: 346,   https://arxiv.org/pdf/2503.05185    
Authors:Fengbin Zhu, Junfeng Li, Liangming Pan, Wenjie Wang, Fuli Feng, Chao Wang, Huanbo Luan, Tat-Seng Chua
Abstract:
Finance decision-making often relies on in-depth data analysis across various data sources, including financial tables, news articles, stock prices, etc. In this work, we introduce FinTMMBench, the first comprehensive benchmark for evaluating temporal-aware multi-modal Retrieval-Augmented Generation (RAG) systems in finance. Built from heterologous data of NASDAQ 100 companies, FinTMMBench offers three significant advantages. 1) Multi-modal Corpus: It encompasses a hybrid of financial tables, news articles, daily stock prices, and visual technical charts as the corpus. 2) Temporal-aware Questions: Each question requires the retrieval and interpretation of its relevant data over a specific time period, including daily, weekly, monthly, quarterly, and annual periods. 3) Diverse Financial Analysis Tasks: The questions involve 10 different financial analysis tasks designed by domain experts, including information extraction, trend analysis, sentiment analysis and event detection, etc. We further propose a novel TMMHybridRAG method, which first leverages LLMs to convert data from other modalities (e.g., tabular, visual and time-series data) into textual format and then incorporates temporal information in each node when constructing graphs and dense indexes. Its effectiveness has been validated in extensive experiments, but notable gaps remain, highlighting the challenges presented by our FinTMMBench.
Paperid: 347,   https://arxiv.org/pdf/2502.00702    
Authors:Sheng Lyu, Ruiming Huang, Sijie Ji, Yasar Abbas Ur Rehman, Lan Ma, Chenshu Wu
Abstract:
Online Cardiac Monitoring (OCM) emerges as a compelling enhancement for the next-generation video streaming platforms. It enables various applications including remote health, online affective computing, and deepfake detection. Yet the physiological information encapsulated in the video streams has been long neglected. In this paper, we present the design and implementation of CardioLive, the first online cardiac monitoring system in video streaming platforms. We leverage the naturally co-existed video and audio streams and devise CardioNet, the first audio-visual network to learn the cardiac series. It incorporates multiple unique designs to extract temporal and spectral features, ensuring robust performance under realistic video streaming conditions. To enable the Service-On-Demand online cardiac monitoring, we implement CardioLive as a plug-and-play middleware service and develop systematic solutions to practical issues including changing FPS and unsynchronized streams. Extensive experiments have been done to demonstrate the effectiveness of our system. We achieve a Mean Square Error (MAE) of 1.79 BPM error, outperforming the video-only and audio-only solutions by 69.2% and 81.2%, respectively. Our CardioLive service achieves average throughputs of 115.97 and 98.16 FPS when implemented in Zoom and YouTube. We believe our work opens up new applications for video stream systems. We will release the code soon.
Paperid: 348,   https://arxiv.org/pdf/2508.07878    
Authors:Hanting Wang, Shengpeng Ji, Shulei Wang, Hai Huang, Xiao Jin, Qifei Zhang, Tao Jin
Abstract:
Image restoration under adverse weather conditions has been extensively explored, leading to numerous high-performance methods. In particular, recent advances in All-in-One approaches have shown impressive results by training on multi-task image restoration datasets. However, most of these methods rely on dedicated network modules or parameters for each specific degradation type, resulting in a significant parameter overhead. Moreover, the relatedness across different restoration tasks is often overlooked. In light of these issues, we propose a parameter-efficient All-in-One image restoration framework that leverages task-aware enhanced prompts to tackle various adverse weather degradations.Specifically, we adopt a two-stage training paradigm consisting of a pretraining phase and a prompt-tuning phase to mitigate parameter conflicts across tasks. We first employ supervised learning to acquire general restoration knowledge, and then adapt the model to handle specific degradation via trainable soft prompts. Crucially, we enhance these task-specific prompts in a task-aware manner. We apply low-rank decomposition to these prompts to capture both task-general and task-specific characteristics, and impose contrastive constraints to better align them with the actual inter-task relatedness. These enhanced prompts not only improve the parameter efficiency of the restoration model but also enable more accurate task modeling, as evidenced by t-SNE analysis. Experimental results on different restoration tasks demonstrate that the proposed method achieves superior performance with only 2.75M parameters.
Paperid: 349,   https://arxiv.org/pdf/2509.06992    
Authors:Kun Zhai, Siheng Chen, Xingjun Ma, Yu-Gang Jiang
Abstract:
Federated Prompt Tuning (FPT) is an efficient method for cross-client collaborative fine-tuning of large Vision-Language Models (VLMs). However, models tuned using FPT are vulnerable to adversarial attacks, leading to misclassification in downstream tasks. In this work, we introduce Federated Adversarial Prompt Tuning (FedAPT), a novel method designed to enhance the adversarial robustness of FPT. We identify a key issue in FedAPT under non-independent and identically distributed (non-IID) settings: a class information gap between clients and the global model. Clients rely solely on limited local label information to generate adversarial samples for training, while the global model must defend against adversarial attacks from global labels. To address this issue, we propose a class-aware prompt generator that generates visual prompts from text prompts. This generator is guided by a \emphGlobal Label Embedding (serving as a ``beacon") which encodes cross-client label information to create more globally-aligned visual prompts. Additionally, we propose a cross-layer generator sharing strategy to enhance prompt coupling across different layers of the model, further boosting adversarial robustness. Extensive experiments on multiple image classification datasets demonstrate the superiority of FedAPT in improving adversarial robustness, outperforming existing methods by a large margin. FedAPT also exhibits exceptional generalization in cross-domain and cross-dataset scenarios, indicating its effectiveness in real-world applications.
Paperid: 350,   https://arxiv.org/pdf/2507.09242    
Authors:Shiqi Jiang, Xinpeng Li, Xi Mao, Changbo Wang, Chenhui Li
Abstract:
Artistic image assessment has become a prominent research area in computer vision. In recent years, the field has witnessed a proliferation of datasets and methods designed to evaluate the aesthetic quality of paintings. However, most existing approaches focus solely on static final images, overlooking the dynamic and multi-stage nature of the artistic painting process. To address this gap, we propose a novel framework for human-aligned assessment of painting processes. Specifically, we introduce the Painting Process Assessment Dataset (PPAD), the first large-scale dataset comprising real and synthetic painting process images, annotated by domain experts across eight detailed attributes. Furthermore, we present PPJudge (Painting Process Judge), a Transformer-based model enhanced with temporally-aware positional encoding and a heterogeneous mixture-of-experts architecture, enabling effective assessment of the painting process. Experimental results demonstrate that our method outperforms existing baselines in accuracy, robustness, and alignment with human judgment, offering new insights into computational creativity and art education.
Paperid: 351,   https://arxiv.org/pdf/2501.17547    
Authors:Xinzhe Xia, Weiguang Zhao, Yuyao Yan, Guanyu Yang, Rui Zhang, Kaizhu Huang, Xi Yang
Abstract:
3D open-world classification is a challenging yet essential task in dynamic and unstructured real-world scenarios, requiring both open-category and open-pose recognition. To address these challenges, recent wisdom often takes sophisticated 2D pre-trained models to provide enriched and stable representations. However, these methods largely rely on how 3D objects can be projected into 2D space, which is unfortunately not well solved, and thus significantly limits their performance. Unlike these present efforts, in this paper we make a pioneering exploration of 3D generative models for 3D open-world classification. Drawing on abundant prior knowledge from 3D generative models, we additionally craft a rotation-invariant feature extractor. This innovative synergy endows our pipeline with the advantages of being training-free, open-category, and pose-invariant, thus well suited to 3D open-world classification. Extensive experiments on benchmark datasets demonstrate the potential of generative models in 3D open-world classification, achieving state-of-the-art performance on ModelNet10 and McGill with 32.0% and 8.7% overall accuracy improvement, respectively.
Paperid: 352,   https://arxiv.org/pdf/2502.11093    
Authors:Runtian Yuan, Mohan Chen, Jilan Xu, Ling Zhou, Qingqiu Li, Yuejie Zhang, Rui Feng, Tao Zhang, Shang Gao
Abstract:
Referring Medical Image Sequence Segmentation (Ref-MISS) is a novel and challenging task that aims to segment anatomical structures in medical image sequences (\emphe.g. endoscopy, ultrasound, CT, and MRI) based on natural language descriptions. This task holds significant clinical potential and offers a user-friendly advancement in medical imaging interpretation. Existing 2D and 3D segmentation models struggle to explicitly track objects of interest across medical image sequences, and lack support for nteractive, text-driven guidance. To address these limitations, we propose Text-Promptable Propagation (TPP), a model designed for referring medical image sequence segmentation. TPP captures the intrinsic relationships among sequential images along with their associated textual descriptions. Specifically, it enables the recognition of referred objects through cross-modal referring interaction, and maintains continuous tracking across the sequence via Transformer-based triple propagation, using text embeddings as queries. To support this task, we curate a large-scale benchmark, Ref-MISS-Bench, which covers 4 imaging modalities and 20 different organs and lesions. Experimental results on this benchmark demonstrate that TPP consistently outperforms state-of-the-art methods in both medical segmentation and referring video object segmentation.
Paperid: 353,   https://arxiv.org/pdf/2508.08179    
Authors:Sihan Zhao, Zixuan Wang, Tianyu Luan, Jia Jia, Wentao Zhu, Jiebo Luo, Junsong Yuan, Nan Xi
Abstract:
Human motion generation has found widespread applications in AR/VR, film, sports, and medical rehabilitation, offering a cost-effective alternative to traditional motion capture systems. However, evaluating the fidelity of such generated motions is a crucial, multifaceted task. Although previous approaches have attempted at motion fidelity evaluation using human perception or physical constraints, there remains an inherent gap between human-perceived fidelity and physical feasibility. Moreover, the subjective and coarse binary labeling of human perception further undermines the development of a robust data-driven metric. We address these issues by introducing a physical labeling method. This method evaluates motion fidelity by calculating the minimum modifications needed for a motion to align with physical laws. With this approach, we are able to produce fine-grained, continuous physical alignment annotations that serve as objective ground truth. With these annotations, we propose PP-Motion, a novel data-driven metric to evaluate both physical and perceptual fidelity of human motion. To effectively capture underlying physical priors, we employ Pearson's correlation loss for the training of our metric. Additionally, by incorporating a human-based perceptual fidelity loss, our metric can capture fidelity that simultaneously considers both human perception and physical alignment. Experimental results demonstrate that our metric, PP-Motion, not only aligns with physical laws but also aligns better with human perception of motion fidelity than previous work.
Paperid: 354,   https://arxiv.org/pdf/2506.23607    
Authors:Shiqi Zhang, Sha Zhang, Jiajun Deng, Yedong Shen, Mingxiao MA, Yanyong Zhang
Abstract:
Existing open-vocabulary 3D semantic segmentation methods typically supervise 3D segmentation models by merging text-aligned features (e.g., CLIP) extracted from multi-view images onto 3D points. However, such approaches treat multi-view images merely as intermediaries for transferring open-vocabulary information, overlooking their rich semantic content and cross-view correspondences, which limits model effectiveness. To address this, we propose PGOV3D, a novel framework that introduces a Partial-to-Global curriculum for improving open-vocabulary 3D semantic segmentation. The key innovation lies in a two-stage training strategy. In the first stage, we pre-train the model on partial scenes that provide dense semantic information but relatively simple geometry. These partial point clouds are derived from multi-view RGB-D inputs via pixel-wise depth projection. To enable open-vocabulary learning, we leverage a multi-modal large language model (MLLM) and a 2D segmentation foundation model to generate open-vocabulary labels for each viewpoint, offering rich and aligned supervision. An auxiliary inter-frame consistency module is introduced to enforce feature consistency across varying viewpoints and enhance spatial understanding. In the second stage, we fine-tune the model on complete scene-level point clouds, which are sparser and structurally more complex. We aggregate the partial vocabularies associated with each scene and generate pseudo labels using the pre-trained model, effectively bridging the semantic gap between dense partial observations and large-scale 3D environments. Extensive experiments on ScanNet, ScanNet200, and S3DIS benchmarks demonstrate that PGOV3D achieves competitive performance in open-vocabulary 3D semantic segmentation.
Paperid: 355,   https://arxiv.org/pdf/2508.08754    
Authors:Qianru Qiu, Jiafeng Mao, Xueting Wang
Abstract:
With the advent of diffusion models, Text-to-Image (T2I) generation has seen substantial advancements. Current T2I models allow users to specify object colors using linguistic color names, and some methods aim to personalize color-object association through prompt learning. However, existing models struggle to provide comprehensive control over the color schemes of an entire image, especially for background elements and less prominent objects not explicitly mentioned in prompts. This paper proposes a novel approach to enhance color scheme control by integrating color palettes as a separate guidance mechanism alongside prompt instructions. We investigate the effectiveness of palette guidance by exploring various palette representation methods within a diffusion-based image colorization framework. To facilitate this exploration, we construct specialized palette-text-image datasets and conduct extensive quantitative and qualitative analyses. Our results demonstrate that incorporating palette guidance significantly improves the model's ability to generate images with desired color schemes, enabling a more controlled and refined colorization process.
Paperid: 356,   https://arxiv.org/pdf/2508.15272    
Authors:Han Li, Shaofei Huang, Longfei Xu, Yulu Gao, Beipeng Mu, Si Liu
Abstract:
Lane topology reasoning plays a critical role in autonomous driving by modeling the connections among lanes and the topological relationships between lanes and traffic elements. Most existing methods adopt a first-detect-then-reason paradigm, where topological relationships are supervised based on the one-to-one assignment results obtained during the detection stage. This supervision strategy results in suboptimal topology reasoning performance due to the limited range of valid supervision. In this paper, we propose RATopo, a Redundancy Assignment strategy for lane Topology reasoning that enables quantity-rich and geometry-diverse topology supervision. Specifically, we restructure the Transformer decoder by swapping the cross-attention and self-attention layers. This allows redundant lane predictions to be retained before suppression, enabling effective one-to-many assignment. We also instantiate multiple parallel cross-attention blocks with independent parameters, which further enhances the diversity of detected lanes. Extensive experiments on OpenLane-V2 demonstrate that our RATopo strategy is model-agnostic and can be seamlessly integrated into existing topology reasoning frameworks, consistently improving both lane-lane and lane-traffic topology performance.
Paperid: 357,   https://arxiv.org/pdf/2510.18837    
Authors:Yubin Zheng, Pak-Hei Yeung, Jing Xia, Tianjie Ju, Peng Tang, Weidong Qiu, Jagath C. Rajapakse
Abstract:
Federated learning (FL) enables multiple clients to collaboratively train machine learning models without exposing local data, balancing performance and privacy. However, domain shift and label heterogeneity across clients often hinder the generalization of the aggregated global model. Recently, large-scale vision-language models like CLIP have shown strong zero-shot classification capabilities, raising the question of how to effectively fine-tune CLIP across domains in a federated setting. In this work, we propose an adaptive federated prompt tuning framework, FedDEAP, to enhance CLIP's generalization in multi-domain scenarios. Our method includes the following three key components: (1) To mitigate the loss of domain-specific information caused by label-supervised tuning, we disentangle semantic and domain-specific features in images by using semantic and domain transformation networks with unbiased mappings; (2) To preserve domain-specific knowledge during global prompt aggregation, we introduce a dual-prompt design with a global semantic prompt and a local domain prompt to balance shared and personalized information; (3) To maximize the inclusion of semantic and domain information from images in the generated text features, we align textual and visual representations under the two learned transformations to preserve semantic and domain consistency. Theoretical analysis and extensive experiments on four datasets demonstrate the effectiveness of our method in enhancing the generalization of CLIP for federated image recognition across multiple domains.
Paperid: 358,   https://arxiv.org/pdf/2507.18243    
Authors:Longjian Zeng, Zunjie Zhu, Rongfeng Lu, Ming Lu, Bolun Zheng, Chenggang Yan, Anke Xue
Abstract:
In recent years, foundation models for monocular depth estimation have received increasing attention. Current methods mainly address typical daylight conditions, but their effectiveness notably decreases in low-light environments. There is a lack of robust foundational models for monocular depth estimation specifically designed for low-light scenarios. This largely stems from the absence of large-scale, high-quality paired depth datasets for low-light conditions and the effective parameter-efficient fine-tuning (PEFT) strategy. To address these challenges, we propose DepthDark, a robust foundation model for low-light monocular depth estimation. We first introduce a flare-simulation module and a noise-simulation module to accurately simulate the imaging process under nighttime conditions, producing high-quality paired depth datasets for low-light conditions. Additionally, we present an effective low-light PEFT strategy that utilizes illumination guidance and multiscale feature fusion to enhance the model's capability in low-light environments. Our method achieves state-of-the-art depth estimation performance on the challenging nuScenes-Night and RobotCar-Night datasets, validating its effectiveness using limited training data and computing resources.
Paperid: 359,   https://arxiv.org/pdf/2504.20447    
Authors:Zhicheng Lian, Lizhi Wang, Hua Huang
Abstract:
Automatic speech quality assessment aims to quantify subjective human perception of speech through computational models to reduce the need for labor-consuming manual evaluations. While models based on deep learning have achieved progress in predicting mean opinion scores (MOS) to assess synthetic speech, the neglect of fundamental auditory perception mechanisms limits consistency with human judgments. To address this issue, we propose an auditory perception guided-MOS prediction model (APG-MOS) that synergistically integrates auditory modeling with semantic analysis to enhance consistency with human judgments. Specifically, we first design a perceptual module, grounded in biological auditory mechanisms, to simulate cochlear functions, which encodes acoustic signals into biologically aligned electrochemical representations. Secondly, we propose a residual vector quantization (RVQ)-based semantic distortion modeling method to quantify the degradation of speech quality at the semantic level. Finally, we design a residual cross-attention architecture, coupled with a progressive learning strategy, to enable multimodal fusion of encoded electrochemical signals and semantic representations. Experiments demonstrate that APG-MOS achieves superior performance on two primary benchmarks. Our code and checkpoint will be available on a public repository upon publication.
Paperid: 360,   https://arxiv.org/pdf/2504.21646    
Authors:Liqin Wang, Qianyue Hu, Wei Lu, Xiangyang Luo
Abstract:
The success of face recognition (FR) systems has led to serious privacy concerns due to potential unauthorized surveillance and user tracking on social networks. Existing methods for enhancing privacy fail to generate natural face images that can protect facial privacy. In this paper, we propose diffusion-based adversarial identity manipulation (DiffAIM) to generate natural and highly transferable adversarial faces against malicious FR systems. To be specific, we manipulate facial identity within the low-dimensional latent space of a diffusion model. This involves iteratively injecting gradient-based adversarial identity guidance during the reverse diffusion process, progressively steering the generation toward the desired adversarial faces. The guidance is optimized for identity convergence towards a target while promoting semantic divergence from the source, facilitating effective impersonation while maintaining visual naturalness. We further incorporate structure-preserving regularization to preserve facial structure consistency during manipulation. Extensive experiments on both face verification and identification tasks demonstrate that compared with the state-of-the-art, DiffAIM achieves stronger black-box attack transferability while maintaining superior visual quality. We also demonstrate the effectiveness of the proposed approach for commercial FR APIs, including Face++ and Aliyun.
Paperid: 361,   https://arxiv.org/pdf/2508.03437    
Authors:Hongjun Liu, Chao Yao, Yalan Zhang, Xiaokun wang, Xiaojuan Ban
Abstract:
Electroencephalogram (EEG) signal classification faces significant challenges due to data distribution shifts caused by heterogeneous electrode configurations, acquisition protocols, and hardware discrepancies across domains. This paper introduces IMAC, a novel channel-dependent mask and imputation self-supervised framework that formulates the alignment of cross-domain EEG data shifts as a spatial time series imputation task. To address heterogeneous electrode configurations in cross-domain scenarios, IMAC first standardizes different electrode layouts using a 3D-to-2D positional unification mapping strategy, establishing unified spatial representations. Unlike previous mask-based self-supervised representation learning methods, IMAC introduces spatio-temporal signal alignment. This involves constructing a channel-dependent mask and reconstruction task framed as a low-to-high resolution EEG spatial imputation problem. Consequently, this approach simulates cross-domain variations such as channel omissions and temporal instabilities, thus enabling the model to leverage the proposed imputer for robust signal alignment during inference. Furthermore, IMAC incorporates a disentangled structure that separately models the temporal and spatial information of the EEG signals separately, reducing computational complexity while enhancing flexibility and adaptability. Comprehensive evaluations across 10 publicly available EEG datasets demonstrate IMAC's superior performance, achieving state-of-the-art classification accuracy in both cross-subject and cross-center validation scenarios. Notably, IMAC shows strong robustness under both simulated and real-world distribution shifts, surpassing baseline methods by up to 35% in integrity scores while maintaining consistent classification accuracy.
Paperid: 362,   https://arxiv.org/pdf/2503.08714    
Authors:Zheng Qin, Ruobing Zheng, Yabing Wang, Tianqi Li, Zixin Zhu, Sanping Zhou, Ming Yang, Le Wang
Abstract:
In filmmaking, directors typically allow actors to perform freely based on the script before providing specific guidance on how to present key actions. AI-generated content faces similar requirements, where users not only need automatic generation of lip synchronization and basic gestures from audio input but also desire semantically accurate and expressive body movement that can be ``directly guided'' through text descriptions. Therefore, we present VersaAnimator, a versatile framework that synthesizes expressive talking human videos from arbitrary portrait images. Specifically, we design a motion generator that produces basic rhythmic movements from audio input and supports text-prompt control for specific actions. The generated whole-body 3D motion tokens can animate portraits of various scales, producing talking heads, half-body gestures and even leg movements for whole-body images. Besides, we introduce a multi-modal controlled video diffusion that generates photorealistic videos, where speech signals govern lip synchronization, facial expressions, and head motions while body movements are guided by the 2D poses. Furthermore, we introduce a token2pose translator to smoothly map 3D motion tokens to 2D pose sequences. This design mitigates the stiffness resulting from direct 3D to 2D conversion and enhances the details of the generated body movements. Extensive experiments shows that VersaAnimator synthesizes lip-synced and identity-preserving videos while generating expressive and semantically meaningful whole-body motions.
Paperid: 363,   https://arxiv.org/pdf/2412.11715    
Authors:RunLin Yu, Yipu Gong, Wenrui Li, Aiwen Sun, Mengren Zheng
Abstract:
Audio-visual Zero-Shot Learning (ZSL) has attracted significant attention for its ability to identify unseen classes and perform well in video classification tasks. However, modal imbalance in (G)ZSL leads to over-reliance on the optimal modality, reducing discriminative capabilities for unseen classes. Some studies have attempted to address this issue by modifying parameter gradients, but two challenges still remain: (a) Quality discrepancies, where modalities offer differing quantities and qualities of information for the same concept. (b) Content discrepancies, where sample contributions within a modality vary significantly. To address these challenges, we propose a Discrepancy-Aware Attention Network (DAAN) for Enhanced Audio-Visual ZSL. Our approach introduces a Quality-Discrepancy Mitigation Attention (QDMA) unit to minimize redundant information in the high-quality modality and a Contrastive Sample-level Gradient Modulation (CSGM) block to adjust gradient magnitudes and balance content discrepancies. We quantify modality contributions by integrating optimization and convergence rate for more precise gradient modulation in CSGM. Experiments demonstrates DAAN achieves state-of-the-art performance on benchmark datasets, with ablation studies validating the effectiveness of individual modules.
Paperid: 364,   https://arxiv.org/pdf/2508.02340    
Authors:Fan Hu, Zijie Xin, Xirong Li
Abstract:
Ad-hoc Video Search (AVS) involves using a textual query to search for multiple relevant videos in a large collection of unlabeled short videos. The main challenge of AVS is the visual diversity of relevant videos. A simple query such as "Find shots of a man and a woman dancing together indoors" can span a multitude of environments, from brightly lit halls and shadowy bars to dance scenes in black-and-white animations. It is therefore essential to retrieve relevant videos as comprehensively as possible. Current solutions for the AVS task primarily fuse multiple features into one or more common spaces, yet overlook the need for diverse spaces. To fully exploit the expressive capability of individual features, we propose LPD, short for Learning Partially Decorrelated common spaces. LPD incorporates two key innovations: feature-specific common space construction and the de-correlation loss. Specifically, LPD learns a separate common space for each video and text feature, and employs de-correlation loss to diversify the ordering of negative samples across different spaces. To enhance the consistency of multi-space convergence, we designed an entropy-based fair multi-space triplet ranking loss. Extensive experiments on the TRECVID AVS benchmarks (2016-2023) justify the effectiveness of LPD. Moreover, diversity visualizations of LPD's spaces highlight its ability to enhance result diversity.
Paperid: 365,   https://arxiv.org/pdf/2504.09588    
Authors:Zhicong Wu, Hongbin Xu, Gang Xu, Ping Nie, Zhixin Yan, Jinkai Zheng, Liangqiong Qu, Ming Li, Liqiang Nie
Abstract:
Recent advancements in Generalizable Gaussian Splatting have enabled robust 3D reconstruction from sparse input views by utilizing feed-forward Gaussian Splatting models, achieving superior cross-scene generalization. However, while many methods focus on geometric consistency, they often neglect the potential of text-driven guidance to enhance semantic understanding, which is crucial for accurately reconstructing fine-grained details in complex scenes. To address this limitation, we propose TextSplat--the first text-driven Generalizable Gaussian Splatting framework. By employing a text-guided fusion of diverse semantic cues, our framework learns robust cross-modal feature representations that improve the alignment of geometric and semantic information, producing high-fidelity 3D reconstructions. Specifically, our framework employs three parallel modules to obtain complementary representations: the Diffusion Prior Depth Estimator for accurate depth information, the Semantic Aware Segmentation Network for detailed semantic information, and the Multi-View Interaction Network for refined cross-view features. Then, in the Text-Guided Semantic Fusion Module, these representations are integrated via the text-guided and attention-based feature aggregation mechanism, resulting in enhanced 3D Gaussian parameters enriched with detailed semantic cues. Experimental results on various benchmark datasets demonstrate improved performance compared to existing methods across multiple evaluation metrics, validating the effectiveness of our framework. The code will be publicly available.
Paperid: 366,   https://arxiv.org/pdf/2504.13535    
Authors:Jiahao Song, Yuzhao Wang
Abstract:
Music generation aims to create music segments that align with human aesthetics based on diverse conditional information. Despite advancements in generating music from specific textual descriptions (e.g., style, genre, instruments), the practical application is still hindered by ordinary users' limited expertise or time to write accurate prompts. To bridge this application gap, this paper introduces MusFlow, a novel multimodal music generation model using Conditional Flow Matching. We employ multiple Multi-Layer Perceptrons (MLPs) to align multimodal conditional information into the audio's CLAP embedding space. Conditional flow matching is trained to reconstruct the compressed Mel-spectrogram in the pretrained VAE latent space guided by aligned feature embedding. MusFlow can generate music from images, story texts, and music captions. To collect data for model training, inspired by multi-agent collaboration, we construct an intelligent data annotation workflow centered around a fine-tuned Qwen2-VL model. Using this workflow, we build a new multimodal music dataset, MMusSet, with each sample containing a quadruple of image, story text, music caption, and music piece. We conduct four sets of experiments: image-to-music, story-to-music, caption-to-music, and multimodal music generation. Experimental results demonstrate that MusFlow can generate high-quality music pieces whether the input conditions are unimodal or multimodal. We hope this work can advance the application of music generation in multimedia field, making music creation more accessible. Our generated samples, code and dataset are available at musflow.github.io.
Paperid: 367,   https://arxiv.org/pdf/2507.10595    
Authors:Yaowen Hu, Wenxuan Tu, Yue Liu, Miaomiao Li, Wenpeng Lu, Zhigang Luo, Xinwang Liu, Ping Chen
Abstract:
Deep graph clustering (DGC) for attribute-missing graphs is an unsupervised task aimed at partitioning nodes with incomplete attributes into distinct clusters. Addressing this challenging issue is vital for practical applications. However, research in this area remains underexplored. Existing imputation methods for attribute-missing graphs often fail to account for the varying amounts of information available across node neighborhoods, leading to unreliable results, especially for nodes with insufficient known neighborhood. To address this issue, we propose a novel method named Divide-Then-Rule Graph Completion (DTRGC). This method first addresses nodes with sufficient known neighborhood information and treats the imputed results as new knowledge to iteratively impute more challenging nodes, while leveraging clustering information to correct imputation errors. Specifically, Dynamic Cluster-Aware Feature Propagation (DCFP) initializes missing node attributes by adjusting propagation weights based on the clustering structure. Subsequently, Hierarchical Neighborhood-aware Imputation (HNAI) categorizes attribute-missing nodes into three groups based on the completeness of their neighborhood attributes. The imputation is performed hierarchically, prioritizing the groups with nodes that have the most available neighborhood information. The cluster structure is then used to refine the imputation and correct potential errors. Finally, Hop-wise Representation Enhancement (HRE) integrates information across multiple hops, thereby enriching the expressiveness of node representations. Experimental results on six widely used graph datasets show that DTRGC significantly improves the clustering performance of various DGC methods under attribute-missing graphs.
Paperid: 368,   https://arxiv.org/pdf/2509.03409    
Authors:Hoan My Tran, Damien Lolive, Aghilas Sini, Arnaud Delhay, Pierre-François Marteau, David Guennec
Abstract:
Recent advancements in generative AI, particularly in speech synthesis, have enabled the generation of highly natural-sounding synthetic speech that closely mimics human voices. While these innovations hold promise for applications like assistive technologies, they also pose significant risks, including misuse for fraudulent activities, identity theft, and security threats. Current research on spoofing detection countermeasures remains limited by generalization to unseen deepfake attacks and languages. To address this, we propose a gating mechanism extracting relevant feature from the speech foundation XLS-R model as a front-end feature extractor. For downstream back-end classifier, we employ Multi-kernel gated Convolution (MultiConv) to capture both local and global speech artifacts. Additionally, we introduce Centered Kernel Alignment (CKA) as a similarity metric to enforce diversity in learned features across different MultiConv layers. By integrating CKA with our gating mechanism, we hypothesize that each component helps improving the learning of distinct synthetic speech patterns. Experimental results demonstrate that our approach achieves state-of-the-art performance on in-domain benchmarks while generalizing robustly to out-of-domain datasets, including multilingual speech samples. This underscores its potential as a versatile solution for detecting evolving speech deepfake threats.
Paperid: 369,   https://arxiv.org/pdf/2510.18459    
Authors:Tong Liu, Zhiwei Fan, Guanyan Peng, Haodan Zhang, Yucheng Zhang, Zhen Wang, Pengjin Xie, Liang Liu
Abstract:
Short video streaming has become a dominant paradigm in digital media, characterized by rapid swiping interactions and diverse media content. A key technical challenge is designing an effective preloading strategy that dynamically selects and prioritizes download tasks from an evolving playlist, balancing Quality of Experience (QoE) and bandwidth efficiency under practical commercial constraints. However, real world analysis reveals critical limitations of existing approaches: (1) insufficient adaptation of download task sizes to dynamic conditions, and (2) watch time prediction models that are difficult to deploy reliably at scale. In this paper, we propose DeLoad, a novel preloading framework that addresses these issues by introducing dynamic task sizing and a practical, multi dimensional watch time estimation method. Additionally, a Deep Reinforcement Learning (DRL) enhanced agent is trained to optimize the download range decisions adaptively. Extensive evaluations conducted on an offline testing platform, leveraging massive real world network data, demonstrate that DeLoad achieves significant improvements in QoE metrics (34.4% to 87.4% gain). Furthermore, after deployment on a large scale commercial short video platform, DeLoad has increased overall user watch time by 0.09% while simultaneously reducing rebuffering events and 3.76% bandwidth consumption.
Paperid: 370,   https://arxiv.org/pdf/2504.13419    
Authors:Wenyu Li, Sidun Liu, Peng Qiao, Yong Dou
Abstract:
Recent advances in data-driven geometric multi-view 3D reconstruction foundation models (e.g., DUSt3R) have shown remarkable performance across various 3D vision tasks, facilitated by the release of large-scale, high-quality 3D datasets. However, as we observed, constrained by their matching-based principles, the reconstruction quality of existing models suffers significant degradation in challenging regions with limited matching cues, particularly in weakly textured areas and low-light conditions. To mitigate these limitations, we propose to harness the inherent robustness of monocular geometry estimation to compensate for the inherent shortcomings of matching-based methods. Specifically, we introduce a monocular-guided refinement module that integrates monocular geometric priors into multi-view reconstruction frameworks. This integration substantially enhances the robustness of multi-view reconstruction systems, leading to high-quality feed-forward reconstructions. Comprehensive experiments across multiple benchmarks demonstrate that our method achieves substantial improvements in both mutli-view camera pose estimation and point cloud accuracy.
Paperid: 371,   https://arxiv.org/pdf/2506.03652    
Authors:Cheng Zhang, Hongxia xie, Bin Wen, Songhan Zuo, Ruoxuan Zhang, Wen-huang Cheng
Abstract:
With the rapid advancement of diffusion models, text-to-image generation has achieved significant progress in image resolution, detail fidelity, and semantic alignment, particularly with models like Stable Diffusion 3.5, Stable Diffusion XL, and FLUX 1. However, generating emotionally expressive and abstract artistic images remains a major challenge, largely due to the lack of large-scale, fine-grained emotional datasets. To address this gap, we present the EmoArt Dataset -- one of the most comprehensive emotion-annotated art datasets to date. It contains 132,664 artworks across 56 painting styles (e.g., Impressionism, Expressionism, Abstract Art), offering rich stylistic and cultural diversity. Each image includes structured annotations: objective scene descriptions, five key visual attributes (brushwork, composition, color, line, light), binary arousal-valence labels, twelve emotion categories, and potential art therapy effects. Using EmoArt, we systematically evaluate popular text-to-image diffusion models for their ability to generate emotionally aligned images from text. Our work provides essential data and benchmarks for emotion-driven image synthesis and aims to advance fields such as affective computing, multimodal learning, and computational art, enabling applications in art therapy and creative design. The dataset and more details can be accessed via our project website.
Paperid: 372,   https://arxiv.org/pdf/2508.06205    
Authors:Ruiyan Wang, Lin Zuo, Zonghao Lin, Qiang Wang, Zhengxue Cheng, Rong Xie, Jun Ling, Li Song
Abstract:
The Human-Object Interaction (HOI) task explores the dynamic interactions between humans and objects in physical environments, providing essential biomechanical and cognitive-behavioral foundations for fields such as robotics, virtual reality, and human-computer interaction. However, existing HOI data sets focus on details of affordance, often neglecting the influence of physical properties of objects on human long-term motion. To bridge this gap, we introduce the PA-HOI Motion Capture dataset, which highlights the impact of objects' physical attributes on human motion dynamics, including human posture, moving velocity, and other motion characteristics. The dataset comprises 562 motion sequences of human-object interactions, with each sequence performed by subjects of different genders interacting with 35 3D objects that vary in size, shape, and weight. This dataset stands out by significantly extending the scope of existing ones for understanding how the physical attributes of different objects influence human posture, speed, motion scale, and interacting strategies. We further demonstrate the applicability of the PA-HOI dataset by integrating it with existing motion generation methods, validating its capacity to transfer realistic physical awareness.
Paperid: 373,   https://arxiv.org/pdf/2506.06084    
Authors:Bowen Yuan, Selena Song, Javier Fernandez, Yadan Luo, Mahsa Baktashmotlagh, Zijian Wang
Abstract:
Wheat management strategies play a critical role in determining yield. Traditional management decisions often rely on labour-intensive expert inspections, which are expensive, subjective and difficult to scale. Recently, Vision-Language Models (VLMs) have emerged as a promising solution to enable scalable, data-driven management support. However, due to a lack of domain-specific knowledge, directly applying VLMs to wheat management tasks results in poor quantification and reasoning capabilities, ultimately producing vague or even misleading management recommendations. In response, we propose WisWheat, a wheat-specific dataset with a three-layered design to enhance VLM performance on wheat management tasks: (1) a foundational pretraining dataset of 47,871 image-caption pairs for coarsely adapting VLMs to wheat morphology; (2) a quantitative dataset comprising 7,263 VQA-style image-question-answer triplets for quantitative trait measuring tasks; and (3) an Instruction Fine-tuning dataset with 4,888 samples targeting biotic and abiotic stress diagnosis and management plan for different phenological stages. Extensive experimental results demonstrate that fine-tuning open-source VLMs (e.g., Qwen2.5 7B) on our dataset leads to significant performance improvements. Specifically, the Qwen2.5 VL 7B fine-tuned on our wheat instruction dataset achieves accuracy scores of 79.2% and 84.6% on wheat stress and growth stage conversation tasks respectively, surpassing even general-purpose commercial models such as GPT-4o by a margin of 11.9% and 34.6%.
Paperid: 374,   https://arxiv.org/pdf/2508.10771    
Authors:Jieyu Li, Xin Zhang, Joey Tianyi Zhou
Abstract:
Recent advances in AI-generated content have fueled the rise of highly realistic synthetic videos, posing severe risks to societal trust and digital integrity. Existing benchmarks for video authenticity detection typically suffer from limited realism, insufficient scale, and inadequate complexity, failing to effectively evaluate modern vision-language models against sophisticated forgeries. To address this critical gap, we introduce AEGIS, a novel large-scale benchmark explicitly targeting the detection of hyper-realistic and semantically nuanced AI-generated videos. AEGIS comprises over 10,000 rigorously curated real and synthetic videos generated by diverse, state-of-the-art generative models, including Stable Video Diffusion, CogVideoX-5B, KLing, and Sora, encompassing open-source and proprietary architectures. In particular, AEGIS features specially constructed challenging subsets enhanced with robustness evaluation. Furthermore, we provide multimodal annotations spanning Semantic-Authenticity Descriptions, Motion Features, and Low-level Visual Features, facilitating authenticity detection and supporting downstream tasks such as multimodal fusion and forgery localization. Extensive experiments using advanced vision-language models demonstrate limited detection capabilities on the most challenging subsets of AEGIS, highlighting the dataset's unique complexity and realism beyond the current generalization capabilities of existing models. In essence, AEGIS establishes an indispensable evaluation benchmark, fundamentally advancing research toward developing genuinely robust, reliable, broadly generalizable video authenticity detection methodologies capable of addressing real-world forgery threats. Our dataset is available on https://huggingface.co/datasets/Clarifiedfish/AEGIS.
Paperid: 375,   https://arxiv.org/pdf/2510.02787    
Authors:Jan Zdenek, Wataru Shimoda, Kota Yamaguchi
Abstract:
Text removal is a crucial task in computer vision with applications such as privacy preservation, image editing, and media reuse. While existing research has primarily focused on scene text removal in natural images, limitations in current datasets hinder out-of-domain generalization or accurate evaluation. In particular, widely used benchmarks such as SCUT-EnsText suffer from ground truth artifacts due to manual editing, overly simplistic text backgrounds, and evaluation metrics that do not capture the quality of generated results. To address these issues, we introduce an approach to synthesizing a text removal benchmark applicable to domains other than scene texts. Our dataset features text rendered on complex backgrounds using object-aware placement and vision-language model-generated content, ensuring clean ground truth and challenging text removal scenarios. The dataset is available at https://huggingface.co/datasets/cyberagent/OTR .
Paperid: 376,   https://arxiv.org/pdf/2505.03319    
Authors:Manolis Mylonas, Evlampios Apostolidis, Vasileios Mezaris
Abstract:
In this work, we introduce the task of script-driven video summarization, which aims to produce a summary of the full-length video by selecting the parts that are most relevant to a user-provided script outlining the visual content of the desired summary. Following, we extend a recently-introduced large-scale dataset for generic video summarization (VideoXum) by producing natural language descriptions of the different human-annotated summaries that are available per video. In this way we make it compatible with the introduced task, since the available triplets of ``video, summary and summary description'' can be used for training a method that is able to produce different summaries for a given video, driven by the provided script about the content of each summary. Finally, we develop a new network architecture for script-driven video summarization (SD-VSum), that employs a cross-modal attention mechanism for aligning and fusing information from the visual and text modalities. Our experimental evaluations demonstrate the advanced performance of SD-VSum against SOTA approaches for query-driven and generic (unimodal and multimodal) summarization from the literature, and document its capacity to produce video summaries that are adapted to each user's needs about their content.
Paperid: 377,   https://arxiv.org/pdf/2502.11128    
Authors:Hui Wang, Shujie Liu, Lingwei Meng, Jinyu Li, Yifan Yang, Shiwan Zhao, Haiyang Sun, Yanqing Liu, Haoqin Sun, Jiaming Zhou, Yan Lu, Yong Qin
Abstract:
To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model's output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown in https://aka.ms/felle.
Paperid: 378,   https://arxiv.org/pdf/2507.04600    
Authors:Zhipeng Liu, Peibo Duan, Binwu Wang, Xuan Tang, Qi Chu, Changsheng Zhang, Yongsheng Huang, Bin Zhang
Abstract:
Real-world time series typically exhibit complex temporal variations, making the time series classification task notably challenging. Recent advancements have demonstrated the potential of multi-scale analysis approaches, which provide an effective solution for capturing these complex temporal patterns. However, existing multi-scale analysis-based time series prediction methods fail to eliminate redundant scale-shared features across multi-scale time series, resulting in the model over- or under-focusing on scale-shared features. To address this issue, we propose a novel end-to-end Disentangled Multi-Scale framework for Time Series classification (DisMS-TS). The core idea of DisMS-TS is to eliminate redundant shared features in multi-scale time series, thereby improving prediction performance. Specifically, we propose a temporal disentanglement module to capture scale-shared and scale-specific temporal representations, respectively. Subsequently, to effectively learn both scale-shared and scale-specific temporal representations, we introduce two regularization terms that ensure the consistency of scale-shared representations and the disparity of scale-specific representations across all temporal scales. Extensive experiments conducted on multiple datasets validate the superiority of DisMS-TS over its competitive baselines, with the accuracy improvement up to 9.71%.
Paperid: 379,   https://arxiv.org/pdf/2509.17757    
Authors:Hongxing Fan, Lipeng Wang, Haohua Chen, Zehuan Huang, Jiangtao Wu, Lu Sheng
Abstract:
Amodal completion, generating invisible parts of occluded objects, is vital for applications like image editing and AR. Prior methods face challenges with data needs, generalization, or error accumulation in progressive pipelines. We propose a Collaborative Multi-Agent Reasoning Framework based on upfront collaborative reasoning to overcome these issues. Our framework uses multiple agents to collaboratively analyze occlusion relationships and determine necessary boundary expansion, yielding a precise mask for inpainting. Concurrently, an agent generates fine-grained textual descriptions, enabling Fine-Grained Semantic Guidance. This ensures accurate object synthesis and prevents the regeneration of occluders or other unwanted elements, especially within large inpainting areas. Furthermore, our method directly produces layered RGBA outputs guided by visible masks and attention maps from a Diffusion Transformer, eliminating extra segmentation. Extensive evaluations demonstrate our framework achieves state-of-the-art visual quality.
Paperid: 380,   https://arxiv.org/pdf/2506.18246    
Authors:Xiangzhao Hao, Kuan Zhu, Hongyu Guo, Haiyun Guo, Ning Jiang, Quan Lu, Ming Tang, Jinqiao Wang
Abstract:
Using natural language to query visual information is a fundamental need in real-world applications. Text-Image Retrieval (TIR) retrieves a target image from a gallery based on an image-level description, while Referring Expression Comprehension (REC) localizes a target object within a given image using an instance-level description. However, real-world applications often present more complex demands. Users typically query an instance-level description across a large gallery and expect to receive both relevant image and the corresponding instance location. In such scenarios, TIR struggles with fine-grained descriptions and object-level localization, while REC is limited in its ability to efficiently search large galleries and lacks an effective ranking mechanism. In this paper, we introduce a new task called Referring Expression Instance Retrieval (REIR), which supports both instance-level retrieval and localization based on fine-grained referring expressions. First, we propose a large-scale benchmark for REIR, named REIRCOCO, constructed by prompting advanced vision-language models to generate high-quality referring expressions for instances in the MSCOCO and RefCOCO datasets. Second, we present a baseline method, Contrastive Language-Instance Alignment with Relation Experts (CLARE), which employs a dual-stream architecture to address REIR in an end-to-end manner. Given a referring expression, the textual branch encodes it into a query embedding. The visual branch detects candidate objects and extracts their instance-level visual features. The most similar candidate to the query is selected for bounding box prediction. CLARE is first trained on object detection and REC datasets to establish initial grounding capabilities, then optimized via Contrastive Language-Instance Alignment (CLIA) for improved retrieval across images. We will release our code and benchmark publicly.
Paperid: 381,   https://arxiv.org/pdf/2507.04758    
Authors:Jiayun Hu, Yueyi He, Tianyi Liang, Changbo Wang, Chenhui Li
Abstract:
Emotion alignment between music and palettes is crucial for effective multimedia content, yet misalignment creates confusion that weakens the intended message. However, existing methods often generate only a single dominant color, missing emotion variation. Others rely on indirect mappings through text or images, resulting in the loss of crucial emotion details. To address these challenges, we present Music2Palette, a novel method for emotion-aligned color palette generation via cross-modal representation learning. We first construct MuCED, a dataset of 2,634 expert-validated music-palette pairs aligned through Russell-based emotion vectors. To directly translate music into palettes, we propose a cross-modal representation learning framework with a music encoder and color decoder. We further propose a multi-objective optimization approach that jointly enhances emotion alignment, color diversity, and palette coherence. Extensive experiments demonstrate that our method outperforms current methods in interpreting music emotion and generating attractive and diverse color palettes. Our approach enables applications like music-driven image recoloring, video generating, and data visualization, bridging the gap between auditory and visual emotion experiences.
Paperid: 382,   https://arxiv.org/pdf/2508.05343    
Authors:Junyu Zhou, Yuyang Huang, Wenrui Dai, Junni Zou, Ziyang Zheng, Nuowen Kan, Chenglin Li, Hongkai Xiong
Abstract:
Recent prominence in 3D Gaussian Splatting (3DGS) has enabled real-time rendering while maintaining high-fidelity novel view synthesis. However, 3DGS resorts to the Gaussian function that is low-pass by nature and is restricted in representing high-frequency details in 3D scenes. Moreover, it causes redundant primitives with degraded training and rendering efficiency and excessive memory overhead. To overcome these limitations, we propose 3D Gabor Splatting (3DGabSplat) that leverages a novel 3D Gabor-based primitive with multiple directional 3D frequency responses for radiance field representation supervised by multi-view images. The proposed 3D Gabor-based primitive forms a filter bank incorporating multiple 3D Gabor kernels at different frequencies to enhance flexibility and efficiency in capturing fine 3D details. Furthermore, to achieve novel view rendering, an efficient CUDA-based rasterizer is developed to project the multiple directional 3D frequency components characterized by 3D Gabor-based primitives onto the 2D image plane, and a frequency-adaptive mechanism is presented for adaptive joint optimization of primitives. 3DGabSplat is scalable to be a plug-and-play kernel for seamless integration into existing 3DGS paradigms to enhance both efficiency and quality of novel view synthesis. Extensive experiments demonstrate that 3DGabSplat outperforms 3DGS and its variants using alternative primitives, and achieves state-of-the-art rendering quality across both real-world and synthetic scenes. Remarkably, we achieve up to 1.35 dB PSNR gain over 3DGS with simultaneously reduced number of primitives and memory consumption.
Paperid: 383,   https://arxiv.org/pdf/2509.20756    
Authors:Yuhong Zhang, Han Wang, Yiwen Wang, Rong Xie, Li Song
Abstract:
Text-to-image diffusion models have made significant progress in image generation, allowing for effortless customized generation. However, existing image editing methods still face certain limitations when dealing with personalized image composition tasks. First, there is the issue of lack of geometric control over the inserted objects. Current methods are confined to 2D space and typically rely on textual instructions, making it challenging to maintain precise geometric control over the objects. Second, there is the challenge of style consistency. Existing methods often overlook the style consistency between the inserted object and the background, resulting in a lack of realism. In addition, the challenge of inserting objects into images without extensive training remains significant. To address these issues, we propose FreeInsert, a novel training-free framework that customizes object insertion into arbitrary scenes by leveraging 3D geometric information. Benefiting from the advances in existing 3D generation models, we first convert the 2D object into 3D, perform interactive editing at the 3D level, and then re-render it into a 2D image from a specified view. This process introduces geometric controls such as shape or view. The rendered image, serving as geometric control, is combined with style and content control achieved through diffusion adapters, ultimately producing geometrically controlled, style-consistent edited images via the diffusion model.
Paperid: 384,   https://arxiv.org/pdf/2508.15535    
Authors:Guotao Liang, Juncheng Hu, Ximing Xing, Jing Zhang, Qian Yu
Abstract:
We introduce GroupSketch, a novel method for vector sketch animation that effectively handles multi-object interactions and complex motions. Existing approaches struggle with these scenarios, either being limited to single-object cases or suffering from temporal inconsistency and poor generalization. To address these limitations, our method adopts a two-stage pipeline comprising Motion Initialization and Motion Refinement. In the first stage, the input sketch is interactively divided into semantic groups and key frames are defined, enabling the generation of a coarse animation via interpolation. In the second stage, we propose a Group-based Displacement Network (GDN), which refines the coarse animation by predicting group-specific displacement fields, leveraging priors from a text-to-video model. GDN further incorporates specialized modules, such as Context-conditioned Feature Enhancement (CCFE), to improve temporal consistency. Extensive experiments demonstrate that our approach significantly outperforms existing methods in generating high-quality, temporally consistent animations for complex, multi-object sketches, thus expanding the practical applications of sketch animation.
Paperid: 385,   https://arxiv.org/pdf/2507.06744    
Authors:Yafei Zhang, Yongle Shang, Huafeng Li
Abstract:
Weakly supervised text-to-person image matching, as a crucial approach to reducing models' reliance on large-scale manually labeled samples, holds significant research value. However, existing methods struggle to predict complex one-to-many identity relationships, severely limiting performance improvements. To address this challenge, we propose a local-and-global dual-granularity identity association mechanism. Specifically, at the local level, we explicitly establish cross-modal identity relationships within a batch, reinforcing identity constraints across different modalities and enabling the model to better capture subtle differences and correlations. At the global level, we construct a dynamic cross-modal identity association network with the visual modality as the anchor and introduce a confidence-based dynamic adjustment mechanism, effectively enhancing the model's ability to identify weakly associated samples while improving overall sensitivity. Additionally, we propose an information-asymmetric sample pair construction method combined with consistency learning to tackle hard sample mining and enhance model robustness. Experimental results demonstrate that the proposed method substantially boosts cross-modal matching accuracy, providing an efficient and practical solution for text-to-person image matching.
Paperid: 386,   https://arxiv.org/pdf/2507.19077    
Authors:Yangyang Xu, Xi Ye, Duo Su
Abstract:
Multi-task learning (MTL) for dense prediction has shown promising results but still faces challenges in balancing shared representations with task-specific specialization. In this paper, we introduce a novel Fine-Grained Mixture of Experts (FGMoE) architecture that explores MoE-based MTL models through a combination of three key innovations and fine-tuning. First, we propose intra-task experts that partition along intermediate hidden dimensions of MLPs, enabling finer decomposition of task information while maintaining parameter efficiency. Second, we introduce shared experts that consolidate common information across different contexts of the same task, reducing redundancy, and allowing routing experts to focus on unique aspects. Third, we design a global expert that facilitates adaptive knowledge transfer across tasks based on both input feature and task requirements, promoting beneficial information sharing while preventing harmful interference. In addition, we use the fine-tuning approach to improve parameter efficiency only by training the parameters of the decoder. Extensive experimental results show that the proposed FGMoE uses fewer parameters and significantly outperforms current MoE-based competitive MTL models on two dense prediction datasets (i.e., NYUD-v2, PASCAL-Context) in various metrics.
Paperid: 387,   https://arxiv.org/pdf/2501.06004    
Authors:Yin Wang, Zixuan Wang, Hao Lu, Zhen Qin, Hailiang Zhao, Guanjie Cheng, Ge Su, Li Kuang, Mengchu Zhou, Shuiguang Deng
Abstract:
Semi-Supervised Learning (SSL) can leverage abundant unlabeled data to boost model performance. However, the class-imbalanced data distribution in real-world scenarios poses great challenges to SSL, resulting in performance degradation. Existing class-imbalanced semi-supervised learning (CISSL) methods mainly focus on rebalancing datasets but ignore the potential of using hard examples to enhance performance, making it difficult to fully harness the power of unlabeled data even with sophisticated algorithms. To address this issue, we propose a method that enhances the performance of Imbalanced Semi-Supervised Learning by Mining Hard Examples (SeMi). This method distinguishes the entropy differences among logits of hard and easy examples, thereby identifying hard examples and increasing the utility of unlabeled data, better addressing the imbalance problem in CISSL. In addition, we maintain a class-balanced memory bank with confidence decay for storing high-confidence embeddings to enhance the pseudo-labels' reliability. Although our method is simple, it is effective and seamlessly integrates with existing approaches. We perform comprehensive experiments on standard CISSL benchmarks and experimentally demonstrate that our proposed SeMi outperforms existing state-of-the-art methods on multiple benchmarks, especially in reversed scenarios, where our best result shows approximately a 54.8% improvement over the baseline methods.
Paperid: 388,   https://arxiv.org/pdf/2501.12640    
Authors:Naquee Rizwan, Nayandeep Deb, Sarthak Roy, Vishwajeet Singh Solanki, Kiran Garimella, Animesh Mukherjee
Abstract:
Tackling toxic behavior in digital communication continues to be a pressing concern for both academics and industry professionals. While significant research has explored toxicity on platforms like social networks and discussion boards, podcasts despite their rapid rise in popularity remain relatively understudied in this context. This work seeks to fill that gap by curating a dataset of political podcast transcripts and analyzing them with a focus on conversational structure. Specifically, we investigate how toxicity surfaces and intensifies through sequences of replies within these dialogues, shedding light on the organic patterns by which harmful language can escalate across conversational turns. Warning: Contains potentially abusive/toxic contents.
Paperid: 389,   https://arxiv.org/pdf/2504.18087    
Authors:Weipeng Tan, Chuming Lin, Chengming Xu, FeiFan Xu, Xiaobin Hu, Xiaozhong Ji, Junwei Zhu, Chengjie Wang, Yanwei Fu
Abstract:
Recent advances in Talking Head Generation (THG) have achieved impressive lip synchronization and visual quality through diffusion models; yet existing methods struggle to generate emotionally expressive portraits while preserving speaker identity. We identify three critical limitations in current emotional talking head generation: insufficient utilization of audio's inherent emotional cues, identity leakage in emotion representations, and isolated learning of emotion correlations. To address these challenges, we propose a novel framework dubbed as DICE-Talk, following the idea of disentangling identity with emotion, and then cooperating emotions with similar characteristics. First, we develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention, representing emotions as identity-agnostic Gaussian distributions. Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks that explicitly capture inter-emotion relationships through vector quantization and attention-based feature aggregation. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process through latent-space classification. Extensive experiments on MEAD and HDTF datasets demonstrate our method's superiority, outperforming state-of-the-art approaches in emotion accuracy while maintaining competitive lip-sync performance. Qualitative results and user studies further confirm our method's ability to generate identity-preserving portraits with rich, correlated emotional expressions that naturally adapt to unseen identities.
Paperid: 390,   https://arxiv.org/pdf/2506.21298    
Authors:Atharva Mehta, Shivam Chauhan, Monojit Choudhury
Abstract:
Fine-tuning large-scale music generation models, such as MusicGen and Mustango, is a computationally expensive process, often requiring updates to billions of parameters and, therefore, significant hardware resources. Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly adapter-based methods, have emerged as a promising alternative, enabling adaptation with minimal trainable parameters while preserving model performance. However, the design choices for adapters, including their architecture, placement, and size, are numerous, and it is unclear which of these combinations would produce optimal adapters and why, for a given case of low-resource music genre. In this paper, we attempt to answer this question by studying various adapter configurations for two AI music models, MusicGen and Mustango, on two genres: Hindustani Classical and Turkish Makam music. Our findings reveal distinct trade-offs: convolution-based adapters excel in capturing fine-grained local musical details such as ornamentations and short melodic phrases, while transformer-based adapters better preserve long-range dependencies crucial for structured improvisation. Additionally, we analyze computational resource requirements across different adapter scales, demonstrating how mid-sized adapters (40M parameters) achieve an optimal balance between expressivity and quality. Furthermore, we find that Mustango, a diffusion-based model, generates more diverse outputs with better adherence to the description in the input prompt while lacking in providing stability in notes, rhythm alignment, and aesthetics. Also, it is computationally intensive and requires significantly more time to train. In contrast, autoregressive models like MusicGen offer faster training and are more efficient, and can produce better quality output in comparison, but have slightly higher redundancy in their generations.
Paperid: 391,   https://arxiv.org/pdf/2510.16463    
Authors:Haocheng Tang, Ruoke Yan, Xinhui Yin, Qi Zhang, Xinfeng Zhang, Siwei Ma, Wen Gao, Chuanmin Jia
Abstract:
Recent advances in 3D Gaussian Splatting (3DGS) have enabled fast, photorealistic rendering of dynamic 3D scenes, showing strong potential in immersive communication. However, in digital human encoding and transmission, the compression methods based on general 3DGS representations are limited by the lack of human priors, resulting in suboptimal bitrate efficiency and reconstruction quality at the decoder side, which hinders their application in streamable 3D avatar systems. We propose HGC-Avatar, a novel Hierarchical Gaussian Compression framework designed for efficient transmission and high-quality rendering of dynamic avatars. Our method disentangles the Gaussian representation into a structural layer, which maps poses to Gaussians via a StyleUNet-based generator, and a motion layer, which leverages the SMPL-X model to represent temporal pose variations compactly and semantically. This hierarchical design supports layer-wise compression, progressive decoding, and controllable rendering from diverse pose inputs such as video sequences or text. Since people are most concerned with facial realism, we incorporate a facial attention mechanism during StyleUNet training to preserve identity and expression details under low-bitrate constraints. Experimental results demonstrate that HGC-Avatar provides a streamable solution for rapid 3D avatar rendering, while significantly outperforming prior methods in both visual quality and compression efficiency.
Paperid: 392,   https://arxiv.org/pdf/2410.01737    
Authors:Bingchen Miao, Wenqiao Zhang, Juncheng Li, Wangyu Wu, Siliang Tang, Zhaocheng Li, Haochen Shi, Jun Xiao, Yueting Zhuang
Abstract:
Multimodal Industrial Anomaly Detection (MIAD), which utilizes 3D point clouds and 2D RGB images to identify abnormal regions in products, plays a crucial role in industrial quality inspection. However, traditional MIAD settings assume that all 2D and 3D modalities are paired, ignoring the fact that multimodal data collected from the real world is often imperfect due to missing modalities. Additionally, models trained on modality-incomplete data are prone to overfitting. Therefore, MIAD models that demonstrate robustness against modality-incomplete data are highly desirable in practice. To address this, we introduce a pioneering study that comprehensively investigates Modality-Incomplete Industrial Anomaly Detection (MIIAD), and under the guidance of experts, we construct the MIIAD Bench with rich modality-missing settings to account for imperfect learning environments with incomplete multimodal information. As expected, we find that most existing MIAD methods perform poorly on the MIIAD Bench, leading to significant performance degradation. To tackle this challenge, we propose a novel two-stage Robust modAlity-aware fusing and Detecting framewoRk, abbreviated as RADAR. Specifically: i) We propose Modality-incomplete Instruction to guide the multimodal Transformer to robustly adapt to various modality-incomplete scenarios, and implement adaptive parameter learning based on HyperNetwork. ii) Then, we construct a Double-Pseudo Hybrid Module to highlight the uniqueness of modality combinations, mitigating overfitting issues and further enhancing the robustness of the MIIAD model. Our experimental results demonstrate that the proposed RADAR significantly outperforms traditional MIAD methods on our newly created MIIAD dataset, proving its practical application value.
Paperid: 393,   https://arxiv.org/pdf/2507.04377    
Authors:Xiao Zhang, Johan Bos
Abstract:
Tombstones are historically and culturally rich artifacts, encapsulating individual lives, community memory, historical narratives and artistic expression. Yet, many tombstones today face significant preservation challenges, including physical erosion, vandalism, environmental degradation, and political shifts. In this paper, we introduce a novel multi-modal framework for tombstones digitization, aiming to improve the interpretation, organization and retrieval of tombstone content. Our approach leverages vision-language models (VLMs) to translate tombstone images into structured Tombstone Meaning Representations (TMRs), capturing both image and text information. To further enrich semantic parsing, we incorporate retrieval-augmented generation (RAG) for integrate externally dependent elements such as toponyms, occupation codes, and ontological concepts. Compared to traditional OCR-based pipelines, our method improves parsing accuracy from an F1 score of 36.1 to 89.5. We additionally evaluate the model's robustness across diverse linguistic and cultural inscriptions, and simulate physical degradation through image fusion to assess performance under noisy or damaged conditions. Our work represents the first attempt to formalize tombstone understanding using large vision-language models, presenting implications for heritage preservation.
Paperid: 394,   https://arxiv.org/pdf/2509.16960    
Authors:Ruiyan Wang, Zhengxue Cheng, Zonghao Lin, Jun Ling, Yuzhou Liu, Yanru An, Rong Xie, Li Song
Abstract:
3D digital garment generation and editing play a pivotal role in fashion design, virtual try-on, and gaming. Traditional methods struggle to meet the growing demand due to technical complexity and high resource costs. Learning-based approaches offer faster, more diverse garment synthesis based on specific requirements and reduce human efforts and time costs. However, they still face challenges such as inconsistent multi-view geometry or textures and heavy reliance on detailed garment topology and manual rigging. We propose SemanticGarment, a 3D Gaussian-based method that realizes high-fidelity 3D garment generation from text or image prompts and supports semantic-based interactive editing for flexible user customization. To ensure multi-view consistency and garment fitting, we propose to leverage structural human priors for the generative model by introducing a 3D semantic clothing model, which initializes the geometry structure and lays the groundwork for view-consistent garment generation and editing. Without the need to regenerate or rely on existing mesh templates, our approach allows for rapid and diverse modifications to existing Gaussians, either globally or within a local region. To address the artifacts caused by self-occlusion for garment reconstruction based on single image, we develop a self-occlusion optimization strategy to mitigate holes and artifacts that arise when directly animating self-occluded garments. Extensive experiments are conducted to demonstrate our superior performance in 3D garment generation and editing.
Paperid: 395,   https://arxiv.org/pdf/2504.15624    
Authors:Jingzhi Li, Changjiang Luo, Ruoyu Chen, Hua Zhang, Wenqi Ren, Jianhou Gan, Xiaochun Cao
Abstract:
Recent advances in multimodal large language models (MLLMs) have demonstrated strong capabilities in understanding general visual content. However, these general-domain MLLMs perform poorly in face perception tasks, often producing inaccurate or misleading responses to face-specific queries. To address this gap, we propose FaceInsight, the versatile face perception MLLM that provides fine-grained facial information. Our approach introduces visual-textual alignment of facial knowledge to model both uncertain dependencies and deterministic relationships among facial information, mitigating the limitations of language-driven reasoning. Additionally, we incorporate face segmentation maps as an auxiliary perceptual modality, enriching the visual input with localized structural cues to enhance semantic understanding. Comprehensive experiments and analyses across three face perception tasks demonstrate that FaceInsight consistently outperforms nine compared MLLMs under both training-free and fine-tuned settings.
Paperid: 396,   https://arxiv.org/pdf/2508.09912    
Authors:Chaoran Feng, Zhenyu Tang, Wangbo Yu, Yatian Pang, Yian Zhao, Jianbin Zhao, Li Yuan, Yonghong Tian
Abstract:
Novel view synthesis and 4D reconstruction techniques predominantly rely on RGB cameras, thereby inheriting inherent limitations such as the dependence on adequate lighting, susceptibility to motion blur, and a limited dynamic range. Event cameras, offering advantages of low power, high temporal resolution and high dynamic range, have brought a new perspective to addressing the scene reconstruction challenges in high-speed motion and
Paperid: 397,   https://arxiv.org/pdf/2505.06133    
Authors:Hongming Wang, Yifeng Wu, Huimin Huang, Hongtao Wu, Jia-Xuan Jiang, Xiaodong Zhang, Hao Zheng, Xian Wu, Yefeng Zheng, Jinping Xu, Jing Cheng
Abstract:
The segmentation of substantial brain lesions is a significant and challenging task in the field of medical image segmentation. Substantial brain lesions in brain imaging exhibit high heterogeneity, with indistinct boundaries between lesion regions and normal brain tissue. Small lesions in single slices are difficult to identify, making the accurate and reproducible segmentation of abnormal regions, as well as their feature description, highly complex. Existing methods have the following limitations: 1) They rely solely on single-modal information for learning, neglecting the multi-modal information commonly used in diagnosis. This hampers the ability to comprehensively acquire brain lesion information from multiple perspectives and prevents the effective integration and utilization of multi-modal data inputs, thereby limiting a holistic understanding of lesions. 2) They are constrained by the amount of data available, leading to low sensitivity to small lesions and difficulty in detecting subtle pathological changes. 3) Current SAM-based models rely on external prompts, which cannot achieve automatic segmentation and, to some extent, affect diagnostic efficiency.To address these issues, we have developed a large-scale fully automated segmentation model specifically designed for brain lesion segmentation, named BrainSegDMLF. This model has the following features: 1) Dynamic Modal Interactive Fusion (DMIF) module that processes and integrates multi-modal data during the encoding process, providing the SAM encoder with more comprehensive modal information. 2) Layer-by-Layer Upsampling Decoder, enabling the model to extract rich low-level and high-level features even with limited data, thereby detecting the presence of small lesions. 3) Automatic segmentation masks, allowing the model to generate lesion masks automatically without requiring manual prompts.
Paperid: 398,   https://arxiv.org/pdf/2502.11897    
Authors:Zhihang Yuan, Siyuan Wang, Rui Xie, Hanling Zhang, Tongcheng Fang, Yuzhang Shang, Shengen Yan, Guohao Dai, Yu Wang
Abstract:
In this paper, we propose the Dynamic Latent Frame Rate VAE (DLFR-VAE), a training-free paradigm that can make use of adaptive temporal compression in latent space. While existing video generative models apply fixed compression rates via pretrained VAE, we observe that real-world video content exhibits substantial temporal non-uniformity, with high-motion segments containing more information than static scenes. Based on this insight, DLFR-VAE dynamically adjusts the latent frame rate according to the content complexity. Specifically, DLFR-VAE comprises two core innovations: (1) A Dynamic Latent Frame Rate Scheduler that partitions videos into temporal chunks and adaptively determines optimal frame rates based on information-theoretic content complexity, and (2) A training-free adaptation mechanism that transforms pretrained VAE architectures into a dynamic VAE that can process features with variable frame rates. Our simple but effective DLFR-VAE can function as a plug-and-play module, seamlessly integrating with existing video generation models and accelerating the video generation process.
Paperid: 399,   https://arxiv.org/pdf/2512.09335    
Authors:Seonghwa Choi, Moonkyeong Choi, Mingyu Jang, Jaekyung Kim, Jianfei Cai, Wen-Huang Cheng, Sanghoon Lee
Abstract:
Modeling relightable and animatable human avatars from monocular video is a long-standing and challenging task. Recently, Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) methods have been employed to reconstruct the avatars. However, they often produce unsatisfactory photo-realistic results because of insufficient geometrical details related to body motion, such as clothing wrinkles. In this paper, we propose a 3DGS-based human avatar modeling framework, termed as Relightable and Dynamic Gaussian Avatar (RnD-Avatar), that presents accurate pose-variant deformation for high-fidelity geometrical details. To achieve this, we introduce dynamic skinning weights that define the human avatar's articulation based on pose while also learning additional deformations induced by body motion. We also introduce a novel regularization to capture fine geometric details under sparse visual cues. Furthermore, we present a new multi-view dataset with varied lighting conditions to evaluate relight. Our framework enables realistic rendering of novel poses and views while supporting photo-realistic lighting effects under arbitrary lighting conditions. Our method achieves state-of-the-art performance in novel view synthesis, novel pose rendering, and relighting.
Paperid: 400,   https://arxiv.org/pdf/2502.21245    
Authors:Haoran Zhang, Yong Liu, Yunzhong Qiu, Haixuan Liu, Zhongyi Pei, Jianmin Wang, Mingsheng Long
Abstract:
Time series analysis is crucial in diverse scenarios. Beyond forecasting, considerable real-world tasks are categorized into classification, imputation, and anomaly detection, underscoring different capabilities termed time series understanding in this paper. While GPT-style models have been positioned as foundation models for time series forecasting, the BERT-style architecture, which has made significant advances in natural language understanding, has not been fully unlocked for time series understanding, possibly attributed to the undesirable dropout of essential elements of BERT. In this paper, inspired by the shared multi-granularity structure between multivariate time series and multisentence documents, we design TimesBERT to learn generic representations of time series including temporal patterns and variate-centric characteristics. In addition to a natural adaptation of masked modeling, we propose a parallel task of functional token prediction to embody vital multi-granularity structures. Our model is pre-trained on 260 billion time points across diverse domains. Leveraging multi-granularity representations, TimesBERT achieves state-of-the-art performance across four typical downstream understanding tasks, outperforming task-specific models and language pre-trained backbones, positioning it as a versatile foundation model for time series understanding.
Paperid: 401,   https://arxiv.org/pdf/2504.19086    
Authors:Xiaoran Xu, Jiangang Yang, Wenyue Chong, Wenhui Shi, Shichu Sun, Jing Xing, Jian Liu
Abstract:
Single-Domain Generalized Object Detection~(S-DGOD) aims to train an object detector on a single source domain while generalizing well to diverse unseen target domains, making it suitable for multimedia applications that involve various domain shifts, such as intelligent video surveillance and VR/AR technologies. With the success of large-scale Vision-Language Models, recent S-DGOD approaches exploit pre-trained vision-language knowledge to guide invariant feature learning across visual domains. However, the utilized knowledge remains at a coarse-grained level~(e.g., the textual description of adverse weather paired with the image) and serves as an implicit regularization for guidance, struggling to learn accurate region- and object-level features in varying domains. In this work, we propose a new cross-modal feature learning method, which can capture generalized and discriminative regional features for S-DGOD tasks. The core of our method is the mechanism of Cross-modal and Region-aware Feature Interaction, which simultaneously learns both inter-modal and intra-modal regional invariance through dynamic interactions between fine-grained textual and visual features. Moreover, we design a simple but effective strategy called Cross-domain Proposal Refining and Mixing, which aligns the position of region proposals across multiple domains and diversifies them, enhancing the localization ability of detectors in unseen scenarios. Our method achieves new state-of-the-art results on S-DGOD benchmark datasets, with improvements of +8.8%~mPC on Cityscapes-C and +7.9%~mPC on DWD over baselines, demonstrating its efficacy.
Paperid: 402,   https://arxiv.org/pdf/2504.08259    
Authors:Ruohao Zhan, Yijin Li, Yisheng He, Shuo Chen, Yichen Shen, Xinyu Chen, Zilong Dong, Zhaoyang Huang, Guofeng Zhang
Abstract:
Sketches serve as fundamental blueprints in artistic creation because sketch editing is easier and more intuitive than pixel-level RGB image editing for painting artists, yet sketch generation remains unexplored despite advancements in generative models. We propose a novel framework CoProSketch, providing prominent controllability and details for sketch generation with diffusion models. A straightforward method is fine-tuning a pretrained image generation diffusion model with binarized sketch images. However, we find that the diffusion models fail to generate clear binary images, which makes the produced sketches chaotic. We thus propose to represent the sketches by unsigned distance field (UDF), which is continuous and can be easily decoded to sketches through a lightweight network. With CoProSketch, users generate a rough sketch from a bounding box and a text prompt. The rough sketch can be manually edited and fed back into the model for iterative refinement and will be decoded to a detailed sketch as the final result. Additionally, we curate the first large-scale text-sketch paired dataset as the training data. Experiments demonstrate superior semantic consistency and controllability over baselines, offering a practical solution for integrating user feedback into generative workflows.
Paperid: 403,   https://arxiv.org/pdf/2507.09693    
Authors:Jiali Chen, Yujie Jia, Zihan Wu, Jinyu Yang, Jianpeng Chen, Xusen Hei, Jiayuan Xie, Yi Cai, Qing Li
Abstract:
Experiment commentary is crucial in describing the experimental procedures, delving into underlying scientific principles, and incorporating content-related safety guidelines. In practice, human teachers rely heavily on subject-specific expertise and invest significant time preparing such commentary. To address this challenge, we introduce the task of automatic commentary generation across multi-discipline scientific experiments. While recent progress in large multimodal models (LMMs) has demonstrated promising capabilities in video understanding and reasoning, their ability to generate fine-grained and insightful experiment commentary remains largely underexplored. In this paper, we make the following contributions: (i) We construct ExpInstruct, the first dataset tailored for experiment commentary generation, featuring over 7K step-level commentaries across 21 scientific subjects from 3 core disciplines (\ie, science, healthcare and engineering). Each sample includes procedural descriptions along with potential scientific principles (\eg, chemical equations and physical laws) and safety guidelines. (ii) We propose ExpStar, an automatic experiment commentary generation model that leverages a retrieval-augmented mechanism to adaptively access, evaluate, and utilize external knowledge. (iii) Extensive experiments show that our ExpStar substantially outperforms 14 leading LMMs, which highlights the superiority of our dataset and model. We believe that ExpStar holds great potential for advancing AI-assisted scientific experiment instruction.
Paperid: 404,   https://arxiv.org/pdf/2507.12760    
Authors:Ruicheng Zhang, Haowei Guo, Kanghui Tian, Jun Zhou, Mingliang Yan, Zeyu Zhang, Shen Zhao
Abstract:
Unified Medical Image Segmentation (UMIS) is critical for comprehensive anatomical assessment but faces challenges due to multi-scale structural heterogeneity. Conventional pixel-based approaches, lacking object-level anatomical insight and inter-organ relational modeling, struggle with morphological complexity and feature conflicts, limiting their efficacy in UMIS. We propose Mamba Snake, a novel deep snake framework enhanced by state space modeling for UMIS. Mamba Snake frames multi-contour evolution as a hierarchical state space atlas, effectively modeling macroscopic inter-organ topological relationships and microscopic contour refinements. We introduce a snake-specific vision state space module, the Mamba Evolution Block (MEB), which leverages effective spatiotemporal information aggregation for adaptive refinement of complex morphologies. Energy map shape priors further ensure robust long-range contour evolution in heterogeneous data. Additionally, a dual-classification synergy mechanism is incorporated to concurrently optimize detection and segmentation, mitigating under-segmentation of microstructures in UMIS. Extensive evaluations across five clinical datasets reveal Mamba Snake's superior performance, with an average Dice improvement of 3% over state-of-the-art methods.
Paperid: 405,   https://arxiv.org/pdf/2504.10068    
Authors:Yang Shi, Jiaheng Liu, Yushuo Guan, Zhenhua Wu, Yuanxing Zhang, Zihao Wang, Weihong Lin, Jingyun Hua, Zekun Wang, Xinlong Chen, Bohan Zeng, Wentao Zhang, Fuzheng Zhang, Wenjing Yang, Di Zhang
Abstract:
Long-context video understanding in multimodal large language models (MLLMs) faces a critical challenge: balancing computational efficiency with the retention of fine-grained spatio-temporal patterns. Existing approaches (e.g., sparse sampling, dense sampling with low resolution, and token compression) suffer from significant information loss in temporal dynamics, spatial details, or subtle interactions, particularly in videos with complex motion or varying resolutions. To address this, we propose \mathbfMavors, a novel framework that introduces \mathbfMulti-gr\mathbfanularity \mathbfvide\mathbfo \mathbfrepre\mathbfsentation for holistic long-video modeling. Specifically, Mavors directly encodes raw video content into latent representations through two core components: 1) an Intra-chunk Vision Encoder (IVE) that preserves high-resolution spatial features via 3D convolutions and Vision Transformers, and 2) an Inter-chunk Feature Aggregator (IFA) that establishes temporal coherence across chunks using transformer-based dependency modeling with chunk-level rotary position encodings. Moreover, the framework unifies image and video understanding by treating images as single-frame videos via sub-image decomposition. Experiments across diverse benchmarks demonstrate Mavors' superiority in maintaining both spatial fidelity and temporal continuity, significantly outperforming existing methods in tasks requiring fine-grained spatio-temporal reasoning.
Paperid: 406,   https://arxiv.org/pdf/2510.22513    
Authors:Junran Wu, Beng Chin Ooi, Ke Xu
Abstract:
Signed Graph Neural Networks (SGNNs) are widely adopted to analyze complex patterns in signed graphs with both positive and negative links. Given the noisy nature of real-world connections, the robustness of SGNN has also emerged as a pivotal research area. Under the supervision of empirical properties, graph structure learning has shown its robustness on signed graph representation learning, however, there remains a paucity of research investigating a robust SGNN with theoretical guidance. Inspired by the success of graph information bottleneck (GIB) in information extraction, we propose RIDGE, a novel framework for Robust sI gned graph learning through joint Denoising of Graph inputs and supervision targEts. Different from the basic GIB, we extend the GIB theory with the capability of target space denoising as the co-existence of noise in both input and target spaces. In instantiation, RIDGE effectively cleanses input data and supervision targets via a tractable objective function produced by reparameterization mechanism and variational approximation. We extensively validate our method on four prevalent signed graph datasets, and the results show that RIDGE clearly improves the robustness of popular SGNN models under various levels of noise.
Paperid: 407,   https://arxiv.org/pdf/2504.17349    
Authors:Yiyan Xu, Wuqiang Zheng, Wenjie Wang, Fengbin Zhu, Xinting Hu, Yang Zhang, Fuli Feng, Tat-Seng Chua
Abstract:
Personalized image generation has emerged as a promising direction in multimodal content creation. It aims to synthesize images tailored to individual style preferences (e.g., color schemes, character appearances, layout) and semantic intentions (e.g., emotion, action, scene contexts) by leveraging user-interacted history images and multimodal instructions. Despite notable progress, existing methods -- whether based on diffusion models, large language models, or Large Multimodal Models (LMMs) -- struggle to accurately capture and fuse user style preferences and semantic intentions. In particular, the state-of-the-art LMM-based method suffers from the entanglement of visual features, leading to Guidance Collapse, where the generated images fail to preserve user-preferred styles or reflect the specified semantics. To address these limitations, we introduce DRC, a novel personalized image generation framework that enhances LMMs through Disentangled Representation Composition. DRC explicitly extracts user style preferences and semantic intentions from history images and the reference image, respectively, to form user-specific latent instructions that guide image generation within LMMs. Specifically, it involves two critical learning stages: 1) Disentanglement learning, which employs a dual-tower disentangler to explicitly separate style and semantic features, optimized via a reconstruction-driven paradigm with difficulty-aware importance sampling; and 2) Personalized modeling, which applies semantic-preserving augmentations to effectively adapt the disentangled representations for robust personalized generation. Extensive experiments on two benchmarks demonstrate that DRC shows competitive performance while effectively mitigating the guidance collapse issue, underscoring the importance of disentangled representation learning for controllable and effective personalized image generation.
Paperid: 408,   https://arxiv.org/pdf/2508.04055    
Authors:Fangmin Zhao, Weichao Zeng, Zhenhang Li, Dongbao Yang, Binbin Li, Xiaojun Bi, Yu Zhou
Abstract:
Removing various degradations from damaged documents greatly benefits digitization, downstream document analysis, and readability. Previous methods often treat each restoration task independently with dedicated models, leading to a cumbersome and highly complex document processing system. Although recent studies attempt to unify multiple tasks, they often suffer from limited scalability due to handcrafted prompts and heavy preprocessing, and fail to fully exploit inter-task synergy within a shared architecture. To address the aforementioned challenges, we propose Uni-DocDiff, a Unified and highly scalable Document restoration model based on Diffusion. Uni-DocDiff develops a learnable task prompt design, ensuring exceptional scalability across diverse tasks. To further enhance its multi-task capabilities and address potential task interference, we devise a novel Prior Pool, a simple yet comprehensive mechanism that combines both local high-frequency features and global low-frequency features. Additionally, we design the Prior Fusion Module (PFM), which enables the model to adaptively select the most relevant prior information for each specific task. Extensive experiments show that the versatile Uni-DocDiff achieves performance comparable or even superior performance compared with task-specific expert models, and simultaneously holds the task scalability for seamless adaptation to new tasks.
Paperid: 409,   https://arxiv.org/pdf/2505.03310    
Authors:Lei Liu, Zhenghao Chen, Dong Xu
Abstract:
3D Gaussian Splatting (3DGS) data compression is crucial for enabling efficient storage and transmission in 3D scene modeling. However, its development remains limited due to inadequate entropy models and suboptimal quantization strategies for both lossless and lossy compression scenarios, where existing methods have yet to 1) fully leverage hyperprior information to construct robust conditional entropy models, and 2) apply fine-grained, element-wise quantization strategies for improved compression granularity. In this work, we propose a novel Mixture of Priors (MoP) strategy to simultaneously address these two challenges. Specifically, inspired by the Mixture-of-Experts (MoE) paradigm, our MoP approach processes hyperprior information through multiple lightweight MLPs to generate diverse prior features, which are subsequently integrated into the MoP feature via a gating mechanism. To enhance lossless compression, the resulting MoP feature is utilized as a hyperprior to improve conditional entropy modeling. Meanwhile, for lossy compression, we employ the MoP feature as guidance information in an element-wise quantization procedure, leveraging a prior-guided Coarse-to-Fine Quantization (C2FQ) strategy with a predefined quantization step value. Specifically, we expand the quantization step value into a matrix and adaptively refine it from coarse to fine granularity, guided by the MoP feature, thereby obtaining a quantization step matrix that facilitates element-wise quantization. Extensive experiments demonstrate that our proposed 3DGS data compression framework achieves state-of-the-art performance across multiple benchmarks, including Mip-NeRF360, BungeeNeRF, DeepBlending, and Tank&Temples.
Paperid: 410,   https://arxiv.org/pdf/2509.13722    
Authors:Dingwei Zhang, Dong Zhang, Jinhui Tang
Abstract:
Recently, query-based methods have achieved remarkable performance in Referring Video Object Segmentation (RVOS) by using textual static object queries to drive cross-modal alignment. However, these static queries are easily misled by distractors with similar appearance or motion, resulting in \emphquery selection bias. To address this issue, we propose Triple Query Former (TQF), which factorizes the referring query into three specialized components: an appearance query for static attributes, an intra-frame interaction query for spatial relations, and an inter-frame motion query for temporal association. Instead of relying solely on textual embeddings, our queries are dynamically constructed by integrating both linguistic cues and visual guidance. Furthermore, we introduce two motion-aware aggregation modules that enhance object token representations: Intra-frame Interaction Aggregation incorporates position-aware interactions among objects within a single frame, while Inter-frame Motion Aggregation leverages trajectory-guided alignment across frames to ensure temporal coherence. Extensive experiments on multiple RVOS benchmarks demonstrate the advantages of TQF and the effectiveness of our structured query design and motion-aware aggregation modules.
Paperid: 411,   https://arxiv.org/pdf/2504.05137    
Authors:Jinxiang Lai, Wenlong Wu, Jiawei Zhan, Jian Li, Bin-Bin Gao, Jun Liu, Jie Zhang, Song Guo
Abstract:
Box-supervised instance segmentation methods aim to achieve instance segmentation with only box annotations. Recent methods have demonstrated the effectiveness of acquiring high-quality pseudo masks under the teacher-student framework. Building upon this foundation, we propose a BoxSeg framework involving two novel and general modules named the Quality-Aware Module (QAM) and the Peer-assisted Copy-paste (PC). The QAM obtains high-quality pseudo masks and better measures the mask quality to help reduce the effect of noisy masks, by leveraging the quality-aware multi-mask complementation mechanism. The PC imitates Peer-Assisted Learning to further improve the quality of the low-quality masks with the guidance of the obtained high-quality pseudo masks. Theoretical and experimental analyses demonstrate the proposed QAM and PC are effective. Extensive experimental results show the superiority of our BoxSeg over the state-of-the-art methods, and illustrate the QAM and PC can be applied to improve other models.
Paperid: 412,   https://arxiv.org/pdf/2512.05121    
Authors:Tianshun Han, Benjia Zhou, Ajian Liu, Yanyan Liang, Du Zhang, Zhen Lei, Jun Wan
Abstract:
PESTalk is a novel method for generating 3D facial animations with personalized emotional styles directly from speech. It overcomes key limitations of existing approaches by introducing a Dual-Stream Emotion Extractor (DSEE) that captures both time and frequency-domain audio features for fine-grained emotion analysis, and an Emotional Style Modeling Module (ESMM) that models individual expression patterns based on voiceprint characteristics. To address data scarcity, the method leverages a newly constructed 3D-EmoStyle dataset. Evaluations demonstrate that PESTalk outperforms state-of-the-art methods in producing realistic and personalized facial animations.
Paperid: 413,   https://arxiv.org/pdf/2507.20980    
Authors:Shide Du, Chunming Wu, Zihan Fang, Wendi Zhao, Yilin Wu, Changwei Wang, Shiping Wang
Abstract:
Deep anchor-based multi-view clustering methods enhance the scalability of neural networks by utilizing representative anchors to reduce the computational complexity of large-scale clustering. Despite their scalability advantages, existing approaches often incorporate anchor structures in a heuristic or task-agnostic manner, either through post-hoc graph construction or as auxiliary components for message passing. Such designs overlook the core structural demands of anchor-based clustering, neglecting key optimization principles. To bridge this gap, we revisit the underlying optimization problem of large-scale anchor-based multi-view clustering and unfold its iterative solution into a novel deep network architecture, termed LargeMvC-Net. The proposed model decomposes the anchor-based clustering process into three modules: RepresentModule, NoiseModule, and AnchorModule, corresponding to representation learning, noise suppression, and anchor indicator estimation. Each module is derived by unfolding a step of the original optimization procedure into a dedicated network component, providing structural clarity and optimization traceability. In addition, an unsupervised reconstruction loss aligns each view with the anchor-induced latent space, encouraging consistent clustering structures across views. Extensive experiments on several large-scale multi-view benchmarks show that LargeMvC-Net consistently outperforms state-of-the-art methods in terms of both effectiveness and scalability.
Paperid: 414,   https://arxiv.org/pdf/2508.01577    
Authors:Lei Xie, Junxiong Huang, Yuanjing Feng, Qingrun Zeng
Abstract:
The parcellation of Cranial Nerves (CNs) serves as a crucial quantitative methodology for evaluating the morphological characteristics and anatomical pathways of specific CNs. Multi-modal CNs parcellation networks have achieved promising segmentation performance, which combine structural Magnetic Resonance Imaging (MRI) and diffusion MRI. However, insufficient exploration of diffusion MRI information has led to low performance of existing multi-modal fusion. In this work, we propose a tractography-guided Dual-label Collaborative Learning Network (DCLNet) for multi-modal CNs parcellation. The key contribution of our DCLNet is the introduction of coarse labels of CNs obtained from fiber tractography through CN atlas, and collaborative learning with precise labels annotated by experts. Meanwhile, we introduce a Modality-adaptive Encoder Module (MEM) to achieve soft information swapping between structural MRI and diffusion MRI. Extensive experiments conducted on the publicly available Human Connectome Project (HCP) dataset demonstrate performance improvements compared to single-label network. This systematic validation underscores the effectiveness of dual-label strategies in addressing inherent ambiguities in CNs parcellation tasks.
Paperid: 415,   https://arxiv.org/pdf/2510.18606    
Authors:Chunyu Qiao, Tong Liu, Yucheng Zhang, Zhiwei Fan, Pengjin Xie, Zhen Wang, Liang Liu
Abstract:
In large scale short video platforms, CDN resource selection plays a critical role in maintaining Quality of Experience (QoE) while controlling escalating traffic costs. To better understand this phenomenon, we conduct in the wild network measurements during video playback in a production short video system. The results reveal that CDNs delivering higher average QoE often come at greater financial cost, yet their connection quality fluctuates even within a single video underscoring a fundamental and dynamic trade off between QoE and cost. However, the problem of sustaining high QoE under cost constraints remains insufficiently investigated in the context of CDN selection for short video streaming. To address this, we propose PIRA, a dynamic resource selection algorithm that optimizes QoE and cost in real time during video playback. PIRA formally integrating QoE and cost by a mathematical model, and introduce a intra video control theoretic CDN resource selection approach which can balance QoE and cost under network dynamics. To reduce the computation overheads, PIRA employs state space pruning and adaptive parameter adjustment to efficiently solve the high dimensional optimization problem. In large scale production experiments involving 450,000 users over two weeks, PIRA outperforms the production baseline, achieving a 2.1% reduction in start up delay, 15.2% shorter rebuffering time, and 10% lower average unit traffic cost, demonstrating its effectiveness in balancing user experience and financial cost at scale.
Paperid: 416,   https://arxiv.org/pdf/2509.11796    
Authors:Haodong Chen, Haojian Huang, XinXiang Yin, Dian Shao
Abstract:
Video Question Answering (VideoQA) based on Large Language Models (LLMs) has shown potential in general video understanding but faces significant challenges when applied to the inherently complex domain of sports videos. In this work, we propose FineQuest, the first training-free framework that leverages dual-mode reasoning inspired by cognitive science: i) Reactive Reasoning for straightforward sports queries and ii) Deliberative Reasoning for more complex ones. To bridge the knowledge gap between general-purpose models and domain-specific sports understanding, FineQuest incorporates SSGraph, a multimodal sports knowledge scene graph spanning nine sports, which encodes both visual instances and domain-specific terminology to enhance reasoning accuracy. Furthermore, we introduce two new sports VideoQA benchmarks, Gym-QA and Diving-QA, derived from the FineGym and FineDiving datasets, enabling diverse and comprehensive evaluation. FineQuest achieves state-of-the-art performance on these benchmarks as well as the existing SPORTU dataset, while maintains strong general VideoQA capabilities.
Paperid: 417,   https://arxiv.org/pdf/2504.17395    
Authors:Yiming Zhao, Guorong Li, Laiyun Qing, Amin Beheshti, Jian Yang, Michael Sheng, Yuankai Qi, Qingming Huang
Abstract:
Open-world object counting leverages the robust text-image alignment of pre-trained vision-language models (VLMs) to enable counting of arbitrary categories in images specified by textual queries. However, widely adopted naive fine-tuning strategies concentrate exclusively on text-image consistency for categories contained in training, which leads to limited generalizability for unseen categories. In this work, we propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories with minimal overhead in parameters and inference time. First, we introduce a two-stage visual prompt learning strategy composed of Category-Specific Prompt Initialization (CSPI) and Topology-Guided Prompt Refinement (TGPR). The CSPI generates category-specific visual prompts, and then TGPR distills latent structural patterns from the VLM's text encoder to refine these prompts. During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories, facilitating robust text-image alignment for unseen categories. Extensive experiments integrating SDVPT with all available open-world object counting models demonstrate its effectiveness and adaptability across three widely used datasets: FSC-147, CARPK, and PUCPR+.
Paperid: 418,   https://arxiv.org/pdf/2507.12883    
Authors:Weihuang Lin, Yiwei Ma, Xiaoshuai Sun, Shuting He, Jiayi Ji, Liujuan Cao, Rongrong Ji
Abstract:
The reasoning segmentation task involves segmenting objects within an image by interpreting implicit user instructions, which may encompass subtleties such as contextual cues and open-world knowledge. Despite significant advancements made by existing approaches, they remain constrained by low perceptual resolution, as visual encoders are typically pre-trained at lower resolutions. Furthermore, simply interpolating the positional embeddings of visual encoders to enhance perceptual resolution yields only marginal performance improvements while incurring substantial computational costs. To address this, we propose HRSeg, an efficient model with high-resolution fine-grained perception. It features two key innovations: High-Resolution Perception (HRP) and High-Resolution Enhancement (HRE). The HRP module processes high-resolution images through cropping, integrating local and global features for multi-granularity quality. The HRE module enhances mask features by integrating fine-grained information from high-resolution images, refining their alignment with text features for precise segmentation. Extensive ablation studies validate the effectiveness of our modules, while comprehensive experiments on multiple benchmark datasets demonstrate HRSeg's superior performance.
Paperid: 419,   https://arxiv.org/pdf/2508.05585    
Authors:Haijing Liu, Tao Pu, Hefeng Wu, Keze Wang, Liang Lin
Abstract:
Open-Vocabulary Multi-Label Recognition (OV-MLR) aims to identify multiple seen and unseen object categories within an image, requiring both precise intra-class localization to pinpoint objects and effective inter-class reasoning to model complex category dependencies. While Vision-Language Pre-training (VLP) models offer a strong open-vocabulary foundation, they often struggle with fine-grained localization under weak supervision and typically fail to explicitly leverage structured relational knowledge beyond basic semantics, limiting performance especially for unseen classes. To overcome these limitations, we propose the Dual Adaptive Refinement Transfer (DART) framework. DART enhances a frozen VLP backbone via two synergistic adaptive modules. For intra-class refinement, an Adaptive Refinement Module (ARM) refines patch features adaptively, coupled with a novel Weakly Supervised Patch Selecting (WPS) loss that enables discriminative localization using only image-level labels. Concurrently, for inter-class transfer, an Adaptive Transfer Module (ATM) leverages a Class Relationship Graph (CRG), constructed using structured knowledge mined from a Large Language Model (LLM), and employs graph attention network to adaptively transfer relational information between class representations. DART is the first framework, to our knowledge, to explicitly integrate external LLM-derived relational knowledge for adaptive inter-class transfer while simultaneously performing adaptive intra-class refinement under weak supervision for OV-MLR. Extensive experiments on challenging benchmarks demonstrate that our DART achieves new state-of-the-art performance, validating its effectiveness.
Paperid: 420,   https://arxiv.org/pdf/2508.13633    
Authors:Bowen Tian, Wenshuo Chen, Zexi Li, Songning Lai, Jiemin Wu, Yutao Yue
Abstract:
How far are we really from automatically generating neural networks? While neural network weight generation shows promise, current approaches struggle with generalization to unseen tasks and practical application exploration. To address this, we propose T2W, a diffusion transformer framework that generates task-specific weights conditioned on natural language descriptions. T2W hierarchically processes network parameters into uniform blocks, integrates text embeddings from CLIP via a prior attention mechanism, and employs adversarial training with weight-space augmentation to enhance generalization. Experiments on Cifar100, Caltech256, and TinyImageNet demonstrate T2W's ability to produce high-quality weights for unseen tasks, outperforming optimization-based initialization and enabling novel applications such as weight enhancement and text-guided model fusion. Our work bridges textual semantics with weight-space dynamics, supported by an open-source dataset of text-weight pairs, advancing the practicality of generative models in neural network parameter synthesis. Our code is available on Github.
Paperid: 421,   https://arxiv.org/pdf/2508.01181    
Authors:Zhiyuan Han, Beier Zhu, Yanlong Xu, Peipei Song, Xun Yang
Abstract:
Despite their strong performance in multimodal emotion reasoning, existing Multimodal Large Language Models (MLLMs) often overlook the scenarios involving emotion conflicts, where emotional cues from different modalities are inconsistent. To fill this gap, we first introduce CA-MER, a new benchmark designed to examine MLLMs under realistic emotion conflicts. It consists of three subsets: video-aligned, audio-aligned, and consistent, where only one or all modalities reflect the true emotion. However, evaluations on our CA-MER reveal that current state-of-the-art emotion MLLMs systematically over-rely on audio signal during emotion conflicts, neglecting critical cues from visual modality. To mitigate this bias, we propose MoSEAR, a parameter-efficient framework that promotes balanced modality integration. MoSEAR consists of two modules: (1)MoSE, modality-specific experts with a regularized gating mechanism that reduces modality bias in the fine-tuning heads; and (2)AR, an attention reallocation mechanism that rebalances modality contributions in frozen backbones during inference. Our framework offers two key advantages: it mitigates emotion conflicts and improves performance on consistent samples-without incurring a trade-off between audio and visual modalities. Experiments on multiple benchmarks-including MER2023, EMER, DFEW, and our CA-MER-demonstrate that MoSEAR achieves state-of-the-art performance, particularly under modality conflict conditions.
Paperid: 422,   https://arxiv.org/pdf/2507.13061    
Authors:Jingyao Wang, Yiming Chen, Lingyu Si, Changwen Zheng
Abstract:
Scene understanding is one of the core tasks in computer vision, aiming to extract semantic information from images to identify objects, scene categories, and their interrelationships. Although advancements in Vision-Language Models (VLMs) have driven progress in this field, existing VLMs still face challenges in adaptation to unseen complex wide-area scenes. To address the challenges, this paper proposes a Hierarchical Coresets Selection (HCS) mechanism to advance the adaptation of VLMs in complex wide-area scene understanding. It progressively refines the selected regions based on the proposed theoretically guaranteed importance function, which considers utility, representativeness, robustness, and synergy. Without requiring additional fine-tuning, HCS enables VLMs to achieve rapid understandings of unseen scenes at any scale using minimal interpretable regions while mitigating insufficient feature density. HCS is a plug-and-play method that is compatible with any VLM. Experiments demonstrate that HCS achieves superior performance and universality in various tasks.
Paperid: 423,   https://arxiv.org/pdf/2502.07327    
Authors:Haowen Gao, Liang Pang, Shicheng Xu, Leigang Qu, Tat-Seng Chua, Huawei Shen, Xueqi Cheng
Abstract:
With the rapid development of AI-generated content (AIGC), the creation of high-quality AI-generated videos has become faster and easier, resulting in the Internet being flooded with all kinds of video content. However, the impact of these videos on the content ecosystem remains largely unexplored. Video information retrieval remains a fundamental approach for accessing video content. Building on the observation that retrieval models often favor AI-generated content in ad-hoc and image retrieval tasks, we investigate whether similar biases emerge in the context of challenging video retrieval, where temporal and visual factors may further influence model behavior. To explore this, we first construct a comprehensive benchmark dataset containing both real and AI-generated videos, along with a set of fair and rigorous metrics to assess bias. This benchmark consists of 13,000 videos generated by two state-of-the-art open-source video generation models. We meticulously design a suite of rigorous metrics to accurately measure this preference, accounting for potential biases arising from the limited frame rate and suboptimal quality of AIGC videos. We then applied three off-the-shelf video retrieval models to perform retrieval tasks on this hybrid dataset. Our findings reveal a clear preference for AI-generated videos in retrieval. Further investigation shows that incorporating AI-generated videos into the training set of retrieval models exacerbates this bias. Unlike the preference observed in image modalities, we find that video retrieval bias arises from both unseen visual and temporal information, making the root causes of video bias a complex interplay of these two factors. To mitigate this bias, we fine-tune the retrieval models using a contrastive learning approach. The results of this study highlight the potential implications of AI-generated videos on retrieval systems.
Paperid: 424,   https://arxiv.org/pdf/2504.19183    
Authors:Mi Zheng, Guanglei Yang, Zitong Huang, Zhenhua Guo, Kevin Han, Wangmeng Zuo
Abstract:
With the emergence of transformer-based architectures and large language models (LLMs), the accuracy of road scene perception has substantially advanced. Nonetheless, current road scene segmentation approaches are predominantly trained on closed-set data, resulting in insufficient detection capabilities for out-of-distribution (OOD) objects. To overcome this limitation, road anomaly detection methods have been proposed. However, existing methods primarily depend on image inpainting and OOD distribution detection techniques, facing two critical issues: (1) inadequate consideration of the objectiveness attributes of anomalous regions, causing incomplete segmentation when anomalous objects share similarities with known classes, and (2) insufficient attention to environmental constraints, leading to the detection of anomalies irrelevant to autonomous driving tasks. In this paper, we propose a novel framework termed Segmenting Objectiveness and Task-Awareness (SOTA) for autonomous driving scenes. Specifically, SOTA enhances the segmentation of objectiveness through a Semantic Fusion Block (SFB) and filters anomalies irrelevant to road navigation tasks using a Scene-understanding Guided Prompt-Context Adaptor (SG-PCA). Extensive empirical evaluations on multiple benchmark datasets, including Fishyscapes Lost and Found, Segment-Me-If-You-Can, and RoadAnomaly, demonstrate that the proposed SOTA consistently improves OOD detection performance across diverse detectors, achieving robust and accurate segmentation outcomes.
Paperid: 425,   https://arxiv.org/pdf/2508.15232    
Authors:Ruipu Wu, Yige Zhang, Jinyu Chen, Linjiang Huang, Shifeng Zhang, Xu Zhou, Liang Wang, Si Liu
Abstract:
Aerial Vision-and-Language Navigation (VLN) is an emerging task that enables Unmanned Aerial Vehicles (UAVs) to navigate outdoor environments using natural language instructions and visual cues. However, due to the extended trajectories and complex maneuverability of UAVs, achieving reliable UAV-VLN performance is challenging and often requires human intervention or overly detailed instructions. To harness the advantages of UAVs' high mobility, which could provide multi-grained perspectives, while maintaining a manageable motion space for learning, we introduce a novel task called Dual-Altitude UAV Collaborative VLN (DuAl-VLN). In this task, two UAVs operate at distinct altitudes: a high-altitude UAV responsible for broad environmental reasoning, and a low-altitude UAV tasked with precise navigation. To support the training and evaluation of the DuAl-VLN, we construct the HaL-13k, a dataset comprising 13,838 collaborative high-low UAV demonstration trajectories, each paired with target-oriented language instructions. This dataset includes both unseen maps and an unseen object validation set to systematically evaluate the model's generalization capabilities across novel environments and unfamiliar targets. To consolidate their complementary strengths, we propose a dual-UAV collaborative VLN framework, AeroDuo, where the high-altitude UAV integrates a multimodal large language model (Pilot-LLM) for target reasoning, while the low-altitude UAV employs a lightweight multi-stage policy for navigation and target grounding. The two UAVs work collaboratively and only exchange minimal coordinate information to ensure efficiency.
Paperid: 426,   https://arxiv.org/pdf/2504.09451    
Authors:Tianyi Wang, Harry Cheng, Ming-Hui Liu, Mohan Kankanhalli
Abstract:
Proactive Deepfake detection via robust watermarks has been raised ever since passive Deepfake detectors encountered challenges in identifying high-quality synthetic images. However, while demonstrating reasonable detection performance, they lack localization functionality and explainability in detection results. Additionally, the unstable robustness of watermarks can significantly affect the detection performance accordingly. In this study, we propose novel fractal watermarks for proactive Deepfake detection and localization, namely FractalForensics. Benefiting from the characteristics of fractals, we devise a parameter-driven watermark generation pipeline that derives fractal-based watermarks and conducts one-way encryption regarding the parameters selected. Subsequently, we propose a semi-fragile watermarking framework for watermark embedding and recovery, trained to be robust against benign image processing operations and fragile when facing Deepfake manipulations in a black-box setting. Meanwhile, we introduce an entry-to-patch strategy that implicitly embeds the watermark matrix entries into image patches at corresponding positions, achieving localization of Deepfake manipulations. Extensive experiments demonstrate satisfactory robustness and fragility of our approach against common image processing operations and Deepfake manipulations, outperforming state-of-the-art semi-fragile watermarking algorithms and passive detectors for Deepfake detection. Furthermore, by highlighting the areas manipulated, our method provides explainability for the proactive Deepfake detection results.
Paperid: 427,   https://arxiv.org/pdf/2601.22498    
Authors:Wei Yang, Rui Zhong, Yiqun Chen, Shixuan Li, Heng Ping, Chi Lu, Peng Jiang
Abstract:
Multimodal recommendation aims to enhance user preference modeling by leveraging rich item content such as images and text. Yet dominant systems fuse modalities in the spatial domain, obscuring the frequency structure of signals and amplifying misalignment and redundancy. We adopt a spectral information-theoretic view and show that, under an orthogonal transform that approximately block-diagonalizes bandwise covariances, the Gaussian Information Bottleneck objective decouples across frequency bands, providing a principled basis for separate-then-fuse paradigm. Building on this foundation, we propose FITMM, a Frequency-aware Information-Theoretic framework for multimodal recommendation. FITMM constructs graph-enhanced item representations, performs modality-wise spectral decomposition to obtain orthogonal bands, and forms lightweight within-band multimodal components. A residual, task-adaptive gate aggregates bands into the final representation. To control redundancy and improve generalization, we regularize training with a frequency-domain IB term that allocates capacity across bands (Wiener-like shrinkage with shut-off of weak bands). We further introduce a cross-modal spectral consistency loss that aligns modalities within each band. The model is jointly optimized with the standard recommendation loss. Extensive experiments on three real-world datasets demonstrate that FITMM consistently and significantly outperforms advanced baselines.
Paperid: 428,   https://arxiv.org/pdf/2509.25963    
Authors:Longzhen Yang, Zhangkai Ni, Ying Wen, Yihang Liu, Lianghua He, Heng Tao Shen
Abstract:
Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images, anchored in explicit visual evidence to improve interpretability and facilitate integration into clinical workflows. However, existing methods often rely on separately trained detection modules that require extensive expert annotations, introducing high labeling costs and limiting generalizability due to pathology distribution bias across datasets. To address these challenges, we propose Self-Supervised Anatomical Consistency Learning (SS-ACL) -- a novel and annotation-free framework that aligns generated reports with corresponding anatomical regions using simple textual prompts. SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy, organizing entities by spatial location. It recursively reconstructs fine-grained anatomical regions to enforce intra-sample spatial alignment, inherently guiding attention maps toward visually relevant areas prompted by text. To further enhance inter-sample semantic alignment for abnormality recognition, SS-ACL introduces a region-level contrastive learning based on anatomical consistency. These aligned embeddings serve as priors for report generation, enabling attention maps to provide interpretable visual evidence. Extensive experiments demonstrate that SS-ACL, without relying on expert annotations, (i) generates accurate and visually grounded reports -- outperforming state-of-the-art methods by 10% in lexical accuracy and 25% in clinical efficacy, and (ii) achieves competitive performance on various downstream visual tasks, surpassing current leading visual foundation models by 8% in zero-shot visual grounding.
Paperid: 429,   https://arxiv.org/pdf/2508.09168    
Authors:Feiyu Wang, Zhiyuan Zhao, Yuandong Liu, Da Zhang, Junyu Gao, Hao Sun, Xuelong Li
Abstract:
Scalable Vector Graphics (SVG) is widely used in front-end development and UI/UX design due to its scalability, editability, and rendering efficiency. However, turning creative ideas into precise vector graphics remains a time-consuming challenge. To address this, we introduce SVG-1M, a large-scale dataset of high-quality SVGs paired with natural language descriptions. Through advanced data augmentation and annotation, we create well-aligned Text to SVG training pairs, including a subset with Chain of Thought annotations for enhanced semantic guidance. Based on this dataset, we propose SVGen, an end-to-end model that generates SVG code from natural language inputs. Our approach ensures semantic accuracy and structural completeness, supported by curriculum learning and reinforcement learning optimization. Experiments show that SVGen outperforms general large models and traditional rendering methods in both effectiveness and efficiency. Code, model, and dataset are available on GitHub.
Paperid: 430,   https://arxiv.org/pdf/2510.12493    
Authors:An Zhao, Piaopiao Yu, Zhe Zhu, Mingqiang Wei
Abstract:
3D Gaussian Splatting has exhibited remarkable capabilities in 3D scene reconstruction.However, reconstructing high-quality 3D scenes from motion-blurred images caused by camera motion poses a significant challenge.The performance of existing 3DGS-based deblurring methods are limited due to their inherent mechanisms, such as extreme dependence on the accuracy of camera poses and inability to effectively control erroneous Gaussian primitives densification caused by motion blur.To solve these problems, we introduce a novel framework, Bi-Stage 3D Gaussian Splatting, to accurately reconstruct 3D scenes from motion-blurred images.BSGS contains two stages. First, Camera Pose Refinement roughly optimizes camera poses to reduce motion-induced distortions. Second, with fixed rough camera poses, Global RigidTransformation further corrects motion-induced blur distortions.To alleviate multi-subframe gradient conflicts, we propose a subframe gradient aggregation strategy to optimize both stages.Furthermore, a space-time bi-stage optimization strategy is introduced to dynamically adjust primitive densification thresholds and prevent premature noisy Gaussian generation in blurred regions. Comprehensive experiments verify the effectiveness of our proposed deblurring method and show its superiority over the state of the arts.
Paperid: 431,   https://arxiv.org/pdf/2504.20629    
Authors:Jeongsoo Choi, Ji-Hoon Kim, Kim Sung-Bin, Tae-Hyun Oh, Joon Son Chung
Abstract:
In this paper, we address the task of multimodal-to-speech generation, which aims to synthesize high-quality speech from multiple input modalities: text, video, and reference audio. This task has gained increasing attention due to its wide range of applications, such as film production, dubbing, and virtual avatars. Despite recent progress, existing methods still suffer from limitations in speech intelligibility, audio-video synchronization, speech naturalness, and voice similarity to the reference speaker. To address these challenges, we propose AlignDiT, a multimodal Aligned Diffusion Transformer that generates accurate, synchronized, and natural-sounding speech from aligned multimodal inputs. Built upon the in-context learning capability of the DiT architecture, AlignDiT explores three effective strategies to align multimodal representations. Furthermore, we introduce a novel multimodal classifier-free guidance mechanism that allows the model to adaptively balance information from each modality during speech synthesis. Extensive experiments demonstrate that AlignDiT significantly outperforms existing methods across multiple benchmarks in terms of quality, synchronization, and speaker similarity. Moreover, AlignDiT exhibits strong generalization capability across various multimodal tasks, such as video-to-speech synthesis and visual forced alignment, consistently achieving state-of-the-art performance. The demo page is available at https://mm.kaist.ac.kr/projects/AlignDiT .
Paperid: 432,   https://arxiv.org/pdf/2505.06278    
Authors:Tongfei Bian, Mathieu Chollet, Tanaya Guha
Abstract:
The need for social robots and agents to interact and assist humans is growing steadily. To be able to successfully interact with humans, they need to understand and analyse socially interactive scenes from their (robot's) perspective. Works that model social situations between humans and agents are few; and even those existing ones are often too computationally intensive to be suitable for deployment in real time or on real world scenarios with limited available information. We propose a robust knowledge distillation framework that models social interactions through various multimodal cues, yet is robust against incomplete and noisy information during inference. Our teacher model is trained with multimodal input (body, face and hand gestures, gaze, raw images) that transfers knowledge to a student model that relies solely on body pose. Extensive experiments on two publicly available human-robot interaction datasets demonstrate that the our student model achieves an average accuracy gain of 14.75% over relevant baselines on multiple downstream social understanding task even with up to 51% of its input being corrupted. The student model is highly efficient: it is <1% in size of the teacher model in terms of parameters and uses ~ 0.5\textperthousand~FLOPs of that in the teacher model. Our code will be made public during publication.
Paperid: 433,   https://arxiv.org/pdf/2509.08008    
Authors:Bingjian Yang, Danni Xu, Kaipeng Niu, Wenxuan Liu, Zheng Wang, Mohan Kankanhalli
Abstract:
The proliferation of online misinformation videos poses serious societal risks. Current datasets and detection methods primarily target binary classification or single-modality localization based on post-processed data, lacking the interpretability needed to counter persuasive misinformation. In this paper, we introduce the task of Grounding Multimodal Misinformation (GroundMM), which verifies multimodal content and localizes misleading segments across modalities. We present the first real-world dataset for this task, GroundLie360, featuring a taxonomy of misinformation types, fine-grained annotations across text, speech, and visuals, and validation with Snopes evidence and annotator reasoning. We also propose a VLM-based, QA-driven baseline, FakeMark, using single- and cross-modal cues for effective detection and grounding. Our experiments highlight the challenges of this task and lay a foundation for explainable multimodal misinformation detection.
Paperid: 434,   https://arxiv.org/pdf/2506.06733    
Authors:Ruoxuan Zhang, Jidong Gao, Bin Wen, Hongxia Xie, Chenming Zhang, Hong-Han Shuai, Wen-Huang Cheng
Abstract:
Creating recipe images is a key challenge in food computing, with applications in culinary education and multimodal recipe assistants. However, existing datasets lack fine-grained alignment between recipe goals, step-wise instructions, and visual content. We present RecipeGen, the first large-scale, real-world benchmark for recipe-based Text-to-Image (T2I), Image-to-Video (I2V), and Text-to-Video (T2V) generation. RecipeGen contains 26,453 recipes, 196,724 images, and 4,491 videos, covering diverse ingredients, cooking procedures, styles, and dish types. We further propose domain-specific evaluation metrics to assess ingredient fidelity and interaction modeling, benchmark representative T2I, I2V, and T2V models, and provide insights for future recipe generation models. Project page is available now.
Paperid: 435,   https://arxiv.org/pdf/2406.09356    
Authors:Chunyi Li, Xiele Wu, Haoning Wu, Donghui Feng, Zicheng Zhang, Guo Lu, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, Weisi Lin
Abstract:
Ultra-low bitrate image compression is a challenging and demanding topic. With the development of Large Multimodal Models (LMMs), a Cross Modality Compression (CMC) paradigm of Image-Text-Image has emerged. Compared with traditional codecs, this semantic-level compression can reduce image data size to 0.1% or even lower, which has strong potential applications. However, CMC has certain defects in consistency with the original image and perceptual quality. To address this problem, we introduce CMC-Bench, a benchmark of the cooperative performance of Image-to-Text (I2T) and Text-to-Image (T2I) models for image compression. This benchmark covers 18,000 and 40,000 images respectively to verify 6 mainstream I2T and 12 T2I models, including 160,000 subjective preference scores annotated by human experts. At ultra-low bitrates, this paper proves that the combination of some I2T and T2I models has surpassed the most advanced visual signal codecs; meanwhile, it highlights where LMMs can be further optimized toward the compression task. We encourage LMM developers to participate in this test to promote the evolution of visual signal codec protocols.
Paperid: 436,   https://arxiv.org/pdf/2508.18608    
Authors:Janet Wang, Xin Hu, Yunbei Zhang, Diabate Almamy, Vagamon Bamba, Konan Amos Sébastien Koffi, Yao Koffi Aubin, Zhengming Ding, Jihun Hamm, Rie R. Yotsu
Abstract:
Skin Neglected Tropical Diseases (NTDs) impose severe health and socioeconomic burdens in impoverished tropical communities. Yet, advancements in AI-driven diagnostic support are hindered by data scarcity, particularly for underrepresented populations and rare manifestations of NTDs. Existing dermatological datasets often lack the demographic and disease spectrum crucial for developing reliable recognition models of NTDs. To address this, we introduce eSkinHealth, a novel dermatological dataset collected on-site in Côte d'Ivoire and Ghana. Specifically, eSkinHealth contains 5,623 images from 1,639 cases and encompasses 47 skin diseases, focusing uniquely on skin NTDs and rare conditions among West African populations. We further propose an AI-expert collaboration paradigm to implement foundation language and segmentation models for efficient generation of multimodal annotations, under dermatologists' guidance. In addition to patient metadata and diagnosis labels, eSkinHealth also includes semantic lesion masks, instance-specific visual captions, and clinical concepts. Overall, our work provides a valuable new resource and a scalable annotation framework, aiming to catalyze the development of more equitable, accurate, and interpretable AI tools for global dermatology.
Paperid: 437,   https://arxiv.org/pdf/2506.01520    
Authors:Bobo Li, Yuheng Wang, Hao Fei, Juncheng Li, Wei Ji, Mong-Li Lee, Wynne Hsu
Abstract:
Online form filling is a common yet labor-intensive task involving extensive keyboard and mouse interactions. Despite the long-standing vision of automating this process with "one click", existing tools remain largely rule-based and lack generalizable, generative capabilities. Recent advances in Multimodal Large Language Models (MLLMs) have enabled promising agents for GUI-related tasks in general-purpose scenarios. However, they struggle with the unique challenges of form filling, such as flexible layouts and the difficulty of aligning textual instructions with on-screen fields. To bridge this gap, we formally define the form-filling task and propose FormFactory, an interactive benchmarking suite comprising a web-based interface, backend evaluation module, and carefully constructed dataset. Our benchmark covers diverse real-world scenarios, incorporates various field formats, and simulates high-fidelity form interactions. We conduct a comprehensive evaluation of state-of-the-art MLLMs and observe that no model surpasses 5% accuracy, underscoring the inherent difficulty of the task. These findings also reveal significant limitations in current models' visual layout reasoning and field-value alignment abilities. We hope our benchmark can serve as a stepping stone for further research into robust, practical form-filling agents.
Paperid: 438,   https://arxiv.org/pdf/2503.04095    
Authors:Xiangnan Chen, Yuancheng Fang, Qian Xiao, Juncheng Li, Jun Lin, Siliang Tang, Yi Yang, Yueting Zhuang
Abstract:
Multimodal Large Language Models (MLLMs) have garnered significant attention for their strong visual-semantic understanding. Most existing chart benchmarks evaluate MLLMs' ability to parse information from charts to answer questions. However, they overlook the inherent output biases of MLLMs, where models rely on their parametric memory to answer questions rather than genuinely understanding the chart content. To address this limitation, we introduce a novel Chart Hypothetical Question Answering (HQA) task, which imposes assumptions on the same question to compel models to engage in counterfactual reasoning based on the chart content. Furthermore, we introduce HAI, a human-AI interactive data synthesis approach that leverages the efficient text-editing capabilities of LLMs alongside human expert knowledge to generate diverse and high-quality HQA data at a low cost. Using HAI, we construct Chart-HQA, a challenging benchmark synthesized from publicly available data sources. Evaluation results on 18 MLLMs of varying model sizes reveal that current models face significant generalization challenges and exhibit imbalanced reasoning performance on the HQA task.
Paperid: 439,   https://arxiv.org/pdf/2504.10358    
Authors:Rui Chen, Lei Sun, Jing Tang, Geng Li, Xiangxiang Chu
Abstract:
Recent advances in video generation have posed great challenges in the assessment of AI-generated content, particularly with the emergence of increasingly sophisticated models. The various inconsistencies and defects observed in such videos are inherently complex, making overall scoring notoriously difficult. In this paper, we emphasize the critical importance of integrating fine-grained reasoning into video evaluation, and we propose FingER, a novel entity-level reasoning evaluation framework that first automatically generates Fine-grained Entity-level questions, and then answers those questions by a Reasoning model with scores, which can be subsequently weighted summed to an overall score for different applications. Specifically, we leverage LLMs to derive entity-level questions across five distinct perspectives, which (i) often focus on some specific entities of the content, thereby making answering or scoring much easier by MLLMs, and (ii) are more interpretable. Then we construct a FingER dataset, consisting of approximately 3.3k videos and corresponding 60k fine-grained QA annotations, each with detailed reasons. Based on that, we further investigate various training protocols to best incentivize the reasoning capability of MLLMs for correct answer prediction. Extensive experiments demonstrate that a reasoning model trained using Group Relative Policy Optimization (GRPO) with a cold-start strategy achieves the best performance. Notably, our model surpasses existing methods by a relative margin of 11.8% on GenAI-Bench and 5.5% on MonetBench with only 3.3k training videos, which is at most one-tenth of the training samples utilized by other methods. Our code and dataset will be released soon.
Paperid: 440,   https://arxiv.org/pdf/2410.14334    
Authors:Taras Kucherenko, Derek Peristy, Judith Bütepage
Abstract:
Animation data is often obtained through optical motion capture systems, which utilize a multitude of cameras to establish the position of optical markers. However, system errors or occlusions can result in missing markers, the manual cleaning of which can be time-consuming. This has sparked interest in machine learning-based solutions for missing marker reconstruction in the academic community. Most academic papers utilize a simplistic mean square error as the main metric. In this paper, we show that this metric does not correlate with subjective perception of the fill quality. Additionally, we introduce and evaluate a set of better-correlated metrics that can drive progress in the field.
Paperid: 441,   https://arxiv.org/pdf/2507.12758    
Authors:Wangzheng Shi, Yinglin Zheng, Yuxin Lin, Jianmin Bao, Ming Zeng, Dong Chen
Abstract:
Hair transfer is increasingly valuable across domains such as social media, gaming, advertising, and entertainment. While significant progress has been made in single-image hair transfer, video-based hair transfer remains challenging due to the need for temporal consistency, spatial fidelity, and dynamic adaptability. In this work, we propose HairShifter, a novel "Anchor Frame + Animation" framework that unifies high-quality image hair transfer with smooth and coherent video animation. At its core, HairShifter integrates a Image Hair Transfer (IHT) module for precise per-frame transformation and a Multi-Scale Gated SPADE Decoder to ensure seamless spatial blending and temporal coherence. Our method maintains hairstyle fidelity across frames while preserving non-hair regions. Extensive experiments demonstrate that HairShifter achieves state-of-the-art performance in video hairstyle transfer, combining superior visual quality, temporal consistency, and scalability. The code will be publicly available. We believe this work will open new avenues for video-based hairstyle transfer and establish a robust baseline in this field.
Paperid: 442,   https://arxiv.org/pdf/2501.02786    
Authors:Yuanhong Chen, Kazuki Shimada, Christian Simon, Yukara Ikemiya, Takashi Shibuya, Yuki Mitsufuji
Abstract:
Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information. However, current models risk overfitting to room environments and lose fine-grained spatial details. In this paper, we propose a new audio-visual binaural generation model incorporating an audio-visual conditional normalisation layer that dynamically aligns the mean and variance of the target difference audio features using visual context, along with a new contrastive learning method to enhance spatial sensitivity by mining negative samples from shuffled visual features. We also introduce a cost-efficient way to utilise test-time augmentation in video data to enhance performance. Our approach achieves state-of-the-art generation accuracy on the FAIR-Play and MUSIC-Stereo benchmarks.
Paperid: 443,   https://arxiv.org/pdf/2510.20189    
Authors:Xinyi Hu, Yuran Wang, Ruixu Zhang, Yue Li, Wenxuan Liu, Zheng Wang
Abstract:
Temporal Intention Localization (TIL) is crucial for video surveillance, focusing on identifying varying levels of suspicious intentions to improve security monitoring. However, existing discrete classification methods fail to capture the continuous nature of suspicious intentions, limiting early intervention and explainability. In this paper, we propose the Suspicion Progression Analysis Network (SPAN), which shifts from discrete classification to continuous regression, enabling the capture of fluctuating and evolving suspicious intentions. We reveal that suspicion exhibits long-term dependencies and cumulative effects, similar to Temporal Point Process (TPP) theory. Based on these insights, we define a suspicion score formula that models continuous changes while accounting for temporal characteristics. We also introduce Suspicion Coefficient Modulation, which adjusts suspicion coefficients using multimodal information to reflect the varying impacts of suspicious actions. Additionally, the Concept-Anchored Mapping method is proposed to link suspicious actions to predefined intention concepts, offering insights into both the actions and their potential underlying intentions. Extensive experiments on the HAI dataset show that SPAN significantly outperforms existing methods, reducing MSE by 19.8% and improving average mAP by 1.78%. Notably, SPAN achieves a 2.74% mAP gain in low-frequency cases, demonstrating its superior ability to capture subtle behavioral changes. Compared to discrete classification systems, our continuous suspicion modeling approach enables earlier detection and proactive intervention, greatly enhancing system explainability and practical utility in security applications.
Paperid: 444,   https://arxiv.org/pdf/2412.18945    
Authors:Sijie Xu, Runqi Wang, Wei Zhu, Dejia Song, Nemo Chen, Xu Tang, Yao Hu
Abstract:
Diffusion-based stylization methods typically denoise from a specific partial noise state for image-to-image and video-to-video tasks. This multi-step diffusion process is computationally expensive and hinders real-world application. A promising solution to speed up the process is to obtain few-step consistency models through trajectory distillation. However, current consistency models only force the initial-step alignment between the probability flow ODE (PF-ODE) trajectories of the student and the imperfect teacher models. This training strategy can not ensure the consistency of whole trajectories. To address this issue, we propose single trajectory distillation (STD) starting from a specific partial noise state. We introduce a trajectory bank to store the teacher model's trajectory states, mitigating the time cost during training. Besides, we use an asymmetric adversarial loss to enhance the style and quality of the generated images. Extensive experiments on image and video stylization demonstrate that our method surpasses existing acceleration models in terms of style similarity and aesthetic evaluations. Our code and results will be available on the project page: https://single-trajectory-distillation.github.io.
Paperid: 445,   https://arxiv.org/pdf/2507.12795    
Authors:Penglei Sun, Yaoxian Song, Xiangru Zhu, Xiang Liu, Qiang Wang, Yue Liu, Changqun Xia, Tiefeng Li, Yang Yang, Xiaowen Chu
Abstract:
Scene understanding enables intelligent agents to interpret and comprehend their environment. While existing large vision-language models (LVLMs) for scene understanding have primarily focused on indoor household tasks, they face two significant limitations when applied to outdoor large-scale scene understanding. First, outdoor scenarios typically encompass larger-scale environments observed through various sensors from multiple viewpoints (e.g., bird view and terrestrial view), while existing indoor LVLMs mainly analyze single visual modalities within building-scale contexts from humanoid viewpoints. Second, existing LVLMs suffer from missing multidomain perception outdoor data and struggle to effectively integrate 2D and 3D visual information. To address the aforementioned limitations, we build the first multidomain perception outdoor scene understanding dataset, named \underlineSVM-City, deriving from multi\underlineScale scenarios with multi\underlineView and multi\underlineModal instruction tuning data. It contains 420k images and 4, 811M point clouds with 567k question-answering pairs from vehicles, low-altitude drones, high-altitude aerial planes, and satellite. To effectively fuse the multimodal data in the absence of one modality, we introduce incomplete multimodal learning to model outdoor scene understanding and design the LVLM named \underlineCity-VLM. Multimodal fusion is realized by constructing a joint probabilistic distribution space rather than implementing directly explicit fusion operations (e.g., concatenation). Experimental results on three typical outdoor scene understanding tasks show City-VLM achieves 18.14 % performance surpassing existing LVLMs in question-answering tasks averagely. Our method demonstrates pragmatic and generalization performance across multiple outdoor scenes.
Paperid: 446,   https://arxiv.org/pdf/2504.09209    
Authors:Xiangyue Zhang, Jianfang Li, Jiaxu Zhang, Jianqiang Ren, Liefeng Bo, Zhigang Tu
Abstract:
Masked modeling framework has shown promise in co-speech motion generation. However, it struggles to identify semantically significant frames for effective motion masking. In this work, we propose a speech-queried attention-based mask modeling framework for co-speech motion generation. Our key insight is to leverage motion-aligned speech features to guide the masked motion modeling process, selectively masking rhythm-related and semantically expressive motion frames. Specifically, we first propose a motion-audio alignment module (MAM) to construct a latent motion-audio joint space. In this space, both low-level and high-level speech features are projected, enabling motion-aligned speech representation using learnable speech queries. Then, a speech-queried attention mechanism (SQA) is introduced to compute frame-level attention scores through interactions between motion keys and speech queries, guiding selective masking toward motion frames with high attention scores. Finally, the motion-aligned speech features are also injected into the generation network to facilitate co-speech motion generation. Qualitative and quantitative evaluations confirm that our method outperforms existing state-of-the-art approaches, successfully producing high-quality co-speech motion.
Paperid: 447,   https://arxiv.org/pdf/2504.11286    
Authors:Pengcheng Zheng, Kecheng Chen, Jiaxin Huang, Bohao Chen, Ju Liu, Yazhou Ren, Xiaorong Pu
Abstract:
Medical image restoration tasks aim to recover high-quality images from degraded observations, exhibiting emergent desires in many clinical scenarios, such as low-dose CT image denoising, MRI super-resolution, and MRI artifact removal. Despite the success achieved by existing deep learning-based restoration methods with sophisticated modules, they struggle with rendering computationally-efficient reconstruction results. Moreover, they usually ignore the reliability of the restoration results, which is much more urgent in medical systems. To alleviate these issues, we present LRformer, a Lightweight Transformer-based method via Reliability-guided learning in the frequency domain. Specifically, inspired by the uncertainty quantification in Bayesian neural networks (BNNs), we develop a Reliable Lesion-Semantic Prior Producer (RLPP). RLPP leverages Monte Carlo (MC) estimators with stochastic sampling operations to generate sufficiently-reliable priors by performing multiple inferences on the foundational medical image segmentation model, MedSAM. Additionally, instead of directly incorporating the priors in the spatial domain, we decompose the cross-attention (CA) mechanism into real symmetric and imaginary anti-symmetric parts via fast Fourier transform (FFT), resulting in the design of the Guided Frequency Cross-Attention (GFCA) solver. By leveraging the conjugated symmetric property of FFT, GFCA reduces the computational complexity of naive CA by nearly half. Extensive experimental results in various tasks demonstrate the superiority of the proposed LRformer in both effectiveness and efficiency.
Paperid: 448,   https://arxiv.org/pdf/2507.10293    
Authors:Wenkang Han, Wang Lin, Yiyun Zhou, Qi Liu, Shulei Wang, Chang Yao, Jingyuan Chen
Abstract:
Face Video Restoration (FVR) aims to recover high-quality face videos from degraded versions. Traditional methods struggle to preserve fine-grained, identity-specific features when degradation is severe, often producing average-looking faces that lack individual characteristics. To address these challenges, we introduce IP-FVR, a novel method that leverages a high-quality reference face image as a visual prompt to provide identity conditioning during the denoising process. IP-FVR incorporates semantically rich identity information from the reference image using decoupled cross-attention mechanisms, ensuring detailed and identity consistent results. For intra-clip identity drift (within 24 frames), we introduce an identity-preserving feedback learning method that combines cosine similarity-based reward signals with suffix-weighted temporal aggregation. This approach effectively minimizes drift within sequences of frames. For inter-clip identity drift, we develop an exponential blending strategy that aligns identities across clips by iteratively blending frames from previous clips during the denoising process. This method ensures consistent identity representation across different clips. Additionally, we enhance the restoration process with a multi-stream negative prompt, guiding the model's attention to relevant facial attributes and minimizing the generation of low-quality or incorrect features. Extensive experiments on both synthetic and real-world datasets demonstrate that IP-FVR outperforms existing methods in both quality and identity preservation, showcasing its substantial potential for practical applications in face video restoration.
Paperid: 449,   https://arxiv.org/pdf/2311.16475    
Authors:Yu-Wei Zhan, Fan Liu, Xin Luo, Xin-Shun Xu, Liqiang Nie, Mohan Kankanhalli
Abstract:
Human-Object Interaction (HOI) detection aims at detecting human-object pairs and predicting their interactions. However, conventional HOI detection methods often struggle to fully capture the contextual information needed to accurately identify these interactions. While large Vision-Language Models (VLMs) show promise in tasks involving human interactions, they are not tailored for HOI detection. The complexity of human behavior and the diverse contexts in which these interactions occur make it further challenging. Contextual cues, such as the participants involved, body language, and the surrounding environment, play crucial roles in predicting these interactions, especially those that are unseen or ambiguous. Moreover, large VLMs are trained on vast image and text data, enabling them to generate contextual cues that help in understanding real-world contexts, object relationships, and typical interactions. Building on this, in this paper we introduce ConCue, a novel approach for improving visual feature extraction in HOI detection. Specifically, we first design specialized prompts to utilize large VLMs to generate contextual cues within an image. To fully leverage these cues, we develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors. Extensive experiments and analyses demonstrate the effectiveness of using these contextual cues for HOI detection. The experimental results show that integrating ConCue with existing state-of-the-art methods significantly enhances their performance on two widely used datasets.
Paperid: 450,   https://arxiv.org/pdf/2505.09193    
Authors:Wei Jiang, Junru Li, Kai Zhang, Li Zhang
Abstract:
Recent forward prediction-based learned video compression (LVC) methods have achieved impressive results, even surpassing VVC reference software VTM under the Low Delay B (LDB) configuration. In contrast, learned bidirectional video compression (BVC) remains underexplored and still lags behind its forward-only counterparts. This performance gap is mainly due to the limited ability to extract diverse and accurate contexts: most existing BVCs primarily exploit temporal motion while neglecting non-local correlations across frames. Moreover, they lack the adaptability to dynamically suppress harmful contexts arising from fast motion or occlusion. To tackle these challenges, we propose BiECVC, a BVC framework that incorporates diversified local and non-local context modeling along with adaptive context gating. For local context enhancement, BiECVC reuses high-quality features from lower layers and aligns them using decoded motion vectors without introducing extra motion overhead. To model non-local dependencies efficiently, we adopt a linear attention mechanism that balances performance and complexity. To further mitigate the impact of inaccurate context prediction, we introduce Bidirectional Context Gating, inspired by data-dependent decay in recent autoregressive language models, to dynamically filter contextual information based on conditional coding results. Extensive experiments demonstrate that BiECVC achieves state-of-the-art performance, reducing the bit-rate by 13.4% and 15.7% compared to VTM 13.2 under the Random Access (RA) configuration with intra periods of 32 and 64, respectively. To our knowledge, BiECVC is the first learned video codec to surpass VTM 13.2 RA across all standard test datasets.
Paperid: 451,   https://arxiv.org/pdf/2509.06552    
Authors:Zheqi Lv, Wenqiao Zhang, Kairui Fu, Qi Tian, Shengyu Zhang, Jiajie Su, Jingyuan Chen, Kun Kuang, Fei Wu
Abstract:
The on-device real-time data distribution shift on devices challenges the generalization of lightweight on-device models. This critical issue is often overlooked in current research, which predominantly relies on data-intensive and computationally expensive fine-tuning approaches. To tackle this, we introduce Persona, a novel personalized method using a prototype-based, backpropagation-free parameter editing framework to enhance model generalization without post-deployment retraining. Persona employs a neural adapter in the cloud to generate a parameter editing matrix based on real-time device data. This matrix adeptly adapts on-device models to the prevailing data distributions, efficiently clustering them into prototype models. The prototypes are dynamically refined via the parameter editing matrix, facilitating efficient evolution. Furthermore, the integration of cross-layer knowledge transfer ensures consistent and context-aware multi-layer parameter changes and prototype assignment. Extensive experiments on vision task and recommendation task on multiple datasets confirm Persona's effectiveness and generality.
Paperid: 452,   https://arxiv.org/pdf/2504.09915    
Authors:Yuxi Bi, Yunfan Gao, Haofen Wang
Abstract:
Advancements in Generative AI offers new opportunities for FashionAI, surpassing traditional recommendation systems that often lack transparency and struggle to integrate expert knowledge, leaving the potential for personalized fashion styling remain untapped. To address these challenges, we present PAFA (Principle-Aware Fashion), a multi-granular knowledge base that organizes professional styling expertise into three levels of metadata, domain principles, and semantic relationships. Using PAFA, we develop StePO-Rec, a knowledge-guided method for multi-step outfit recommendation. StePO-Rec provides structured suggestions using a scenario-dimension-attribute framework, employing recursive tree construction to align recommendations with both professional principles and individual preferences. A preference-trend re-ranking system further adapts to fashion trends while maintaining the consistency of the user's original style. Experiments on the widely used personalized outfit dataset IQON show a 28% increase in Recall@1 and 32.8% in MAP. Furthermore, case studies highlight improved explainability, traceability, result reliability, and the seamless integration of expertise and personalization.
Paperid: 453,   https://arxiv.org/pdf/2507.01455    
Authors:Yuxing Liu, Ji Zhang, Zhou Xuchuan, Jingzhong Xiao, Huimin Yang, Jiaxin Zhong
Abstract:
Anomaly segmentation aims to identify Out-of-Distribution (OoD) anomalous objects within images. Existing pixel-wise methods typically assign anomaly scores individually and employ a global thresholding strategy to segment anomalies. Despite their effectiveness, these approaches encounter significant challenges in real-world applications: (1) neglecting spatial correlations among pixels within the same object, resulting in fragmented segmentation; (2) variabil ity in anomaly score distributions across image regions, causing global thresholds to either generate false positives in background areas or miss segments of anomalous objects. In this work, we introduce OoDDINO, a novel multi-level anomaly segmentation framework designed to address these limitations through a coarse-to-fine anomaly detection strategy. OoDDINO combines an uncertainty-guided anomaly detection model with a pixel-level segmentation model within a two-stage cascade architecture. Initially, we propose an Orthogonal Uncertainty-Aware Fusion Strategy (OUAFS) that sequentially integrates multiple uncertainty metrics with visual representations, employing orthogonal constraints to strengthen the detection model's capacity for localizing anomalous regions accurately. Subsequently, we develop an Adaptive Dual-Threshold Network (ADT-Net), which dynamically generates region-specific thresholds based on object-level detection outputs and pixel-wise anomaly scores. This approach allows for distinct thresholding strategies within foreground and background areas, achieving fine-grained anomaly segmentation. The proposed framework is compatible with other pixel-wise anomaly detection models, which acts as a plug-in to boost the performance. Extensive experiments on two benchmark datasets validate our framework's superiority and compatibility over state-of-the-art methods.
Paperid: 454,   https://arxiv.org/pdf/2507.17554    
Authors:Xide Xu, Sandesh Kamath, Muhammad Atif Butt, Bogdan Raducanu
Abstract:
The versatility of diffusion models in generating customized images from few samples raises significant privacy concerns, particularly regarding unauthorized modifications of private content. This concerning issue has renewed the efforts in developing protection mechanisms based on adversarial attacks, which generate effective perturbations to poison diffusion models. Our work is motivated by the observation that these models exhibit a high degree of abstraction within their semantic latent space (`h-space'), which encodes critical high-level features for generating coherent and meaningful content. In this paper, we propose a novel anti-customization approach, called HAAD (h-space based Adversarial Attack for Diffusion models), that leverages adversarial attacks to craft perturbations based on the h-space that can efficiently degrade the image generation process. Building upon HAAD, we further introduce a more efficient variant, HAAD-KV, that constructs perturbations solely based on the KV parameters of the h-space. This strategy offers a stronger protection, that is computationally less expensive. Despite their simplicity, our methods outperform state-of-the-art adversarial attacks, highlighting their effectiveness.
Paperid: 455,   https://arxiv.org/pdf/2507.11997    
Authors:Tairan Huang, Yili Wang, Qiutong Li, Changlong He, Jianliang Gao
Abstract:
Graph fraud detection has garnered significant attention as Graph Neural Networks (GNNs) have proven effective in modeling complex relationships within multimodal data. However, existing graph fraud detection methods typically use preprocessed node embeddings and predefined graph structures to reveal fraudsters, which ignore the rich semantic cues contained in raw textual information. Although Large Language Models (LLMs) exhibit powerful capabilities in processing textual information, it remains a significant challenge to perform multimodal fusion of processed textual embeddings with graph structures. In this paper, we propose a Multi-level LLM Enhanced Graph Fraud Detection framework called MLED. In MLED, we utilize LLMs to extract external knowledge from textual information to enhance graph fraud detection methods. To integrate LLMs with graph structure information and enhance the ability to distinguish fraudsters, we design a multi-level LLM enhanced framework including type-level enhancer and relation-level enhancer. One is to enhance the difference between the fraudsters and the benign entities, the other is to enhance the importance of the fraudsters in different relations. The experiments on four real-world datasets show that MLED achieves state-of-the-art performance in graph fraud detection as a generalized framework that can be applied to existing methods.
Paperid: 456,   https://arxiv.org/pdf/2507.04000    
Authors:Fan Zhang, Jinpeng Chen, Huan Li, Senzhang Wang, Yuan Cao, Kaimin Wei, JianXiang He, Feifei Kou, Jinqing Wang
Abstract:
Cross-domain recommendation (CDR) aims to address the persistent cold-start problem in Recommender Systems. Current CDR research concentrates on transferring cold-start users' information from the auxiliary domain to the target domain. However, these systems face two main issues: the underutilization of multimodal data, which hinders effective cross-domain alignment, and the neglect of side users who interact solely within the target domain, leading to inadequate learning of the target domain's vector space distribution. To address these issues, we propose a model leveraging Multimodal data and Side users for diffusion Cross-domain recommendation (MuSiC). We first employ a multimodal large language model to extract item multimodal features and leverage a large language model to uncover user features using prompt learning without fine-tuning. Secondly, we propose the cross-domain diffusion module to learn the generation of feature vectors in the target domain. This approach involves learning feature distribution from side users and understanding the patterns in cross-domain transformation through overlapping users. Subsequently, the trained diffusion module is used to generate feature vectors for cold-start users in the target domain, enabling the completion of cross-domain recommendation tasks. Finally, our experimental evaluation of the Amazon dataset confirms that MuSiC achieves state-of-the-art performance, significantly outperforming all selected baselines. Our code is available: https://anonymous.4open.science/r/MuSiC-310A/.
Paperid: 457,   https://arxiv.org/pdf/2412.19123    
Authors:Kaixing Yang, Xulong Tang, Haoyu Wu, Qinliang Xue, Biao Qin, Hongyan Liu, Zhaoxin Fan
Abstract:
Dance generation is crucial and challenging, particularly in domains like dance performance and virtual gaming. In the current body of literature, most methodologies focus on Solo Music2Dance. While there are efforts directed towards Group Music2Dance, these often suffer from a lack of coherence, resulting in aesthetically poor dance performances. Thus, we introduce CoheDancers, a novel framework for Music-Driven Interactive Group Dance Generation. CoheDancers aims to enhance group dance generation coherence by decomposing it into three key aspects: synchronization, naturalness, and fluidity. Correspondingly, we develop a Cycle Consistency based Dance Synchronization strategy to foster music-dance correspondences, an Auto-Regressive-based Exposure Bias Correction strategy to enhance the fluidity of the generated dances, and an Adversarial Training Strategy to augment the naturalness of the group dance output. Collectively, these strategies enable CohdeDancers to produce highly coherent group dances with superior quality. Furthermore, to establish better benchmarks for Group Music2Dance, we construct the most diverse and comprehensive open-source dataset to date, I-Dancers, featuring rich dancer interactions, and create comprehensive evaluation metrics. Experimental evaluations on I-Dancers and other extant datasets substantiate that CoheDancers achieves unprecedented state-of-the-art performance. Code will be released.
Paperid: 458,   https://arxiv.org/pdf/2509.01275    
Authors:Jiahao Li, Yang Lu, Yachao Zhang, Fangyong Wang, Yuan Xie, Yanyun Qu
Abstract:
Open-vocabulary semantic segmentation (OVSS) conducts pixel-level classification via text-driven alignment, where the domain discrepancy between base category training and open-vocabulary inference poses challenges in discriminative modeling of latent unseen category. To address this challenge, existing vision-language model (VLM)-based approaches demonstrate commendable performance through pre-trained multi-modal representations. However, the fundamental mechanisms of latent semantic comprehension remain underexplored, making the bottleneck for OVSS. In this work, we initiate a probing experiment to explore distribution patterns and dynamics of latent semantics in VLMs under inductive learning paradigms. Building on these insights, we propose X-Agent, an innovative OVSS framework employing latent semantic-aware ``agent'' to orchestrate cross-modal attention mechanisms, simultaneously optimizing latent semantic dynamic and amplifying its perceptibility. Extensive benchmark evaluations demonstrate that X-Agent achieves state-of-the-art performance while effectively enhancing the latent semantic saliency.
Paperid: 459,   https://arxiv.org/pdf/2509.10359    
Authors:Matteo Trippodo, Federico Becattini, Lorenzo Seidenari
Abstract:
Recent advances in text-based image editing have enabled fine-grained manipulation of visual content guided by natural language. However, such methods are susceptible to adversarial attacks. In this work, we propose a novel attack that targets the visual component of editing methods. We introduce Attention Attack, which disrupts the cross-attention between a textual prompt and the visual representation of the image by using an automatically generated caption of the source image as a proxy for the edit prompt. This breaks the alignment between the contents of the image and their textual description, without requiring knowledge of the editing method or the editing prompt. Reflecting on the reliability of existing metrics for immunization success, we propose two novel evaluation strategies: Caption Similarity, which quantifies semantic consistency between original and adversarial edits, and semantic Intersection over Union (IoU), which measures spatial layout disruption via segmentation masks. Experiments conducted on the TEDBench++ benchmark demonstrate that our attack significantly degrades editing performance while remaining imperceptible.
Paperid: 460,   https://arxiv.org/pdf/2511.09878    
Authors:Wenzhe He, Xiaojun Chen, Wentang Chen, Hongyu Wang, Ying Liu, Ruihui Li
Abstract:
Semantic Scene Completion (SSC) aims to generate a complete semantic scene from an incomplete input. Existing approaches often employ dense network architectures with a high parameter count, leading to increased model complexity and resource demands. To address these limitations, we propose RWKV-PCSSC, a lightweight point cloud semantic scene completion network inspired by the Receptance Weighted Key Value (RWKV) mechanism. Specifically, we introduce a RWKV Seed Generator (RWKV-SG) module that can aggregate features from a partial point cloud to produce a coarse point cloud with coarse features. Subsequently, the point-wise feature of the point cloud is progressively restored through multiple stages of the RWKV Point Deconvolution (RWKV-PD) modules. By leveraging a compact and efficient design, our method achieves a lightweight model representation. Experimental results demonstrate that RWKV-PCSSC reduces the parameter count by 4.18× and improves memory efficiency by 1.37× compared to state-of-the-art methods PointSSC. Furthermore, our network achieves state-of-the-art performance on established indoor (SSC-PC, NYUCAD-PC) and outdoor (PointSSC) scene dataset, as well as on our proposed datasets (NYUCAD-PC-V2, 3D-FRONT-PC).
Paperid: 461,   https://arxiv.org/pdf/2507.21830    
Authors:Kuiye Ding, Fanda Fan, Yao Wang, Ruijie jian, Xiaorui Wang, Luqi Gong, Yishan Jiang, Chunjie Luo, Jianfeng Zhan
Abstract:
Multivariate Time Series Forecasting plays a key role in many applications. Recent works have explored using Large Language Models for MTSF to take advantage of their reasoning abilities. However, many methods treat LLMs as end-to-end forecasters, which often leads to a loss of numerical precision and forces LLMs to handle patterns beyond their intended design. Alternatively, methods that attempt to align textual and time series modalities within latent space frequently encounter alignment difficulty. In this paper, we propose to treat LLMs not as standalone forecasters, but as semantic guidance modules within a dual-stream framework. We propose DualSG, a dual-stream framework that provides explicit semantic guidance, where LLMs act as Semantic Guides to refine rather than replace traditional predictions. As part of DualSG, we introduce Time Series Caption, an explicit prompt format that summarizes trend patterns in natural language and provides interpretable context for LLMs, rather than relying on implicit alignment between text and time series in the latent space. We also design a caption-guided fusion module that explicitly models inter-variable relationships while reducing noise and computation. Experiments on real-world datasets from diverse domains show that DualSG consistently outperforms 15 state-of-the-art baselines, demonstrating the value of explicitly combining numerical forecasting with semantic guidance.
Paperid: 462,   https://arxiv.org/pdf/2504.21214    
Authors:Jinzhao Zhou, Zehong Cao, Yiqun Duan, Connor Barkley, Daniel Leong, Xiaowei Jiang, Quoc-Toan Nguyen, Ziyi Zhao, Thomas Do, Yu-Cheng Chang, Sheng-Fu Liang, Chin-teng Lin
Abstract:
This paper explores silent speech decoding in active brain-computer interface (BCI) systems, which offer more natural and flexible communication than traditional BCI applications. We collected a new silent speech dataset of over 120 hours of electroencephalogram (EEG) recordings from 12 subjects, capturing 24 commonly used English words for language model pretraining and decoding. Following the recent success of pretraining large models with self-supervised paradigms to enhance EEG classification performance, we propose Large Brain Language Model (LBLM) pretrained to decode silent speech for active BCI. To pretrain LBLM, we propose Future Spectro-Temporal Prediction (FSTP) pretraining paradigm to learn effective representations from unlabeled EEG data. Unlike existing EEG pretraining methods that mainly follow a masked-reconstruction paradigm, our proposed FSTP method employs autoregressive modeling in temporal and frequency domains to capture both temporal and spectral dependencies from EEG signals. After pretraining, we finetune our LBLM on downstream tasks, including word-level and semantic-level classification. Extensive experiments demonstrate significant performance gains of the LBLM over fully-supervised and pretrained baseline models. For instance, in the difficult cross-session setting, our model achieves 47.0% accuracy on semantic-level classification and 39.6% in word-level classification, outperforming baseline methods by 5.4% and 7.3%, respectively. Our research advances silent speech decoding in active BCI systems, offering an innovative solution for EEG language model pretraining and a new dataset for fundamental research.
Paperid: 463,   https://arxiv.org/pdf/2411.16783    
Authors:Aravindan Sundaram, Ujjayan Pal, Abhimanyu Chauhan, Aishwarya Agarwal, Srikrishna Karanam
Abstract:
Despite recent advancements in text-to-image models, achieving semantically accurate images in text-to-image diffusion models is a persistent challenge. While existing initial latent optimization methods have demonstrated impressive performance, we identify two key limitations: (a) attention neglect, where the synthesized image omits certain subjects from the input prompt because they do not have a designated segment in the self-attention map despite despite having a high-response cross-attention, and (b) attention interference, where the generated image has mixed-up properties of multiple subjects because of a conflicting overlap between cross- and self-attention maps of different subjects. To address these limitations, we introduce CoCoNO, a new algorithm that optimizes the initial latent by leveraging the complementary information within self-attention and cross-attention maps. Our method introduces two new loss functions: the attention contrast loss, which minimizes undesirable overlap by ensuring each self-attention segment is exclusively linked to a specific subject's cross attention map, and the attention complete loss, which maximizes the activation within these segments to guarantee that each subject is fully and distinctly represented. Our approach operates within a noise optimization framework, avoiding the need to retrain base models. Through extensive experiments on multiple benchmarks, we demonstrate that CoCoNO significantly improves text-image alignment and outperforms the current state of the art.
Paperid: 464,   https://arxiv.org/pdf/2508.05658    
Authors:Song Yan, Hui Wei, Jinlong Fei, Guoliang Yang, Zhengyu Zhao, Zheng Wang
Abstract:
Various (text) prompt filters and (image) safety checkers have been implemented to mitigate the misuse of Text-to-Image (T2I) models in creating Not-Safe-For-Work (NSFW) content. In order to expose potential security vulnerabilities of such safeguards, multimodal jailbreaks have been studied. However, existing jailbreaks are limited to prompt-specific and image-specific perturbations, which suffer from poor scalability and time-consuming optimization. To address these limitations, we propose Universally Unfiltered and Unseen (U3)-Attack, a multimodal jailbreak attack method against T2I safeguards. Specifically, U3-Attack optimizes an adversarial patch on the image background to universally bypass safety checkers and optimizes a safe paraphrase set from a sensitive word to universally bypass prompt filters while eliminating redundant computations. Extensive experimental results demonstrate the superiority of our U3-Attack on both open-source and commercial T2I models. For example, on the commercial Runway-inpainting model with both prompt filter and safety checker, our U3-Attack achieves ~4× higher success rates than the state-of-the-art multimodal jailbreak attack, MMA-Diffusion.
Paperid: 465,   https://arxiv.org/pdf/2507.05731    
Authors:Yuxin Zhang, Jiahao Yang, Zhe Chen, Wenjun Zhu, Jin Zhao, Yue Gao
Abstract:
Recently, large vision-language models (LVLMs) unleash powerful analysis capabilities for low Earth orbit (LEO) satellite Earth observation images in the data center. However, fast satellite motion, brief satellite-ground station (GS) contact windows, and large size of the images pose a data download challenge. To enable near real-time Earth observation applications (e.g., disaster and extreme weather monitoring), we should explore how to deploy LVLM in LEO satellite networks, and design SpaceVerse, an efficient satellite-ground synergistic LVLM inference system. To this end, firstly, we deploy compact LVLMs on satellites for lightweight tasks, whereas regular LVLMs operate on GSs to handle computationally intensive tasks. Then, we propose a computing and communication co-design framework comprised of a progressive confidence network and an attention-based multi-scale preprocessing, used to identify on-satellite inferring data, and reduce data redundancy before satellite-GS transmission, separately. We implement and evaluate SpaceVerse on real-world LEO satellite constellations and datasets, achieving a 31.2% average gain in accuracy and a 51.2% reduction in latency compared to state-of-the-art baselines.
Paperid: 466,   https://arxiv.org/pdf/2507.10109    
Authors:Wenjie Tian, Xinfa Zhu, Haohe Liu, Zhixian Zhao, Zihao Chen, Chaofan Ding, Xinhan Di, Junjie Zheng, Lei Xie
Abstract:
While recent video-to-audio (V2A) models can generate realistic background audio from visual input, they largely overlook speech, an essential part of many video soundtracks. This paper proposes a new task, video-to-soundtrack (V2ST) generation, which aims to jointly produce synchronized background audio and speech within a unified framework. To tackle V2ST, we introduce DualDub, a unified framework built on a multimodal language model that integrates a multimodal encoder, a cross-modal aligner, and dual decoding heads for simultaneous background audio and speech generation. Specifically, our proposed cross-modal aligner employs causal and non-causal attention mechanisms to improve synchronization and acoustic harmony. Besides, to handle data scarcity, we design a curriculum learning strategy that progressively builds the multimodal capability. Finally, we introduce DualBench, the first benchmark for V2ST evaluation with a carefully curated test set and comprehensive metrics. Experimental results demonstrate that DualDub achieves state-of-the-art performance, generating high-quality and well-synchronized soundtracks with both speech and background audio.
Paperid: 467,   https://arxiv.org/pdf/2403.05262    
Authors:YiFan Zhang, Yang Shi, Weichen Yu, Qingsong Wen, Xue Wang, Wenjing Yang, Zhang Zhang, Liang Wang, Rong Jin
Abstract:
In the realms of computer vision and natural language processing, Multimodal Large Language Models (MLLMs) have become indispensable tools, proficient in generating textual responses based on visual inputs. Despite their advancements, our investigation reveals a noteworthy bias: the generated content is often driven more by the inherent priors of the underlying Large Language Models (LLMs) than by the input image. Empirical experiments underscore the persistence of this bias, as MLLMs often provide confident answers even in the absence of relevant images or given incongruent visual inputs. To rectify these biases and redirect the model's focus toward visual information, we propose two simple, training-free strategies. First, for tasks such as classification or multi-choice question answering, we introduce a "Post-Hoc Debias" method using an affine calibration step to adjust the output distribution. This approach ensures uniform answer scores when the image is absent, acting as an effective regularization technique to alleviate the influence of LLM priors. For more intricate open-ended generation tasks, we extend this method to "Visual Debias Decoding", which mitigates bias by contrasting token log-probabilities conditioned on a correct image versus a meaningless one. Additionally, our investigation sheds light on the instability of MLLMs across various decoding configurations. Through systematic exploration of different settings, we achieve significant performance improvements--surpassing previously reported results--and raise concerns about the fairness of current evaluation practices. Comprehensive experiments substantiate the effectiveness of our proposed strategies in mitigating biases. These strategies not only prove beneficial in minimizing hallucinations but also contribute to the generation of more helpful and precise illustrations.
Paperid: 468,   https://arxiv.org/pdf/2504.14200    
Authors:Huiyi Chen, Jiawei Peng, Kaihua Tang, Xin Geng, Xu Yang
Abstract:
In-context learning (ICL) enables Large Vision-Language Models (LVLMs) to adapt to new tasks without parameter updates, using a few demonstrations from a large support set. However, selecting informative demonstrations leads to high computational and memory costs. While some methods explore selecting a small and representative coreset in the text classification, evaluating all support set samples remains costly, and discarded samples lead to unnecessary information loss. These methods may also be less effective for image classification due to differences in feature spaces. Given these limitations, we propose Key-based Coreset Optimization (KeCO), a novel framework that leverages untapped data to construct a compact and informative coreset. We introduce visual features as keys within the coreset, which serve as the anchor for identifying samples to be updated through different selection strategies. By leveraging untapped samples from the support set, we update the keys of selected coreset samples, enabling the randomly initialized coreset to evolve into a more informative coreset under low computational cost. Through extensive experiments on coarse-grained and fine-grained image classification benchmarks, we demonstrate that KeCO effectively enhances ICL performance for image classification task, achieving an average improvement of more than 20%. Notably, we evaluate KeCO under a simulated online scenario, and the strong performance in this scenario highlights the practical value of our framework for resource-constrained real-world scenarios.
Paperid: 469,   https://arxiv.org/pdf/2408.02306    
Authors:Changtao Miao, Qi Chu, Tao Gong, Zhentao Tan, Zhenchao Jin, Wanyi Zhuang, Man Luo, Honggang Hu, Nenghai Yu
Abstract:
With the advancement of face manipulation technology, forgery images in multi-face scenarios are gradually becoming a more complex and realistic challenge. Despite this, detection and localization methods for such multi-face manipulations remain underdeveloped. Traditional manipulation localization methods either indirectly derive detection results from localization masks, resulting in limited detection performance, or employ a naive two-branch structure to simultaneously obtain detection and localization results, which cannot effectively benefit the localization capability due to limited interaction between two tasks. This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization. The MoNFAP primarily introduces two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM). The FUP integrates detection and localization tasks using a token learning strategy and multiple forgery-aware transformers, which facilitates the use of classification information to enhance localization capability. Besides, motivated by the crucial role of noise information in forgery detection, the MNM leverages multiple noise extractors based on the concept of the mixture of experts to enhance the general RGB features, further boosting the performance of our framework. Finally, we establish a comprehensive benchmark for multi-face detection and localization and the proposed MoNFAP achieves significant performance. The codes will be made available.
Paperid: 470,   https://arxiv.org/pdf/2504.10905    
Authors:Yukang Lin, Yan Hong, Zunnan Xu, Xindi Li, Chao Xu, Chuanbiao Song, Ronghui Li, Haoxing Chen, Jun Lan, Huijia Zhu, Weiqiang Wang, Jianfu Zhang, Xiu Li
Abstract:
Recent video generation research has focused heavily on isolated actions, leaving interactive motions-such as hand-face interactions-largely unexamined. These interactions are essential for emerging biometric authentication systems, which rely on interactive motion-based anti-spoofing approaches. From a security perspective, there is a growing need for large-scale, high-quality interactive videos to train and strengthen authentication models. In this work, we introduce a novel paradigm for animating realistic hand-face interactions. Our approach simultaneously learns spatio-temporal contact dynamics and biomechanically plausible deformation effects, enabling natural interactions where hand movements induce anatomically accurate facial deformations while maintaining collision-free contact. To facilitate this research, we present InterHF, a large-scale hand-face interaction dataset featuring 18 interaction patterns and 90,000 annotated videos. Additionally, we propose InterAnimate, a region-aware diffusion model designed specifically for interaction animation. InterAnimate leverages learnable spatial and temporal latents to effectively capture dynamic interaction priors and integrates a region-aware interaction mechanism that injects these priors into the denoising process. To the best of our knowledge, this work represents the first large-scale effort to systematically study human hand-face interactions. Qualitative and quantitative results show InterAnimate produces highly realistic animations, setting a new benchmark. Code and data will be made public to advance research.
Paperid: 471,   https://arxiv.org/pdf/2504.09479    
Authors:Zhiqing Cui, Jiahao Yuan, Hanqing Wang, Yanshu Li, Chenxu Du, Zhenglong Ding
Abstract:
Scientific diagrams are vital tools for communicating structured knowledge across disciplines. However, they are often published as static raster images, losing symbolic semantics and limiting reuse. While Multimodal Large Language Models (MLLMs) offer a pathway to bridging vision and structure, existing methods lack semantic control and structural interpretability, especially on complex diagrams. We propose Draw with Thought (DwT), a training-free framework that guides MLLMs to reconstruct diagrams into editable mxGraph XML code through cognitively-grounded Chain-of-Thought reasoning. DwT enables interpretable and controllable outputs without model fine-tuning by dividing the task into two stages: Coarse-to-Fine Planning, which handles perceptual structuring and semantic specification, and Structure-Aware Code Generation, enhanced by format-guided refinement. To support evaluation, we release Plot2XML, a benchmark of 247 real-world scientific diagrams with gold-standard XML annotations. Extensive experiments across eight MLLMs show that our approach yields high-fidelity, semantically aligned, and structurally valid reconstructions, with human evaluations confirming strong alignment in both accuracy and visual aesthetics, offering a scalable solution for converting static visuals into executable representations and advancing machine understanding of scientific graphics.
Paperid: 472,   https://arxiv.org/pdf/2507.21177    
Authors:Xinhai Yan, Libing Wu, Zhuangzhuang Zhang, Bingyi Liu, Lijuan Huo, Jing Wang
Abstract:
Federated Learning (FL) enables collaborative model training while preserving data privacy, but it is highly vulnerable to backdoor attacks. Most existing defense methods in FL have limited effectiveness due to their neglect of the model's over-reliance on backdoor triggers, particularly as the proportion of malicious clients increases. In this paper, we propose FedBAP, a novel defense framework for mitigating backdoor attacks in FL by reducing the model's reliance on backdoor triggers. Specifically, first, we propose a perturbed trigger generation mechanism that creates perturbation triggers precisely matching backdoor triggers in location and size, ensuring strong influence on model outputs. Second, we utilize these perturbation triggers to generate benign adversarial perturbations that disrupt the model's dependence on backdoor triggers while forcing it to learn more robust decision boundaries. Finally, we design an adaptive scaling mechanism to dynamically adjust perturbation intensity, effectively balancing defense strength and model performance. The experimental results demonstrate that FedBAP reduces the attack success rates by 0.22%-5.34%, 0.48%-6.34%, and 97.22%-97.6% under three types of backdoor attacks, respectively. In particular, FedBAP demonstrates outstanding performance against novel backdoor attacks.
Paperid: 473,   https://arxiv.org/pdf/2507.16596    
Authors:Wenbo Xu, Junyan Wu, Wei Lu, Xiangyang Luo, Qian Wang
Abstract:
Current researches on Deepfake forensics often treat detection as a classification task or temporal forgery localization problem, which are usually restrictive, time-consuming, and challenging to scale for large datasets. To resolve these issues, we present a multimodal deviation perceiving framework for weakly-supervised temporal forgery localization (MDP), which aims to identify temporal partial forged segments using only video-level annotations. The MDP proposes a novel multimodal interaction mechanism (MI) and an extensible deviation perceiving loss to perceive multimodal deviation, which achieves the refined start and end timestamps localization of forged segments. Specifically, MI introduces a temporal property preserving cross-modal attention to measure the relevance between the visual and audio modalities in the probabilistic embedding space. It could identify the inter-modality deviation and construct comprehensive video features for temporal forgery localization. To explore further temporal deviation for weakly-supervised learning, an extensible deviation perceiving loss has been proposed, aiming at enlarging the deviation of adjacent segments of the forged samples and reducing that of genuine samples. Extensive experiments demonstrate the effectiveness of the proposed framework and achieve comparable results to fully-supervised approaches in several evaluation metrics.
Paperid: 474,   https://arxiv.org/pdf/2511.04281    
Authors:Yujie Yang, Shuang Li, Jun Ye, Neng Dong, Fan Li, Huafeng Li
Abstract:
Video-based Visible-Infrared person re-identification (VVI-ReID) aims to retrieve the same pedestrian across visible and infrared modalities from video sequences. Existing methods tend to exploit modality-invariant visual features but largely overlook gait features, which are not only modality-invariant but also rich in temporal dynamics, thus limiting their ability to model the spatiotemporal consistency essential for cross-modal video matching. To address these challenges, we propose a DINOv2-Driven Gait Representation Learning (DinoGRL) framework that leverages the rich visual priors of DINOv2 to learn gait features complementary to appearance cues, facilitating robust sequence-level representations for cross-modal retrieval. Specifically, we introduce a Semantic-Aware Silhouette and Gait Learning (SASGL) model, which generates and enhances silhouette representations with general-purpose semantic priors from DINOv2 and jointly optimizes them with the ReID objective to achieve semantically enriched and task-adaptive gait feature learning. Furthermore, we develop a Progressive Bidirectional Multi-Granularity Enhancement (PBMGE) module, which progressively refines feature representations by enabling bidirectional interactions between gait and appearance streams across multiple spatial granularities, fully leveraging their complementarity to enhance global representations with rich local details and produce highly discriminative features. Extensive experiments on HITSZ-VCM and BUPT datasets demonstrate the superiority of our approach, significantly outperforming existing state-of-the-art methods.
Paperid: 475,   https://arxiv.org/pdf/2508.06136    
Authors:YoungChan Choi, HengFei Wang, YiHua Cheng, Boeun Kim, Hyung Jin Chang, YoungGeun Choi, Sang-Il Choi
Abstract:
We propose a novel 3D gaze redirection framework that leverages an explicit 3D eyeball structure. Existing gaze redirection methods are typically based on neural radiance fields, which employ implicit neural representations via volume rendering. Unlike these NeRF-based approaches, where the rotation and translation of 3D representations are not explicitly modeled, we introduce a dedicated 3D eyeball structure to represent the eyeballs with 3D Gaussian Splatting (3DGS). Our method generates photorealistic images that faithfully reproduce the desired gaze direction by explicitly rotating and translating the 3D eyeball structure. In addition, we propose an adaptive deformation module that enables the replication of subtle muscle movements around the eyes. Through experiments conducted on the ETH-XGaze dataset, we demonstrate that our framework is capable of generating diverse novel gaze images, achieving superior image quality and gaze estimation accuracy compared to previous state-of-the-art methods.
Paperid: 476,   https://arxiv.org/pdf/2505.06020    
Authors:Shuai Wang, Ivona Najdenkoska, Hongyi Zhu, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring
Abstract:
Understanding visual art requires reasoning across multiple perspectives -- cultural, historical, and stylistic -- beyond mere object recognition. While recent multimodal large language models (MLLMs) perform well on general image captioning, they often fail to capture the nuanced interpretations that fine art demands. We propose ArtRAG, a novel, training-free framework that combines structured knowledge with retrieval-augmented generation (RAG) for multi-perspective artwork explanation. ArtRAG automatically constructs an Art Context Knowledge Graph (ACKG) from domain-specific textual sources, organizing entities such as artists, movements, themes, and historical events into a rich, interpretable graph. At inference time, a multi-granular structured retriever selects semantically and topologically relevant subgraphs to guide generation. This enables MLLMs to produce contextually grounded, culturally informed art descriptions. Experiments on the SemArt and Artpedia datasets show that ArtRAG outperforms several heavily trained baselines. Human evaluations further confirm that ArtRAG generates coherent, insightful, and culturally enriched interpretations.
Paperid: 477,   https://arxiv.org/pdf/2508.13756    
Authors:Ruonan Chai, Yixiang Zhu, Xinjiao Li, Jiawei Li, Zili Meng, Dirk Kutscher
Abstract:
Real-time streaming of point cloud video, characterized by massive data volumes and high sensitivity to packet loss, remains a key challenge for immersive applications under dynamic network conditions. While connection-oriented protocols such as TCP and more modern alternatives like QUIC alleviate some transport-layer inefficiencies, including head-of-line blocking, they still retain a coarse-grained, segment-based delivery model and a centralized control loop that limit fine-grained adaptation and effective caching. We introduce INDS (Incremental Named Data Streaming), an adaptive streaming framework based on Information-Centric Networking (ICN) that rethinks delivery for hierarchical, layered media. INDS leverages the Octree structure of point cloud video and expressive content naming to support progressive, partial retrieval of enhancement layers based on consumer bandwidth and decoding capability. By combining time-windows with Group-of-Frames (GoF), INDS's naming scheme supports fine-grained in-network caching and facilitates efficient multi-user data reuse. INDS can be deployed as an overlay, remaining compatible with QUIC-based transport infrastructure as well as future Media-over-QUIC (MoQ) architectures, without requiring changes to underlying IP networks. Our prototype implementation shows up to 80% lower delay, 15-50% higher throughput, and 20-30% increased cache hit rates compared to state-of-the-art DASH-style systems. Together, these results establish INDS as a scalable, cache-friendly solution for real-time point cloud streaming under variable and lossy conditions, while its compatibility with MoQ overlays further positions it as a practical, forward-compatible architecture for emerging immersive media systems.
Paperid: 478,   https://arxiv.org/pdf/2507.06959    
Authors:Xiao Liang, Jiawei Hu, Di Wang, Zhi Ma, Lin Zhao, Ronghan Li, Bo Wan, Quan Wang
Abstract:
Vision-language models (VLMs) are prone to hallucinations that critically compromise reliability in medical applications. While preference optimization can mitigate these hallucinations through clinical feedback, its implementation faces challenges such as clinically irrelevant training samples, imbalanced data distributions, and prohibitive expert annotation costs. To address these challenges, we introduce CheXPO, a Chest X-ray Preference Optimization strategy that combines confidence-similarity joint mining with counterfactual rationale. Our approach begins by synthesizing a unified, fine-grained multi-task chest X-ray visual instruction dataset across different question types for supervised fine-tuning (SFT). We then identify hard examples through token-level confidence analysis of SFT failures and use similarity-based retrieval to expand hard examples for balancing preference sample distributions, while synthetic counterfactual rationales provide fine-grained clinical preferences, eliminating the need for additional expert input. Experiments show that CheXPO achieves 8.93% relative performance gain using only 5% of SFT samples, reaching state-of-the-art performance across diverse clinical tasks and providing a scalable, interpretable solution for real-world radiology applications.
Paperid: 479,   https://arxiv.org/pdf/2408.01343    
Authors:Bingyu Li, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li
Abstract:
Multimodal semantic segmentation shows significant potential for enhancing segmentation accuracy in complex scenes. However, current methods often incorporate specialized feature fusion modules tailored to specific modalities, thereby restricting input flexibility and increasing the number of training parameters. To address these challenges, we propose StitchFusion, a straightforward yet effective modal fusion framework that integrates large-scale pre-trained models directly as encoders and feature fusers. This approach facilitates comprehensive multi-modal and multi-scale feature fusion, accommodating any visual modal inputs. Specifically, Our framework achieves modal integration during encoding by sharing multi-modal visual information. To enhance information exchange across modalities, we introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding. By leveraging MultiAdapter to propagate multi-scale information across pre-trained encoders during the encoding process, StitchFusion achieves multi-modal visual information integration during encoding. Extensive comparative experiments demonstrate that our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters. Furthermore, the experimental integration of MultiAdapter with existing Feature Fusion Modules (FFMs) highlights their complementary nature. Our code is available at StitchFusion_repo.
Paperid: 480,   https://arxiv.org/pdf/2504.10352    
Authors:Yifan Yang, Shujie Liu, Jinyu Li, Yuxuan Hu, Haibin Wu, Hui Wang, Jianwei Yu, Lingwei Meng, Haiyang Sun, Yanqing Liu, Yan Lu, Kai Yu, Xie Chen
Abstract:
Recent zero-shot text-to-speech (TTS) systems face a common dilemma: autoregressive (AR) models suffer from slow generation and lack duration controllability, while non-autoregressive (NAR) models lack temporal modeling and typically require complex designs. In this paper, we introduce a novel pseudo-autoregressive (PAR) codec language modeling approach that unifies AR and NAR modeling. Combining explicit temporal modeling from AR with parallel generation from NAR, PAR generates dynamic-length spans at fixed time steps. Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement. In the first stage, PAR progressively generates speech tokens along the time dimension, with each step predicting all positions in parallel but only retaining the left-most span. In the second stage, low-confidence tokens are iteratively refined in parallel, leveraging the global contextual information. Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data, including F5-TTS, E2-TTS, and MaskGCT, on the LibriSpeech test-clean set in terms of speech quality, speaker similarity, and intelligibility, while achieving up to ten times faster inference speed. Audio samples are available at https://microsoft.com/research/project/vall-e-x/palle.
Paperid: 481,   https://arxiv.org/pdf/2508.00308    
Authors:Chunyan She, Fujun Han, Chengyu Fang, Shukai Duan, Lidan Wang
Abstract:
The event camera, benefiting from its high dynamic range and low latency, provides performance gain for low-light image enhancement. Unlike frame-based cameras, it records intensity changes with extremely high temporal resolution, capturing sufficient structure information. Currently, existing event-based methods feed a frame and events directly into a single model without fully exploiting modality-specific advantages, which limits their performance. Therefore, by analyzing the role of each sensing modality, the enhancement pipeline is decoupled into two stages: visibility restoration and structure refinement. In the first stage, we design a visibility restoration network with amplitude-phase entanglement by rethinking the relationship between amplitude and phase components in Fourier space. In the second stage, a fusion strategy with dynamic alignment is proposed to mitigate the spatial mismatch caused by the temporal resolution discrepancy between two sensing modalities, aiming to refine the structure information of the image enhanced by the visibility restoration network. In addition, we utilize spatial-frequency interpolation to simulate negative samples with diverse illumination, noise and artifact degradations, thereby developing a contrastive loss that encourages the model to learn discriminative representations. Experiments demonstrate that the proposed method outperforms state-of-the-art models.
Paperid: 482,   https://arxiv.org/pdf/2507.22501    
Authors:Chang Huang, Jiahang Cao, Jun Ma, Kieren Yu, Cong Li, Huayong Yang, Kaishun Wu
Abstract:
Underwater images typically suffer from severe colour distortions, low visibility, and reduced structural clarity due to complex optical effects such as scattering and absorption, which greatly degrade their visual quality and limit the performance of downstream visual perception tasks. Existing enhancement methods often struggle to adaptively handle diverse degradation conditions and fail to leverage underwater-specific physical priors effectively. In this paper, we propose a degradation-aware conditional diffusion model to enhance underwater images adaptively and robustly. Given a degraded underwater image as input, we first predict its degradation level using a lightweight dual-stream convolutional network, generating a continuous degradation score as semantic guidance. Based on this score, we introduce a novel conditional diffusion-based restoration network with a Swin UNet backbone, enabling adaptive noise scheduling and hierarchical feature refinement. To incorporate underwater-specific physical priors, we further propose a degradation-guided adaptive feature fusion module and a hybrid loss function that combines perceptual consistency, histogram matching, and feature-level contrast. Comprehensive experiments on benchmark datasets demonstrate that our method effectively restores underwater images with superior colour fidelity, perceptual quality, and structural details. Compared with SOTA approaches, our framework achieves significant improvements in both quantitative metrics and qualitative visual assessments.
Paperid: 483,   https://arxiv.org/pdf/2508.00782    
Authors:Kien T. Pham, Yingqing He, Yazhou Xing, Qifeng Chen, Long Chen
Abstract:
Audio-driven video generation aims to synthesize realistic videos that align with input audio recordings, akin to the human ability to visualize scenes from auditory input. However, existing approaches predominantly focus on exploring semantic information, such as the classes of sounding sources present in the audio, limiting their ability to generate videos with accurate content and spatial composition. In contrast, we humans can not only naturally identify the semantic categories of sounding sources but also determine their deeply encoded spatial attributes, including locations and movement directions. This useful information can be elucidated by considering specific spatial indicators derived from the inherent physical properties of sound, such as loudness or frequency. As prior methods largely ignore this factor, we present SpA2V, the first framework explicitly exploits these spatial auditory cues from audios to generate videos with high semantic and spatial correspondence. SpA2V decomposes the generation process into two stages: 1) Audio-guided Video Planning: We meticulously adapt a state-of-the-art MLLM for a novel task of harnessing spatial and semantic cues from input audio to construct Video Scene Layouts (VSLs). This serves as an intermediate representation to bridge the gap between the audio and video modalities. 2) Layout-grounded Video Generation: We develop an efficient and effective approach to seamlessly integrate VSLs as conditional guidance into pre-trained diffusion models, enabling VSL-grounded video generation in a training-free manner. Extensive experiments demonstrate that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.
Paperid: 484,   https://arxiv.org/pdf/2503.18377    
Authors:Chang Gao, Kang Zhao, Runqi Wang, Jianfei Chen, Liping Jing
Abstract:
Large language models (LLMs) have demonstrated impressive capabilities, but their enormous size poses significant challenges for deployment in real-world applications. To address this issue, researchers have sought to apply network pruning techniques to LLMs. A critical challenge in pruning is allocation the sparsity for each layer. Recent sparsity allocation methods is often based on heuristics or search that can easily lead to suboptimal performance. In this paper, we conducted an extensive investigation into various LLMs and revealed three significant discoveries: (1) the layerwise pruning sensitivity (LPS) of LLMs is highly non-uniform, (2) the choice of pruning metric affects LPS, and (3) the performance of a sparse model is related to the uniformity of its layerwise redundancy level. Based on these observations, we propose that the layerwise sparsity of LLMs should adhere to three principles: \emphnon-uniformity, \emphpruning metric dependency, and \emphuniform layerwise redundancy level in the pruned model. To this end, we proposed Maximum Redundancy Pruning (MRP), an iterative pruning algorithm that prunes in the most redundant layers (\emphi.e., those with the highest non-outlier ratio) at each iteration. The achieved layerwise sparsity aligns with the outlined principles. We conducted extensive experiments on publicly available LLMs, including the LLaMA2 and OPT, across various benchmarks. Experimental results validate the effectiveness of MRP, demonstrating its superiority over previous methods.
Paperid: 485,   https://arxiv.org/pdf/2510.20285    
Authors:Jiayi Zou, Chaofan Chen, Bing-Kun Bao, Changsheng Xu
Abstract:
Egocentric Video Question Answering (Egocentric VideoQA) plays an important role in egocentric video understanding, which refers to answering questions based on first-person videos. Although existing methods have made progress through the paradigm of pre-training and fine-tuning, they ignore the unique challenges posed by the first-person perspective, such as understanding multiple events and recognizing hand-object interactions. To deal with these challenges, we propose a Dual-Modal Counterfactual Contrastive Construction (DMC^3) framework, which contains an egocentric videoqa baseline, a counterfactual sample construction module and a counterfactual sample-involved contrastive optimization. Specifically, We first develop a counterfactual sample construction module to generate positive and negative samples for textual and visual modalities through event description paraphrasing and core interaction mining, respectively. Then, We feed these samples together with the original samples into the baseline. Finally, in the counterfactual sample-involved contrastive optimization module, we apply contrastive loss to minimize the distance between the original sample features and the positive sample features, while maximizing the distance from the negative samples. Experiments show that our method achieve 52.51% and 46.04% on the normal and indirect splits of EgoTaskQA, and 13.2% on QAEGO4D, both reaching the state-of-the-art performance.
Paperid: 486,   https://arxiv.org/pdf/2403.09559    
Authors:Zikang Liu, Kun Zhou, Wayne Xin Zhao, Dawei Gao, Yaliang Li, Ji-Rong Wen
Abstract:
Visual instruction tuning is the key to building large vision language models~(LVLMs), which can greatly improve the task generalization and solving capabilities by learning a mixture of instruction data from diverse visual tasks. Previous work mostly collects multiple existing visual instruction datasets via heuristic ways for training (even more than a million instructions), which may introduce data redundancy and enlarge the training cost. To investigate this issue, we conduct a series of empirical studies, which reveal a significant redundancy within the visual instruction datasets, and show that greatly reducing the amount of instructions from several tasks even do not affect the performance. Based on the findings, we propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost. In TIVE, we first estimate the instance influence score on its corresponding task, and the task difficulty score, based on the gradient-based influence functions. Then, we leverage the two kinds of scores to determine the task proportion within the selected visual instruction subset, and select high-value instances for each task, respectively. Experiments on various LVLMs show that our approach using only about 15% data can achieve comparable average performance to the full-data fine-tuned model across eight benchmarks, even surpassing it on four of the benchmarks. Our code and data will be publicly released.
Paperid: 487,   https://arxiv.org/pdf/2508.01525    
Authors:Kuo Shi, Jie Lu, Shanshan Ye, Guangquan Zhang, Zhen Fang
Abstract:
Recent advances in generative models have highlighted the need for robust detectors capable of distinguishing real images from AI-generated images. While existing methods perform well on known generators, their performance often declines when tested with newly emerging or unseen generative models due to overlapping feature embeddings that hinder accurate cross-generator classification. In this paper, we propose Multimodal Discriminative Representation Learning for Generalizable AI-generated Image Detection (MiraGe), a method designed to learn generator-invariant features. Motivated by theoretical insights on intra-class variation minimization and inter-class separation, MiraGe tightly aligns features within the same class while maximizing separation between classes, enhancing feature discriminability. Moreover, we apply multimodal prompt learning to further refine these principles into CLIP, leveraging text embeddings as semantic anchors for effective discriminative representation learning, thereby improving generalizability. Comprehensive experiments across multiple benchmarks show that MiraGe achieves state-of-the-art performance, maintaining robustness even against unseen generators like Sora.
Paperid: 488,   https://arxiv.org/pdf/2504.16722    
Authors:Yingjie Xi, Jian Jun Zhang, Xiaosong Yang
Abstract:
In computer animation, game design, and human-computer interaction, synthesizing human motion that aligns with user intent remains a significant challenge. Existing methods have notable limitations: textual approaches offer high-level semantic guidance but struggle to describe complex actions accurately; trajectory-based techniques provide intuitive global motion direction yet often fall short in generating precise or customized character movements; and anchor poses-guided methods are typically confined to synthesize only simple motion patterns. To generate more controllable and precise human motions, we propose ProMoGen (Progressive Motion Generation), a novel framework that integrates trajectory guidance with sparse anchor motion control. Global trajectories ensure consistency in spatial direction and displacement, while sparse anchor motions only deliver precise action guidance without displacement. This decoupling enables independent refinement of both aspects, resulting in a more controllable, high-fidelity, and sophisticated motion synthesis. ProMoGen supports both dual and single control paradigms within a unified training process. Moreover, we recognize that direct learning from sparse motions is inherently unstable, we introduce SAP-CL (Sparse Anchor Posture Curriculum Learning), a curriculum learning strategy that progressively adjusts the number of anchors used for guidance, thereby enabling more precise and stable convergence. Extensive experiments demonstrate that ProMoGen excels in synthesizing vivid and diverse motions guided by predefined trajectory and arbitrary anchor frames. Our approach seamlessly integrates personalized motion with structured guidance, significantly outperforming state-of-the-art methods across multiple control scenarios.
Paperid: 489,   https://arxiv.org/pdf/2405.00435    
Authors:Wei Zhang, Wong Kam-Kwai, Biying Xu, Yiwen Ren, Yuhuai Li, Minfeng Zhu, Yingchaojie Feng, Wei Chen
Abstract:
The integration of new technology with cultural studies enhances our understanding of cultural heritage but often struggles to connect with diverse audiences. It is challenging to align personal interpretations with the intended meanings across different cultures. Our study investigates the important factors in appreciating art from a cross-cultural perspective. We explore the application of Large Language Models (LLMs) to bridge the cultural and language barriers in understanding Traditional Chinese Paintings (TCPs). We present CultiVerse, a visual analytics system that utilizes LLMs within a mixed-initiative framework, enhancing interpretative appreciation of TCP in a cross-cultural dialogue. CultiVerse addresses the challenge of translating the nuanced symbolism in art, which involves interpreting complex cultural contexts, aligning cross-cultural symbols, and validating cultural acceptance. CultiVerse integrates an interactive interface with the analytical capability of LLMs to explore a curated TCP dataset, facilitating the analysis of multifaceted symbolic meanings and the exploration of cross-cultural serendipitous discoveries. Empirical evaluations affirm that CultiVerse significantly improves cross-cultural understanding, offering deeper insights and engaging art appreciation.
Paperid: 490,   https://arxiv.org/pdf/2507.12461    
Authors:Trong-Thang Pham, Anh Nguyen, Zhigang Deng, Carol C. Wu, Hien Van Nguyen, Ngan Le
Abstract:
Radiologists rely on eye movements to navigate and interpret medical images. A trained radiologist possesses knowledge about the potential diseases that may be present in the images and, when searching, follows a mental checklist to locate them using their gaze. This is a key observation, yet existing models fail to capture the underlying intent behind each fixation. In this paper, we introduce a deep learning-based approach, RadGazeIntent, designed to model this behavior: having an intention to find something and actively searching for it. Our transformer-based architecture processes both the temporal and spatial dimensions of gaze data, transforming fine-grained fixation features into coarse, meaningful representations of diagnostic intent to interpret radiologists' goals. To capture the nuances of radiologists' varied intention-driven behaviors, we process existing medical eye-tracking datasets to create three intention-labeled subsets: RadSeq (Systematic Sequential Search), RadExplore (Uncertainty-driven Exploration), and RadHybrid (Hybrid Pattern). Experimental results demonstrate RadGazeIntent's ability to predict which findings radiologists are examining at specific moments, outperforming baseline methods across all intention-labeled datasets.
Paperid: 491,   https://arxiv.org/pdf/2509.00859    
Authors:Jiacheng Jiang, Yuan Meng, Chen Tang, Han Yu, Qun Li, Zhi Wang, Wenwu Zhu
Abstract:
Current quantization-aware training (QAT) methods primarily focus on enhancing the performance of quantized models on in-distribution (I.D) data, while overlooking the potential performance degradation on out-of-distribution (OOD) data. In this paper, we first substantiate this problem through rigorous experiment, showing that QAT can lead to a significant OOD generalization performance degradation. Further, we find the contradiction between the perspective that flatness of loss landscape gives rise to superior OOD generalization and the phenomenon that QAT lead to a sharp loss landscape, can cause the above problem. Therefore, we propose a flatness-oriented QAT method, FQAT, to achieve generalizable QAT. Specifically, i) FQAT introduces a layer-wise freezing mechanism to mitigate the gradient conflict issue between dual optimization objectives (i.e., vanilla QAT and flatness). ii) FQAT proposes an disorder-guided adaptive freezing algorithm to dynamically determines which layers to freeze at each training step, effectively addressing the challenges caused by interference between layers. A gradient disorder metric is designed to help the algorithm identify unstable layers during training. Extensive experiments on influential OOD benchmark demonstrate the superiority of our method over state-of-the-art baselines under both I.D and OOD image classification tasks.
Paperid: 492,   https://arxiv.org/pdf/2504.17343    
Authors:Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, Xu Sun
Abstract:
The rapid growth of online video platforms, particularly live streaming services, has created an urgent need for real-time video understanding systems. These systems must process continuous video streams and respond to user queries instantaneously, presenting unique challenges for current Video Large Language Models (VideoLLMs). While existing VideoLLMs excel at processing complete videos, they face significant limitations in streaming scenarios due to their inability to handle dense, redundant frames efficiently. We introduce TimeChat-Online, a novel online VideoLLM that revolutionizes real-time video interaction. At its core lies our innovative Differential Token Drop (DTD) module, which addresses the fundamental challenge of visual redundancy in streaming videos. Drawing inspiration from human visual perception's Change Blindness phenomenon, DTD preserves meaningful temporal changes while filtering out static, redundant content between frames. Remarkably, our experiments demonstrate that DTD achieves an 82.8% reduction in video tokens while maintaining 98% performance on StreamingBench, revealing that over 80% of visual content in streaming videos is naturally redundant without requiring language guidance. To enable seamless real-time interaction, we present TimeChat-Online-139K, a comprehensive streaming video dataset featuring diverse interaction patterns including backward-tracing, current-perception, and future-responding scenarios. TimeChat-Online's unique Proactive Response capability, naturally achieved through continuous monitoring of video scene transitions via DTD, sets it apart from conventional approaches. Our extensive evaluation demonstrates TimeChat-Online's superior performance on streaming benchmarks (StreamingBench and OvOBench) and maintaining competitive results on long-form video tasks such as Video-MME and MLVU.
Paperid: 493,   https://arxiv.org/pdf/2412.07161    
Authors:Yun Li, Zhe Liu, Lina Yao
Abstract:
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen combinations of seen attributes and objects. Current CLIP-based methods in CZSL, despite their advancements, often fail to effectively understand and link the attributes and objects due to inherent limitations in CLIP's pretraining mechanisms. To address these shortcomings, this paper introduces a novel framework, Understanding and Linking Attributes and Objects (ULAO) in CZSL, which comprises two innovative modules. The Understanding Attributes and Objects (UAO) module improves primitive understanding by sequential primitive prediction and leveraging recognized objects as contextual hints for attribute classification. Concurrently, the Linking Attributes and Objects (LAO) module improves the attribute-object linkage understanding through a new contrastive learning strategy that incorporates tailored hard negative generation and adaptive loss adjustments. We demonstrate our model's superiority by showcasing its state-of-the-art performance across three benchmark datasets in both Closed-World (CW) and Open-World (OW) scenarios.
Paperid: 494,   https://arxiv.org/pdf/2512.11301    
Authors:Bate Li, Houqiang Zhong, Zhengxue Cheng, Qiang Hu, Qiang Wang, Li Song, Wenjun Zhang
Abstract:
Multi-view egocentric dynamic scene reconstruction holds significant research value for applications in holographic documentation of social interactions. However, existing reconstruction datasets focus on static multi-view or single-egocentric view setups, lacking multi-view egocentric datasets for dynamic scene reconstruction. Therefore, we present MultiEgo, the first multi-view egocentric dataset for 4D dynamic scene reconstruction. The dataset comprises five canonical social interaction scenes: meetings, performances, and a presentation. Each scene provides five authentic egocentric videos captured by participants wearing AR glasses. We design a hardware-based data acquisition system and processing pipeline, achieving sub-millisecond temporal synchronization across views, coupled with accurate pose annotations. Experiment validation demonstrates the practical utility and effectiveness of our dataset for free-viewpoint video (FVV) applications, establishing MultiEgo as a foundational resource for advancing multi-view egocentric dynamic scene reconstruction research.
Paperid: 495,   https://arxiv.org/pdf/2408.07543    
Authors:Hao Liang, Linzhuang Sun, Minxuan Zhou, Zirong Chen, Meiyi Qiang, Mingan Lin, Tianpeng Li, Fan Yang, Zenan Zhou, Wentao Zhang
Abstract:
With the rapid progress of Multimodal LLMs, evaluating their mathematical reasoning capabilities has become an increasingly important research direction. In particular, visual-textual mathematical reasoning serves as a key indicator of an MLLM's ability to comprehend and solve complex, multi-step quantitative problems. While existing benchmarks such as MathVista and MathVerse have advanced the evaluation of multimodal math proficiency, they primarily rely on digitally rendered content and fall short in capturing the complexity of real-world scenarios. To bridge this gap, we introduce MathScape, a novel benchmark focused on assessing MLLMs' reasoning ability in realistic mathematical contexts. MathScape comprises 1,369 high-quality math problems paired with human-captured real-world images, closely reflecting the challenges encountered in practical educational settings. We conduct a thorough multi-dimensional evaluation across nine leading closed-source MLLMs, three open-source MLLMs with over 20 billion parameters, and seven smaller-scale MLLMs. Our results show that even SOTA models struggle with real-world math tasks, lagging behind human performance -- highlighting critical limitations in current model capabilities. Moreover, we find that strong performance on synthetic or digitally rendered images does not guarantee similar effectiveness on real-world tasks. This underscores the necessity of MathScape in the next stage of multimodal mathematical reasoning.
Paperid: 496,   https://arxiv.org/pdf/2407.12274    
Authors:Cong Cai, Shan Liang, Xuefei Liu, Kang Zhu, Zhengqi Wen, Jianhua Tao, Heng Xie, Jizhou Cui, Yiming Ma, Zhenhua Cheng, Hanzhe Xu, Ruibo Fu, Bin Liu, Yongwei Li
Abstract:
Deception detection has garnered increasing attention in recent years due to the significant growth of digital media and heightened ethical and security concerns. It has been extensively studied using multimodal methods, including video, audio, and text. In addition, individual differences in deception production and detection are believed to play a crucial role.Although some studies have utilized individual information such as personality traits to enhance the performance of deception detection, current systems remain limited, partly due to a lack of sufficient datasets for evaluating performance. To address this issue, we introduce a multimodal deception dataset MDPE. Besides deception features, this dataset also includes individual differences information in personality and emotional expression characteristics. It can explore the impact of individual differences on deception behavior. It comprises over 104 hours of deception and emotional videos from 193 subjects. Furthermore, we conducted numerous experiments to provide valuable insights for future deception detection research. MDPE not only supports deception detection, but also provides conditions for tasks such as personality recognition and emotion recognition, and can even study the relationships between them. We believe that MDPE will become a valuable resource for promoting research in the field of affective computing.
Paperid: 497,   https://arxiv.org/pdf/2501.13368    
Authors:Yuzhuo Li, Di Zhao, Tingrui Qiao, Yihao Wu, Bo Pang, Yun Sing Koh
Abstract:
Identifying individual animals within large wildlife populations is essential for effective wildlife monitoring and conservation efforts. Recent advancements in computer vision have shown promise in animal re-identification (Animal ReID) by leveraging data from camera traps. However, existing Animal ReID datasets rely exclusively on visual data, overlooking environmental metadata that ecologists have identified as highly correlated with animal behavior and identity, such as temperature and circadian rhythms. Moreover, the emergence of multimodal models capable of jointly processing visual and textual data presents new opportunities for Animal ReID, but existing datasets fail to leverage these models' text-processing capabilities, limiting their full potential. Additionally, to facilitate the use of metadata in existing ReID methods, we propose the Meta-Feature Adapter (MFA), a lightweight module that can be incorporated into existing vision-language model (VLM)-based Animal ReID methods, allowing ReID models to leverage both environmental metadata and visual information to improve ReID performance. Experiments on MetaWild show that combining baseline ReID models with MFA to incorporate metadata consistently improves performance compared to using visual information alone, validating the effectiveness of incorporating metadata in re-identification. We hope that our proposed dataset can inspire further exploration of multimodal approaches for Animal ReID.
Paperid: 498,   https://arxiv.org/pdf/2506.17939    
Authors:Bo Liu, Xiangyu Zhao, Along He, Yidi Chen, Huazhu Fu, Xiao-Ming Wu
Abstract:
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images. While recent advances in multi-modal learning have significantly improved performance, current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model outputs. To address these limitations, this work first proposes a Region-Aware Multimodal Chain-of-Thought (RMCoT) dataset, in which the process of producing an answer is preceded by a sequence of intermediate reasoning steps that explicitly ground relevant visual regions of the medical image, thereby providing fine-grained explainability. Furthermore, we introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer. Remarkably, our method achieves comparable performance using only one-eighth of the training data, demonstrating the efficiency and effectiveness of the proposal. The dataset is available at https://www.med-vqa.com/GEMeX/.
Paperid: 499,   https://arxiv.org/pdf/2506.01478    
Authors:Tung-Lam Ngo, Ba-Hoang Tran, Duy-Cat Can, Trung-Hieu Do, Oliver Y. Chén, Hoang-Quynh Le
Abstract:
Understanding the interaction between different drugs (drug-drug interaction or DDI) is critical for ensuring patient safety and optimizing therapeutic outcomes. Existing DDI datasets primarily focus on textual information, overlooking multimodal data that reflect complex drug mechanisms. In this paper, we (1) introduce MUDI, a large-scale Multimodal biomedical dataset for Understanding pharmacodynamic Drug-drug Interactions, and (2) benchmark learning methods to study it. In brief, MUDI provides a comprehensive multimodal representation of drugs by combining pharmacological text, chemical formulas, molecular structure graphs, and images across 310,532 annotated drug pairs labeled as Synergism, Antagonism, or New Effect. Crucially, to effectively evaluate machine-learning based generalization, MUDI consists of unseen drug pairs in the test set. We evaluate benchmark models using both late fusion voting and intermediate fusion strategies. All data, annotations, evaluation scripts, and baselines are released under an open research license.
Paperid: 500,   https://arxiv.org/pdf/2508.09584    
Authors:Bei Yan, Zhiyuan Chen, Yuecong Min, Jie Zhang, Jiahao Wang, Xiaozhen Wang, Shiguang Shan
Abstract:
Despite rapid advances, Large Vision-Language Models (LVLMs) still suffer from hallucinations, i.e., generating content inconsistent with input or established world knowledge, which correspond to faithfulness and factuality hallucinations, respectively. Prior studies primarily evaluate faithfulness hallucination at a rather coarse level (e.g., object-level) and lack fine-grained analysis. Additionally, existing benchmarks often rely on costly manual curation or reused public datasets, raising concerns about scalability and data leakage. To address these limitations, we propose an automated data construction pipeline that produces scalable, controllable, and diverse evaluation data. We also design a hierarchical hallucination induction framework with input perturbations to simulate realistic noisy scenarios. Integrating these designs, we construct SHALE, a Scalable HALlucination Evaluation benchmark designed to assess both faithfulness and factuality hallucinations via a fine-grained hallucination categorization scheme. SHALE comprises over 30K image-instruction pairs spanning 12 representative visual perception aspects for faithfulness and 6 knowledge domains for factuality, considering both clean and noisy scenarios. Extensive experiments on over 20 mainstream LVLMs reveal significant factuality hallucinations and high sensitivity to semantic perturbations.
Paperid: 501,   https://arxiv.org/pdf/2507.20913    
Authors:Jialei Cui, Jianwei Du, Yanzhe Li, Lei Gao, Hui Jiang, Chenfu Bao
Abstract:
The rapid evolution of face manipulation techniques poses a critical challenge for face forgery detection: cross-domain generalization. Conventional methods, which rely on simple classification objectives, often fail to learn domain-invariant representations. We propose HAMLET-FFD, a cognitively inspired Hierarchical Adaptive Multi-modal Learning framework that tackles this challenge via bidirectional cross-modal reasoning. Building on contrastive vision-language models such as CLIP, HAMLET-FFD introduces a knowledge refinement loop that iteratively assesses authenticity by integrating visual evidence with conceptual cues, emulating expert forensic analysis. A key innovation is a bidirectional fusion mechanism in which textual authenticity embeddings guide the aggregation of hierarchical visual features, while modulated visual features refine text embeddings to generate image-adaptive prompts. This closed-loop process progressively aligns visual observations with semantic priors to enhance authenticity assessment. By design, HAMLET-FFD freezes all pretrained parameters, serving as an external plugin that preserves CLIP's original capabilities. Extensive experiments demonstrate its superior generalization to unseen manipulations across multiple benchmarks, and visual analyses reveal a division of labor among embeddings, with distinct representations specializing in fine-grained artifact recognition.
Paperid: 502,   https://arxiv.org/pdf/2507.19836    
Authors:Xuanchen Wang, Heng Wang, Weidong Cai
Abstract:
Modern artistic productions increasingly demand automated choreography generation that adapts to diverse musical styles and individual dancer characteristics. Existing approaches often fail to produce high-quality dance videos that harmonize with both musical rhythm and user-defined choreography styles, limiting their applicability in real-world creative contexts. To address this gap, we introduce ChoreoMuse, a diffusion-based framework that uses SMPL format parameters and their variation version as intermediaries between music and video generation, thereby overcoming the usual constraints imposed by video resolution. Critically, ChoreoMuse supports style-controllable, high-fidelity dance video generation across diverse musical genres and individual dancer characteristics, including the flexibility to handle any reference individual at any resolution. Our method employs a novel music encoder MotionTune to capture motion cues from audio, ensuring that the generated choreography closely follows the beat and expressive qualities of the input music. To quantitatively evaluate how well the generated dances match both musical and choreographic styles, we introduce two new metrics that measure alignment with the intended stylistic cues. Extensive experiments confirm that ChoreoMuse achieves state-of-the-art performance across multiple dimensions, including video quality, beat alignment, dance diversity, and style adherence, demonstrating its potential as a robust solution for a wide range of creative applications. Video results can be found on our project page: https://choreomuse.github.io.
Paperid: 503,   https://arxiv.org/pdf/2507.04680    
Authors:Wenhao Li, Xiu Su, Jingyi Wu, Feng Yang, Yang Liu, Yi Chen, Shan You, Chang Xu
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated remarkable advancements in numerous areas such as multimedia. However, hallucination issues significantly limit their credibility and application potential. Existing mitigation methods typically rely on external tools or the comparison of multi-round inference, which significantly increase inference time. In this paper, we propose SElf-Evolving Distillation (SEED), which identifies hallucinations within the inner knowledge of LVLMs, isolates and purges them, and then distills the purified knowledge back into the model, enabling self-evolution. Furthermore, we identified that traditional distillation methods are prone to inducing void spaces in the output space of LVLMs. To address this issue, we propose a Mode-Seeking Evolving approach, which performs distillation to capture the dominant modes of the purified knowledge distribution, thereby avoiding the chaotic results that could emerge from void spaces. Moreover, we introduce a Hallucination Elimination Adapter, which corrects the dark knowledge of the original model by learning purified knowledge. Extensive experiments on multiple benchmarks validate the superiority of our SEED, demonstrating substantial improvements in mitigating hallucinations for representative LVLM models such as LLaVA-1.5 and InternVL2. Remarkably, the F1 score of LLaVA-1.5 on the hallucination evaluation metric POPE-Random improved from 81.3 to 88.3.
Paperid: 504,   https://arxiv.org/pdf/2504.09828    
Authors:Hezhao Liu, Yang Lu, Mengke Li, Yiqun Zhang, Shreyank N Gowda, Chen Gong, Hanzi Wang
Abstract:
Semi-supervised learning (SSL) has achieved significant progress by leveraging both labeled data and unlabeled data. Existing SSL methods overlook a common real-world scenario when labeled data is extremely scarce, potentially as limited as a single labeled sample in the dataset. General SSL approaches struggle to train effectively from scratch under such constraints, while methods utilizing pre-trained models often fail to find an optimal balance between leveraging limited labeled data and abundant unlabeled data. To address this challenge, we propose Firstly Adapt, Then catEgorize (FATE), a novel SSL framework tailored for scenarios with extremely limited labeled data. At its core, the two-stage prompt tuning paradigm FATE exploits unlabeled data to compensate for scarce supervision signals, then transfers to downstream tasks. Concretely, FATE first adapts a pre-trained model to the feature distribution of downstream data using volumes of unlabeled samples in an unsupervised manner. It then applies an SSL method specifically designed for pre-trained models to complete the final classification task. FATE is designed to be compatible with both vision and vision-language pre-trained models. Extensive experiments demonstrate that FATE effectively mitigates challenges arising from the scarcity of labeled samples in SSL, achieving an average performance improvement of 33.74% across seven benchmarks compared to state-of-the-art SSL methods. Code is available at https://anonymous.4open.science/r/Semi-supervised-learning-BA72.
Paperid: 505,   https://arxiv.org/pdf/2510.03038    
Authors:Tianqi Liu, Kairui Fu, Shengyu Zhang, Wenyan Fan, Zhaocheng Du, Jieming Zhu, Fan Wu, Fei Wu
Abstract:
With the advancement of mobile device capabilities, deploying reranking models directly on devices has become feasible, enabling real-time contextual recommendations. When migrating models from cloud to devices, resource heterogeneity inevitably necessitates model compression. Recent quantization methods show promise for efficient deployment, yet they overlook device-specific user interests, resulting in compromised recommendation accuracy. While on-device finetuning captures personalized user preference, it imposes additional computational burden through local retraining. To address these challenges, we propose a framework for \underlineCustomizing \underlineHybrid-precision \underlineOn-device model for sequential \underlineRecommendation with \underlineDevice-cloud collaboration (CHORD), leveraging channel-wise mixed-precision quantization to simultaneously achieve personalization and resource-adaptive deployment. CHORD distributes randomly initialized models across heterogeneous devices and identifies user-specific critical parameters through auxiliary hypernetwork modules on the cloud. Our parameter sensitivity analysis operates across multiple granularities (layer, filter, and element levels), enabling precise mapping from user profiles to quantization strategy. Through on-device mixed-precision quantization, CHORD delivers dynamic model adaptation and accelerated inference without backpropagation, eliminating costly retraining cycles. We minimize communication overhead by encoding quantization strategies using only 2 bits per channel instead of 32-bit weights. Experiments on three real-world datasets with two popular backbones (SASRec and Caser) demonstrate the accuracy, efficiency, and adaptivity of CHORD.
Paperid: 506,   https://arxiv.org/pdf/2510.21786    
Authors:Qile Su, Shoutai Zhu, Shuai Zhang, Baoyu Liang, Chao Tong
Abstract:
Script event induction, which aims to predict the subsequent event based on the context, is a challenging task in NLP, achieving remarkable success in practical applications. However, human events are mostly recorded and presented in the form of videos rather than scripts, yet there is a lack of related research in the realm of vision. To address this problem, we introduce AVEP (Action-centric Video Event Prediction), a task that distinguishes itself from existing video prediction tasks through its incorporation of more complex logic and richer semantic information. We present a large structured dataset, which consists of about 35K annotated videos and more than 178K video clips of event, built upon existing video event datasets to support this task. The dataset offers more fine-grained annotations, where the atomic unit is represented as a multimodal event argument node, providing better structured representations of video events. Due to the complexity of event structures, traditional visual models that take patches or frames as input are not well-suited for AVEP. We propose EventFormer, a node-graph hierarchical attention based video event prediction model, which can capture both the relationships between events and their arguments and the coreferencial relationships between arguments. We conducted experiments using several SOTA video prediction models as well as LVLMs on AVEP, demonstrating both the complexity of the task and the value of the dataset. Our approach outperforms all these video prediction models. We will release the dataset and code for replicating the experiments and annotations.
Paperid: 507,   https://arxiv.org/pdf/2508.01227    
Authors:Zihan Fang, Zhiyong Xu, Lan Du, Shide Du, Zhiling Cai, Shiping Wang
Abstract:
Existing multi-view learning models struggle in open-set scenarios due to their implicit assumption of class completeness. Moreover, static view-induced biases, which arise from spurious view-label associations formed during training, further degrade their ability to recognize unknown categories. In this paper, we propose a multi-view open-set learning framework via ambiguity uncertainty calibration and view-wise debiasing. To simulate ambiguous samples, we design O-Mix, a novel synthesis strategy to generate virtual samples with calibrated open-set ambiguity uncertainty. These samples are further processed by an auxiliary ambiguity perception network that captures atypical patterns for improved open-set adaptation. Furthermore, we incorporate an HSIC-based contrastive debiasing module that enforces independence between view-specific ambiguous and view-consistent representations, encouraging the model to learn generalizable features. Extensive experiments on diverse multi-view benchmarks demonstrate that the proposed framework consistently enhances unknown-class recognition while preserving strong closed-set performance.
Paperid: 508,   https://arxiv.org/pdf/2505.14035    
Authors:Shiyao Cui, Qinglin Zhang, Xuan Ouyang, Renmiao Chen, Zhexin Zhang, Yida Lu, Hongning Wang, Han Qiu, Minlie Huang
Abstract:
Toxicity detection in multimodal text-image content faces growing challenges, especially with multimodal implicit toxicity, where each modality appears benign on its own but conveys hazard when combined. Multimodal implicit toxicity appears not only as formal statements in social platforms but also prompts that can lead to toxic dialogs from Large Vision-Language Models (LVLMs). Despite the success in unimodal text or image moderation, toxicity detection for multimodal content, particularly the multimodal implicit toxicity, remains underexplored. To fill this gap, we comprehensively build a taxonomy for multimodal implicit toxicity (MMIT) and introduce an MMIT-dataset, comprising 2,100 multimodal statements and prompts across 7 risk categories (31 sub-categories) and 5 typical cross-modal correlation modes. To advance the detection of multimodal implicit toxicity, we build ShieldVLM, a model which identifies implicit toxicity in multimodal statements, prompts and dialogs via deliberative cross-modal reasoning. Experiments show that ShieldVLM outperforms existing strong baselines in detecting both implicit and explicit toxicity. The model and dataset will be publicly available to support future researches. Warning: This paper contains potentially sensitive contents.
Paperid: 509,   https://arxiv.org/pdf/2508.16859    
Authors:Jinpeng Hu, Hongchang Shi, Chongyuan Dai, Zhuo Li, Peipei Song, Meng Wang
Abstract:
Multimodal large language models (MLLMs) have been widely applied across various fields due to their powerful perceptual and reasoning capabilities. In the realm of psychology, these models hold promise for a deeper understanding of human emotions and behaviors. However, recent research primarily focuses on enhancing their emotion recognition abilities, leaving the substantial potential in emotion reasoning, which is crucial for improving the naturalness and effectiveness of human-machine interactions. Therefore, in this paper, we introduce a multi-turn multimodal emotion understanding and reasoning (MTMEUR) benchmark, which encompasses 1,451 video data from real-life scenarios, along with 5,101 progressive questions. These questions cover various aspects, including emotion recognition, potential causes of emotions, future action prediction, etc. Besides, we propose a multi-agent framework, where each agent specializes in a specific aspect, such as background context, character dynamics, and event details, to improve the system's reasoning capabilities. Furthermore, we conduct experiments with existing MLLMs and our agent-based method on the proposed benchmark, revealing that most models face significant challenges with this task.
Paperid: 510,   https://arxiv.org/pdf/2506.10006    
Authors:Jie Qin, Wei Yang, Yan Su, Yiran Zhu, Weizhen Li, Yunyue Pan, Chengchang Pan, Honggang Qi
Abstract:
In breast cancer HER2 assessment, clinical evaluation relies on combined H&E and IHC images, yet acquiring both modalities is often hindered by clinical constraints and cost. We propose an adaptive bimodal prediction framework that flexibly supports single- or dual-modality inputs through two core innovations: a dynamic branch selector activating modality completion or joint inference based on input availability, and a cross-modal GAN (CM-GAN) enabling feature-space reconstruction of missing modalities. This design dramatically improves H&E-only accuracy from 71.44% to 94.25%, achieves 95.09% with full dual-modality inputs, and maintains 90.28% reliability under single-modality conditions. The "dual-modality preferred, single-modality compatible" architecture delivers near-dual-modality accuracy without mandatory synchronized acquisition, offering a cost-effective solution for resource-limited regions and significantly improving HER2 assessment accessibility.
Paperid: 511,   https://arxiv.org/pdf/2504.11301    
Authors:Yangyang Zhuang, Wenjia Jiang, Jiayu Zhang, Ze Yang, Joey Tianyi Zhou, Chi Zhang
Abstract:
Large Language Model (LLM)-based agents have demonstrated strong capabilities across a wide range of tasks, and their application in the medical domain holds particular promise due to the demand for high generalizability and reliance on interdisciplinary knowledge. However, existing medical agent systems often rely on static, manually crafted workflows that lack the flexibility to accommodate diverse diagnostic requirements and adapt to emerging clinical scenarios. Motivated by the success of automated machine learning (AutoML), this paper introduces a novel framework for the automated design of medical agent architectures. Specifically, we define a hierarchical and expressive agent search space that enables dynamic workflow adaptation through structured modifications at the node, structural, and framework levels. Our framework conceptualizes medical agents as graph-based architectures composed of diverse, functional node types and supports iterative self-improvement guided by diagnostic feedback. Experimental results on skin disease diagnosis tasks demonstrate that the proposed method effectively evolves workflow structures and significantly enhances diagnostic accuracy over time. This work represents the first fully automated framework for medical agent architecture design and offers a scalable, adaptable foundation for deploying intelligent agents in real-world clinical environments.
Paperid: 512,   https://arxiv.org/pdf/2409.20018    
Authors:Hongchen Wei, Zhenzhong Chen
Abstract:
Large Multimodal Models (LMMs) have demonstrated impressive performance in short video understanding tasks but face great challenges when applied to long video understanding. In contrast, Large Language Models (LLMs) exhibit outstanding capabilities in modeling long texts. Existing work attempts to address this issue by introducing long video-text pairs during training. However, these approaches require substantial computational and data resources. In this paper, we tackle the challenge of long video understanding from the perspective of context windows, aiming to apply LMMs to long video tasks without retraining on long video datasets. We first conduct an in-depth analysis of why pretrained LMMs struggle to understand lengthy video content, identifying that discrepancies between visual and language modalities lead to different context windows for visual and language tokens, making it difficult to directly extend the visual tokens to match the language context window. Based on this, we propose to adapt LMMs for long video understanding tasks by extending the visual context window, eliminating the need for retraining on large scalelong video datasets. To further mitigate the significant memory consumption caused by long sequences, we introduce a progressive pooling inference strategy that selectively adjusts the spatial resolution of frame embeddings, reducing the number of visual tokens while retaining important spatial information. Across multiple long video understanding benchmarks, our method consistently improves the performance as the number of video frames increases. On the MLVU benchmark, our method outperforms GPT-4o, even though our model size is only 7B. Additionally, in the 256-frame setting, our method reduces memory usage by approximately 45% compared to the baseline, without introducing any performance loss.
Paperid: 513,   https://arxiv.org/pdf/2510.19451    
Authors:Xueqi Ma, Yanbei Jiang, Sarah Erfani, James Bailey, Weifeng Liu, Krista A. Ehinger, Jey Han Lau
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance across various objective multimodal perception tasks, yet their application to subjective, emotionally nuanced domains, such as psychological analysis, remains largely unexplored. In this paper, we introduce PICK, a multi-step framework designed for Psychoanalytical Image Comprehension through hierarchical analysis and Knowledge injection with MLLMs, specifically focusing on the House-Tree-Person (HTP) Test, a widely used psychological assessment in clinical practice. First, we decompose drawings containing multiple instances into semantically meaningful sub-drawings, constructing a hierarchical representation that captures spatial structure and content across three levels: single-object level, multi-object level, and whole level. Next, we analyze these sub-drawings at each level with a targeted focus, extracting psychological or emotional insights from their visual cues. We also introduce an HTP knowledge base and design a feature extraction module, trained with reinforcement learning, to generate a psychological profile for single-object level analysis. This profile captures both holistic stylistic features and dynamic object-specific features (such as those of the house, tree, or person), correlating them with psychological states. Finally, we integrate these multi-faceted information to produce a well-informed assessment that aligns with expert-level reasoning. Our approach bridges the gap between MLLMs and specialized expert domains, offering a structured and interpretable framework for understanding human mental states through visual expression. Experimental results demonstrate that the proposed PICK significantly enhances the capability of MLLMs in psychological analysis. It is further validated as a general framework through extensions to emotion understanding tasks.
Paperid: 514,   https://arxiv.org/pdf/2510.11830    
Authors:Yuyang Jiang, Binzhu Xie, Lina Xu, Xiaokang Lei, Shi Qiu, Luwen Yu, Pan Hui
Abstract:
Mindfulness meditation has seen increasing applications in diverse domains as an effective practice to improve mental health. However, the standardized frameworks adopted by most applications often fail to cater to users with various psychological states and health conditions. This limitation arises primarily from the lack of personalization and adaptive content design. To address this, we propose MindfulVerse, an AI-Generated Content (AIGC)-driven application to create personalized and immersive mindfulness experiences. By developing a novel agent, the system can dynamically adjust the meditation content based on the ideas of individual users. Furthermore, we conducted exploratory user studies and comparative evaluations to assess the application scenarios and performance of our novel generative meditation tool in VR environments. The results of this user study indicate that generative meditation improves neural activation in self-regulation and shows a positive impact on emotional regulation and participation. Our approach offers a generative meditation procedure that provides users with an application that better suits their preferences and states.
Paperid: 515,   https://arxiv.org/pdf/2502.08297    
Authors:Yu Hong, Yize Wu, Zhehao Shen, Chengcheng Guo, Yuheng Jiang, Yingliang Zhang, Jingyi Yu, Lan Xu
Abstract:
Volumetric video enables immersive experiences by capturing dynamic 3D scenes, enabling diverse applications for virtual reality, education, and telepresence. However, traditional methods struggle with fixed lighting conditions, while neural approaches face trade-offs in efficiency, quality, or adaptability for relightable scenarios. To address these limitations, we present BEAM, a novel pipeline that bridges 4D Gaussian representations with physically-based rendering (PBR) to produce high-quality, relightable volumetric videos from multi-view RGB footage. BEAM recovers detailed geometry and PBR properties via a series of available Gaussian-based techniques. It first combines Gaussian-based human performance tracking with geometry-aware rasterization in a coarse-to-fine optimization framework to recover spatially and temporally consistent geometries. We further enhance Gaussian attributes by incorporating PBR properties step by step. We generate roughness via a multi-view-conditioned diffusion model, and then derive AO and base color using a 2D-to-3D strategy, incorporating a tailored Gaussian-based ray tracer for efficient visibility computation. Once recovered, these dynamic, relightable assets integrate seamlessly into traditional CG pipelines, supporting real-time rendering with deferred shading and offline rendering with ray tracing. By offering realistic, lifelike visualizations under diverse lighting conditions, BEAM opens new possibilities for interactive entertainment, storytelling, and creative visualization.
Paperid: 516,   https://arxiv.org/pdf/2502.15278    
Authors:Shunchang Liu, Zhuan Shi, Lingjuan Lyu, Yaochu Jin, Boi Faltings
Abstract:
Assessing whether AI-generated images are substantially similar to source works is a crucial step in resolving copyright disputes. In this paper, we propose CopyJudge, a novel automated infringement identification framework that leverages large vision-language models (LVLMs) to simulate practical court processes for determining substantial similarity between copyrighted images and those generated by text-to-image diffusion models. Specifically, we employ an abstraction-filtration-comparison test framework based on the multi-LVLM debate to assess the likelihood of infringement and provide detailed judgment rationales. Based on these judgments, we further introduce a general LVLM-based mitigation strategy that automatically optimizes infringing prompts by avoiding sensitive expressions while preserving the non-infringing content. Furthermore, assuming the input noise is controllable, our approach can be enhanced by iteratively exploring non-infringing noise vectors within the diffusion latent space, even without modifying the original prompts. Experimental results show that our automated identification method achieves comparable state-of-the-art performance, while offering superior generalization and interpretability across various forms of infringement, and that our mitigation method more effectively mitigates memorization and IP infringement with a high degree of alignment to the original non-infringing expressions.
Paperid: 517,   https://arxiv.org/pdf/2412.16039    
Authors:Jiadong Pan, Liang Li, Hongcheng Gao, Zheng-Jun Zha, Qingming Huang, Jiebo Luo
Abstract:
Diffusion models (DMs) have demonstrated exceptional performance in text-to-image tasks, leading to their widespread use. With the introduction of classifier-free guidance (CFG), the quality of images generated by DMs is significantly improved. However, one can use DMs to generate more harmful images by maliciously guiding the image generation process through CFG. Existing safe alignment methods aim to mitigate the risk of generating harmful images but often reduce the quality of clean image generation. To address this issue, we propose SafeCFG to adaptively control harmful features with dynamic safe guidance by modulating the CFG generation process. It dynamically guides the CFG generation process based on the harmfulness of the prompts, inducing significant deviations only in harmful CFG generations, achieving high quality and safety generation. SafeCFG can simultaneously modulate different harmful CFG generation processes, so it could eliminate harmful elements while preserving high-quality generation. Additionally, SafeCFG provides the ability to detect image harmfulness, allowing unsupervised safe alignment on DMs without pre-defined clean or harmful labels. Experimental results show that images generated by SafeCFG achieve both high quality and safety, and safe DMs trained in our unsupervised manner also exhibit good safety performance.
Paperid: 518,   https://arxiv.org/pdf/2504.18039    
Authors:Zheng Zhang, Nuoqian Xiao, Qi Chai, Deheng Ye, Hao Wang
Abstract:
Large Language Model (LLM) agents have demonstrated impressive capabilities in social deduction games (SDGs) like Werewolf, where strategic reasoning and social deception are essential. However, current approaches remain limited to textual information, ignoring crucial multimodal cues such as facial expressions and tone of voice that humans naturally use to communicate. Moreover, existing SDG agents primarily focus on inferring other players' identities without modeling how others perceive themselves or fellow players. To address these limitations, we use One Night Ultimate Werewolf (ONUW) as a testbed and present MultiMind, the first framework integrating multimodal information into SDG agents. MultiMind processes facial expressions and vocal tones alongside verbal content, while employing a Theory of Mind (ToM) model to represent each player's suspicion levels toward others. By combining this ToM model with Monte Carlo Tree Search (MCTS), our agent identifies communication strategies that minimize suspicion directed at itself. Through comprehensive evaluation in both agent-versus-agent simulations and studies with human players, we demonstrate MultiMind's superior performance in gameplay. Our work presents a significant advancement toward LLM agents capable of human-like social reasoning across multimodal domains.
Paperid: 519,   https://arxiv.org/pdf/2504.10540    
Authors:Zichao Yu, Zhen Zou, Guojiang Shao, Chengwei Zhang, Shengze Xu, Jie Huang, Feng Zhao, Xiaodong Cun, Wenyi Zhang
Abstract:
Diffusion models have demonstrated remarkable success in generative tasks, yet their iterative denoising process results in slow inference, limiting their practicality. While existing acceleration methods exploit the well-known U-shaped similarity pattern between adjacent steps through caching mechanisms, they lack theoretical foundation and rely on simplistic computation reuse, often leading to performance degradation. In this work, we provide a theoretical understanding by analyzing the denoising process through the second-order Adams-Bashforth method, revealing a linear relationship between the outputs of consecutive steps. This analysis explains why the outputs of adjacent steps exhibit a U-shaped pattern. Furthermore, extending Adams-Bashforth method to higher order, we propose a novel caching-based acceleration approach for diffusion models, instead of directly reusing cached results, with a truncation error bound of only \(O(h^k)\) where h is the step size. Extensive validation across diverse image and video diffusion models (including HunyuanVideo and FLUX.1-dev) with various schedulers demonstrates our method's effectiveness in achieving nearly 3× speedup while maintaining original performance levels, offering a practical real-time solution without compromising generation quality.
Paperid: 520,   https://arxiv.org/pdf/2507.03936    
Authors:Chen Pang, Xuequan Lu, Qianyu Zhou, Lei Lyu
Abstract:
Most GCN-based methods model interacting individuals as independent graphs, neglecting their inherent inter-dependencies. Although recent approaches utilize predefined interaction adjacency matrices to integrate participants, these matrices fail to adaptively capture the dynamic and context-specific joint interactions across different actions. In this paper, we propose the Active Node Selection with External Attention Network (ASEA), an innovative approach that dynamically captures interaction relationships without predefined assumptions. Our method models each participant individually using a GCN to capture intra-personal relationships, facilitating a detailed representation of their actions. To identify the most relevant nodes for interaction modeling, we introduce the Adaptive Temporal Node Amplitude Calculation (AT-NAC) module, which estimates global node activity by combining spatial motion magnitude with adaptive temporal weighting, thereby highlighting salient motion patterns while reducing irrelevant or redundant information. A learnable threshold, regularized to prevent extreme variations, is defined to selectively identify the most informative nodes for interaction modeling. To capture interactions, we design the External Attention (EA) module to operate on active nodes, effectively modeling the interaction dynamics and semantic relationships between individuals. Extensive evaluations show that our method captures interaction relationships more effectively and flexibly, achieving state-of-the-art performance.
Paperid: 521,   https://arxiv.org/pdf/2508.13921    
Authors:Ziang Wang, Xiaoqin Wang, Dingyi Wang, Qiang Li, Shushan Qiao
Abstract:
Image degradation caused by complex lighting conditions such as low-light and backlit scenarios is commonly encountered in real-world environments, significantly affecting image quality and downstream vision tasks. Most existing methods focus on a single type of illumination degradation and lack the ability to handle diverse lighting conditions in a unified manner. To address this issue, we propose a dual-illumination enhancement framework called DIME-Net. The core of our method is a Mixture-of-Experts illumination estimator module, where a sparse gating mechanism adaptively selects suitable S-curve expert networks based on the illumination characteristics of the input image. By integrating Retinex theory, this module effectively performs enhancement tailored to both low-light and backlit images. To further correct illumination-induced artifacts and color distortions, we design a damage restoration module equipped with Illumination-Aware Cross Attention and Sequential-State Global Attention mechanisms. In addition, we construct a hybrid illumination dataset, MixBL, by integrating existing datasets, allowing our model to achieve robust illumination adaptability through a single training process. Experimental results show that DIME-Net achieves competitive performance on both synthetic and real-world low-light and backlit datasets without any retraining. These results demonstrate its generalization ability and potential for practical multimedia applications under diverse and complex illumination conditions.
Paperid: 522,   https://arxiv.org/pdf/2508.00726    
Authors:Jiale Li, Mingrui Wu, Zixiang Jin, Hao Chen, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Rongrong Ji
Abstract:
Despite growing interest in hallucination in Multimodal Large Language Models, existing studies primarily focus on single-image settings, leaving hallucination in multi-image scenarios largely unexplored. To address this gap, we conduct the first systematic study of hallucinations in multi-image MLLMs and propose MIHBench, a benchmark specifically tailored for evaluating object-related hallucinations across multiple images. MIHBench comprises three core tasks: Multi-Image Object Existence Hallucination, Multi-Image Object Count Hallucination, and Object Identity Consistency Hallucination, targeting semantic understanding across object existence, quantity reasoning, and cross-view identity consistency. Through extensive evaluation, we identify key factors associated with the occurrence of multi-image hallucinations, including: a progressive relationship between the number of image inputs and the likelihood of hallucination occurrences; a strong correlation between single-image hallucination tendencies and those observed in multi-image contexts; and the influence of same-object image ratios and the positional placement of negative samples within image sequences on the occurrence of object identity consistency hallucination. To address these challenges, we propose a Dynamic Attention Balancing mechanism that adjusts inter-image attention distributions while preserving the overall visual attention proportion. Experiments across multiple state-of-the-art MLLMs demonstrate that our method effectively reduces hallucination occurrences and enhances semantic integration and reasoning stability in multi-image scenarios.
Paperid: 523,   https://arxiv.org/pdf/2508.20530    
Authors:Mingqian Ji, Jian Yang, Shanshan Zhang
Abstract:
Existing LiDAR-based 3D object detectors typically rely on manually annotated labels for training to achieve good performance. However, obtaining high-quality 3D labels is time-consuming and labor-intensive. To address this issue, recent works explore unsupervised 3D object detection by introducing RGB images as an auxiliary modal to assist pseudo-box generation. However, these methods simply integrate pseudo-boxes generated by LiDAR point clouds and RGB images. Yet, such a label-level fusion strategy brings limited improvements to the quality of pseudo-boxes, as it overlooks the complementary nature in terms of LiDAR and RGB image data. To overcome the above limitations, we propose a novel data-level fusion framework that integrates RGB images and LiDAR data at an early stage. Specifically, we utilize vision foundation models for instance segmentation and depth estimation on images and introduce a bi-directional fusion method, where real points acquire category labels from the 2D space, while 2D pixels are projected onto 3D to enhance real point density. To mitigate noise from depth and segmentation estimations, we propose a local and global filtering method, which applies local radius filtering to suppress depth estimation errors and global statistical filtering to remove segmentation-induced outliers. Furthermore, we propose a data-level fusion based dynamic self-evolution strategy, which iteratively refines pseudo-boxes under a dense representation, significantly improving localization accuracy. Extensive experiments on the nuScenes dataset demonstrate that the detector trained by our method significantly outperforms that trained by previous state-of-the-art methods with 28.4% mAP on the nuScenes validation benchmark.
Paperid: 524,   https://arxiv.org/pdf/2508.21539    
Authors:Hao Ruan, Jinliang Lin, Yingxin Lai, Zhiming Luo, Shaozi Li
Abstract:
Natural Language-Guided Drones (NLGD) provide a novel paradigm for tasks such as target matching and navigation. However, the wide field of view and complex compositional semantics in drone scenarios pose challenges for vision-language understanding. Mainstream Vision-Language Models (VLMs) emphasize global alignment while lacking fine-grained semantics, and existing hierarchical methods depend on precise entity partitioning and strict containment, limiting effectiveness in dynamic environments. To address this, we propose the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework with two components: (1) Region-Global Image-Text Contrastive Learning (RG-ITC), which avoids precise scene partitioning and captures hierarchical local-to-global semantics by contrasting local visual regions with global text and vice versa; (2) Region-Global Image-Text Matching (RG-ITM), which dispenses with rigid constraints and instead evaluates local semantic consistency within global cross-modal representations, enhancing compositional reasoning. Moreover, drone text descriptions are often incomplete or ambiguous, destabilizing alignment. HCCM introduces a Momentum Contrast and Distillation (MCD) mechanism to improve robustness. Experiments on GeoText-1652 show HCCM achieves state-of-the-art Recall@1 of 28.8% (image retrieval) and 14.7% (text retrieval). On the unseen ERA dataset, HCCM demonstrates strong zero-shot generalization with 39.93% mean recall (mR), outperforming fine-tuned baselines.
Paperid: 525,   https://arxiv.org/pdf/2504.09535    
Authors:Yuting Zhao, Yuheng Ji, Xiaoshuai Hao, Shuxiao Li
Abstract:
Road Surface Reconstruction (RSR) is crucial for autonomous driving, enabling the understanding of road surface conditions. Recently, RSR from the Bird's Eye View (BEV) has gained attention for its potential to enhance performance. However, existing methods for transforming perspective views to BEV face challenges such as information loss and representation sparsity. Moreover, stereo matching in BEV is limited by the need to balance accuracy with inference speed. To address these challenges, we propose two efficient and accurate BEV-based RSR models: FastRSR-mono and FastRSR-stereo. Specifically, we first introduce Depth-Aware Projection (DAP), an efficient view transformation strategy designed to mitigate information loss and sparsity by querying depth and image features to aggregate BEV data within specific road surface regions using a pre-computed look-up table. To optimize accuracy and speed in stereo matching, we design the Spatial Attention Enhancement (SAE) and Confidence Attention Generation (CAG) modules. SAE adaptively highlights important regions, while CAG focuses on high-confidence predictions and filters out irrelevant information. FastRSR achieves state-of-the-art performance, exceeding monocular competitors by over 6.0% in elevation absolute error and providing at least a 3.0x speedup by stereo methods on the RSRD dataset. The source code will be released.
Paperid: 526,   https://arxiv.org/pdf/2412.05808    
Authors:Shuzhao Xie, Jiahang Liu, Weixiang Zhang, Shijia Ge, Sicheng Pan, Chen Tang, Yunpeng Bai, Cong Zhang, Xiaoyi Fan, Zhi Wang
Abstract:
Recent advances in 3D Gaussian Splatting (3DGS) have greatly improved 3D reconstruction. However, its substantial data size poses a significant challenge for transmission and storage. While many compression techniques have been proposed, they fail to efficiently adapt to fluctuating network bandwidth, leading to resource wastage. We address this issue from the perspective of size-aware compression, where we aim to compress 3DGS to a desired size by quickly searching for suitable hyperparameters. Through a measurement study, we identify key hyperparameters that affect the size -- namely, the reserve ratio of Gaussians and bit-width settings for Gaussian attributes. Then, we formulate this hyperparameter optimization problem as a mixed-integer nonlinear programming (MINLP) problem, with the goal of maximizing visual quality while respecting the size budget constraint. To solve the MINLP, we decouple this problem into two parts: discretely sampling the reserve ratio and determining the bit-width settings using integer linear programming (ILP). To solve the ILP more quickly and accurately, we design a quality loss estimator and a calibrated size estimator, as well as implement a CUDA kernel. Extensive experiments on multiple 3DGS variants demonstrate that our method achieves state-of-the-art performance in post-training compression. Furthermore, our method can achieve comparable quality to leading training-required methods after fine-tuning.
Paperid: 527,   https://arxiv.org/pdf/2507.05887    
Authors:Xianzhi Ma, Jianhui Li, Changhua Pei, Hao Liu
Abstract:
The application of Vision-Language Models (VLMs) in remote sensing (RS) image understanding has achieved notable progress, demonstrating the basic ability to recognize and describe geographical entities. However, existing RS-VLMs are mostly limited to image-level and region-level tasks, lacking the capability to handle pixel-level tasks and performing poorly in small-object recognition scenarios. Moreover, RS-VLMs consume significant computational resources when processing high-resolution RS images, further restricting their practical applicability. In this context, we propose GeoMag (Geographical Magnifier), an end-to-end general-purpose large model framework for RS. GeoMag dynamically focuses the attention scope based on prompt semantics to effectively perform remote sensing image parsing across multiple levels of granularity. This method introduces Task-driven Multi-granularity Resolution Adjustment (TMRA) and Prompt-guided Semantic-aware Cropping (PSC), which adaptively reduce the spatial resolution of task-irrelevant regions while enhancing the visual representation of task-relevant areas. This approach improves the model's perception of critical target regions, suppresses background redundancy, and reduces the computational cost of interpreting high-resolution RS imagery. Extensive comparative experiments on 10 benchmarks demonstrate that GeoMag not only excels in handling pixel-level tasks but also maintains competitive performance across tasks of other granularities compared to existing RS-VLMs.
Paperid: 528,   https://arxiv.org/pdf/2509.23612    
Authors:Xinhao Cai, Minghang Zheng, Xin Jin, Yang Liu
Abstract:
We propose a novel task of text-controlled human object interaction generation in 3D scenes with movable objects. Existing human-scene interaction datasets suffer from insufficient interaction categories and typically only consider interactions with static objects (do not change object positions), and the collection of such datasets with movable objects is difficult and costly. To address this problem, we construct the InteractMove dataset for Movable Human-Object Interaction in 3D Scenes by aligning existing human object interaction data with scene contexts, featuring three key characteristics: 1) scenes containing multiple movable objects with text-controlled interaction specifications (including same-category distractors requiring spatial and 3D scene context understanding), 2) diverse object types and sizes with varied interaction patterns (one-hand, two-hand, etc.), and 3) physically plausible object manipulation trajectories. With the introduction of various movable objects, this task becomes more challenging, as the model needs to identify objects to be interacted with accurately, learn to interact with objects of different sizes and categories, and avoid collisions between movable objects and the scene. To tackle such challenges, we propose a novel pipeline solution. We first use 3D visual grounding models to identify the interaction object. Then, we propose a hand-object joint affordance learning to predict contact regions for different hand joints and object parts, enabling accurate grasping and manipulation of diverse objects. Finally, we optimize interactions with local-scene modeling and collision avoidance constraints, ensuring physically plausible motions and avoiding collisions between objects and the scene. Comprehensive experiments demonstrate our method's superiority in generating physically plausible, text-compliant interactions compared to existing approaches.
Paperid: 529,   https://arxiv.org/pdf/2505.22067    
Authors:Xinyu Xia, Xingjun Ma, Yunfeng Hu, Ting Qu, Hong Chen, Xun Gong
Abstract:
Ensuring robust and generalizable autonomous driving requires not only broad scenario coverage but also efficient repair of failure cases, particularly those related to challenging and safety-critical scenarios. However, existing scenario generation and selection methods often lack adaptivity and semantic relevance, limiting their impact on performance improvement. In this paper, we propose SERA, an LLM-powered framework that enables autonomous driving systems to self-evolve by repairing failure cases through targeted scenario recommendation. By analyzing performance logs, SERA identifies failure patterns and dynamically retrieves semantically aligned scenarios from a structured bank. An LLM-based reflection mechanism further refines these recommendations to maximize relevance and diversity. The selected scenarios are used for few-shot fine-tuning, enabling targeted adaptation with minimal data. Experiments on the benchmark show that SERA consistently improves key metrics across multiple autonomous driving baselines, demonstrating its effectiveness and generalizability under safety-critical conditions.
Paperid: 530,   https://arxiv.org/pdf/2504.09555    
Authors:Jinhao Li, Zijian Chen, Runze Jiang, Tingzhu Chen, Changbo Wang, Guangtao Zhai
Abstract:
The oracle bone inscription (OBI) recognition plays a significant role in understanding the history and culture of ancient China. However, the existing OBI datasets suffer from a long-tail distribution problem, leading to biased performance of OBI recognition models across majority and minority classes. With recent advancements in generative models, OBI synthesis-based data augmentation has become a promising avenue to expand the sample size of minority classes. Unfortunately, current OBI datasets lack large-scale structure-aligned image pairs for generative model training. To address these problems, we first present the Oracle-P15K, a structure-aligned OBI dataset for OBI generation and denoising, consisting of 14,542 images infused with domain knowledge from OBI experts. Second, we propose a diffusion model-based pseudo OBI generator, called OBIDiff, to achieve realistic and controllable OBI generation. Given a clean glyph image and a target rubbing-style image, it can effectively transfer the noise style of the original rubbing to the glyph image. Extensive experiments on OBI downstream tasks and user preference studies show the effectiveness of the proposed Oracle-P15K dataset and demonstrate that OBIDiff can accurately preserve inherent glyph structures while transferring authentic rubbing styles effectively.
Paperid: 531,   https://arxiv.org/pdf/2506.03589    
Authors:Huy Le, Nhat Chung, Tung Kieu, Anh Nguyen, Ngan Le
Abstract:
Text-video retrieval (TVR) systems often suffer from visual-linguistic biases present in datasets, which cause pre-trained vision-language models to overlook key details. To address this, we propose BiMa, a novel framework designed to mitigate biases in both visual and textual representations. Our approach begins by generating scene elements that characterize each video by identifying relevant entities/objects and activities. For visual debiasing, we integrate these scene elements into the video embeddings, enhancing them to emphasize fine-grained and salient details. For textual debiasing, we introduce a mechanism to disentangle text features into content and bias components, enabling the model to focus on meaningful content while separately handling biased information. Extensive experiments and ablation studies across five major TVR benchmarks (i.e., MSR-VTT, MSVD, LSMDC, ActivityNet, and DiDeMo) demonstrate the competitive performance of BiMa. Additionally, the model's bias mitigation capability is consistently validated by its strong results on out-of-distribution retrieval tasks.
Paperid: 532,   https://arxiv.org/pdf/2501.15775    
Authors:Yunbo Lyu, Zhou Yang, Yuqing Niu, Jing Jiang, David Lo
Abstract:
Text-to-Image (T2I) models have recently gained significant attention due to their ability to generate high-quality images and are consequently used in a wide range of applications. However, there are concerns about the gender bias of these models. Previous studies have shown that T2I models can perpetuate or even amplify gender stereotypes when provided with neutral text prompts. Researchers have proposed automated gender bias uncovering detectors for T2I models, but a crucial gap exists: no existing work comprehensively compares the various detectors and understands how the gender bias detected by them deviates from the actual situation. This study addresses this gap by validating previous gender bias detectors using a manually labeled dataset and comparing how the bias identified by various detectors deviates from the actual bias in T2I models, as verified by manual confirmation. We create a dataset consisting of 6,000 images generated from three cutting-edge T2I models: Stable Diffusion XL, Stable Diffusion 3, and Dreamlike Photoreal 2.0. During the human-labeling process, we find that all three T2I models generate a portion (12.48% on average) of low-quality images (e.g., generate images with no face present), where human annotators cannot determine the gender of the person. Our analysis reveals that all three T2I models show a preference for generating male images, with SDXL being the most biased. Additionally, images generated using prompts containing professional descriptions (e.g., lawyer or doctor) show the most bias. We evaluate seven gender bias detectors and find that none fully capture the actual level of bias in T2I models, with some detectors overestimating bias by up to 26.95%. We further investigate the causes of inaccurate estimations, highlighting the limitations of detectors in dealing with low-quality images. Based on our findings, we propose an enhanced detector...
Paperid: 533,   https://arxiv.org/pdf/2508.09009    
Authors:Luyang Cao, Han Xu, Jian Zhang, Lei Qi, Jiayi Ma, Yinghuan Shi, Yang Gao
Abstract:
In low-light image enhancement, Retinex-based deep learning methods have garnered significant attention due to their exceptional interpretability. These methods decompose images into mutually independent illumination and reflectance components, allows each component to be enhanced separately. In fact, achieving perfect decomposition of illumination and reflectance components proves to be quite challenging, with some residuals still existing after decomposition. In this paper, we formally name these residuals as inter-component residuals (ICR), which has been largely underestimated by previous methods. In our investigation, ICR not only affects the accuracy of the decomposition but also causes enhanced components to deviate from the ideal outcome, ultimately reducing the final synthesized image quality. To address this issue, we propose a novel Inter-correction Retinex model (IRetinex) to alleviate ICR during the decomposition and enhancement stage. In the decomposition stage, we leverage inter-component residual reduction module to reduce the feature similarity between illumination and reflectance components. In the enhancement stage, we utilize the feature similarity between the two components to detect and mitigate the impact of ICR within each enhancement unit. Extensive experiments on three low-light benchmark datasets demonstrated that by reducing ICR, our method outperforms state-of-the-art approaches both qualitatively and quantitatively.
Paperid: 534,   https://arxiv.org/pdf/2509.04051    
Authors:Yaojun Wu, Chaoyi Lin, Yiming Wang, Semih Esenlik, Zhaobin Zhang, Kai Zhang, Li Zhang
Abstract:
This paper explores the application of enhancement filtering techniques in neural video compression. Specifically, we categorize these techniques into in-loop contextual filtering and out-of-loop reconstruction enhancement based on whether the enhanced representation affects the subsequent coding loop. In-loop contextual filtering refines the temporal context by mitigating error propagation during frame-by-frame encoding. However, its influence on both the current and subsequent frames poses challenges in adaptively applying filtering throughout the sequence. To address this, we introduce an adaptive coding decision strategy that dynamically determines filtering application during encoding. Additionally, out-of-loop reconstruction enhancement is employed to refine the quality of reconstructed frames, providing a simple yet effective improvement in coding efficiency. To the best of our knowledge, this work presents the first systematic study of enhancement filtering in the context of conditional-based neural video compression. Extensive experiments demonstrate a 7.71% reduction in bit rate compared to state-of-the-art neural video codecs, validating the effectiveness of the proposed approach.
Paperid: 535,   https://arxiv.org/pdf/2509.01214    
Authors:Yizhe Yuan, Bingsen Xue, Bangzheng Pu, Chengxiang Wang, Cheng Jin
Abstract:
Tumor spatial heterogeneity analysis requires precise correlation between Hematoxylin and Eosin H&E morphology and immunohistochemical (IHC) biomarker expression, yet current methods suffer from spatial misalignment in consecutive sections, severely compromising in situ pathological interpretation. In order to obtain a more accurate virtual staining pattern, We propose PRINTER, a weakly-supervised framework that integrates PRototype-drIven content and staiNing patTERn decoupling and deformation-aware adversarial learning strategies designed to accurately learn IHC staining patterns while preserving H&E staining details. Our approach introduces three key innovations: (1) A prototype-driven staining pattern transfer with explicit content-style decoupling; and (2) A cyclic registration-synthesis framework GapBridge that bridges H&E and IHC domains through deformable structural alignment, where registered features guide cross-modal style transfer while synthesized outputs iteratively refine the registration;(3) Deformation-Aware Adversarial Learning: We propose a training framework where a generator and deformation-aware registration network jointly adversarially optimize a style-focused discriminator. Extensive experiments demonstrate that PRINTER effectively achieves superior performance in preserving H&E staining details and virtual staining fidelity, outperforming state-of-the-art methods. Our work provides a robust and scalable solution for virtual staining, advancing the field of computational pathology.
Paperid: 536,   https://arxiv.org/pdf/2508.04028    
Authors:Yifan Wang, Tao Wang, Chenwei Tang, Caiyang Yu, Zhengqing Zang, Mengmi Zhang, Shudong Huang, Jiancheng Lv
Abstract:
Recently, prompt learning has demonstrated remarkable success in adapting pre-trained Vision-Language Models (VLMs) to various downstream tasks such as image classification. However, its application to the downstream Image-Text Retrieval (ITR) task is more challenging. We find that the challenge lies in discriminating both fine-grained attributes and similar subcategories of the downstream data. To address this challenge, we propose Dual prompt Learning with Joint Category-Attribute Reweighting (DCAR), a novel dual-prompt learning framework to achieve precise image-text matching. The framework dynamically adjusts prompt vectors from both semantic and visual dimensions to improve the performance of CLIP on the downstream ITR task. Based on the prompt paradigm, DCAR jointly optimizes attribute and class features to enhance fine-grained representation learning. Specifically, (1) at the attribute level, it dynamically updates the weights of attribute descriptions based on text-image mutual information correlation; (2) at the category level, it introduces negative samples from multiple perspectives with category-matching weighting to learn subcategory distinctions. To validate our method, we construct the Fine-class Described Retrieval Dataset (FDRD), which serves as a challenging benchmark for ITR in downstream data domains. It covers over 1,500 downstream fine categories and 230,000 image-caption pairs with detailed attribute annotations. Extensive experiments on FDRD demonstrate that DCAR achieves state-of-the-art performance over existing baselines.
Paperid: 537,   https://arxiv.org/pdf/2509.26008    
Authors:Zhiwei Zhang, Ruikai Xu, Weijian Zhang, Zhizhong Zhang, Xin Tan, Jingyu Gong, Yuan Xie, Lizhuang Ma
Abstract:
In this paper, we present the first pinhole-fisheye framework for heterogeneous multi-view depth estimation, PFDepth. Our key insight is to exploit the complementary characteristics of pinhole and fisheye imagery (undistorted vs. distorted, small vs. large FOV, far vs. near field) for joint optimization. PFDepth employs a unified architecture capable of processing arbitrary combinations of pinhole and fisheye cameras with varied intrinsics and extrinsics. Within PFDepth, we first explicitly lift 2D features from each heterogeneous view into a canonical 3D volumetric space. Then, a core module termed Heterogeneous Spatial Fusion is designed to process and fuse distortion-aware volumetric features across overlapping and non-overlapping regions. Additionally, we subtly reformulate the conventional voxel fusion into a novel 3D Gaussian representation, in which learnable latent Gaussian spheres dynamically adapt to local image textures for finer 3D aggregation. Finally, fused volume features are rendered into multi-view depth maps. Through extensive experiments, we demonstrate that PFDepth sets a state-of-the-art performance on KITTI-360 and RealHet datasets over current mainstream depth networks. To the best of our knowledge, this is the first systematic study of heterogeneous pinhole-fisheye depth estimation, offering both technical novelty and valuable empirical insights.
Paperid: 538,   https://arxiv.org/pdf/2504.17365    
Authors:Ling You, Wenxuan Huang, Xinni Xie, Xiangyi Wei, Bangyan Li, Shaohui Lin, Yang Li, Changbo Wang
Abstract:
Soccer is a globally popular sporting event, typically characterized by long matches and distinctive highlight moments. Recent advances in Multimodal Large Language Models (MLLMs) offer promising capabilities in temporal grounding and video understanding, soccer commentary generation often requires precise temporal localization and semantically rich descriptions over long-form video. However, existing soccer MLLMs often rely on the temporal a priori for caption generation, so they cannot process the soccer video end-to-end. While some traditional approaches follow a two-step paradigm that is complex and fails to capture the global context to achieve suboptimal performance. To solve the above issues, we present TimeSoccer, the first end-to-end soccer MLLM for Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos. TimeSoccer jointly predicts timestamps and generates captions in a single pass, enabling global context modeling across 45-minute matches. To support long video understanding of soccer matches, we introduce MoFA-Select, a training-free, motion-aware frame compression module that adaptively selects representative frames via a coarse-to-fine strategy, and incorporates complementary training paradigms to strengthen the model's ability to handle long temporal sequences. Extensive experiments demonstrate that our TimeSoccer achieves State-of-The-Art (SoTA) performance on the SDVC task in an end-to-end form, generating high-quality commentary with accurate temporal alignment and strong semantic relevance.
Paperid: 539,   https://arxiv.org/pdf/2507.22481    
Authors:Tianyi Liu, Kejun Wu, Chen Cai, Yi Wang, Kim-Hui Yap, Lap-Pui Chau
Abstract:
Video signals are vulnerable in multimedia communication and storage systems, as even slight bitstream-domain corruption can lead to significant pixel-domain degradation. To recover faithful spatio-temporal content from corrupted inputs, bitstream-corrupted video recovery has recently emerged as a challenging and understudied task. However, existing methods require time-consuming and labor-intensive annotation of corrupted regions for each corrupted video frame, resulting in a large workload in practice. In addition, high-quality recovery remains difficult as part of the local residual information in corrupted frames may mislead feature completion and successive content recovery. In this paper, we propose the first blind bitstream-corrupted video recovery framework that integrates visual foundation models with a recovery model, which is adapted to different types of corruption and bitstream-level prompts. Within the framework, the proposed Detect Any Corruption (DAC) model leverages the rich priors of the visual foundation model while incorporating bitstream and corruption knowledge to enhance corruption localization and blind recovery. Additionally, we introduce a novel Corruption-aware Feature Completion (CFC) module, which adaptively processes residual contributions based on high-level corruption understanding. With VFM-guided hierarchical feature augmentation and high-level coordination in a mixture-of-residual-experts (MoRE) structure, our method suppresses artifacts and enhances informative residuals. Comprehensive evaluations show that the proposed method achieves outstanding performance in bitstream-corrupted video recovery without requiring a manually labeled mask sequence. The demonstrated effectiveness will help to realize improved user experience, wider application scenarios, and more reliable multimedia communication and storage systems.
Paperid: 540,   https://arxiv.org/pdf/2508.19165    
Authors:Yuzhen Li, Min Liu, Yuan Bian, Xueping Wang, Zhaoyang Li, Gen Li, Yaonan Wang
Abstract:
Monocular 3D visual grounding is a novel task that aims to locate 3D objects in RGB images using text descriptions with explicit geometry information. Despite the inclusion of geometry details in the text, we observe that the text embeddings are sensitive to the magnitude of numerical values but largely ignore the associated measurement units. For example, simply equidistant mapping the length with unit "meter" to "decimeters" or "centimeters" leads to severe performance degradation, even though the physical length remains equivalent. This observation signifies the weak 3D comprehension of pre-trained language model, which generates misguiding text features to hinder 3D perception. Therefore, we propose to enhance the 3D perception of model on text embeddings and geometry features with two simple and effective methods. Firstly, we introduce a pre-processing method named 3D-text Enhancement (3DTE), which enhances the comprehension of mapping relationships between different units by augmenting the diversity of distance descriptors in text queries. Next, we propose a Text-Guided Geometry Enhancement (TGE) module to further enhance the 3D-text information by projecting the basic text features into geometrically consistent space. These 3D-enhanced text features are then leveraged to precisely guide the attention of geometry features. We evaluate the proposed method through extensive comparisons and ablation studies on the Mono3DRefer dataset. Experimental results demonstrate substantial improvements over previous methods, achieving new state-of-the-art results with a notable accuracy gain of 11.94% in the "Far" scenario. Our code will be made publicly available.
Paperid: 541,   https://arxiv.org/pdf/2501.16714    
Authors:Huijie Liu, Jingyun Wang, Shuai Ma, Jie Hu, Xiaoming Wei, Guoliang Kang
Abstract:
Motion customization aims to adapt the diffusion model (DM) to generate videos with the motion specified by a set of video clips with the same motion concept. To realize this goal, the adaptation of DM should be possible to model the specified motion concept, without compromising the ability to generate diverse appearances. Thus, the key to solving this problem lies in how to separate the motion concept from the appearance in the adaptation process of DM. Typical previous works explore different ways to represent and insert a motion concept into large-scale pretrained text-to-video diffusion models, e.g., learning a motion LoRA, using latent noise residuals, etc. While those methods can encode the motion concept, they also inevitably encode the appearance in the reference videos, resulting in weakened appearance generation capability. In this paper, we follow the typical way to learn a motion LoRA to encode the motion concept, but propose two novel strategies to enhance motion-appearance separation, including temporal attention purification (TAP) and appearance highway (AH). Specifically, we assume that in the temporal attention module, the pretrained Value embeddings are sufficient to serve as basic components needed by producing a new motion. Thus, in TAP, we choose only to reshape the temporal attention with motion LoRAs so that Value embeddings can be reorganized to produce a new motion. Further, in AH, we alter the starting point of each skip connection in U-Net from the output of each temporal attention module to the output of each spatial attention module. Extensive experiments demonstrate that compared to previous works, our method can generate videos with appearance more aligned with the text descriptions and motion more consistent with the reference videos.
Paperid: 542,   https://arxiv.org/pdf/2502.13053    
Authors:Yurun Chen, Xavier Hu, Keting Yin, Juncheng Li, Shengyu Zhang
Abstract:
As researchers continue to optimize AI agents for more effective task execution within operating systems, they often overlook a critical security concern: the ability of these agents to detect "impostors" within their environment. Through an analysis of the agents' operational context, we identify a significant threat-attackers can disguise malicious attacks as environmental elements, injecting active disturbances into the agents' execution processes to manipulate their decision-making. We define this novel threat as the Active Environment Injection Attack (AEIA). Focusing on the interaction mechanisms of the Android OS, we conduct a risk assessment of AEIA and identify two critical security vulnerabilities: (1) Adversarial content injection in multimodal interaction interfaces, where attackers embed adversarial instructions within environmental elements to mislead agent decision-making; and (2) Reasoning gap vulnerabilities in the agent's task execution process, which increase susceptibility to AEIA attacks during reasoning. To evaluate the impact of these vulnerabilities, we propose AEIA-MN, an attack scheme that exploits interaction vulnerabilities in mobile operating systems to assess the robustness of MLLM-based agents. Experimental results show that even advanced MLLMs are highly vulnerable to this attack, achieving a maximum attack success rate of 93% on the AndroidWorld benchmark by combining two vulnerabilities.
Paperid: 543,   https://arxiv.org/pdf/2409.10570    
Authors:Cong Kong, Rui Xu, Weixi Chen, Jiawei Chen, Zhaoxia Yin
Abstract:
With the advancement of intelligent healthcare, medical pre-trained language models (Med-PLMs) have emerged and demonstrated significant effectiveness in downstream medical tasks. While these models are valuable assets, they are vulnerable to misuse and theft, requiring copyright protection. However, existing watermarking methods for pre-trained language models (PLMs) cannot be directly applied to Med-PLMs due to domain-task mismatch and inefficient watermark embedding. To fill this gap, we propose the first training-free backdoor model watermarking for Med-PLMs. Our method employs low-frequency words as triggers, embedding the watermark by replacing their embeddings in the model's word embedding layer with those of specific medical terms. The watermarked Med-PLMs produce the same output for triggers as for the corresponding specified medical terms. We leverage this unique mapping to design tailored watermark extraction schemes for different downstream tasks, thereby addressing the challenge of domain-task mismatch in previous methods. Experiments demonstrate superior effectiveness of our watermarking method across medical downstream tasks. Moreover, the method exhibits robustness against model extraction, pruning, fusion-based backdoor removal attacks, while maintaining high efficiency with 10-second watermark embedding.
Paperid: 544,   https://arxiv.org/pdf/2411.18375    
Authors:Yiming Wu, Zhenghao Chen, Huan Wang, Dong Xu
Abstract:
The high computational cost and slow inference time are major obstacles to deploying Video Diffusion Models (VDMs). To overcome this, we introduce a new Video Diffusion Model Compression approach using individual content and motion dynamics preserved pruning and consistency loss. First, we empirically observe that deeper VDM layers are crucial for maintaining the quality of motion dynamics (e.g., coherence of the entire video), while shallower layers are more focused on individual content (e.g., individual frames). Therefore, we prune redundant blocks from the shallower layers while preserving more of the deeper layers, resulting in a lightweight VDM variant called VDMini. Moreover, we propose an Individual Content and Motion Dynamics (ICMD) Consistency Loss to gain comparable generation performance as larger VDM to VDMini. In particular, we first use the Individual Content Distillation (ICD) Loss to preserve the consistency in the features of each generated frame between the teacher and student models. Next, we introduce a Multi-frame Content Adversarial (MCA) Loss to enhance the motion dynamics across the generated video as a whole. This method significantly accelerates inference time while maintaining high-quality video generation. Extensive experiments demonstrate the effectiveness of our VDMini on two important video generation tasks, Text-to-Video (T2V) and Image-to-Video (I2V), where we respectively achieve an average 2.5 ×, 1.4 ×, and 1.25 × speed up for the I2V method SF-V, the T2V method T2V-Turbo-v2, and the T2V method HunyuanVideo, while maintaining the quality of the generated videos on several benchmarks including UCF101, VBench-T2V, and VBench-I2V.
Paperid: 545,   https://arxiv.org/pdf/2410.06795    
Authors:Yuying Shang, Xinyi Zeng, Yutao Zhu, Xiao Yang, Zhengwei Fang, Jingyuan Zhang, Jiawei Chen, Zinan Liu, Yu Tian
Abstract:
Hallucinations in large vision-language models (LVLMs) are a significant challenge, i.e., generating objects that are not presented in the visual input, which impairs their reliability. Recent studies often attribute hallucinations to a lack of understanding of visual input, yet ignore a more fundamental issue: the model's inability to effectively extract or decouple visual features. In this paper, we revisit the hallucinations in LVLMs from an architectural perspective, investigating whether the primary cause lies in the visual encoder (feature extraction) or the modal alignment module (feature decoupling). Motivated by our findings on the preliminary investigation, we propose a novel tuning strategy, PATCH, to mitigate hallucinations in LVLMs. This plug-and-play method can be integrated into various LVLMs, utilizing adaptive virtual tokens to extract object features from bounding boxes, thereby addressing hallucinations caused by insufficient decoupling of visual features. PATCH achieves state-of-the-art performance on multiple multi-modal hallucination datasets. We hope this approach provides researchers with deeper insights into the underlying causes of hallucinations in LVLMs, fostering further advancements and innovation in this field.
Paperid: 546,   https://arxiv.org/pdf/2507.20650    
Authors:Zhicheng Zhang, Peizhuo Lv, Mengke Wan, Jiang Fang, Diandian Guo, Yezeng Chen, Yinlong Liu, Wei Ma, Jiyan Sun, Liru Geng
Abstract:
Recently, Deep Learning (DL) models have been increasingly deployed on end-user devices as On-Device AI, offering improved efficiency and privacy. However, this deployment trend poses more serious Intellectual Property (IP) risks, as models are distributed on numerous local devices, making them vulnerable to theft and redistribution. Most existing ownership protection solutions (e.g., backdoor-based watermarking) are designed for cloud-based AI-as-a-Service (AIaaS) and are not directly applicable to large-scale distribution scenarios, where each user-specific model instance must carry a unique watermark. These methods typically embed a fixed watermark, and modifying the embedded watermark requires retraining the model. To address these challenges, we propose Hot-Swap MarkBoard, an efficient watermarking method. It encodes user-specific n-bit binary signatures by independently embedding multiple watermarks into a multi-branch Low-Rank Adaptation (LoRA) module, enabling efficient watermark customization without retraining through branch swapping. A parameter obfuscation mechanism further entangles the watermark weights with those of the base model, preventing removal without degrading model performance. The method supports black-box verification and is compatible with various model architectures and DL tasks, including classification, image generation, and text generation. Extensive experiments across three types of tasks and six backbone models demonstrate our method's superior efficiency and adaptability compared to existing approaches, achieving 100% verification accuracy.
Paperid: 547,   https://arxiv.org/pdf/2508.03038    
Authors:Qi Peng, Jialin Cui, Jiayuan Xie, Yi Cai, Qing Li
Abstract:
Large language models (LLMs) have shown great potential in the medical domain. However, existing models still fall short when faced with complex medical diagnosis task in the real world. This is mainly because they lack sufficient reasoning depth, which leads to information loss or logical jumps when processing a large amount of specialized medical data, leading to diagnostic errors. To address these challenges, we propose Tree-of-Reasoning (ToR), a novel multi-agent framework designed to handle complex scenarios. Specifically, ToR introduces a tree structure that can clearly record the reasoning path of LLMs and the corresponding clinical evidence. At the same time, we propose a cross-validation mechanism to ensure the consistency of multi-agent decision-making, thereby improving the clinical reasoning ability of multi-agents in complex medical scenarios. Experimental results on real-world medical data show that our framework can achieve better performance than existing baseline methods.
Paperid: 548,   https://arxiv.org/pdf/2507.06821    
Authors:Chuhang Zheng, Chunwei Tian, Jie Wen, Daoqiang Zhang, Qi Zhu
Abstract:
Multi-modal emotion recognition has garnered increasing attention as it plays a significant role in human-computer interaction (HCI) in recent years. Since different discrete emotions may exist at the same time, compared with single-class emotion recognition, emotion distribution learning (EDL) that identifies a mixture of basic emotions has gradually emerged as a trend. However, existing EDL methods face challenges in mining the heterogeneity among multiple modalities. Besides, rich semantic correlations across arbitrary basic emotions are not fully exploited. In this paper, we propose a multi-modal emotion distribution learning framework, named HeLo, aimed at fully exploring the heterogeneity and complementary information in multi-modal emotional data and label correlation within mixed basic emotions. Specifically, we first adopt cross-attention to effectively fuse the physiological data. Then, an optimal transport (OT)-based heterogeneity mining module is devised to mine the interaction and heterogeneity between the physiological and behavioral representations. To facilitate label correlation learning, we introduce a learnable label embedding optimized by correlation matrix alignment. Finally, the learnable label embeddings and label correlation matrices are integrated with the multi-modal representations through a novel label correlation-driven cross-attention mechanism for accurate emotion distribution learning. Experimental results on two publicly available datasets demonstrate the superiority of our proposed method in emotion distribution learning.
Paperid: 549,   https://arxiv.org/pdf/2507.14921    
Authors:Xiufeng Huang, Ka Chun Cheung, Runmin Cong, Simon See, Renjie Wan
Abstract:
Generalizable 3D Gaussian Splatting reconstruction showcases advanced Image-to-3D content creation but requires substantial computational resources and large datasets, posing challenges to training models from scratch. Current methods usually entangle the prediction of 3D Gaussian geometry and appearance, which rely heavily on data-driven priors and result in slow regression speeds. To address this, we propose \method, a disentangled framework for efficient 3D Gaussian prediction. Our method extracts features from local image pairs using a stereo vision backbone and fuses them via global attention blocks. Dedicated point and Gaussian prediction heads generate multi-view point-maps for geometry and Gaussian features for appearance, combined as GS-maps to represent the 3DGS object. A refinement network enhances these GS-maps for high-quality reconstruction. Unlike existing methods that depend on camera parameters, our approach achieves pose-free 3D reconstruction, improving robustness and practicality. By reducing resource demands while maintaining high-quality outputs, \method provides an efficient, scalable solution for real-world 3D content generation.
Paperid: 550,   https://arxiv.org/pdf/2504.20466    
Authors:Woo Yi Yang, Jiarui Wang, Sijing Wu, Huiyu Duan, Yuxin Zhu, Liu Yang, Kang Fu, Guangtao Zhai, Xiongkuo Min
Abstract:
The rapid advancement in generative artificial intelligence have enabled the creation of 3D human faces (HFs) for applications including media production, virtual reality, security, healthcare, and game development, etc. However, assessing the quality and realism of these AI-generated 3D human faces remains a significant challenge due to the subjective nature of human perception and innate perceptual sensitivity to facial features. To this end, we conduct a comprehensive study on the quality assessment of AI-generated 3D human faces. We first introduce Gen3DHF, a large-scale benchmark comprising 2,000 videos of AI-Generated 3D Human Faces along with 4,000 Mean Opinion Scores (MOS) collected across two dimensions, i.e., quality and authenticity, 2,000 distortion-aware saliency maps and distortion descriptions. Based on Gen3DHF, we propose LMME3DHF, a Large Multimodal Model (LMM)-based metric for Evaluating 3DHF capable of quality and authenticity score prediction, distortion-aware visual question answering, and distortion-aware saliency prediction. Experimental results show that LMME3DHF achieves state-of-the-art performance, surpassing existing methods in both accurately predicting quality scores for AI-generated 3D human faces and effectively identifying distortion-aware salient regions and distortion types, while maintaining strong alignment with human perceptual judgments. Both the Gen3DHF database and the LMME3DHF will be released upon the publication.
Paperid: 551,   https://arxiv.org/pdf/2510.03089    
Authors:Naresh Kumar Devulapally, Shruti Agarwal, Tejas Gokhale, Vishnu Suresh Lokhande
Abstract:
Text-to-image diffusion models have demonstrated remarkable effectiveness in rapid and high-fidelity personalization, even when provided with only a few user images. However, the effectiveness of personalization techniques has lead to concerns regarding data privacy, intellectual property protection, and unauthorized usage. To mitigate such unauthorized usage and model replication, the idea of generating ``unlearnable'' training samples utilizing image poisoning techniques has emerged. Existing methods for this have limited imperceptibility as they operate in the pixel space which results in images with noise and artifacts. In this work, we propose a novel model-based perturbation strategy that operates within the latent space of diffusion models. Our method alternates between denoising and inversion while modifying the starting point of the denoising trajectory: of diffusion models. This trajectory-shifted sampling ensures that the perturbed images maintain high visual fidelity to the original inputs while being resistant to inversion and personalization by downstream generative models. This approach integrates unlearnability into the framework of Latent Diffusion Models (LDMs), enabling a practical and imperceptible defense against unauthorized model adaptation. We validate our approach on four benchmark datasets to demonstrate robustness against state-of-the-art inversion attacks. Results demonstrate that our method achieves significant improvements in imperceptibility (~ 8 % -10% on perceptual metrics including PSNR, SSIM, and FID) and robustness ( ~ 10% on average across five adversarial settings), highlighting its effectiveness in safeguarding sensitive data.
Paperid: 552,   https://arxiv.org/pdf/2512.03540    
Authors:Ruoxuan Zhang, Bin Wen, Hongxia Xie, Yi Yao, Songhan Zuo, Jian-Yu Jiang-Lin, Hong-Han Shuai, Wen-Huang Cheng
Abstract:
Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.
Paperid: 553,   https://arxiv.org/pdf/2507.22828    
Authors:Kedong Xiu, Sai Qian Zhang
Abstract:
As Vision-Language Models (VLMs) are increasingly deployed in split-DNN configurations--with visual encoders (e.g., ResNet, ViT) operating on user devices and sending intermediate features to the cloud--there is a growing privacy risk from semantic information leakage. Existing approaches to reconstructing images from these intermediate features often result in blurry, semantically ambiguous images. To directly address semantic leakage, we propose CapRecover, a cross-modality inversion framework that recovers high-level semantic content, such as labels or captions, directly from intermediate features without image reconstruction. We evaluate CapRecover on multiple datasets and victim models, demonstrating strong performance in semantic recovery. Specifically, CapRecover achieves up to 92.71% Top-1 label accuracy on CIFAR-10 and generates fluent captions from ResNet50 features on COCO2017 with ROUGE-L scores up to 0.52. Our analysis further reveals that deeper convolutional layers encode significantly more semantic information compared to shallow layers. To mitigate semantic leakage, we introduce a simple yet effective protection method: adding random noise to intermediate features at each layer and removing the noise in the next layer. Experimental results show that this approach prevents semantic leakage without additional training costs.
Paperid: 554,   https://arxiv.org/pdf/2507.05227    
Authors:Qucheng Peng, Chen Bai, Guoxiang Zhang, Bo Xu, Xiaotong Liu, Xiaoyin Zheng, Chen Chen, Cheng Lu
Abstract:
Autonomous driving systems have made significant advances in Q&A, perception, prediction, and planning based on local visual information, yet they struggle to incorporate broader navigational context that human drivers routinely utilize. We address this critical gap between local sensor data and global navigation information by proposing NavigScene, an auxiliary navigation-guided natural language dataset that simulates a human-like driving environment within autonomous driving systems. Moreover, we develop three complementary paradigms to leverage NavigScene: (1) Navigation-guided Reasoning, which enhances vision-language models by incorporating navigation context into the prompting approach; (2) Navigation-guided Preference Optimization, a reinforcement learning method that extends Direct Preference Optimization to improve vision-language model responses by establishing preferences for navigation-relevant summarized information; and (3) Navigation-guided Vision-Language-Action model, which integrates navigation guidance and vision-language models with conventional driving models through feature fusion. Extensive experiments demonstrate that our approaches significantly improve performance across perception, prediction, planning, and question-answering tasks by enabling reasoning capabilities beyond visual range and improving generalization to diverse driving scenarios. This work represents a significant step toward more comprehensive autonomous driving systems capable of navigating complex, unfamiliar environments with greater reliability and safety.
Paperid: 555,   https://arxiv.org/pdf/2412.04955    
Authors:Peng Chen, Xiaobao Wei, Qingpo Wuwu, Xinyi Wang, Xingyu Xiao, Ming Lu
Abstract:
Reconstructing high-fidelity 3D head avatars is crucial in various applications such as virtual reality. The pioneering methods reconstruct realistic head avatars with Neural Radiance Fields (NeRF), which have been limited by training and rendering speed. Recent methods based on 3D Gaussian Splatting (3DGS) significantly improve the efficiency of training and rendering. However, the surface inconsistency of 3DGS results in subpar geometric accuracy; later, 2DGS uses 2D surfels to enhance geometric accuracy at the expense of rendering fidelity. To leverage the benefits of both 2DGS and 3DGS, we propose a novel method named MixedGaussianAvatar for realistically and geometrically accurate head avatar reconstruction. Our main idea is to utilize 2D Gaussians to reconstruct the surface of the 3D head, ensuring geometric accuracy. We attach the 2D Gaussians to the triangular mesh of the FLAME model and connect additional 3D Gaussians to those 2D Gaussians where the rendering quality of 2DGS is inadequate, creating a mixed 2D-3D Gaussian representation. These 2D-3D Gaussians can then be animated using FLAME parameters. We further introduce a progressive training strategy that first trains the 2D Gaussians and then fine-tunes the mixed 2D-3D Gaussians. We use a unified mixed Gaussian representation to integrate the two modalities of 2D image and 3D mesh. Furthermore, the comprehensive experiments demonstrate the superiority of MixedGaussianAvatar. The code will be released.
Paperid: 556,   https://arxiv.org/pdf/2507.21195    
Authors:Po-Yuan Mao, Cheng-Chang Tsai, Chun-Shien Lu
Abstract:
The great success of the diffusion model in image synthesis led to the release of gigantic commercial models, raising the issue of copyright protection and inappropriate content generation. Training-free diffusion watermarking provides a low-cost solution for these issues. However, the prior works remain vulnerable to rotation, scaling, and translation (RST) attacks. Although some methods employ meticulously designed patterns to mitigate this issue, they often reduce watermark capacity, which can result in identity (ID) collusion. To address these problems, we propose MaXsive, a training-free diffusion model generative watermarking technique that has high capacity and robustness. MaXsive best utilizes the initial noise to watermark the diffusion model. Moreover, instead of using a meticulously repetitive ring pattern, we propose injecting the X-shape template to recover the RST distortions. This design significantly increases robustness without losing any capacity, making ID collusion less likely to happen. The effectiveness of MaXsive has been verified on two well-known watermarking benchmarks under the scenarios of verification and identification.
Paperid: 557,   https://arxiv.org/pdf/2505.05509    
Authors:Yi Liu, Xinyi Liu, Yi Wan, Panwang Xia, Qiong Wu, Yongjun Zhang
Abstract:
Stereo image super-resolution (SSR) aims to enhance high-resolution details by leveraging information from stereo image pairs. However, existing stereo super-resolution (SSR) upsampling methods (e.g., pixel shuffle) often overlook cross-view geometric consistency and are limited to fixed-scale upsampling. The key issue is that previous upsampling methods use convolution to independently process deep features of different views, lacking cross-view and non-local information perception, making it difficult to select beneficial information from multi-view scenes adaptively. In this work, we propose Stereo Implicit Neural Representation (StereoINR), which innovatively models stereo image pairs as continuous implicit representations. This continuous representation breaks through the scale limitations, providing a unified solution for arbitrary-scale stereo super-resolution reconstruction of left-right views. Furthermore, by incorporating spatial warping and cross-attention mechanisms, StereoINR enables effective cross-view information fusion and achieves significant improvements in pixel-level geometric consistency. Extensive experiments across multiple datasets show that StereoINR outperforms out-of-training-distribution scale upsampling and matches state-of-the-art SSR methods within training-distribution scales.
Paperid: 558,   https://arxiv.org/pdf/2508.00288    
Authors:Jianqiang Xiao, Yuexuan Sun, Yixin Shao, Boxi Gan, Rongqiang Liu, Yanjing Wu, Weili Guan, Xiang Deng
Abstract:
Aerial navigation is a fundamental yet underexplored capability in embodied intelligence, enabling agents to operate in large-scale, unstructured environments where traditional navigation paradigms fall short. However, most existing research follows the Vision-and-Language Navigation (VLN) paradigm, which heavily depends on sequential linguistic instructions, limiting its scalability and autonomy. To address this gap, we introduce UAV-ON, a benchmark for large-scale Object Goal Navigation (ObjectNav) by aerial agents in open-world environments, where agents operate based on high-level semantic goals without relying on detailed instructional guidance as in VLN. UAV-ON comprises 14 high-fidelity Unreal Engine environments with diverse semantic regions and complex spatial layouts, covering urban, natural, and mixed-use settings. It defines 1270 annotated target objects, each characterized by an instance-level instruction that encodes category, physical footprint, and visual descriptors, allowing grounded reasoning. These instructions serve as semantic goals, introducing realistic ambiguity and complex reasoning challenges for aerial agents. To evaluate the benchmark, we implement several baseline methods, including Aerial ObjectNav Agent (AOA), a modular policy that integrates instruction semantics with egocentric observations for long-horizon, goal-directed exploration. Empirical results show that all baselines struggle in this setting, highlighting the compounded challenges of aerial navigation and semantic goal grounding. UAV-ON aims to advance research on scalable UAV autonomy driven by semantic goal descriptions in complex real-world environments.
Paperid: 559,   https://arxiv.org/pdf/2403.01422    
Authors:Zhende Song, Chenchen Wang, Jiamu Sheng, Chi Zhang, Shengji Tang, Jiayuan Fan, Tao Chen
Abstract:
Recent large vision-language models (LVLMs) for video understanding are primarily fine-tuned with various videos scraped from online platforms. Existing datasets, such as ActivityNet, require considerable human labor for structuring and annotation before effectively utilized for tuning LVLMs. While current LVLMs are primarily trained on existing datasets in broad, general-purpose settings, adapting them to specific downstream scenarios remains challenging, as collecting and annotating task-specific videos is highly labor-intensive and time-consuming. To address this issue, we propose a three-stage framework named DreamFrame for automatically generating style-consistent keyframes and corresponding question-answer (QA) pairs to support LVLM instruction tuning. DreamFrame generates datasets in a movie-like manner. First, we utilize an LLM to generate structured movie plots including movie prior information (like overview and style), frame descriptions and plot-related QA pairs, with a story expansion strategy to mitigate context length limitations.Then, to ensure visual consistency across generated frames, we design a Style Immobilization Process which maintains consistent style through an embedding learning strategy. Finally, frame descriptions and style embeddings are integrated to produce coherent keyframes. Using DreamFrame, we construct a dataset comprising approximately 1k stylized keyframe-like videos and 100k diverse QA pairs. Extensive fine-tuned experiments on various LVLM architectures demonstrate the effectiveness of the proposed dataset. Furthermore, based on the proposed dataset, we fine-tune a new LVLM named DreamFrame-7B, which significantly surpasses the previous similar-sized LVLMs across different benchmarks.
Paperid: 560,   https://arxiv.org/pdf/2512.14574    
Authors:Mitsuki Watanabe, Sosuke Amano, Kiyoharu Aizawa, Yoko Yamakata
Abstract:
Food image classification models are crucial for dietary management applications because they reduce the burden of manual meal logging. However, most publicly available datasets for training such models rely on web-crawled images, which often differ from users' real-world meal photos. In this work, we present FoodLogAthl-218, a food image dataset constructed from real-world meal records collected through the dietary management application FoodLog Athl. The dataset contains 6,925 images across 218 food categories, with a total of 14,349 bounding boxes. Rich metadata, including meal date and time, anonymized user IDs, and meal-level context, accompany each image. Unlike conventional datasets-where a predefined class set guides web-based image collection-our data begins with user-submitted photos, and labels are applied afterward. This yields greater intra-class diversity, a natural frequency distribution of meal types, and casual, unfiltered images intended for personal use rather than public sharing. In addition to (1) a standard classification benchmark, we introduce two FoodLog-specific tasks: (2) an incremental fine-tuning protocol that follows the temporal stream of users' logs, and (3) a context-aware classification task where each image contains multiple dishes, and the model must classify each dish by leveraging the overall meal context. We evaluate these tasks using large multimodal models (LMMs). The dataset is publicly available at https://huggingface.co/datasets/FoodLog/FoodLogAthl-218.
Paperid: 561,   https://arxiv.org/pdf/2509.08538    
Authors:Garry Yang, Zizhe Chen, Man Hon Wong, Haoyu Lei, Yongqiang Chen, Zhenguo Li, Kaiwen Zhou, James Cheng
Abstract:
Large Video Models (LVMs) build on the semantic capabilities of Large Language Models (LLMs) and vision modules by integrating temporal information to better understand dynamic video content. Despite their progress, LVMs are prone to hallucinations-producing inaccurate or irrelevant descriptions. Current benchmarks for video hallucination depend heavily on manual categorization of video content, neglecting the perception-based processes through which humans naturally interpret videos. We introduce MESH, a benchmark designed to evaluate hallucinations in LVMs systematically. MESH uses a Question-Answering framework with binary and multi-choice formats incorporating target and trap instances. It follows a bottom-up approach, evaluating basic objects, coarse-to-fine subject features, and subject-action pairs, aligning with human video understanding. We demonstrate that MESH offers an effective and comprehensive approach for identifying hallucinations in videos. Our evaluations show that while LVMs excel at recognizing basic objects and features, their susceptibility to hallucinations increases markedly when handling fine details or aligning multiple actions involving various subjects in longer videos.
Paperid: 562,   https://arxiv.org/pdf/2504.11781    
Authors:Guanchun Wang, Xiangrong Zhang, Yifei Zhang, Zelin Peng, Tianyang Zhang, Xu Tang, Licheng Jiao
Abstract:
Unsupervised anomaly detection in hyperspectral images (HSI), aiming to detect unknown targets from backgrounds, is challenging for earth surface monitoring. However, current studies are hindered by steep computational costs due to the high-dimensional property of HSI and dense sampling-based training paradigm, constraining their rapid deployment. Our key observation is that, during training, not all samples within the same homogeneous area are indispensable, whereas ingenious sampling can provide a powerful substitute for reducing costs. Motivated by this, we propose an Asymmetrical Consensus State Space Model (ACMamba) to significantly reduce computational costs without compromising accuracy. Specifically, we design an asymmetrical anomaly detection paradigm that utilizes region-level instances as an efficient alternative to dense pixel-level samples. In this paradigm, a low-cost Mamba-based module is introduced to discover global contextual attributes of regions that are essential for HSI reconstruction. Additionally, we develop a consensus learning strategy from the optimization perspective to simultaneously facilitate background reconstruction and anomaly compression, further alleviating the negative impact of anomaly reconstruction. Theoretical analysis and extensive experiments across eight benchmarks verify the superiority of ACMamba, demonstrating a faster speed and stronger performance over the state-of-the-art.
Paperid: 563,   https://arxiv.org/pdf/2411.11435    
Authors:Junwen He, Yifan Wang, Lijun Wang, Huchuan Lu, Jun-Yan He, Chenyang Li, Hanyuan Chen, Jin-Peng Lan, Bin Luo, Yifeng Geng
Abstract:
Text logo design heavily relies on the creativity and expertise of professional designers, in which arranging element layouts is one of the most important procedures. However, this specific task has received limited attention, often overshadowed by broader layout generation tasks such as document or poster design. In this paper, we propose a Vision-Language Model (VLM)-based framework that generates content-aware text logo layouts by integrating multi-modal inputs with user-defined constraints, enabling more flexible and robust layout generation for real-world applications. We introduce two model techniques that reduce the computational cost for processing multiple glyph images simultaneously, without compromising performance. To support instruction tuning of our model, we construct two extensive text logo datasets that are five times larger than existing public datasets. In addition to geometric annotations (e.g., text masks and character recognition), our datasets include detailed layout descriptions in natural language, enabling the model to reason more effectively in handling complex designs and custom user inputs. Experimental results demonstrate the effectiveness of our proposed framework and datasets, outperforming existing methods on various benchmarks that assess geometric aesthetics and human preferences.
Paperid: 564,   https://arxiv.org/pdf/2502.06392    
Authors:Pengyu Long, Zijun Zhao, Min Ouyang, Qingcheng Zhao, Qixuan Zhang, Wei Yang, Lan Xu, Jingyi Yu
Abstract:
Hairstyles are intricate and culturally significant with various geometries, textures, and structures. Existing text or image-guided generation methods fail to handle the richness and complexity of diverse styles. We present TANGLED, a novel approach for 3D hair strand generation that accommodates diverse image inputs across styles, viewpoints, and quantities of input views. TANGLED employs a three-step pipeline. First, our MultiHair Dataset provides 457 diverse hairstyles annotated with 74 attributes, emphasizing complex and culturally significant styles to improve model generalization. Second, we propose a diffusion framework conditioned on multi-view linearts that can capture topological cues (e.g., strand density and parting lines) while filtering out noise. By leveraging a latent diffusion model with cross-attention on lineart features, our method achieves flexible and robust 3D hair generation across diverse input conditions. Third, a parametric post-processing module enforces braid-specific constraints to maintain coherence in complex structures. This framework not only advances hairstyle realism and diversity but also enables culturally inclusive digital avatars and novel applications like sketch-based 3D strand editing for animation and augmented reality.
Paperid: 565,   https://arxiv.org/pdf/2505.14535    
Authors:Jiangrong Shen, Yulin Xie, Qi Xu, Gang Pan, Huajin Tang, Badong Chen
Abstract:
Multimodal spiking neural networks (SNNs) hold significant potential for energy-efficient sensory processing but face critical challenges in modality imbalance and temporal misalignment. Current approaches suffer from uncoordinated convergence speeds across modalities and static fusion mechanisms that ignore time-varying cross-modal interactions. We propose the temporal attention-guided adaptive fusion framework for multimodal SNNs with two synergistic innovations: 1) The Temporal Attention-guided Adaptive Fusion (TAAF) module that dynamically assigns importance scores to fused spiking features at each timestep, enabling hierarchical integration of temporally heterogeneous spike-based features; 2) The temporal adaptive balanced fusion loss that modulates learning rates per modality based on the above attention scores, preventing dominant modalities from monopolizing optimization. The proposed framework implements adaptive fusion, especially in the temporal dimension, and alleviates the modality imbalance during multimodal learning, mimicking cortical multisensory integration principles. Evaluations on CREMA-D, AVE, and EAD datasets demonstrate state-of-the-art performance (77.55%, 70.65% and 97.5%accuracy, respectively) with energy efficiency. The system resolves temporal misalignment through learnable time-warping operations and faster modality convergence coordination than baseline SNNs. This work establishes a new paradigm for temporally coherent multimodal learning in neuromorphic systems, bridging the gap between biological sensory processing and efficient machine intelligence.
Paperid: 566,   https://arxiv.org/pdf/2402.16267    
Authors:Yuhao Wang, Lingjuan Miao, Zhiqiang Zhou, Lei Zhang, Yajun Qiao
Abstract:
Infrared-visible image fusion (IVIF) has attracted much attention owing to the highly-complementary properties of the two image modalities. Due to the lack of ground-truth fused images, the fusion output of current deep-learning based methods heavily depends on the loss functions defined mathematically. As it is hard to well mathematically define the fused image without ground truth, the performance of existing fusion methods is limited. In this paper, we first propose to use natural language to express the objective of IVIF, which can avoid the explicit mathematical modeling of fusion output in current losses, and make full use of the advantage of language expression to improve the fusion performance. For this purpose, we present a comprehensive language-expressed fusion objective, and encode relevant texts into the multi-modal embedding space using CLIP. A language-driven fusion model is then constructed in the embedding space, by establishing the relationship among the embedded vectors to represent the fusion objective and input image modalities. Finally, a language-driven loss is derived to make the actual IVIF aligned with the embedded language-driven fusion model via supervised training. Experiments show that our method can obtain much better fusion results than existing techniques.
Paperid: 567,   https://arxiv.org/pdf/2509.18682    
Authors:Beibei Zhang, Yanan Lu, Ruobing Xie, Zongyi Li, Siyuan Xing, Tongwei Ren, Fen Lin
Abstract:
Personalized product search (PPS) aims to retrieve products relevant to the given query considering user preferences within their purchase histories. Since large language models (LLM) exhibit impressive potential in content understanding and reasoning, current methods explore to leverage LLM to comprehend the complicated relationships among user, query and product to improve the search performance of PPS. Despite the progress, LLM-based PPS solutions merely take textual contents into consideration, neglecting multimodal contents which play a critical role for product search. Motivated by this, we propose a novel framework, HMPPS, for Harnessing Multimodal large language models (MLLM) to deal with Personalized Product Search based on multimodal contents. Nevertheless, the redundancy and noise in PPS input stand for a great challenge to apply MLLM for PPS, which not only misleads MLLM to generate inaccurate search results but also increases the computation expense of MLLM. To deal with this problem, we additionally design two query-aware refinement modules for HMPPS: 1) a perspective-guided summarization module that generates refined product descriptions around core perspectives relevant to search query, reducing noise and redundancy within textual contents; and 2) a two-stage training paradigm that introduces search query for user history filtering based on multimodal representations, capturing precise user preferences and decreasing the inference cost. Extensive experiments are conducted on four public datasets to demonstrate the effectiveness of HMPPS. Furthermore, HMPPS is deployed on an online search system with billion-level daily active users and achieves an evident gain in A/B testing.
Paperid: 568,   https://arxiv.org/pdf/2508.06917    
Authors:Jianting Tang, Yubo Wang, Haoyu Cao, Linli Xu
Abstract:
Recent advances in molecular science have been propelled significantly by large language models (LLMs). However, their effectiveness is limited when relying solely on molecular sequences, which fail to capture the complex structures of molecules. Beyond sequence representation, molecules exhibit two complementary structural views: the first focuses on the topological relationships between atoms, as exemplified by the graph view; and the second emphasizes the spatial configuration of molecules, as represented by the image view. The two types of views provide unique insights into molecular structures. To leverage these views collaboratively, we propose the CROss-view Prefixes (CROP) to enhance LLMs' molecular understanding through efficient multi-view integration. CROP possesses two advantages: (i) efficiency: by jointly resampling multiple structural views into fixed-length prefixes, it avoids excessive consumption of the LLM's limited context length and allows easy expansion to more views; (ii) effectiveness: by utilizing the LLM's self-encoded molecular sequences to guide the resampling process, it boosts the quality of the generated prefixes. Specifically, our framework features a carefully designed SMILES Guided Resampler for view resampling, and a Structural Embedding Gate for converting the resulting embeddings into LLM's prefixes. Extensive experiments demonstrate the superiority of CROP in tasks including molecule captioning, IUPAC name prediction and molecule property prediction.
Paperid: 569,   https://arxiv.org/pdf/2411.18823    
Authors:Jingdong Zhang, Hanrong Ye, Xin Li, Wenping Wang, Dan Xu
Abstract:
In recent years, simultaneous learning of multiple dense prediction tasks with partially annotated label data has emerged as an important research area. Previous works primarily focus on leveraging cross-task relations or conducting adversarial training for extra regularization, which achieve promising performance improvements, while still suffering from the lack of direct pixel-wise supervision and extra training of heavy mapping networks. To effectively tackle this challenge, we propose a novel approach to optimize a set of compact learnable hierarchical task tokens, including global and fine-grained ones, to discover consistent pixel-wise supervision signals in both feature and prediction levels. Specifically, the global task tokens are designed for effective cross-task feature interactions in a global context. Then, a group of fine-grained task-specific spatial tokens for each task is learned from the corresponding global task tokens. It is embedded to have dense interactions with each task-specific feature map. The learned global and local fine-grained task tokens are further used to discover pseudo task-specific dense labels at different levels of granularity, and they can be utilized to directly supervise the learning of the multi-task dense prediction framework. Extensive experimental results on challenging NYUD-v2, Cityscapes, and PASCAL Context datasets demonstrate significant improvements over existing state-of-the-art methods for partially annotated multi-task dense prediction.
Paperid: 570,   https://arxiv.org/pdf/2505.01322    
Authors:Chenxi Li, Weijie Wang, Qiang Li, Bruno Lepri, Nicu Sebe, Weizhi Nie
Abstract:
Text-driven object insertion in 3D scenes is an emerging task that enables intuitive scene editing through natural language. However, existing 2D editing-based methods often rely on spatial priors such as 2D masks or 3D bounding boxes, and they struggle to ensure consistency of the inserted object. These limitations hinder flexibility and scalability in real-world applications. In this paper, we propose FreeInsert, a novel framework that leverages foundation models including MLLMs, LGMs, and diffusion models to disentangle object generation from spatial placement. This enables unsupervised and flexible object insertion in 3D scenes without spatial priors. FreeInsert starts with an MLLM-based parser that extracts structured semantics, including object types, spatial relationships, and attachment regions, from user instructions. These semantics guide both the reconstruction of the inserted object for 3D consistency and the learning of its degrees of freedom. We leverage the spatial reasoning capabilities of MLLMs to initialize object pose and scale. A hierarchical, spatially aware refinement stage further integrates spatial semantics and MLLM-inferred priors to enhance placement. Finally, the appearance of the object is improved using the inserted-object image to enhance visual fidelity. Experimental results demonstrate that FreeInsert achieves semantically coherent, spatially precise, and visually realistic 3D insertions without relying on spatial priors, offering a user-friendly and flexible editing experience.
Paperid: 571,   https://arxiv.org/pdf/2507.09876    
Authors:Yongheng Zhang, Xu Liu, Ruihan Tao, Qiguang Chen, Hao Fei, Wanxiang Che, Libo Qin
Abstract:
Video understanding plays a vital role in bridging low-level visual signals with high-level cognitive reasoning, and is fundamental to applications such as autonomous driving, embodied AI, and the broader pursuit of AGI. The rapid development of large language models (LLMs), particularly those utilizing Chain-of-Thought (CoT) technology, has significantly advanced video reasoning capabilities. However, current approaches primarily depend on textual information for reasoning, overlooking the visual modality in the actual video reasoning process. In contrast, humans naturally re-examine visual content while reasoning. Motivated by this, we introduce a novel video reasoning paradigm: Video-Text Interleaved CoT (ViTCoT), which facilitates more intuitive and cognitively aligned reasoning. To the end, first, we construct the Video-Text Interleaved Benchmark (ViTIB), which is created using MLLMs for key-video selection and manually verified. Furthermore, we extensively explore the potential of the ViTCoT paradigm in the video understanding field. Extensive experiments demonstrate that ViTCoT significantly enhances performance compared to the traditional text-only CoT paradigm and effectively activates more neuron values in MLLMs.
Paperid: 572,   https://arxiv.org/pdf/2504.14548    
Authors:Lifeng Lin, Rongfeng Lu, Quan Chen, Haofan Ren, Ming Lu, Yaoqi Sun, Chenggang Yan, Anke Xue
Abstract:
Sparse-view 3D reconstruction is a fundamental yet challenging task in practical 3D reconstruction applications. Recently, many methods based on the 3D Gaussian Splatting (3DGS) framework have been proposed to address sparse-view 3D reconstruction. Although these methods have made considerable advancements, they still show significant issues with overfitting. To reduce the overfitting, we introduce VGNC, a novel Validation-guided Gaussian Number Control (VGNC) approach based on generative novel view synthesis (NVS) models. To the best of our knowledge, this is the first attempt to alleviate the overfitting issue of sparse-view 3DGS with generative validation images. Specifically, we first introduce a validation image generation method based on a generative NVS model. We then propose a Gaussian number control strategy that utilizes generated validation images to determine the optimal Gaussian numbers, thereby reducing the issue of overfitting. We conducted detailed experiments on various sparse-view 3DGS baselines and datasets to evaluate the effectiveness of VGNC. Extensive experiments show that our approach not only reduces overfitting but also improves rendering quality on the test set while decreasing the number of Gaussian points. This reduction lowers storage demands and accelerates both training and rendering. The code will be released.
Paperid: 573,   https://arxiv.org/pdf/2507.07733    
Authors:Yongyang Zhou, Fang-Lue Zhang, Zichen Wang, Lei Zhang
Abstract:
3D Gaussian Splatting (3DGS) has demonstrated impressive capabilities in novel view synthesis. However, rendering reflective objects remains a significant challenge, particularly in inverse rendering and relighting. We introduce RTR-GS, a novel inverse rendering framework capable of robustly rendering objects with arbitrary reflectance properties, decomposing BRDF and lighting, and delivering credible relighting results. Given a collection of multi-view images, our method effectively recovers geometric structure through a hybrid rendering model that combines forward rendering for radiance transfer with deferred rendering for reflections. This approach successfully separates high-frequency and low-frequency appearances, mitigating floating artifacts caused by spherical harmonic overfitting when handling high-frequency details. We further refine BRDF and lighting decomposition using an additional physically-based deferred rendering branch. Experimental results show that our method enhances novel view synthesis, normal estimation, decomposition, and relighting while maintaining efficient training inference process.
Paperid: 574,   https://arxiv.org/pdf/2507.08064    
Authors:Yibo Lyu, Rui Shao, Gongwei Chen, Yijie Zhu, Weili Guan, Liqiang Nie
Abstract:
As multimedia content expands, the demand for unified multimodal retrieval (UMR) in real-world applications increases. Recent work leverages multimodal large language models (MLLMs) to tackle this task. However, their large parameter size results in high training costs and low inference efficiency. To address this, we propose PUMA: a Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning. Our approach improves UMR from both structural and learning perspectives. (1) Structurally, we propose Layer-Pruned Self-Distillation, which prunes MLLMs by keeping only shallow layers while distilling features from dropped deep layers as teacher signals. This reduces parameters and preserves representation capability. (2) On the learning side, we introduce Modality-Adaptive Contrastive Learning Loss (MAC-Loss), which separates in-batch negatives into harder intra-modality and easier inter-modality groups based on the target modality, assigning different temperature strategies to enhance learning efficiency. Experiments show our method significantly reduces resource usage while maintaining strong performance.
Paperid: 575,   https://arxiv.org/pdf/2509.10869    
Authors:Mingkang Li, Xuexiong Luo, Yue Zhang, Yaoyang Li, Fu Lin
Abstract:
Anomaly detection in graph-structured data is an inherently challenging problem, as it requires the identification of rare nodes that deviate from the majority in both their structural and behavioral characteristics. Existing methods, such as those based on graph convolutional networks (GCNs), often suffer from over-smoothing, which causes the learned node representations to become indistinguishable. Furthermore, graph reconstruction-based approaches are vulnerable to anomalous node interference during the reconstruction process, leading to inaccurate anomaly detection. In this work, we propose a novel and holistic anomaly evaluation framework that integrates three key components: a local-global Transformer encoder, a memory-guided reconstruction mechanism, and a multi-scale representation matching strategy. These components work synergistically to enhance the model's ability to capture both local and global structural dependencies, suppress the influence of anomalous nodes, and assess anomalies from multiple levels of granularity. Anomaly scores are computed by combining reconstruction errors and memory matching signals, resulting in a more robust evaluation. Extensive experiments on seven benchmark datasets demonstrate that our method outperforms existing state-of-the-art approaches, offering a comprehensive and generalizable solution for anomaly detection across various graph domains.
Paperid: 576,   https://arxiv.org/pdf/2504.10146    
Authors:Jo-Ku Cheng, Zeren Zhang, Ran Chen, Jingyang Deng, Ziran Qin, Jinwen Ma
Abstract:
We propose GeoUni, the first unified geometry expert model capable of generating problem solutions and diagrams within a single framework in a way that enables the creation of unique and individualized geometry problems. Traditionally, solving geometry problems and generating diagrams have been treated as separate tasks in machine learning, with no models successfully integrating both to support problem creation. However, we believe that mastery in geometry requires frictionless integration of all of these skills, from solving problems to visualizing geometric relationships, and finally, crafting tailored problems. Our extensive experiments demonstrate that GeoUni, with only 1.5B parameters, achieves performance comparable to larger models such as DeepSeek-R1 with 671B parameters in geometric reasoning tasks. GeoUni also excels in generating precise geometric diagrams, surpassing both text-to-image models and unified models, including the GPT-4o image generation. Most importantly, GeoUni is the only model capable of successfully generating textual problems with matching diagrams based on specific knowledge points, thus offering a wider range of capabilities that extend beyond current models.
Paperid: 577,   https://arxiv.org/pdf/2507.09647    
Authors:Peican Zhu, Yubo Jing, Le Cheng, Keke Tang, Yangming Guo
Abstract:
In recent years, the rampant spread of misinformation on social media has made accurate detection of multimodal fake news a critical research focus. However, previous research has not adequately understood the semantics of images, and models struggle to discern news authenticity with limited textual information. Meanwhile, treating all emotional types of news uniformly without tailored approaches further leads to performance degradation. Therefore, we propose a novel Knowledge Augmentation and Emotion Guidance Network (KEN). On the one hand, we effectively leverage LVLM's powerful semantic understanding and extensive world knowledge. For images, the generated captions provide a comprehensive understanding of image content and scenes, while for text, the retrieved evidence helps break the information silos caused by the closed and limited text and context. On the other hand, we consider inter-class differences between different emotional types of news through balanced learning, achieving fine-grained modeling of the relationship between emotional types and authenticity. Extensive experiments on two real-world datasets demonstrate the superiority of our KEN.
Paperid: 578,   https://arxiv.org/pdf/2504.20835    
Authors:Hongfei Xue, Yufeng Tang, Hexin Liu, Jun Zhang, Xuelong Geng, Lei Xie
Abstract:
Large language models have been extended to the speech domain, leading to the development of speech large language models (SLLMs). While existing SLLMs demonstrate strong performance in speech instruction-following for core languages (e.g., English), they often struggle with non-core languages due to the scarcity of paired speech-text data and limited multilingual semantic reasoning capabilities. To address this, we propose the semi-implicit Cross-lingual Speech Chain-of-Thought (XS-CoT) framework, which integrates speech-to-text translation into the reasoning process of SLLMs. The XS-CoT generates four types of tokens: instruction and response tokens in both core and non-core languages, enabling cross-lingual transfer of reasoning capabilities. To mitigate inference latency in generating target non-core response tokens, we incorporate a semi-implicit CoT scheme into XS-CoT, which progressively compresses the first three types of intermediate reasoning tokens while retaining global reasoning logic during training. By leveraging the robust reasoning capabilities of the core language, XS-CoT improves responses for non-core languages by up to 45% in GPT-4 score when compared to direct supervised fine-tuning on two representative SLLMs, Qwen2-Audio and SALMONN. Moreover, the semi-implicit XS-CoT reduces token delay by more than 50% with a slight drop in GPT-4 scores. Importantly, XS-CoT requires only a small amount of high-quality training data for non-core languages by leveraging the reasoning capabilities of core languages. To support training, we also develop a data pipeline and open-source speech instruction-following datasets in Japanese, German, and French.
Paperid: 579,   https://arxiv.org/pdf/2507.19209    
Authors:Xiaoyu Zhang, Zhifeng Bao, Hai Dong, Ziwei Wang, Jiajun Liu
Abstract:
Autonomous vehicles generate massive volumes of point cloud data, yet only a subset is relevant for specific tasks such as collision detection, traffic analysis, or congestion monitoring. Effectively querying this data is essential to enable targeted analytics. In this work, we formalize point cloud querying by defining three core query types: RETRIEVAL, COUNT, and AGGREGATION, each aligned with distinct analytical scenarios. All these queries rely heavily on accurate object counts to produce meaningful results, making precise object counting a critical component of query execution. Prior work has focused on indexing techniques for 2D video data, assuming detection models provide accurate counting information. However, when applied to 3D point cloud data, state-of-the-art detection models often fail to generate reliable object counts, leading to substantial errors in query results. To address this limitation, we propose CounterNet, a heatmap-based network designed for accurate object counting in large-scale point cloud data. Rather than focusing on accurate object localization, CounterNet detects object presence by finding object centers to improve counting accuracy. We further enhance its performance with a feature map partitioning strategy using overlapping regions, enabling better handling of both small and large objects in complex traffic scenes. To adapt to varying frame characteristics, we introduce a per-frame dynamic model selection strategy that selects the most effective configuration for each input. Evaluations on three real-world autonomous vehicle datasets show that CounterNet improves counting accuracy by 5% to 20% across object categories, resulting in more reliable query outcomes across all supported query types.
Paperid: 580,   https://arxiv.org/pdf/2507.20738    
Authors:Yu Zhao, Ying Zhang, Xuhui Sui, Baohang Zhou, Haoze Zhu, Jeff Z. Pan, Xiaojie Yuan
Abstract:
The multimodal knowledge graph reasoning (MKGR) task aims to predict the missing facts in the incomplete MKGs by leveraging auxiliary images and descriptions of entities. Existing approaches are trained with single-target objectives, which neglect the probabilistic correlations of entity labels, especially in non-target entities. Moreover, previous studies incorporate all modalities statically or adaptively, overlooking the negative impacts of irrelevant or misleading information in the incompetent modalities. To address these issues, we introduce a novel Reinforced Multimodal Distillation framework, exploiting the Dark Side of Modalities (DSoM) from two perspectives: (1) Dark knowledge from non-target entities: We propose to train a unimodal KGR model through logit distillation to mimic the multimodal soft labels provided by pre-trained multimodal teacher models. The multimodal soft labels could provide rich supervision signals with subtle correlations among both target and non-target entities from multiple perspectives. We further decouple logits into neighbor entities and non-neighbor entities to divide into two types of correlations. (2) Dark side in unhelpful modalities: To exclude the adverse effects of unhelpful modalities, we introduce a reinforced teacher combination mechanism that dynamically selects the optimal set of multimodal teachers for each triple. The agent is trained to maximize the rewards, which are only assigned to the beneficial multimodal combination strategies for the student model. Comprehensive experiments demonstrate the effectiveness of DSoM framework on 5 MKGR datasets. Codes are available at github.com/OreOZhao/DSoM.
Paperid: 581,   https://arxiv.org/pdf/2411.06459    
Authors:Nian Liu, Libin Liu, Zilong Zhang, Zi Wang, Hongzhao Xie, Tengyu Liu, Xinyi Tong, Yaodong Yang, Zhaofeng He
Abstract:
Learning natural and diverse behaviors from human motion datasets remains challenging in physics-based character control. Existing conditional adversarial models often suffer from tight and biased embedding distributions where embeddings from the same motion are closely grouped in a small area and shorter motions occupy even less space. Our empirical observations indicate this limits the representational capacity and diversity under each skill. An ideal latent space should be maximally packed by all motion's embedding clusters. In this paper, we propose a skill-conditioned controller that learns diverse skills with expressive variations. Our approach leverages the Neural Collapse phenomenon, a natural outcome of the classification-based encoder, to uniformly distributed cluster centers. We additionally propose a novel Embedding Expansion technique to form stylistic embedding clusters for diverse skills that are uniformly distributed on a hypersphere, maximizing the representational area occupied by each skill and minimizing unmapped regions. This maximally packed and uniformly distributed embedding space ensures that embeddings within the same cluster generate behaviors conforming to the characteristics of the corresponding motion clips, yet exhibiting noticeable variations within each cluster. Compared to existing methods, our controller not only generates high-quality, diverse motions covering the entire dataset but also achieves superior controllability, motion coverage, and diversity under each skill. Both qualitative and quantitative results confirm these traits, enabling our controller to be applied to a wide range of downstream tasks and serving as a cornerstone for diverse applications.
Paperid: 582,   https://arxiv.org/pdf/2507.09619    
Authors:Yilin Lu, Jianghang Lin, Linhuang Xie, Kai Zhao, Yansong Qu, Shengchuan Zhang, Liujuan Cao, Rongrong Ji
Abstract:
Anomaly inspection plays a vital role in industrial manufacturing, but the scarcity of anomaly samples significantly limits the effectiveness of existing methods in tasks such as localization and classification. While several anomaly synthesis approaches have been introduced for data augmentation, they often struggle with low realism, inaccurate mask alignment, and poor generalization. To overcome these limitations, we propose Generate Aligned Anomaly (GAA), a region-guided, few-shot anomaly image-mask pair generation framework. GAA leverages the strong priors of a pretrained latent diffusion model to generate realistic, diverse, and semantically aligned anomalies using only a small number of samples. The framework first employs Localized Concept Decomposition to jointly model the semantic features and spatial information of anomalies, enabling flexible control over the type and location of anomalies. It then utilizes Adaptive Multi-Round Anomaly Clustering to perform fine-grained semantic clustering of anomaly concepts, thereby enhancing the consistency of anomaly representations. Subsequently, a region-guided mask generation strategy ensures precise alignment between anomalies and their corresponding masks, while a low-quality sample filtering module is introduced to further improve the overall quality of the generated samples. Extensive experiments on the MVTec AD and LOCO datasets demonstrate that GAA achieves superior performance in both anomaly synthesis quality and downstream tasks such as localization and classification.
Paperid: 583,   https://arxiv.org/pdf/2509.17769    
Authors:Yang Li, Xinyi Zeng, Zhe Xue, Pinxian Zeng, Zikai Zhang, Yan Wang
Abstract:
As the third generation of neural networks, spiking neural networks (SNNs) have recently gained widespread attention for their biological plausibility, energy efficiency, and effectiveness in processing neuromorphic datasets. To better emulate biological neurons, various models such as Integrate-and-Fire (IF) and Leaky Integrate-and-Fire (LIF) have been widely adopted in SNNs. However, these neuron models overlook the refractory period, a fundamental characteristic of biological neurons. Research on excitable neurons reveal that after firing, neurons enter a refractory period during which they are temporarily unresponsive to subsequent stimuli. This mechanism is critical for preventing over-excitation and mitigating interference from aberrant signals. Therefore, we propose a simple yet effective method to incorporate the refractory period into spiking LIF neurons through spike-triggered threshold dynamics, termed RPLIF. Our method ensures that each spike accurately encodes neural information, effectively preventing neuron over-excitation under continuous inputs and interference from anomalous inputs. Incorporating the refractory period into LIF neurons is seamless and computationally efficient, enhancing robustness and efficiency while yielding better performance with negligible overhead. To the best of our knowledge, RPLIF achieves state-of-the-art performance on Cifar10-DVS(82.40%) and N-Caltech101(83.35%) with fewer timesteps and demonstrates superior performance on DVS128 Gesture(97.22%) at low latency.
Paperid: 584,   https://arxiv.org/pdf/2507.06444    
Authors:Jiaxun Zhang, Haicheng Liao, Yumu Xie, Chengyue Wang, Yanchen Guan, Bin Rao, Zhenning Li
Abstract:
Accurate accident anticipation remains challenging when driver cognition and dynamic road conditions are underrepresented in predictive models. In this paper, we propose CAMERA (Context-Aware Multi-modal Enhanced Risk Anticipation), a multi-modal framework integrating dashcam video, textual annotations, and driver attention maps for robust accident anticipation. Unlike existing methods that rely on static or environment-centric thresholds, CAMERA employs an adaptive mechanism guided by scene complexity and gaze entropy, reducing false alarms while maintaining high recall in dynamic, multi-agent traffic scenarios. A hierarchical fusion pipeline with Bi-GRU (Bidirectional GRU) captures spatio-temporal dependencies, while a Geo-Context Vision-Language module translates 3D spatial relationships into interpretable, human-centric alerts. Evaluations on the DADA-2000 and benchmarks show that CAMERA achieves state-of-the-art performance, improving accuracy and lead time. These results demonstrate the effectiveness of modeling driver attention, contextual description, and adaptive risk thresholds to enable more reliable accident anticipation.
Paperid: 585,   https://arxiv.org/pdf/2507.17394    
Authors:Zhaolin Cai, Fan Li, Ziwei Zheng, Yanjun Qin
Abstract:
Video Anomaly Detection (VAD) aims to identify and locate deviations from normal patterns in video sequences. Traditional methods often struggle with substantial computational demands and a reliance on extensive labeled datasets, thereby restricting their practical applicability. To address these constraints, we propose HiProbe-VAD, a novel framework that leverages pre-trained Multimodal Large Language Models (MLLMs) for VAD without requiring fine-tuning. In this paper, we discover that the intermediate hidden states of MLLMs contain information-rich representations, exhibiting higher sensitivity and linear separability for anomalies compared to the output layer. To capitalize on this, we propose a Dynamic Layer Saliency Probing (DLSP) mechanism that intelligently identifies and extracts the most informative hidden states from the optimal intermediate layer during the MLLMs reasoning. Then a lightweight anomaly scorer and temporal localization module efficiently detects anomalies using these extracted hidden states and finally generate explanations. Experiments on the UCF-Crime and XD-Violence datasets demonstrate that HiProbe-VAD outperforms existing training-free and most traditional approaches. Furthermore, our framework exhibits remarkable cross-model generalization capabilities in different MLLMs without any tuning, unlocking the potential of pre-trained MLLMs for video anomaly detection and paving the way for more practical and scalable solutions.
Paperid: 586,   https://arxiv.org/pdf/2409.14878    
Authors:Zhiyuan Zhou, Jilong Liu, Sanwang Wang, Shijie Hao, Yanrong Guo, Richang Hong
Abstract:
Depression poses significant challenges to patients and healthcare organizations, necessitating efficient assessment methods. Existing paradigms typically focus on a patient-doctor way that overlooks multi-role interactions, such as family involvement in the evaluation and caregiving process. Moreover, current automatic depression detection (ADD) methods usually model depression detection as a classification or regression task, lacking interpretability for the decision-making process. To address these issues, we developed InterMind, a doctor-patient-family interactive depression assessment system empowered by large language models (LLMs). Our system enables patients and families to contribute descriptions, generates assistive diagnostic reports for doctors, and provides actionable insights, improving diagnostic precision and efficiency. To enhance LLMs' performance in psychological counseling and diagnostic interpretability, we integrate retrieval-augmented generation (RAG) and chain-of-thoughts (CoT) techniques for data augmentation, which mitigates the hallucination issue of LLMs in specific scenarios after instruction fine-tuning. Quantitative experiments and professional assessments by clinicians validate the effectiveness of our system.
Paperid: 587,   https://arxiv.org/pdf/2404.12966    
Authors:Yian Li, Wentao Tian, Yang Jiao, Jingjing Chen, Tianwen Qian, Bin Zhu, Na Zhao, Yu-Gang Jiang
Abstract:
Recently, Multimodal Large Language Models (MLLMs) have achieved significant success across multiple disciplines due to their exceptional instruction-following capabilities and extensive world knowledge. However, whether these MLLMs possess human-like compositional reasoning abilities remains an open problem. To unveil their reasoning behaviors, we first curate a Multimodal Assumptive Reasoning Benchmark (MARS-Bench) in this paper. Interestingly, we find that most prevalent MLLMs can be easily fooled by the introduction of a presupposition into the question, whereas such presuppositions appear naive to human reasoning. Besides, we also propose a simple yet effective method, Active Deduction (AD), a novel reinforcement learning paradigm to encourage the model to actively perform composite deduction before reaching a final decision. Equipped with the proposed AD method, a MLLM demonstrates significant improvements in assumptive reasoning abilities without compromising its general-purpose question-answering performance. We also provide extensive evaluations of both open-source and private MLLMs on MARS-Bench, along with experimental analyses of the AD method.
Paperid: 588,   https://arxiv.org/pdf/2505.12339    
Authors:Midou Guo, Qilin Yin, Wei Lu, Xiangyang Luo
Abstract:
With the development of generative artificial intelligence, new forgery methods are rapidly emerging. Social platforms are flooded with vast amounts of unlabeled synthetic data and authentic data, making it increasingly challenging to distinguish real from fake. Due to the lack of labels, existing supervised detection methods struggle to effectively address the detection of unknown deepfake methods. Moreover, in open world scenarios, the amount of unlabeled data greatly exceeds that of labeled data. Therefore, we define a new deepfake detection generalization task which focuses on how to achieve efficient detection of large amounts of unlabeled data based on limited labeled data to simulate a open world scenario. To solve the above mentioned task, we propose a novel Open-World Deepfake Detection Generalization Enhancement Training Strategy (OWG-DS) to improve the generalization ability of existing methods. Our approach aims to transfer deepfake detection knowledge from a small amount of labeled source domain data to large-scale unlabeled target domain data. Specifically, we introduce the Domain Distance Optimization (DDO) module to align different domain features by optimizing both inter-domain and intra-domain distances. Additionally, the Similarity-based Class Boundary Separation (SCBS) module is used to enhance the aggregation of similar samples to ensure clearer class boundaries, while an adversarial training mechanism is adopted to learn the domain-invariant features. Extensive experiments show that the proposed deepfake detection generalization enhancement training strategy excels in cross-method and cross-dataset scenarios, improving the model's generalization.
Paperid: 589,   https://arxiv.org/pdf/2507.12932    
Authors:Zhou Feng, Jiahao Chen, Chunyi Zhou, Yuwen Pu, Qingming Li, Tianyu Du, Shouling Ji
Abstract:
The rapid advancement of voice deepfake technologies has raised serious concerns about user audio privacy, as attackers increasingly exploit publicly available voice data to generate convincing fake audio for malicious purposes such as identity theft, financial fraud, and misinformation campaigns. While existing defense methods offer partial protection, they face critical limitations, including weak adaptability to unseen user data, poor scalability to long audio, rigid reliance on white-box knowledge, and high computational and temporal costs during the encryption process. To address these challenges and defend against personalized voice deepfake threats, we propose Enkidu, a novel user-oriented privacy-preserving framework that leverages universal frequential perturbations generated through black-box knowledge and few-shot training on a small amount of user data. These highly malleable frequency-domain noise patches enable real-time, lightweight protection with strong generalization across variable-length audio and robust resistance to voice deepfake attacks, all while preserving perceptual quality and speech intelligibility. Notably, Enkidu achieves over 50 to 200 times processing memory efficiency (as low as 0.004 gigabytes) and 3 to 7000 times runtime efficiency (real-time coefficient as low as 0.004) compared to six state-of-the-art countermeasures. Extensive experiments across six mainstream text-to-speech models and five cutting-edge automated speaker verification models demonstrate the effectiveness, transferability, and practicality of Enkidu in defending against both vanilla and adaptive voice deepfake attacks. Our code is currently available.
Paperid: 590,   https://arxiv.org/pdf/2507.20326    
Authors:Jiaxi Wang, Yaosen Min, Xun Zhu, Miao Li, Ji Wu
Abstract:
Polymers, composed of repeating structural units called monomers, are fundamental materials in daily life and industry. Accurate property prediction for polymers is essential for their design, development, and application. However, existing modeling approaches, which typically represent polymers by the constituent monomers, struggle to capture the whole properties of polymer, since the properties change during the polymerization process. In this study, we propose a Multimodal Infinite Polymer Sequence (MIPS) pre-training framework, which represents polymers as infinite sequences of monomers and integrates both topological and spatial information for comprehensive modeling. From the topological perspective, we generalize message passing mechanism (MPM) and graph attention mechanism (GAM) to infinite polymer sequences. For MPM, we demonstrate that applying MPM to infinite polymer sequences is equivalent to applying MPM on the induced star-linking graph of monomers. For GAM, we propose to further replace global graph attention with localized graph attention (LGA). Moreover, we show the robustness of the "star linking" strategy through Repeat and Shift Invariance Test (RSIT). Despite its robustness, "star linking" strategy exhibits limitations when monomer side chains contain ring structures, a common characteristic of polymers, as it fails the Weisfeiler-Lehman~(WL) test. To overcome this issue, we propose backbone embedding to enhance the capability of MPM and LGA on infinite polymer sequences. From the spatial perspective, we extract 3D descriptors of repeating monomers to capture spatial information. Finally, we design a cross-modal fusion mechanism to unify the topological and spatial information. Experimental validation across eight diverse polymer property prediction tasks reveals that MIPS achieves state-of-the-art performance.
Paperid: 591,   https://arxiv.org/pdf/2502.19455    
Authors:Lingzhou Mu, Baiji Liu, Ruonan Zhang, Guiming Mo, Jiawei Jin, Kai Zhang, Haozhi Huang
Abstract:
Diffusion-based video generation techniques have significantly improved zero-shot talking-head avatar generation, enhancing the naturalness of both head motion and facial expressions. However, existing methods suffer from poor controllability, making them less applicable to real-world scenarios such as filmmaking and live streaming for e-commerce. To address this limitation, we propose FLAP, a novel approach that integrates explicit 3D intermediate parameters (head poses and facial expressions) into the diffusion model for end-to-end generation of realistic portrait videos. The proposed architecture allows the model to generate vivid portrait videos from audio while simultaneously incorporating additional control signals, such as head rotation angles and eye-blinking frequency. Furthermore, the decoupling of head pose and facial expression allows for independent control of each, offering precise manipulation of both the avatar's pose and facial expressions. We also demonstrate its flexibility in integrating with existing 3D head generation methods, bridging the gap between 3D model-based approaches and end-to-end diffusion techniques. Extensive experiments show that our method outperforms recent audio-driven portrait video models in both naturalness and controllability.
Paperid: 592,   https://arxiv.org/pdf/2508.02050    
Authors:Yuli Liu, Wenjun Kong, Cheng Luo, Weizhi Ma
Abstract:
Sequential Recommendation (SR) focuses on personalizing user experiences by predicting future preferences based on historical interactions. Transformer models, with their attention mechanisms, have become the dominant architecture in SR tasks due to their ability to capture dependencies in user behavior sequences. However, traditional attention mechanisms, where attention weights are computed through query-key transformations, are inherently linear and deterministic. This fixed approach limits their ability to account for the dynamic and non-linear nature of user preferences, leading to challenges in capturing evolving interests and subtle behavioral patterns. Given that generative models excel at capturing non-linearity and probabilistic variability, we argue that generating attention distributions offers a more flexible and expressive alternative compared to traditional attention mechanisms. To support this claim, we present a theoretical proof demonstrating that generative attention mechanisms offer greater expressiveness and stochasticity than traditional deterministic approaches. Building upon this theoretical foundation, we introduce two generative attention models for SR, each grounded in the principles of Variational Autoencoders (VAE) and Diffusion Models (DMs), respectively. These models are designed specifically to generate adaptive attention distributions that better align with variable user preferences. Extensive experiments on real-world datasets show our models significantly outperform state-of-the-art in both accuracy and diversity.
Paperid: 593,   https://arxiv.org/pdf/2504.09069    
Authors:Shuning Sun, Yu Zhang, Chen Wu, Dianjie Lu, Dianjie Lu, Guijuan Zhan, Yang Weng, Zhuoran Zheng
Abstract:
Video imaging is often affected by complex degradations such as blur, noise, and compression artifacts. Traditional restoration methods follow a "single-task single-model" paradigm, resulting in poor generalization and high computational cost, limiting their applicability in real-world scenarios with diverse degradation types. We propose UniFlowRestore, a general video restoration framework that models restoration as a time-continuous evolution under a prompt-guided and physics-informed vector field. A physics-aware backbone PhysicsUNet encodes degradation priors as potential energy, while PromptGenerator produces task-relevant prompts as momentum. These components define a Hamiltonian system whose vector field integrates inertial dynamics, decaying physical gradients, and prompt-based guidance. The system is optimized via a fixed-step ODE solver to achieve efficient and unified restoration across tasks. Experiments show that UniFlowRestore delivers stateof-the-art performance with strong generalization and efficiency. Quantitative results demonstrate that UniFlowRestore achieves state-of-the-art performance, attaining the highest PSNR (33.89 dB) and SSIM (0.97) on the video denoising task, while maintaining top or second-best scores across all evaluated tasks.
Paperid: 594,   https://arxiv.org/pdf/2508.04549    
Authors:Quang-Trung Truong, Yuk-Kwan Wong, Vo Hoang Kim Tuyen Dang, Rinaldi Gotama, Duc Thanh Nguyen, Sai-Kit Yeung
Abstract:
Marine videos present significant challenges for video understanding due to the dynamics of marine objects and the surrounding environment, camera motion, and the complexity of underwater scenes. Existing video captioning datasets, typically focused on generic or human-centric domains, often fail to generalize to the complexities of the marine environment and gain insights about marine life. To address these limitations, we propose a two-stage marine object-oriented video captioning pipeline. We introduce a comprehensive video understanding benchmark that leverages the triplets of video, text, and segmentation masks to facilitate visual grounding and captioning, leading to improved marine video understanding and analysis, and marine video generation. Additionally, we highlight the effectiveness of video splitting in order to detect salient object transitions in scene changes, which significantly enrich the semantics of captioning content. Our dataset and code have been released at https://msc.hkustvgd.com.
Paperid: 595,   https://arxiv.org/pdf/2506.08817    
Authors:Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, Shanghang Zhang
Abstract:
Video content comprehension is essential for various applications, ranging from video analysis to interactive systems. Despite advancements in large-scale vision-language models (VLMs), these models often struggle to capture the nuanced, spatiotemporal details essential for thorough video analysis. To address this gap, we introduce Video-CoT, a groundbreaking dataset designed to enhance spatiotemporal understanding using Chain-of-Thought (CoT) methodologies. Video-CoT contains 192,000 fine-grained spa-tiotemporal question-answer pairs and 23,000 high-quality CoT-annotated samples, providing a solid foundation for evaluating spatiotemporal understanding in video comprehension. Additionally, we provide a comprehensive benchmark for assessing these tasks, with each task featuring 750 images and tailored evaluation metrics. Our extensive experiments reveal that current VLMs face significant challenges in achieving satisfactory performance, high-lighting the difficulties of effective spatiotemporal understanding. Overall, the Video-CoT dataset and benchmark open new avenues for research in multimedia understanding and support future innovations in intelligent systems requiring advanced video analysis capabilities. By making these resources publicly available, we aim to encourage further exploration in this critical area. Project website:https://video-cot.github.io/ .
Paperid: 596,   https://arxiv.org/pdf/2407.20756    
Authors:Zheng Liu, Hao Liang, Bozhou Li, Wentao Xiong, Chong Chen, Conghui He, Wentao Zhang, Bin Cui
Abstract:
Vision-Language Models (VLMs) have recently emerged, demonstrating remarkable vision-understanding capabilities. However, training these models requires large-scale datasets, which brings challenges related to efficiency, effectiveness, and quality of web data. In this paper, we introduce SynthVLM, a new data synthesis and curation method for generating image-caption pairs. Unlike traditional methods, where captions are generated from images, SynthVLM utilizes advanced diffusion models and high-quality captions to synthesize and select images from text captions, thereby creating precisely aligned image-text pairs. We further introduce SynthVLM-100K, a high-quality dataset consisting of 100K curated and synthesized image-caption pairs. In both model and human evaluations, SynthVLM-100K outperforms traditional real-world datasets. Leveraging this dataset, we develop a new family of multimodal large language models (MLLMs), SynthVLM-7B and SynthVLM-13B, which achieve state-of-the-art (SOTA) performance on various vision question-answering (VQA) tasks. Notably, our models outperform LLaVA across most metrics with only 18% pretrain data. Furthermore, SynthVLM-7B and SynthVLM-13B attain SOTA performance on the MMLU benchmark, demonstrating that the high-quality SynthVLM-100K dataset preserves language abilities.
Paperid: 597,   https://arxiv.org/pdf/2508.07918    
Authors:Xing Zi, Jinghao Xiao, Yunxiao Shi, Xian Tao, Jun Li, Ali Braytee, Mukesh Prasad
Abstract:
Visual Question Answering (VQA) in remote sensing (RS) is pivotal for interpreting Earth observation data. However, existing RS VQA datasets are constrained by limitations in annotation richness, question diversity, and the assessment of specific reasoning capabilities. This paper introduces RSVLM-QA dataset, a new large-scale, content-rich VQA dataset for the RS domain. RSVLM-QA is constructed by integrating data from several prominent RS segmentation and detection datasets: WHU, LoveDA, INRIA, and iSAID. We employ an innovative dual-track annotation generation pipeline. Firstly, we leverage Large Language Models (LLMs), specifically GPT-4.1, with meticulously designed prompts to automatically generate a suite of detailed annotations including image captions, spatial relations, and semantic tags, alongside complex caption-based VQA pairs. Secondly, to address the challenging task of object counting in RS imagery, we have developed a specialized automated process that extracts object counts directly from the original segmentation data; GPT-4.1 then formulates natural language answers from these counts, which are paired with preset question templates to create counting QA pairs. RSVLM-QA comprises 13,820 images and 162,373 VQA pairs, featuring extensive annotations and diverse question types. We provide a detailed statistical analysis of the dataset and a comparison with existing RS VQA benchmarks, highlighting the superior depth and breadth of RSVLM-QA's annotations. Furthermore, we conduct benchmark experiments on Six mainstream Vision Language Models (VLMs), demonstrating that RSVLM-QA effectively evaluates and challenges the understanding and reasoning abilities of current VLMs in the RS domain. We believe RSVLM-QA will serve as a pivotal resource for the RS VQA and VLM research communities, poised to catalyze advancements in the field.
Paperid: 598,   https://arxiv.org/pdf/2506.14805    
Authors:Yang Yao, Lingyu Li, Jiaxin Song, Chiyu Chen, Zhenqi He, Yixu Wang, Xin Wang, Tianle Gu, Jie Li, Yan Teng, Yingchun Wang
Abstract:
As Multimodal Large Language Models (MLLMs) continue to evolve, their cognitive and reasoning capabilities have seen remarkable progress. However, challenges in visual fine-grained perception and commonsense causal inference persist. This paper introduces Argus Inspection, a multimodal benchmark with two levels of difficulty, emphasizing detailed visual recognition while incorporating real-world commonsense understanding to evaluate causal reasoning abilities. Expanding on it, we present the Eye of Panoptes framework, which integrates a binary parametric Sigmoid metric with an indicator function, enabling a more holistic evaluation of MLLMs' responses in opinion-based reasoning tasks. Experiments conducted on 26 mainstream MLLMs reveal that the highest performance in visual fine-grained reasoning reaches only 0.46, highlighting considerable potential for enhancement. Our research offers valuable perspectives for the continued refinement of MLLMs.
Paperid: 599,   https://arxiv.org/pdf/2507.18616    
Authors:Si-Woo Kim, MinJu Jeon, Ye-Chan Kim, Soeun Lee, Taewhan Kim, Dong-Jin Kim
Abstract:
Zero-shot Image Captioning (ZIC) increasingly utilizes synthetic datasets generated by text-to-image (T2I) models to mitigate the need for costly manual annotation. However, these T2I models often produce images that exhibit semantic misalignments with their corresponding input captions (e.g., missing objects, incorrect attributes), resulting in noisy synthetic image-caption pairs that can hinder model training. Existing dataset pruning techniques are largely designed for removing noisy text in web-crawled data. However, these methods are ill-suited for the distinct challenges of synthetic data, where captions are typically well-formed, but images may be inaccurate representations. To address this gap, we introduce SynC, a novel framework specifically designed to refine synthetic image-caption datasets for ZIC. Instead of conventional filtering or regeneration, SynC focuses on reassigning captions to the most semantically aligned images already present within the synthetic image pool. Our approach employs a one-to-many mapping strategy by initially retrieving multiple relevant candidate images for each caption. We then apply a cycle-consistency-inspired alignment scorer that selects the best image by verifying its ability to retrieve the original caption via image-to-text retrieval. Extensive evaluations demonstrate that SynC consistently and significantly improves performance across various ZIC models on standard benchmarks (MS-COCO, Flickr30k, NoCaps), achieving state-of-the-art results in several scenarios. SynC offers an effective strategy for curating refined synthetic data to enhance ZIC.
Paperid: 600,   https://arxiv.org/pdf/2507.18881    
Authors:Bolei Chen, Jiaxu Kang, Haonan Yang, Ping Zhong, Jianxin Wang
Abstract:
Since a building's floorplans are easily accessible, consistent over time, and inherently robust to changes in visual appearance, self-localization within the floorplan has attracted researchers' interest. However, since floorplans are minimalist representations of a building's structure, modal and geometric differences between visual perceptions and floorplans pose challenges to this task. While existing methods cleverly utilize 2D geometric features and pose filters to achieve promising performance, they fail to address the localization errors caused by frequent visual changes and view occlusions due to variously shaped 3D objects. To tackle these issues, this paper views the 2D Floorplan Localization (FLoc) problem from a higher dimension by injecting 3D geometric priors into the visual FLoc algorithm. For the 3D geometric prior modeling, we first model geometrically aware view invariance using multi-view constraints, i.e., leveraging imaging geometric principles to provide matching constraints between multiple images that see the same points. Then, we further model the view-scene aligned geometric priors, enhancing the cross-modal geometry-color correspondences by associating the scene's surface reconstruction with the RGB frames of the sequence. Both 3D priors are modeled through self-supervised contrastive learning, thus no additional geometric or semantic annotations are required. These 3D priors summarized in extensive realistic scenes bridge the modal gap while improving localization success without increasing the computational burden on the FLoc algorithm. Sufficient comparative studies demonstrate that our method significantly outperforms state-of-the-art methods and substantially boosts the FLoc accuracy. All data and code will be released after the anonymous review.
Paperid: 601,   https://arxiv.org/pdf/2507.14904    
Authors:Fan Li, Zanyi Wang, Zeyi Huang, Guang Dai, Jingdong Wang, Mengmeng Wang
Abstract:
3D visual grounding allows an embodied agent to understand visual information in real-world 3D environments based on human instructions, which is crucial for embodied intelligence. Existing 3D visual grounding methods typically rely on separate encoders for different modalities (e.g., RGB images, text, and 3D point clouds), resulting in large and complex models that are inefficient to train. While some approaches use pre-trained 2D multi-modal models like CLIP for 3D tasks, they still struggle with aligning point cloud data to 2D encoders. As a result, these methods continue to depend on 3D encoders for feature extraction, further increasing model complexity and training inefficiency. In this paper, we propose a unified 2D pre-trained multi-modal network to process all three modalities (RGB images, text, and point clouds), significantly simplifying the architecture. By leveraging a 2D CLIP bi-modal model with adapter-based fine-tuning, this framework effectively adapts to the tri-modal setting, improving both adaptability and performance across modalities. Our Geometric-Aware 2D-3D Feature Recovery and Fusion (GARF) module is designed to fuse geometric multi-scale features from point clouds and images. We then integrate textual features for final modality fusion and introduce a multi-modal decoder to facilitate deep cross-modal understanding. Together, our method achieves unified feature extraction and fusion across the three modalities, enabling an end-to-end 3D visual grounding model. Compared to the baseline, our method reduces the number of trainable parameters by approximately 58%, while achieving a 6.52% improvement in the 3D detection task and a 6.25% improvement in the 3D visual grounding task.
Paperid: 602,   https://arxiv.org/pdf/2508.20513    
Authors:Yongqi Shao, Binxin Mei, Cong Tan, Hong Huo, Tao Fang
Abstract:
Early screening for Alzheimer's Disease (AD) through speech presents a promising non-invasive approach. However, challenges such as limited data and the lack of fine-grained, adaptive feature selection often hinder performance. To address these issues, we propose MoTAS, a robust framework designed to enhance AD screening efficiency. MoTAS leverages Text-to-Speech (TTS) augmentation to increase data volume and employs a Mixture of Experts (MoE) mechanism to improve multimodal feature selection, jointly enhancing model generalization. The process begins with automatic speech recognition (ASR) to obtain accurate transcriptions. TTS is then used to synthesize speech that enriches the dataset. After extracting acoustic and text embeddings, the MoE mechanism dynamically selects the most informative features, optimizing feature fusion for improved classification. Evaluated on the ADReSSo dataset, MoTAS achieves a leading accuracy of 85.71%, outperforming existing baselines. Ablation studies further validate the individual contributions of TTS augmentation and MoE in boosting classification performance. These findings highlight the practical value of MoTAS in real-world AD screening scenarios, particularly in data-limited settings.
Paperid: 603,   https://arxiv.org/pdf/2511.05034    
Authors:Jing Jin, Xu Liu, Te Gao, Zhihong Shi, Yixiong Liang, Ruiqing Zheng, Hulin Kuang, Min Zeng, Shichao Kan
Abstract:
Whole Slide Image (WSI) representation is critical for cancer subtyping, cancer recognition and mutation prediction.Training an end-to-end WSI representation model poses significant challenges, as a standard gigapixel slide can contain tens of thousands of image tiles, making it difficult to compute gradients of all tiles in a single mini-batch due to current GPU limitations. To address this challenge, we propose a method of dynamic residual encoding with slide-level contrastive learning (DRE-SLCL) for end-to-end WSI representation. Our approach utilizes a memory bank to store the features of tiles across all WSIs in the dataset. During training, a mini-batch usually contains multiple WSIs. For each WSI in the batch, a subset of tiles is randomly sampled and their features are computed using a tile encoder. Then, additional tile features from the same WSI are selected from the memory bank. The representation of each individual WSI is generated using a residual encoding technique that incorporates both the sampled features and those retrieved from the memory bank. Finally, the slide-level contrastive loss is computed based on the representations and histopathology reports ofthe WSIs within the mini-batch. Experiments conducted over cancer subtyping, cancer recognition, and mutation prediction tasks proved the effectiveness of the proposed DRE-SLCL method.
Paperid: 604,   https://arxiv.org/pdf/2512.22602    
Authors:Bin Wang, Yang Xu, Huan Zhao, Hao Zhang, Zixing Zhang
Abstract:
Speech-driven 3D talking head generation aims to produce lifelike facial animations precisely synchronized with speech. While considerable progress has been made in achieving high lip-synchronization accuracy, existing methods largely overlook the intricate nuances of individual speaking styles, which limits personalization and realism. In this work, we present a novel framework for personalized 3D talking head animation, namely "PTalker". This framework preserves speaking style through style disentanglement from audio and facial motion sequences and enhances lip-synchronization accuracy through a three-level alignment mechanism between audio and mesh modalities. Specifically, to effectively disentangle style and content, we design disentanglement constraints that encode driven audio and motion sequences into distinct style and content spaces to enhance speaking style representation. To improve lip-synchronization accuracy, we adopt a modality alignment mechanism incorporating three aspects: spatial alignment using Graph Attention Networks to capture vertex connectivity in the 3D mesh structure, temporal alignment using cross-attention to capture and synchronize temporal dependencies, and feature alignment by top-k bidirectional contrastive losses and KL divergence constraints to ensure consistency between speech and mesh modalities. Extensive qualitative and quantitative experiments on public datasets demonstrate that PTalker effectively generates realistic, stylized 3D talking heads that accurately match identity-specific speaking styles, outperforming state-of-the-art methods. The source code and supplementary videos are available at: PTalker.
Paperid: 605,   https://arxiv.org/pdf/2508.01236    
Authors:Mingyu Fu, Wei Suo, Ji Ma, Lin Yuanbo Wu, Peng Wang, Yanning Zhang
Abstract:
Despite the great success of Large Vision Language Models (LVLMs), their high computational cost severely limits their broad applications. The computational cost of LVLMs mainly stems from the visual sequence of the input, which consists of hundreds or even thousands of tokens. Although existing methods have made progress by removing redundant tokens, they suffer from severe performance degradation with high pruning rates due to the loss of visual information. In this paper, we propose an Adaptive Content Compensation Method (ACCM), which can effectively mitigate the visual information loss via an image caption. Specifically, ACCM comprises two key components: a lightweight caption model and a selector. Firstly the caption model generates question-related descriptions under the guidance of the user instruction. Then the selector further identifies a contextually appropriate caption from multiple candidates. Leveraging self-supervised learning, our modules could be learned efficiently without any human or automated labeling. We conduct extensive experiments across seven benchmarks and the results show that ACCM significantly outperforms existing methods with lower FLOPs (e.g., surpassing SOTA by 20.6% with 6.5% fewer FLOPs).
Paperid: 606,   https://arxiv.org/pdf/2508.04524    
Authors:Tianxiao Li, Zhenglin Huang, Haiquan Wen, Yiwei He, Shuchang Lyu, Baoyuan Wu, Guangliang Cheng
Abstract:
The rapid advancement of AI-generation models has enabled the creation of hyperrealistic imagery, posing ethical risks through widespread misinformation. Current deepfake detection methods, categorized as face specific detectors or general AI-generated detectors, lack transparency by framing detection as a classification task without explaining decisions. While several LLM-based approaches offer explainability, they suffer from coarse-grained analyses and dependency on labor-intensive annotations. This paper introduces RAIDX (Retrieval-Augmented Image Deepfake Detection and Explainability), a novel deepfake detection framework integrating Retrieval-Augmented Generation (RAG) and Group Relative Policy Optimization (GRPO) to enhance detection accuracy and decision explainability. Specifically, RAIDX leverages RAG to incorporate external knowledge for improved detection accuracy and employs GRPO to autonomously generate fine-grained textual explanations and saliency maps, eliminating the need for extensive manual annotations. Experiments on multiple benchmarks demonstrate RAIDX's effectiveness in identifying real or fake, and providing interpretable rationales in both textual descriptions and saliency maps, achieving state-of-the-art detection performance while advancing transparency in deepfake identification. RAIDX represents the first unified framework to synergize RAG and GRPO, addressing critical gaps in accuracy and explainability. Our code and models will be publicly available.
Paperid: 607,   https://arxiv.org/pdf/2507.17232    
Authors:Mashiro Toyooka, Kiyoharu Aizawa, Yoko Yamakata
Abstract:
Large Language Models (LLMs) are trained on a vast amount of procedural texts, but they do not directly observe real-world phenomena. In the context of cooking recipes, this poses a challenge, as intermediate states of ingredients are often omitted, making it difficult for models to track ingredient states and understand recipes accurately. In this paper, we apply state probing, a method for evaluating a language model's understanding of the world, to the domain of cooking. We propose a new task and dataset for evaluating how well LLMs can recognize intermediate ingredient states during cooking procedures. We first construct a new Japanese recipe dataset with clear and accurate annotations of ingredient state changes, collected from well-structured and controlled recipe texts. Using this dataset, we design three novel tasks to evaluate whether LLMs can track ingredient state transitions and identify ingredients present at intermediate steps. Our experiments with widely used LLMs, such as Llama3.1-70B and Qwen2.5-72B, show that learning ingredient state knowledge improves their understanding of cooking processes, achieving performance comparable to commercial LLMs. The dataset are publicly available at: https://huggingface.co/datasets/mashi6n/nhkrecipe-100-anno-1
Paperid: 608,   https://arxiv.org/pdf/2507.06719    
Authors:Zhenyang Liu, Sixiao Zheng, Siyu Chen, Cairong Zhao, Longfei Liang, Xiangyang Xue, Yanwei Fu
Abstract:
Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries, which is crucial for embodied AI applications such as autonomous navigation, robotics, and augmented reality. Learning 3D language fields through neural representations enables accurate understanding of 3D scenes from limited viewpoints and facilitates the localization of target objects in complex environments. However, existing language field methods struggle to accurately localize instances using spatial relations in language queries, such as ``the book on the chair.'' This limitation mainly arises from inadequate reasoning about spatial relations in both language queries and 3D scenes. In this work, we propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning that constructs a visual properties-enhanced hierarchical feature field for open-vocabulary 3D visual grounding. To enable spatial reasoning in language queries, SpatialReasoner fine-tunes an LLM to capture spatial relations and explicitly infer instructions for the target, anchor, and spatial relation. To enable spatial reasoning in 3D scenes, SpatialReasoner incorporates visual properties (opacity and color) to construct a hierarchical feature field. This field represents language and instance features using distilled CLIP features and masks extracted via the Segment Anything Model (SAM). The field is then queried using the inferred instructions in a hierarchical manner to localize the target 3D instance based on the spatial relation in the language query. Extensive experiments show that our framework can be seamlessly integrated into different neural representations, outperforming baseline models in 3D visual grounding while empowering their spatial reasoning capability.
Paperid: 609,   https://arxiv.org/pdf/2412.03910    
Authors:Xuesong Li, Jinguang Tong, Jie Hong, Vivien Rolland, Lars Petersson
Abstract:
Dynamic scene reconstruction from monocular video is essential for real-world applications. We introduce DGNS, a hybrid framework integrating \underlineDeformable \underlineGaussian Splatting and Dynamic \underlineNeural \underlineSurfaces, effectively addressing dynamic novel-view synthesis and 3D geometry reconstruction simultaneously. During training, depth maps generated by the deformable Gaussian splatting module guide the ray sampling for faster processing and provide depth supervision within the dynamic neural surface module to improve geometry reconstruction. Conversely, the dynamic neural surface directs the distribution of Gaussian primitives around the surface, enhancing rendering quality. In addition, we propose a depth-filtering approach to further refine depth supervision. Extensive experiments conducted on public datasets demonstrate that DGNS achieves state-of-the-art performance in 3D reconstruction, along with competitive results in novel-view synthesis.
Paperid: 610,   https://arxiv.org/pdf/2504.09255    
Authors:Sijing Wu, Yunhao Li, Ziwen Xu, Yixuan Gao, Huiyu Duan, Wei Sun, Guangtao Zhai
Abstract:
Face video quality assessment (FVQA) deserves to be explored in addition to general video quality assessment (VQA), as face videos are the primary content on social media platforms and human visual system (HVS) is particularly sensitive to human faces. However, FVQA is rarely explored due to the lack of large-scale FVQA datasets. To fill this gap, we present the first large-scale in-the-wild FVQA dataset, FVQ-20K, which contains 20,000 in-the-wild face videos together with corresponding mean opinion score (MOS) annotations. Along with the FVQ-20K dataset, we further propose a specialized FVQA method named FVQ-Rater to achieve human-like rating and scoring for face video, which is the first attempt to explore the potential of large multimodal models (LMMs) for the FVQA task. Concretely, we elaborately extract multi-dimensional features including spatial features, temporal features, and face-specific features (i.e., portrait features and face embeddings) to provide comprehensive visual information, and take advantage of the LoRA-based instruction tuning technique to achieve quality-specific fine-tuning, which shows superior performance on both FVQ-20K and CFVQA datasets. Extensive experiments and comprehensive analysis demonstrate the significant potential of the FVQ-20K dataset and FVQ-Rater method in promoting the development of FVQA.
Paperid: 611,   https://arxiv.org/pdf/2508.00507    
Authors:Yiming Xu, Jiarun Chen, Zhen Peng, Zihan Chen, Qika Lin, Lan Ma, Bin Shi, Bo Dong
Abstract:
The natural combination of intricate topological structures and rich textual information in text-attributed graphs (TAGs) opens up a novel perspective for graph anomaly detection (GAD). However, existing GAD methods primarily focus on designing complex optimization objectives within the graph domain, overlooking the complementary value of the textual modality, whose features are often encoded by shallow embedding techniques, such as bag-of-words or skip-gram, so that semantic context related to anomalies may be missed. To unleash the enormous potential of textual modality, large language models (LLMs) have emerged as promising alternatives due to their strong semantic understanding and reasoning capabilities. Nevertheless, their application to TAG anomaly detection remains nascent, and they struggle to encode high-order structural information inherent in graphs due to input length constraints. For high-quality anomaly detection in TAGs, we propose CoLL, a novel framework that combines LLMs and graph neural networks (GNNs) to leverage their complementary strengths. CoLL employs multi-LLM collaboration for evidence-augmented generation to capture anomaly-relevant contexts while delivering human-readable rationales for detected anomalies. Moreover, CoLL integrates a GNN equipped with a gating mechanism to adaptively fuse textual features with evidence while preserving high-order topological information. Extensive experiments demonstrate the superiority of CoLL, achieving an average improvement of 13.37% in AP. This study opens a new avenue for incorporating LLMs in advancing GAD.
Paperid: 612,   https://arxiv.org/pdf/2507.21637    
Authors:Wanying Wang, Zeyu Ma, Han Zheng, Xin Tan, Mingang Chen
Abstract:
Large vision-language models (LVLMs) are vulnerable to harmful input compared to their language-only backbones. We investigated this vulnerability by exploring LVLMs internal dynamics, framing their inherent safety understanding in terms of three key capabilities. Specifically, we define these capabilities as safety perception, semantic understanding, and alignment for linguistic expression, and experimentally pinpointed their primary locations within the model architecture. The results indicate that safety perception often emerges before comprehensive semantic understanding, leading to the reduction in safety. Motivated by these findings, we propose Self-Aware Safety Augmentation (SASA), a technique that projects informative semantic representations from intermediate layers onto earlier safety-oriented layers. This approach leverages the model's inherent semantic understanding to enhance safety recognition without fine-tuning. Then, we employ linear probing to articulate the model's internal semantic comprehension to detect the risk before the generation process. Extensive experiments on various datasets and tasks demonstrate that SASA significantly improves the safety of LVLMs, with minimal impact on the utility.
Paperid: 613,   https://arxiv.org/pdf/2508.19688    
Authors:Gangjian Zhang, Jian Shu, Nanjie Yao, Hao Wang
Abstract:
Monocular texture 3D human reconstruction aims to create a complete 3D digital avatar from just a single front-view human RGB image. However, the geometric ambiguity inherent in a single 2D image and the scarcity of 3D human training data are the main obstacles limiting progress in this field. To address these issues, current methods employ prior geometric estimation networks to derive various human geometric forms, such as the SMPL model and normal maps. However, they struggle to integrate these modalities effectively, leading to view inconsistencies, such as facial distortions. To this end, we propose a two-process 3D human reconstruction framework, SAT, which seamlessly learns various prior geometries in a unified manner and reconstructs high-quality textured 3D avatars as the final output. To further facilitate geometry learning, we introduce a Supervisor Feature Regularization module. By employing a multi-view network with the same structure to provide intermediate features as training supervision, these varied geometric priors can be better fused. To tackle data scarcity and further improve reconstruction quality, we also propose an Online Animation Augmentation module. By building a one-feed-forward animation network, we augment a massive number of samples from the original 3D human data online for model training. Extensive experiments on two benchmarks show the superiority of our approach compared to state-of-the-art methods.
Paperid: 614,   https://arxiv.org/pdf/2507.19062    
Authors:Zhaoxi Mu, Rilin Chen, Andong Li, Meng Yu, Xinyu Yang, Dong Yu
Abstract:
This paper introduces OmniGSE, a novel general speech enhancement (GSE) framework designed to mitigate the diverse distortions that speech signals encounter in real-world scenarios. These distortions include background noise, reverberation, bandwidth limitations, signal clipping, and network packet loss. Existing methods typically focus on optimizing for a single type of distortion, often struggling to effectively handle the simultaneous presence of multiple distortions in complex scenarios. OmniGSE bridges this gap by integrating the strengths of discriminative and generative approaches through a two-stage architecture that enables cross-domain collaborative optimization. In the first stage, continuous features are enhanced using a lightweight channel-split NAC-RoFormer. In the second stage, discrete tokens are generated to reconstruct high-quality speech through language models. Specifically, we designed a hierarchical language model structure consisting of a RootLM and multiple BranchLMs. The RootLM models general acoustic features across codebook layers, while the BranchLMs explicitly capture the progressive relationships between different codebook levels. Experimental results demonstrate that OmniGSE surpasses existing models across multiple benchmarks, particularly excelling in scenarios involving compound distortions. These findings underscore the framework's potential for robust and versatile speech enhancement in real-world applications.
Paperid: 615,   https://arxiv.org/pdf/2508.01574    
Authors:Pengfei Gu, Hongxiao Wang, Yejia Zhang, Huimin Li, Chaoli Wang, Danny Chen
Abstract:
Topological structures in image data, such as connected components and loops, play a crucial role in understanding image content (e.g., biomedical objects). % Despite remarkable successes of numerous image processing methods that rely on appearance information, these methods often lack sensitivity to topological structures when used in general deep learning (DL) frameworks. % In this paper, we introduce a new general approach, called TopoImages (for Topology Images), which computes a new representation of input images by encoding local topology of patches. % In TopoImages, we leverage persistent homology (PH) to encode geometric and topological features inherent in image patches. % Our main objective is to capture topological information in local patches of an input image into a vectorized form. % Specifically, we first compute persistence diagrams (PDs) of the patches, % and then vectorize and arrange these PDs into long vectors for pixels of the patches. % The resulting multi-channel image-form representation is called a TopoImage. % TopoImages offers a new perspective for data analysis. % To garner diverse and significant topological features in image data and ensure a more comprehensive and enriched representation, we further generate multiple TopoImages of the input image using various filtration functions, which we call multi-view TopoImages. % The multi-view TopoImages are fused with the input image for DL-based classification, with considerable improvement. % Our TopoImages approach is highly versatile and can be seamlessly integrated into common DL frameworks. Experiments on three public medical image classification datasets demonstrate noticeably improved accuracy over state-of-the-art methods.
Paperid: 616,   https://arxiv.org/pdf/2504.09967    
Authors:Xun Zhu, Fanbin Mo, Zheng Zhang, Jiaxi Wang, Yiming Shi, Ming Wu, Chuang Zhang, Miao Li, Ji Wu
Abstract:
The emergence of medical generalist foundation models has revolutionized conventional task-specific model development paradigms, aiming to better handle multiple tasks through joint training on large-scale medical datasets. However, recent advances prioritize simple data scaling or architectural component enhancement, while neglecting to re-examine multi-task learning from a data-centric perspective. Critically, simply aggregating existing data resources leads to decentralized image-task alignment, which fails to cultivate comprehensive image understanding or align with clinical needs for multi-dimensional image interpretation. In this paper, we introduce the image-centric multi-annotation X-ray dataset (IMAX), the first attempt to enhance the multi-task learning capabilities of medical multi-modal large language models (MLLMs) from the data construction level. To be specific, IMAX is featured from the following attributes: 1) High-quality data curation. A comprehensive collection of more than 354K entries applicable to seven different medical tasks. 2) Image-centric dense annotation. Each X-ray image is associated with an average of 4.10 tasks and 7.46 training entries, ensuring multi-task representation richness per image. Compared to the general decentralized multi-annotation X-ray dataset (DMAX), IMAX consistently demonstrates significant multi-task average performance gains ranging from 3.20% to 21.05% across seven open-source state-of-the-art medical MLLMs. Moreover, we investigate differences in statistical patterns exhibited by IMAX and DMAX training processes, exploring potential correlations between optimization dynamics and multi-task performance. Finally, leveraging the core concept of IMAX data construction, we propose an optimized DMAX-based training strategy to alleviate the dilemma of obtaining high-quality IMAX data in practical scenarios.
Paperid: 617,   https://arxiv.org/pdf/2504.16405    
Authors:Lancheng Gao, Ziheng Jia, Yunhao Zeng, Wei Sun, Yiming Zhang, Wei Zhou, Guangtao Zhai, Xiongkuo Min
Abstract:
The furnishing of multi-modal large language models (MLLMs) has led to the emergence of numerous benchmark studies, particularly those evaluating their perception and understanding capabilities. Among these, understanding image-evoked emotions aims to enhance MLLMs' empathy, with significant applications such as human-machine interaction and advertising recommendations. However, current evaluations of this MLLM capability remain coarse-grained, and a systematic and comprehensive assessment is still lacking. To this end, we introduce EEmo-Bench, a novel benchmark dedicated to the analysis of the evoked emotions in images across diverse content categories. Our core contributions include: 1) Regarding the diversity of the evoked emotions, we adopt an emotion ranking strategy and employ the Valence-Arousal-Dominance (VAD) as emotional attributes for emotional assessment. In line with this methodology, 1,960 images are collected and manually annotated. 2) We design four tasks to evaluate MLLMs' ability to capture the evoked emotions by single images and their associated attributes: Perception, Ranking, Description, and Assessment. Additionally, image-pairwise analysis is introduced to investigate the model's proficiency in performing joint and comparative analysis. In total, we collect 6,773 question-answer pairs and perform a thorough assessment on 19 commonly-used MLLMs. The results indicate that while some proprietary and large-scale open-source MLLMs achieve promising overall performance, the analytical capabilities in certain evaluation dimensions remain suboptimal. Our EEmo-Bench paves the path for further research aimed at enhancing the comprehensive perceiving and understanding capabilities of MLLMs concerning image-evoked emotions, which is crucial for machine-centric emotion perception and understanding.
Paperid: 618,   https://arxiv.org/pdf/2507.23253    
Authors:Mingyang Yu, Xiahui Guo, Peng chen, Zhenkai Li, Yang Shu
Abstract:
Time Series forecasting is critical in diverse domains such as weather forecasting, financial investment, and traffic management. While traditional numerical metrics like mean squared error (MSE) can quantify point-wise accuracy, they fail to evaluate the geometric structure of time series data, which is essential to understand temporal dynamics. To address this issue, we propose the time series Geometric Structure Index (TGSI), a novel evaluation metric that transforms time series into images to leverage their inherent two-dimensional geometric representations. However, since the image transformation process is non-differentiable, TGSI cannot be directly integrated as a training loss. We further introduce the Shape-Aware Temporal Loss (SATL), a multi-component loss function operating in the time series modality to bridge this gap and enhance structure modeling during training. SATL combines three components: a first-order difference loss that measures structural consistency through the MSE between first-order differences, a frequency domain loss that captures essential periodic patterns using the Fast Fourier Transform while minimizing noise, and a perceptual feature loss that measures geometric structure difference in time-series by aligning temporal features with geometric structure features through a pre-trained temporal feature extractor and time-series image autoencoder. Experiments across multiple datasets demonstrate that models trained with SATL achieve superior performance in both MSE and the proposed TGSI metrics compared to baseline methods, without additional computational cost during inference.
Paperid: 619,   https://arxiv.org/pdf/2507.16238    
Authors:Xin Xu, Chaoyue Ren, Wei Liu, Wenke Huang, Bin Yang, Zhixi Yu, Kui Jiang
Abstract:
The Federated Domain Generalization for Person re-identification (FedDG-ReID) aims to learn a global server model that can be effectively generalized to source and target domains through distributed source domain data. Existing methods mainly improve the diversity of samples through style transformation, which to some extent enhances the generalization performance of the model. However, we discover that not all styles contribute to the generalization performance. Therefore, we define styles that are beneficial or harmful to the model's generalization performance as positive or negative styles. Based on this, new issues arise: How to effectively screen and continuously utilize the positive styles. To solve these problems, we propose a Style Screening and Continuous Utilization (SSCU) framework. Firstly, we design a Generalization Gain-guided Dynamic Style Memory (GGDSM) for each client model to screen and accumulate generated positive styles. Meanwhile, we propose a style memory recognition loss to fully leverage the positive styles memorized by Memory. Furthermore, we propose a Collaborative Style Training (CST) strategy to make full use of positive styles. Unlike traditional learning strategies, our approach leverages both newly generated styles and the accumulated positive styles stored in memory to train client models on two distinct branches. This training strategy is designed to effectively promote the rapid acquisition of new styles by the client models, and guarantees the continuous and thorough utilization of positive styles, which is highly beneficial for the model's generalization performance. Extensive experimental results demonstrate that our method outperforms existing methods in both the source domain and the target domain.
Paperid: 620,   https://arxiv.org/pdf/2507.14686    
Authors:Chen Cai, Tianyi Liu, Jianjun Gao, Wenyang Liu, Kejun Wu, Ruoyu Wang, Yi Wang, Soo Chin Liew
Abstract:
Recent Multimodal Large Language Models (MLLMs) exhibit strong zero-shot abilities but struggle with complex Grounded Situation Recognition (GSR) and are resource-intensive for edge device deployment. Meanwhile, conventional GSR models often lack generalization ability, falling short in recognizing unseen and rare situations. In this paper, we exploit transferring knowledge from a teacher MLLM to a small GSR model to enhance its generalization and zero-shot abilities, thereby introducing the task of Open-vocabulary Grounded Situation Recognition (Ov-GSR). To achieve this, we propose Multimodal Interactive Prompt Distillation (MIPD), a novel framework that distills enriched multimodal knowledge from the foundation model, enabling the student Ov-GSR model to recognize unseen situations and be better aware of rare situations. Specifically, the MIPD framework first leverages the LLM-based Judgmental Rationales Generator (JRG) to construct positive and negative glimpse and gaze rationales enriched with contextual semantic information. The proposed scene-aware and instance-perception prompts are then introduced to align rationales with visual information from the MLLM teacher via the Negative-Guided Multimodal Prompting Alignment (NMPA) module, effectively capturing holistic and perceptual multimodal knowledge. Finally, the aligned multimodal knowledge is distilled into the student Ov-GSR model, providing a stronger foundation for generalization that enhances situation understanding, bridges the gap between seen and unseen scenarios, and mitigates prediction bias in rare cases. We evaluate MIPD on the refined Ov-SWiG dataset, achieving superior performance on seen, rare, and unseen situations, and further demonstrate improved unseen detection on the HICO-DET dataset.
Paperid: 621,   https://arxiv.org/pdf/2507.20623    
Authors:Yang Zhao, Shusheng Li, Xueshang Feng
Abstract:
As the development of lightweight deep learning algorithms, various deep neural network (DNN) models have been proposed for the remote sensing scene classification (RSSC) application. However, it is still challenging for these RSSC models to achieve optimal performance among model accuracy, inference latency, and energy consumption on resource-constrained edge devices. In this paper, we propose a lightweight RSSC framework, which includes a distilled global filter network (GFNet) model and an early-exit mechanism designed for edge devices to achieve state-of-the-art performance. Specifically, we first apply frequency domain distillation on the GFNet model to reduce model size. Then we design a dynamic early-exit model tailored for DNN models on edge devices to further improve model inference efficiency. We evaluate our E3C model on three edge devices across four datasets. Extensive experimental results show that it achieves an average of 1.3x speedup on model inference and over 40% improvement on energy efficiency, while maintaining high classification accuracy.
Paperid: 622,   https://arxiv.org/pdf/2507.22037    
Authors:Muzhi Dai, Shixuan Liu, Zhiyuan Zhao, Junyu Gao, Hao Sun, Xuelong Li
Abstract:
The rapid advancement of multimodal large language models (MLLMs) has led to breakthroughs in various applications, yet their security remains a critical challenge. One pressing issue involves unsafe image-query pairs--jailbreak inputs specifically designed to bypass security constraints and elicit unintended responses from MLLMs. Compared to general multimodal data, such unsafe inputs are relatively sparse, which limits the diversity and richness of training samples available for developing robust defense models. Meanwhile, existing guardrail-type methods rely on external modules to enforce security constraints but fail to address intrinsic vulnerabilities within MLLMs. Traditional supervised fine-tuning (SFT), on the other hand, often over-refuses harmless inputs, compromising general performance. Given these challenges, we propose Secure Tug-of-War (SecTOW), an innovative iterative defense-attack training method to enhance the security of MLLMs. SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO). During the iterative process, the attacker identifies security vulnerabilities in the defense model and expands jailbreak data. The expanded data are then used to train the defender, enabling it to address identified security vulnerabilities. We also design reward mechanisms used for GRPO to simplify the use of response labels, reducing dependence on complex generative labels and enabling the efficient use of synthetic data. Additionally, a quality monitoring mechanism is used to mitigate the defender's over-refusal of harmless inputs and ensure the diversity of the jailbreak data generated by the attacker. Experimental results on safety-specific and general benchmarks demonstrate that SecTOW significantly improves security while preserving general performance.
Paperid: 623,   https://arxiv.org/pdf/2510.22571    
Authors:Mahiro Ukai, Shuhei Kurita, Nakamasa Inoue
Abstract:
Object state recognition aims to identify the specific condition of objects, such as their positional states (e.g., open or closed) and functional states (e.g., on or off). While recent Vision-Language Models (VLMs) are capable of performing a variety of multimodal tasks, it remains unclear how precisely they can identify object states. To alleviate this issue, we introduce the STAte and Transition UnderStanding Benchmark (STATUS Bench), the first benchmark for rigorously evaluating the ability of VLMs to understand subtle variations in object states in diverse situations. Specifically, STATUS Bench introduces a novel evaluation scheme that requires VLMs to perform three tasks simultaneously: object state identification (OSI), image retrieval (IR), and state change identification (SCI). These tasks are defined over our fully hand-crafted dataset involving image pairs, their corresponding object state descriptions and state change descriptions. Furthermore, we introduce a large-scale training dataset, namely STATUS Train, which consists of 13 million semi-automatically created descriptions. This dataset serves as the largest resource to facilitate further research in this area. In our experiments, we demonstrate that STATUS Bench enables rigorous consistency evaluation and reveal that current state-of-the-art VLMs still significantly struggle to capture subtle object state distinctions. Surprisingly, under the proposed rigorous evaluation scheme, most open-weight VLMs exhibited chance-level zero-shot performance. After fine-tuning on STATUS Train, Qwen2.5-VL achieved performance comparable to Gemini 2.0 Flash. These findings underscore the necessity of STATUS Bench and Train for advancing object state recognition in VLM research.
Paperid: 624,   https://arxiv.org/pdf/2505.01263    
Authors:Gaoxiang Cong, Liang Li, Jiadong Pan, Zhedong Zhang, Amin Beheshti, Anton van den Hengel, Yuankai Qi, Qingming Huang
Abstract:
Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects while preserving the vocal timbre of a given brief reference audio. Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality. To address these issues, we propose a large language model (LLM) based flow matching architecture for dubbing, named FlowDubber, which achieves high-quality audio-visual sync and pronunciation by incorporating a large speech language model and dual contrastive aligning while achieving better acoustic quality via the proposed voice-enhanced flow matching than previous works. First, we introduce Qwen2.5 as the backbone of LLM to learn the in-context sequence from movie scripts and reference audio. Then, the proposed semantic-aware learning focuses on capturing LLM semantic knowledge at the phoneme level. Next, dual contrastive aligning (DCA) boosts mutual alignment with lip movement, reducing ambiguities where similar phonemes might be confused. Finally, the proposed Flow-based Voice Enhancing (FVE) improves acoustic quality in two aspects, which introduces an LLM-based acoustics flow matching guidance to strengthen clarity and uses affine style prior to enhance identity when recovering noise into mel-spectrograms via gradient vector field prediction. Extensive experiments demonstrate that our method outperforms several state-of-the-art methods on two primary benchmarks.
Paperid: 625,   https://arxiv.org/pdf/2502.09967    
Authors:Zhuming Wang, Yihao Zheng, Jiarui Li, Yaofei Wu, Yan Huang, Zun Li, Lifang Wu, Liang Wang
Abstract:
Existing weakly supervised group activity recognition methods rely on object detectors or attention mechanisms to capture key areas automatically. However, they overlook the semantic information associated with captured areas, which may adversely affect the recognition performance. In this paper, we propose a novel framework named Visual Conceptual Knowledge Guided Action Map (VicKAM) which effectively captures the locations of individual actions and integrates them with action semantics for weakly supervised group activity recognition.It generates individual action prototypes from training set as visual conceptual knowledge to bridge action semantics and visual representations. Guided by this knowledge, VicKAM produces action maps that indicate the likelihood of each action occurring at various locations, based on image correlation theorem. It further augments individual action maps using group activity related statistical information, representing individual action distribution under different group activities, to establish connections between action maps and specific group activities. The augmented action map is incorporated with action semantic representations for group activity recognition.Extensive experiments on two public benchmarks, the Volleyball and the NBA datasets, demonstrate the effectiveness of our proposed method, even in cases of limited training data. The code will be released later.
Paperid: 626,   https://arxiv.org/pdf/2504.02060    
Authors:Minh-Quan Ho-Le, Duy-Khang Ho, Van-Tu Ninh, Cathal Gurrin, Minh-Triet Tran
Abstract:
Lifelogging involves continuously capturing personal data through wearable cameras, providing an egocentric view of daily activities. Lifelog retrieval aims to search and retrieve relevant moments from this data, yet existing methods largely overlook activity-level annotations, which capture temporal relationships and enrich semantic understanding. In this work, we introduce LSC-ADL, an ADL-annotated lifelog dataset derived from the LSC dataset, incorporating Activities of Daily Living (ADLs) as a structured semantic layer. Using a semi-automatic approach featuring the HDBSCAN algorithm for intra-class clustering and human-in-the-loop verification, we generate accurate ADL annotations to enhance retrieval explainability. By integrating action recognition into lifelog retrieval, LSC-ADL bridges a critical gap in existing research, offering a more context-aware representation of daily life. We believe this dataset will advance research in lifelog retrieval, activity recognition, and egocentric vision, ultimately improving the accuracy and interpretability of retrieved content. The ADL annotations can be downloaded at https://bit.ly/lsc-adl-annotations.
Paperid: 627,   https://arxiv.org/pdf/2504.18576    
Authors:Xiaofan Li, Chenming Wu, Zhao Yang, Zhihao Xu, Dingkang Liang, Yumeng Zhang, Ji Wan, Jun Wang
Abstract:
This paper presents DriVerse, a generative model for simulating navigation-driven driving scenes from a single image and a future trajectory. Previous autonomous driving world models either directly feed the trajectory or discrete control signals into the generation pipeline, leading to poor alignment between the control inputs and the implicit features of the 2D base generative model, which results in low-fidelity video outputs. Some methods use coarse textual commands or discrete vehicle control signals, which lack the precision to guide fine-grained, trajectory-specific video generation, making them unsuitable for evaluating actual autonomous driving algorithms. DriVerse introduces explicit trajectory guidance in two complementary forms: it tokenizes trajectories into textual prompts using a predefined trend vocabulary for seamless language integration, and converts 3D trajectories into 2D spatial motion priors to enhance control over static content within the driving scene. To better handle dynamic objects, we further introduce a lightweight motion alignment module, which focuses on the inter-frame consistency of dynamic pixels, significantly enhancing the temporal coherence of moving elements over long sequences. With minimal training and no need for additional data, DriVerse outperforms specialized models on future video generation tasks across both the nuScenes and Waymo datasets. The code and models will be released to the public.
Paperid: 628,   https://arxiv.org/pdf/2508.07596    
Authors:Shahroz Tariq, Simon S. Woo, Priyanka Singh, Irena Irmalasari, Saakshi Gupta, Dev Gupta
Abstract:
The proliferation of deepfake technologies poses urgent challenges and serious risks to digital integrity, particularly within critical sectors such as forensics, journalism, and the legal system. While existing detection systems have made significant progress in classification accuracy, they typically function as black-box models, offering limited transparency and minimal support for human reasoning. This lack of interpretability hinders their usability in real-world decision-making contexts, especially for non-expert users. In this paper, we present DF-P2E (Deepfake: Prediction to Explanation), a novel multimodal framework that integrates visual, semantic, and narrative layers of explanation to make deepfake detection interpretable and accessible. The framework consists of three modular components: (1) a deepfake classifier with Grad-CAM-based saliency visualisation, (2) a visual captioning module that generates natural language summaries of manipulated regions, and (3) a narrative refinement module that uses a fine-tuned Large Language Model (LLM) to produce context-aware, user-sensitive explanations. We instantiate and evaluate the framework on the DF40 benchmark, the most diverse deepfake dataset to date. Experiments demonstrate that our system achieves competitive detection performance while providing high-quality explanations aligned with Grad-CAM activations. By unifying prediction and explanation in a coherent, human-aligned pipeline, this work offers a scalable approach to interpretable deepfake detection, advancing the broader vision of trustworthy and transparent AI systems in adversarial media environments.
Paperid: 629,   https://arxiv.org/pdf/2504.12680    
Authors:Baining Zhao, Ziyou Wang, Jianjie Fang, Chen Gao, Fanhang Man, Jinqiang Cui, Xin Wang, Xinlei Chen, Yong Li, Wenwu Zhu
Abstract:
Humans can perceive and reason about spatial relationships from sequential visual observations, such as egocentric video streams. However, how pretrained models acquire such abilities, especially high-level reasoning, remains unclear. This paper introduces Embodied-R, a collaborative framework combining large-scale Vision-Language Models (VLMs) for perception and small-scale Language Models (LMs) for reasoning. Using Reinforcement Learning (RL) with a novel reward system considering think-answer logical consistency, the model achieves slow-thinking capabilities with limited computational resources. After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models (OpenAI-o1, Gemini-2.5-pro) on both in-distribution and out-of-distribution embodied spatial reasoning tasks. Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration. We further explore research questions including response length, training on VLM, strategies for reward design, and differences in model generalization after SFT (Supervised Fine-Tuning) and RL training.
Paperid: 630,   https://arxiv.org/pdf/2508.01680    
Authors:Dong Li, Yichen Niu, Ying Ai, Xiang Zou, Biqing Qi, Jianxing Liu
Abstract:
Large language models (LLMs) have demonstrated strong performance in natural language generation but remain limited in knowle- dge-intensive tasks due to outdated or incomplete internal knowledge. Retrieval-Augmented Generation (RAG) addresses this by incorporating external retrieval, with GraphRAG further enhancing performance through structured knowledge graphs and multi-hop reasoning. However, existing GraphRAG methods largely ignore the temporal dynamics of knowledge, leading to issues such as temporal ambiguity, time-insensitive retrieval, and semantic redundancy. To overcome these limitations, we propose Temporal GraphRAG (T-GRAG), a dynamic, temporally-aware RAG framework that models the evolution of knowledge over time. T-GRAG consists of five key components: (1) a Temporal Knowledge Graph Generator that creates time-stamped, evolving graph structures; (2) a Temporal Query Decomposition mechanism that breaks complex temporal queries into manageable sub-queries; (3) a Three-layer Interactive Retriever that progressively filters and refines retrieval across temporal subgraphs; (4) a Source Text Extractor to mitigate noise; and (5) a LLM-based Generator that synthesizes contextually and temporally accurate responses. We also introduce Time-LongQA, a novel benchmark dataset based on real-world corporate annual reports, designed to test temporal reasoning across evolving knowledge. Extensive experiments show that T-GRAG significantly outperforms prior RAG and GraphRAG baselines in both retrieval accuracy and response relevance under temporal constraints, highlighting the necessity of modeling knowledge evolution for robust long-text question answering. Our code is publicly available on the T-GRAG
Paperid: 631,   https://arxiv.org/pdf/2508.01897    
Authors:Mingru Yang, Yanmei Gu, Qianhua He, Yanxiong Li, Peirong Zhang, Yongqiang Chen, Zhiming Wang, Huijia Zhu, Jian Liu, Weiqiang Wang
Abstract:
Audio deepfake detection (ADD) faces critical generalization challenges due to diverse real-world spoofing attacks and domain variations. However, existing methods primarily rely on Euclidean distances, failing to adequately capture the intrinsic hierarchical structures associated with attack categories and domain factors. To address these issues, we design a novel framework Poin-HierNet to construct domain-invariant hierarchical representations in the Poincaré sphere. Poin-HierNet includes three key components: 1) Poincaré Prototype Learning (PPL) with several data prototypes aligning sample features and capturing multilevel hierarchies beyond human labels; 2) Hierarchical Structure Learning (HSL) leverages top prototypes to establish a tree-like hierarchical structure from data prototypes; and 3) Poincaré Feature Whitening (PFW) enhances domain invariance by applying feature whitening to suppress domain-sensitive features. We evaluate our approach on four datasets: ASVspoof 2019 LA, ASVspoof 2021 LA, ASVspoof 2021 DF, and In-The-Wild. Experimental results demonstrate that Poin-HierNet exceeds state-of-the-art methods in Equal Error Rate.
Paperid: 632,   https://arxiv.org/pdf/2501.12635    
Authors:Dunwei Tu, Huiyu Yi, Yuchi Wang, Baile Xu, Jian Zhao, Furao Shen
Abstract:
Continual learning requires machine learning models to continuously acquire new knowledge in dynamic environments while avoiding the forgetting of previous knowledge. Prompt-based continual learning methods effectively address the issue of catastrophic forgetting through prompt expansion and selection. However, existing approaches often suffer from low accuracy in prompt selection, which can result in the model receiving biased knowledge and making biased predictions. To address this issue, we propose the Multiple Queries with Multiple Keys (MQMK) prompt matching paradigm for precise prompt selection. The goal of MQMK is to select the prompts whose training data distribution most closely matches that of the test sample. Specifically, Multiple Queries enable precise breadth search by introducing task-specific knowledge, while Multiple Keys perform deep search by representing the feature distribution of training samples at a fine-grained level. Experiments show that MQMK enhances the prompt matching rate by over 30% in challenging scenarios and achieves state-of-the-art performance on three widely adopted continual learning benchmarks. Once this paper is accepted, we will release the code.
Paperid: 633,   https://arxiv.org/pdf/2509.18683    
Authors:Lanhu Wu, Zilin Gao, Hao Fei, Mong-Li Lee, Wynne Hsu
Abstract:
RGB-D salient object detection (SOD) aims to identify the most conspicuous objects in a scene with the incorporation of depth cues. Existing methods mainly rely on CNNs, limited by the local receptive fields, or Vision Transformers that suffer from the cost of quadratic complexity, posing a challenge in balancing performance and computational efficiency. Recently, state space models (SSM), Mamba, have shown great potential for modeling long-range dependency with linear complexity. However, directly applying SSM to RGB-D SOD may lead to deficient local semantics as well as the inadequate cross-modality fusion. To address these issues, we propose a Local Emphatic and Adaptive Fusion state space model (LEAF-Mamba) that contains two novel components: 1) a local emphatic state space module (LE-SSM) to capture multi-scale local dependencies for both modalities. 2) an SSM-based adaptive fusion module (AFM) for complementary cross-modality interaction and reliable cross-modality integration. Extensive experiments demonstrate that the LEAF-Mamba consistently outperforms 16 state-of-the-art RGB-D SOD methods in both efficacy and efficiency. Moreover, our method can achieve excellent performance on the RGB-T SOD task, proving a powerful generalization ability.
Paperid: 634,   https://arxiv.org/pdf/2510.16448    
Authors:Yongxiang Hua, Haoyu Cao, Zhou Tao, Bocheng Li, Zihao Wu, Chaohu Liu, Linli Xu
Abstract:
Sparse Mixture of Experts (sMoE) has become a pivotal approach for scaling large vision-language models, offering substantial capacity while maintaining computational efficiency through dynamic, sparse activation of experts. However, existing routing mechanisms, typically based on similarity scoring, struggle to effectively capture the underlying input structure. This limitation leads to a trade-off between expert specialization and balanced computation, hindering both scalability and performance. We propose Input Domain Aware MoE, a novel routing framework that leverages a probabilistic mixture model to better partition the input space. By modeling routing probabilities as a mixture of distributions, our method enables experts to develop clear specialization boundaries while achieving balanced utilization. Unlike conventional approaches, our routing mechanism is trained independently of task-specific objectives, allowing for stable optimization and decisive expert assignments. Empirical results on vision-language tasks demonstrate that our method consistently outperforms existing sMoE approaches, achieving higher task performance and improved expert utilization balance.
Paperid: 635,   https://arxiv.org/pdf/2412.10834    
Authors:Jiaxu Li, Rui Li, Jianyu Qi, Songning Lai, Linpu Lv, Kejia Fan, Jianheng Tang, Yutao Yue, Dongzhan Zhou, Yuanhuai Liu, Huiping Zhuang
Abstract:
2D images and 3D point clouds are foundational data types for multimedia applications, including real-time video analysis, augmented reality (AR), and 3D scene understanding. Class-incremental semantic segmentation (CSS) requires incrementally learning new semantic categories while retaining prior knowledge. Existing methods typically rely on computationally expensive training based on stochastic gradient descent, employing complex regularization or exemplar replay. However, stochastic gradient descent-based approaches inevitably update the model's weights for past knowledge, leading to catastrophic forgetting, a problem exacerbated by pixel/point-level granularity. To address these challenges, we propose CFSSeg, a novel exemplar-free approach that leverages a closed-form solution, offering a practical and theoretically grounded solution for continual semantic segmentation tasks. This eliminates the need for iterative gradient-based optimization and storage of past data, requiring only a single pass through new samples per step. It not only enhances computational efficiency but also provides a practical solution for dynamic, privacy-sensitive multimedia environments. Extensive experiments on 2D and 3D benchmark datasets such as Pascal VOC2012, S3DIS, and ScanNet demonstrate CFSSeg's superior performance.
Paperid: 636,   https://arxiv.org/pdf/2411.01819    
Authors:Bo Gao, Jianhui Wang, Xinyuan Song, Yangfan He, Fangxu Xing, Tianyu Shi
Abstract:
Current semantic segmentation models typically require a substantial amount of manually annotated data, a process that is both time-consuming and resource-intensive. Alternatively, leveraging advanced text-to-image models such as Midjourney and Stable Diffusion has emerged as an efficient strategy, enabling the automatic generation of synthetic data in place of manual annotations. However, previous methods have been limited to generating single-instance images, as the generation of multiple instances with Stable Diffusion has proven unstable. To address this limitation and expand the scope and diversity of synthetic datasets, we propose a framework Free-Mask that combines a Diffusion Model for segmentation with advanced image editing capabilities, allowing for the integration of multiple objects into images via text-to-image models. Our method facilitates the creation of highly realistic datasets that closely emulate open-world environments while generating accurate segmentation masks. It reduces the labor associated with manual annotation and also ensures precise mask generation. Experimental results demonstrate that synthetic data generated by Free-Mask enables segmentation models to outperform those trained on real data, especially in zero-shot settings. Notably, Free-Mask achieves new state-of-the-art results on previously unseen classes in the VOC 2012 benchmark.
Paperid: 637,   https://arxiv.org/pdf/2509.10058    
Authors:Sung-Lin Tsai, Bo-Lun Huang, Yu Ting Shen, Cheng Yu Yeo, Chiang Tseng, Bo-Kai Ruan, Wen-Sheng Lien, Hong-Han Shuai
Abstract:
Accurate color alignment in text-to-image (T2I) generation is critical for applications such as fashion, product visualization, and interior design, yet current diffusion models struggle with nuanced and compound color terms (e.g., Tiffany blue, lime green, hot pink), often producing images that are misaligned with human intent. Existing approaches rely on cross-attention manipulation, reference images, or fine-tuning but fail to systematically resolve ambiguous color descriptions. To precisely render colors under prompt ambiguity, we propose a training-free framework that enhances color fidelity by leveraging a large language model (LLM) to disambiguate color-related prompts and guiding color blending operations directly in the text embedding space. Our method first employs a large language model (LLM) to resolve ambiguous color terms in the text prompt, and then refines the text embeddings based on the spatial relationships of the resulting color terms in the CIELAB color space. Unlike prior methods, our approach improves color accuracy without requiring additional training or external reference images. Experimental results demonstrate that our framework improves color alignment without compromising image quality, bridging the gap between text semantics and visual generation.
Paperid: 638,   https://arxiv.org/pdf/2501.15065    
Authors:Wenju Sun, Qingyong Li, Wen Wang, Yangli-ao Geng, Boyang Li
Abstract:
Multi-task model merging offers an efficient solution for integrating knowledge from multiple fine-tuned models, mitigating the significant computational and storage demands associated with multi-task training. As a key technique in this field, Task Arithmetic (TA) defines task vectors by subtracting the pre-trained model (θ_\textpre) from the fine-tuned task models in parameter space, then adjusting the weight between these task vectors and θ_\textpre to balance task-generalized and task-specific knowledge. Despite the promising performance of TA, conflicts can arise among the task vectors, particularly when different tasks require distinct model adaptations. In this paper, we formally define this issue as knowledge conflicts, characterized by the performance degradation of one task after merging with a model fine-tuned for another task. Through in-depth analysis, we show that these conflicts stem primarily from the components of task vectors that align with the gradient of task-specific losses at θ_\textpre. To address this, we propose Task Arithmetic in Trust Region (TATR), which defines the trust region as dimensions in the model parameter space that cause only small changes (corresponding to the task vector components with gradient orthogonal direction) in the task-specific losses. Restricting parameter merging within this trust region, TATR can effectively alleviate knowledge conflicts. Moreover, TATR serves as both an independent approach and a plug-and-play module compatible with a wide range of TA-based methods. Extensive empirical evaluations on eight distinct datasets robustly demonstrate that TATR improves the multi-task performance of several TA-based model merging methods by an observable margin.
Paperid: 639,   https://arxiv.org/pdf/2404.10838    
Authors:Zhengyang Liang, Meiyu Liang, Wei Huang, Yawen Li, Zhe Xue
Abstract:
In recent years, pre-trained multimodal large models have attracted widespread attention due to their outstanding performance in various multimodal applications. Nonetheless, the extensive computational resources and vast datasets required for their training present significant hurdles for deployment in environments with limited computational resources. To address this challenge, we propose a novel dynamic self-adaptive multiscale distillation from pre-trained multimodal large model for efficient cross-modal representation learning for the first time. Unlike existing distillation methods, our strategy employs a multiscale perspective, enabling the extraction structural knowledge across from the pre-trained multimodal large model. Ensuring that the student model inherits a comprehensive and nuanced understanding of the teacher knowledge. To optimize each distillation loss in a balanced and efficient manner, we propose a dynamic self-adaptive distillation loss balancer, a novel component eliminating the need for manual loss weight adjustments and dynamically balances each loss item during the distillation process. Our methodology streamlines pre-trained multimodal large models using only their output features and original image-level information, requiring minimal computational resources. This efficient approach is suited for various applications and allows the deployment of advanced multimodal technologies even in resource-limited settings. Extensive experiments has demonstrated that our method maintains high performance while significantly reducing model complexity and training costs. Moreover, our distilled student model utilizes only image-level information to achieve state-of-the-art performance on cross-modal retrieval tasks, surpassing previous methods that relied on region-level information.
Paperid: 640,   https://arxiv.org/pdf/2408.01701    
Authors:Naichuan Zheng, Yuchen Du, Hailun Xia, Zeyu Liang
Abstract:
For multimodal skeleton-based action recognition, Graph Convolutional Networks (GCNs) are effective models. Still, their reliance on floating-point computations leads to high energy consumption, limiting their applicability in battery-powered devices. While energy-efficient, Spiking Neural Networks (SNNs) struggle to model skeleton dynamics, leading to suboptimal solutions. We propose Signal-SGN (Spiking Graph Convolutional Network), which utilizes the temporal dimension of skeleton sequences as the spike time steps and represents features as multi-dimensional discrete stochastic signals for temporal-frequency domain feature extraction. It combines the 1D Spiking Graph Convolution (1D-SGC) module and the Frequency Spiking Convolution (FSC) module to extract features from the skeleton represented as spiking form. Additionally, the Multi-Scale Wavelet Transform Feature Fusion (MWTF) module is proposed to extract dynamic spiking features and capture frequency-specific characteristics, enhancing classification performance. Experiments across three large-scale datasets reveal Signal-SGN exceeding state-of-the-art SNN-based methods in accuracy and computational efficiency while attaining comparable performance with GCN methods and significantly reducing theoretical energy consumption.
Paperid: 641,   https://arxiv.org/pdf/2504.10854    
Authors:Hanning Chen, Yang Ni, Wenjun Huang, Hyunwoo Oh, Yezi Liu, Tamoghno Das, Mohsen Imani
Abstract:
Large Vision Language Models (LVLMs) have been widely adopted to guide vision foundation models in performing reasoning segmentation tasks, achieving impressive performance. However, the substantial computational overhead associated with LVLMs presents a new challenge. The primary source of this computational cost arises from processing hundreds of image tokens. Therefore, an effective strategy to mitigate such overhead is to reduce the number of image tokens, a process known as image token pruning. Previous studies on image token pruning for LVLMs have primarily focused on high level visual understanding tasks, such as visual question answering and image captioning. In contrast, guiding vision foundation models to generate accurate visual masks based on textual queries demands precise semantic and spatial reasoning capabilities. Consequently, pruning methods must carefully control individual image tokens throughout the LVLM reasoning process. Our empirical analysis reveals that existing methods struggle to adequately balance reductions in computational overhead with the necessity to maintain high segmentation accuracy. In this work, we propose LVLM_CSP, a novel training free visual token pruning method specifically designed for LVLM based reasoning segmentation tasks. LVLM_CSP consists of three stages: clustering, scattering, and pruning. Initially, the LVLM performs coarse-grained visual reasoning using a subset of selected image tokens. Next, fine grained reasoning is conducted, and finally, most visual tokens are pruned in the last stage. Extensive experiments demonstrate that LVLM_CSP achieves a 65% reduction in image token inference FLOPs with virtually no accuracy degradation, and a 70% reduction with only a minor 1% drop in accuracy on the 7B LVLM.
Paperid: 642,   https://arxiv.org/pdf/2507.03903    
Authors:Hanzhe Liang, Jie Zhang, Tao Dai, Linlin Shen, Jinbao Wang, Can Gao
Abstract:
Reconstruction-based methods have demonstrated very promising results for 3D anomaly detection. However, these methods face great challenges in handling high-precision point clouds due to the large scale and complex structure. In this study, a Down-Up Sampling Network (DUS-Net) is proposed to reconstruct high-precision point clouds for 3D anomaly detection by preserving the group center geometric structure. The DUS-Net first introduces a Noise Generation module to generate noisy patches, which facilitates the diversity of training data and strengthens the feature representation for reconstruction. Then, a Down-sampling Network (Down-Net) is developed to learn an anomaly-free center point cloud from patches with noise injection. Subsequently, an Up-sampling Network (Up-Net) is designed to reconstruct high-precision point clouds by fusing multi-scale up-sampling features. Our method leverages group centers for construction, enabling the preservation of geometric structure and providing a more precise point cloud. Extensive experiments demonstrate the effectiveness of our proposed method, achieving state-of-the-art (SOTA) performance with an Object-level AUROC of 79.9% and 79.5%, and a Point-level AUROC of 71.2% and 84.7% on the Real3D-AD and Anomaly-ShapeNet datasets, respectively.
Paperid: 643,   https://arxiv.org/pdf/2508.01282    
Authors:Jiawei Li, Linjie Qiu, Zhiqing Wu, Qiongyan Chen, Ziyan Wang, Mingming Fan
Abstract:
Older adults tend to encounter challenges when learning to use new smartphone apps due to age-related cognitive and physical changes. Compared to traditional support methods such as video tutorials, trial-and-error allows older adults to learn to use smartphone apps by making and correcting mistakes. However, it remains unknown how trial-and-error should be designed to empower older adults to use smartphone apps and how well it would work for older adults. Informed by the guidelines derived from prior work, we designed and implemented ExplorAR, an AR-based trial-and-error system that offers real-time and situated visual guidance in the augmented space around the smartphone to empower older adults to explore and correct mistakes independently. We conducted a user study with 18 older adults to compare ExplorAR with traditional video tutorials and a simplified version of ExplorAR. Results show that the AR-supported trial-and-error method enhanced older adults' learning experience by fostering deeper cognitive engagement and improving confidence in exploring unknown operations.
Paperid: 644,   https://arxiv.org/pdf/2506.18679    
Authors:Ruicheng Zhang, Yu Sun, Zeyu Zhang, Jinai Li, Xiaofan Liu, Au Hoi Fan, Haowei Guo, Puxin Yan
Abstract:
We introduce MARL-MambaContour, the first contour-based medical image segmentation framework based on Multi-Agent Reinforcement Learning (MARL). Our approach reframes segmentation as a multi-agent cooperation task focused on generate topologically consistent object-level contours, addressing the limitations of traditional pixel-based methods which could lack topological constraints and holistic structural awareness of anatomical regions. Each contour point is modeled as an autonomous agent that iteratively adjusts its position to align precisely with the target boundary, enabling adaptation to blurred edges and intricate morphologies common in medical images. This iterative adjustment process is optimized by a contour-specific Soft Actor-Critic (SAC) algorithm, further enhanced with the Entropy Regularization Adjustment Mechanism (ERAM) which dynamically balance agent exploration with contour smoothness. Furthermore, the framework incorporates a Mamba-based policy network featuring a novel Bidirectional Cross-attention Hidden-state Fusion Mechanism (BCHFM). This mechanism mitigates potential memory confusion limitations associated with long-range modeling in state space models, thereby facilitating more accurate inter-agent information exchange and informed decision-making. Extensive experiments on five diverse medical imaging datasets demonstrate the state-of-the-art performance of MARL-MambaContour, highlighting its potential as an accurate and robust clinical application.
Paperid: 645,   https://arxiv.org/pdf/2508.17924    
Authors:Konstantin Egorov, Stepan Botman, Pavel Blinov, Galina Zubkova, Anton Ivaschenko, Alexander Kolsanov, Andrey Savchenko
Abstract:
Progress in remote PhotoPlethysmoGraphy (rPPG) is limited by the critical issues of existing publicly available datasets: small size, privacy concerns with facial videos, and lack of diversity in conditions. The paper introduces a novel comprehensive large-scale multi-view video dataset for rPPG and health biomarkers estimation. Our dataset comprises 3600 synchronized video recordings from 600 subjects, captured under varied conditions (resting and post-exercise) using multiple consumer-grade cameras at different angles. To enable multimodal analysis of physiological states, each recording is paired with a 100 Hz PPG signal and extended health metrics, such as electrocardiogram, arterial blood pressure, biomarkers, temperature, oxygen saturation, respiratory rate, and stress level. Using this data, we train an efficient rPPG model and compare its quality with existing approaches in cross-dataset scenarios. The public release of our dataset and model should significantly speed up the progress in the development of AI medical assistants.
Paperid: 646,   https://arxiv.org/pdf/2506.01668    
Authors:Heng Er Metilda Chee, Jiayin Wang, Zhiqiang Guo, Weizhi Ma, Min Zhang
Abstract:
Stickers, though small, are a highly condensed form of visual expression, ubiquitous across messaging platforms and embraced by diverse cultures, genders, and age groups. Despite their popularity, sticker retrieval remains an underexplored task due to the significant human effort and subjectivity involved in constructing high-quality sticker query datasets. Although large language models (LLMs) excel at general NLP tasks, they falter when confronted with the nuanced, intangible, and highly specific nature of sticker query generation. To address this challenge, we propose a threefold solution. First, we introduce Sticktionary, a gamified annotation framework designed to gather diverse, high-quality, and contextually resonant sticker queries. Second, we present StickerQueries, a multilingual sticker query dataset containing 1,115 English and 615 Chinese queries, annotated by over 60 contributors across 60+ hours. Lastly, Through extensive quantitative and qualitative evaluation, we demonstrate that our approach significantly enhances query generation quality, retrieval accuracy, and semantic understanding in the sticker domain. To support future research, we publicly release our multilingual dataset along with two fine-tuned query generation models.
Paperid: 647,   https://arxiv.org/pdf/2508.03397    
Authors:Xinzhu Li, Juepeng Zheng, Yikun Chen, Xudong Mao, Guanghui Yue, Wei Zhou, Chenlei Lv, Ruomei Wang, Fan Zhou, Baoquan Zhao
Abstract:
Robust gait recognition requires highly discriminative representations, which are closely tied to input modalities. While binary silhouettes and skeletons have dominated recent literature, these 2D representations fall short of capturing sufficient cues that can be exploited to handle viewpoint variations, and capture finer and meaningful details of gait. In this paper, we introduce a novel framework, termed DepthGait, that incorporates RGB-derived depth maps and silhouettes for enhanced gait recognition. Specifically, apart from the 2D silhouette representation of the human body, the proposed pipeline explicitly estimates depth maps from a given RGB image sequence and uses them as a new modality to capture discriminative features inherent in human locomotion. In addition, a novel multi-scale and cross-level fusion scheme has also been developed to bridge the modality gap between depth maps and silhouettes. Extensive experiments on standard benchmarks demonstrate that the proposed DepthGait achieves state-of-the-art performance compared to peer methods and attains an impressive mean rank-1 accuracy on the challenging datasets.
Paperid: 648,   https://arxiv.org/pdf/2507.18489    
Authors:Jinfeng Xu, Zheyu Chen, Shuo Yang, Jinze Li, Edith C. H. Ngai
Abstract:
The efficiency and scalability of graph convolution networks (GCNs) in training recommender systems remain critical challenges, hindering their practical deployment in real-world scenarios. In the multimodal recommendation (MMRec) field, training GCNs requires more expensive time and space costs and exacerbates the gap between different modalities, resulting in sub-optimal recommendation accuracy. This paper critically points out the inherent challenges associated with adopting GCNs during the training phase in MMRec, revealing that GCNs inevitably create unhelpful and even harmful pairs during model optimization and isolate different modalities. To this end, we propose FastMMRec, a highly efficient multimodal recommendation framework that deploys graph convolutions exclusively during the testing phase, bypassing their use in training. We demonstrate that adopting GCNs solely in the testing phase significantly improves the model's efficiency and scalability while alleviating the modality isolation problem often caused by using GCNs during the training phase. We conduct extensive experiments on three public datasets, consistently demonstrating the performance superiority of FastMMRec over competitive baselines while achieving efficiency and scalability.
Paperid: 649,   https://arxiv.org/pdf/2505.19569    
Authors:Jianghang Lin, Yue Hu, Jiangtao Shen, Yunhang Shen, Liujuan Cao, Shengchuan Zhang, Rongrong Ji
Abstract:
Open vocabulary image segmentation tackles the challenge of recognizing dynamically adjustable, predefined novel categories at inference time by leveraging vision-language alignment. However, existing paradigms typically perform class-agnostic region segmentation followed by category matching, which deviates from the human visual system's process of recognizing objects based on semantic concepts, leading to poor alignment between region segmentation and target concepts. To bridge this gap, we propose a novel Cognition-Inspired Framework for open vocabulary image segmentation that emulates the human visual recognition process: first forming a conceptual understanding of an object, then perceiving its spatial extent. The framework consists of three core components: (1) A Generative Vision-Language Model (G-VLM) that mimics human cognition by generating object concepts to provide semantic guidance for region segmentation. (2) A Concept-Aware Visual Enhancer Module that fuses textual concept features with global visual representations, enabling adaptive visual perception based on target concepts. (3) A Cognition-Inspired Decoder that integrates local instance features with G-VLM-provided semantic cues, allowing selective classification over a subset of relevant categories. Extensive experiments demonstrate that our framework achieves significant improvements, reaching 27.2 PQ, 17.0 mAP, and 35.3 mIoU on A-150. It further attains 56.2, 28.2, 15.4, 59.2, 18.7, and 95.8 mIoU on Cityscapes, Mapillary Vistas, A-847, PC-59, PC-459, and PAS-20, respectively. In addition, our framework supports vocabulary-free segmentation, offering enhanced flexibility in recognizing unseen categories. Code will be public.
Paperid: 650,   https://arxiv.org/pdf/2507.23382    
Authors:Yiyan Ji, Haoran Chen, Qiguang Chen, Chengyue Wu, Libo Qin, Wanxiang Che
Abstract:
Multimodal planning capabilities refer to the ability to predict, reason, and design steps for task execution with multimodal context, which is essential for complex reasoning and decision-making across multiple steps. However, current benchmarks face two key challenges: (1) they cannot directly assess multimodal real-world planning capabilities, and (2) they lack constraints or implicit constraints across modalities. To address these issues, we introduce Multimodal Planning with Complex Constraints (MPCC), the first benchmark to systematically evaluate MLLMs' ability to handle multimodal constraints in planning. To address the first challenge, MPCC focuses on three real-world tasks: Flight Planning, Calendar Planning, and Meeting Planning. To solve the second challenge, we introduce complex constraints (e.g. budget, temporal, and spatial) in these tasks, with graded difficulty levels (EASY, MEDIUM, HARD) to separate constraint complexity from search space expansion. Experiments on 13 advanced MLLMs reveal significant challenges: closed-source models achieve only 21.3% feasible plans, while open-source models average below 11%. Additionally, we observe that MLLMs are highly sensitive to constraint complexity and that traditional multimodal prompting strategies fail in multi-constraint scenarios. Our work formalizes multimodal constraints in planning, provides a rigorous evaluation framework, and highlights the need for advancements in constraint-aware reasoning for real-world MLLM applications.
Paperid: 651,   https://arxiv.org/pdf/2508.09057    
Authors:Zeyu Huang, Juyuan Wang, Longfeng Chen, Boyi Xiao, Leng Cai, Yawen Zeng, Jin Xu
Abstract:
Given the significant advances in Large Vision Language Models (LVLMs) in reasoning and visual understanding, mobile agents are rapidly emerging to meet users' automation needs. However, existing evaluation benchmarks are disconnected from the real world and fail to adequately address the diverse and complex requirements of users. From our extensive collection of user questionnaire, we identified five tasks: Multi-App, Vague, Interactive, Single-App, and Unethical Instructions. Around these tasks, we present MVISU-Bench, a bilingual benchmark that includes 404 tasks across 137 mobile applications. Furthermore, we propose Aider, a plug-and-play module that acts as a dynamic prompt prompter to mitigate risks and clarify user intent for mobile agents. Our Aider is easy to integrate into several frameworks and has successfully improved overall success rates by 19.55% compared to the current state-of-the-art (SOTA) on MVISU-Bench. Specifically, it achieves success rate improvements of 53.52% and 29.41% for unethical and interactive instructions, respectively. Through extensive experiments and analysis, we highlight the gap between existing mobile agents and real-world user expectations.
Paperid: 652,   https://arxiv.org/pdf/2505.19139    
Authors:Feiran Liu, Yuzhe Zhang, Xinyi Huang, Yinan Peng, Xinfeng Li, Lixu Wang, Yutong Shen, Ranjie Duan, Simeng Qin, Xiaojun Jia, Qingsong Wen, Wei Dong
Abstract:
Our research reveals a new privacy risk associated with the vision-language model (VLM) agentic framework: the ability to infer sensitive attributes (e.g., age and health information) and even abstract ones (e.g., personality and social traits) from a set of personal images, which we term "image private attribute profiling." This threat is particularly severe given that modern apps can easily access users' photo albums, and inference from image sets enables models to exploit inter-image relations for more sophisticated profiling. However, two main challenges hinder our understanding of how well VLMs can profile an individual from a few personal photos: (1) the lack of benchmark datasets with multi-image annotations for private attributes, and (2) the limited ability of current multimodal large language models (MLLMs) to infer abstract attributes from large image collections. In this work, we construct PAPI, the largest dataset for studying private attribute profiling in personal images, comprising 2,510 images from 251 individuals with 3,012 annotated privacy attributes. We also propose HolmesEye, a hybrid agentic framework that combines VLMs and LLMs to enhance privacy inference. HolmesEye uses VLMs to extract both intra-image and inter-image information and LLMs to guide the inference process as well as consolidate the results through forensic analysis, overcoming existing limitations in long-context visual reasoning. Experiments reveal that HolmesEye achieves a 10.8% improvement in average accuracy over state-of-the-art baselines and surpasses human-level performance by 15.0% in predicting abstract attributes. This work highlights the urgency of addressing privacy risks in image-based profiling and offers both a new dataset and an advanced framework to guide future research in this area.
Paperid: 653,   https://arxiv.org/pdf/2507.16257    
Authors:Futa Waseda, Saku Sugawara, Isao Echizen
Abstract:
Defending pre-trained vision-language models (VLMs), such as CLIP, against adversarial attacks is crucial, as these models are widely used in diverse zero-shot tasks, including image classification. However, existing adversarial training (AT) methods for robust fine-tuning largely overlook the role of language in enhancing visual robustness. Specifically, (1) supervised AT methods rely on short texts (e.g., class labels) to generate adversarial perturbations, leading to overfitting to object classes in the training data, and (2) unsupervised AT avoids this overfitting but remains suboptimal against practical text-guided adversarial attacks due to its lack of semantic guidance. To address these limitations, we propose Quality Text-guided Adversarial Fine-Tuning (QT-AFT), which leverages high-quality captions during training to guide adversarial examples away from diverse semantics present in images. This enables the visual encoder to robustly recognize a broader range of image features even under adversarial noise, thereby enhancing robustness across diverse downstream tasks. QT-AFT overcomes the key weaknesses of prior methods -- overfitting in supervised AT and lack of semantic awareness in unsupervised AT -- achieving state-of-the-art zero-shot adversarial robustness and clean accuracy, evaluated across 16 zero-shot datasets. Furthermore, our comprehensive study uncovers several key insights into the role of language in enhancing vision robustness; for example, describing object properties in addition to object names further enhances zero-shot robustness. Our findings point to an urgent direction for future work -- centering high-quality linguistic supervision in robust visual representation learning.
Paperid: 654,   https://arxiv.org/pdf/2505.19498    
Authors:Nanxing Hu, Xiaoyue Duan, Jinchao Zhang, Guoliang Kang
Abstract:
Large Vision-Language Models (LVLMs) usually generate texts which satisfy context coherence but don't match the visual input. Such a hallucination issue hinders LVLMs' applicability in the real world. The key to solving hallucination in LVLM is to make the text generation rely more on the visual content. Most previous works choose to enhance/adjust the features/output of a specific modality (i.e., visual or textual) to alleviate hallucinations in LVLM, which do not explicitly or systematically enhance the visual reliance. In this paper, we comprehensively investigate the factors which may degenerate the visual reliance in text generation of LVLM from a Bayesian perspective. Based on our observations, we propose to mitigate hallucination in LVLM from three aspects. Firstly, we observe that not all visual tokens are informative in generating meaningful texts. We propose to evaluate and remove redundant visual tokens to avoid their disturbance. Secondly, LVLM may encode inappropriate prior information, making it lean toward generating unexpected words. We propose a simple yet effective way to rectify the prior from a Bayesian perspective. Thirdly, we observe that starting from certain steps, the posterior of next-token prediction conditioned on visual tokens may collapse to a prior distribution which does not depend on any informative visual tokens at all. Thus, we propose to stop further text generation to avoid hallucination. Extensive experiments on three benchmarks including POPE, CHAIR, and MME demonstrate that our method can consistently mitigate the hallucination issue of LVLM and performs favorably against previous state-of-the-arts.
Paperid: 655,   https://arxiv.org/pdf/2405.20090    
Authors:Hao Cheng, Erjia Xiao, Jiayan Yang, Jinhao Duan, Yichi Wang, Jiahang Cao, Qiang Zhang, Le Yang, Kaidi Xu, Jindong Gu, Renjing Xu
Abstract:
Multimodal Large Language Models (MLLMs) demonstrate exceptional performance in cross-modality interaction, yet they also suffer adversarial vulnerabilities. In particular, the transferability of adversarial examples remains an ongoing challenge. In this paper, we specifically analyze the manifestation of adversarial transferability among MLLMs and identify the key factors that influence this characteristic. We discover that the transferability of MLLMs exists in cross-LLM scenarios with the same vision encoder and indicate \underlinetwo key Factors that may influence transferability. We provide two semantic-level data augmentation methods, Adding Image Patch (AIP) and Typography Augment Transferability Method (TATM), which boost the transferability of adversarial examples across MLLMs. To explore the potential impact in the real world, we utilize two tasks that can have both negative and positive societal impacts: \ding182 Harmful Content Insertion and \ding183 Information Protection.
Paperid: 656,   https://arxiv.org/pdf/2504.09354    
Authors:Duy-Cat Can, Quang-Huy Tang, Huong Ha, Binh T. Nguyen, Oliver Y. Chén
Abstract:
Timely and accurate diagnosis of neurodegenerative disorders, such as Alzheimer's disease, is central to disease management. Existing deep learning models require large-scale annotated datasets and often function as "black boxes". Additionally, datasets in clinical practice are frequently small or unlabeled, restricting the full potential of deep learning methods. Here, we introduce REMEMBER -- Retrieval-based Explainable Multimodal Evidence-guided Modeling for Brain Evaluation and Reasoning -- a new machine learning framework that facilitates zero- and few-shot Alzheimer's diagnosis using brain MRI scans through a reference-based reasoning process. Specifically, REMEMBER first trains a contrastively aligned vision-text model using expert-annotated reference data and extends pseudo-text modalities that encode abnormality types, diagnosis labels, and composite clinical descriptions. Then, at inference time, REMEMBER retrieves similar, human-validated cases from a curated dataset and integrates their contextual information through a dedicated evidence encoding module and attention-based inference head. Such an evidence-guided design enables REMEMBER to imitate real-world clinical decision-making process by grounding predictions in retrieved imaging and textual context. Specifically, REMEMBER outputs diagnostic predictions alongside an interpretable report, including reference images and explanations aligned with clinical workflows. Experimental results demonstrate that REMEMBER achieves robust zero- and few-shot performance and offers a powerful and explainable framework to neuroimaging-based diagnosis in the real world, especially under limited data.
Paperid: 657,   https://arxiv.org/pdf/2409.02834    
Authors:Wentao Liu, Qianjun Pan, Yi Zhang, Zhuo Liu, Ji Wu, Jie Zhou, Aimin Zhou, Qin Chen, Bo Jiang, Liang He
Abstract:
Large language models (LLMs) have obtained promising results in mathematical reasoning, which is a foundational skill for human intelligence. Most previous studies focus on improving and measuring the performance of LLMs based on textual math reasoning datasets (e.g., MATH, GSM8K). Recently, a few researchers have released English multimodal math datasets (e.g., MATHVISTA and MATH-V) to evaluate the effectiveness of large multimodal models (LMMs). In this paper, we release a Chinese multimodal math (CMM-Math) dataset, including benchmark and training parts, to evaluate and enhance the mathematical reasoning of LMMs. CMM-Math contains over 28,000 high-quality samples, featuring a variety of problem types (e.g., multiple-choice, fill-in-the-blank, and so on) with detailed solutions across 12 grade levels from elementary to high school in China. Specifically, the visual context may be present in the questions or opinions, which makes this dataset more challenging. Through comprehensive analysis, we discover that state-of-the-art LMMs on the CMM-Math dataset face challenges, emphasizing the necessity for further improvements in LMM development. We also propose a Multimodal Mathematical LMM (Math-LMM) to handle the problems with mixed input of multiple images and text segments. We train our model using three stages, including foundational pre-training, foundational fine-tuning, and mathematical fine-tuning. The extensive experiments indicate that our model effectively improves math reasoning performance by comparing it with the SOTA LMMs over three multimodal mathematical datasets.
Paperid: 658,   https://arxiv.org/pdf/2508.07730    
Authors:Mingyang Su, Chao Liu, Jingling Zhang, WU Shuang, Mingming Fan
Abstract:
Offering diverse perspectives on a museum artifact can deepen visitors' understanding and help avoid the cognitive limitations of a single narrative, ultimately enhancing their overall experience. Physical museums promote diversity through visitor interactions. However, it remains a challenge to present multiple voices appropriately while attracting and sustaining a visitor's attention in the virtual museum. Inspired by recent studies that show the effectiveness of LLM-powered multi-agents in presenting different opinions about an event, we propose SimViews, an interactive multi-agent system that simulates visitor-to-visitor conversational patterns to promote the presentation of diverse perspectives. The system employs LLM-powered multi-agents that simulate virtual visitors with different professional identities, providing diverse interpretations of artifacts. Additionally, we constructed 4 conversational patterns between users and agents to simulate visitor interactions. We conducted a within-subject study with 20 participants, comparing SimViews to a traditional single-agent condition. Our results show that SimViews effectively facilitates the presentation of diverse perspectives through conversations, enhancing participants' understanding of viewpoints and engagement within the virtual museum.
Paperid: 659,   https://arxiv.org/pdf/2503.06260    
Authors:Muzhi Dai, Jiashuo Sun, Zhiyuan Zhao, Shixuan Liu, Rui Li, Junyu Gao, Xuelong Li
Abstract:
Aligning large vision-language models (LVLMs) with human preferences is challenging due to the scarcity of fine-grained, high-quality, and multimodal preference data without human annotations. Existing methods relying on direct distillation often struggle with low-confidence data, leading to suboptimal performance. To address this, we propose CAREVL, a novel method for preference reward modeling by reliably using both high- and low-confidence data. First, a cluster of auxiliary expert models (textual reward models) innovatively leverages image captions as weak supervision signals to filter high-confidence data. The high-confidence data are then used to fine-tune the LVLM. Second, low-confidence data are used to generate diverse preference samples using the fine-tuned LVLM. These samples are then scored and selected to construct reliable chosen-rejected pairs for further training. CAREVL achieves performance improvements over traditional distillation-based methods on VL-RewardBench and MLLM-as-a-Judge benchmark, demonstrating its effectiveness. The code will be released soon.
Paperid: 660,   
Authors: Bocheng Pan, Hailong Shi, Xingyu Gao
Title: DR-VQA:Decompose-then-ReconstructforVisualQuestion AnsweringinBLVAssistance
Paperid: 661,   
Authors: Qian Sun, Chengzhuo Lu, Wenyu Chen, Wenjie Wei, Jingya Wang, Jieyuan Zhang, Xiaoli Liu, Yalan Ye, Yang Yang, Malu Zhang
Title: Temporal-coded Spiking Transformer
Paperid: 662,   
Authors: Yuhang Lan, Shilin Xu, Chao Su, Run Ye, Dezhong Peng, Yuan Sun
Title: Multi-view Hashing Classification
Paperid: 663,   
Authors: Junwei Zhu, Wei Li, honghui xu, jiawei jiang, Zhi Liu, Jianwei Zheng
Title: Arbitrary-scale Fusion Neural Operator
Paperid: 664,   
Authors: Jiawei Zhang, Xiaoli Jiang, Hao Wang, Lin Yuan, Xiangyang Luo, Bin Ma, Jinwei Wang
Title: DVW: Diffusion Visible Watermark
Paperid: 665,   
Authors: Yifan Hu, Rui Liu, Yi Ren, Xiang Yin, Haizhou Li
Title: UniTalker: Conversational Speech-Visual Synthesis
Paperid: 666,   
Authors: Cheng Peng, Oya Celiktutan
Title: Multi-Task Gaze Communication Understanding
Paperid: 667,   
Authors: Yingbing Liu, Fei Ma, Yanan Wu, Xinxin Zuo, Fan Zhang, Yang Wang
Title: Collaborative Cloud-edge Generalized Category Discovery
Paperid: 668,   
Authors: Na Zhao, Kejiang Chen, Yuang Qi, Kai Zeng, Weiming Zhang, Nenghai Yu
Title: Merging-Resistant Watermarking for LoRA Modules
Paperid: 669,   
Authors: Quanmin Liang, Jinyi Lu, Qiang Li, Shuai Liu, Zhihao Zhao, Yinzheng Zhao, Wei Zhang, Kai Huang, Yonghong Tian
Title: ESOD: Event-Based Small Object Detection
Paperid: 670,   
Authors: Zhihong Zheng, Yang Cao, Junlong Gao, Hanzi Wang
Title: OV-VOD: Open-Vocabulary Video Object Detection
Paperid: 671,   
Authors: Ziyu Wang, Yiming Du, Rui Ning, Lusi Li
Title: Energy-based Deep Incomplete Multi-View Clustering
Paperid: 672,   
Authors: Kuan Liu, Ke Wang, Ji Zhang, Gang Zhou
Title: LLM-Grounded Diffusion for Cross-Domain Recommendation
Paperid: 673,   
Authors: Ruocheng Gu, Sen Jia, Yule Ma, Jinqin Zhong, Jenq-Neng Hwang, Lei Li
Title: MoCount: Motion-Based Repetitive Action Counting
Paperid: 674,   
Authors: Xin Zhang, Weiying Xie, Yunsong Li, Xiaoyu Chen, Tianlin Hui, Jitao Ma, Leyuan Fang
Title: TF-ATM: Training-Free Adaptive Token Merging
Paperid: 675,   
Authors: Zebing Yao, Hao Fu, Yuanhang Yang, Guanghua Gu
Title: Dynamic Optimization Noisy Cross-Modal Hashing
Paperid: 676,   
Authors: LiyuanCao LiyuanCao, ZiHang Guo, Huaiwen Zhang
Title: Event Consistency-aware Robust Fake News Detection
Paperid: 677,   
Authors: Wending Xiong, Ruimin Hu, Lingfei Ren, Xixi Li, Dengshi Li
Title: SE2E: Recognizing Emotion Behind Societal Behavior
Paperid: 678,   
Authors: Bingqian Zhou, Zhihao Wu, Yushi Cheng, Wenyuan Xu
Title: AdvPainting: Clean-text Jailbreaking Against Inpainting Models
Paperid: 679,   
Authors: Shubo Liu, Hongsheng Zhang, Qian Qiao, Qi Wu, PENG WANG
Title: VLN-ChEnv: Vision-language Navigation in Changeable Environments
Paperid: 680,   
Authors: Kai Niu, Liucun Shi, Ke Han, Qinzi Zhao, Yue Wu, Yanning Zhang
Title: Test-Time Adaptation for Text-Based Person Search
Paperid: 681,   
Authors: Xiaofeng Liu, Guanchen Meng, Chongyang Feng, Risheng Liu, Zhongxuan Luo, Xin Fan
Title: TNT-GS: Truncated and Tailored Gaussian Splatting
Paperid: 682,   
Authors: Chenbo Zhang, Bing Huangfu, Hongxu Ma, Jihong Guan, Shuigeng Zhou
Title: Multi-modal Prototype Guided Few-shot Object Detection
Paperid: 683,   
Authors: Zikai Zhang, Xu Zhang, ziyi li, Yidong Li, Yuanzhouhan Cao
Title: GMML:Gradient-Modulated Robustness for Imbalance-Aware Multimodal Learning
Paperid: 684,   
Authors: Changzhou Li, Xinyu Yang, Weiguo Yang, Xinyi Li
Title: VaF-LangSplat: Voxel-Aware Fusion Language Gaussian Splatting
Paperid: 685,   
Authors: Guoyi Li, Die Hu, Haozhe Li, Qirui Tang, Xiaomeng Fu, Yulei Wu, Xiaodan Zhang, Honglei Lyu
Title: Zero-Shot Multimodal Fact-Checking with Conceptual Reasoning
Paperid: 686,   
Authors: Tongfei Liu, Yufan Liu, Bing Li, Weiming Hu, Yuming Li, Chenguang Ma
Title: Noise-Optimized Distribution Distillation for Dataset Condensation
Paperid: 687,   
Authors: Jilong Wei, Yangyang Hu, Xiangjuan Wu, Yiqiang Wu, Hao Liu
Title: Appearance Contrasts for Unconstrained Age Estimation
Paperid: 688,   
Authors: Jiajun Han, Xuran Yang, Hui Zhang
Title: Query-Focused Multimodal Summarization with Gate-Guided Mixture-of-Experts
Paperid: 689,   
Authors: Wentao Fan, Chao Zhang, Chunlin Chen, Huaxiong Li
Title: Online Cross-Modal Hashing with Multi-Level Memory
Paperid: 690,   
Authors: Jiajun Zhang, Xin Li, Si Wu, Yong Xu, Yaowei Wang
Title: Prior-Free Augmentation for Cloth-Changing Person Re-Identification
Paperid: 691,   
Authors: Kamakshya Nayak, Kamalakar Thakare, Ashesh Xalxo, Lalit Lohani, Debi Prosad Dogra
Title: Can Person-Level Attributes Improve Group Re-Identification?
Paperid: 692,   
Authors: Zelei Wu, Xulun Ye, Jieyu Zhao
Title: Clustering-Based Tail-class Mitigation for New-class Discovery
Paperid: 693,   
Authors: Ze Huang, Zhongyang Xiao, Mingliang Song, Yu Fang, Hongyuan Yuan, Kevin Sun, Li Zhang
Title: MS-Road: Towards Spatiotemporal-Consistent Large-Scale Road Reconstruction
Paperid: 694,   
Authors: Adhi Widagdo, Teemu Kämäräinen, Ahmad Alhilal, Matti Siekkinen, Cheng-Hsin Hsu
Title: Gaze-Adaptive Foveation for Remote Rendered VR
Paperid: 695,   
Authors: Yao Zhang, Ping Huang, Rui Zhang
Title: Multimodal Dual Population Evolutionary Reinforcement Learning
Paperid: 696,   
Authors: Yanming Chen, Zixin Ma, Chuanguang Yang, Zhulin An, Yiwen Zhang
Title: Accelerating Diffusion Models via Parallel Denoising
Paperid: 697,   
Authors: Guimin Hu, Yi Xin, Lijie Hu, Zhihong Zhu, Hasti Seifi
Title: PgM: Partitioner Guided Modal Learning Framework
Paperid: 698,   
Authors: Lingling Dai, Andong Li, Zhe Han, Chengshi Zheng, Xiaodong Li
Title: BAPEN: Towards Versatile Audio Phase Retrieval
Paperid: 699,   
Authors: Siqi Song, Limin Yu, Jimin XIAO
Title: SDP: Spectral-Decomposed Prompting for Continual Learning
Paperid: 700,   
Authors: Hridayesh Lekhak, Theron Wang, Tuan Dang, Kenny Zhu
Title: DogSpeak: A Canine Vocalization Classification Dataset
Paperid: 701,   
Authors: Nickolay Safonov, Rakhmanov Mikhail, Dmitriy Vatolin
Title: Screen content video dataset and benchmark
Paperid: 702,   
Authors: Yiang Zhu, Haoyue Wang, Zhenxing Qian, Sheng Li, Xinpeng Zhang, Jian liu
Title: Towards Generalized Physical Occlusion Detection On Documents
Paperid: 703,   
Authors: Chenxu Wang, Dong Zhou, Ting Liu, Jianghao Lin, Yongmei Zhou, Aimin Yang
Title: DiffTMR: Diffusion-based Hierarchical Alignment for Text-Molecule Retrieval
Paperid: 704,   
Authors: Jiahao Wang, Fang Liu, Licheng Jiao, Hao Wang, Shuo Li, Lingling Li, Puhua Chen, Xu Liu, Xinyi Wang
Title: FA³T: Feature-Aware Adversarial Attacks for Multi-modal Tracking
Paperid: 705,   
Authors: Zhiyuan Fan, Keyi Liang
Title: Video-to-Image Affordance Grounding via Visual Conceptual Learning
Paperid: 706,   
Authors: Wei Li, Junwei Zhu, honghui xu, jiawei jiang, Jianwei Zheng
Title: SpecSolver: Solving Spatial-Spectral Fusion via Semantic Transformer
Paperid: 707,   
Authors: Yifan Wang, Yuntai Ding, Yiyang Gu, Ziyue Qiao, Chong Chen, Xian-Sheng Hua, Ming Zhang, Wei Ju
Title: Deep Graph Clustering with Disentangled Representation Learning
Paperid: 708,   
Authors: Binrui Wu, Haochen Sui, Jiaye Lin, Jiechao Gao, Ting Xu, Keyan Jin, Xuesong Zhang
Title: Prototype-Guided Representation Projection for Multi-Domain Multi-Task Recommendation
Paperid: 709,   
Authors: Zihao Zhang, Xingjiao Wu, Junjie Xu, Tianlong Ma, Tangren Yao, Wen Wu, Liang He
Title: Temporal-Conditioned Symbolic Alignment for Controllable Text-to-Music Generation
Paperid: 710,   
Authors: Shuo Wang, Zhichuan Wang, Yanmin Chen, Mengyao Zhou, Jun Luo
Title: DRMix: Decomposition-Recomposition Data Augmentation with Diffusion Model
Paperid: 711,   
Authors: Tianyi Zhang, Qinglong Lin, Yang Hu, Pengming Feng, Rubo Zhang
Title: Edge-aware Affinity Enhancement for Image Manipulation Localization
Paperid: 712,   
Authors: Yize Song, Yunqing Chen, Zhou Wang, Cheng Chen, Ruoxiu Xiao
Title: Symmetrical Awareness Generation for Pelvic Image Segmentation
Paperid: 713,   
Authors: Haoyu Shi, Huaiwen Zhang
Title: Sequence-Event Semantic Consistent Learning for Text-to-Motion Retrieval
Paperid: 714,   
Authors: Peiqi Jiang, Bohan Lei, Yuhao Sun, Lingyun Yu, Zhineng Chen, Hongtao Xie, Yongdong Zhang
Title: Proactive Deepfake Detection via Self-Verifiable Semantic Watermarking
Paperid: 715,   
Authors: Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Mike Zheng Shou
Title: GUI-Narrator: Detecting and Captioning Computer GUI Actions
Paperid: 716,   
Authors: Wang Runjie, Kemi Chen, Shuijie Li, Mingkai chen, Tiesong Zhao
Title: Efficient Semantic Codec for Real-time Vibrotactile Transmission
Paperid: 717,   
Authors: Renjie Lin, Jiacheng Li, Shide Du, Shiping Wang, Le Zhang
Title: OIMGC-Net: Optimization-inspired Interpretable Multi-view Graph Clustering Network
Paperid: 718,   
Authors: Zhan Yang, Binghong Chen, Jiajun Tang, Yinan Li
Title: Unsupervised Similarity-Fusion Transformer Hashing for Multimodal Retrieval
Paperid: 719,   
Authors: Nhu-Thuat Tran, Hady Lauw
Title: Parameter-Efficient Variational AutoEncoder for Multimodal Multi-Interest Recommendation
Paperid: 720,   
Authors: Jun Yang, MAOYU MAO
Title: DiffuSeg: Diffusion-Enhanced Cross-Modal Semantic Segmentation for RGB-D
Paperid: 721,   
Authors: Binbin Zheng, Aiqiu Wu, Kai Fan, Ao Li, Minghui Wang
Title: Domain-Specific Interactive Prompting for Generalized Nuclei Classification
Paperid: 722,   
Authors: Weicheng Xie, Chunlin Yan, Siyang Song, Zitong YU, Linlin Shen, Laizhong Cui
Title: Smooth Online Multiple Appropriate Facial Reaction Generation
Paperid: 723,   
Authors: Xinyao Li, Dan Zhang, Zhekai Du, Lei Zhu, Zhi Chen, Jingjing Li
Title: PatAug: Augmentation of Augmentation for Test-Time Adaptation
Paperid: 724,   
Authors: Fan Qi, Zhan Wang, Changsheng Xu, Huaiwen Zhang
Title: Fine-tuning Bias Neurons for Fair Text-to-Image Generation
Paperid: 725,   
Authors: Zixuan Wan, Jiqing Zhang, Yushan Wang, Hu Lin, Yafei Wang, Zetian Mi, Xin Yang, Xianping Fu, Huibing Wang
Title: Eye-based Emotion Recognition via Event-Driven Sparse Transformers
Paperid: 726,   
Authors: Shuyong Gao, Qianyu Guo, Yuang Feng, Chunyuan Chen, Xujun Wei, Yan Wang, Wenqiang Zhang
Title: Progressive Representation Learning for Weakly-Supervised Camouflaged Detection
Paperid: 727,   
Authors: Xu Shaowu, Xibin Jia, Junyu Gao, Qianmei Sun, Jing Chang, Chao Fan
Title: Cross-Modal Dual-Causal Learning for Long-Term Action Recognition
Paperid: 728,   
Authors: Yichen Bao, Yuxuan Liu, Yu Duan, Jing Li, Quanxue Gao
Title: Multi-view Clustering Based on Probabilistic Tensor Regression
Paperid: 729,   
Authors: Zhijie Rao, Jingcai Guo
Title: Balancing Cross-Modal Attention for Generalized Zero-Shot Learning
Paperid: 730,   
Authors: Sidun Liu, Wenyu Li, Peng Qiao, Yong Dou
Title: Regist3R: Increamental Registration with Stereo Foundation Model
Paperid: 731,   
Authors: Jiahao Li, Yiqiang Chen, Yunbing Xing, Yang Gu, Xiangyuan Lan
Title: K-Space Bispectrum Steganography for Robust Unlearnable Data
Paperid: 732,   
Authors: Kai Li, Wenqi Ren, Wei Wang, Linchao Zhang, Xiaochun Cao
Title: Detecting Synthetic Image by Cross-Modal Commonality Interaction
Paperid: 733,   
Authors: Wei Chen, Jianwei Niu, Xuefeng Liu, Xinghao Wu
Title: Decoupling Dense Video Captioning via Task-specific Prompts
Paperid: 734,   
Authors: Xuedong He, Huiying Xu, Xinzhong Zhu, Hongbo Li
Title: High-Performance Discriminative Tracking with Spatio-Temporal Template Fusion
Paperid: 735,   
Authors: Le Han, Kaixuan Chen, Minchen Ye, Nenggan Zheng
Title: Hi-Motion: Hierarchical Intention Guided Conditional Motion Synthesis
Paperid: 736,   
Authors: Fan Li, Jiazhen Huang, Shisong Tang, Bing Han, Huafeng Cao, Haochen Sui, Ting Xu, Kangxiaoyu Kangxiaoyu
Title: Contrastive Prototype Framework for Calibrating Video Recommendation
Paperid: 737,   
Authors: Yalan Qin, Nan Pu, Hanzhou Wu, Zhaoxin Fan
Title: Flexible Multi-view Clustering with Dynamic Views Generation
Paperid: 738,   
Authors: Haolun Li, Weihuang Liu, JiaTeng Liu, Zhenhua Tang, Chi-Man Pun, Qiguang Miao, Feng Xu, Hao Gao
Title: MotionRefineNet: Fine-Grained Pose Sequence Smoothing and Refinement
Paperid: 739,   
Authors: Taichun Zhou, Zhibin Dong, Siwei Wang, KE LIANG, Miaomiao Li, Xinwang Liu, En Zhu, Xiangjun Dong
Title: DPFMVC: Dynamic Progressive Fusion for Multi-view Clustering
Paperid: 740,   
Authors: Jiaxin Peng, Siwang Zhou, Chengqing Li, Yucheng Li, Dunyun Chen
Title: Mitigating Delivery Artifacts in Real-World Video Super-Resolution
Paperid: 741,   
Authors: Pengsheng Liu, Zhaojie Chu, Xiaofen Xing, Xiangmin Xu
Title: SemGesture: Synthesizing Semantically Enhanced and Coherent Gestures
Paperid: 742,   
Authors: Zheyun Qin, Deng Yu, Yang Shi, Qiangchang Wang, Zhumin Chen
Title: Video Instance Segmentation by Weighted Structure Inference
Paperid: 743,   
Authors: Vitalii Emelianov, Niki Martinel
Title: Neural Additive Adapters for Interpretable Nutrition Prediction
Paperid: 744,   
Authors: Qiyuan Zhu, Lujun Li, Dezhi Li, Jiacheng Liu, Pengyu Cheng, Yucheng Xu, Sirui Han, Yike Guo
Title: Outlier-Aware Model Merging for Efficient Multitask Inference
Paperid: 745,   
Authors: wangjiawen wangjiawen, Jianjun Li, Zhiyuan Ma, Bairuixia Bairuixia
Title: SAKR-Edit: Scene-Aware Knowledge Reasoning for Text-to-Image Editing
Paperid: 746,   
Authors: Da Zhang, Feiyu Wang, Bingyu Li, Zhiyuan Zhao, Junyu Gao, Xuelong Li
Title: KAID: Knowledge-Aware Interactive Distillation for Vision-Language Models
Paperid: 747,   
Authors: Gang Pan, Meihua Liu, Lei Zhou, Jiahao Wang, Di Sun
Title: Image Retargeting based on Text Region Awareness
Paperid: 748,   
Authors: Cai Xu, Ziqi Wen, Jie Zhao, Wanqing Zhao, Jinlong Yu, Haishun Chen, Ziyu Guan, Wei Zhao
Title: Beyond Equal Views: Strength-Adaptive Evidential Multi-View Learning
Paperid: 749,   
Authors: Xiaoyu Chen, Yigang Cen, Wanru Xu, Yue Zhang, Yi Jin, Yidong Li, Linna Zhang
Title: Hierarchical Meta-prototypes Network for Few-shot Action Recognition
Paperid: 750,   
Authors: Ao Yang, Yanglin Feng, Yuan Sun, Dezhong Peng, Guiduo Duan, Yang Qin
Title: Noise-Robust Cross-modal Learning for Reliable 2D-3D Retrieval
Paperid: 751,   
Authors: Chenda Wei, Haoyue Wang, Zhenxing Qian, Sheng Li, Xinpeng Zhang, Jian liu
Title: Learning Discrepant Transformations for Face Privacy Protection
Paperid: 752,   
Authors: Weitao You, Heda Zuo, Junxian Wu, Dengming Zhang, Zhibin ZHOU, Lingyun Sun
Title: Spatial-Temporal Decomposition and Alignment in Controllable Video-to-Music Generation
Paperid: 753,   
Authors: Yawei Chen, Huibing Wang, Mingze Yao, Jinjia Peng, Guangqi Jiang, Jiqing Zhang
Title: Scalable Multi-view Clustering based on Tight Anchor Distribution
Paperid: 754,   
Authors: Ruiqi Dong, Wenjing Pang, ChenJie Pan, Heng-yang Lu, Chenyou Fan
Title: StoryCrafter: Instance-Aligned Multi-Character Storytelling with Diffusion Policy Learning
Paperid: 755,   
Authors: Qi Chen, Zhuoya Yao, Haiguang Wang, Gaowei Wu, Bihui Yu, Siyuan Li, Jingxuan Wei, Cheng Tan
Title: ResearchPulse: Building Method–Experiment Chains through Multi-Document Scientific Inference
Paperid: 756,   
Authors: Jianqiao Cui, Bingyao Yu, Qihao Wang, Fei Meng, Jiwen Lu
Title: WhiADD: Semantic-Acoustic Fusion for Robust Audio Deepfake Detection
Paperid: 757,   
Authors: Rouqi Zhang, Chengdi Lu, Hancheng Lu, Yang Cao, Tiesong Zhao
Title: RobustVisH: Robust Visual-Haptic Cross-Modal Recognition Under Transmission Interference
Paperid: 758,   
Authors: Wenlan Chen, Lu Gao, Cheng Liang, Fei Guo
Title: Deep Variational Incomplete Multi-View Clustering with Information-Theoretic Guidance
Paperid: 759,   
Authors: Yizhou Lin, Nisha Huang, Kaer Huang, Henglin Liu, Yiqiang Yan, Jie Guo, Tong-Yee Lee, Xiu Li
Title: ICE: Intercede Concept Erasure in Text-to-Image Diffusion Models
Paperid: 760,   
Authors: Chengcheng Xing, Yanyu Xu, Yonghui Xu, Lizhen Cui
Title: Learning Invariant Discriminative Patterns for Unified Anomaly Detection
Paperid: 761,   
Authors: Nan He, Yiming Chen, Zheng Jiang, Song Yang, Lifeng Sun
Title: DynFed: Adaptive Federated Learning via Quantization-Aware Knowledge Distillation
Paperid: 762,   
Authors: Liuyi Li, Feng Shi, Jian Wang, Jinjing Zhu, Wenze Shao
Title: An Event-tailored State-Space based Model for Pedestrian Detection
Paperid: 763,   
Authors: Chen Gao, Youfang Lin, Wenbin Wang, Shuo Zhang
Title: Epipolar Consistency-based Network for Structure-Aware LF Semantic Segmentation
Paperid: 764,   
Authors: Kyungjune Lee, Seongjean Kim, Hoseok Tong, Hyucksang Lee, Seongmin Lee, Weisi Lin, Ping An, Sanghoon Lee
Title: Domain crossover Non-Rigid Registration for 3D Human Meshes
Paperid: 765,   
Authors: Xinyu Xiao, Peixi Peng, Qiang Wang, XingChao XingChao, Shuhan Qi
Title: Multi-faceted Complementary Learning for Incomplete Multi-view Multi-label Classification
Paperid: 766,   
Authors: Hang Lv, Zixuan Guo, Zijie Wu, Yanchao Tan, Guofang Ma, zhigang lin, Xiping Chen, Hong Cheng, Carl Yang
Title: MedAlign: Enhancing Combinatorial Medication Recommendation with Multi-modality Alignment
Paperid: 767,   
Authors: Yu Tong, Weihai Lu, Xiaoxi Cui, Yifan Mao, Zhejun Zhao
Title: DAPT: Domain-Aware Prompt-Tuning for Multimodal Fake New Detection
Paperid: 768,   
Authors: Yufan Hu, Kunlin Yang, Junyu Gao, Bin Fan, Hongmin Liu
Title: Learning Evidential Delta Denoising Scores for Video Editing
Paperid: 769,   
Authors: Donglin Zhang, Boyuan Ma, Xiaojun Wu, Josef Kittler
Title: Ingredients-Guided and Nutrients-Prompted Network for Food Nutrition Estimation
Paperid: 770,   
Authors: Juan Zhao, Yudao Sun, Zhihai Yang, Cai Xu, Hongji Chen, Fan Zhang, Jianxin Li
Title: Cross-Model Watermarking via Discriminative Samples for Secure Authentication
Paperid: 771,   
Authors: Xiaohan Yu, Zicheng Pan, Yang Zhao, Qin Zhang, Yongsheng Gao
Title: Contrastive Lie Algebra Learning for Ultra-Fine-Grained Visual Categorization
Paperid: 772,   
Authors: Zhang Haofan, Shangfei Wang
Title: EmIT: Emotional Interaction control in Text-to-image diffusion models
Paperid: 773,   
Authors: Haoxiang Cao, Chaoqun Wang, Yongwen Lai, Shaobo Min, Xuejin Chen
Title: CausalCtrl: Causality-Aware Control Framework for Text-Guided Visual Editing
Paperid: 774,   
Authors: Chong Wu, Maolin Che, Renjie Xu, Zhuoheng Ran, Hong Yan
Title: ELFATT: Efficient Linear Fast Attention for Vision Transformers
Paperid: 775,   
Authors: Xuan Zhang, Sinchee Chin, Jing-Hao Xue, Xiaochen Yang, Wenming Yang
Title: DARL: Mitigating Gradient Conflicts in Long-Tailed Out-of-Distribution Learning
Paperid: 776,   
Authors: Zhongyun Bao, Jhon Jhon, Jianchi Sun, Jing Zhou, Ziqi Yu, Chunxia Xiao
Title: I2HDiffuser: Image Illumination Harmonization Meets the Diffusion Model
Paperid: 777,   
Authors: Wenming Wu, Tianlei Sheng, Gaofeng Zhang, Liping Zheng
Title: FloorplanSBS: Synthesizing Vector Floorplans by Patch-Based Floorplan Segmentation
Paperid: 778,   
Authors: Hanyuan Liu, Minshan Xie, Jinbo Xing, Chengze Li, Chi LEUNG, Tien-Tsin Wong
Title: ColorDiffuser: Video Colorization with Pretrained Text-to-Image Diffusion Models
Paperid: 779,   
Authors: Zeyu Xia, Canqun Yang, Haoang Chi, Tao Tang, weiming xiang, Yingbo Cui
Title: MMF-SV: A Multi-Modal Feature Fusion-Based Structural Variant Caller
Paperid: 780,   
Authors: Jiayi Gao, Huaiwen Zhang
Title: Evaluating and Mitigating Sycophancy in Large Vision-Language Models
Paperid: 781,   
Authors: Zhen Wang, Dongyuan Li, Yaozu Wu, Peide Zhu, Shiyin Tan, Renhe Jiang
Title: Video-based Transparent Object Segmentation via Temporal Feature Aggregation
Paperid: 782,   
Authors: Fengxin Li, Zhiqian Yin, Hongyan Liu, Jingcai Guo, Jun He, Yi LI, CHAO ZHOU, Jun Zhang, Haijie Gu
Title: Topic Guided Multi-faceted Semantic Disentanglement for CTR prediction
Paperid: 783,   
Authors: Biao Chen, Kunbin He, Zhikun Zheng, Mengmeng Jing, Lin Zuo
Title: Chain-of-Thought Guided Semantic Debiasing for Low-Shot Vision-Language Tasks
Paperid: 784,   
Authors: Hanmo Chen, Chenghao Xu, Jiexi Yan, Cheng Deng
Title: AStF: Motion Style Tranfer via Adaptive Statistics Fusor
Paperid: 785,   
Authors: Ting Li, Songtao Li, Shuaifeng Li, Xiaolin Qin, Maoyuan Zhao, Luping Ji, Mao Ye
Title: SAM-Guided Semantic Knowledge Fusion for Visible-Infrared Object Detection
Paperid: 786,   
Authors: Mingle Zhou, Jiahui Liu, Jin Wan, Gang Li, Min Li
Title: Exploring Multimodal Prompts For Unsupervised Continuous Anomaly Detection
Paperid: 787,   
Authors: Mengling Xu, Ming Tao, Bingkun BAO
Title: Chain-of-Cooking: Cooking Process Visualization via Bidirectional Chain-of-Thought Guidance
Paperid: 788,   
Authors: Hang Xiong, Runmin Cong, Jinpeng Chen, Chen Zhang, Feng Li, Huihui Bai, Sam Kwong
Title: MM-Prompt: Multi-modality and Multi-granularity Prompts for Few-Shot Segmentation
Paperid: 789,   
Authors: Ruonan Wei, Yuntao Wang, Siyan Fang, Yuehuan Wang
Title: End-to-End Multiple Object Tracking with Dynamic Scene Perception
Paperid: 790,   
Authors: Ziqi Yuan, Jun Li, Yanghao Li, Yuxiang Huang, Chi Chen, Shuo Wang, Zhinan Gou
Title: CITR: Efficient Long Video Understanding Needs Causal Importance
Paperid: 791,   
Authors: Zhaoqi chen, Wanni Xu, Yunfeng Zhang, Yawei Hou, Zhenyu Wen, Cong Wang
Title: DeCoRec: Decoupled Collaborative Refinement for Multi-Modal Sequential Recommendations
Paperid: 792,   
Authors: Hang Yu, Yimin Wen, Xiongjian Lv
Title: DiffuFuse:Diffusion-Driven Dual-Stream Fusion Framework for Multimodal Sentiment Analysis
Paperid: 793,   
Authors: Pengyuan Li, Man Liu, Dongxia Chang, Yiming Wang, Zisen Kong, Yao Zhao
Title: AEMVC: Mitigate Imbalanced Embedding Space in Multi-view Clustering
Paperid: 794,   
Authors: Fei Ye, Adrian Bors
Title: Online Continual Learning via Dynamic Expandable Recursive Model
Paperid: 795,   
Authors: Xinbiao Gan, Qiang Zhang, Tiejun Li, Chunye Gong, Kai Lu
Title: GraphWorld: Ultra-fast Graph Engine for World-Wide Web Searching
Paperid: 796,   
Authors: Yufeng Chen, Umakant Kulkarni, Voicu Popescu, Sonia Fahmy
Title: RUN: A Case for Cross-Layer Networked Virtual Reality
Paperid: 797,   
Authors: Dongjian Yu, Weiqing Min, Xin Jin, Qian Jiang, Shuqiang Jiang
Title: Spatial-Aware Multi-Modal Information Fusion for Food Nutrition Estimation
Paperid: 798,   
Authors: Rui Shang, Min Liu, Xueping Wang, Yuan Bian, Yaonan Wang
Title: Decoupled Identity and Attribute Tokenization for Person Re-Identification
Paperid: 799,   
Authors: Disen Hu, Xun Jiang, Sun Zhe, Hao Yang, Chong Peng, Peng Yan, Heng Tao Shen, Xing Xu
Title: Geometric Gradient Divergence Modulation for Imbalanced Multimodal Learning
Paperid: 800,   
Authors: Ruian He, Zixian Zhang, Ri Cheng, Weimin Tan, Bo Yan
Title: Efficient Trajectory Space-Time Super-Resolution for Fast Live-cell Imaging
Paperid: 801,   
Authors: Tung-I Chen, Dae Lee, Guan-Ming Su, Mohammad Hajiesmaili, Ramesh Sitaraman
Title: NIVM: Real-time View Morphing via Neural Implicit Function
Paperid: 802,   
Authors: Zhicheng Dong, Xiaodong Yue, Yufei Chen, Yuxian Zhou
Title: Trusted Open-World Multi-View Classification with Dynamic Opinion Aggregation
Paperid: 803,   
Authors: Penglei Wang, Ziming Quan, Danyang Wu, Jin Xu
Title: Cluster-Aware Contrastive Multi-View Clustering Based on Masked Views
Paperid: 804,   
Authors: Kaixiang Wang, Xiaojian Ding, Wanqi Yang, Ming Yang
Title: Label-Semantics-Guided Multi-View Multi-Label Learning via High-Order Semantic Fusion
Paperid: 805,   
Authors: Rui Wang, Yuxuan Liu, Guangyu Yang, Quanxue Gao, Cheng Deng
Title: Bi-Orthogonal Non-negative Tensor tri-Factorization for Tensorized Label Learning
Paperid: 806,   
Authors: Yuwu Lu, Haoyu Huang, Xue Hu
Title: Domain-aware Visual Context Prompt for Multi-Source Domain Adaptation
Paperid: 807,   
Authors: Chengzhou Li, Xiaokang Liu, Qi Jia, Jinyuan Liu, Zhiying Jiang, Longhan Feng, Yu Liu, Zhongxuan Luo, Xin Fan
Title: Physics-Guided Sonar Image Fine-grained Recognition under Scarce Annotations
Paperid: 808,   
Authors: Songtao Zhou, Xiaoyu Qin, Yixuan Zhou, Qixin Wang, Zeyu Jin, Zixuan Wang, Zhiyong Wu, Jia Jia
Title: HarmoniVox: Painting Voices to Match the Avatar’s Soul
Paperid: 809,   
Authors: Chen Feng, Nicu Sebe, Georgios Tzimiropoulos, Miguel Rodrigues, Ioannis Patras
Title: Unveiling Open-set Noise: Theoretical Insights into Label Noise
Paperid: 810,   
Authors: Lin Peng, Cong Wan, Shaokun Wang, Xiang Song, Yuhang He, Yihong Gong
Title: CIA: Class- and Instance-aware Adaptation for Vision-Language Models
Paperid: 811,   
Authors: Jiaqi Cui, Yilun Li, Xi Wu, Jiliu Zhou, Yan Wang
Title: PREMISE: Individual Preference-aware Multi-modal Cooperation for Survival Prediction
Paperid: 812,   
Authors: Hongyu Liu, Hongwei Ge, Yuxuan Liu, Yaqing Hou
Title: Dialogue-Driven Interactive Dynamic Learning for Text-to-Image Person Retrieval
Paperid: 813,   
Authors: Hongyang Lin, Kuixiang Shao, Peijun Xu, Zhuoyang Bu, Yuyang Jiao, Ziyuan Tang, Chenxi Xiao, Jingyi Yu
Title: HandCraft: Tactile-Informed Hand-Object Dynamics Capture and Realistic Rendering
Paperid: 814,   
Authors: Fan Qi, Ao Liu, Zixin Zhang, Changsheng Xu
Title: FORGET ME: Federated Unlearning for Face Generation Models
Paperid: 815,   
Authors: Kailong Yu, Liyuan Pan, Liu Liu, Wei Liang
Title: Enhanced Dual-Pixel Image Reflection Removal via Gaussian Splatting
Paperid: 816,   
Authors: Jiawei Gu, Ziyue Qiao, Zechao Li
Title: Activation Shape Matters: OOD Detection with Norm-Entropy Fusion
Paperid: 817,   
Authors: dengwen wang, Guanyu Xing, Yanli Liu
Title: Low-light Invariant Representation Learning for Visible-Infrared Person Re-identification
Paperid: 818,   
Authors: Shuo Li, Xingchen Liu, Fang Liu, Licheng Jiao, Jiahao Wang, Xinyan Huang, Yanbiao Ma, Puhua Chen, Lingling Li, Xu Liu, Xuejian Gou
Title: Imagining Vision From Language for Few-Shot Class-Incremental Learning
Paperid: 819,   
Authors: Cheng Luo, Siyang Song, Siyuan Yan, zhen yu, Zongyuan Ge
Title: ReactDiff: Fundamental Multiple Appropriate Facial Reaction Diffusion Model
Paperid: 820,   
Authors: Huadai Liu, Jialei Wang, Xiangtai Li, Wen Wang, Qian Chen, Rongjie Huang, Yang Liu, Jiayang Xu, Wei Xue, Zhou Zhao
Title: MelodyEdit: Zero-shot Music Editing with Disentangled Inversion Control
Paperid: 821,   
Authors: Qixun Zeng
Title: Retrieval Augmented 3D Garment Generation from Single Image
Paperid: 822,   
Authors: Jingjun Yi, Qi Bi, Hao Zheng, Huimin Huang, Haolan Zhan, Yixian Shen, Wei Ji, Yawen Huang, Yuexiang Li, Xian Wu, Yefeng Zheng
Title: AtlantisGS: Underwater Sparse-View Scene Reconstruction via Gaussian Splatting
Paperid: 823,   
Authors: Yonghyeon Jo, JangHyun Kim, Jinsun Park
Title: BAC-GCN: Background-Aware CLIP-GCN Framework for Unsupervised Multi-Label Classification
Paperid: 824,   
Authors: Lizhi Xiong, Peipeng Yu, Yue Wu
Title: MADPHash: Manipulation-Aware Deep Perceptual Hashing using Feature Consistency
Paperid: 825,   
Authors: Quanhong Peng, Dan Zhang, Dong Zhao, Jianpeng Zhang, Meihua Song, Chenlei Lv
Title: Cam-Bench: A Benchmark for Image-based Camera Parameter Estimation
Paperid: 826,   
Authors: Zihan Wang, Yunhang Shen, Yuan Fang, Zuwei Long, Ke Li, Xing Sun, Jiao Xie, Shaohui Lin
Title: Towards Universal Perception through Language-Guided Open-World Object Detection
Paperid: 827,   
Authors: Zihang Zhang, Shoulong Zhang, Yan Wang, Shuai Li
Title: Reactffusion: Physical Contact-guided Diffusion Model for Reaction Generation
Paperid: 828,   
Authors: Xinlong Zhang, Zejian Li, Wei Li, Xiaoyu Zhang, Jia Wei, Chengyu Lin, Yongchuan Tang
Title: ObjCtrl: Object-based Control Relaxation for Conditional Text-to-image Generation
Paperid: 829,   
Authors: Long Chen, De Cheng, Shizhou Zhang, Yinghui Xing, Di Xu, Yanning Zhang
Title: Amplitude-aware Domain Style Replay for Lifelong Person Re-identification
Paperid: 830,   
Authors: Fujian Ren, Wenlan Chen, Lu Gao, Fei Guo, Cheng Liang
Title: Dual-level Distribution Alignment for Deep Incomplete Multi-view Clustering
Paperid: 831,   
Authors: Xinchen Ye, Aokai Zhang, Rui Xu
Title: Semantics-Driven Contrastive Learning for Real-World Depth Super Resolution
Paperid: 832,   
Authors: Junyu Chen, Jiawei Peng, Yuan Sun, Jian Dai, Xingfeng Li, Zhenwen Ren
Title: Scalable Unpaired Multi-View Clustering via Anchor-Driven High-Throughput Encoding
Paperid: 833,   
Authors: Zefan Zhang, Weiqi Zhang, Kailong Suo, yanhui li, Tian Bai
Title: Video-Level Multimodal Relation Extraction with Event-Entity Semantic Consistency
Paperid: 834,   
Authors: Wanying Zhou, Yuqi Sun, Yu Ling, Zhen Xing, Chenxi Ma, Weimin Tan, Bo Yan
Title: TabiMed: Tabularizing Medical Images for Few-Shot In-Context Diagnosis
Paperid: 835,   
Authors: Liang Yao, Fan Liu, Shengxiang Xu, Chuanyi Zhang, Shimin Di, Xing Ma, Jianyu Jiang, Zequan Wang, Jun Zhou
Title: UEMM-Air: Enable UAVs to Undertake More Multi-modal Tasks
Paperid: 836,   
Authors: Tan Yue, xuzhao Shi, Rui Mao, Zilong Song, Zonghai Hu, Dongyan Zhao
Title: AnaFig: A Human-Aligned Dataset for Scientific Figure Analysis
Paperid: 837,   
Authors: SungHyun Moon, Aidyn Zhakatayev, SeungJae Lee
Title: HAN: Korean Heritage Augmented Narrative Visual-Language Description Dataset
Paperid: 838,   
Authors: Maksim Golyadkin, Innokentiy Humonen, Valeria Rubanova, Danil Kalin, Ianis Plevokas, Dmitry Nikolotov, Aleksandr Utkov, Nikita Sidelnikov, Petr Ivanov, Bureeva Ekaterina, Ekaterina Alexandrova, Ilya Makarov
Title: MuMMy: Multimodal Dataset supporting VLM-based Egyptology Research Assistant
Paperid: 839,   
Authors: Xuanliu Zhu, Yiqiao Chai, Runnan Li, Mingying Lan, Li Gao
Title: CrossMind-VL: Multi-Subject Mind-to-Video Decoding with Multimodal LLM Semantic Grounding
Paperid: 840,   
Authors: Inzamamul Alam, Md Islam, Simon Woo
Title: SpecXNet: A Dual-Domain Convolutional Network for Robust Deepfake Detection
Paperid: 841,   
Authors: Tianming Xu, Tiantian Guo, Youdan Feng, Zihan Chen, Qiaoyi Xue, Lingzhi Hu, Yuhang SHI
Title: Anatomical Region-Guided 3D PET/MR Tumor Segmentation via Medical Record
Paperid: 842,   
Authors: Shengjiu Dai, Xiujian Liang, Sheng Li, Zhenxing Qian, Xinpeng Zhang
Title: Safe-BVAR: Text-to-Image Generative Watermarking for Bitwise Visual AutoRegressive Model
Paperid: 843,   
Authors: Min Li, Jinghui He, Jiachen Li, Delong Han, Jin Wan, Gang Li
Title: HGCF: Hierarchical Geometry-Color Fusion for Multimodal Industrial Anomaly Detection
Paperid: 844,   
Authors: Andong Zhu, Sheng Zhang, Xiaohang Shi, Hesheng Sun, Yu Liang, Zhuzhong Qian, Han Zheng, Xiaokun Wang, Ning Jiang
Title: VidIQ: Inference-Aware Neural Codecs for Quality-Enhanced, Real-Time Video Analytics
Paperid: 845,   
Authors: Cheng Ye, Weidong Chen, Peipei Song, Xinyan Liu, Lei Zhang, Zhendong Mao
Title: Multi-round Mutual Emotion-Cause Pair Extraction for Emotion-Attributed Video Captioning
Paperid: 846,   
Authors: Zhenxi Wang, Zongyao Yin, Yujie Hou, Xianchuan Yu
Title: Robust Multi-view Clustering via Pseudo Label Guided Universum Learning
Paperid: 847,   
Authors: Hao Wang, Hanxiao Li, Li Xu
Title: CrosST: Cross Swin 4D Transformer for Multi-Modal Alzheimer’s Detection
Paperid: 848,   
Authors: Shengqian Zhu, yu chengrong, Wenbo Qi, Jiafei Wu, Ying Song, Guangjun Li, Zhang Yi, Xiaogang Xu, Junjie Hu
Title: PRIME: Prototype-Driven Class Incremental Learning for Medical Image Segmentation
Paperid: 849,   
Authors: Luyan Cui, Huibing Wang, Yawei Chen, Mingze Yao, Xianping Fu, Jiqing Zhang
Title: Dual-Constraint Multi-view Fuzzy Clustering with Scalable Anchor Graph Learning
Paperid: 850,   
Authors: Junkang Liu, Fanhua Shang, Yuxuan Tian, Hongying Liu, Yuanyuan Liu
Title: Consistency of Local and Global Flatness for Federated Learning
Paperid: 851,   
Authors: Xingchen Li, Wuyang Zhang, Guoliang You, Xiaomeng Chu, Wenhao Yu, Yifan Duan, Yuxuan Xiao, Yanyong Zhang
Title: CalibWorkflow: A General MLLM-Guided Workflow for Centimeter-Level Cross-Sensor Calibration
Paperid: 852,   
Authors: Shifeng Bao, Zhe Xue, Qi Chen, Shilong Ou, Amin Beheshti, Quan Sheng, Anton van den Hengel, Yuankai Qi
Title: CausalMVC: Causal Content-Style Representation Learning for Deep Multi-View Clustering
Paperid: 853,   
Authors: Zhou Tan, De Li, Yirui Huang, Jia-Li Yin, Ximeng Liu
Title: FeatShield: Isolating Malicious Feature Extractors for Backdoor-Robust Federated Learning
Paperid: 854,   
Authors: Jinghan Liu, Xingmei Wang, Jiaxiang Meng
Title: Adaspeaker: Learning Discriminative Speaker Representations with Gradient-Aware Adaptive Scaling
Paperid: 855,   
Authors: Jingrou Wu, Haoxian Liu, Jin Zhang, Dan Wang, Jing Jiang
Title: P²VS: Progressive Partition-based Volumetric Video Streaming under Network Dynamics
Paperid: 856,   
Authors: Meng Chu, Yicong Li, Tat-Seng Chua
Title: GraphVideoAgent: Understanding Long Videos via LLM-Powered Entity Relation Graphs
Paperid: 857,   
Authors: Yawen Cui, Wenbin Zou, Huiping Zhuang, Yi Wang, Lap-Pui Chau
Title: Probabilistic Mixture of Hyperbolic Mamba for Few-Shot Class-Incremental Learning
Paperid: 858,   
Authors: Peng Zhao, Zhiguang Cao, Di Wang, Wen Song, Wei Pang, You Zhou, Yuan Jiang
Title: Visual-Enhanced Multimodal Framework for Flexible Job Shop Scheduling Problem
Paperid: 859,   
Authors: Tao Ling, Siping SHI, Dan Wang
Title: Accelerating Long Video Understanding via Compressed Scene Graph-Enabled Chain-of-Thought
Paperid: 860,   
Authors: Wenxiang Liu, Yongkang Liu, Weiliang Meng, Gaoqi He, Jianhua Li
Title: D³L: Curvature-Constrained Denoising Diffusion Model for 3D Lane Detection
Paperid: 861,   
Authors: Xiaobo Liu, Henglu Wei, Chuxi Yang, Wei Yu, Xudong Zhao, Xiangyang Ji
Title: Camera-Specific Imaging Simulation for Raw Domain Image Super Resolution
Paperid: 862,   
Authors: Amruta Muthal, Varghese Kuruvilla, Ravi Kiran Sarvadevabhatla
Title: PLATO: Generating Objects from Part Lists via Synthesized Layouts
Paperid: 863,   
Authors: Jiawei Zhang, Haonan Zhang, Weitao Zhang, Liang Pu, Zesen Feng, Jie Guo
Title: Decoupled Motion Prediction for Real-time G-buffer Free Frame Extrapolation
Paperid: 864,   
Authors: Yuzhen Niu, Siling Chen, Yuzhong Chen, Fusheng Li, Rui Xu, Hui Da
Title: CoFiVLA: Synergistic Coarse-Fine Vision-Language Alignment for Image Aesthetic Assessment
Paperid: 865,   
Authors: Bowen Guo, Shiwei Gan, Yafeng Yin, Xiao Liu, Zhiwei Jiang, Shunmei Meng
Title: Sentence-level Segmentation for Long Sign Language Videos with Captions
Paperid: 866,   
Authors: Lehao Lin, Baohua Fang, Ziheng Sun, Ke Wang, Hong Kang, Wei Cai
Title: BS3: Bézier Slicing Middleware for 3D Mesh LOD Optimization
Paperid: 867,   
Authors: Ziming Quan, Penglei Wang, Danyang Wu, Jin Xu
Title: Unsupervised Cross-view Message Passing Method for Multi-view Graph Clustering
Paperid: 868,   
Authors: Zongxin Liu, Yishu Liu, Guangming Lu, Xiaoling Luo, Bingzhi Chen
Title: CauRDG: Enhancing Domain Generalization with Causal-Driven Semantic Consistency Reasoning
Paperid: 869,   
Authors: Haolin Wang, Yafei Ou, Prasoon Ambalathankandy, Gen Ota, Pengyu Dai, Masayuki Ikebe, Kenji Suzuki, Tamotsu Kamishima
Title: Layer Separation: Towards Adjustable Joint Space Width Images Synthesis
Paperid: 870,   
Authors: Demin Yu, Wenchuan Du, Kenghong Lin, Xutao Li, Yunming Ye, Luo Chuyao, Chenxunlai Chenxunlai
Title: PiMMNet: Introducing Multi-Modal Precipitation Nowcasting via a Physics-informed Perspective
Paperid: 871,   
Authors: Weihai Lu, Li Yin
Title: DMMD4SR: Diffusion Model-based Multi-level Multimodal Denoising for Sequential Recommendation
Paperid: 872,   
Authors: Shu-Xun Yang, Xian-Ling Mao, Heyan Huang
Title: ESTJ: Enhancing Structured Tendency Judgment in Hybrid-Modal Table Understanding
Paperid: 873,   
Authors: Yimou Guo, Yaochen Li, Jingze Liu, Jiahui Feng, Haoyi Lou, Zhimin Chen, Yuan Gao, Yuanqi Su
Title: Image Captioning with Multimodal Guidance and Search Space Optimization
Paperid: 874,   
Authors: Hengnian Gu, Zhifu Chen, Jin Peng Zhou, Dongdai Zhou
Title: Hierarchical Disentanglement of Cognitive States for Enhanced Cognitive Diagnosis
Paperid: 875,   
Authors: Yi Dai, Yang Ding, Kaisheng Zeng
Title: Bridging Domains in Mental Stress Assessment via Retrieval-Augmented Reasoning
Paperid: 876,   
Authors: Jinjia Peng, Tianhang Cheng, Guangqi Jiang, Huibing Wang
Title: Prior-oriented Anchor Learning with Coalesced Semantics for Multi-View Clustering
Paperid: 877,   
Authors: Yusen Wang, Huan Zhou, Yu Jiang, Chunxia Xiao
Title: Robust Gaussian Surface Reconstruction with Semantic Aware Progressive Propagation
Paperid: 878,   
Authors: Qiyin Zhong, Xianglin Qiu, Xiaolei Wang, Zhen Zhang, Gang Liu, Jimin XIAO
Title: FAMRD: Frequency-Aware Multimodal Reverse Distillation for Industrial Anomaly Detection
Paperid: 879,   
Authors: Xiao Hu, Heiko Neumann, Jochen Lang
Title: A Filtering Framework for Semi-online Referring Video Object Segmentation
Paperid: 880,   
Authors: Weiqi Liu, Yongshan Zhang, Xinxin Wang, Lefei Zhang
Title: Deep Multi-Level Contrastive Clustering for Multi-Modal Remote Sensing Images
Paperid: 881,   
Authors: Xueyu Yuan, Jiarui Zhang, Jiangqi Song, Liu Liu, Li Zhang, Dan Guo, Richang Hong, Meng Wang
Title: DFGAP: Towards Depth-Free Cross-Category GAParts Perception via Uncertainty-Quantified Modeling
Paperid: 882,   
Authors: Weiwu Pang, Rajrup Ghosh, Jiawei Yang, Ziyu Wei, Branden Leong, Yue Wang, Ramesh Govindan
Title: SplatPose: On-Device Outdoor AR Pose Estimation Using Gaussian Splatting
Paperid: 883,   
Authors: Yuxiang Zhao, Wei Huang, Haipeng Zeng, Huan Zhao, Yujie Song
Title: Cross Time Domain Intention Interaction for Conditional Trajectory Prediction
Paperid: 884,   
Authors: Xuandong Huang, Yuzhe Zhou, Jiashu Li, Shiqian Lu, Shangfei Wang
Title: EmoDETective: Detecting, Exploring, and Thinking Emotional Cause in Videos
Paperid: 885,   
Authors: Shengzhe You, Libo Weng, Fei Gao
Title: ViTraj: Learning Dual-Side Representations for Vehicle-Infrastructure Cooperative Trajectory Prediction
Paperid: 886,   
Authors: Ruoxuan Li, Xiangyu Wu, Yang Yang
Title: Noise Self-Correction via Relation Propagation for Robust Cross-Modal Retrieval
Paperid: 887,   
Authors: Yue Sun, Xinqi Liu, Zhiliang He, Jialu Zhang, Chenming Wu, Guodong Lu, Jituo Li
Title: DAFU-CAD: Depth-assisted Feature Unraveling for Sketch-based Robust CAD Modeling
Paperid: 888,   
Authors: Xiaodong Zhu, Suting Wang, Junqi Yang, Yuhong Yang, Weiping Tu, Zhongyuan Wang
Title: Query-Based Audio-Visual Temporal Forgery Localization with Register-Enhanced Representation Learning
Paperid: 889,   
Authors: Hongda Qin, Xiao Lu, Zhiyong Wei, Ningjiang Chen
Title: Object-Preserving Counterfactual Diffusion Augmentation for Single-Domain Generalized Object Detection
Paperid: 890,   
Authors: Haosheng Cai, Yang Xue
Title: G2LFormer: Global-to-Local Query Enhancement for Robust Table Structure Recognition
Paperid: 891,   
Authors: Haochen Yang, Lei Li, Jiacheng Guo, Baolu Li, Minghai Qin, Hongkai Yu, Tianyun Zhang
Title: DA3D: Domain-Aware Dynamic Adaptation for All-Weather Multimodal 3D Detection
Paperid: 892,   
Authors: Kewei Zhao, Xiaowei Hu, Qinya Li
Title: Device-Cloud Collaborative Learning Framework for Efficient Unknown Object Detection
Paperid: 893,   
Authors: Mianzimei Yang, Zhipeng Zhou, Jin Zhang, Yuanhao Pu, Hong Xie, Defu Lian
Title: Conflict-Buffering Optimization by Symmetry Teleportation for Deep Long-Tailed Recognition
Paperid: 894,   
Authors: Tianzhong Lan, Zhang Yi, Xiuyuan Xu, Min Zhu
Title: LooBox: Loose-box-supervised 3D Tumor Segmentation with Self-correcting Bidirectional Learning
Paperid: 895,   
Authors: Jiaxing Qi, Yifan Xu, Zhifei Yang, Ruifei Ma, Chao Zhang, Kuifei Yu
Title: BridgeGLM: Bridging Graph and Language Spaces for Domain Generalization
Paperid: 896,   
Authors: Mengzu Liu, Junwei Xu, Tao Huang, Fangfang Wu, Le Dong, Xin Li, Weisheng Dong
Title: Exploring Global Correlations via Polarity Memory for Multispectral Demosaicing
Paperid: 897,   
Authors: Junzhe Zhang, Chengfeng Han, Dandan Ding, Zhan Ma
Title: GeoQE: Enhancing Quality of Experience in Point Cloud Streaming
Paperid: 898,   
Authors: Xiang Huang, Ao Luo, Xiao Wu, Zhaoquan Yuan
Title: Latent Interactiveness Field for Non-Contact Human Object Interaction Detection
Paperid: 899,   
Authors: Yixuan Zhou, Yulu Tian, Wenliang Zhong, Xingbin Yu, Heng Tao Shen, Xing Xu
Title: SaP-Bot: A Multimodal Large-Language Model for End-to-End Same-Product Identification
Paperid: 900,   
Authors: Tianjiao Xu, Hao Fu, Suiyang Zhang, Jianhua Yin, Tian Gan, Liqiang Nie
Title: Enhancing Democratic Mediation through Norm-Awareness in Generative Agent Societies
Paperid: 901,   
Authors: Yongxin Li, Ying Cheng, Yaning Pan, Wen He, Qing Wang, Rui Feng, Xiaobo Zhang
Title: Semantic-Aware Hard Negative Mining for Medical Vision-Language Contrastive Pretraining
Paperid: 902,   
Authors: Yuanyi Duan, Wei Xu, Qinlong Wu, Guo-Sen Xie, Fang Zhao, Caifeng Shan
Title: AnomalyControl: Highly-Aligned Anomalous Image Generation with Controlled Diffusion Model
Paperid: 903,   
Authors: Jiawei Meng, Zhengmao Yang, Zhiqiang Liu, Shaokai Chen, Zhizhen Liu, Wen Zhang, Huajun Chen
Title: Text-to-Image Generation with Multi-modal Knowledge Graph Construction and Retrieval
Paperid: 904,   
Authors: Jing Ma, Haochen Sun, Zeyuan Zang, Fangxiang Feng, Caixia Yuan, Lei Ren, Huixing Jiang, Chen Wei, Xiaojie Wang
Title: VL-DynaRefine: A Vision-Language Dynamic Refinement Approach for Visual Reasoning
Paperid: 905,   
Authors: Haitao Wang, Sijia Wen, Bo Guo
Title: Polarimetric Monocular Gaussian Splatting SLAM for Dense Surface Reconstruction
Paperid: 906,   
Authors: Gang Pan, Hongen Liu, Di Sun
Title: Formula Spotting Based on Synergy Perception and Representation Mining
Paperid: 907,   
Authors: Guyue Jin, Tianming Zhao, Jiacan Yan, Tian Tian
Title: Contextually-Guided State Space Fusion for Misaligned Multi-Spectral Object Detection
Paperid: 908,   
Authors: Ziming Zhao, Zhaoxuan Li, Tingting Li, Fan Zhang
Title: Stealthy-AE: Generating Stealthy Adversarial Examples through Online Social Networks
Paperid: 909,   
Authors: Jinghan Yang, Zhenbo Xu, Dehua Ma, Liu Liu, Fei Liu, Gong Huang, Zhaofeng He
Title: RecipeRAG: Advancing Recipe Generation with Reinforced Retrieval Augmented Generation
Paperid: 910,   
Authors: Zhilin Huang, Chujun Qin, Yifei Xing, Wenming Yang
Title: Enhanced Motion-aware Latent Diffusion Models for Video Frame Interpolation
Paperid: 911,   
Authors: Yang Zhou, Jin Wang, Yuxiao Zhang, Kaixiang Huang, Guodong Lu, Jingru Yang, Shengfeng He
Title: Art4Math: Handwritten Mathematical Expression Recognition via Multimodal Sketch Grounding
Paperid: 912,   
Authors: Fan Yang, Ling Deng, Zhiyong Gan, Qisheng He, Yuanbo Fang, Xiangmin Xu, Shuangping Huang, Tianshui Chen
Title: Optimal Feature Embedding for Document Large Visual Language Model
Paperid: 913,   
Authors: Sifan Zuo, Youfa Liu, Bo Du
Title: CSDN: CLIP-Driven Similarity-Aligned Distillation Network for Weakly-Supervised Object Localization
Paperid: 914,   
Authors: Mingyang Ding, Zhan Wang, Jiachen Wang, Tingting Han, Xinyuan Hu, Jiajun Ding, Min Tan, Zhenzhong Kuang
Title: FutureGS: Structured Gaussian Fields for Future-Aware Dynamic Scene Modeling
Paperid: 915,   
Authors: Maksim Golyadkin, Valeria Rubanova, Aleksandr Utkov, Dmitry Nikolotov, Ilya Makarov
Title: Evaluation of Egyptian Hieroglyph Classification Across Diverse Writing Styles
Paperid: 916,   
Authors: Shiying Lin, Rong Hu, Zuoyong Li, Qinghua Lin, Jiawei Wu, Changqing Zhang
Title: Gradient-Aware Revitalization of Non-Effective Samples in Medical Image Segmentation
Paperid: 917,   
Authors: Ran Chen, Taiyi Su, Hanli Wang
Title: WaveCL: Wavelet Calibration Learning for Referring Video Object Segmentation
Paperid: 918,   
Authors: Feng Chen, Jielong He, Yang Liu, Heng Liu, Zhe Chen, Yaxiong Wang
Title: Unsupervised Cross-Modal Person Search via Progressive Diverse Text Generation
Paperid: 919,   
Authors: Mingjie Li, Junhao Lin, Dian Ouyang, Ying Zhang, Wei Wang
Title: Graph-based Approximate Nearest Neighbor Search by Deep Reinforcement Routing
Paperid: 920,   
Authors: Shehzad Ali, Md Islam, IK HYUN LEE, Mingfu Xiong, Minh-Son Dao, Saeed Anwar, Sambit Bakshi, Khan Muhammad
Title: Towards Hazardous Activity Recognition for A Novel Real-World Dataset
Paperid: 921,   
Authors: Wenxi Huang, Xiaojun Chen, Qin Zhang, Ting Wan, Ziqi Liu, Liang-Jie Zhang
Title: MRBench: A Multi-Image Reasoning Benchmark with Adaptive Knowledge Retrieval
Paperid: 922,   
Authors: Yongan Guo, Zhongyan Zhou, Yuao Wang, Na Zhu, Xuyun Zhang, Hongwang Xiao, Yuan Miao, Bo Li
Title: RSFomer: Time Series Transformer for Robust Sports Action Recognition
Paperid: 923,   
Authors: Liang Xu, Cathal Gurrin, Songkai Jia, Monica Ward, Allie Tran
Title: Through Someone Else’s Eyes: Lifelogging Meets Narrative Virtual Reality
Paperid: 924,   
Authors: Yuzhen Li, Yuehui Han, Jianjun Qian, Jian Yang
Title: Self-Supervised Vision Graph Neural Networks Based on Contrastive Learning
Paperid: 925,   
Authors: Zijun Xu, Jiahao Guo, Chunjie Zhang, Zhongyuan Wang, Chunxia Xiao, Chao Liang
Title: Quantum Interference-Inspired Who-What-Where Composite-Semantics Instance Search for Story Videos
Paperid: 926,   
Authors: Ru Jia, Xiaoqian Liang, Xubin Duan, Jianji Wang, Nanning Zheng
Title: HybridPlane: A General 4D Representation for dynamic scene reconstruction
Paperid: 927,   
Authors: Fan Wang, Zhangjie Fu, Xiang Zhang, Ziqiang Li, Ziwen He, Manyu Wang
Title: Pair-wise Confidence Difference-based Pseudo-Label Selection for Universal Mismatched Steganalysis
Paperid: 928,   
Authors: Xiaoyan Yuan, Wei Wang, Junxin Chen, Xiping Hu
Title: Reading Between the Channels: Knowledge-Augmented Medical Time Series Classification
Paperid: 929,   
Authors: Yufei Zheng, Jiawei Liu, Bingyu Hu, Zikun Wei, Yong Wu, Zheng-Jun Zha
Title: Dual Uncertainty-Guided Feature Alignment Learning for Text-Based Person Retrieval
Paperid: 930,   
Authors: Benlong Wu, Yuang Qi, Xiuwei Shang, Weiming Zhang, Nenghai Yu, Kejiang Chen
Title: MMPro: A Decoupled Perception-Thinking-Action Framework for Secure GUI Agent
Paperid: 931,   
Authors: Daoxu Sheng, Qi Qi, Jing-Yu Wang, Jianxin Liao
Title: Watch, Skip, Repeat: Hotspot-Aware Joint Optimization for Video Streaming
Paperid: 932,   
Authors: Shilin Liu, Kyohei Kamikawa, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
Title: Context-aware Image-to-Music Generation via Bridging Modalities through Musical Captions
Paperid: 933,   
Authors: Shanding Diao, Yang Zhao, Yuan Chen, Zhao Zhang, Wei Jia, Ronggang Wang
Title: Multi-Layer Gaussian Splatting for Single-Image Feed-Forward Spatial Scene Reconstruction
Paperid: 934,   
Authors: Mingle Zhou, Xingli Wang, Jiachen Li, Delong Han, Gang Li
Title: Unsupervised Dual-Domain Memory Model for Time Series Anomaly Detection
Paperid: 935,   
Authors: Ziyun Qian, Xiao-Zeyu Xiao-Zeyu, Xingliang Jin, Dingkang Yang, Mingcheng Li, Zhenyi Wu, Dongliang Kou, Peng Zhai, Lihua Zhang
Title: UMSD:High Realism Motion Style Transfer via Unified Mamba-based Diffusion
Paperid: 936,   
Authors: Xinyi Wang, Pengfei Ren, Haoyang Zhang, Xin Sheng, Da Li, Liang Xie, Yue Gao, Erwei Yin
Title: VIHand: Enhancing 3D Hand Pose Estimation with Visual-Inertial Benchmark
Paperid: 937,   
Authors: Stefan Arzberger, Paul Raith, Werner Bailer, Marion Jaks
Title: A Dataset and Metric for Textual Video Content Description
Paperid: 938,   
Authors: Minyi Zhao, YI LIU, wensong he, Bingzhe Yu, Yuxi Mi, Shuigeng Zhou
Title: Towards High Robust Vision-Language Large Models: Benchmark and Method
Paperid: 939,   
Authors: Ines Riahi, Abduljalil Radman, Zixin Guo, Rachid Hedjam, Jorma Laaksonen
Title: Valor32k-AVQA v2.0: Open-Ended Audio-Visual Question Answering Dataset and Benchmark
Paperid: 940,   
Authors: Alessandro Ragano, Carl Tolentino, Kata Szita, Dan Barry, Davoud Panah, Niall Murray, Andrew Hines
Title: EgoMusic: An Egocentric Augmented Reality Glasses Dataset for Music
Paperid: 941,   
Authors: Guillaume Gautier, Xuemei Zhou, Nguyen Thong, Jack Jansen, Louis Fréneau, Marko Viitanen, Uyen Phan, Jani Käpylä, Irene Viola, Alexandre Mercat, Pablo Cesar, Jarno Vanne
Title: UVG-CWI-DQPC: Dual-Quality Point Cloud Dataset for Volumetric Video Applications
Paperid: 942,   
Authors: Debora Russo, Nicola Mazzocca, Valeria Vittorini
Title: UR-MAT: A Multimodal, Material-Aware Synthetic Dataset of Urban Scenarios
Paperid: 943,   
Authors: NEGIN GHAMSARIAN, Raphael Sznitman, Klaus Schoeffmann, Jens Kowal
Title: WetCat: Automating Skill Assessment in WetLab Cataract Surgery Videos
Paperid: 944,   
Authors: Fanshen Meng, Zhenhua Meng, Ru Jin, Yuli Chen, Rongheng Lin, Budan Wu
Title: TAMER: Interest Tree Augmented Modality Graph Recommender for Multimodal Recommendation
Paperid: 945,   
Authors: Xu Chen, Yang Li, Yahong Han, Jialie Shen
Title: Ex Pede Herculem, Predicting Global Actionness Curve from Local Clips
Paperid: 946,   
Authors: Bohao Zhang, Haoxin Xu, Jingzhong Lin, Changbo Wang, Gaoqi He
Title: Regulatory Focus Theory Induced Micro-Expression Analysis with Structured Representation Learning
Paperid: 947,   
Authors: Sifan Zhou, Ziyu Zhao, Jiahao Nie, Yichao Cao, Xiaobo Lu
Title: FocusTrack: One-Stage Focus-and-Suppress Framework for 3D Point Cloud Object Tracking
Paperid: 948,   
Authors: Yuxuan Zhang, Bo Wang, Yu Du, Yangfu Zhu, Haorui Wang, Guangyao Su, Tao Zhou, Bin Wu
Title: Cause and Effect: Video Social Relationship Recognition from Causal Perspective
Paperid: 949,   
Authors: Yixuan Gao, Xiongkuo Min, Jinliang Han, Yuqin Cao, Sijing Wu, Yunze Dou, Guangtao Zhai
Title: Multi-Dimensional Text-to-Face Image Quality Assessment Using LLM: Database and Method
Paperid: 950,   
Authors: Xudong Wang, Lei Tan, Pingyang Dai, Liujuan Cao, Rongrong Ji
Title: GPT-ReID: Learning Fine-grained Representation with GPT for Text-based Person Retrieval
Paperid: 951,   
Authors: Shuoshuo Li, Shuli Cheng, Liejun Wang
Title: Entity-Level Alignment with Prompt-Guided Adapter for Remote Sensing Image-Text Retrieval
Paperid: 952,   
Authors: Jiale Yu, Baopeng Zhang, Zhu Teng, Jianping Fan
Title: OV-DAVEL: Towards Open-Vocabulary Dense Audio-Visual Event Localization in Untrimmed Videos
Paperid: 953,   
Authors: Yongjie Hu, Yifan Jiang, Ziyun Li, Fei Gao, Henrik Boström, Nannan Wang
Title: CADQ: Attribute-Consistent Face Cartoonization with Cross-modal Aligned and Deformable Quantization
Paperid: 954,   
Authors: Yang Deng, Yu-Kun Lai, Paul Rosin
Title: CCDb+: Enhanced Annotations and Multi-Modal Benchmark for Natural Dyadic Conversations
Paperid: 955,   
Authors: Yuxiong Xu, Bin Li, Weixiang Li, Sara Mandelli, Viola Negroni, Sheng Li
Title: ALDEN: Dual-Level Disentanglement with Meta-learning for Generalizable Audio Deepfake Detection
Paperid: 956,   
Authors: Runwei Situ, Yi Cai, Yong Xu, Jiexin Wang
Title: Ground and Reconstruct: Entity-Region Bidirectional Alignment Pre-Training for Low-Resource GMNER
Paperid: 957,   
Authors: Guoqiang Liang, Chuan Qin, De Cheng, Shizhou Zhang, Yanning Zhang
Title: Boosting Multi-Modal Alignment: Geometric Feature Separation for Class Incremental Learning
Paperid: 958,   
Authors: Pingting Hao, Huijie Zhang, Yongshan Zhang
Title: Tensor-based Opposing yet Complementary Learning for Multi-view Multi-label Feature Selection
Paperid: 959,   
Authors: Hongzhao Li, Hualei Wan, Liangzhi Zhang, Mingyuan Jiu, Shupan Li, Mingliang Xu, Muhammad Haris Khan
Title: Towards Robust Multimodal Domain Generalization via Modality-Domain Joint Adversarial Training
Paperid: 960,   
Authors: Zhiyu Ye, Guowen Li, Haoyuan Liang, Zixi Wang, Shilei Cao, Yushan Lai, Juepeng Zheng
Title: Quantifying Samples with Invariance for Source-Free Class Incremental Domain Adaptation
Paperid: 961,   
Authors: Huilin Chen, Miaomiao Cai, Fan Liu, Zhiyong Cheng, Richang Hong, Meng Wang
Title: I³-MRec: Invariant Learning with Information Bottleneck for Incomplete Modality Recommendation
Paperid: 962,   
Authors: Yongquan Xue, Zhaoru Guo, Zhaozhao Su, Chong Peng, Jun Feng, Pan Zhou, Marcin Pietron, Xiyuan Wang, Liejun Wang, Panpan Zheng
Title: RoDeCon-Net: Medical Image Segmentation via Robust Decoupling and Contrast-Enhanced Fusion
Paperid: 963,   
Authors: Liangyu Fu, Junbo Wang, Yuke Li, Qiangguo Jin, Hongsong Wang, Ya Jing, Linjiang Huang, Liang Yao, Jiangbin Zheng, Xuecheng Wu, Zhiyong Wang
Title: DSACap: Enhancing Visual-Semantic Alignment with Diffusion-based Framework for Image Captioning
Paperid: 964,   
Authors: Haizhou Wang, Guobing Zou, Fei Xu, Yangguang Cui, Tongquan Wei
Title: Multi-Width Neural Network-Assisted Hierarchical Federated Learning in Heterogeneous Cloud-Edge-Device Computing
Paperid: 965,   
Authors: Yue Pan, Cunbo Li, Peiyang Li, Fali Li, Feng Wan, Dezhong Yao, Zehong Cao, Peng Xu
Title: Real-Time EEG Emotion Recognition from Dynamic Mixed Spatiotemporal Graph Learning
Paperid: 966,   
Authors: Chuanwei Huang, Zexi Jia, Hongyan Fei, Yeshuang Zhu, Yuan Zhiqiang, Jinchao Zhang, Jie Zhou
Title: ArtFRD: A Fisher-Rao Mixture Metric for Generative Model Aesthetic Evaluation
Paperid: 967,   
Authors: Xueyi Zhang, Peiyin Zhu, Jinping Sui, Xiaoda Yang, Jiahe Tian, Mingrui Lao, Siqi Cai, Yanming Guo, Jun Tang
Title: Choose Your Expert: Uncertainty-Guided Expert Selection for Continual Deepfake Detection
Paperid: 968,   
Authors: Luyao Ren, Wenxin Yu, Zhiqiang Zhang, Chang Liu
Title: EMIFS: Efficient Multi-scale Information Fusion Self-supervision for Medical Image Segmentation
Paperid: 969,   
Authors: Man Xiao, Jianbin Ye, Bo Liu, Zijian Gao, Kele Xu, Xiaodong Wang
Title: Analytic Synaptic Dynamic Scaling Balancer for Multimodal Deepfake Continual Detection
Paperid: 970,   
Authors: Zixin Tang, Haihui Fan, Jinchao Zhang, Hui Ma, Xiaoyan Gu, Bo Li, Weiping Wang
Title: ShieldIR: Privacy-Preserving Unsupervised Cross-Domain Image Retrieval via Dual Protection Transformation
Paperid: 971,   
Authors: Bo Wang, Jin Liu, Huiyuan Fu, Xin Wang, Heng Zhang, Huadong Ma
Title: Severe Light, Textureless Sight: A Benchmark for Extreme Exposure Correction
Paperid: 972,   
Authors: Zihao Mo, Junye Chen, Chaowei Fang, Guanbin Li
Title: PatchWiper: Leveraging Dynamic Patch-Wise Parameters for Real-World Visible Watermark Removal
Paperid: 973,   
Authors: Gang Pan, Liming Pan, Hongze Mi, Rongyu Xiong, Jiahao Wang, Di Sun
Title: AFFIR: Dual-Modal Attention Feature Fusion for Scene Text Image Retargeting
Paperid: 974,   
Authors: Yijie Yang, Lianyong Qi, Weiming Liu, Fan Wang, Jing Du, Yuwen Liu, Xiaolong Xu, Qiang Ni, Wanchun Dou, Xiaokang Zhou
Title: Joint Test-time Adaptation with Refined Pseudo-labels and Latent Score Matching
Paperid: 975,   
Authors: Zuona Chen, James She
Title: Infusing AI Art with Cultural Authenticity Through the Culture-Specific LoRA
Paperid: 976,   
Authors: Yujiang Li, Zhili Zhou, Ruohan Meng, Baowei Wang, Xiaojuan Wang, Cheng Qiao, Jiantao Zhou
Title: Zero Matrix guided Adaptive Image Vaccine against Diffusion Model-based Multitask
Paperid: 977,   
Authors: Lingbo Zhang, Bingqian Sun, Linghan Cai, Yifeng Wang, Ye Zhang, Songhan Jiang, Kai Zhang, Yongbing Zhang
Title: Counting by Points: Density-Guided Weakly-Supervised Nuclei Segmentation in Histopathological Images
Paperid: 978,   
Authors: Lei Chen
Title: Graph-Perceptron with Semantic Fidelity for No-Reference Super-Resolution Image Quality Assessment
Paperid: 979,   
Authors: Zhihao Wang, Shiyu Liu, Zhiwei He, Kangjie Zheng, Liangying Shao, Junfeng Yao, Jinsong Su
Title: Gloss Matters: Unlocking the Potential of Non-Autoregressive Sign Language Translation
Paperid: 980,   
Authors: Xiaodi Xu, Lijie Li, Ye Wang, Tao Ren, Tian Qiao
Title: WFF: Wavelet-based Information Fusion for Multimodal Knowledge Graph Link Prediction
Paperid: 981,   
Authors: Yongji Li, Luping Wang
Title: Spatial-Frequency Mamba Collaborative Learning Network for Infrared Small Target Detection
Paperid: 982,   
Authors: shao jiang, Xinbo Zhao, XiaoChun Zou, XiaoLin Ye
Title: EgoHierMask: Hierarchical Semantic-Prior Guided Masked Autoencoder for Egocentric Action Recognition
Paperid: 983,   
Authors: Hongyu Jiang, Yuxin Huo, Sirou Sheng, Hong Tao, Chenping Hou
Title: Scalable One-step Unaligned Multi-view Clustering via Joint High-Order Correlation Learning
Paperid: 984,   
Authors: Xiangyu Shan, Heng Song, Junwu Zhu
Title: DFCNet: Dual-Factor Compensatory Clustering Network for Modality-Imbalanced Generalized Zero-Shot Learning
Paperid: 985,   
Authors: Bingshuai Liu, Ante Wang, Zijun Min, Chenyang Lyu, Longyue Wang, Zhihao Wang, Xu Han, Peng Li, Jinsong Su
Title: EditEval: Towards Comprehensive and Automatic Evaluation for Text-guided Video Editing
Paperid: 986,   
Authors: Xueheng Li, Xuanhua He, Tao Hu, Jie Zhang, Man Zhou, Chengjun Xie, Yingying Wang, Bo Huang
Title: Freq-RWKV: Granularity-Aware Spatial-Frequency Synergy via Dual-Domain Recurrent Scanning for Pan-sharpening
Paperid: 987,   
Authors: Tianyi Ma, Maoying Qiao
Title: EBaR: Efficient Buffer and Resetting for Single-Sample Continual Test-Time Adaptation
Paperid: 988,   
Authors: Yan Chen, Bingbing Jiang, Peng Zhou, Lei Duan, Yuhua Qian, Liang Du
Title: Balanced Multiple Kernel Clustering with Discrete Partition Entropy Auto Regularization
Paperid: 989,   
Authors: Feiyu Peng, Chaobo He, Junwei Cheng, Huijuan Hu, Wenkai Zhang, Youda Mo
Title: Frequency-refined Graph Convolution Network with Cross-modal Wavelet Denoising for Recommendation
Paperid: 990,   
Authors: Xiaokun Wang, Yuting Yan, Sheng Zhang, Andong Zhu, Ning Chen, Yu Chen, Zhuzhong Qian, Sanglu Lu, Yu Liang
Title: Decode-What-Matters: Frame-Level Parallel Generative Decoding to Accelerate Large-Scale Video Analytics
Paperid: 991,   
Authors: Yamiao Ding, Tianrui Liu, Zhizhou Lu, Jun-Jie Huang, zhao wentao, Xinwang Liu, Meng Wang
Title: VSumMamba: Mamba Empowered Efficient Video Summarization with Multi-Scale Spatial-Temporal Modeling
Paperid: 992,   
Authors: Yiyang Gu, Taian Guo, Hang Zhou, Zihao Chen, Zhiping Xiao, Yifang Qin, Xiao Luo, Wei Ju, Yifan Wang, Ming Zhang
Title: CODE: Towards Partial Label Graph Learning via Coupled Dual Separation
Paperid: 993,   
Authors: Chunshi Wang, Hongxing Li, Yawei Luo
Title: SonicGauss: Interactive Position-aware Impact Audio Synthesis for 3D Gaussian Splatting
Paperid: 994,   
Authors: Tengyu Ma, Jiafa Ruan, Yuetong Wang, Guangchao Han, Zhu Liu, Long Ma, Risheng Liu
Title: Degradation-Aware One-Step Diffusion Model for Content-Sensitive Super-Resolution in the Dark
Paperid: 995,   
Authors: Domenic Zingsheim, Markus Plack, Hannah Dröge, Janelle Pfeifer, Patrick Stotko, Matthias Hullin, Reinhard Klein
Title: RIFTCast: A Template-Free End-to-End Multi-View Live Telepresence Framework and Benchmark
Paperid: 996,   
Authors: Hui Wu, Haoquan Zhai, Yuchen Li, Hengyi Cai, Peirong Zhang, Yidan Zhang, Lei Wang, Chunle Wang, Yingyan Hou, Shuaiqiang Wang, Dawei Yin
Title: MARA: A Multimodal Adaptive Retrieval-Augmented Framework for Document Question Answering
Paperid: 997,   
Authors: Yunlong Zhao, Xiaoheng Deng, Zhuohua Qiu, Feng Yang, Chang Xu, Shan You, Xiangjian He, Xiu Su
Title: CaDGS: Modeling Inter-Gaussian Mutual Information for Dynamic Novel View Synthesis
Paperid: 998,   
Authors: Wenli Zheng, Huiyuan Fu, Xicong Wang, Hao kang, Chuanming Wang, Jin Liu, Zekai Xu, Heng Zhang, Huadong Ma
Title: EvRAW: Event-guided Structural and Color Modeling for RAW-to-sRGB Image Reconstruction
Paperid: 999,   
Authors: Si Chen, Yujia Chen, Xiaotian Yin, Xin Liu, Huakai Lai, Tianzhu Zhang
Title: PAF: Prototype Adaptive Fusion for Test-Time Adaptation of Vision-Language Models
Paperid: 1000,   
Authors: Chi Huang, Qi Zhang, Qian Zhang, Nan Li, Yipu Gong, Xiaowei Wang, Wei Feng
Title: TriGS: Tri-consistency 3D Gaussian Splatting from Sparse and Unposed Views
Paperid: 1001,   
Authors: Mingrui Li, Shuhao Zhai, Zibing Zhao, Luyue Sun, Xinxiao Wang, Dong Li, Shuhong Liu, Hongyu Wang
Title: Wild3A: Novel View Synthesis from Any Dynamic Images in Seconds
Paperid: 1002,   
Authors: Peng Ying, Zhongnian Li, Meng Wei, Xinzheng Xu
Title: Reversible Privacy Preserving on Vision-Language Models via Adversarial Multimodal Key
Paperid: 1003,   
Authors: Zhaoyun Jiang, Jiaqi Guo, Shakie Liu, Chao Han, Ting Liu, Jian-Guang Lou, Dongmei Zhang
Title: Illustration Layout Generation for Slide Enhancement with Pixel-based Diffusion Model
Paperid: 1004,   
Authors: Qiuna Tan, Runqi Qiao, Guanting Dong, YiFan Zhang, MinhuiWu MinhuiWu, Jiapeng Wang, Miaoxuan Zhang, Yida Xu, Chong Sun, Chen Li, Honggang Zhang
Title: OCR-Critic: Aligning Multimodal Large Language Models’ Perception through Critical Feedback
Paperid: 1005,   
Authors: Yongzheng Liu, Siru Zhong, Gefeng Luo, Weilin Ruan, Yuxuan Liang
Title: Towards Multi-Scenario Forecasting of Building Electricity Loads with Multimodal Data
Paperid: 1006,   
Authors: Xingbo Yao, XuanminWang XuanminWang, Hui Xiong
Title: CitySculpt: 3D City Generation from Satellite Imagery with UV Diffusion
Paperid: 1007,   
Authors: Haichuan Fang, Haoran Zhang, Yulin Du, Qiang Guo, Zhen Tian, Youwei Wang, Yangdong Ye
Title: CDIB: Consistency Discovery-guided Information Bottleneck for Multi-modal Knowledge Graph Reasoning
Paperid: 1008,   
Authors: Wei Li, Yizhao Wan, Xiao Wu, Jianshuai Wang, Penglin Dai, Zhaoquan Yuan
Title: HOPNet: Learning Hand-Object-Person Interaction Network for Hand Contact State Detection
Paperid: 1009,   
Authors: Yunyu Zou, Yishu Liu, Jun Liang, Bingzhi Chen
Title: SG-FSL: Cross-Domain Few-Shot Learning with Style-Decoupled Augmentation and Gradient-Conflict Adjustment
Paperid: 1010,   
Authors: Zhenxuan Fang, Shuaibo Wang, Weisheng Dong, Junwei Xu, Fangfang Wu, Xin Li, Guangming Shi
Title: Beyond Visual Quality: Fidelity-Oriented Diffusion Model for Real-world Image Super-Resolution
Paperid: 1011,   
Authors: Xuyao Liu, Jiahui Qu, Wenqian Dong
Title: Breaking the Spatial-Temporal Consistency Constraint: Towards Reference-Based Hyperspectral Image Super-Resolution
Paperid: 1012,   
Authors: Xiangfei Sheng, Pangu Xie, Weidong Zou, Pengfei Chen, Tong Zhu, Leida Li
Title: InstructCrop: Teaching Multimodal Large Language Models to Crop Aesthetic Images
Paperid: 1013,   
Authors: Qingtian Bian, Tieying Li, Marcus De Carvalho, Jiaxing Xu, Hui Fang, Yiping Ke
Title: Multi-Domain Enhancement via Residual Interwoven Transfer in Cross-Domain Sequential Recommendation
Paperid: 1014,   
Authors: Jian Zhou, Yingjie Xie, Cunhang Fan, Huabin Wang, Zhao Lv, Liang Tao
Title: DHGCN: Dual HyperGraph Convolutional Network for EEG-Based Auditory Attention Detection
Paperid: 1015,   
Authors: Xueyi Zhang, Jialu Sun, Chengwei Zhang, Xianghu Yue, Tianfang Xiao, Siqi Cai, Mingrui Lao, Haizhou Li
Title: EventLip: Enhancing Event-Based Lip Reading via Frequency-Aware Spatiotemporal Hypergraph Modeling
Paperid: 1016,   
Authors: teng jin, Ziwen He, Zhangjie Fu, Songping Wang, Yueming Lyu, yufei shi
Title: Frequency Domain Distributed Perturbations: Towards Query-Efficient Black-Box Adversarial Video Attack
Paperid: 1017,   
Authors: Yixin Xu, Hao Wu, Jingzhou Zhu, Fengyuan Xu, Sheng Zhong
Title: PriCAF: Privacy-Preserving Contribution Assessment in Federated Learning Before Model Training
Paperid: 1018,   
Authors: Zhaohui Jiang, Xuening Feng, Tianchi Huang, Ruixiao Zhang, Paul Weng, Yifei Zhu
Title: Progressive Learning with Human Feedback for Personalized Adaptive Video Streaming
Paperid: 1019,   
Authors: Shaohua Liu, Ning Gao, Zuoya Gu, Hongkun Dou, Yue Deng, Hongjue Li
Title: Spatiotemporal Degradation-Aware 3D Gaussian Splatting for Realistic Underwater Scene Reconstruction
Paperid: 1020,   
Authors: YuXin Xie, Dongyue Chen, Yue Zhu, Tong Jia, Shizhuo Deng
Title: Noise-Aware Decoding with Salient Region Enhancing for Zero-Shot Image Captioning
Paperid: 1021,   
Authors: Linxin Xiao, Xin Wang, Zeyang Zhang, Yang Yao, Wenwu Zhu
Title: DyNAS-DDI: Dynamic Pairwise Architecture Search for Generalizable Drug-Drug Interaction LLM
Paperid: 1022,   
Authors: Ting Xiao, Minqian Sun, Yiqing Xia, Zhe Wang
Title: Dual-Prototype Learning in Multiple Instance Learning for Histopathology Image Classification
Paperid: 1023,   
Authors: Hyungjun Doh, Dong Lee, Seunggeun Chi, Pin-Hao Huang, Kwonjoon Lee, Sangpil Kim, Karthik Ramani
Title: Occlusion-Aware and Consistent Amodal Completion for 3D Human-Object Interaction Reconstruction
Paperid: 1024,   
Authors: Hang Yang, Le Hui, Jianjun Qian, Jian Yang, Yigong Zhang, Jin Xie
Title: Cross-View Geometric Collaboration for Generalizable Sparse View Neural Surface Reconstruction
Paperid: 1025,   
Authors: Zhishuo Zhao, Yi Lin, Dongyue Guo, Junyu Fan
Title: AV-RISE: Hierarchical Cross-Modal Denoising for Learning Robust Audio-Visual Speech Representation
Paperid: 1026,   
Authors: Xuan Hai, Xin Liu, Zihao Zhang, Ziyao Yu, Kong Xiangzhen, Song Li, Weina Niu, Rui Zhou, Qingguo Zhou
Title: SiFMimicEvader: Evading Fake Voice Detection with Adversarial Neural Mimicry Attacks
Paperid: 1027,   
Authors: Hong Gao, Xiangkai Xu, Tianqi Zhu, Xiugang Dong, Yiming Bao, Min-Ling Zhang
Title: Radar-Mamba: 4D Millimeter-Wave Point Cloud Enhancement via State Space Models
Paperid: 1028,   
Authors: Min Dang, Gang Liu, Jingqi Zhao, Adams Kong, Nan Luo, Di Wang
Title: DDFD: Diffusion-Based Denoising Fusion for Object Detection in Infrared-Visible Images
Paperid: 1029,   
Authors: Wenpeng Lang, Saihui Hou, Yongzhen Huang
Title: Beyond Sparse Keypoints: Dense Pose Modeling for Robust Gait Recognition
Paperid: 1030,   
Authors: Xiangui Huang, Taotao Lai, Yizhang Liu, Shuyuan Lin, Zuoyong Li
Title: Two-View Correspondence Pruning via Channel-Spatial Interaction and Bidirectional Consensus Interaction
Paperid: 1031,   
Authors: Yadong Huo, Qibing Qin, Wenfeng Zhang, Lei Huang, Jie Nie
Title: Factorized Transformer Hashing with Adaptive Routing for Large-scale Image Retrieval
Paperid: 1032,   
Authors: Guipeng Xv, Xinyu Li, Yi Liu, Chen Lin, Xiaoli Wang
Title: Unveiling the Impact of Multi-modal Content in Multi-modal Recommender Systems
Paperid: 1033,   
Authors: Mingrui Li, Dong Li, Sijia Hu, Kangxu Wang, Zhenjun Zhao, Hongyu Wang
Title: SLAM-X: Generalizable Dynamic Removal for NeRF and Gaussian Splatting SLAM
Paperid: 1034,   
Authors: Yuntian Xiao, Shoulong Zhang, Zihang Zhang, Jiahao Cui, Yan Wang, Shuai Li
Title: Phys4DRT: Physics-based 4D Generation for Real-Time Interaction with Time-Frequency Supervision
Paperid: 1035,   
Authors: Na Li, Zihao Li, Zuoli Tang, Yuqing Yu, Lixin Zou, Chenliang Li
Title: Bridging the Gap: Consistent Image Outpainting via Training-Free Noise Optimization
Paperid: 1036,   
Authors: Jiahuan Long, Wen Yao, Tingsong Jiang, Jiacheng Hou, Shuai Jia, Junqi Wu, xiaoya zhang, Xiaohu Zheng, Chao Ma
Title: CDUPatch: Color-Driven Universal Adversarial Patch Attack for Dual-Modal Visible-Infrared Detectors
Paperid: 1037,   
Authors: Yingzhen Zhang, Jimin Dai, Qianliang Wu, Jian Yang, lei luo
Title: DNCOT: Diffusion-Cascaded Neural Optimal Transport for Scalable Multi-Domain Image-to-Image Translation
Paperid: 1038,   
Authors: Xiaodong Wang, Hongmin Hu, Fei Yan, Lu junwen, Zhiqiang Zeng, Weidong Hong, Zhedong Zheng
Title: UniAD: Integrating Geometric and Semantic Cues for Unified Anomaly Detection
Paperid: 1039,   
Authors: Mufan Liu, Wu Ran, Zhiquan He, Zuojie Xie, Hong Lu, Peirong Ma
Title: Implicit Retinex Decomposition with Chromaticity Disentanglement for Low-Light Image Enhancement
Paperid: 1040,   
Authors: Quangui He, Jiahui Qu, Wenqian Dong, Song Xiao, Qinghao Gao
Title: Cycle-Consistent Mamba-Based Registration-Fusion Joint Network for Unregistered Hyperspectral Image Super-Resolution
Paperid: 1041,   
Authors: Zhihao Jia, Meiyan Xu, Jingyuan Wang, Ziyu Jia, Yong Li, Xinliang Zhou, Chenyu Liu, Junfeng Yao, Yi Ding
Title: Sera: Separated Coarse-to-fine Representation Alignment for Cross-subject EEG-based Emotion Recognition
Paperid: 1042,   
Authors: Wenrui Liu, Qian Chen, Wen Wang, Guanrou Yang, Weiqin Li, Minghui Fang, Jialong Zuo, Xiaoda Yang, Tao Jin, Jin Xu, Zemin Liu, Yafeng Chen, Bai Jionghao, Zhifang Guo
Title: Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation
Paperid: 1043,   
Authors: Yan Zhang, Shiwen He, Lin Yuan, Jiaxu Leng, Xinbo Gao
Title: DichotomyIR: Universal Image Reconstruction via Dichotomy Classification and Uncertainty Elimination
Paperid: 1044,   
Authors: Xiong Li, Yikang Yan, Zhenyu Wen, Qin Yuan, Fangda Guo, Zhen Hong, Ye Yuan
Title: Open3DSearch: Zero-Shot Precise Retrieval of 3D Shapes Using Text Descriptions
Paperid: 1045,   
Authors: Hui Zhang, Yiteng Xu, Yonglin Tian, Yidong Li, Tiago Falk, Fei-Yue Wang
Title: Selective Shift: Towards Personalized Domain Adaptation in Multi-Agent Collaborative Perception
Paperid: 1046,   
Authors: Yiqiang Guo, Lei Zhong, Bin Chen, Jia-Li Yin, Xiaolei Liu, Shouling Ji
Title: Focus on Generalization: Improving Adversarial Transferability via Bi-Level Bias Mitigation
Paperid: 1047,   
Authors: Lei Liu, Xiangdong Su, Guanglai Gao
Title: Fourier Self-Adaptation for Transferring General Pretrained Models to Specific Domains
Paperid: 1048,   
Authors: Xiaorui Ding, Huan Ma, Changqing Zhang
Title: A Theoretical Proof of Dynamic Multimodal Fusion Exacerbates Modality Greedy
Paperid: 1049,   
Authors: He Wang, Longquan Dai, Shihao Pu, Shaomeng Wang, Jinhui Tang
Title: Generative Semantic Probing for Vision-Language Models via Hierarchical Feature Optimization
Paperid: 1050,   
Authors: Yi Han, Yaochen Li, Peijun Chen, Wenlong Zhou, Jinhuo Yang, Jintao Chang
Title: SVDGNet: Shapley Value-Based Weight Adjustment for Unsupervised Image Style Transfer
Paperid: 1051,   
Authors: Zizhuo Li, Chunbao Su, Fan Fan, Jun Huang, Jiayi Ma
Title: CorrNeXt: Making the ConvNet-Style Correspondence Pruner Stronger for Two-View Geometry
Paperid: 1052,   
Authors: Yijun Wang, Siying Wu, Lubin Gan, Zheyu Zhang, Jing Zhang, Zhangchi Hu, Huyue Zhu, Peixi Wu, Xiaoyan Sun
Title: MeDKCoOp: Dual Knowledge-guided Graph Prompt Learning for Biomedical Vision-Language Models
Paperid: 1053,   
Authors: SiyuanHuang SiyuanHuang, Jiahui Jin, Xin Lin, Xigang Sun, Yukun Ban
Title: IM-POI: Bridging ID and Multi-modal Gaps in Next POI Recommendation
Paperid: 1054,   
Authors: Jiaqi Hou, Kewei Zhang, Tianyu Yang, Chengyu Jia, Qiqi Lin, Hui Wei, Zheng Wang
Title: FAB-Attack: Fabric-Aware Adversarial Attacks on Person Detectors under Motion Blur
Paperid: 1055,   
Authors: Jin Han, Yixin Yang, Zhan Zhan, Boxin Shi, Imari Sato
Title: EDeF-Net: Spatio-temporal Association Network for Flicker Removal in Event Streams
Paperid: 1056,   
Authors: Yanwei Xie, Weizhi Nie, Lanjun Wang, Hongshuo Tian, Changtai Shi, Anan Liu
Title: When Headlines Meet Minds: Empowering News Recommendations with Social Simulator
Paperid: 1057,   
Authors: Abel Chai, Kelly Jee, Sue Han Lee, Tay Siang, Jules Vandeputte, Hervé Goeau, Pierre Bonnet, Alexis Joly
Title: Deep-Plant-Disease Dataset Is All You Need for Plant Disease Identification
Paperid: 1058,   
Authors: Chenhui Qiang, Zhaoyang Wei, Xumeng Han, Zipeng Wang, Siyao Li, Xiangyuan Lan, Jianbin Jiao, Zhenjun Han
Title: VER-Bench: Evaluating Multimodal LLM on Reasoning with Fine-Grained Visual Evidence
Paperid: 1059,   
Authors: Zihao Wang, Shulei Ji, Le Ma, Yuhang Jin, Shun Lei, Jianyi Chen, Haoying Fu, Roger Dannenberg, Kejun Zhang
Title: Multi-Accent Mandarin Dry-Vocal Singing Dataset: Benchmark for Singing Accent Recognition
Paperid: 1060,   
Authors: Zhihua Wang, Weixia Zhang, Wei Zhou, Xiaohong Liu, Guangtao Zhai, Patrick Callet
Title: Evaluating Perceptual Color Preferences in Smartphone Photography: Dataset and Challenges
Paperid: 1061,   
Authors: Liang Cheng, Hao Wang, Chenwei Wu, Haochen You, Xianhao Wu
Title: Unlocking Joint Image Deraining and Low-Light Enhancement: Benchmark and Baseline
Paperid: 1062,   
Authors: Felix Immohr, Gareth Rendle, Annika Neidhardt, Anton Lammert, Bernd Froehlich, Alexander Raake
Title: ICS-MR: Interactive Conversation Scenarios for Assessment of Mixed Reality Communication
Paperid: 1063,   
Authors: Mohammad Ghasempour, Hadi Amirpour, Christian Timmerer
Title: Nature-1k: The Raw Beauty of Nature in 4K at 60FPS
Paperid: 1064,   
Authors: Sowmya Vijayakumar, Tong Xue, Abdallah Ali, Irene Viola, Ronan Flynn, Peter Corcoran, Pablo Cesar, Niall Murray
Title: RCQoEA-360VR: Real-time Continuous QoE scores for HMD-based 360° VR dataset
Paperid: 1065,   
Authors: Tariq Shoura, Ali Dehaghi, Reza Razavi, Mohammad Moshirpour
Title: VIDEA-8K-60FPS Dataset: 8K 60FPS Video Sequences for Analysis and Development
Paperid: 1066,   
Authors: Linlin Zong, Shilin Sui, Wenjun Liang, Wanyu song, LINLIN TIAN, Xinyue Liu, Xianchao Zhang, Bo Xu
Title: CH-SV: A Benchmark for Multi-Type Chinese Harmful Short Video Detection
Paperid: 1067,   
Authors: Chenxi Wang, Yusheng Dai, Lei Sun, Jun Du, Jianqing Gao
Title: AudioAtlas: A Comprehensive and Balanced Benchmark Towards Movie-Oriented Text-to-Auido Generation
Paperid: 1068,   
Authors: Sijing Wu, Yunhao Li, Huiyu Duan, Yanwei Jiang, Yucheng Zhu, Guangtao Zhai
Title: HVEval: Towards Unified Evaluation of Human-Centric Video Generation and Understanding
Paperid: 1069,   
Authors: Ming Li, Yupeng Hu, Yinwei Wei, Hao Liu, Haocong Wang, Weili Guan
Title: DCount: Decoupled Spatial Perception and Attribute Discrimination for Referring Expression Counting
Paperid: 1070,   
Authors: Zeyan Li, Cankun Guo, Yin Tang
Title: Modal Symbiosis: Variational Alignment Unveils New Horizons in Multimodal Representation Learning
Paperid: 1071,   
Authors: Yue He, Jingxi Xie, Fengling Li, Lei Zhu, Jingjing Li
Title: Flip is Better than Noise: Unbiased Interest Generation for Multimedia Recommendation
Paperid: 1072,   
Authors: Wenyu Yin, Shuyuan Lin, David Suter, Hanzi Wang
Title: Adaptive Graph Attention-Guided Parallel Sampling and Embedded Selection for Multi-Model Fitting
Paperid: 1073,   
Authors: Renxiang Guan, Junhong Li, Siwei Wang, Wenxuan Tu, Miaomiao Li, En Zhu, Xinwang Liu, Chenping Chenping
Title: Multi-view Graph Clustering with Dual Relation Optimization for Remote Sensing Data
Paperid: 1074,   
Authors: Ren Wang, Xin Wang, Tongtong Feng, Xinyue Gong, Guangyao Li, Yu-Wei Zhan, Qing Li, Wenwu Zhu
Title: Improving Compositional Generalization in Cross-Embodiment Learning via Mixture of Disentangled Prototypes
Paperid: 1075,   
Authors: Liqi Yan, Xuebin Li, Jianhui Zhang, Fangli Guan, Kanglei Peng, Pan Li
Title: F-DDIM: A Featurized Denoising Diffusion Implicit Model for Facial Image Steganography
Paperid: 1076,   
Authors: Min Tan, Guanhao Liu, Huijing Zhan, Yuyu Yin, Zhou Yu, Jiajun Ding, Yinfu FENG
Title: DiSCo: Disentangled Attribute Manipulation Retrieval via Semantic Reconstruction and Consistency Regularization
Paperid: 1077,   
Authors: De Li, LQY LQY, Zhou Tan, Zeming Gan, Tiange Xia, Xianxian LI, Jinyan Wang
Title: FedRog: Robust Federated Graph Classification for Strong Heterogeneity and High-Noise Scenarios
Paperid: 1078,   
Authors: KAI ZHU, Jun Yin
Title: Neighbor Contrastive Learning with Weakened Consensus Graph for Deep Multi-View Clustering
Paperid: 1079,   
Authors: Te Song, Lianyong Qi, Weiming Liu, Fan Wang, Xiaolong Xu, Hongsheng Hu, Yang Cao, Xuyun Zhang, Amin Beheshti
Title: Boosting Guided Diffusion with Large Language Models for Multimodal Sequential Recommendation
Paperid: 1080,   
Authors: Hongtao Wu, Yifeng Wu, Jia-Xuan Jiang, Wu Chengyu, Hong Wang, Yefeng Zheng
Title: SAMVSR: Leveraging Semantic Priors to Zone-Focused Mamba for Video Snow Removal
Paperid: 1081,   
Authors: Che Liu, Yingji Zhang, Dong Zhang, Weijie Zhang, Chenggong Gong, Yu Lu, Shilin Zhou, Ziliang Gan, Ziao Wang, Haipang WU, Ji Liu, Andre Freitas, Qifan Wang, Zenglin Xu, Rongjunchen Zhang, Yong Dai
Title: NEXUS-O: AN OMNI-PERCEPTIVE AND -INTERACTIVE MODEL FOR LANGUAGE, AUDIO, AND VISION
Paperid: 1082,   
Authors: Daixun Li, Sibo He, Jiayun Tian, Yusi Zhang, Weiying Xie, Mingxiang Cao, donglai Liu, Zirui Li, Tianlin Hui, Rui Huang, Yunsong Li
Title: Uni-Sight: An E2E Vision-Language-Action System Unifying Multi-View Alignment and Multi-Modal Fusion
Paperid: 1083,   
Authors: Na Jiang, Wenhui Zheng, Xuqian Gu, Jingjing Wang
Title: OmniDoctor: Towards LLM-centric Lifelong Learning for New Emerging Medical VQA Tasks
Paperid: 1084,   
Authors: Eungi Lee, Jae Yoon, Seok Bong Yoo
Title: SCOL: Style Code Orchestration in Latent Space for Proactive Face-Swapping Defense
Paperid: 1085,   
Authors: Jielong Lu, Zhihao Wu, Jiajun Yu, QIANQIAN SHEN, Jiajun Bu, Haishuai Wang
Title: Where Views Meet Curves: Virtual Anchors for Hyperbolic Multi-View Graph Diffusion
Paperid: 1086,   
Authors: Qi He, Xiao Wu, Jun-Yan He, Wei Li, Zhaoquan Yuan
Title: DualEnhance: External Multimodal Foundation Models Guidance and Internal Fast-Slow Teacher Regulation
Paperid: 1087,   
Authors: Shuang Hao, Pengfei Ren, Lei Zhang, Haifeng Sun, Pan Ting, Menghao Zhang, Cong Liu, Qi Qi, Jianxin Liao, Jing-Yu Wang
Title: A Dual-Branch 3D Spatial-Aware Latent Diffusion for Realistic Depth Image Synthesis
Paperid: 1088,   
Authors: Zhuo Su, Jufeng Li, Yan Zhang, Xin Li, Fuwei Zhang, Yuxin Feng, Fan Zhou
Title: Breaking the Synthetic Barrier: Towards Stable and Generalizable Real-World Image Dehazing
Paperid: 1089,   
Authors: Linxuan Luo, Pan Mu, Cong Bai
Title: Physics-Coupled Frequency Dynamic Adaptation Network for Domain Generalized Underwater Object Detection
Paperid: 1090,   
Authors: Yujia Zhu, Hao Yang, Yibo Zhao, Chunjie Ma, Weili Guan, Zan Gao
Title: Lightweight Relational Proposal Network with Dual-Branch Distillation for Video Moment Retrieval
Paperid: 1091,   
Authors: Yiliang Zhu, Dayan Wu, Qinghang Su, Zexian Yang, Zheng Lin, Weiping Wang
Title: Mitigating the Evolving Semantic Entanglement in Continual Learning of Vision-Language Models
Paperid: 1092,   
Authors: Fangli Ying, Zhihong Zhang, Liting Zhou, Cathal Gurrin, Jinhai Wang
Title: Identity-Preserving Facial Aesthetic Enhancement via Hierarchical Prompt Learning and Pivotal Tuning
Paperid: 1093,   
Authors: Tong Chen, Bowen Du, Jiejie Zhao, Hanyang Xia, Haiquan Wang, Jiakai Wang
Title: BadMDA: Towards Backdoor Injection during Domain Adaptation to Collapse Multi-Agent Perception
Paperid: 1094,   
Authors: Yiru Li, Yingying Zhu
Title: PLGeo: A Patch-level Framework to Overcome Orientation Discrepancies in Cross-view Geo-localization
Paperid: 1095,   
Authors: Jinbao Wei, Yuhang Chen, Zhijie Wang, Gang Yang, Shimin Tao, Jian Gao, Aiping Liu, Xun Chen
Title: Rethinking Diffusion Bridge Model with Dual Alignments for Medical Image Synthesis
Paperid: 1096,   
Authors: Wangsheng He, Wanru Xu, Ping Guo, Zhenjiang Miao, Yi Tian
Title: InstructStep: Fine-Grained Localization of Step Content and Relation in Instructional Video
Paperid: 1097,   
Authors: Yuqi Chen, Xiubo Liang, Yu Zhao, Hongzhi Wang, Weidong Geng
Title: S²-Edit3DV: Diffusion-Guided Style Meets Structure for Consistent Multi-View 3D Video Generation
Paperid: 1098,   
Authors: Xiubo Liang, Hongzhi Wang, Zigen Li, Jinxing Han, Yu Zhao, Weidong Geng
Title: SGM-Transformer: Rethinking Gradient Information Loss and Compensation in Spiking Neural Networks
Paperid: 1099,   
Authors: Hongchen Wei, Zhenzhong Chen
Title: RealVG: Unleashing MLLMs for Training-Free Spatio-Temporal Video Grounding in the Wild
Paperid: 1100,   
Authors: Minho Park, Young Jo, Jae-Hyeok Lee, Ji Lee, Dong-oh Kang, Yong Man Ro
Title: Focus Where It Matters: LLM-Guided Regional Identification for Instruction-based Image Editing
Paperid: 1101,   
Authors: Xiaohang Zhang, Hui Gao, Bo Zhang, Xiao Chen, Kun Niu, Tan Yang, Wufan Wang, Wendong Wang
Title: Monocular Vision-based Fast 3D Mapping and Multi-Agent-Assisted Trajectory Planning for UAVs
Paperid: 1102,   
Authors: Zeyang Bai, Yunbiao Wang, Dongbo Yu, Jun Xiao, Lupeng Liu
Title: GraphSplat: Sparse-View Generalizable 3D Gaussian Splatting is Worth Graph of Nodes
Paperid: 1103,   
Authors: Yuxuan Xiong, Ye Chen, Yue Shi, Zhangli Hu, Bingbing Ni
Title: Rig-Reconstruct-Render (R³3D): Collaborative Representation for Editable and Skeleton-Drivable 3D Asset Generation
Paperid: 1104,   
Authors: Jiale Zou, Yan Chen, Bingbing Jiang, Peng Zhou, Liang Du, Lei Duan, Yuhua Qian
Title: Robust Tensor Learning with Graph Diffusion for Scalable Multi-view Graph Clustering
Paperid: 1105,   
Authors: Shengze Shi, Tao Ren, Guoliang Zhu, Guandong Feng, JUN HU
Title: Closing the Feedback Loop in Text2Vis: Refining Visualization with Vision-Language Models
Paperid: 1106,   
Authors: Zhibing Zhang, Jiantao Lin, Cangqi Zhou, Rui Xia
Title: MPPR: Memory-Prior-based Prompt Refinement in Continuous Space for Advanced Text-to-Image Generation
Paperid: 1107,   
Authors: Seung-gyeom Kim, Areum Kim, Eunchae Kim, Minho Chung, Yongjae Yoo
Title: Automatic Accessible Multimodal Translation of Graphics Using A Refreshable Pin Array
Paperid: 1108,   
Authors: Jieyi Ge, Zhaodong Sun, Wei Peng, Chenhang Ying, Yuwei Chen, Kui Ren, Xiaobai Li
Title: Evidential Remote Physiological Measurement via Uncertainty-aware Fusion of Video and RF
Paperid: 1109,   
Authors: Runze Zhao, Fuqing Zhu, Jizhong Han, Songlin Hu
Title: Visual Perception Uncertainty Learning for Hallucination Detection in Large Vision-Language Models
Paperid: 1110,   
Authors: Yan Wang, Qindong Sun, Dongzhu Rong
Title: Audio-Visual Asynchrony Mitigation: Cross-Modal Alignment and Feature Reconstruction for Deepfake Detection
Paperid: 1111,   
Authors: Jiayi Zeng, Tao Ren, Changhu Wang, Yifan Wang, Wei Ju, Zhipeng Sun, Xiao Luo
Title: DATE: Dual Prompt Learning with Information Bottleneck for Graph Out-of-distribution Generalization
Paperid: 1112,   
Authors: Songning Lai, Ninghui Feng, Jiechao Gao, Hao Wang, Haochen Sui, Xin Zou, Jiayu Yang, Wenshuo Chen, Lijie Hu, Hang Zhao, Xuming Hu, Yutao Yue
Title: From Guesswork to Guarantee: Towards Faithful Multimedia Web Forecasting with TimeSieve
Paperid: 1113,   
Authors: Lingren Wang, Wenxuan Tu, Jieren Cheng, Jianan Wang, Xiangyan Tang, Chenchen Wang
Title: Discovering Maximum Frequency Consensus: Lightweight Federated Learning for Medical Image Segmentation
Paperid: 1114,   
Authors: Hongyan Xu, Zhongze Wu, Ang He, Xi Lin, Yi Chen, Xiu Su
Title: Addressing Granularity-induced Semantic Drift in OvOD via Graph-guided semantically consistent representation
Paperid: 1115,   
Authors: Jiaming Liang, Chi-Man Pun
Title: I-C Attack: In-place and Cross-pixel Augmentations for Highly Transferable Transformation-based Attacks
Paperid: 1116,   
Authors: Xin Peng, Bowen Liu, Renxiang Guan, Wenxuan Tu
Title: Multi-view Graph Clustering with Dual Structure Awareness for Remote Sensing Data
Paperid: 1117,   
Authors: Lihong Qiao, ShiYi Gao, Yucheng Shu, Bin Xiao, Weisheng Li, Xinbo Gao
Title: Pathology-Aware Reconstruction with Discriminative Knowledge Boosting Alignment for Che-Xray Vision-Language Pre-training
Paperid: 1118,   
Authors: Yiming Li, Peng Zhou, Xiaokang Qin, Hongwei Hu, Jun Sun, Yi Xu
Title: Position-LoRA: Enhanced Relation Customization through Structural Prior in Initial Latent Noise
Paperid: 1119,   
Authors: Yuhan Jing, Bo He, Haifeng Sun, Qi Qi, Zirui Zhuang, Lei Zhang, Jianxin Liao, Jing-Yu Wang
Title: Foresail: LLM Sensor Knowledge Empowered Status-guided Network for Multivariate Time-series Classification
Paperid: 1120,   
Authors: Jipeng Liu, Haichao Shi, Yaru Zhang, Xiao-Yu Zhang
Title: Knowledge Negative Distillation: Circumventing Overfitting to Unlock More Generalizable Deepfake Detection
Paperid: 1121,   
Authors: Chengzhe Wang, Wenqing Ji, Chenyang Li, Tongjie Pan, Yalan Ye
Title: Toward Reliable Emotion Recognition: Alleviating Label Noise and Reducing Uncertain Prediction
Paperid: 1122,   
Authors: Changjuan Ran, FANG LIU, Runqi Fang, Xiangyu Meng, Shenglan Cui, Yunfan Ye
Title: Where Watermark Meets Beauty: Expert-Guided Aesthetic Visible Watermarking for Digital Artworks
Paperid: 1123,   
Authors: Yifei Deng, Chenglong Li, Futian WANG, Jin Tang
Title: Learning Hierarchical Cross-modal Association with Intra-modal Context for Text-Image Person Retrieval
Paperid: 1124,   
Authors: Mingliang Zhai, Yiheng Wang, Haidong Hu, Chi-Man Pun, Hao Gao
Title: FGRFlow: Learning Fine-Grained Rigidity Scene Flow from 4D Radar Point Cloud
Paperid: 1125,   
Authors: Hanzhe Yu, Yun Ye, Jintao Rong, Qi Xuan, Chen Ma
Title: RealHD: A High-Quality Dataset for Robust Detection of State-of-the-Art AI-Generated Images
Paperid: 1126,   
Authors: Shan Wang, Weisi Lin, Yun Liu, Libao Zhang
Title: CLIP-HNet: Hybrid Network with Cross-Modal Guidance for Self-Supervised Remote Sensing Dehazing
Paperid: 1127,   
Authors: Kun Cheng, Qibing Qin, Wenfeng Zhang, Lei Huang, Jie Nie
Title: Deep Probabilistic Binary Embedding via Learning Reliable Uncertainty for Cross-Modal Retrieval
Paperid: 1128,   
Authors: Gwonjung Kim, Duyeol Lee, Jaehong Yang, Chae Eun Rhee
Title: See Through the Occlusions: Amodal Gaussian Splatting for Few-Shot 3D Reconstruction
Paperid: 1129,   
Authors: Xinyu Zhang, Lingling Zhang, Yanrui Wu, Muye Huang, Jun Liu
Title: Cognitive Predictive Coding Network: Rethinking the Generalization in Raven’s Progressive Matrices
Paperid: 1130,   
Authors: Haonan Cheng, Junwei Zhang, Hengyan Huang, Long Ye
Title: FG-Midiformer: A Symbolic Music Understanding Model towards Fine-Grained Learning of Multi-Attributes
Paperid: 1131,   
Authors: Dongdong Hu, Yang Zhou, Xiaofeng Huang, Haibing Yin, Zhu Li
Title: Sparse4DGS: Flow-Geometry Assisted 4D Gaussian Splatting for Dynamic Sparse View Synthesis
Paperid: 1132,   
Authors: Zhe Sun, Qiang Xu, Qi Zhang, Shan Liu, Ge Li
Title: Overfitted Point Cloud Attribute Codec Using Sparse Hierarchical Implicit Neural Representations
Paperid: 1133,   
Authors: Liang Yue, Shao-Kui Zhang, Lin Yuan, Yi-Tao Chen, Zirui Zhou, Song-Hai Zhang
Title: Synthesizing 3D Scenes via Diffusion Model that Incorporates Indoor Scene Characteristics
Paperid: 1134,   
Authors: Beijing Chen, Yuting Hong, Ziqiang Li, Zhangjie Fu
Title: DFPD: Dual-Forgery Proactive Defense against Both Deepfakes and Traditional Image Manipulations
Paperid: 1135,   
Authors: Hua Li, Gaowei Lin, Zhiyuan Li, Sam Kwong, Runmin Cong
Title: FSCDiff: Frequency-Spatial Entangled Conditional Diffusion model for Underwater Salient Object Detection
Paperid: 1136,   
Authors: Liang Zhao, Shubin Ma, Bo Xu, Qingchen Zhang
Title: Dual-Learning Based Penalized Multi-Align Clustering for Multi-View Incomplete and Disorderly Data
Paperid: 1137,   
Authors: Hanling Wang, Qing Li, Li Chen, Haidong Kang, Fei Ma, Yong Jiang
Title: HoloTrace: LLM-based Bidirectional Causal Knowledge Graph for Edge-Cloud Video Anomaly Detection
Paperid: 1138,   
Authors: Naisong Luo, Yuan Wang, Yuwen Pan, Rui Sun
Title: Focus on the Object: Gradient-based Feature Modulation for Camouflaged Object Segmentation
Paperid: 1139,   
Authors: Zhaolin Wei, Xiuwen Shi, Dengpan Ye, Yuhan Lin, Zhigang Wang, JiaCheng Deng, Ziyi Liu, Long Tang
Title: PhonoFence: A Cross-Task Defense Framework for DeepFake via Phoneme-Level Adversarial Perturbations
Paperid: 1140,   
Authors: Ziying Tan, Linbo Luo, Haiyan Yin, Yew-Soon Ong, Wentong Cai
Title: Crowd Dynamics Demand Adaptivity: Self-Adaptive Physics-Informed Neural Network for Crowd Simulation
Paperid: 1141,   
Authors: Sensen Wang, Yuehu Liu, Chi Zhang
Title: BiOMamba: Mamba-based Forward-Then-Backward Temporal Modeling for Online Action Detection and Anticipation
Paperid: 1142,   
Authors: Shuai Huang, Yongxiong Wang, Huan Luo, Haodong Jing, Chendong Qin, Jingqun Tang
Title: MINDEV: Multi-modal Integrated Diffusion Framework for Video Reconstruction from EEG Signals
Paperid: 1143,   
Authors: Jiateng Liu, Hengcan Shi, Haiwen Liang, Xiaolin Xu, Yuan Zong, Yaonan Wang, Wenming ZHENG
Title: NaME: A Natural Micro-expression Dataset for Micro-expression Recognition in the Wild
Paperid: 1144,   
Authors: Wenpeng Mu, Zheng Li, Qiang Xu, Xinghao Jiang, Tanfeng Sun
Title: ExDA: Towards Universal Detection and Plug-and-Play Attribution of AI-Generated Ex-Regulatory Images
Paperid: 1145,   
Authors: Fenghua Yu, Jianwen Sun, Qian Wan, Meicheng Chen, Xiaoxuan Shen, Qing Li
Title: DiffuQKT: A Diffusion-Based Approach for Improved Question Representation in Knowledge Tracing
Paperid: 1146,   
Authors: Shalayiding Sirejiding, Yue Ding, Yuxiang Lu, Xinyi Hou, Shaokai Wu, Qichen He, Chunlin Wang, Wenqiang GUO, Hongtao Lu
Title: CLIP-MT: Multi-Modal Knowledge-Driven Adaptive Scale Feature Allocation for Multi-Task Dense Prediction
Paperid: 1147,   
Authors: Ruiqi Li, Yiu-ming Cheung
Title: Modeling and Identifying Distractors with Curriculum for Robust 3D Gaussian Splatting
Paperid: 1148,   
Authors: Haotian Gan, Yudong Li, Wanyue Li, Weidong Tang
Title: Aligned or Apart? Multi-Agent Insights into Consumer and Brand Messaging Discrepancies
Paperid: 1149,   
Authors: Songze Li, Yunfei Guo, Shen Chen, Bin Li, Kaiqing Lin, Changsheng Chen, Haodong Li, Taiping Yao, Shouhong Ding
Title: DITL²: Dual-Stage Invariance Transfer Learning for Generalizable Document Image Tampering Localization
Paperid: 1150,   
Authors: Mengzhen Wang, Xunbin Huang, Jiayuan Xie, Shukai Ma, Jiale Men, DaYong Liang, Yi Cai
Title: From Model Diagram to Code: A Benchmark Dataset and Multi-Agent Framework
Paperid: 1151,   
Authors: Rongqiang Fang, Yongqi Sun, Jidong Yuan, hongbo cao, Jinkun Dong
Title: A Language-Assisted Semantic-Aware Disentangled Method for Link Prediction on Heterogeneous Graphs
Paperid: 1152,   
Authors: Nokap Park
Title: M2PE-Diff: Music-to-Pose Encoder for Dance Video Generation Leveraging Latent Diffusion Framework
Paperid: 1153,   
Authors: Hancong Wang, Yue Yu, Hairong Zheng, Tong Zhang
Title: Test-Time Adaptation of Medical Vision-Language Models with Mixture of Modality Experts
Paperid: 1154,   
Authors: Haichao Sha, Yuncheng Wu, Ruixuan Liu, Yang Cao, Hong Chen
Title: Differentially Private Visual Learning with Public Subspace Augmented by Synthetic Data
Paperid: 1155,   
Authors: Yu Liu, Kun Sun, Chang Tang, Yuhua Qian, Xin Li
Title: TPDepth: Leveraging Text Prompts with ControlNet to Boost Diffusion-based Depth Estimation
Paperid: 1156,   
Authors: Yidong Chen, Qi Li, Yuyang Yang, Wen Li, Sheng Ao, Cheng Wang
Title: Unleashing the Power of Data Generation in One-Pass Outdoor LiDAR Localization
Paperid: 1157,   
Authors: Yunqiang Pei, Hongrong yang, Kaiyue Zhang, Guoqing Wang, Peng Wang, Chaoning Zhang, Yang Yang, Heng Tao Shen
Title: InteractGuide: LLM-Enhanced Multimodal Reasoning for User-Centric Interaction Recommendations in AR-HRI Authoring
Paperid: 1158,   
Authors: Xuanming Jiang, Baoyi An, Zhengwei Zou, DingYu Nie, Jialie Shen, Xueming Qian, Guoshuai Zhao
Title: Ear with Eye: Lightweight Multimodal Audio-Visual Network Inspired by Bionic Structures
Paperid: 1159,   
Authors: Guoyi Li, Die Hu, Xiaomeng Fu, Qirui Tang, Yulei Wu, Xiaodan Zhang, Honglei Lyu
Title: Entity Graph Alignment and Visual Reasoning for Multimodal Fake News Detection
Paperid: 1160,   
Authors: Le Liu, Shizhou Zhang, Di Xu
Title: SUVIS: A Depth- and Motion-Encoded Stereoscopic System for Communicating Forecast Uncertainty
Paperid: 1161,   
Authors: Jingyuan Fang, Yang Ning, Xiushan Nie, Xinfeng Liu, Zhiyong Cheng
Title: VLHP: Learning Discriminative Vision-Language Hybrid Prototypes for Weakly Supervised Semantic Segmentation
Paperid: 1162,   
Authors: Xiangping Zheng, Xuan Feng, Bo Wu, Bin Ren, Wei Li, Xiuxin Hao, Xun Liang, Bin Tang, Zhiwen Yu
Title: Breaking Semantic Barriers: A Zero-Shot Generalized Framework for Graph Anomaly Detection
Paperid: 1163,   
Authors: Zhenyu Xu, Junjie Wu, Zhiyan Piao, Xiaoqi Sheng, Yu Xiao, Xinyu Zhang
Title: AnyStyleDiffusion: Flexible Style Transfer with Consistent Content Adaptation Across Diffusion Models
Paperid: 1164,   
Authors: Baoquan Zhao, Xiaofan Ma, Qianshi Pang, Ruomei Wang, Fan Zhou, Shujin Lin
Title: VisAug: Facilitating Speech-RichWeb Video Navigation and Engagement with Auto-Generated Visual Augmentations
Paperid: 1165,   
Authors: Haifeng Zhao, Shuo Xu, Leilei Ma, Yufei Zhang, Lei Wang, Dengdi Sun
Title: Towards Space and Semantics: Object-Purified Representation Learning for Multi-Label Image Classification
Paperid: 1166,   
Authors: Songpei Xu, Xuri Ge, Chaitanya Kaul, Roderick Murray-Smith
Title: HandSolo: A Mid-Air Hand Pose Interaction Method Based on Disentangled Degrees-of-Hand-Freedom
Paperid: 1167,   
Authors: Guangfei Li, Quanxue Gao, Yu Lei, Yichen Bao, Qianqian Wang
Title: Multi-view Collaborative Representation Learning from Noisy Labels for VHR Imagery Classification
Paperid: 1168,   
Authors: Mingsong Yang, Xinhong Hei, Kehai Chen, Haining Meng, HaoYang Dong, Qin zhao
Title: BIMCompNet: Multimodal Dataset for Geometric Deep Learning in Building Information Model
Paperid: 1169,   
Authors: Zhixia Zhao, Qiyue Li, Jie Li, Richang Hong, Zhi Liu
Title: ViewGauss: A Head Movement Dataset for 6DoF Gaussian Splatting Video Viewing
Paperid: 1170,   
Authors: Peirong Zhang, Yidan Zhang, Hanru Shi, Dianyu Wang, Xiaoxuan Liu, Lei Wang
Title: Referring Multi-Object Tracking in Satellite Videos: A New Benchmark and Baseline
Paperid: 1171,   
Authors: Shifu Xiong, HangChen HangChen, Shi Cheng, Kai Shen, Hengshun Zhou, Genshun Wan, Chenyue Zhang, Kewei Li, Jun Du, Lirong Dai
Title: MISP-QEKS: A Large-Scale Dataset with Multimodal Cues for Query-by-Example Keyword Spotting
Paperid: 1172,   
Authors: Keyue Shi, QIANQIAN SHEN, Zhaoming Ye, liangjun jiang, Jiajun Bu, Haishuai Wang
Title: LUMOS Dataset: Lumbar Multimodal Osteoporosis Screening with X-ray and CT images
Paperid: 1173,   
Authors: Wenxu Gao, Liang Xie, Kangli Wang, Jingxuan Su, Changhao Peng, Wei Gao
Title: DPCSet: A Large-scale Dynamic Point Cloud Dataset for Compression and Perception
Paperid: 1174,   
Authors: Zihou Zhang, Hao Li, Zhengwei Yang, Zechao Hu, Liang Li, Zheng Wang
Title: From Language to Instance: Generative Visual Prompting for Zero-shot Camouflaged Object Detection
Paperid: 1175,   
Authors: Dongyang Ma, Zhengyu Ma, Wei Zhang, Yonghong Tian
Title: DSF-Net: Dynamic Sparse Fusion of Event-RGB via Spike-Triggered Attention for High-Speed Detection
Paperid: 1176,   
Authors: Lin Wu, Wei Wei, Peizhuo Yu, Jianglin Lan
Title: Open-Vocabulary 3D Affordance Understanding via Functional Text Enhancement and Multilevel Representation Alignment
Paperid: 1177,   
Authors: Shanghui Deng, Xiao Zheng, Chang Tang, Kun Sun, Yuanyuan Liu, Xinwang Liu
Title: Find True Collaborators: Banzhaf Index-based Cross View Alignment for Partially View-aligned Clustering
Paperid: 1178,   
Authors: Jiehua Zhang, Liang Li, Chenggang Yan, Wei Ke, Yihong Gong
Title: Frequency-aware Correlation Discovering and Spatial Forgery Clue Distilling for Synthetic Image Detection
Paperid: 1179,   
Authors: Chenglong Sun, Shijie Pang, Yuzheng Wang, Lizhe Qi
Title: RWKV3D: An RWKV-Based Model with Multiple Training Strategies for Point Cloud Analysis
Paperid: 1180,   
Authors: Longquan Dai, He Wang, Xiaolu Wei, Shaomeng Wang, Jinhui Tang
Title: Conducting Conditional Diffusion by Estimating the Mean Vector of von Mises-Fisher Distribution
Paperid: 1181,   
Authors: Yang Yu, Meiyu Liang, Wei Huang, Juncheng Zheng, Kangkang Lu, Yawen Li, Junping Du, Zhe Xue, Wu Liu
Title: Asymmetric Pre-aligned Anchor Contrastive Enhanced Diffusion Hashing Model for Incomplete Multimodal Retrieval
Paperid: 1182,   
Authors: Pengfei Ren, Jing-Yu Wang, Haifeng Sun, Qi Qi, Jing Wang, Jianxin Liao
Title: Rule Meets Learning: Confidence-Aware Multi-View Fusion for Self-Supervised 3D Hand Pose Estimation
Paperid: 1183,   
Authors: Jiacheng Ruan, Zongyun Zhang, Jingsheng Gao, Wenzhen Yuan, Ting Liu, yuzhuo fu
Title: MPI-CD: Multi-Path Information Contrastive Decoding for Mitigating Hallucinations in Large Vision-Language Models
Paperid: 1184,   
Authors: Qian Li, Siyuan Liang, Yuzheng Zhang, Cheng Ji, Zongyu Chang, Shangguang Wang
Title: Meta-Knowledge Path Augmentation for Multi-Hop Reasoning on Satellite Commonsense Multi-Modal Knowledge Graphs
Paperid: 1185,   
Authors: Fangxin Liu, Junjie Wang, Ning Yang, Zongwu Wang, Junping Zhao, Li Jiang, Haibing Guan
Title: ASTER: Adaptive Dynamic Layer-Skipping for Efficient Transformer Inference via Markov Decision Process
Paperid: 1186,   
Authors: Chang Su, Beihong Jin, Fusang Zhang, Siheng Li, Zhi Wang
Title: Self-Supervised Human Mesh Recovery from Partial Point Cloud via a Self-Improving Loop
Paperid: 1187,   
Authors: Runqi Wang, Caoyuan Ma, Jian Zhao, Hanrui Xu, Dongfang Sun, Haoyang Chen, Lin Xiong, Zheng Wang, Xuelong Li
Title: Leader is Guided: Interactive Motion Generation via Lead-Follow Paradigm and Trajectory Guidance
Paperid: 1188,   
Authors: Chuan Zeng, Zhao Zhang, Wei Huang, Lei Zhang, Le Yi, kefu zhao
Title: DC²-SR: A Dual-Consistency Guided Curriculum Learning method for Thick-Slice Fetal MRI Super-Resolution
Paperid: 1189,   
Authors: Bingfeng Liu, Songwei Pei, Shuhuai Wang, Wenzheng Yang, Qian Li, Shangguang Wang
Title: Prior-Constrained Relevant Feature driven Image Fusion with Hybrid Feature via Mode Decomposition
Paperid: 1190,   
Authors: Dahao Fu, Jiangqun Ni, Jian Zhang
Title: JPEG-RAE: Reversible Adversarial Example for Privacy and Copyright Protection of JPEG Images
Paperid: 1191,   
Authors: Ziqiang Shi, Rujie Liu, Jun Takahashi, Shan Jiang
Title: TrueCount: Improving Open-World Object Counting with Visual-Language Models and Dynamic Multi-Modal Inputs
Paperid: 1192,   
Authors: Yan-Kai Liu, Shunyang Yao, Tao Xi, Bao-liang Lu, Wei-Long Zheng
Title: Human vs AI: How Digital Human News Anchors Affect Our Cognitive Processes?
Paperid: 1193,   
Authors: Shibei Meng, Saihui Hou, Yang Fu, Xuecai Hu, Junzhou Huang, Yongzhen Huang
Title: Seeing from Magic Mirror: Contrastive Learning from Reconstruction for Pose-based Gait Recognition
Paperid: 1194,   
Authors: Zhiqian Xia, Haifeng Xia, Shichao Jin, Wei Wang, Zhengming Ding, Xiaochun Cao
Title: DSPF: Dual-Stage Preservation and Fusion for Source-free Domain Adaptive Point Cloud Completion
Paperid: 1195,   
Authors: Ziyi Li, Wei-Long Zheng, Bao-liang Lu
Title: Multimodal Emotion Recognition with Missing Modality via a Unified Multi-task Pre-training Framework
Paperid: 1196,   
Authors: Dexuan Xu, Yanyuan Chen, Yu Huang, Shihao E, Yiwei Lou, Yongzhi Cao, Hanpin Wang, Meikang Qiu
Title: Medical Vision-Language Pre-training with Multimodal Variational Masked Autoencoder for Robust Medical VQA
Paperid: 1197,   
Authors: Yue Hou, Yingke Su, Junran Wu, Ke Xu
Title: Test-time Graph OOD Detection via Dynamic Dictionary Expansion and OOD Score Calibration
Paperid: 1198,   
Authors: Junpu Zhang, Shengju Yu, Suyuan Liu, Siwei Wang, Miaomiao Li, Xinwang Liu, En Zhu, Kunlun He
Title: Learning the Anchors with Similar Distributions to Original Data for Multi-view Clustering
Paperid: 1199,   
Authors: Xubo Liu, wenya guo, Ruxue Yan, Xumeng Liu, Ying Zhang, Ru Zhou
Title: Rethinking the Reliability of Evidence in End-to-End Fact-Checking from the Causal Perspective
Paperid: 1200,   
Authors: Youchen Xie, Chen Li, Sheng Qiu, ZhiJun Wang, Chenhui Li, Yibo Zhao, Zan Gao, Changbo Wang
Title: FluidGS: Physics Informed Gaussian Splatting for Dynamic Fluid Reconstruction from Sparse Views
Paperid: 1201,   
Authors: Yiqing Hao, Yangru Huang, Yi Jin, Tao Wang, Yidong Li, Yigang Cen
Title: Tree of Prompts: Aligning Hierarchical Visual Prior for Continue Generalized Category Discovery
Paperid: 1202,   
Authors: Jintian Ji, Songhe Feng
Title: Anchors Bring Stability and Efficiency: Fast Tensorial Multi-view Clustering on Shuffled Datasets
Paperid: 1203,   
Authors: Duolin Wang, Guanyu Xing, Yanli Liu
Title: FlowTrack: Integrating Adjacent-Frame Motion Tracking and Adaptive Prediction for Robust Semi-Supervised VOS
Paperid: 1204,   
Authors: Jinwen Wang, Youfang Lin, Xiaobo Hu, Siyu Yang, Sheng Han, Shuo Wang, Kai Lv
Title: From Pixels to Temporal Correlations: Learning Informative Representations for Reinforcement Learning Pre-training
Paperid: 1205,   
Authors: KunSheng Ma, Fan Qi, Changsheng Xu
Title: Granular Music Attribute Transformation with Proximal Policy Optimization Adapters for Diffusion Model
Paperid: 1206,   
Authors: Yan Zhong, Xinping Zhao, Li Zhang, Xinyuan Song, Tingting Jiang
Title: Adaptive Prompt Learning for Blind Image Quality Assessment with Multi-modal Mixed-datasets Training
Paperid: 1207,   
Authors: Xiaolei Bo, Feiyang Yang, Feilong Xu, Xiaoli Zhang
Title: Cross-Counter-Repeat Attention for Enhanced Understanding of Visual Semantics in Radiology Report Generation
Paperid: 1208,   
Authors: Junlei Zhou, Jiashi Gao, Xinwei Guo, Haiyan Wu, Quanying Liu, Xiangyu Zhao, Hongxin Wei, Xin Yao, Xuetao Wei
Title: Mitigating Stereotypes in Text-to-Image Generation: A Novel Perspective of Selective Neural Suppression
Paperid: 1209,   
Authors: Qi Li, Yucan Zhou, Jiang Zhou, Xingyou Yang, Xiaoyan Gu
Title: Diverse and Public Features Cooperation via Gradient Rectification for Federated Prompt Learning
Paperid: 1210,   
Authors: Qiuyu Liang, Yongqiang Zhang
Title: SAM based Region-Word Clustering and Inference Score Adjusting for Open-Vocabulary Object Detection
Paperid: 1211,   
Authors: Dawei Lin, Meng Yuan, Ziming Wang, Tieru Wu, Yuanning Liu
Title: FreeCAD: A Multimodal Framework for 3D CAD Model Generation from Free-Form Prompts
Paperid: 1212,   
Authors: Changshuo Wang, Shuting He, Xiang Fang, Fangzhe Nan, Prayag Tiwari
Title: Seeing the Overlooked: Bio-Visual Inspired Weak Saliency Feedback Transformer for Person Re-identification
Paperid: 1213,   
Authors: Zihou Liu, Dongming Zhang, Jing Zhang, Jun Li, Yongdong Zhang
Title: RealText: Realistic Text Image Generation based on Glyph and Scene Aware Inpainting
Paperid: 1214,   
Authors: Michael Kohl, Tobias Wursthorn, Christof Weiss
Title: Cross-Modal Metrics for Capturing Correspondences between Music Audio and Stage Lighting Signals
Paperid: 1215,   
Authors: Xinbo Geng, Fan Shi, Xu Cheng, Chen Jia, Meng Zhao, Shengyong Chen
Title: LFMamba: Focal Stack-aware State Space Modeling for Light Field Salient Object Detection
Paperid: 1216,   
Authors: Zhongfan Sun, Kan Guo, Yongli Hu, Daxin Tian, Qingqing Gao, Jiapu Wang, Junbin Gao, Yanfeng Sun, Baocai Yin
Title: Large-Small Model Synergy with Multimodal Fine-Grained Heuristics for Knowledge-Based Visual Question Answering
Paperid: 1217,   
Authors: Wenzheng Yang, Songwei Pei, Bingfeng Liu, Qian Li, Shangguang Wang
Title: OGDepth: Leveraging Object Guidance in Diffusion Models for Enhanced Monocular Depth Estimation
Paperid: 1218,   
Authors: Giovanni Zanin, Ritujoy Biswas, Pietro Morerio, Sylvio Barbon Junior, Alberto Carini, Alessio Del Bue, Vittorio Murino
Title: Direction-Aware Room Impulse Response Estimation for Immersive Audio Rendering in Real Environments
Paperid: 1219,   
Authors: Ziwei Niu, Shiao Xie, Ziyue Wang, Yen Chen, Yueming Jin, Lanfen Lin
Title: EIR-SDG: Explore Invariant Representation for Single-source Domain Generalization in Medical Image Segmentation
Paperid: 1220,   
Authors: Zezhou Chen, Ping Chen, Huan Hu, Xiang Liu, Zipeng Wang, Zhaoxiang Liu, Kai Wang, Shiguo Lian
Title: CP3: Customizable 3D Pop-Out Effect Creation for Immersive Content Using Multimodal Models
Paperid: 1221,   
Authors: Yichi Zhang, Zhuo Chen, Lingbing Guo, yajing Xu, Lei Liang, Wen Zhang, Huajun Chen
Title: Client-Server Co-design with Multi-modal Codebooks Makes Better and Faster Federate Knowledge Sharing
Paperid: 1222,   
Authors: Ruilin Yao, Yi Rong, Tianyu Zou, Bo Zhang, Jian Li, Shengwu Xiong, Shili Xiong
Title: MAP: Parameter-Efficient Tuning for Referring Expression Comprehension via Multi-Modal Adaptive Positional Encoding
Paperid: 1223,   
Authors: Xiaoxuan Mu, Haoyu Tang, Han Jiang, Tianyuan Liang, Qinghai Zheng, Jihua Zhu
Title: FACE: A Dual-Template and Adaptive Curriculum Framework for Unsupervised Text-Based Person Search
Paperid: 1224,   
Authors: Jiaye Zhang, Hongyi Wang, Peiru Yang, Zili Meng, Mingwei Xu
Title: Configuring Dynamic Multi-Stage Serverless Pipelines for Video Processing with Minimal Profiling Overhead
Paperid: 1225,   
Authors: Dan Wu, Xincheng Ju, Dong Zhang, Shoushan Li, Erik Cambria, Guodong Zhou
Title: Emotion across Modalities and Cultures: Multilingual Multimodal Emotion-Cause Analysis with Memory-inspired Framework
Paperid: 1226,   
Authors: Jie Fu, Bingkun BAO
Title: Retaining Temporal Semantics and Relation Topologies for Continual Weakly-Supervised Audio-Visual Video Parsing
Paperid: 1227,   
Authors: Sujuan Hou, Zhihui Feng, Hao Xiong, Weiqing Min, Peng Li, Shuqiang Jiang
Title: DSDGF-Nutri: A Decoupled Self-Distillation Network with Gating Fusion For Food Nutritional Assessment
Paperid: 1228,   
Authors: Youbo Mao, Ziyang Kang, Peng Li, Jiyao Chen, Zenglin Yang, Zhijun Li
Title: FCG: High-Throughput JPEG Heterogeneous Inference with Hybrid Parallel Pipeline on Mobile Devices
Paperid: 1229,   
Authors: Feng-Kai Huang, Bo-Lun Huang, Li-Wu Tsao, Jhih-Ciang Wu, Hong-Han Shuai, Wen-Huang Cheng
Title: Flowing Crowd to Count Flows: A Self-Supervised Framework for Video Individual Counting
Paperid: 1230,   
Authors: Haiyang Mei, Difei Gao, Xiaopeng Wei, Xin Yang, Mike Zheng Shou
Title: Can I Trust You? Advancing GUI Task Automation with Action Trust Score
Paperid: 1231,   
Authors: Biao Dong, Lei Zhang
Title: Talking Head Generation via Viewpoint and Lighting Simulation Based on Global Representation
Paperid: 1232,   
Authors: Jiahui Zhang, Mengtian Li, Jiewei Tang, Junyu Deng, Siyu Tian, Xiang Liu, Meng Zhang, Guangnan Ye, Yu-Gang Jiang
Title: EditMaster: Bridging Text instruction and Visual Example for Multimodal guided Image Editing
Paperid: 1233,   
Authors: mo yang, luo chen, Jiali zhou
Title: Change-UP: Advancing Visualization and Inference Capability for Multi-level Remote Sensing Change Interpretation
Paperid: 1234,   
Authors: Leyuan Liu, Shen Chen, Jingying Chen
Title: HumanPrinter: Reconstructing 3D Human from a Single Image Like a 3D Printer
Paperid: 1235,   
Authors: Chuan Zhang, Zihan Li, ZiHao Xu, Xuhao Ren, Liehuang Zhu
Title: SepVAMark: Deep Separable Visual-Audio Fusion Watermarking for Source Tracing and Deepfake Detection
Paperid: 1236,   
Authors: Junyi Wang, Yue Qi
Title: Visual Localization using Hybrid Feature Grid and Learned Weighted Global Point Cloud
Paperid: 1237,   
Authors: Chuhang Ma, Shuai Tan, Junjie Wei, Ye Pan
Title: GOES: 3D Gaussian-based One-shot Head Animation with Any Emotion and Any Style
Paperid: 1238,   
Authors: Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, Yapeng Tian
Title: AV-DiT: Taming Image Diffusion Transformers for Efficient Joint Audio and Video Generation
Paperid: 1239,   
Authors: Mingjie Wei, Weinan Zhang, Chen Zhang, Yifeng Ding, Donglin Di, Lei Ren, Chen Wei, Ting Liu
Title: PRISM: A Benchmark for Unveiling Cross-modal Knowledge Inconsistency in Large Vision-Language Models
Paperid: 1240,   
Authors: Jiankun Zhu, Sicheng Zhao, Lulu Tian, Jing Jiang, Xi Chen, Hongxun Yao
Title: Emotion in a Bottle: Information Bottleneck Guided Disentanglement for Emotion Domain Adaptation
Paperid: 1241,   
Authors: Jingxing Guo, Guilian Chen, Yimu Sun, Huisi Wu, Jing Qin
Title: Hierarchical Spatiotemporal Context Aggregation and Speckle-aware Deformable Convolution for Echocardiography Video Segmentation
Paperid: 1242,   
Authors: Fenghao Tian, Mingtao Feng, Jianqiao Luo, Zijie Wu, Longlong Mei, Lijie Yang, Weisheng Dong, Yaonan Wang
Title: Generalizing to New Area: Self-Distillation Curriculum Learning for Fine-Grained Cross View Localization
Paperid: 1243,   
Authors: Jiahua Bao, Siyao Cheng, Jiaxing Du, Changjiang He, Zeming Lang, Hao Zhang, Jie Liu
Title: BOLT: Fewer Tokens but More Performance Retention for Efficient Vision-Language Models Inference
Paperid: 1244,   
Authors: Liu Yu, Jiajun Sun, PING KUANG, Rui Zhou, Fan Zhou, Zhikun Feng
Title: Bimodal Debiasing for Text-to-Image Diffusion: Adaptive Guidance in Textual and Visual Spaces
Paperid: 1245,   
Authors: Chenyang Zhou, Menghejiya Menghejiya, TangChao TangChao, Licheng Wu
Title: UniMTR: Unified Recognition of Dual-style Traditional Mongolian Scripts via Contrastive Representation Alignment
Paperid: 1246,   
Authors: Han Hu, WenLi Du, Bing Wang
Title: Efficient Video Anomaly Detection via Scene-Dependent Memory Assisted Inter-Frame RGB Difference Reconstruction
Paperid: 1247,   
Authors: Xiongwei Dang, Wenxuan Liu, Xian Zhong, Zheng Wang
Title: SegTraj: A Segmented Trajectory-aware Spatio-Temporal Graph Convolutional Network for Social Group Detection
Paperid: 1248,   
Authors: Xiao Fu, Pengyu Wang, Wei Xi, Kun Zhao, Jiadong Feng, Jizhong Zhao
Title: LES-CLIP: A Lightweight Emotion-Sensitive Adaptation of CLIP for Precise Similar Emotion Discrimination
Paperid: 1249,   
Authors: Tiancheng LIU, JIAYI YE, Shumeng Zhang, Kang Zhang, Chen Liang
Title: Quantifying Structural Aesthetic Features and Personality Trait Preferences in $\textit{Kai Shu}$ Calligraphy
Paperid: 1250,   
Authors: LiYan LiYan, Xingchen Hu, Jiyuan Liu, Zhong Liu
Title: Federated Incomplete Multi-view Clustering with Individual Structure Preservation and Central Representation Tensorization
Paperid: 1251,   
Authors: Hanyu Guo, Suzhou Que, Junlong Gao, Hanzi Wang
Title: TFPA: Text Features Guided Dynamic Parameter Adjustment for Few Shot Action Recognition
Paperid: 1252,   
Authors: Feng-Kai Huang, Hong-Wei Xu, Chu-Chuan Lee, Hong-Yi Tu, Hong-Han Shuai, Wen-Huang Cheng
Title: OinkTrack: An Ultra-Long-Term Dataset for Multi-Object Tracking and Re-Identification of Group-Housed Pigs
Paperid: 1253,   
Authors: Wuxia Zhang, Yang Xin, Shibo Lv, Xin Zhang, Xiang Zhong, Jianmin Jiang
Title: EEG-Face: A Facial-Image Stimulated EEG Data-Set for Analysis of Brain Perceived Multimedia
Paperid: 1254,   
Authors: Yuchen Zhang, Tailin Chen, Jiangbei Yue, Yueming Sun, Rahul Singh, Jianbo Jiao, ZEYU FU
Title: DeHate: A Holistic Hateful Video Dataset for Explicit and Implicit Hate Detection
Paperid: 1255,   
Authors: Tuan Dang, Theron Wang, Hridayesh Lekhak, Kenny Zhu
Title: EmotionalCanines: A Dataset for Analysis of Arousal and Valence in Dog Vocalization
Paperid: 1256,   
Authors: Chunyi Li, Bo Hu, Taiyang Chen, Leida Li, Lihuo He, Xinbo Gao
Title: Low-light Image Enhancement Quality Assessment: A Real-World Dataset and An Objective Method
Paperid: 1257,   
Authors: Haohui Li, Bowen Qu, Wei Gao
Title: T23D-QA: An Open Dataset and Benchmark for Text-driven 3D Generation Quality Assessment
Paperid: 1258,   
Authors: Jinsheng Wei, Jialiang Sun, Guanming Lu, Jingjie Yan, Dong Zhang
Title: Multi-information Hierarchical Fusion Transformer with Local Alignment and Global Correlation for Micro-Expression Recognition
Paperid: 1259,   
Authors: Xianrun Xu, Baoyao Yang, Wanyun Li, Jingsong Lin, Yufei Xu
Title: Simple but Effective: Sub-Volume Contrastive Learning for Class-Imbalanced Semi-Supervised 3D Medical Image Segmentation
Paperid: 1260,   
Authors: Pengyu Zeng, Jun Yin, Haoyuan Sun, Yuqin Dai, Maowei Jiang, Miao Zhang, Shuai Lu
Title: MRED-14: A Benchmark for Low-Energy Residential Floor Plan Generation with 14 Flexible Inputs
Paperid: 1261,   
Authors: Jiawei Ge, Xinyu Zhang, Jiuxin Cao, Xuelin Zhu, Weijia Liu, Qingqing Gao, Biwei Cao, Kun Wang, Chang Liu, Bo Liu, Chen Feng, Ioannis Patras
Title: Gen4Track: A Tuning-free Data Augmentation Framework via Self-correcting Diffusion Model for Vision-Language Tracking
Paperid: 1262,   
Authors: Xiangyu Zheng, Songcheng He, Wanyun Li, Xiaoqiang Li, Wei Zhang
Title: Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation
Paperid: 1263,   
Authors: Lizhi Xiong, Linsen Ding, Ziqiang Li
Title: Detecting Forged HEVC Videos via Anomalous Bitrate-Compressed Traces: A Frame-Level Bitrate Analysis Framework
Paperid: 1264,   
Authors: Kaihang Jiang, Waikeung Wong, Jianyang Qin, Xiaozhao Fang, Jie Wen, Bingzhi Chen, Hongbo Gao
Title: Label Prediction Inherited Hashing for Cross-Modal Retrieval: Applying Supervised Hashing to Unsupervised Tasks
Paperid: 1265,   
Authors: Lin Zuo, Kunshan Yang, Mengmeng Jing, Xiangxu Zhao, Jiaqiao Chen
Title: Bridging Inter-Class Ambiguity and Spatial Variability in Flexible Object Recognition via Graph Distillation
Paperid: 1266,   
Authors: Liqian Zhang, Feng Yuan, Haoran Xie, Fu Lee Wang, Zhaoqing Pan
Title: Evaluating Visual Quality of Autostereoscopic 3D Displays via a Multi-Modal Parameter Perception Network
Paperid: 1267,   
Authors: Yanfeng Liu, Lefei Zhang
Title: Multimodal Decomposed Distillation with Instance Alignment and Uncertainty Compensation for Thermal Object Detection
Paperid: 1268,   
Authors: Siyi Qian, Jian Fang, Yuzhou Mao, Yayun Zou, Wentao Zhang, Haiwei Xue
Title: Human Motion Generation in 3D Scenes from Open-Ended Textual Instructions with MLLM Planning
Paperid: 1269,   
Authors: Wanyi Zhuang, Qi Chu, Tao Gong, Changtao Miao, Nenghai Yu
Title: Towards Good Generalizations for Diffusion Generated Image Detection Using Multiple Reconstruction Contrastive Learning
Paperid: 1270,   
Authors: Yuezhou Li, Yuzhen Niu, Huangbiao Xu, Hui Da, Rui Xu, Wenxi Liu
Title: IPCMoE: Integrating Perceptual Cues with Mixture-of-Experts for Joint Low-Light Image Enhancement and Deblurring
Paperid: 1271,   
Authors: Hua Wang, Hong Liu, Jiale Ren, Mingxin Tan, Zhongzien Jiang
Title: CLIP-6D: Empowering CLIP as a Zero-Shot 6D Pose Estimator Through Generalizable Object-Specific Representations
Paperid: 1272,   
Authors: Changyu Rao, Gaozhi Liu, Sheng Li, Xinpeng Zhang, Zhenxing Qian
Title: DynMark: A Robust Watermarking Solution for Dynamic Screen Content with Small-size Screenshot Support
Paperid: 1273,   
Authors: Chengpei Xu, Wenhao Zhou, Long Ma, Weimin Wang, Feng Xia, Binghao Li, Wenjie Zhang
Title: Bright to Dark: Stage-wise Bilevel Knowledge Transfer for Seeing Text in the Dark
Paperid: 1274,   
Authors: Yuwei Zhou, Xin Wang, Hong Chen, Yipeng Zhang, Zeyang Zhang, Wenwu Zhu
Title: ModuleTeam: Open-Set Multi-Conditional Image Generation with Training-Free Latent Mixture of Any Control Module
Paperid: 1275,   
Authors: Wanting Zhang, Jingxuan Zhang, Libao Zhang
Title: Saliency-Guided Adaptive Random Diffusion for Remote Sensing Images Restoration with Cloud and Haze
Paperid: 1276,   
Authors: Yifan Zeng, Fangzhou Dong, Jian Zhao, Peijia Zheng, jian li, Huiyu Zhou
Title: Towards Culturally Fair Multimodal Generation: Quantifying and Mitigating Orientalist Biases in Text-to-Visual Models
Paperid: 1277,   
Authors: Wei Jia, Li Jin, Kaiwen Wei, Yuying Shang, Nayu Liu, Zhicong Lu, Qing Liu, Linhao Zhang, Jiang Zhong, Yanfeng Hu
Title: U-MERE: Unconstrained Multimodal Entity and Relation Extraction with Collaborative Modeling and Order-Sensitive Optimization
Paperid: 1278,   
Authors: Xingke Song, Jianxu Shangguan, Yiran Li, Jialu ZHANG, Jianfeng Ren, Ruibin Bai, Xin Chen, Xudong Jiang
Title: CEARI: Co-Evolutionary Agents for Reassembling and Inpainting Puzzles with Gaps and Missing Pieces
Paperid: 1279,   
Authors: Swarna Chakraborty, Mylene Farias
Title: MT-DPCQA: A Multimodal Time-aware Learning Approach for No-Reference Dynamic Point Cloud Quality Assessment
Paperid: 1280,   
Authors: Guitao Xu, Ziqi Yi, Peirong Zhang, Jiahuan Cao, Shihang Wu, Lianwen Jin
Title: From Pixels to Semantics: a Novel MLLM-Driven Approach for Explainable Tampered Text Detection
Paperid: 1281,   
Authors: Yihang Liu, Ying Wen, Longzhen Yang, Lianghua He, Heng Tao Shen
Title: RadLAS: A Foundation Model for Interpretable Radiography Image Analysis with Lesion-Aware Self-Supervised Pre-training
Paperid: 1282,   
Authors: Wenhui Wu, Guanqi Wen, Le Ou-Yang, Ran Wang, Sam Kwong
Title: DUIMC: Deep Unbalanced Incomplete Multi-View Clustering via Graph Constrained Imputation and Contrastive Learning
Paperid: 1283,   
Authors: Seungkyu Leem, Seokhyun Jeong, Yeonho Cho, Yoonjae Lee, Jungjin Lee
Title: VRMusicStage: A System for Converting Fixed-Camera Music Stage Videos into Immersive VR Content
Paperid: 1284,   
Authors: Jiaqing Fan, Hanwen Qian, Mengjuan Jiang, Fanzhang Li
Title: PeriodVOS: Learning Periodic Patterns for Unsupervised Video Object Segmentation via Adaptive Contextual Coupling
Paperid: 1285,   
Authors: Jinxu Zhang, QiyuanFan QiyuanFan, Yongqi Yu, Yu Zhang
Title: DREAM: Integrating Hierarchical Multimodal Retrieval with Multi-page Multimodal Language Model for Documents VQA
Paperid: 1286,   
Authors: Shanshan Li, Jiawei Hou, Da Huang, Yanwei Fu, Xiangyang Xue
Title: Ali-UI: Enhancing Complex Vision-Language Navigation with Alignment of Unified Map and Instruction Parsing
Paperid: 1287,   
Authors: jie yu, Songping Mai, Peng Zhang, Yucheng Jiang, Jian Cheng
Title: Activation and Weight Distribution Balancing for Optimal Post-Training Quantization in Learned Image Compression
Paperid: 1288,   
Authors: Bo Xu, Jie Wei, Hongya Wang, Ming Du, Hui Song, Yanghua Xiao
Title: Bridging the Unseen Gap: Label-Enhanced Information Bottleneck Distillation for Multimodal Named Entity Recognition
Paperid: 1289,   
Authors: Hao Gu, Jiangyan Yi, Chenglong Wang, Jianhua Tao, Zheng Lian, Jiayi He, Yong Ren, Yujie Chen, Zhengqi Wen
Title: $\mathcal{A}LLM4ADD$: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection
Paperid: 1290,   
Authors: Junwei Zhao, Qianchun Luo, Shiliang Zhang, Shen Gao, Jie Wu
Title: HDCFN: Haze Distribution-aware Cross-modal Fusion Network for Infrared-guided Dense Haze Removal in UAVs
Paperid: 1291,   
Authors: Qin Li, CongCongXiao CongCongXiao, Limei Liu, Han Peng, Junfeng Yang
Title: Skeleton Compression and Complementary Enhanced Fusion Under Branch-Stage Supervision for Human Action Recognition
Paperid: 1292,   
Authors: Cheng Peng, Zhen Wang
Title: Method and Applications of Solid-State Lidar Modeling for X-in-the-Loop Testing of Autonomous Vehicles
Paperid: 1293,   
Authors: Jinming Zhang, Yunlian Sun, Hongwen Zhang, Jinhui Tang
Title: EDMG: Towards Efficient Long Dance Motion Generation with Fundamental Movements from Dance Genres
Paperid: 1294,   
Authors: Richen Liu, Lingyu Sun, Xuefeng Huang, Yiran Li, Jiang Zhang, Siru Chen, Zhouhao Wu, Ayush Kumar, Chufan Lai
Title: Meta-Illustrator: Transferring Illustrations from 2D Interactive Image Space to 3D Immersive Exploration Space
Paperid: 1295,   
Authors: Yucheng Shu, Yaohui Wang, Lihong Qiao, Feiyan Li, Bin Xiao, Weisheng Li, Xinbo Gao
Title: The Overlooked Matters: Revisiting Background, Prototype, and Activation in Few-Shot Medical Image Segmentation
Paperid: 1296,   
Authors: Sizhe Zhao, Chenyang Wang, Weiyu Zhao, Zonglin Li, Ming Li, Shengping Zhang
Title: REA-Listener: Real-Time Listening Head Generation with Dynamic Emotion Modeling and Flexible Modality Adaptation
Paperid: 1297,   
Authors: Nan Gao, Junchao Zhu, YILONG ZHANG, Ronghua Liang, Guodao Sun, Peng Chen
Title: Dual Teacher with Dempster-Shafer Guidance for Decision Making in Semi-Supervised Small Object Detection
Paperid: 1298,   
Authors: Shikun Sun, Chengrui Wang, Min Zhou, Zixuan Wang, Xiaoyu Qin, Tiezheng Ge, Bo Zheng, Jia Jia
Title: DEPO: Enhancing E-commerce Image Background Generation with Short Trajectory Direct Expected Preference Optimization
Paperid: 1299,   
Authors: Zeyu Zhu, KE LIANG, Lingyuan Meng, Xingchen Hu, Xinwang Liu, Wanwei Liu, Kunlun He
Title: SALVG: Latent Variable Gene Augmented Graph Learning for Multi-View Clustering in Spatial Transcriptomics
Paperid: 1300,   
Authors: Ana Rebelo, Pedro Ferreira, André Ribeiro, Rui Nóbrega
Title: Walking vs. Teleport in VR: Why Walking and Portals Matter in Small Spaces
Paperid: 1301,   
Authors: Chunpeng Wang, Wenlong Ma, Li Zou, Zhiqiu Xia, Qi Li, Bin Ma, Yunan Liu
Title: Toward Robust Deepfake Detection: A Proactive Method Based on Watermarking and Knowledge Distillation
Paperid: 1302,   
Authors: Huanqi Wu, Huangbiao Xu, Xiao Ke
Title: The Devil in the Stego Image: Far from Being Usable in Real-World Scenarios
Paperid: 1303,   
Authors: Yijie Zhu, Yibo Lyu, Zitong YU, Rui Shao, Kaiyang Zhou, Liqiang Nie
Title: EmoSym: A Symbiotic Framework for Unified Emotional Understanding and Generation via Latent Reasoning
Paperid: 1304,   
Authors: Zhaoyu Chen, Qian Huang, Xing Li, Yunfei Zhang, Shihao Han, Ge Gao, Yirui Wu, Xin Li, Ziyang Yin
Title: Geo-CF2Net: Geometry-Prior Cross-Frequency Interactive Fusion Network for Point Coud-based 3D Human Action Recognition
Paperid: 1305,   
Authors: Huabin Wang, Yingfan Cheng, Wu Zheng, Jiayuan Cheng, Xin Li, Min Li, Fei Liu
Title: A Multi-illumination Dataset and a Illumination Domain Adaptation Network for Finger Vein Identification
Paperid: 1306,   
Authors: Shenjie Jiang, Zhuoyu Wang, Xuecheng Wu, Hongru Ji, Mingxin Li, Xianghua Li, Chao Gao
Title: DDSE: A Decoupled Dual-Stream Enhanced Framework for Multimodal Sentiment Analysis with Text-Centric SSM
Paperid: 1307,   
Authors: Yishu Liu, Zhiming Chen, Desen Wang, Xiaoling Luo, Bingzhi Chen, Guangming Lu
Title: PET-GPRA: Rethinking PET with Gradient-Aware Prompting and Router-Free Adapters for Few-shot Class-Incremental Learning
Paperid: 1308,   
Authors: Xueyi Zhang, Peiyin Zhu, Yuan Liao, Xiyu Wang, Mingrui Lao, Siqi Cai, Yanming Guo, Haizhou Li
Title: TrustCLIP: Learning from Noisy Labels via Semantic Label Verification and Trust-aligned Gradient Projection
Paperid: 1309,   
Authors: Yechao Xu, Zhengxing Sun, Qian Li, Yunhan Sun
Title: Text Prompted Spatiotemporal Sequence Prediction with Text-Vision Prompt Refiner and Masked Diffusion Transformers
Paperid: 1310,   
Authors: Yanting Pei, Fan Yang
Title: Adaptive Neighbors and Uncertainty Estimation for Source-Free Unsupervised Domain Adaptation with Noisy Labels
Paperid: 1311,   
Authors: Yang Liu, Zhang Zhiyong
Title: DSP: Dense-Sparse Parallel Networks for Self-supervised 3D Multi-person Pose Estimation from Multiple Views
Paperid: 1312,   
Authors: Jianxiang Xie, Yao Wu, Yachao Zhang, Xiaopei Zhang, Yuan Xie, Yanyun Qu
Title: PLATO-TTA: Prototype-Guided Pseudo-Labeling and Adaptive Tuning for Multi-Modal Test-Time Adaptation of 3D Segmentation
Paperid: 1313,   
Authors: Jiahuan Cao, Yang Liu, Peirong Zhang, Yongxin Shi, Kai Ding, Lianwen Jin
Title: TongGu-VL: Advancing Visual-Language Understanding in Chinese Classical Studies through Parameter Sensitivity-Guided Instruction Tuning
Paperid: 1314,   
Authors: Zhuojun Wu, Dong Liu, Juan Liu, Yechen Wang, Linxi Li, Liwei Jin, Hui Bu, Pengyuan zhang, Ming Li
Title: SMIIP-NV: A Multi-Annotation Non-Verbal Expressive Speech Corpus in Mandarin for LLM-Based Speech Synthesis
Paperid: 1315,   
Authors: Yulong Li, Yuxuan Zhang, Rui Chen, Feilong Tang, Zhixiang Lu, Ming Hu, Jianghao Wu, Haochen Xue, Mian Zhou, Chong Li, Jionglong Su, Imran Razzak
Title: Genesis: A Large-Scale Benchmark for Multimodal Large Language Model in Emotional Causality Analysis
Paperid: 1316,   
Authors: Tingrui Shen, Bangzhen Liu, Zhirun Fan, Shiting Zhang, Weifeng Pan, Sun Fan, Dan Cao, Shengfeng He
Title: Language-Driven 3D Human Pose Estimation in Multi-Person Scenarios: A New Dataset and Approach
Paperid: 1317,   
Authors: Weibin Wu, Zitong Wang, Zhengjie Luo, Wenqing Chen, Zibin Zheng
Title: Detecting Violations of Physical Common Sense in Images: A Challenge Dataset and Effective Model
Paperid: 1318,   
Authors: Zixi Wang, Yubo Huang, Jingzehua Xu, Jinzhu Wei, Shuai Zhang, Xin Lai
Title: Multi-Modal Gradual Domain Osmosis: Stepwise Dynamic Learning with Batch Matching for Gradual Domain Adaptation
Paperid: 1319,   
Authors: Yang Hu, Jingui Ma, Yucheng Yang, Jie Liang, Jinbo Yan, Jiahao Wu, Jiayu Yang, Yang Deng, Ronggang Wang
Title: Excavating the Most Critical Gaussians: Sparse Selection and Structural Optimization for Efficient 3DGS Compression
Paperid: 1320,   
Authors: Shangheng Chen, Shengsheng Qian, Quan Fang, Jun Hu, Changsheng Xu
Title: A Large-Scale Dataset for Short-Video Topic Peak Prediction and a Large Heterogeneous Graph Model
Paperid: 1321,   
Authors: Dirui Xie, Xiaofang Hu, ZihanWei ZihanWei, Zhengqiqi Yang, Yanlian Jiang, Yue Zhou
Title: Learning Structural Priors via Laplacian RWKV Diffusion with Light-Effect Dataset for Nighttime Visibility Enhancement
Paperid: 1322,   
Authors: Leidong Fan, Zhang Qian, Qing Li
Title: Inverse-Tone-Mapped HDR Video Quality Assessment for Broadcast Television: A Comprehensive Dataset and SDR-Referenced Method
Paperid: 1323,   
Authors: Qinfu Xu, Liyuan Pan, Shaozu Yuan, Yiwei Wei, Chunlei Wu
Title: From Subtle Hints to Grand Expressions – Mastering Fine-grained Emotions with Dynamic Multimodal Analysis
Paperid: 1324,   
Authors: Seungmi Choi, TaeHwa Lee, Jun Yeong Cha, Suhyun Jo, Hyunmin Ban, Kwan-Jung Oh, Hyunsuk Ko, Hui Yong Kim
Title: Phase Distribution Matters: On the Importance of Phase Distribution Alignment (PDA) in Holographic Applications
Paperid: 1325,   
Authors: Zhi Zeng, Jiaying Wu, Minnan Luo, Xiangzheng Kong, Zihan Ma, Guang Dai, Qinghua Zheng
Title: Understand, Refine and Summarize: Multi-Granularity Knowledge Progressive Enhancement Learning for Fake News Video Detection
Paperid: 1326,   
Authors: Changhao Pan, Wenxiang Guo, Yu Zhang, Zhiyuan Zhu, ZheTao Chen, Han Wang, Zhou Zhao
Title: A Multimodal Evaluation Framework for Spatial Audio Playback Systems: From Localization to Listener Preference
Paperid: 1327,   
Authors: Nan Ma, Beining Sun, Yiheng Han, Genbao Xu
Title: Kinematic Enhanced Hypergraph Convolutional Network for Skeleton-based Human Action Recognition with LLM Training Guides
Paperid: 1328,   
Authors: Yefei Sheng, Jie Wang, Ming Tao, Bingkun BAO
Title: D²Gaussian: Dynamic Control with Discretized 3D View Modeling for Text-Driven 3D Gaussian Splatting Editing
Paperid: 1329,   
Authors: Jiye Xie, Yifei Gao, Liangliang You, Xiang Xu, Haoran Xu, Zhiqiang Kou, Kexue Fu, Youyang Qu, Wenjie Yang, Jianwei Guo, Weiliang Meng, Longxiang Gao, Haoran Yang, Changwei Wang, Yu Zhang
Title: Collaboration Wins More: Dual-Modal Collaborative Attention Reinforcement for Mitigating Large Vision Language Models Hallucination
Paperid: 1330,   
Authors: Shuyang Wang, Chunxiao Li, Anlong Ming
Title: IFS-Light: An Interactive Framework for Single-view Face Relighting with both Facial and Lighting Consistency
Paperid: 1331,   
Authors: Nguyen Duy, Hoang Hoan, Thanh-Trung Phan
Title: Like or Not to Like: An Usecase of Vietnamese Street Food Videos on YouTube
Paperid: 1332,   
Authors: Yuliang Chen, Xi Lin, Chao Sang, Xiu Su
Title: DualFPT: Handling Data Heterogeneity in Federated Prompt Tuning from both Generalized and Personalized Perspective
Paperid: 1333,   
Authors: Guoxin Zhang, Zhonghong Ou, Kaiwen Xue, Jiangfeng Sun, Yifan Zhu, Siyuan Yao, Yiran Shen, Meina Song
Title: DGFSD: Bridging the Gap between Dense and Sparse for Fully Sparse 3D Object Detection
Paperid: 1334,   
Authors: Feida Liu, Yifan Wang, Jiaqi Zheng, Boxi Liu, Guihai Chen
Title: Themis: Toward Stable Near-Zero Queuing Delay in Congestion Control for Low-Latency Interactive Video Streaming
Paperid: 1335,   
Authors: Ahmad Alhilal, Ze Wu, Teemu Kämäräinen, Tristan Braud, Matti Siekkinen
Title: Congestion Control for VR Cloud Gaming: Integration and Comparison in Real VR Gaming Environment
Paperid: 1336,   
Authors: Junxiao Ma, Jingjing Wang, Min Zhang, Guodong Zhou
Title: Skynet-V1: Towards Early Warning of Video Abnormal Events via A Spatial-temporal Causal-enhanced MoE Framework
Paperid: 1337,   
Authors: Gyeongjin Kim, Sebin Lee, Daye Kim, Jungjin Lee, Minju Kim
Title: Bring the VibeOn: Designing a Multimodal Interface for Shared Emotional Experiences in Live-streamed Concerts
Paperid: 1338,   
Authors: Yating Liu, Yang Zou, Xingyuan Li, Xingyue Zhu, Kaiqi Han, Zhiying Jiang, Long Ma, Jinyuan Liu
Title: Toward a Training-Free Plug-and-Play Refinement Framework for Infrared and Visible Image Registration and Fusion
Paperid: 1339,   
Authors: Zhenbo Yu, Jimin Dai, Yingzhen Zhang, Jian Yang, lei luo
Title: SSAIM: Not All Self-Attentions Contain Effective Spatial Structure in Diffusion Models for Text-to-Image Editing
Paperid: 1340,   
Authors: Junyu Gao, Xuan Yao, Yong Rui, Changsheng Xu
Title: Building Embodied EvoAgent: A Brain-inspired Paradigm for Bridging Multimodal Large Models and World Models
Paperid: 1341,   
Authors: Bichen Wang, Yixin Sun, Yanyan Zhao, Bing Qin
Title: Beyond Snapshots: A Multimodal User-Level Dataset for Depression Detection in Dynamic Social Media Streams
Paperid: 1342,   
Authors: Zhaohu Xing, Lihao Liu, Tian Ye, Sixiang Chen, Yijun Yang, Guang Liu, Lei Zhu
Title: Farther Than Mirror: Explore Pattern-Compensated Depth of Mirror with Temporal Changes for Video Mirror Detection
Paperid: 1343,   
Authors: Nan An, Siqi Xu, Long Ma, Zhu Liu, Guangchao Han, Tengyu Ma, Risheng Liu
Title: Inter-Task Weaving in Image Enhancement: From a New Unified Architecture to a Better Meta-Representation Learning
Paperid: 1344,   
Authors: Jiawei Zheng, Feiyan Liu, Xiaoli Wang
Title: Seeing Through Ambiguity: Effective Video-guided Machine Translation via Chaotic Fusion and Causally Aligned Spatio-temporal Attention
Paperid: 1345,   
Authors: Yuwu Lu, Chunzhi Liu, Yihan Yang
Title: CWCP: Generalizing Virtual Reality to Real World with Contextual-Weather Correlation Pairing for Deraining and Desnowing
Paperid: 1346,   
Authors: Zongxing Zhao, Shenzhi Yang, Xingkai Yao, Yuying Wang, Zhongqiu Chen, Xiaofang Zhang
Title: $\textbf{HGAC}_{\textbf{LLM}}$: Attribute Completion in Heterogeneous Graph with Integration of External Knowledge from Large Language Models
Paperid: 1347,   
Authors: Yue Ling, Dong Zhao, Kaikai Deng, Kangwen Yin, Zixiao He, Yizong Wang, Huadong Ma
Title: Venus:Generating Large-scale mmWave Radar Data via Few 2D Videos for Gesture Recognition While Lying Down
Paperid: 1348,   
Authors: Mingliang Yan, Yanhua Yu, Ruochi Zhang, Zhiyuan Liu, Ruicheng Zhang, Yimeng Ren, Kangkang Lu, Zhiyong Huang, Feng Luo, Zhen Cai
Title: DeepMolTex: Deep Alignment of Molecular Graphs with Large Language Models via Mixture of Modality Experts
Paperid: 1349,   
Authors: Tianzuo Xin, Jing Wang, Xiyuan Jin, Xiaojun Ning, Zhiyang Feng, Youfang Lin
Title: MoCERNet: A Modality-Complete Modeling Framework for Emotion Recognition in Physiological Signals under Imperfect Modal Matching
Paperid: 1350,   
Authors: Xiaojian Lin, Wenxin Zhang, Yuchu Jiang, Wangyu Wu, Yiran Guo, Kangxu Wang, Zongzheng Zhang, Guijin Wang, Lei Jin, Hao Zhao
Title: Butter: Frequency-Adaptive Feature Consistency and Progressive Hierarchical Fusion for Efficient Object Detection in Autonomous Driving
Paperid: 1351,   
Authors: Siyuan Zhang, Xiaoping Wang, Jiang Li, Weibin Feng, Xin Zhan, Hongzhi Huang
Title: HAFUNet: A Hierarchical Attention Fusion Network for Monocular Depth Estimation Integrating Event and Frame Data
Paperid: 1352,   
Authors: Weimin Cheng, Zhenyu Wang, Tao Huang, Fangfang Wu, Weisheng Dong
Title: Pushing the Limit of Binarized Neural Network for Image Super Resolution with Smooth Information Transmission
Paperid: 1353,   
Authors: Yingbo Tang, Lingfeng Zhang, Shuyi Zhang, Yinuo Zhao, Xiaoshuai Hao
Title: RoboAfford: A Dataset and Benchmark for Enhancing Object and Spatial Affordance Learning in Robot Manipulation
Paperid: 1354,   
Authors: Nora Hofer, Rainer Böhme
Title: Challenging Cases of Neural Image Compression: A Dataset of Visually Compelling Yet Semantically Incorrect Reconstructions
Paperid: 1355,   
Authors: Shuai Yu, Xiaoliang He, Kangjie Dong, Yi Yu
Title: DUDA: A Two-stage Decoupling Unsupervised Domain Adaptation Framework for Semi-supervised Singing Melody Extraction from Polyphonic Music
Paperid: 1356,   
Authors: Jie Wan, Jianhao Fu, Ziqi Yang, Kui Ren
Title: BTUAP: Boosting the Transferability of Universal Adversarial Perturbations in the Black-box Setting under various data dependencies
Paperid: 1357,   
Authors: Muhammad Ali Farooq, Waseem Shariff, Peter Corcoran
Title: ThermVision: Exploring FLUX for Synthesizing Hyper-Realistic Thermal Face Data and Animations via Image to Video Translation
Paperid: 1358,   
Authors: Wei Miao, Jiangrong Shen, Hongming Xu, Tommi Kärkkäinen, Qi Xu, Yi Xu, Fengyu Cong
Title: Advanced SpikingYOLOX: Extending Spiking Neural Network on Object Detection with Spike-based Partial Self-Attention and 2D-Spiking Transformer
Paperid: 1359,   
Authors: Jingxing Guo, Guilian Chen, Yimu Sun, Huisi Wu, Jing Qin
Title: EchoVim: Making Vision Mamba Docile for Echocardiography Video Segmentation via Dynamic Interaction and Semantic Token-attentive Refinement
Paperid: 1360,   
Authors: Qi Shen, Junchang Xin, Bing Dai, Shudi Zhang, Xinyao Liu, Zhiqiong Wang
Title: ElaSleepNet: Exploring an Elastic Multimodal Neural Network for Sleep Staging via Temporal and Contextual Consistency Learning
Paperid: 1361,   
Authors: Kipp Freud, Daniel Collins, Delmiro Sampaio Neto, Grant Stevens
Title: AutoVec: Automatic generation of data and vector embeddings for arbitrary domains and cross-domain mappings using LLMs
Paperid: 1362,   
Authors: Junlin Fang, Wenya Wang, Lingli Zhang, Fengmao Lv
Title: Why is a Bird’s Caption a Good Demonstration? Towards Effective Multimodal In-Context Learning without Dedicated Data
Paperid: 1363,   
Authors: Weichen Zhang, Zile Zhou, Xin Zeng, LIU Xuchen, Jianjie Fang, Chen Gao, Jinqiang Cui, Yong Li, Xinlei Chen, Xiao-Ping Zhang
Title: Open3DVQA: A Benchmark for Embodied Spatial Concept Reasoning with Multimodal Large Language Model in Open Space
Paperid: 1364,   
Authors: Bingcai Wei, Hui Liu, Chuang Qian, Zijian Li, Wangyu Wu, Zijie Meng
Title: Robust Single Image Sand Removal by Leveraging Uncertainty-aware SAM Priors and Prompt Learning with Refined Perceptual Loss
Paperid: 1365,   
Authors: Sitian Gu, Zhiyu Pan, Chaoyi Hong, Chengxin Liu, Zhiguo Cao
Title: Dynamic Beauty is Easy to Find: A Large-Scale Composition-Aware Dataset and an End-to-End Framework for Video Reframing
Paperid: 1366,   
Authors: Ronghui Li, Lingxiao Han, Shi Shu, Yueyao Liu, Yukang Lin, Yue Ma, Jie Guo, Ziwei Liu, Xiu Li
Title: A Motion is Worth a Hybrid Sentence: Taming Language Model for Unified Motion Generation by Fine-grained Planning
Paperid: 1367,   
Authors: Yilin Zhang, Yanyan Wei, Zhao Zhang, Jicong Fan, Haijun Zhang, Shuicheng YAN
Title: From Outline to Detail: An Hierarchical End-to-end Framework for Coherent and Consistent Visual Novel Generation and Assembly
Paperid: 1368,   
Authors: Dezhi Zheng, Kaijun Deng, Xianxu Hou, Jinbao Wang, Xiaoqin Wang, Linlin Shen
Title: Unknown Pixel Mask based finetuning of 2D Inpainting Models for Unbounded 3D Scene Generation from a Single Image
Paperid: 1369,   
Authors: Dominika Wanat, Dawid Juszka, Mikołaj Leszczuk, Lucjan Janowski
Title: Bridging the Lab and the Wild: Behavioral Experiments as a Pathway to QoE Research Closer to Realistic Environment
Paperid: 1370,   
Authors: Lamei Di, Bin Zhang, Yiming Wang, Wenxia Zhang
Title: Frequency Meets Semantics: Text-Visual Fusion with Directional Spectral Enhancement for Salient Object Detection in Optical Remote Sensing Images
Paperid: 1371,   
Authors: Yu Chen, BinBin Yan, Shuo Chen, XinZhu Sang
Title: A Comprehensive Model for Visual Fatigue Assessment in 3D Light Field Displays Based on Eye Movement Data Analysis
Paperid: 1372,   
Authors: Ziang Li, Chengxiang Si, Zhenyu Cheng
Title: Zero in on the Target: A Composite Robust Model for Retrieving Information in Traffic Data to Discover Network Attacks
Paperid: 1373,   
Authors: Shuyang chu, Jingang Shi, Xu Cheng, Haoyu Chen, Xin Liu, Xu Jian, Guoying Zhao
Title: To Remember, To Adapt, To Preempt: A Stable Continual Test-Time Adaptation Framework for Remote Physiological Measurement in Dynamic Domain Shifts