Abstract Paper Portal of SIGGRAPH 2025

PaperID: 1,   https://arxiv.org/pdf/2501.18672     GitHub GitHub GitHub
Authors: Yansong Qu, Dian Chen, Xinyang Li, Xiaofan Li, Shengchuan Zhang, Liujuan Cao, Rongrong Ji
Affiliations: the Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China ; Baidu Inc.
Title: Drag Your Gaussian: Effective Drag-Based Editing with Score Distillation for 3D Gaussian Splatting
Abstract:
Recent advancements in generative models have significantly propelled 3D scene editing. While existing methods excel at text-guided texture modifications for 3D representations like 3D Gaussian Splatting (3DGS), they struggle with geometric transformations (e.g., rotating a character’s head) and lack precise spatial control over edits due to the inherent ambiguity of language-driven guidance. To address these limitations, we introduce DYG, a 3D drag-based editing framework for 3DGS. Users intuitively define editing regions using 3D masks and specify desired transformations through pairs of control points. DYG integrates the implicit triplane representation to establish the geometric scaffold of editing results, effectively overcoming suboptimal editing outcomes caused by the sparsity of 3DGS in the desired editing regions. Additionally, we incorporate a drag-based Latent Diffusion Model through the proposed Drag-SDS loss, enabling flexible, multi-view consistent, and fine-grained editing. Extensive experiments demonstrate that DYG enables effective drag-based editing, outperforming other baselines in terms of editing effect and quality. Additional results are available on our project page: https://quyans.github.io/Drag-Your-Gaussian/.
PaperID: 2,   https://arxiv.org/pdf/2508.07905     GitHub GitHub
Authors: Yongtao Ge, Kangyang Xie, Guangkai Xu, Li Ke, Mingyu Liu, Longtao Huang, Hui Xue, Hao Chen, Chunhua Shen
Affiliations: The University of Adelaide, Australia and Zhejiang University, China ; Zhejiang University, China ; Alibaba Group, China and Zhejiang University
Title: Generative Video Matting
Abstract:
Video matting has traditionally been limited by the lack of high-quality ground-truth data. Most existing video matting datasets provide only human-annotated imperfect alpha and foreground annotations, which must be composited to background images or videos during the training stage. Thus, the generalization capability of previous methods in real-world scenarios is typically poor. In this work, we propose to solve the problem from two perspectives. First, we emphasize the importance of large-scale pre-training by pursuing diverse synthetic and pseudo-labeled segmentation datasets. We also develop a scalable synthetic data generation pipeline that can render diverse human bodies and fine-grained hairs, yielding around 200 video clips with a 3-second duration for fine-tuning. Second, we introduce a novel video matting approach that can effectively leverage the rich priors from pre-trained video diffusion models. This architecture offers two key advantages. First, strong priors play a critical role in bridging the domain gap between synthetic and real-world scenes. Second, unlike most existing methods that process video matting frame-by-frame and use an independent decoder to aggregate temporal information, our model is inherently designed for video, ensuring strong temporal consistency. We provide a comprehensive quantitative evaluation across three benchmark datasets, demonstrating our approach’s superior performance, and present comprehensive qualitative results in diverse real-world scenes, illustrating the strong generalization capability of our method. Code is available at https://github.com/aim-uofa/GVM
PaperID: 3,   https://arxiv.org/pdf/2502.17327     GitHub GitHub
Authors: Inbar Gat, Sigal Raab, Guy Tevet, Yuval Reshef, Amit Haim Bermano, Daniel Cohen-Or
Affiliations: Tel Aviv
Title: AnyTop: Character Animation Diffusion with Any Topology
Abstract:
Generating motion for arbitrary skeletons is a longstanding challenge in computer graphics, remaining largely unexplored due to the scarcity of diverse datasets and the irregular nature of the data. In this work, we introduce AnyTop, a diffusion model that generates motions for diverse characters with distinct motion dynamics, using only their skeletal structure as input. Our work features a transformer-based denoising network, tailored for arbitrary skeleton learning, integrating topology information into the traditional attention mechanism. Additionally, by incorporating textual joint descriptions into the latent feature representation, AnyTop learns semantic correspondences between joints across diverse skeletons. Our evaluation demonstrates that AnyTop generalizes well, even with as few as three training examples per topology, and can produce motions for unseen skeletons as well. Furthermore, our model’s latent space is highly informative, enabling downstream tasks such as joint correspondence, temporal segmentation, and motion editing. Our webpage, https://anytop2025.github.io/Anytop-page, includes links to videos and code.
PaperID: 4,   https://arxiv.org/pdf/2507.18939     GitHub GitHub
Authors: Jionghao Wang, Cheng Lin, Yuan Liu, Rui Xu, Zhiyang Dou, Xiaoxiao Long, Haoxiang Guo, Taku Komura, Wenping Wang, Xin Li
Affiliations: Texas A&M University, College Station, Hong Kong, China ; HKUST, China ; Nanjing University, China ; Skywork AI, Kunlun Inc.
Title: PDT: Point Distribution Transformation with Diffusion Models
Abstract:
Point-based representations have consistently played a vital role in geometric data structures. Most point cloud learning and processing methods typically leverage the unordered and unconstrained nature to represent the underlying geometry of 3D shapes. However, how to extract meaningful structural information from unstructured point cloud distributions and transform them into semantically meaningful point distributions remains an under-explored problem. We present PDT, a novel framework for point distribution transformation with diffusion models. Given a set of input points, PDT learns to transform the point set from its original geometric distribution into a target distribution that is semantically meaningful. Our method utilizes diffusion models with novel architecture and learning strategy, which effectively correlates the source and the target distribution through a denoising process. Through extensive experiments, we show that our method successfully transforms input point clouds into various forms of structured outputs - ranging from surface-aligned keypoints, and inner sparse joints to continuous feature lines. The results showcase our framework’s ability to capture both geometric and semantic features, offering a powerful tool for various 3D geometry processing tasks where structured point distributions are desired. Code will be available at this link: link.
PaperID: 5,   https://arxiv.org/pdf/2505.07887     GitHub GitHub
Authors: Songyin Wu, Zhaoyang Lv, Yufeng Zhu, Duncan Frost, Zhengqin Li, Ling-Qi Yan, Carl Ren, Richard Newcombe, Zhao Dong
Affiliations: Meta Reality Labs Research, Santa Barbara, San Francisco
Title: Monocular Online Reconstruction with Enhanced Detail Preservation
Abstract:
We propose an online 3D Gaussian-based dense mapping framework for photorealistic details reconstruction from a monocular image stream. Our approach addresses two key challenges in monocular online reconstruction: distributing Gaussians without relying on depth maps and ensuring both local and global consistency in the reconstructed maps. To achieve this, we introduce two key modules: the Hierarchical Gaussian Management Module for effective Gaussian distribution and the Global Consistency Optimization Module for maintaining alignment and coherence at all scales. In addition, we present the Multi-level Occupancy Hash Voxels (MOHV), a structure that regularizes Gaussians for capturing details across multiple levels of granularity. MOHV ensures accurate reconstruction of both fine and coarse geometries and textures, preserving intricate details while maintaining overall structural integrity. Compared to state-of-the-art RGB-only and even RGB-D methods, our framework achieves superior reconstruction quality with high computational efficiency. Moreover, it integrates seamlessly with various tracking systems, ensuring generality and scalability. Project page: https://poiw.github.io/MODP/.
PaperID: 6,   https://arxiv.org/pdf/2507.20220     GitHub GitHub
Authors: Bohong Chen, Yumeng Li, Youyi Zheng, Yao-Xiang Ding, Kun Zhou
Affiliations: State Key Lab of CAD&CG, Zhejiang University
Title: Motion-example-controlled Co-speech Gesture Generation Leveraging Large Language Models
Abstract:
The automatic generation of controllable co-speech gestures has recently gained growing attention. While existing systems typically achieve gesture control through predefined categorical labels or implicit pseudo-labels derived from motion examples, these approaches often compromise the rich details present in the original motion examples. We present MECo, a framework for motion-example-controlled co-speech gesture generation by leveraging large language models (LLMs). Our method capitalizes on LLMs’ comprehension capabilities through fine-tuning to simultaneously interpret speech audio and motion examples, enabling the synthesis of gestures that preserve example-specific characteristics while maintaining speech congruence. Departing from conventional pseudo-labeling paradigms, we position motion examples as explicit query contexts within the prompt structure to guide gesture generation. Experimental results demonstrate state-of-the-art performance across three metrics: Fréchet Gesture Distance (FGD), motion diversity, and example-gesture similarity. Furthermore, our framework enables granular control of individual body parts and accommodates diverse input modalities including motion clips, static poses, human video sequences, and textual descriptions.
PaperID: 7,   https://arxiv.org/pdf/2504.12240     GitHub GitHub
Authors: Junhao Zhuang, Lingen Li, Xuan Ju, Zhaoyang Zhang, Chun Yuan, Ying Shan
Affiliations: Tsinghua University, Hong Kong, China ; Tencent
Title: Cobra: Efficient Line Art COlorization with BRoAder References
Abstract:
The comic production industry requires reference-based line art colorization with high accuracy, efficiency, contextual consistency, and flexible control.A comic page often involves diverse characters, objects, and backgrounds, which complicates the coloring process. Despite advancements in diffusion models for image generation, their application in line art colorization remains limited, facing challenges related to handling extensive reference images, time-consuming inference, and flexible control. We investigate the necessity of extensive contextual image guidance on the quality of line art colorization. To address these challenges, we introduce Cobra, an efficient and versatile method that supports color hints and utilizes over 200 reference images while maintaining low latency. Central to Cobra is a Causal Sparse DiT architecture, which leverages specially designed positional encodings, causal sparse attention, and Key-Value Cache to effectively manage long-context references and ensure color identity consistency. Results demonstrate that Cobra achieves accurate line art colorization through extensive contextual reference, significantly enhancing inference speed and interactivity, thereby meeting critical industrial demands. We release our codes and models on our project page: https://zhuang2002.github.io/Cobra/.
PaperID: 8,   https://arxiv.org/pdf/2502.17796     GitHub GitHub
Authors: Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, Liefeng Bo
Affiliations: Alibaba Group
Title: LAM: Large Avatar Model for One-shot Animatable Gaussian Head
Abstract:
We present LAM, an innovative Large Avatar Model for animatable Gaussian head reconstruction from a single image. Unlike previous methods that require extensive training on captured video sequences or rely on auxiliary neural networks for animation and rendering during inference, our approach generates Gaussian heads that are immediately animatable and renderable. Specifically, LAM creates an animatable Gaussian head in a single forward pass in seconds, enabling reenactment and rendering without additional networks or post-processing steps. This capability allows for seamless integration into existing rendering pipelines, ensuring real-time animation and rendering across a wide range of platforms, including mobile phones. The centerpiece of our framework is the canonical Gaussian attributes generator, which utilizes FLAME canonical points as queries. These points interact with multi-scale image features through a Transformer to accurately predict Gaussian attributes in the canonical space. The reconstructed canonical Gaussian avatar can then be animated utilizing standard linear blend skinning (LBS) with corrective blendshapes as the FLAME model did and rendered in real-time on various platforms. The experiments demonstrate that LAM outperforms state-of-the-art methods on existing benchmarks.
PaperID: 9,   https://arxiv.org/pdf/2501.15981     GitHub GitHub
Authors: Michael Birsak, John Femiani, Biao Zhang, Peter Wonka
Affiliations: King Abdullah University of Science and Technology (KAUST), Saudi Arabia
Title: MatCLIP: Light- and Shape-Insensitive Assignment of PBR Material Models
Abstract:
Assigning realistic materials to 3D models remains a significant challenge in computer graphics. We propose MatCLIP, a novel method that extracts shape- and lighting-insensitive descriptors of Physically Based Rendering (PBR) materials to assign plausible textures to 3D objects based on images, such as the output of Latent Diffusion Models (LDMs) or photographs. Matching PBR materials to static images is challenging because the PBR representation captures the dynamic appearance of materials under varying viewing angles, shapes, and lighting conditions. By extending an Alpha-CLIP-based model on material renderings across diverse shapes and lighting, and encoding multiple viewing conditions for PBR materials, our approach generates descriptors that bridge the domains of PBR representations with photographs or renderings, including LDM outputs. This enables consistent material assignments without requiring explicit knowledge of material relationships between different parts of an object. MatCLIP achieves a top-1 classification accuracy of 76.6%, outperforming state-of-the-art methods such as PhotoShape and MatAtlas by over 15 percentage points on publicly available datasets. Our method can be used to construct material assignments for 3D shape datasets such as ShapeNet, 3DCoMPaT++, and Objaverse. All code and data will be released at https://birsakm.github.io/matclip/.
PaperID: 10,   https://arxiv.org/pdf/2507.01012     GitHub GitHub
Authors: Zhe Kong, Le Li, Yong Zhang, Feng Gao, Shaoshu Yang, Tao Wang, Kaihao Zhang, Zhuoliang Kang, Xiaoming Wei, Guanying Chen, Wenhan Luo
Affiliations: Sun Yat-sen University, China; Meituan, Hong Kong, China ; Meituan, China ; School of Artificial Intelligence, University of Chinese Academy of Sciences, China ; Nanjing University, China ; Harbin Institute of Technology
Title: DAM-VSR: Disentanglement of Appearance and Motion for Video Super-Resolution
Abstract:
Real-world video super-resolution (VSR) presents significant challenges due to complex and unpredictable degradations. Although some recent methods utilize image diffusion models for VSR and have shown improved detail generation capabilities, they still struggle to produce temporally consistent frames. We attempt to use Stable Video Diffusion (SVD) combined with ControlNet to address this issue. However, due to the intrinsic image-animation characteristics of SVD, it is challenging to generate fine details using only low-quality videos. To tackle this problem, we propose DAM-VSR, an appearance and motion disentanglement framework for VSR. This framework disentangles VSR into appearance enhancement and motion control problems. Specifically, appearance enhancement is achieved through reference image super-resolution, while motion control is achieved through video ControlNet. This disentanglement fully leverages the generative prior of video diffusion models and the detail generation capabilities of image super-resolution models. Furthermore, equipped with the proposed motion-aligned bidirectional sampling strategy, DAM-VSR can conduct VSR on longer input videos. DAM-VSR achieves state-of-the-art performance on real-world data and AIGC data, demonstrating its powerful detail generation capabilities.
PaperID: 11,   https://arxiv.org/pdf/2506.03118     GitHub GitHub
Authors: Zhiyuan Yu, Zhe Li, Hujun Bao, Can Yang, Xiaowei Zhou
Affiliations: Department of Mathematics, Hong Kong, China ; Huawei, China ; State Key Laboratory of CAD&CG, Zhejiang University
Title: HumanRAM: Feed-forward Human Reconstruction and Animation Model using Transformers
Abstract:
3D human reconstruction and animation are long-standing topics in computer graphics and vision. However, existing methods typically rely on sophisticated dense-view capture and/or time-consuming per-subject optimization procedures. To address these limitations, we propose HumanRAM, a novel feed-forward approach for generalizable human reconstruction and animation from monocular or sparse human images. Our approach integrates human reconstruction and animation into a unified framework by introducing explicit pose conditions, parameterized by a shared SMPL-X neural texture, into transformer-based large reconstruction models (LRM). Given monocular or sparse input images with associated camera parameters and SMPL-X poses, our model employs scalable transformers and a DPT-based decoder to synthesize realistic human renderings under novel viewpoints and novel poses. By leveraging the explicit pose conditions, our model simultaneously enables high-quality human reconstruction and high-fidelity pose-controlled animation. Experiments show that HumanRAM significantly surpasses previous methods in terms of reconstruction accuracy, animation fidelity, and generalization performance on real-world datasets.
PaperID: 12,   https://arxiv.org/pdf/2505.01235     GitHub GitHub
Authors: Youngsik Yun, Jeongmin Bae, Hyunseung Son, Seoha Kim, Hahyun Lee, Gun Bang, Youngjung Uh
Affiliations: Yonsei University, Republic of Korea
Title: Compensating Spatiotemporally Inconsistent Observations for Online Dynamic 3D Gaussian Splatting
Abstract:
Online reconstruction of dynamic scenes is significant as it enables learning scenes from live-streaming video inputs, while existing offline dynamic reconstruction methods rely on recorded video inputs. However, previous online reconstruction approaches have primarily focused on efficiency and rendering quality, overlooking the temporal consistency of their results, which often contain noticeable artifacts in static regions. This paper identifies that errors such as noise in real-world recordings affect temporal inconsistency in online reconstruction. We propose a method that enhances temporal consistency in online reconstruction from observations with temporal inconsistency which is inevitable in cameras. We show that our method restores the ideal observation by subtracting the learned error. We demonstrate that applying our method to various baselines significantly enhances both temporal consistency and rendering quality across datasets. Code, video results, and checkpoints are available at https://bbangsik13.github.io/OR2.
PaperID: 13,   https://arxiv.org/pdf/2501.03847     GitHub GitHub
Authors: Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, Yuan Liu
Affiliations: Singapore ; Wuhan University, Singapore ; Texas A&M University, College Station, Hong Kong
Title: Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
Abstract:
Diffusion models have demonstrated impressive performance in generating high-quality videos from text prompts or images. However, precise control over the video generation process—such as camera manipulation or content editing—remains a significant challenge. Existing methods for controlled video generation are typically limited to a single control type, lacking the flexibility to handle diverse control demands. In this paper, we introduce Diffusion as Shader (DaS), a novel approach that supports multiple video control tasks within a unified architecture. Our key insight is that achieving versatile video control necessitates leveraging 3D control signals, as videos are fundamentally 2D renderings of dynamic 3D content. Unlike prior methods limited to 2D control signals, DaS leverages 3D tracking videos as control inputs, making the video diffusion process inherently 3D-aware. This innovation allows DaS to achieve a wide range of video controls by simply manipulating the 3D tracking videos. A further advantage of using 3D tracking videos is their ability to effectively link frames, significantly enhancing the temporal consistency of the generated videos. With just 3 days of fine-tuning on 8 H800 GPUs using less than 10k videos, DaS demonstrates strong control capabilities across diverse tasks, including mesh-to-video generation, camera control, motion transfer, and object manipulation. More results are available in the supplementary materials.
PaperID: 14,   https://arxiv.org/pdf/2503.08061     GitHub GitHub
Authors: DongHeun Han, Byungmin Kim, RoUn Lee, KyeongMin Kim, Hyoseok Hwang, HyeongYeop Kang
Affiliations: IIIXR LAB, Kyung Hee University, Korea University, Republic of Korea
Title: ForceGrip: Reference-Free Curriculum Learning for Realistic Grip Force Control in VR Hand Manipulation
Abstract:
Realistic Hand manipulation is a key component of immersive virtual reality (VR), yet existing methods often rely on a kinematic approach or motion-capture datasets that omit crucial physical attributes such as contact forces and finger torques. Consequently, these approaches prioritize tight, one-size-fits-all grips rather than reflecting users’ intended force levels. We present ForceGrip, a deep learning agent that synthesizes realistic hand manipulation motions, faithfully reflecting the user’s grip force intention. Instead of mimicking predefined motion datasets, ForceGrip uses generated training scenarios—randomizing object shapes, wrist movements, and trigger input flows—to challenge the agent with a broad spectrum of physical interactions. To effectively learn from these complex tasks, we employ a three-phase curriculum learning framework comprising Finger Positioning, Intention Adaptation, and Dynamic Stabilization. This progressive strategy ensures stable hand-object contact, adaptive force control based on user inputs, and robust handling under dynamic conditions. Additionally, a proximity reward function enhances natural finger motions and accelerates training convergence. Quantitative and qualitative evaluations reveal ForceGrip’s superior force controllability and plausibility compared to state-of-the-art methods. Demo videos are available as supplementary material and the code is provided at https://han-dongheun.github.io/ForceGrip.
PaperID: 15,   https://arxiv.org/pdf/2501.18630     GitHub
Authors: Rong Liu, Dylan Sun, Meida Chen, Yue Wang, Andrew Feng
Affiliations: USC Institute for Creative Technologies (ICT), Los Angeles, USA ; University of Southern California
Title: Deformable Beta Splatting
Abstract:
3D Gaussian Splatting (3DGS) has advanced radiance field reconstruction by enabling real-time rendering. However, its reliance on Gaussian kernels for geometry and low-order Spherical Harmonics (SH) for color encoding limits its ability to capture complex geometries and diverse colors. We introduce Deformable Beta Splatting (DBS), a deformable and compact approach that enhances both geometry and color representation. DBS replaces Gaussian kernels with deformable Beta Kernels, which offer bounded support and adaptive frequency control to capture fine geometric details with higher fidelity while achieving better memory efficiency. In addition, we extended the Beta Kernel to color encoding, which facilitates improved representation of diffuse and specular components, yielding superior results compared to SH-based methods. Furthermore, Unlike prior densification techniques that depend on Gaussian properties, we mathematically prove that adjusting regularized opacity alone ensures distribution-preserved Markov chain Monte Carlo (MCMC), independent of the splatting kernel type. Experimental results demonstrate that DBS achieves state-of-the-art visual quality while utilizing only 45% of the parameters and rendering 1.5x faster than 3DGS-MCMC, highlighting the superior performance of DBS for real-time radiance field rendering. Interactive demonstrations and source code are available on our project website: https://rongliu-leo.github.io/beta-splatting/.
PaperID: 16,   https://arxiv.org/pdf/2410.06231     GitHub
Authors: Nithin Raghavan, Krishna Mullia, Alexander Trevithick, Fujun Luan, Miloš Hašan, Ravi Ramamoorthi
Affiliations: University of California San Diego, La Jolla, USA ; Adobe Research, San Francisco, San Jose
Title: Generative Neural Materials
Abstract:
Advancements in neural rendering techniques have sparked renewed interest in neural materials, which are capable of representing bidirectional texture functions (BTFs) cheaply and with high quality. However, content creation in the neural material format is not straightforward. To address this limitation, we present the first image-conditioned diffusion model for neural materials, and show an extension to text conditioning. To achieve this, we make two main contributions: (1) we introduce a universal MLP variant of the NeuMIP architecture, defining a universal basis for neural materials as 16-channel feature textures, and (2) we train a conditional diffusion model for generating neural materials in this basis from flash images, natural images and text prompts. To achieve this, we also construct a new dataset of 150k neural materials in 16 categories, since no large-scale neural material data exists. To our knowledge, our work is the first to enable single-shot neural material generation from arbitrary text or image prompts.
PaperID: 17,   https://arxiv.org/pdf/2505.05022     GitHub
Authors: Tingting Liao, Yujian Zheng, Yuliang Xiu, Adilbek Karmanov, Liwen Hu, Leyang Jin, Hao Li
Affiliations: Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates ; Westlake University, United Arab Emirates ; Pinscreen, Los Angeles, United Arab Emirates and Pinscreen
Title: SOAP: Style-Omniscient Animatable Portraits
Abstract:
Creating animatable 3D avatars from a single image remains challenging due to style limitations (realistic, cartoon, anime) and difficulties in handling accessories or hairstyles. While 3D diffusion models advance single-view reconstruction for general objects, outputs often lack animation controls or suffer from artifacts because of the domain gap. We propose SOAP, a style-omniscient framework to generate rigged, topology-consistent avatars from any portrait. Our method leverages a multiview diffusion model trained on 24K 3D heads with multiple styles and an adaptive optimization pipeline to deform the FLAME mesh while maintaining topology and rigging via differentiable rendering. The resulting textured avatars support FACS-based animation, integrate with eyeballs and teeth, and preserve details like braided hair or accessories. Extensive experiments demonstrate the superiority of our method over state-of-the-art techniques for both single-view head modeling and diffusion-based generation of Image-to-3D. Our code and data are publicly available for research purposes at github.com/TingtingLiao/soap.
PaperID: 18,   https://arxiv.org/pdf/2406.01300     GitHub
Authors: Elad Richardson, Yuval Alaluf, Ali Mahdavi-Amiri, Daniel Cohen-Or
Affiliations: Tel Aviv, Israel ; Simon Fraser University
Title: pOps: Photo-Inspired Diffusion Operators
Abstract:
Text-guided image generation enables the creation of visual content from textual descriptions. However, certain visual concepts cannot be effectively conveyed through language alone. This has sparked a renewed interest in utilizing the CLIP image embedding space for more visually-oriented tasks through methods such as IP-Adapter. Interestingly, the CLIP image embedding space has been shown to be semantically meaningful, where linear operations within this space yield semantically meaningful results. Yet, the specific meaning of these operations can vary unpredictably across different images. To harness this potential, we introduce pOps, a framework that trains specific semantic operators directly on CLIP image embeddings. Each pOps operator is built upon a pretrained Diffusion Prior model. While the Diffusion Prior model was originally trained to map between text embeddings and image embeddings, we demonstrate that it can be tuned to accommodate new input conditions, resulting in a diffusion operator. Working directly over image embeddings not only improves our ability to learn semantic operations but also allows us to directly use a textual CLIP loss as an additional supervision when needed. We show that pOps can be used to learn a variety of photo-inspired operators with distinct semantic meanings. These operators can then serve as creative tools within a design process, enabling artists to semantically manipulate visual concepts as part of their generative workflow. Finally, we show that pOps can be easily plugged into pretrained image diffusion models alongside existing spatial adapters, offering control over both semantics and structure.
PaperID: 19,   https://arxiv.org/pdf/2505.22938     GitHub
Authors: Ben Weiss
Affiliations: Google Research, Playa Vista
Title: Fast Isotropic Median Filtering
Abstract:
Median filtering is a cornerstone of computational image processing. It provides an effective means of image smoothing, with minimal blurring or softening of edges, invariance to monotonic transformations such as gamma adjustment, and robustness to noise and outliers. However, known algorithms have all suffered from practical limitations: the bit depth of the image data, the size of the filter kernel, or the kernel shape itself. Square-kernel implementations tend to produce streaky cross-hatching artifacts, and nearly all known efficient algorithms are in practice limited to square kernels. We present for the first time a method that overcomes all of these limitations. Our method operates efficiently on arbitrary bit-depth data, arbitrary kernel sizes, and arbitrary convex kernel shapes, including circular shapes.
PaperID: 20,   https://arxiv.org/pdf/2403.20193     GitHub
Authors: Luozhou Wang, Ziyang Mai, Guibao Shen, Yixun Liang, Xin Tao, Pengfei Wan, Di Zhang, Yijun Li, Ying-Cong Chen
Affiliations: Hong Kong University of Science and Technology, China ; Kuaishou Technology, China ; Adobe Research
Title: Motion Inversion for Video Customization
Abstract:
In this work, we present a novel approach for motion customization in video generation, addressing the widespread gap in the exploration of motion representation within video generative models. Recognizing the unique challenges posed by the spatiotemporal nature of video, our method introduces Motion Embeddings, a set of explicit, temporally coherent embeddings derived from a given video. These embeddings are designed to integrate seamlessly with the temporal transformer modules of video diffusion models, modulating self-attention computations across frames without compromising spatial integrity. Our approach provides a compact and efficient solution to motion representation, utilizing two types of embeddings: a Motion Query-Key Embedding to modulate the temporal attention map and a Motion Value Embedding to modulate the attention values. Additionally, we introduce an inference strategy that excludes spatial dimensions from the Motion Query-Key Embedding and applies a debias operation to the Motion Value Embedding, both designed to debias appearance and ensure the embeddings focus solely on motion. Our contributions include the introduction of a tailored motion embedding for customization tasks and a demonstration of the practical advantages and effectiveness of our method through extensive experiments. Project page: https://wileewang.github.io/MotionInversion/
PaperID: 21,   https://arxiv.org/pdf/2405.14847     GitHub
Authors: Venkataram Edavamadathil Sivaram, Ravi Ramamoorthi, Tzu-Mao Li
Affiliations: San Diego
Title: Modeling and Rendering Glow Discharge
Abstract:
Previous research in material models for surface and volume scattering has enabled highly realistic scenes in modern rendering systems. However, there has been comparatively little study of light sources in computer graphics despite their critical importance in illuminating and bringing life into these scenes. In the real world, photons are emitted through numerous physical processes including combustion, incandescence, and fluorescence. The qualities of light produced in each of these processes are unique to their physics, making them interesting to study individually.
PaperID: 22,   https://arxiv.org/pdf/2412.14168     GitHub
Authors: Sihui Ji, Yiyang Wang, Xi Chen, Xiaogang Xu, Hao Luo, Hengshuang Zhao
Affiliations: Hong Kong, Alibaba Group
Title: FashionComposer: Compositional Fashion Image Generation
Abstract:
We present FashionComposer for compositional fashion image generation. Unlike previous methods, FashionComposer is highly flexible. It takes multi-modal input (i.e., text prompt, parametric human model, garment image, and face image) and supports personalizing the appearance, pose, and figure of the human and assigning multiple garments in one pass. To achieve this, we first develop a universal framework capable of handling diverse input modalities. We construct scaled training data to enhance the model’s robust compositional capabilities. To accommodate multiple reference images (garments and faces) seamlessly, we organize these references in a single image as an “asset library” and employ a reference UNet [Hu et al. 2023] to extract appearance features. To inject the appearance features into the correct pixels in the generated result, we propose subject-binding attention. It binds the appearance features from different “assets” with the corresponding text features. In this way, the model could understand each asset according to their semantics, supporting arbitrary numbers and types of reference images. As a comprehensive solution, FashionComposer also supports many other applications like human album generation, diverse virtual try-on tasks, etc.
PaperID: 23,   https://arxiv.org/pdf/2501.14726     GitHub
Authors: Shaofei Wang, Tomas Simon, Igor Santesteban, Timur Bagautdinov, Junxuan Li, Vasu Agrawal, Fabian Prada, Shoou-I Yu, Pace Nalbone, Matt Gramlich, Roman Lubachersky, Chenglei Wu, Javier Romero, Jason Saragih, Michael Zollhoefer, Andreas Geiger, Siyu Tang, Shunsuke Saito
Affiliations: ETH Zürich, Switzerland ; Codec Avatars Lab, USA ; Codec Avatars Lab, USA ; University of Tübingen, Germany and Tübingen AI Center
Title: Relightable Full-Body Gaussian Codec Avatars
Abstract:
We propose Relightable Full-Body Gaussian Codec Avatars, a new approach for modeling relightable full-body avatars with fine-grained details including face and hands. The unique challenge for relighting full-body avatars lies in the large deformations caused by body articulation and the resulting impact on appearance caused by light transport. Changes in body pose can dramatically change the orientation of body surfaces with respect to lights, resulting in both local appearance changes due to changes in local light transport functions, as well as non-local changes due to occlusion between body parts. To address this, we decompose the light transport into local and non-local effects. Local appearance changes are modeled using learnable zonal harmonics for diffuse radiance transfer. Unlike spherical harmonics, zonal harmonics are highly efficient to rotate under articulation. This allows us to learn diffuse radiance transfer in a local coordinate frame, which disentangles the local radiance transfer from the articulation of the body. To account for non-local appearance changes, we introduce a shadow network that predicts shadows given precomputed incoming irradiance on a base mesh. This facilitates the learning of non-local shadowing between the body parts. Finally, we use a deferred shading approach to model specular radiance transfer and better capture reflections and highlights such as eye glints. We demonstrate that our approach successfully models both the local and non-local light transport required for relightable full-body avatars, with a superior generalization ability under novel illumination conditions and unseen poses.
PaperID: 24,   https://arxiv.org/pdf/2502.14844     GitHub
Authors: Rameen Abdal, Or Patashnik, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov, Daniel Cohen-Or, Kfir Aberman
Affiliations: Snap Research, Palo Alto, Santa Monica
Title: Dynamic Concepts Personalization from Single Videos
Abstract:
Personalizing generative text-to-image models has seen remarkable progress, but extending this personalization to text-to-video models presents unique challenges. Unlike static concepts, personalizing text-to-video models has the potential to capture dynamic concepts – entities defined not only by their appearance but also by their motion.In this paper, we introduce Set-and-Sequence, a novel framework for personalizing Diffusion Transformers (DiTs)–based generative video models with dynamic concepts. Our approach imposes a spatio-temporal weight space within an architecture that does not explicitly separate spatial and temporal features. This is achieved in two key stages. First, we fine-tune Low-Rank Adaptation (LoRA) layers using an unordered set of frames from the video to learn an appearance LoRA basis that represents the appearance, free from temporal interference. In the second stage, with the appearance LoRAs frozen, we augment their coefficients with Motion Residuals and fine-tune them on the full video sequence, capturing motion dynamics. Our Set-and-Sequence framework resulting in a spatio-temporal weight space effectively embeds dynamic concepts into the video model’s output domain, enabling unprecedented editability and compositionality, and setting a new benchmark for personalizing dynamic concepts.
PaperID: 25,   https://arxiv.org/pdf/2505.19976     GitHub
Authors: Naoki Agata, Takeo Igarashi
Affiliations: The University of Tokyo
Title: Motion Control via Metric-Aligning Motion Matching
Abstract:
We introduce a novel method for controlling a motion sequence using an arbitrary temporal control sequence using temporal alignment. Temporal alignment of motion has gained significant attention owing to its applications in motion control and retargeting. Traditional methods rely on either learned or hand-craft cross-domain mappings between frames in the original and control domains, which often require large, paired, or annotated datasets and time-consuming training. Our approach, named Metric-Aligning Motion Matching, achieves alignment by solely considering within-domain distances. It computes distances among patches in each domain and seeks a matching that optimally aligns the two within-domain distances. This framework allows for the alignment of a motion sequence to various types of control sequences, including sketches, labels, audio, and another motion sequence, all without the need for manually defined mappings or training with annotated data. We demonstrate the effectiveness of our approach through applications in efficient motion control, showcasing its potential in practical scenarios.
PaperID: 26,   https://arxiv.org/pdf/2502.10377     GitHub
Authors: Liyuan Zhu, Shengqu Cai, Shengyu Huang, Gordon Wetzstein, Naji Khosravan, Iro Armeni
Affiliations: Stanford University, USA ; NVIDIA Research, USA ; Zillow Group
Title: Scene-Level Appearance Transfer with Semantic Correspondences
Abstract:
We introduce ReStyle3D, a novel framework for scene-level appearance transfer from a single style image to a real-world scene represented by multiple views. The method combines explicit semantic correspondences with multi-view consistency to achieve precise and coherent stylization. Unlike conventional stylization methods that apply a reference style globally, ReStyle3D uses open-vocabulary segmentation to establish dense, instance-level correspondences between the style and real-world images. This ensures that each object is stylized with semantically matched textures. ReStyle3D first transfers the style to a single view using a training-free semantic-attention mechanism in a diffusion model. It then lifts the stylization to additional views via a learned warp-and-refine network guided by monocular depth and pixel-wise correspondences. Experiments show that ReStyle3D consistently outperforms prior methods in structure preservation, perceptual style similarity, and multi-view coherence. User studies further validate its ability to produce photo-realistic, semantically faithful results. Our code, pretrained models, and dataset will be publicly released, to support new applications in interior design, virtual staging, and 3D-consistent stylization. Project page and code at https://restyle3d.github.io/.
PaperID: 27,   https://arxiv.org/pdf/2504.21836     GitHub
Authors: Ipek Oztas, Duygu Ceylan, Aysegul Dundar
Affiliations: Bilkent University, Turkiye ; Adobe Research
Title: 3D Stylization via Large Reconstruction Model
Abstract:
With the growing success of text or image guided 3D generators, users demand more control over the generation process, appearance stylization being one of them. Given a reference image, this requires adapting the appearance of a generated 3D asset to reflect the visual style of the reference while maintaining visual consistency from multiple viewpoints. To tackle this problem, we draw inspiration from the success of 2D stylization methods that leverage the attention mechanisms in large image generation models to capture and transfer visual style. In particular, we probe if large reconstruction models, commonly used in the context of 3D generation, has a similar capability. We discover that the certain attention blocks in these models capture the appearance specific features. By injecting features from a visual style image to such blocks, we develop a simple yet effective 3D appearance stylization method. Our method does not require training or test time optimization. Through both quantitative and qualitative evaluations, we demonstrate that our approach achieves superior results in terms of 3D appearance stylization, significantly improving efficiency while maintaining high-quality visual outcomes. Code and models are available via our project website: https://github.com/ipekoztas/3D-Stylization-LRM.
PaperID: 28,   https://arxiv.org/pdf/2505.05678     GitHub
Authors: Etai Sella, Yanir Kleiman, Hadar Averbuch-Elor
Affiliations: Tel Aviv, Israel and Meta, United Kingdom ; Meta, United Kingdom ; Cornell Tech, New York
Title: InstanceGen: Image Generation with Instance-level Instructions
Abstract:
Despite rapid advancements in the capabilities of generative models, pretrained text-to-image models still struggle in capturing the semantics conveyed by complex prompts that compound multiple objects and instance-level attributes. Consequently, we are witnessing growing interests in integrating additional structural constraints, typically in the form of coarse bounding boxes, to better guide the generation process in such challenging cases. In this work, we take the idea of structural guidance a step further by making the observation that contemporary image generation models can directly provide a plausible fine-grained structural initialization. We propose a technique that couples this image-based structural guidance with LLM-based instance-level instructions, yielding output images that adhere to all parts of the text prompt, including object counts, instance-level attributes, and spatial relations between instances. Additionally, we contribute CompoundPrompts, a benchmark composed of complex prompts with three difficulty levels in which object instances are progressively compounded with attribute descriptions and spatial relations. Extensive experiments demonstrate that our method significantly surpasses the performance of prior models, particularly over complex multi-object and multi-attribute use cases.
PaperID: 29,   https://arxiv.org/pdf/2502.13951     GitHub
Authors: Sara Dorfman, Dana Cohen-Bar, Rinon Gal, Daniel Cohen-Or
Affiliations: Tel Aviv, Israel ; NVIDIA Research
Title: IP-Composer: Semantic Composition of Visual Concepts
Abstract:
Content creators often draw inspiration from multiple visual sources, combining distinct elements to craft new compositions. Modern computational approaches now aim to emulate this fundamental creative process. Although recent diffusion models excel at text-guided compositional synthesis, text as a medium often lacks precise control over visual details. Image-based composition approaches can capture more nuanced features, but existing methods are typically limited in the range of concepts they can capture, and require expensive training procedures or specialized data. We present IP-Composer, a novel training-free approach for compositional image generation that leverages multiple image references simultaneously, while using natural language to describe the concept to be extracted from each image. Our method builds on IP-Adapter, which synthesizes novel images conditioned on an input image’s CLIP embedding. We extend this approach to multiple visual inputs by crafting composite embeddings, stitched from the projections of multiple input images onto concept-specific CLIP-subspaces identified through text. Through comprehensive evaluation, we show that our approach enables more precise control over a larger range of visual concept compositions.
PaperID: 30,   https://arxiv.org/pdf/2410.18944     GitHub
Authors: Tianyu Huang, Jingwang Ling, Shuang Zhao, Feng Xu
Affiliations: School of Software and BNRist, Tsinghua University, China ; University of California Irvine
Title: Guiding-Based Importance Sampling for Walk on Stars
Abstract:
Walk on stars (WoSt) has shown its power in being applied to Monte Carlo methods for solving partial differential equations, but the sampling techniques in WoSt are not satisfactory, leading to high variance. We propose a guiding-based importance sampling method to reduce the variance of WoSt. Drawing inspiration from path guiding in rendering, we approximate the directional distribution of the recursive term of WoSt using online-learned parametric mixture distributions, decoded by a lightweight neural field. This adaptive approach enables importance sampling the recursive term, which lacks shape information before computation. We introduce a reflection technique to represent guiding distributions at Neumann boundaries and incorporate multiple importance sampling with learnable selection probabilities to further reduce variance. We also present a practical GPU implementation of our method. Experiments show that our method effectively reduces variance compared to the original WoSt, given the same time or the same sample budget. Code and data for this paper are at https://github.com/tyanyuy3125/elaina.
PaperID: 31,   https://arxiv.org/pdf/2507.10542     GitHub
Authors: Shivangi Aneja, Sebastian Weiss, Irene Baeza, Prashanth Chandran, Gaspard Zoss, Matthias Niessner, Derek Bradley
Affiliations: Technical University of Munich, Germany and DisneyResearch|Studios, Switzerland ; DisneyResearch|Studios, Germany ; DisneyResearch|Studios
Title: ScaffoldAvatar: High-Fidelity Gaussian Avatars with Patch Expressions
Abstract:
Generating high-fidelity real-time animated sequences of photorealistic 3D head avatars is important for many graphics applications, including immersive telepresence and movies. This is a challenging problem particularly when rendering digital avatar close-ups for showing character’s facial microfeatures and expressions. To capture the expressive, detailed nature of human heads, including skin furrowing and finer-scale facial movements, we propose to couple locally-defined facial expressions with 3D Gaussian splatting to enable creating ultra-high fidelity, expressive and photorealistic 3D head avatars. In contrast to previous works that operate on a global expression space, we condition our avatar’s dynamics on patch-based local expression features and synthesize 3D Gaussians at a patch level. In particular, we leverage a patch-based geometric 3D face model to extract patch expressions and learn how to translate these into local dynamic skin appearance and motion by coupling the patches with anchor points of Scaffold-GS, a recent hierarchical scene representation. These anchors are then used to synthesize 3D Gaussians on-the-fly, conditioned by patch-expressions and viewing direction. We employ color-based densification and progressive training to obtain high-quality results and faster convergence for high resolution 3K training images. By leveraging patch-level expressions, ScaffoldAvatar consistently achieves state-of-the-art performance with visually natural motion, while encompassing diverse facial expressions and styles in real time.
PaperID: 32,   https://arxiv.org/pdf/2412.15171     GitHub
Authors: Forrest Iandola, Stanislav Pidhorskyi, Igor Santesteban, Divam Gupta, Anuj Pahuja, Nemanja Bartolovic, Frank Yu, Emanuel Garbin, Tomas Simon, Shunsuke Saito
Affiliations: USA ; Meta, Switzerland ; Meta, Israel ; Meta
Title: SqueezeMe: Mobile-Ready Distillation of Gaussian Full-Body Avatars
Abstract:
Gaussian-based human avatars have achieved an unprecedented level of visual fidelity. However, existing approaches based on high-capacity neural networks typically require a desktop GPU to achieve real-time performance for a single avatar, and it remains non-trivial to animate and render such avatars on mobile devices including a standalone VR headset due to substantially limited memory and computational bandwidth. In this paper, we present SqueezeMe, a simple and highly effective framework to convert high-fidelity 3D Gaussian full-body avatars into a lightweight representation that supports both animation and rendering with mobile-grade compute. Our key observation is that the decoding of pose-dependent Gaussian attributes from a neural network creates non-negligible memory and computational overhead. Inspired by blendshapes and linear pose correctives widely used in Computer Graphics, we address this by distilling the pose correctives learned with neural networks into linear layers. Moreover, we further reduce the parameters by sharing the correctives among nearby Gaussians. Combining them with a custom splatting pipeline based on Vulkan, we achieve, for the first time, simultaneous animation and rendering of 3 Gaussian avatars in real-time (72 FPS) on a Meta Quest 3 VR headset.
PaperID: 33,   https://arxiv.org/pdf/2505.21488     GitHub
Authors: Omer Dahary, Yehonathan Cohen, Or Patashnik, Kfir Aberman, Daniel Cohen-Or
Affiliations: Tel Aviv, Israel and Snap Research, Israel ; Snap Research, Palo Alto
Title: Be Decisive: Noise-Induced Layouts for Multi-Subject Generation
Abstract:
Generating multiple distinct subjects remains a challenge for existing text-to-image diffusion models. Complex prompts often lead to subject leakage, causing inaccuracies in quantities, attributes, and visual features. Preventing leakage among subjects necessitates knowledge of each subject’s spatial location. Recent methods provide these spatial locations via an external layout control. However, enforcing such a prescribed layout often conflicts with the innate layout dictated by the sampled initial noise, leading to misalignment with the model’s prior. In this work, we introduce a new approach that predicts a spatial layout aligned with the prompt, derived from the initial noise, and refines it throughout the denoising process. By relying on this noise-induced layout, we avoid conflicts with externally imposed layouts and better preserve the model’s prior. Our method employs a small neural network to predict and refine the evolving noise-induced layout at each denoising step, ensuring clear boundaries between subjects while maintaining consistency. Experimental results show that this noise-aligned strategy achieves improved text-image alignment and more stable multi-subject generation compared to existing layout-guided techniques, while preserving the rich diversity of the model’s original distribution.
PaperID: 34,   https://arxiv.org/pdf/2505.10566     GitHub
Authors: Yen-Chi Cheng, Krishna Kumar Singh, Jae Shin Yoon, Alexander Schwing, Liang-Yan Gui, Matheus Gadelha, Paul Guerrero, Nanxuan Zhao
Affiliations: University of Illinois Urbana-Champaign, USA and Adobe Research, USA ; Krishna KumarSingh Adobe Research, San Jose, USA ; Jae ShinYoon Adobe Research, USA ; Adobe Research, United Kingdom ; Adobe Research
Title: 3D-Fixup: Advancing Photo Editing with 3D Priors
Abstract:
Despite significant advances in modeling image priors via diffusion models, 3D-aware image editing remains challenging, in part because the object is only specified via a single image. To tackle this challenge, we propose 3D-Fixup, a new framework for editing 2D images guided by learned 3D priors. The framework supports difficult editing situations such as object translation and 3D rotation. To achieve this, we leverage a training-based approach that harnesses the generative power of diffusion models. As video data naturally encodes real-world physical dynamics, we turn to video data for generating training data pairs, i.e., a source and a target frame. Rather than relying solely on a single trained model to infer transformations between source and target frames, we incorporate 3D guidance from an Image-to-3D model, which bridges this challenging task by explicitly projecting 2D information into 3D space. We design a data generation pipeline to ensure high-quality 3D guidance throughout training. Results show that by integrating these 3D priors, 3D-Fixup effectively supports complex, identity coherent 3D-aware edits, achieving high-quality results and advancing the application of diffusion models in realistic image manipulation. The code is provided at https://3dfixup.github.io/.
PaperID: 35,   https://arxiv.org/pdf/2505.17860     GitHub
Authors: Wenning Xu, Shiyu Fan, Paul Henderson, Edmond S. L. Ho
Affiliations: School of Computing Science, University of Glasgow, United Kingdom
Title: Multi-Person Interaction Generation from Two-Person Motion Priors
Abstract:
Generating realistic human motion with high-level controls is a crucial task for social understanding, robotics, and animation. With high-quality MOCAP data becoming more available recently, a wide range of data-driven approaches have been presented. However, modelling multi-person interactions still remains a less explored area. In this paper, we present Graph-driven Interaction Sampling, a method that can generate realistic and diverse multi-person interactions by leveraging existing two-person motion diffusion models as motion priors. Instead of training a new model specific to multi-person interaction synthesis, our key insight is to spatially and temporally separate complex multi-person interactions into a graph structure of two-person interactions, which we name the Pairwise Interaction Graph. We thus decompose the generation task into simultaneous single-person motion generation conditioned on one other’s motion. In addition, to reduce artifacts such as interpenetrations of body parts in generated multi-person interactions, we introduce two graph-dependent guidance terms into the diffusion sampling scheme. Unlike previous work, our method can produce various high-quality multi-person interactions without having repetitive individual motions. Extensive experiments demonstrate that our approach consistently outperforms existing methods in reducing artifacts when generating a wide range of two-person and multi-person interactions.
PaperID: 36,   https://arxiv.org/pdf/2502.06093     GitHub
Authors: Fanchao Zhong, Yang Wang, Peng-Shuai Wang, Lin Lu, Haisen Zhao
Affiliations: Shandong University, China ; Peking University
Title: DeepMill: Neural Accessibility Learning for Subtractive Manufacturing
Abstract:
Manufacturability is vital for product design and production, with accessibility being a key element, especially in subtractive manufacturing. Traditional methods for geometric accessibility analysis are time-consuming and struggle with scalability, while existing deep learning approaches in manufacturability analysis often neglect geometric challenges in accessibility and are limited to specific model types. In this paper, we introduce DeepMill, the first neural framework designed to accurately and efficiently predict inaccessible and occlusion regions under varying machining tool parameters, applicable to both CAD and freeform models. To address the challenges posed by cutter collisions and the lack of extensive training datasets, we construct a cutter-aware dual-head octree-based convolutional neural network (O-CNN) and generate an inaccessible and occlusion regions analysis dataset with a variety of cutter sizes for network training. Experiments demonstrate that DeepMill achieves 94.7% accuracy in predicting inaccessible regions and 88.7% accuracy in identifying occlusion regions, with an average processing time of 0.04 seconds for finely-tessellated geometries. Based on the outcomes, DeepMill implicitly captures both local and global geometric features, as well as the complex interactions between cutters and intricate 3D models. Code is publicly available at https://github.com/fanchao98/DeepMill.
PaperID: 37,   https://arxiv.org/pdf/2406.17774     GitHub
Authors: Ruben Wiersma, Julien Philip, Miloš Hašan, Krishna Mullia, Fujun Luan, Elmar Eisemann, Valentin Deschaintre
Affiliations: ETH Zurich, Switzerland; Delft University of Technology, Netherlands and Adobe Research, San Francisco, USA ; Adobe Research, San Jose, USA ; Delft University of Technology, Netherlands ; Adobe Research, United Kingdom
Title: Uncertainty for SVBRDF Acquisition using Frequency Analysis
Abstract:
This paper aims to quantify uncertainty for SVBRDF acquisition in multi-view captures. Under uncontrolled illumination and unstructured viewpoints, there is no guarantee that the observations contain enough information to reconstruct the appearance properties of a captured object. We study this ambiguity, or uncertainty, using entropy and accelerate the analysis by using the frequency domain, rather than the domain of incoming and outgoing viewing angles. The result is a method that computes a map of uncertainty over an entire object within a millisecond. We find that the frequency model allows us to recover SVBRDF parameters with competitive performance, that the accelerated entropy computation matches results with a physically-based path tracer, and that there is a positive correlation between error and uncertainty. We then show that the uncertainty map can be applied to improve SVBRDF acquisition using capture guidance, sharing information on the surface, and using a diffusion model to inpaint uncertain regions. Our code is available at https://github.com/rubenwiersma/svbrdf_uncertainty.
PaperID: 38,   https://arxiv.org/pdf/2406.04008     GitHub
Authors: Zhenyu Wang, Min Lu
Affiliations: Shenzhen University
Title: Image-Space Collage and Packing with Differentiable Rendering
Abstract:
Collage and packing techniques are widely used to organize geometric shapes into cohesive visual representations, facilitating the representation of visual features holistically, as seen in image collages and word clouds. Traditional methods often rely on object-space optimization, requiring intricate geometric descriptors and energy functions to handle complex shapes. In this paper, we introduce a versatile image-space collage technique. Leveraging a differentiable renderer, our method effectively optimizes the object layout with image-space losses, bringing the benefit of fixed complexity and easy accommodation of various shapes. Applying a hierarchical resolution strategy in image space, our method efficiently optimizes the collage with fast convergence, large coarse steps first and then small precise steps. The diverse visual expressiveness of our approach is demonstrated through various examples. Experimental results show that our method achieves an order of magnitude speedup performance compared to state-of-the-art techniques.
PaperID: 39,   https://arxiv.org/pdf/2501.01407     GitHub
Authors: Or Patashnik, Rinon Gal, Daniil Ostashev, Sergey Tulyakov, Kfir Aberman, Daniel Cohen-Or
Affiliations: Tel Aviv, Israel and Snap, Israel ; Snap, United Kingdom ; Snap, Los Angeles, USA ; Snap, San Fransisco
Title: Nested Attention: Semantic-aware Attention Values for Concept Personalization
Abstract:
Personalizing text-to-image models to generate images of specific subjects across diverse scenes and styles is a rapidly advancing field. Current approaches often face challenges in maintaining a balance between identity preservation and alignment with the input text prompt. Some methods rely on a single textual token to represent a subject, which limits expressiveness, while others employ richer representations but disrupt the model’s prior, diminishing prompt alignment. In this work, we introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model’s existing cross-attention layers. Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image. We integrate these nested layers into an encoder-based personalization method, and show that they enable high identity preservation while adhering to input text prompts. Our approach is general and can be trained on various domains. Additionally, its prior preservation allows us to combine multiple personalized subjects from different domains in a single image.
PaperID: 40,   https://arxiv.org/pdf/2505.03730     GitHub
Authors: Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, Yansong Tang
Affiliations: Tsinghua University, China ; Tencent
Title: FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios
Abstract:
Action customization involves generating videos where the subject performs actions dictated by input control signals. Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure such as layout, skeleton, and viewpoint consistency, reducing adaptability across diverse subjects and scenarios. To overcome these limitations, we propose FlexiAct, which transfers actions from a reference video to an arbitrary target image. Unlike existing methods, FlexiAct allows for variations in layout, viewpoint, and skeletal structure between the subject of the reference video and the target image, while maintaining identity consistency. Achieving this requires precise action control, spatial structure adaptation, and consistency preservation. To this end, we introduce RefAdapter, a lightweight image-conditioned adapter that excels in spatial adaptation and consistency preservation, surpassing existing methods in balancing appearance consistency and structural flexibility. Additionally, based on our observations, the denoising process exhibits varying levels of attention to motion (low frequency) and appearance details (high frequency) at different timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike existing methods that rely on separate spatial-temporal architectures, directly achieves action extraction during the denoising process. Experiments demonstrate that our method effectively transfers actions to subjects with diverse layouts, skeletons, and viewpoints. We release our code and model weights to support further research at FlexiAct.
PaperID: 41,   https://arxiv.org/pdf/2506.04228     GitHub
Authors: Sihui Ji, Hao Luo, Xi Chen, Yuanpeng Tu, Yiyang Wang, Hengshuang Zhao
Affiliations: Hong Kong, Alibaba Group
Title: LayerFlow: A Unified Model for Layer-aware Video Generation
Abstract:
We present LayerFlow, a unified solution for layer-aware video generation. Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene. It also supports versatile variants like decomposing a blended video or generating the background for the given foreground and vice versa. Starting from a text-to-video diffusion transformer, we organize the videos of different layers as sub-clips, and leverage layer embeddings to distinguish each clip and the corresponding layer-wise prompts. In this way, we seamlessly support the aforementioned variants in one unified framework. For the lack of high-quality layer-wise training videos, we design a multi-stage training strategy to accommodate static images with high-quality layer annotations. Specifically, we first train the model with low-quality video data. Then, we tune a motion LoRA to make the model compatible with static frames. Afterward, we train the content LoRA on the mixture of image data with high-quality layered images along with copy-pasted video data. During inference, we remove the motion LoRA thus generating smooth videos with desired layers.
PaperID: 42,   https://arxiv.org/pdf/2403.07764     GitHub
Authors: Yuxuan Zhang, Yirui Yuan, Yiren Song, Jiaming Liu
Affiliations: Shanghai Jiao Tong University, China ; Shanghai Tech University, China ; National University of Singapore, Singapore ; Tiamat AI
Title: StableMakeup: When Real-World Makeup Transfer Meets Diffusion Model
Abstract:
Current makeup transfer methods are limited to simple makeup styles, making them difficult to apply in real-world scenarios. In this paper, we introduce Stable-Makeup, a novel diffusion-based makeup transfer method capable of robustly transferring a wide range of real-world makeup, onto user-provided faces. Stable-Makeup is based on a pre-trained diffusion model and utilizes a Detail-Preserving (D-P) makeup encoder to encode makeup details. It also employs content and structural control modules to preserve the content and structural information of the source image. With the aid of our newly added makeup cross-attention layers in U-Net, we can accurately transfer the detailed makeup to the corresponding position in the source image. After content-structure decoupling training, Stable-Makeup can maintain the content and the facial structure of the source image. Moreover, our method has demonstrated strong robustness and generalizability, making it applicable to various tasks such as cross-domain makeup transfer, makeup-guided text-to-image generation, and so on. Extensive experiments have demonstrated that our approach delivers state-of-the-art results among existing makeup transfer methods and exhibits a highly promising with broad potential applications in various related fields.
PaperID: 43,   https://arxiv.org/pdf/2507.17029     GitHub
Authors: Luchuan Song, Yang Zhou, Zhan Xu, Yi Zhou, Deepali Aneja, Chenliang Xu
Affiliations: University of Rochester, New York, USA and Adobe Research, San Jose, USA ; Adobe Research
Title: StreamME: Simplify 3D Gaussian Avatar within Live Stream
Abstract:
We propose StreamME, a method focuses on fast 3D avatar reconstruction. The StreamME synchronously records and reconstructs a head avatar from live video streams without any pre-cached data, enabling seamless integration of the reconstructed appearance into downstream applications. This exceptionally fast training strategy, which we refer to as on-the-fly training, is central to our approach. Our method is built upon 3D Gaussian Splatting (3DGS), eliminating the reliance on MLPs in deformable 3DGS and relying solely on geometry, which significantly improves the adaptation speed to facial expression. To further ensure high efficiency in on-the-fly training, we introduced a simplification strategy based on primary points, which distributes the point clouds more sparsely across the facial surface, optimizing points number while maintaining rendering quality. Leveraging the on-the-fly training capabilities, our method protects the facial privacy and reduces communication bandwidth in VR system or online conference. Additionally, it can be directly applied to downstream application such as animation, toonify, and relighting. Please refer to our project page for more details: https://songluchuan.github.io/StreamME/.
PaperID: 44,   https://arxiv.org/pdf/2502.04050     GitHub
Authors: Aleksandar Cvejic, Abdelrahman Eldesokey, Peter Wonka
Affiliations: King Abdullah University of Science and Technology (KAUST), Saudi Arabia
Title: PartEdit: Fine-Grained Image Editing using Pre-Trained Diffusion Models
Abstract:
We present the first text-based image editing approach for object parts based on pre-trained diffusion models. Diffusion-based image editing approaches capitalized on the deep understanding of diffusion models of image semantics to perform a variety of edits. However, existing diffusion models lack sufficient understanding of many object parts, hindering fine-grained edits requested by users. To address this, we propose to expand the knowledge of pre-trained diffusion models to allow them to understand various object parts, enabling them to perform fine-grained edits. We achieve this by learning special textual tokens that correspond to different object parts through an efficient token optimization process. These tokens are optimized to produce reliable localization masks at each inference step to localize the editing region. Leveraging these masks, we design feature-blending and adaptive thresholding strategies to execute the edits seamlessly. To evaluate our approach, we establish a benchmark and an evaluation protocol for part editing. Experiments show that our approach outperforms existing editing methods on all metrics and is preferred by users 66 − 90 % of the time in conducted user studies.
PaperID: 45,   https://arxiv.org/pdf/2408.13252     GitHub
Authors: Shuai Yang, Jing Tan, Mengchen Zhang, Tong Wu, Gordon Wetzstein, Ziwei Liu, Dahua Lin
Affiliations: Shanghai Jiao Tong University, China and Shanghai Artificial Intelligence Laboratory, Hong Kong, Palo Alto, USA ; Nanyang Technological University
Title: LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation
Abstract:
3D immersive scene generation is a challenging yet critical task in computer vision and graphics. A desired virtual 3D scene should 1) exhibit omnidirectional view consistency, and 2) allow for large-range exploration in complex scene hierarchies. Existing methods either rely on successive scene expansion via inpainting or employ panorama representation to represent large FOV scene environments. However, the generated scene suffers from semantic drift during expansion and is unable to handle occlusion among scene hierarchies. To tackle these challenges, we introduce LayerPano3D, a novel framework for full-view, explorable panoramic 3D scene generation from a single text prompt. Our key insight is to decompose a reference 2D panorama into multiple layers at different depth levels, where each layer reveals the unseen space from the reference views via diffusion prior. LayerPano3D comprises multiple dedicated designs: 1) We introduce a new panorama dataset Upright360 , comprising 9k high-quality and upright panorama images, and finetune the advanced Flux model on Upright360 for high-quality, upright and consistent panorama generation related tasks. 2) We pioneer the Layered 3D Panorama as underlying representation to manage complex scene hierarchies and lift it into 3D Gaussians to splat detailed 360-degree omnidirectional scenes with unconstrained viewing paths. Extensive experiments demonstrate that our framework generates state-of-the-art 3D panoramic scene in both full view consistency and immersive exploratory experience. We believe that LayerPano3D holds promise for advancing 3D panoramic scene creation with numerous applications. More examples please visit our webpage: ys-imtech.github.io/projects/LayerPano3D/
PaperID: 46,   https://arxiv.org/pdf/2501.02048     GitHub
Authors: Yuanpeng Tu, Xi Chen, Ser-Nam Lim, Hengshuang Zhao
Affiliations: Hong Kong
Title: DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data
Abstract:
Open-vocabulary panoptic segmentation has received significant attention due to its applicability in the real world.
PaperID: 47,   https://arxiv.org/pdf/2412.06753     GitHub
Authors: Howard Zhang, Yuval Alaluf, Sizhuo Ma, Achuta Kadambi, Jian Wang, Kfir Aberman
Affiliations: New York City, USA and Electrical and Computer Engineering, Los Angeles, Tel Aviv, Israel ; Snap, USA ; Electrical and Computer Engineering, USA ; Snap, Palo Alto
Title: InstantRestore: Single-Step Personalized Face Restoration with Shared-Image Attention
Abstract:
Face image restoration aims to enhance degraded facial images while addressing challenges such as diverse degradation types, real-time processing demands, and, most crucially, the preservation of identity-specific features. Existing methods often struggle with slow processing times and suboptimal restoration, especially under severe degradation, failing to accurately reconstruct finer-level identity details. To address these issues, we introduce InstantRestore, a novel framework that leverages a single-step image diffusion model and an attention-sharing mechanism for fast and personalized face restoration. Additionally, InstantRestore incorporates a novel landmark attention loss, aligning key facial landmarks to refine the attention maps, enhancing identity preservation. At inference time, given a degraded input and a small (∼ 4) set of reference images, InstantRestore performs a single forward pass through the network to achieve near real-time performance. Unlike prior approaches that rely on full diffusion processes or per-identity model tuning, InstantRestore offers a scalable solution suitable for large-scale applications. Extensive experiments demonstrate that InstantRestore outperforms existing methods in quality and speed, making it an appealing choice for identity-preserving face restoration.
PaperID: 48,   https://arxiv.org/pdf/2506.07738     GitHub
Authors: Lanjiong Li, Guanhua Zhao, Lingting Zhu, Zeyu Cai, Lequan Yu, Jian Zhang, Zeyu Wang
Affiliations: China ; School of Electronic and Computer Engineering, Peking University, Hong Kong
Title: AssetDropper: Asset Extraction via Diffusion Models with Reward-Driven Optimization
Abstract:
Recent research on generative models has primarily focused on creating product-ready visual outputs; however, designers often favor access to standardized asset libraries, a domain that has yet to be significantly enhanced by generative capabilities. Although open-world scenes provide ample raw materials for designers, efficiently extracting high-quality, standardized assets remains a challenge. To address this, we introduce AssetDropper, the first generative framework designed to extract any asset from reference images, providing artists with an open-world asset palette. Our model adeptly extracts a front view of selected subjects from input images, effectively handling complex scenarios such as perspective distortion and subject occlusion. We establish a synthetic dataset of more than 200,000 image-subject pairs and a real-world benchmark with thousands more for evaluation, facilitating the exploration of future research in downstream tasks. Furthermore, to ensure precise asset extraction that aligns well with the image prompts, we employ a pre-trained reward model to achieve a closed loop with feedback. We design the reward model to perform an inverse task that pastes the extracted assets back into the reference sources, which assists training with additional consistency and mitigates hallucination. Extensive experiments show that, with the aid of reward-driven optimization, AssetDropper achieves the state-of-the-art results in asset extraction. Our code and dataset are available at https://github.com/Lanjiong-Li/AssetDropper.
PaperID: 49,   https://arxiv.org/pdf/2504.09975     GitHub
Authors: Si-Tong Wei, Rui-Huan Wang, Chuan-Zhi Zhou, Baoquan Chen, Peng-Shuai Wang
Affiliations: Wangxuan Institute of Computer Technology, Peking University, China ; School of Intelligent Science and Technology
Title: OctGPT: Octree-based Multiscale Autoregressive Models for 3D Shape Generation
Abstract:
Autoregressive models have achieved remarkable success across various domains, yet their performance in 3D shape generation lags significantly behind that of diffusion models. In this paper, we introduce OctGPT, a novel multiscale autoregressive model for 3D shape generation that dramatically improves the efficiency and performance of prior 3D autoregressive approaches, while rivaling or surpassing state-of-the-art diffusion models. Our method employs a serialized octree representation to efficiently capture the hierarchical and spatial structures of 3D shapes. Coarse geometry is encoded via octree structures, while fine-grained details are represented by binary tokens generated using a vector quantized variational autoencoder (VQVAE), transforming 3D shapes into compact multiscale binary sequences suitable for autoregressive prediction. To address the computational challenges of handling long sequences, we incorporate octree-based transformers enhanced with 3D rotary positional encodings, scale-specific embeddings, and token-parallel generation schemes. These innovations reduce training time by 13 folds and generation time by 69 folds, enabling the efficient training of high-resolution 3D shapes, e.g.,10243, on just four NVIDIA 4090 GPUs only within days. OctGPT showcases exceptional versatility across various tasks, including text-, sketch-, and image-conditioned generation, as well as scene-level synthesis involving multiple objects. Extensive experiments demonstrate that OctGPT accelerates convergence and improves generation quality over prior autoregressive methods, offering a new paradigm for high-quality, scalable 3D content creation. Our code and trained models are available at https://github.com/octree-nn/octgpt.
PaperID: 50,   https://arxiv.org/pdf/2505.04203     GitHub
Authors: Zhiping Qiu, Yitong Jin, Yuan Wang, Yi Shi, Chao Tan, Chongwu Wang, Xiaobing Li, Feng Yu, Tao Yu, Qionghai Dai
Affiliations: Central Conservatory of Music, China ; Weilan Tech, China ; BNRist, Tsinghua University
Title: ELGAR: Expressive Cello Performance Motion Generation for Audio Rendition
Abstract:
The art of instrument performance stands as a vivid manifestation of human creativity and emotion. Nonetheless, generating instrument performance motions is a highly challenging task, as it requires not only capturing intricate movements but also reconstructing the complex dynamics of the performer-instrument interaction. While existing works primarily focus on modeling partial body motions, we propose Expressive ceLlo performance motion Generation for Audio Rendition (ELGAR), a state-of-the-art diffusion-based framework for whole-body fine-grained instrument performance motion generation solely from audio. To emphasize the interactive nature of the instrument performance, we introduce Hand Interactive Contact Loss (HICL) and Bow Interactive Contact Loss (BICL), which effectively guarantee the authenticity of the interplay. Moreover, to better evaluate whether the generated motions align with the semantic context of the music audio, we design novel metrics specifically for string instrument performance motion generation, including finger-contact distance, bow-string distance, and bowing score. Extensive evaluations and ablation studies are conducted to validate the efficacy of the proposed methods. In addition, we put forward a motion generation dataset SPD-GEN, collated and normalized from the MoCap dataset SPD. As demonstrated, ELGAR has shown great potential in generating instrument performance motions with complicated and fast interactions, which will promote further development in areas such as animation, music education, interactive art creation, etc. Our code and SPD-GEN dataset are available at https://github.com/Qzping/ELGAR.
PaperID: 51,   https://arxiv.org/pdf/2505.09608     GitHub
Authors: Nadav Magar, Amir Hertz, Eric Tabellion, Yael Pritch, Alex Rav-Acha, Ariel Shamir, Yedid Hoshen
Affiliations: Tel Aviv University, Israel and Google, Israel ; Google, USA ; Google, Mountain View, Israel ; Reichman University, Israel ; Hebrew University of Jerusalem
Title: LightLab: Controlling Light Sources in Images with Diffusion Models
Abstract:
We present a simple, yet effective diffusion-based method for fine-grained, parametric control over light sources in an image. Existing relighting methods either rely on multiple input views to perform inverse rendering at inference time, or fail to provide explicit control over light changes. Our method fine-tunes a diffusion model on a small set of real raw photograph pairs, supplemented by synthetically rendered images at scale, to elicit its photorealistic prior for the relighting task. We leverage the linearity of light to synthesize image pairs depicting controlled light changes of either a target light source or ambient illumination. Using this data and an appropriate fine-tuning scheme, we train a model for precise illumination changes with explicit control over light intensity and color. Lastly, we show how our method can achieve compelling light editing results, and outperforms existing methods based on user preference.
PaperID: 52,   https://arxiv.org/pdf/2501.15641     GitHub
Authors: Yuxin Zhang, Minyan Luo, Weiming Dong, Xiao Yang, Haibin Huang, Chongyang Ma, Oliver Deussen, Tong-Yee Lee, Changsheng Xu
Affiliations: Institute of Automation, Chinese Academy of Sciences, China and School of Artificial Intelligence, China ; MAIS, China ; ByteDance Inc., San Jose, USA ; ByteDance Inc., USA ; University of Konstanz, Germany ; National Cheng-Kung University, Taiwan ; MAIS
Title: IP-Prompter: Training-Free Theme-Specific Image Generation via Dynamic Visual Prompting
Abstract:
The stories and characters that captivate us as we grow up shape unique fantasy worlds, with images serving as the primary medium for visually experiencing these realms. Personalizing generative models through fine-tuning with theme-specific data has become a prevalent approach in text-to-image generation. However, unlike object customization, which focuses on learning specific objects, theme-specific generation encompasses diverse elements such as characters, scenes, and objects. Such diversity also introduces a key challenge: how to adaptively generate multi-character, multi-concept, and continuous theme-specific images (TSI). Moreover, fine-tuning approaches often come with significant computational overhead, time costs, and risks of overfitting. This paper explores a fundamental question: Can image generation models directly leverage images as contextual input, similarly to how large language models use text as context? To address this, we present IP-Prompter, a novel training-free TSI generation method. IP-Prompter introduces visual prompting, a mechanism that integrates reference images into generative models, allowing users to seamlessly specify the target theme without requiring additional training. To further enhance this process, we propose a Dynamic Visual Prompting (DVP) mechanism, which iteratively optimizes visual prompts to improve the accuracy and quality of generated images. Our approach enables diverse applications, including consistent story generation, character design, realistic character generation, and style-guided image generation. Comparative evaluations against state-of-the-art personalization methods demonstrate that IP-Prompter achieves significantly better results and excels in maintaining character identity preserving, style consistency and text alignment, offering a robust and flexible solution for theme-specific image generation. Our project page: https://ip-prompter.github.io/.
PaperID: 53,   https://arxiv.org/pdf/2408.00458     GitHub
Authors: Manuel Kansy, Jacek Naruniec, Christopher Schroers, Markus Gross, Romann M. Weber
Affiliations: ETH Zürich, Switzerland and DisneyResearch|Studios, Switzerland ; DisneyResearch|Studios, Switzerland ; Romann M.Weber DisneyResearch|Studios
Title: Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion
Abstract:
Recent years have seen a tremendous improvement in the quality of video generation and editing approaches. While several techniques focus on editing appearance, few address motion. Current approaches using text, trajectories, or bounding boxes are limited to simple motions, so we specify motions with a single motion reference video instead. We further propose to use a pre-trained image-to-video model rather than a text-to-video model. This approach allows us to preserve the exact appearance and position of a target object or scene and helps disentangle appearance from motion.
PaperID: 54,   https://arxiv.org/pdf/2506.23957     GitHub
Authors: Zinuo You, Stamatios Georgoulis, Anpei Chen, Siyu Tang, Dengxin Dai
Affiliations: ETH Zürich, Switzerland ; Huawei Research Zürich, Switzerland and University of Tübingen
Title: GaVS: 3D-Grounded Video Stabilization via Temporally-Consistent Local Reconstruction and Rendering
Abstract:
Video stabilization is pivotal for video processing, as it removes unwanted shakiness while preserving the original user motion intent. Existing approaches, depending on the domain they operate, suffer from several issues (e.g. geometric distortions, excessive cropping, poor generalization) that degrade the user experience. To address these issues, we introduce GaVS, a novel 3D-grounded approach that reformulates video stabilization as a temporally-consistent ‘local reconstruction and rendering’ paradigm. Given 3D camera poses, we augment a reconstruction model to predict Gaussian Splatting primitives, and finetune it at test-time, with multi-view dynamics-aware photometric supervision and cross-frame regularization, to produce temporally-consistent local reconstructions. The model are then used to render each stabilized frame. We utilize a scene extrapolation module to avoid frame cropping. Our method is evaluated on a repurposed dataset, instilled with 3D-grounded information, covering samples with diverse camera motions and scene dynamics. Quantitatively, our method is competitive with or superior to state-of-the-art 2D and 2.5D approaches in terms of conventional task metrics and new geometry consistency. Qualitatively, our method produces noticeably better results compared to alternatives, validated by the user study. Project Page: sinoyou.github.io/gavs.
PaperID: 55,   https://arxiv.org/pdf/2505.21925     GitHub
Authors: Chong Zeng, Yue Dong, Pieter Peers, Hongzhi Wu, Xin Tong
Affiliations: State Key Lab of CAD & CG, Zhejiang University, China and Microsoft Research Asia, China ; Microsoft Research Asia, China ; College of William & Mary
Title: RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination
Abstract:
We present RenderFormer, a neural rendering pipeline that directly renders an image from a triangle-based representation of a scene with full global illumination effects and that does not require per-scene training or fine-tuning. Instead of taking a physics-centric approach to rendering, we formulate rendering as a sequence-to-sequence transformation where a sequence of tokens representing triangles with reflectance properties is converted to a sequence of output tokens representing small patches of pixels. RenderFormer follows a two stage pipeline: a view-independent stage that models triangle-to-triangle light transport, and a view-dependent stage that transforms a token representing a bundle of rays to the corresponding pixel values guided by the triangle-sequence from the the view-independent stage. Both stages are based on the transformer architecture and are learned with minimal prior constraints. We demonstrate and evaluate RenderFormer on scenes with varying complexity in shape and light transport.
PaperID: 56,   https://arxiv.org/pdf/2506.18680     GitHub
Authors: Anindita Ghosh, Bing Zhou, Rishabh Dabral, Jian Wang, Vladislav Golyanik, Christian Theobalt, Philipp Slusallek, Chuan Guo
Affiliations: Saarland University, Max Planck Institute for Informatics, Germany ; Snap Inc., New York City, Germany ; DFKI
Title: DuetGen: Music Driven Two-Person Dance Generation via Hierarchical Masked Modeling
Abstract:
We present DuetGen, a novel framework for generating interactive two-person dances from music. The key challenge of this task lies in the inherent complexities of two-person dance interactions, where the partners need to synchronize both with each other and with the music. Inspired by the recent advances in motion synthesis, we propose a two-stage solution: encoding two-person motions into discrete tokens and then generating these tokens from music. To effectively capture intricate interactions, we represent both dancers’ motions as a unified whole to learn the necessary motion tokens, and adopt a coarse-to-fine learning strategy in both the stages. Our first stage utilizes a VQ-VAE that hierarchically separates high-level semantic features at a coarse temporal resolution from low-level details at a finer resolution, producing two discrete token sequences at different abstraction levels. Subsequently, in the second stage, two generative masked transformers learn to map music signals to these dance tokens: the first producing high-level semantic tokens, and the second, conditioned on music and these semantic tokens, producing the low-level tokens. We train both transformers to learn to predict randomly masked tokens within the sequence, enabling them to iteratively generate motion tokens by filling an empty token sequence during inference. Through the hierarchical masked modeling and dedicated interaction representation, DuetGen achieves the generation of synchronized and interactive two-person dances across various genres. Extensive experiments and user studies on a benchmark duet dance dataset demonstrate state-of-the-art performance of DuetGen in motion realism, music-dance alignment, and partner coordination. Code and model weights are available at https://github.com/anindita127/DuetGen.
PaperID: 57,   https://arxiv.org/pdf/2502.08639     GitHub
Authors: Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai
Affiliations: Dalian University of Technology, Hong Kong, China ; Kuaishou Technology
Title: CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation
Abstract:
In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and intuitive layout control over the rendered frames. To achieve this, CineMaster operates in two stages. In the first stage, we design an interactive workflow that allows users to intuitively construct 3D-aware control signals by positioning object bounding boxes and defining camera movements within the 3D space. In the second stage, these control signals—comprising rendered depth maps, camera trajectories, and object class labels—serve as the guidance for a text-to-video diffusion model, ensuring to generate the user-intended video content. Furthermore, to overcome the scarcity of in-the-wild datasets with 3D object motion and camera pose annotations, we carefully establish an automated data annotation pipeline that extracts 3D bounding boxes and camera trajectories from large-scale video data. Extensive qualitative and quantitative experiments demonstrate that CineMaster significantly outperforms existing methods and implements prominent 3D-aware text-to-video generation.
PaperID: 58,   https://arxiv.org/pdf/2502.20307     GitHub
Authors: Xiuli Bi, Jianfei Yuan, Bo Liu, Yong Zhang, Xiaodong Cun, Chi-Man Pun, Bin Xiao
Affiliations: Chongqing University of Post and Telecommunications, China ; Meituan, China ; GVC Lab, Great Bay University, China and Dongguan Key Laboratory for Intelligence and Information Technology, China ; University of Macau
Title: Mobius: Text to Seamless Looping Video Generation via Latent Shift
Abstract:
We present Mobius, a novel method to generate seamlessly looping videos from text descriptions directly without any user annotations, thereby creating new visual materials for the multi-media presentation. Our method repurposes the pre-trained video latent diffusion model for generating looping videos from text prompts without any training. During inference, we first construct a latent cycle by connecting the starting and ending noise of the videos. Given that the temporal consistency can be maintained by the context of the video diffusion model, we perform multi-frame latent denoising by gradually shifting the first-frame latent to the end in each step. As a result, the denoising context varies in each step while maintaining consistency throughout the inference process. Moreover, the latent cycle in our method can be of any length. This extends our latent-shifting approach to generate seamless looping videos beyond the scope of the video diffusion model’s context. Unlike previous cinemagraphs, the proposed method does not require an image as appearance, which will restrict the motions of the generated results. Instead, our method can produce more dynamic motion and better visual quality. We conduct multiple experiments and comparisons to verify the effectiveness of the proposed method, demonstrating its efficacy in different scenarios. All the code will be made available.
PaperID: 59,   https://arxiv.org/pdf/2505.07003     GitHub
Authors: Peng Li, Suizhi Ma, Jialiang Chen, Yuan Liu, Congyi Zhang, Wei Xue, Wenhan Luo, Alla Sheffer, Wenping Wang, Yike Guo
Affiliations: Hong Kong, Canada ; Texas A&M University, College Station
Title: CMD: Controllable Multiview Diffusion for 3D Editing and Progressive Generation
Abstract:
Recently, 3D generation methods have shown their powerful ability to automate 3D model creation. However, most 3D generation methods only rely on an input image or a text prompt to generate a 3D model, which lacks the control of each component of the generated 3D model. Any modifications of the input image lead to an entire regeneration of the 3D models. In this paper, we introduce a new method called CMD that generates a 3D model from an input image while enabling flexible local editing of each component of the 3D model. In CMD, we formulate the 3D generation as a conditional multiview diffusion model, which takes the existing or known parts as conditions and generates the edited or added components. This conditional multiview diffusion model not only allows the generation of 3D models part by part but also enables local editing of 3D models according to the local revision of the input image without changing other 3D parts. Extensive experiments are conducted to demonstrate that CMD decomposes a complex 3D generation task into multiple components, improving the generation quality. Meanwhile, CMD enables efficient and flexible local editing of a 3D model by just editing one rendered image.
PaperID: 60,   https://arxiv.org/pdf/2505.06227     GitHub
Authors: Yufan Deng, Yuhao Zhang, Chen Geng, Shangzhe Wu, Jiajun Wu
Affiliations: Stanford University, USA and University of Cambridge
Title: Anymate: A Dataset and Baselines for Learning 3D Object Rigging
Abstract:
Rigging and skinning are essential steps to create realistic 3D animations, often requiring significant expertise and manual effort. Traditional attempts at automating these processes rely heavily on geometric heuristics and often struggle with objects of complex geometry. Recent data-driven approaches show potential for better generality, but are often constrained by limited training data. We present the Anymate Dataset, a large-scale dataset of 230K 3D assets paired with expert-crafted rigging and skinning information—70 times larger than existing datasets. Using this dataset, we propose a learning-based auto-rigging framework with three sequential modules for joint, connectivity, and skinning weight prediction. We systematically design and experiment with various architectures as baselines for each module and conduct comprehensive evaluations on our dataset to compare their performance. Our models significantly outperform existing methods, providing a foundation for comparing future methods in automated rigging and skinning. Code and dataset can be found at https://anymate3d.github.io/.
PaperID: 61,   https://arxiv.org/pdf/2505.05672     GitHub
Authors: Gengyan Li, Paulo Gotardo, Timo Bolkart, Stephan Garbin, Kripasindhu Sarkar, Abhimitra Meka, Alexandros Lattas, Thabo Beeler
Affiliations: ETH Zürich, Switzerland and Google, Switzerland ; Google, United Kingdom ; Google, San Francisco, USA ; Google
Title: TeGA: Texture Space Gaussian Avatars for High-Resolution Dynamic Head Modeling
Abstract:
Sparse volumetric reconstruction and rendering via 3D Gaussian splatting have recently enabled animatable 3D head avatars that are rendered under arbitrary viewpoints with impressive photorealism. Today, such photoreal avatars are seen as a key component in emerging applications in telepresence, extended reality, and entertainment. Building a photoreal avatar requires estimating the complex non-rigid motion of different facial components as seen in input video images; due to inaccurate motion estimation, animatable models typically present a loss of fidelity and detail when compared to their non-animatable counterparts, built from an individual facial expression. Also, recent state-of-the-art models are often affected by memory limitations that reduce the number of 3D Gaussians used for modeling, leading to lower detail and quality. To address these problems, we present a new high-detail 3D head avatar model that improves upon the state of the art, largely increasing the number of 3D Gaussians and modeling quality for rendering at 4K resolution. Our high-quality model is reconstructed from multiview input video and builds on top of a mesh-based 3D morphable model, which provides a coarse deformation layer for the head. Photoreal appearance is modelled by 3D Gaussians embedded within the continuous UVD tangent space of this mesh, allowing for more effective densification where most needed. Additionally, these Gaussians are warped by a novel UVD deformation field to capture subtle, localized motion. Our key contribution is the novel deformable Gaussian encoding and overall fitting procedure that allows our head model to preserve appearance detail, while capturing facial motion and other transient high-frequency features such as skin wrinkling.
PaperID: 62,   https://arxiv.org/pdf/2503.05639     GitHub
Authors: Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, Qiang Xu
Affiliations: Hong Kong, Japan ; University of Macau, China ; Tencent
Title: VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control
Abstract:
Video inpainting, crucial for the media industry, aims to restore corrupted content. However, current methods relying on limited pixel propagation or single-branch image inpainting architectures face challenges with generating fully masked objects, balancing background preservation with foreground generation, and maintaining ID consistency over long video. To address these issues, we propose VideoPainter, an efficient dual-branch framework featuring a lightweight context encoder. This plug-and-play encoder processes masked videos and injects background guidance into any pre-trained video diffusion transformer, generalizing across arbitrary mask types, enhancing background integration and foreground generation, and enabling user-customized control. We further introduce a strategy to resample inpainting regions for maintaining ID consistency in any-length video inpainting. Additionally, we develop a scalable dataset pipeline using advanced vision models and construct VPData and VPBench—the largest video inpainting dataset with segmentation masks and dense caption (>390K clips) —to support large-scale training and evaluation. We also show VideoPainter’s promising potential in downstream applications such as video editing. Extensive experiments demonstrate VideoPainter’s state-of-the-art performance in any-length video inpainting and editing across 8 key metrics, including video quality, mask region preservation, and textual coherence.
PaperID: 63,   https://arxiv.org/pdf/2405.18133     GitHub
Authors: Jingrui Xing, Bin Wang, Mengyu Chu, Baoquan Chen
Affiliations: School of Intelligence Science and Technology, Peking University, China ; Independent Researcher, China ; State Key Laboratory of General Artificial Intelligence
Title: Gaussian Fluids: A Grid-Free Fluid Solver based on Gaussian Spatial Representation
Abstract:
We present a grid-free fluid solver featuring a novel Gaussian representation. Drawing inspiration from the expressive capabilities of 3D Gaussian Splatting in multi-view image reconstruction, we model the continuous flow velocity as a weighted sum of multiple Gaussian functions. This representation is continuously differentiable, which enables us to derive spatial differentials directly and solve the time-dependent PDE via a custom first‑order optimization tailored to fluid dynamics.Compared to traditional discretizations, which typically adopt Eulerian, Lagrangian, or hybrid perspectives, our approach is inherently memory-efficient and spatially adaptive, enabling it to preserve fine-scale structures and vortices with high fidelity. While these advantages are also sought by implicit neural representations, GSR offers enhanced robustness, accuracy, and generality across diverse fluid phenomena, with improved computational efficiency during temporal evolution.Though our first‑order solver does not yet match the speed of fluid solvers using explicit representations, its continuous nature substantially reduces spatial discretization error and opens a new avenue for high‑fidelity simulation. We evaluate the proposed solver across a broad range of 2D and 3D fluid phenomena, demonstrating its ability to preserve intricate vortex dynamics, accurately capture boundary-induced effects such as Kármán vortex streets, and remain robust across long time horizons—all without additional parameter tuning. Our results suggest that GSR offers a compelling direction for future research in fluid simulation. The source code for our fluid solver is publicly available at https://github.com/xjr01/Gaussian-Fluids-Code.
PaperID: 64,   https://arxiv.org/pdf/2508.07557     GitHub
Authors: Minghao Yin, Yukang Cao, Songyou Peng, Kai Han
Affiliations: Singapore ; Google DeepMind, San Francisco, Hong Kong
Title: Splat4D: Diffusion-Enhanced 4D Gaussian Splatting for Temporally and Spatially Consistent Content Creation
Abstract:
Generating high-quality 4D content from monocular videos—for applications such as digital humans and AR/VR—poses challenges in ensuring temporal and spatial consistency, preserving intricate details, and incorporating user guidance effectively. To overcome these challenges, we introduce Splat4D, a novel framework enabling high-fidelity 4D content generation from a monocular video. Splat4D achieves superior performance while maintaining faithful spatial-temporal coherence, by leveraging multi-view rendering, inconsistency identification, a video diffusion model, and an asymmetric U-Net for refinement. Through extensive evaluations on public benchmarks, Splat4D consistently demonstrates state-of-the-art performance across various metrics, underscoring the efficacy of our approach. Additionally, the versatility of Splat4D is validated in various applications such as text/image conditioned 4D generation, 4D human generation, and text-guided content editing, producing coherent outcomes following user instructions. Project page: https://visual-ai.github.io/splat4d
PaperID: 65,   https://arxiv.org/pdf/2505.06523     GitHub
Authors: Xijie Yang, Linning Xu, Lihan Jiang, Dahua Lin, Bo Dai
Affiliations: Zhejiang University, China and Shanghai Artificial Intelligence Laboratory, Hong Kong, China ; University of Science and Technology of China, China and Feeling AI
Title: Virtualized 3D Gaussians: Flexible Cluster-based Level-of-Detail System for Real-Time Rendering of Composed Scenes
Abstract:
3D Gaussian Splatting (3DGS) enables the reconstruction of intricate digital 3D assets from multi-view images by leveraging a set of 3D Gaussian primitives for rendering. Its explicit and discrete representation facilitates the seamless composition of complex digital worlds, offering significant advantages over previous neural implicit methods. However, when applied to large-scale compositions, such as crowd-level scenes, it can encompass numerous 3D Gaussians, posing substantial challenges for real-time rendering. To address this, inspired by Unreal Engine 5’s Nanite system, we propose Virtualized 3D Gaussians (V3DG), a cluster-based LOD solution that constructs hierarchical 3D Gaussian clusters and dynamically selects only the necessary ones to accelerate rendering speed. Our approach consists of two stages: (1) Offline Build, where hierarchical clusters are generated using a local splatting method to minimize visual differences across granularities, and (2) Online Selection, where footprint evaluation determines perceptible clusters for efficient rasterization during rendering. We curate a dataset of synthetic and real-world scenes, including objects, trees, people, and buildings, each requiring 0.1 billion 3D Gaussians to capture fine details. Experiments show that our solution balances rendering efficiency and visual quality across user-defined tolerances, facilitating downstream interactive applications that compose extensive 3DGS assets for consistent rendering performance.
PaperID: 66   GitHub
Authors: Xuejun Hu, Jinfan Lu, Kun Xu
Affiliations: Tsinghua University
Title: Kernel Predicting Neural Shadow Maps
Abstract:
Existing neural shadow mapping methods [Datta et al. 2022] have shown to be promising in generating high quality soft shadows. However, it demonstrates limited generalizability to new scenes. In this paper, we present a novel neural method, named kernel predicting neural shadow mapping to address this issue. Specifically, we explicitly model soft shadow values as pixelwise local filtering from nearby base shadow values (i.e., the classic hard shadow values) in the screen space, where the local filter weights are predicted through a trained neural network. We use dilated filters as the representation of our local filters to maintain a balance between computational efficiency and receptive field of a local filter. We further enhance shadow quality by replacing the classic shadow map algorithm [Williams 1978] with moment shadow maps [Peters and Klein 2015] to generate the base shadows values. With carefully designed filters, input features, and loss functions with temporal regularization, our method runs in real-time framerates (i.e., >100 fps for 2048 × 1024 resolution), produces temporally-stable soft shadows with good generalizability, and consistently beats state-of-the-art methods in both visual qualities and numeric measures. Code and model weights are available at https://github.com/Hoosus/KPNSM.
PaperID: 67   GitHub
Authors: Linjun Wu, Xiangjun Tang, Jingyuan Cong, He Wang, Bo Hu, Xu Gong, Songnan Li, Yuchen Liao, Yiqian Wu, Chen Liu, Xiaogang Jin
Affiliations: State Key Lab of CAD and CG, Zhejiang University, San Diego, USA ; UCL Centre for Artificial Intelligence, Department of Computer Science, University College London (UCL), United Kingdom ; Tencent Technology (Shenzhen) Co., China ; Tencent Technology (Shenzhen) Co.
Title: Semantically Consistent Text-to-Motion with Unsupervised Styles
Abstract:
Text-to-stylized human motion generation leverages text descriptions for motion generation with fine-grained style control with respect to a reference motion. However, existing approaches typically rely on supervised style learning with labeled datasets, constraining their adaptability and generalization for effective diverse style control. Additionally, they have not fully explored the temporal correlations between motion, textual descriptions, and style, making it challenging to generate semantically consistent motion with precise style alignment. To address these limitations, we introduce a novel method that integrates unsupervised style from arbitrary references into a text-driven diffusion model to generate semantically consistent stylized human motion. The core innovation lies in leveraging text as a mediator to capture the temporal correspondences between motion and style, enabling the seamless integration of temporally dynamic style into motion features. Specifically, we first train a diffusion model on a text-motion dataset to capture the correlation between motion and text semantics. A style adapter then extracts temporally dynamic style features from reference motions and integrates a novel Semantic-Aware Style Injection (SASI) module to infuse these features into the diffusion model. The SASI module computes the semantic correlation between motion and style features based on text, selectively incorporating style features that align with motion content, ensuring semantic consistency and precise style alignment. Our style adapter does not require a labeled style dataset for training, enhancing adaptability and generalization of style control. Extensive evaluations show that our method outperforms previous approaches in terms of semantic consistency and style expressivity. Our webpage, https://fivezerojun.github.io/stylization.github.io/, includes links to the supplementary video and code.
PaperID: 68   GitHub
Authors: Hongbo Zhao, Jiaxing Li, Peiyi Zhang, Peng Xiao, Jianxin Lin, Yijun Wang
Affiliations: Hunan University
Title: ColorSurge: Bringing Vibrancy and Efficiency to Automatic Video Colorization via Dual-Branch Fusion
Abstract:
Automatic video colorization poses challenges, requiring efficient generation of results that ensure frame and color consistency. Previous video colorization works often suffer from issues such as color flickering, bleeding, artifacts, and low color richness due to the inherent ambiguity and limitations of the models. While diffusion-based video-to-video approaches can produce customized colorization models through fine-tuning, their high inference costs limit their suitability for real-time scenarios. To address these challenges, we propose ColorSurge, a lightweight network for efficient end-to-end video colorization. ColorSurge employs a dual-branch structure, consisting of a grayscale branch and a color branch. In the grayscale branch, we extract the semantic content of grayscale videos and reconstruct and output features at different spatial scales. In the color branch, we introduce learnable color tokens and fuse these multi-scale semantic features through stacked Color Alchemy Blocks (CABs). Within each CAB, we incorporate Color Spatial Transformer Blocks (CSTB) and Color Temporal Transformer Blocks (CTTB) to constrain the spatial harmony and temporal consistency of colors. Finally, we use a Color Mapper to unify the grayscale and color features, mapping them to obtain the final colorized video result. Extensive experiments demonstrate that our method significantly outperforms previous state-of-the-art models in both qualitative and quantitative evaluations. Our code and model are available at https://github.com/ABTols/ColorSurge.
PaperID: 69   GitHub
Authors: Chengzhu He, Zhendong Wang, Zhaorui Meng, Junfeng Yao, Shihui Guo, Huamin Wang
Affiliations: Xiamen University, China and StyleD Research, China ; StyleD Research
Title: Automated Task Scheduling for Cloth and Deformable Body Simulations in Heterogeneous Computing Environments
Abstract:
The concept of the Internet of Things (IoT) has driven the development of system-on-a-chip (SoC) technology for embedded and mobile systems, which may define the future of next-generation computation. In SoC devices, efficient cloth and deformable body simulations require parallelized, heterogeneous computation across multiple processing units. The key challenge in heterogeneous computation lies in task distribution, which must account for varying inter-task dependencies and communication costs. This paper proposes a novel framework for automated task scheduling to optimize simulation performance by minimizing communication overhead and aligning tasks with the specific strengths of each device. To achieve this, we introduce an efficient scheduling method based on the Heterogeneous Earliest Finish Time (HEFT) algorithm, adapted for hybrid systems. We model simulation tasks—such as those in iterative methods like Jacobi and Gauss-Seidel—as a Directed Acyclic Graph (DAG). To maximize the parallelism of nonlinear Gauss-Seidel simulation tasks, we present an innovative asynchronous Gauss-Seidel method with specialized data synchronization across units. Additionally, we employ task merging and tailored task-sorting strategies for Gauss-Seidel tasks to achieve an optimal balance between convergence and efficiency. We validate the effectiveness of our framework across various simulations, including XPBD, vertex block descent, and second-order stencil descent, using Apple M-series processors with both CPU and GPU cores. By maximizing computational efficiency and reducing processing times, our method achieves superior simulation frame rates compared to approaches that rely on individual devices in isolation. The source code with hybrid Metal/C++ implementation is available at https://github.com/ChengzhuUwU/libAtsSim.
PaperID: 70   GitHub
Authors: Fengqi Liu, Longji Huang, Zhengyu Huang, Zeyu Wang
Affiliations: Hong Kong
Title: Learning to Draw Is Learning to See: Analyzing Eye Tracking Patterns for Assisted Observational Drawing
Abstract:
Drawing is an artistic process involving extensive observation. Understanding how professional artists observe as they draw has significant value because it offers insight into their perception patterns and acquired skills. While previous studies used eye tracking to analyze the drawing process, they fell short in aligning gaze data with drawing actions due to the spatial and temporal gaps between observation and drawing in a model-to-paper setup. This paper presents a study in an image-to-image setup, in which artists observe a reference image and draw on a blank canvas on a tablet, capturing a clearer mapping between eye movements and drawn strokes. Our analysis demonstrates a strong spatial correlation between observed regions and corresponding strokes. We further find that artists initially follow a more structured region-by-region approach and then switch to a less constrained sequence for details. Based on these findings, we develop an assistive interface that integrates real-time visual guidance from professional artists’ eye tracking data, enabling novices to emulate their observation and drawing strategies. A user study shows that novices can draw significantly more accurate shapes using our assistive interface, highlighting the importance of modeling observation and the potential of leveraging eye tracking data in future educational and creativity support tools. Our datasets, analysis code, and assistive interface are available at https://github.com/CISLab-HKUST/Learning-to-Draw-Is-Learning-to-See.
PaperID: 71,   https://arxiv.org/pdf/2507.07521    
Authors: Mingyang Song, Yang Zhang, Marko Mihajlovic, Siyu Tang, Markus Gross, Tunç Ozan Aydın
Affiliations: Switzerland and ETH Zürich, Switzerland ; DisneyResearch|Studios, Switzerland ; ETH Zürich, Switzerland ; Tunç OzanAydın DisneyResearch|Studios
Title: Spline Deformation Field
Abstract:
Trajectory modeling of dense points usually employs implicit deformation fields, represented as neural networks that map coordinates to relate canonical spatial positions to temporal offsets. However, the inductive biases inherent in neural networks can hinder spatial coherence in ill-posed scenarios. Current methods focus either on enhancing encoding strategies for deformation fields, often resulting in opaque and less intuitive models, or adopt explicit techniques like linear blend skinning, which rely on heuristic-based node initialization. Additionally, the potential of implicit representations for interpolating sparse temporal signals remains under-explored. To address these challenges, we propose a spline-based trajectory representation, where the number of knots explicitly determines the degrees of freedom. This approach enables efficient analytical derivation of velocities, preserving spatial coherence and accelerations, while mitigating temporal fluctuations. To model knot characteristics in both spatial and temporal domains, we introduce a novel low-rank time-variant spatial encoding, replacing conventional coupled spatiotemporal techniques. Our method demonstrates superior performance in temporal interpolation for fitting continuous fields with sparse inputs. Furthermore, it achieves competitive dynamic scene reconstruction quality compared to state-of-the-art methods while enhancing motion coherence without relying on linear blend skinning or as-rigid-as-possible constraints.
PaperID: 72,   https://arxiv.org/pdf/2310.03861    
Authors: Leticia Mattos Da Silva, Silvia Sellán, Natalia Pacheco-Tallaj, Justin Solomon
Affiliations: LeticiaMattos Da Silva Massachusetts Institute of Technology, USA ; Massachusetts Institute of Technology, USA and Columbia University, New York City
Title: Variational Elastodynamic Simulation
Abstract:
Numerical schemes for time integration are the cornerstone of dynamical simulations for deformable solids. The most popular time integrators for isotropic distortion energies rely on nonlinear root-finding solvers, most prominently, Newton’s method. These solvers are computationally expensive and sensitive to ill-conditioned Hessians and poor initial guesses; these difficulties can particularly hamper the effectiveness of variational integrators, whose momentum conservation properties require reliable root-finding. To tackle these difficulties, this paper shows how to express variational time integration for a large class of elastic energies as an optimization problem with a “hidden” convex substructure. This hidden convexity suggests uses of optimization techniques with rigorous convergence analysis, guaranteed inversion-free elements, and conservation of physical invariants up to tolerance/numerical precision. In particular, we propose an Alternating Direction Method of Multipliers (ADMM) algorithm combined with a proximal operator step to solve our formulation. Empirically, our integrator improves the performance of elastic simulation tasks, as we demonstrate in a number of examples.
PaperID: 73,   https://arxiv.org/pdf/2506.00222    
Authors: Jiabao Brad Wang, Amir Vaxman
Affiliations: Jiabao BradWang University of Edinburgh, China ; University of Edinburgh, United Kingdom
Title: Power-Linear Polar Directional Fields
Abstract:
We introduce a novel method for directional-field design on meshes, enabling users to specify singularities at any location on a mesh. Our method uses a piecewise power-linear representation for phase and scale, offering precise control over field topology. The resulting fields are smooth and accommodate any singularity index and field symmetry. With this representation, we mitigate the artifacts caused by coarse or uneven meshes. We showcase our approach on meshes with diverse topologies and triangle qualities.
PaperID: 74,   https://arxiv.org/pdf/2505.14087    
Authors: Ziyi Chang, He Wang, George Koulieris, Hubert P. H. Shum
Affiliations: Durham University, Department of Computer Science, University College London (UCL), United Kingdom
Title: Large-Scale Multi-Character Interaction Synthesis
Abstract:
Generating large-scale multi-character interactions is a challenging and important task in character animation. Multi-character interactions involve not only natural interactive motions but also characters coordinated with each other for transition. For example, a dance scenario involves characters dancing with partners and also characters coordinated to new partners based on spatial and temporal observations. We term such transitions as coordinated interactions and decompose them into interaction synthesis and transition planning. Previous methods of single-character animation do not consider interactions that are critical for multiple characters. Deep-learning-based interaction synthesis usually focuses on two characters and does not consider transition planning. Optimization-based interaction synthesis relies on manually designing objective functions that may not generalize well. While crowd simulation involves more characters, their interactions are sparse and passive. We identify two challenges to multi-character interaction synthesis, including the lack of data and the planning of transitions among close and dense interactions. Existing datasets either do not have multiple characters or do not have close and dense interactions. The planning of transitions for multi-character close and dense interactions needs both spatial and temporal considerations. We propose a conditional generative pipeline comprising a coordinatable multi-character interaction space for interaction synthesis and a transition planning network for coordinations. Our experiments demonstrate the effectiveness of our proposed pipeline for multi-character interaction synthesis and the applications facilitated by our method show the scalability and transferability.
PaperID: 75,   https://arxiv.org/pdf/2508.05626    
Authors: Chris Careaga, Yağız Aksoy
Affiliations: Simon Fraser University
Title: Physically Controllable Relighting of Photographs
Abstract:
We present a self-supervised approach to in-the-wild image relighting that enables fully controllable, physically based illumination editing. We achieve this by combining the physical accuracy of traditional rendering with the photorealistic appearance made possible by neural rendering. Our pipeline works by inferring a colored mesh representation of a given scene using monocular estimates of geometry and intrinsic components. This representation allows users to define their desired illumination configuration in 3D. The scene under the new lighting can then be rendered using a path-tracing engine. We send this approximate rendering of the scene through a feed-forward neural renderer to predict the final photorealistic relighting result. We develop a differentiable rendering process to reconstruct in-the-wild scene illumination, enabling self-supervised training of our neural renderer on raw image collections. Our method represents a significant step in bringing the explicit physical control over lights available in typical 3D computer graphics tools, such as Blender, to in-the-wild relighting.
PaperID: 76,   https://arxiv.org/pdf/2505.11729    
Authors: Pedro Figueiredo, Qihao He, Steve Bako, Nima Khademi Kalantari
Affiliations: Texas A&M University, College Station, USA ; Aurora Innovation, San Francisco
Title: Neural Importance Sampling of Many Lights
Abstract:
We propose a neural approach for estimating spatially varying light selection distributions to improve importance sampling in Monte Carlo rendering, particularly for complex scenes with many light sources. Our method uses a neural network to predict the light selection distribution at each shading point based on local information, trained by minimizing the KL-divergence between the learned and target distributions in an online manner. To efficiently manage hundreds or thousands of lights, we integrate our neural approach with light hierarchy techniques, where the network predicts cluster-level distributions and existing methods sample lights within clusters. Additionally, we introduce a residual learning strategy that leverages initial distributions from existing techniques, accelerating convergence during training. Our method achieves superior performance across diverse and challenging scenes.
PaperID: 77,   https://arxiv.org/pdf/2312.07021    
Authors: Pu Li, Wenhao Zhang, Jinglu Chen, Dongming Yan
Affiliations: Institute of Automation, Chinese Academy Of Sciences, China and School of Artificial Intelligence, China ; MAIS, Chinese Academy of Sciences, China and School of Artificial Intellegence
Title: Stitch-A-Shape: Bottom-up Learning for B-Rep Generation
Abstract:
Boundary representation (B-Rep) models serve as the primary representation format in modern CAD systems for describing 3D shapes. While deep learning has achieved success with various geometric representations, B-Reps remain challenging due to their hybrid nature of combining continuous geometry with discrete topological relationships. In this paper, we present Stitch-A-Shape, a B-Rep generation framework that directly models both topology and geometry. This strategy departs from prior work that focuses on either topology or geometry while recovering the other through post-processing. Our method consists of a geometry module that determines the spatial configuration of geometric elements (vertices, curves, and surface control points) and a topology module that establishes connectivity relationships and identifies boundary structures, including outer and inner loops. Our approach leverages a sequential "stitching" representation that mirrors the native data structure and inherent bottom-up organization of B-Rep, assembling geometric entities from vertices through curves to faces. We validate that our framework can handle topological and geometric ambiguities, as well as open surfaces and compound solids. Experiments show that Stitch-A-Shape achieves superior generation quality and computational efficiency compared to existing approaches in unconditional generation tasks, while exhibiting effective capabilities in class-conditional generation and B-Rep autocompletion applications.
PaperID: 78,   https://arxiv.org/pdf/2405.14595    
Authors: Siyuan Shen, Tianjia Shao, Kun Zhou, Chenfanfu Jiang, Sheldon Andrews, Victor Zordan, Yin Yang
Affiliations: Zhejiang University, China ; UCLA, Los Angeles, USA ; École de technologie supérieure (ÉTS), Canada ; Roblox, San Mateo, USA ; University of Utah, Salt Lake City
Title: Elastic Locomotion with Mixed Second-order Differentiation
Abstract:
We present a framework of elastic locomotion, which allows users to enliven an elastic body to produce interesting locomotion by prescribing its high-level kinematics. We formulate this problem as an inverse simulation problem and seek the optimal muscle activations to drive the body to complete the desired actions. We employ the interior-point method to model wide-area contacts between the body and the environment with logarithmic barrier penalties. The core of our framework is a mixed second-order differentiation algorithm. By combining both analytic differentiation and numerical differentiation modalities, a general-purpose second-order differentiation scheme is made possible. Specifically, we augment complex-step finite difference (CSFD) with reverse automatic differentiation (AD). We treat AD as a generic function, mapping a computing procedure to its derivative w.r.t. output loss, and promote CSFD along the AD computation. To this end, we carefully implement all the arithmetics used in elastic locomotion, from elementary functions to linear algebra and matrix operation for CSFD promotion. With this novel differentiation tool, elastic locomotion can directly exploit Newton’s method and use its strong second-order convergence to find the needed activations at muscle fibers. This is not possible with existing first-order inverse or differentiable simulation techniques. We showcase a wide range of interesting locomotions of soft bodies and creatures to validate our method.
PaperID: 79,   https://arxiv.org/pdf/2508.13797    
Authors: Feng-Lin Liu, Shi-Yang Li, Yan-Pei Cao, Hongbo Fu, Lin Gao
Affiliations: Institute of Computing Technology, Chinese Academy of Sciences, China ; VAST, China ; Hong Kong University of Science and Technology
Title: Sketch3DVE: Sketch-based 3D-Aware Scene Video Editing
Abstract:
Recent video editing methods achieve attractive results in style transfer or appearance modification. However, editing the structural content of 3D scenes in videos remains challenging, particularly when dealing with significant viewpoint changes, such as large camera rotations or zooms. Key challenges include generating novel view content that remains consistent with the original video, preserving unedited regions, and translating sparse 2D inputs into realistic 3D video outputs. To address these issues, we propose Sketch3DVE, a sketch-based 3D-aware video editing method to enable detailed local manipulation of videos with significant viewpoint changes. To solve the challenge posed by sparse inputs, we employ image editing methods to generate edited results for the first frame, which are then propagated to the remaining frames of the video. We utilize sketching as an interaction tool for precise geometry control, while other mask-based image editing methods are also supported. To handle viewpoint changes, we perform a detailed analysis and manipulation of the 3D information in the video. Specifically, we utilize a dense stereo method to estimate a point cloud and the camera parameters of the input video. We then propose a point cloud editing approach that uses depth maps to represent the 3D geometry of newly edited components, aligning them effectively with the original 3D scene. To seamlessly merge the newly edited content with the original video while preserving the features of unedited regions, we introduce a 3D-aware mask propagation strategy and employ a video diffusion model to produce realistic edited videos. Extensive experiments demonstrate the superiority of Sketch3DVE in video editing.
PaperID: 80,   https://arxiv.org/pdf/2505.08998    
Authors: Liwen Wu, Sai Bi, Zexiang Xu, Hao Tan, Kai Zhang, Fujun Luan, Haolin Lu, Ravi Ramamoorthi
Affiliations: University of California San Diego, La Jolla, USA ; Adobe Research, San Jose, USA ; Hillbot, USA ; Max Planck Institute for Informatics
Title: Neural BRDF Importance Sampling by Reparameterization
Abstract:
Neural bidirectional reflectance distribution functions (BRDFs) have emerged as popular material representations for enhancing realism in physically-based rendering. Yet their importance sampling remains a significant challenge. In this paper, we introduce a reparameterization-based formulation of neural BRDF importance sampling that seamlessly integrates into the standard rendering pipeline with precise generation of BRDF samples. The reparameterization-based formulation transfers the distribution learning task to a problem of identifying BRDF integral substitutions. In contrast to previous methods that rely on invertible networks and multi-step inference to reconstruct BRDF distributions, our model removes these constraints, which offers greater flexibility and efficiency. Our variance and performance analysis demonstrates that our reparameterization method achieves the best variance reduction in neural BRDF renderings while maintaining high inference speeds compared to existing baselines.
PaperID: 81,   https://arxiv.org/pdf/2506.23388    
Authors: Crane He Chen, Vladimir Kim
Affiliations: Crane HeChen Industrial Light & Magic, San Francisco, USA and Northeastern University, USA ; Adobe Research
Title: Escher Tile Deformation via Closed-Form Solution
Abstract:
We present a real-time deformation method for Escher tiles—interlocking organic forms that seamlessly tessellate the plane following symmetry rules. We formulate the problem as determining a periodic displacement field. The goal is to deform Escher tiles without introducing gaps or overlaps. The resulting displacement field is obtained in closed form by an analytical solution. Our method processes tiles of 17 wallpaper groups across various representations such as images and meshes. Rather than treating tiles as mere boundaries, we consider them as textured shapes, ensuring that both the boundary and interior deform simultaneously. To enable fine-grained artistic input, our interactive tool features a user-controllable adaptive fall-off parameter, allowing precise adjustment of locality and supporting deformations with meaningful semantic control. We demonstrate the effectiveness of our method through various examples, including photo editing and shape sculpting, showing its use in applications such as fabrication and animation.
PaperID: 82,   https://arxiv.org/pdf/2601.01027    
Authors: Rafael Wampfler, Chen Yang, Dillon Elste, Nikola Kovacevic, Philine Witzig, Markus Gross
Affiliations: ETH Zurich
Title: A Platform for Interactive AI Character Experiences
Abstract:
From movie characters to modern science fiction — bringing characters into interactive, story-driven conversations has captured imaginations across generations. Achieving this vision is highly challenging and requires much more than just language modeling. It involves numerous complex AI challenges, such as conversational AI, maintaining character integrity, managing personality and emotions, handling knowledge and memory, synthesizing voice, generating animations, enabling real-world interactions, and integration with physical environments. Recent advancements in the development of foundation models, prompt engineering, and fine-tuning for downstream tasks have enabled researchers to address these individual challenges. However, combining these technologies for interactive characters remains an open problem. We present a system and platform for conveniently designing believable digital characters, enabling a conversational and story-driven experience while providing solutions to all of the technical challenges. As a proof-of-concept, we introduce Digital Einstein, which allows users to engage in conversations with a digital representation of Albert Einstein about his life, research, and persona. While Digital Einstein exemplifies our methods for a specific character, our system is flexible and generalizes to any story-driven or conversational character. By unifying these diverse AI components into a single, easy-to-adapt platform, our work paves the way for immersive character experiences, turning the dream of lifelike, story-based interactions into a reality.
PaperID: 83,   https://arxiv.org/pdf/2501.18635    
Authors: Sophie Kergaßner, Taimoor Tariq, Piotr Didyk
Affiliations: Università della Svizzera italiana
Title: Towards Understanding Depth Perception in Foveated Rendering
Abstract:
The true vision for real-time virtual and augmented reality is reproducing our visual reality in its entirety on immersive displays. To this end, foveated rendering leverages the limitations of spatial acuity in human peripheral vision to allocate computational resources to the fovea while reducing quality in the periphery. Such methods are often derived from studies on the spatial resolution of the human visual system and its ability to perceive blur in the periphery, enabling the potential for high spatial quality in real-time. However, the effects of blur on other visual cues that depend on luminance contrast, such as depth, remain largely unexplored. It is critical to understand this interplay, as accurate depth representation is a fundamental aspect of visual realism. In this paper, we present the first evaluation exploring the effects of foveated rendering on stereoscopic depth perception. We design a psychovisual experiment to quantitatively study the effects of peripheral blur on depth perception. Our analysis demonstrates that stereoscopic acuity remains unaffected (or even improves) by high levels of peripheral blur. Based on our studies, we derive a simple perceptual model that determines the amount of foveation that does not affect stereoacuity. Furthermore, we analyze the model in the context of common foveation practices reported in literature. The findings indicate that foveated rendering does not impact stereoscopic depth perception, and stereoacuity remains unaffected with up to 2 × stronger foveation than commonly used. Finally, we conduct a validation experiment and show that our findings hold for complex natural stimuli.
PaperID: 84,   https://arxiv.org/pdf/2501.13975    
Authors: Lei Lan, Tianjia Shao, Zixuan Lu, Yu Zhang, Chenfanfu Jiang, Yin Yang
Affiliations: University of Utah, Salt Lake City, USA ; Zhejiang University, USA ; UCLA, Los Angeles
Title: 3DGS2: Near Second-order Converging 3D Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a mainstream solution for novel view synthesis and 3D reconstruction. By explicitly encoding a 3D scene using a collection of Gaussian kernels, 3DGS achieves high-quality rendering with superior efficiency. As a learning-based approach, 3DGS training has been dealt with the standard stochastic gradient descent (SGD) method, which offers at most linear convergence. Consequently, training often requires tens of minutes, even with GPU acceleration. This paper introduces a (near) second-order convergent training algorithm for 3DGS, leveraging its unique properties. Our approach is inspired by two key observations. First, the attributes of a Gaussian kernel contribute independently to the image-space loss, which endorses isolated and local optimization algorithms. We exploit this by splitting the optimization at the level of individual kernel attributes, analytically constructing small-size Newton systems for each parameter group, and efficiently solving these systems on GPU threads. This achieves Newton-like convergence per training image without relying on the global Hessian. Second, kernels exhibit sparse and structured coupling across input images. This property allows us to effectively utilize spatial information to mitigate overshoot during stochastic training. Our method converges an order faster than standard GPU-based 3DGS training, requiring over 10 × fewer iterations while maintaining or surpassing the quality of the compared with the SGD-based 3DGS reconstructions.
PaperID: 85,   https://arxiv.org/pdf/2407.01866    
Authors: Yunxiang Zhang, Bingxuan Li, Alexandr Kuznetsov, Akshay Jindal, Stavros Diolatzis, Kenneth Chen, Anton Sochenov, Anton Kaplanyan, Qi Sun
Affiliations: Department of Computer Science and Engineering, Tandon School of Engineering, New York, USA ; Advanced Micro Devices (AMD), USA ; Intel Corporation
Title: Image-GS: Content-Adaptive Image Representation via 2D Gaussians
Abstract:
Neural image representations have emerged as a promising approach for encoding and rendering visual data. Combined with learning-based workflows, they demonstrate impressive trade-offs between visual fidelity and memory footprint. Existing methods in this domain, however, often rely on fixed data structures that suboptimally allocate memory or compute-intensive implicit models, hindering their practicality for real-time graphics applications.
PaperID: 86,   https://arxiv.org/pdf/2505.13390    
Authors: Chunlei Li, Peng Yu, Tiantian Liu, Siyuan Yu, Yuting Xiao, Shuai Li, Aimin Hao, Yang Gao, Qinping Zhao
Affiliations: State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, China ; Taichi Graphics, China ; Zenustech
Title: MGPBD: A Multigrid Accelerated Global XPBD Solver
Abstract:
We introduce a novel Unsmoothed Aggregation (UA) Algebraic Multigrid (AMG) method combined with Preconditioned Conjugate Gradient (PCG) to overcome the limitations of Extended Position-Based Dynamics (XPBD) in high-resolution and high-stiffness simulations. While XPBD excels in simulating deformable objects due to its speed and simplicity, its nonlinear Gauss-Seidel (GS) solver often struggles with low-frequency errors, leading to instability and stalling issues, especially in high-resolution, high-stiffness simulations. Our multigrid approach addresses these issues efficiently by leveraging AMG. To reduce the computational overhead of traditional AMG, where prolongator construction can consume up to two-thirds of the runtime, we propose a lazy setup strategy that reuses prolongators across iterations based on matrix structure and physical significance. Furthermore, we introduce a simplified method for constructing near-kernel components by applying a few sweeps of iterative methods to the homogeneous equation, achieving convergence rates comparable to adaptive smoothed aggregation (adaptive-SA) at a lower computational cost. Experimental results demonstrate that our method significantly improves convergence rates and numerical stability, enabling efficient and stable high-resolution simulations of deformable objects.
PaperID: 87,   https://arxiv.org/pdf/2506.04444    
Authors: Zhaoyang Lv, Maurizio Monge, Ka Chen, Yufeng Zhu, Michael Goesele, Jakob Engel, Zhao Dong, Richard Newcombe
Affiliations: Reality Labs Research, Santa Clara
Title: Photoreal Scene Reconstruction from an Egocentric Device
Abstract:
In this paper, we investigate the challenges associated with using egocentric devices to photorealistic reconstruct the scene in high dynamic range. Existing methodologies typically assume using frame-rate 6DoF pose estimated from the device’s visual-inertial odometry system, which may neglect crucial details necessary for pixel-accurate reconstruction. This study presents two significant findings. Firstly, in contrast to mainstream work treating RGB camera as global shutter frame-rate camera, we emphasize the importance of employing visual-inertial bundle adjustment (VIBA) to calibrate the precise timestamps and movement of the rolling shutter RGB sensing camera in a high frequency trajectory format, which ensures an accurate calibration of the physical properties of the rolling-shutter camera. Secondly, we incorporate a physical image formation model based into Gaussian Splatting, which effectively addresses the sensor characteristics, including the rolling-shutter effect of RGB cameras and the dynamic ranges measured by sensors. Our proposed formulation is applicable to the widely-used variants of Gaussian Splats representation. We conduct a comprehensive evaluation of our pipeline using the open-source Project Aria device under diverse indoor and outdoor lighting conditions, and further validate it on a Meta Quest3 device. Across all experiments, we observe a consistent visual enhancement of +1 dB in PSNR by incorporating VIBA, with an additional +1 dB achieved through our proposed image formation model. Our complete implementation, evaluation datasets, and recording profile are available at https://www.projectaria.com/photoreal-reconstruction/
PaperID: 88,   https://arxiv.org/pdf/2502.12752    
Authors: Xiang Zhang, Yang Zhang, Lukas Mehl, Markus Gross, Christopher Schroers
Affiliations: ETH Zürich, Switzerland and DisneyResearch|Studios, Switzerland ; DisneyResearch|Studios
Title: High-Fidelity Novel View Synthesis via Splatting-Guided Diffusion
Abstract:
Despite recent advances in Novel View Synthesis (NVS), generating high-fidelity views from single or sparse observations remains challenging. Existing splatting-based approaches often produce distorted geometry due to splatting errors. While diffusion-based methods leverage rich 3D priors to achieve improved geometry, they often suffer from texture hallucination. In this paper, we introduce SplatDiff, a pixel-splatting-guided video diffusion model designed to synthesize high-fidelity novel views from a single image. Specifically, we propose an aligned synthesis strategy for precise control of target viewpoints and geometry-consistent view synthesis. To mitigate texture hallucination, we design a texture bridge module that enables high-fidelity texture generation through adaptive feature fusion. In this manner, SplatDiff leverages the strengths of splatting and diffusion for geometrically consistent, high-fidelity view synthesis. Extensive experiments verify the state-of-the-art performance of SplatDiff in single-view NVS. Additionally, without extra training, SplatDiff shows remarkable zero-shot performance across diverse tasks, including sparse-view NVS and stereo video conversion.
PaperID: 89,   https://arxiv.org/pdf/2505.15385    
Authors: Hendrik Junkawitsch, Guoxing Sun, Heming Zhu, Christian Theobalt, Marc Habermann
Affiliations: Max Planck Institute for Informatics, Germany and Saarbrücken Research Center for Visual Computing, Interaction and Artificial Intelligence
Title: EVA: Expressive Virtual Avatars from Multi-view Videos
Abstract:
With recent advancements in neural rendering and motion capture algorithms, remarkable progress has been made in photorealistic human avatar modeling, unlocking immense potential for applications in virtual reality, augmented reality, remote communication, and industries such as gaming, film, and medicine. However, existing methods fail to provide complete, faithful, and expressive control over human avatars due to their entangled representation of facial expressions and body movements. In this work, we introduce Expressive Virtual Avatars (EVA), an actor-specific, fully controllable, and expressive human avatar framework that achieves high-fidelity, lifelike renderings in real time while enabling independent control of facial expressions, body movements, and hand gestures. Specifically, our approach designs the human avatar as a two-layer model: an expressive template geometry layer and a 3D Gaussian appearance layer. First, we present an expressive template tracking algorithm that leverages coarse-to-fine optimization to accurately recover body motions, facial expressions, and non-rigid deformation parameters from multi-view videos. Next, we propose a novel decoupled 3D Gaussian appearance model designed to effectively disentangle body and facial appearance. Unlike unified Gaussian estimation approaches, our method employs two specialized and independent modules to model the body and face separately. Experimental results demonstrate that EVA surpasses state-of-the-art methods in terms of rendering quality and expressiveness, validating its effectiveness in creating full-body avatars. This work represents a significant advancement towards fully drivable digital human models, enabling the creation of lifelike digital avatars that faithfully replicate human geometry and appearance.
PaperID: 90,   https://arxiv.org/pdf/2502.13994    
Authors: Saeed Hadadan, Benedikt Bitterli, Tizian Zeltner, Jan Novák, Fabrice Rousselle, Jacob Munkberg, Jon Hasselgren, Bartlomiej Wronski, Matthias Zwicker
Affiliations: College Park, USA and NVIDIA Research, USA ; NVIDIA Research, Switzerland ; NVIDIA Research, Czech Republic ; NVIDIA Research, Sweden ; NVIDIA Research, New York
Title: Generative detail enhancement for physically based materials
Abstract:
We present a tool for enhancing the detail of physically based materials using an off-the-shelf diffusion model and inverse rendering. Our goal is to increase the visual fidelity of existing materials by adding, for instance, signs of wear, aging, and weathering that are tedious to author. To obtain realistic appearance with minimal user effort, we leverage a generative image model trained on a large dataset of natural images. Given the geometry, UV mapping, and basic appearance of an object, we proceed as follows: We render multiple views of the object and use them, together with an appearance-defining text prompt, to condition a diffusion model. The generated details are then backpropagated from the enhanced images to the material parameters via inverse rendering. For inverse rendering to be successful, the generated appearance has to be consistent across all the images. We propose two priors to address the multi-view consistency of the diffusion model. First, we ensure that the noise that seeds the diffusion process is itself consistent across views by integrating it from a view-independent UV space. Second, we enforce spatial consistency by biasing the attention mechanism via a projective constraint so that pixels attend strongly to their corresponding pixel locations in other views. Our approach does not require any training or finetuning of the diffusion model, is agnostic to the used material model, and the enhanced material properties, i.e., 2D PBR textures, can be further edited by artists. We demonstrate prompt-based material edits exhibiting high levels of realism and detail. This project is available at https://generative-detail.github.io.
PaperID: 91,   https://arxiv.org/pdf/2509.10761    
Authors: Marcelo Sandoval-Castañeda, Bryan Russell, Josef Sivic, Gregory Shakhnarovich, Fabian Caba Heilbron
Affiliations: USA ; Adobe, San Francisco, USA and Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University, Czech Republic ; TTIC, USA ; FabianCaba Heilbron Adobe, San Jose
Title: EditDuet: A Multi-Agent System for Video Non-Linear Editing
Abstract:
Automated tools for video editing and assembly have applications ranging from filmmaking and advertisement to content creation for social media. Previous video editing work has mainly focused on either retrieval or user interfaces, leaving actual editing to the user. In contrast, we propose to automate the core task of video editing, formulating it as sequential decision making process. Ours is a multi-agent approach. We design an Editor agent and a Critic agent. The Editor takes as input a collection of video clips together with natural language instructions and uses tools commonly found in video editing software to produce an edited sequence. On the other hand, the Critic gives natural language feedback to the editor based on the produced sequence or renders it if it is satisfactory. We introduce a learning-based approach for enabling effective communication across specialized agents to address the language-driven video editing task. Finally, we explore an LLM-as-a-judge metric for evaluating the quality of video editing system and compare it with general human preference. We evaluate our system’s output video sequences qualitatively and quantitatively through a user study and find that our system vastly outperforms existing approaches in terms of coverage, time constraint satisfaction, and human preference. Please see our companion supplemental video for qualitative results.
PaperID: 92,   https://arxiv.org/pdf/2506.06462    
Authors: Nicolas Violante, Andréas Meuleman, Alban Gauthier, Fredo Durand, Thibault Groueix, George Drettakis
Affiliations: Université Côte d'Azur, Sophia Antipolis, France ; INRIA, France ; Massachusetts Institute of Technology (MIT), USA ; Adobe Research, San Francisco, USA ; INRIA
Title: Splat and Replace: 3D Reconstruction with Repetitive Elements
Abstract:
We leverage repetitive elements in 3D scenes to improve novel view synthesis. Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have greatly improved novel view synthesis but renderings of unseen and occluded parts remain low-quality if the training views are not exhaustive enough. Our key observation is that our environment is often full of repetitive elements. We propose to leverage those repetitions to improve the reconstruction of low-quality parts of the scene due to poor coverage and occlusions. We propose a method that segments each repeated instance in a 3DGS reconstruction, registers them together, and allows information to be shared among instances. Our method improves the geometry while also accounting for appearance variations across instances. We demonstrate our method on a variety of synthetic and real scenes with typical repetitive elements, leading to a substantial improvement in the quality of novel view synthesis.
PaperID: 93,   https://arxiv.org/pdf/2504.15835    
Authors: Yiqian Wu, Malte Prinzler, Xiaogang Jin, Siyu Tang
Affiliations: ETH Zürich, Switzerland and State Key Lab of CAD and CG, Zhejiang University, Switzerland ; State Key Lab of CAD and CG
Title: Text-based Animatable 3D Avatars with Morphable Model Alignment
Abstract:
The generation of high-quality, animatable 3D head avatars from text has enormous potential in content creation applications such as games, movies, and embodied virtual assistants. Current text-to-3D generation methods typically combine parametric head models with 2D diffusion models using score distillation sampling to produce 3D-consistent results. However, they struggle to synthesize realistic details and suffer from misalignments between the appearance and the driving parametric model, resulting in unnatural animation results. We discovered that these limitations stem from ambiguities in the 2D diffusion predictions during 3D avatar distillation, specifically: i) the avatar’s appearance and geometry is underconstrained by the text input, and ii) the semantic alignment between the predictions and the parametric head model is insufficient because the diffusion model alone cannot incorporate information from the parametric model. In this work, we propose a novel framework, AnimPortrait3D, for text-based realistic animatable 3DGS avatar generation with morphable model alignment, and introduce two key strategies to address these challenges. First, we tackle appearance and geometry ambiguities by utilizing prior information from a pretrained text-to-3D model to initialize a 3D avatar with robust appearance, geometry, and rigging relationships to the morphable model. Second, we refine the initial 3D avatar for dynamic expressions using a ControlNet that is conditioned on semantic and normal maps of the morphable model to ensure accurate alignment. As a result, our method outperforms existing approaches in terms of synthesis quality, alignment, and animation fidelity. Our experiments show that the proposed method advances the state of the art in text-based, animatable 3D head avatar generation. Code and model for this paper are at AnimPortrait3D.
PaperID: 94,   https://arxiv.org/pdf/2505.18772    
Authors: Michal Edelstein, Hsueh-Ti Derek Liu, Mirela Ben-Chen
Affiliations: Technion - Israel Institute of Technology, Israel ; Hsueh-Ti DerekLiu Roblox, Canada and University of British Columbia
Title: CageNet: A Meta-Framework for Learning on Wild Meshes
Abstract:
Learning on triangle meshes has recently proven to be instrumental to a myriad of tasks, from shape classification, to segmentation, to deformation and animation, to mention just a few. While some of these applications are tackled through neural network architectures which are tailored to the application at hand, many others use generic frameworks for triangle meshes where the only customization required is the modification of the input features and the loss function. Our goal in this paper is to broaden the applicability of these generic frameworks to “wild” meshes, i.e. meshes in-the-wild which often have multiple components, non-manifold elements, disrupted connectivity, or a combination of these. We propose a configurable meta-framework based on the concept of caged geometry: Given a mesh, a cage is a single component manifold triangle mesh that envelopes it closely. Generalized barycentric coordinates map between functions on the cage, and functions on the mesh, allowing us to learn and test on a variety of data, in different applications. We demonstrate this concept by learning segmentation and skinning weights on difficult data, achieving better performance to state of the art techniques on wild meshes.
PaperID: 95,   https://arxiv.org/pdf/2502.04299    
Authors: Jinbo Xing, Long Mai, Cusuh Ham, Jiahui Huang, Aniruddha Mahapatra, Chi-Wing Fu, Tien-Tsin Wong, Feng Liu
Affiliations: Hong Kong, San Jose, USA ; Adobe Research, Australia ; Adobe Research
Title: MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation
Abstract:
This paper presents a method that allows users to design cinematic video shots in the context of image-to-video generation. Shot design, a critical aspect of filmmaking, involves meticulously planning both camera movements and object motions in a scene. However, enabling intuitive shot design in modern image-to-video generation systems presents two main challenges: first, effectively capturing user intentions on the motion design, where both camera movements and scene-space object motions must be specified jointly; and second, representing motion information that can be effectively utilized by a video diffusion model to synthesize the image animations. To address these challenges, we introduce MotionCanvas, a method that integrates user-driven controls into image-to-video (I2V) generation models, allowing users to control both object and camera motions in a scene-aware manner. By connecting insights from classical computer graphics and contemporary video generation techniques, we demonstrate the ability to achieve 3D-aware motion control in I2V synthesis without requiring costly 3D-related training data. MotionCanvas enables users to intuitively depict scene-space motion intentions, and translates them into spatiotemporal motion-conditioning signals for video diffusion models. We demonstrate the effectiveness of our method on a wide range of real-world image content and shot-design scenarios, highlighting its potential to enhance the creative workflows in digital content creation and adapt to various image and video editing applications. Code and model weights are at https://motion-canvas25.github.io
PaperID: 96,   https://arxiv.org/pdf/2502.08642    
Authors: Ellie Arar, Yarden Frenkel, Daniel Cohen-Or, Ariel Shamir, Yael Vinker
Affiliations: Tel Aviv, Israel ; Reichman University, Israel ; Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT)
Title: SwiftSketch: A Diffusion Model for Image-to-Vector Sketch Generation
Abstract:
Recent advancements in large vision-language models have enabled highly expressive and diverse vector sketch generation. However, state-of-the-art methods rely on a time-consuming optimization process involving repeated feedback from a pretrained model to determine stroke placement. Consequently, despite producing impressive sketches, these methods are limited in practical applications. In this work, we introduce SwiftSketch, a diffusion model for image-conditioned vector sketch generation that can produce high-quality sketches in less than a second. SwiftSketch operates by progressively denoising stroke control points sampled from a Gaussian distribution. Its transformer-decoder architecture is designed to effectively handle the discrete nature of vector representation and capture the inherent global dependencies between strokes. To train SwiftSketch, we construct a synthetic dataset of image-sketch pairs, addressing the limitations of existing sketch datasets, which are often created by non-artists and lack professional quality. For generating these synthetic sketches, we introduce ControlSketch, a method that enhances SDS-based techniques by incorporating precise spatial control through a depth-aware ControlNet. We demonstrate that SwiftSketch generalizes across diverse concepts, efficiently producing sketches that combine high fidelity with a natural and visually appealing style.
PaperID: 97,   https://arxiv.org/pdf/2505.23708    
Authors: Lucas N. Alegre, Agon Serifi, Ruben Grandia, David Müller, Espen Knoop, Moritz Bächer
Affiliations: Lucas N.Alegre Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil and Disney Research, Switzerland ; Disney Research
Title: AMOR: Adaptive Character Control through Multi-Objective Reinforcement Learning
Abstract:
Reinforcement learning (RL) has significantly advanced the control of physics-based and robotic characters that track kinematic reference motion. However, methods typically rely on a weighted sum of conflicting reward functions, requiring extensive tuning to achieve a desired behavior. Due to the computational cost of RL, this iterative process is a tedious, time-intensive task. Furthermore, for robotics applications, the weights need to be chosen such that the policy performs well in the real world, despite inevitable sim-to-real gaps. To address these challenges, we propose a multi-objective reinforcement learning framework that trains a single policy conditioned on a set of weights, spanning the Pareto front of reward trade-offs. Within this framework, weights can be selected and tuned after training, significantly speeding up iteration time. We demonstrate how this improved workflow can be used to perform highly dynamic motions with a robot character. Moreover, we explore how weight-conditioned policies can be leveraged in hierarchical settings, using a high-level policy to dynamically select weights according to the current task. We show that the multi-objective policy encodes a diverse spectrum of behaviors, facilitating efficient adaptation to novel tasks.
PaperID: 98,   https://arxiv.org/pdf/2501.18291    
Authors: Sean Memery, Kevin Denamganaï, Jiaxin Zhang, Zehai Tu, Yiwen Guo, Kartic Subr
Affiliations: University of Edinburgh, United Kingdom
Title: CueTip: An Interactive and Explainable Physics-aware Pool Assistant
Abstract:
We present an interactive and explainable automated coaching assistant called CueTip for a variant of pool/billiards. CueTip’s novelty lies in its combination of three features: a natural-language interface, an ability to perform contextual, physics-aware reasoning, and that its explanations are rooted in a set of predetermined guidelines developed by domain experts. We instrument a physics simulator so that it generates event traces in natural language alongside traditional state traces. Event traces lend themselves to interpretation by language models, which serve as the interface to our assistant. We design and train a neural adaptor that decouples tactical choices made by CueTip from its interactivity and explainability allowing it to be reconfigured to mimic any pool playing agent. Our experiments show that CueTip enables contextual query-based assistance and explanations while maintaining the strength of the agent in terms of win rate (improving it in some situations). The explanations generated by CueTip are physically-aware and grounded in the expert rules and are therefore more reliable.
PaperID: 99,   https://arxiv.org/pdf/2505.01486    
Authors: Mingfeng Tang, Ningna Wang, Ziyuan Xie, Jianwei Hu, Ke Xie, Xiaohu Guo, Hui Huang
Affiliations: Shenzhen University, China ; University of Texas at Dallas, China ; QiYuan Lab
Title: Aerial Path Online Planning for Urban Scene Updation
Abstract:
We present the first scene-update aerial path planning algorithm specifically designed for detecting and updating change areas in urban environments. While existing methods for large-scale 3D urban scene reconstruction focus on achieving high accuracy and completeness, they are inefficient for scenarios requiring periodic updates, as they often re-explore and reconstruct entire scenes, wasting significant time and resources on unchanged areas. To address this limitation, our method leverages prior reconstructions and change probability statistics to guide UAVs in detecting and focusing on areas likely to have changed. Our approach introduces a novel changeability heuristic to evaluate the likelihood of scene changes, driving the planning of two flight paths: a prior path informed by static priors and a dynamic real-time path that adapts to newly detected changes. Extensive experiments on real-world urban datasets demonstrate that our method significantly reduces flight time and computational overhead while maintaining high-quality updates comparable to full-scene re-exploration and reconstruction. These contributions pave the way for efficient, scalable, and adaptive UAV-based scene updates in complex urban environments.
PaperID: 100,   https://arxiv.org/pdf/2505.23907    
Authors: Amirhossein Alimohammadi, Aryan Mikaeili, Sauradip Nag, Negar Hassanpour, Andrea Tagliasacchi, Ali Mahdavi-Amiri
Affiliations: Simon Fraser University, Canada ; Huawei Canada
Title: Cora: Correspondence-aware image editing using few step diffusion
Abstract:
Image editing is an important task in computer graphics, vision, and VFX, with recent diffusion-based methods achieving fast and high-quality results. However, edits requiring significant structural changes, such as non-rigid deformations, object modifications, or content generation, remain challenging. Existing few step editing approaches produce artifacts such as irrelevant texture or struggle to preserve key attributes of the source image (e.g., pose). We introduce Cora, a novel editing framework that addresses these limitations by introducing correspondence-aware noise correction and interpolated attention maps. Our method aligns textures and structures between the source and target images through semantic correspondence, enabling accurate texture transfer while generating new content when necessary. Cora offers control over the balance between content generation and preservation. Extensive experiments demonstrate that, quantitatively and qualitatively, Cora excels in maintaining structure, textures, and identity across diverse edits, including pose changes, object addition, and texture refinements. User studies confirm that Cora delivers superior results, outperforming alternatives.
PaperID: 101,   https://arxiv.org/pdf/2508.08384    
Authors: Mutian Tong, Rundi Wu, Changxi Zheng
Affiliations: Columbia University, New York
Title: Spatiotemporally Consistent Indoor Lighting Estimation with Diffusion Priors
Abstract:
Indoor lighting estimation from a single image or video remains a challenge due to its highly ill-posed nature, especially when the lighting condition of the scene varies spatially and temporally. We propose a method that estimates from an input video a continuous light field describing the spatiotemporally varying lighting of the scene. We leverage 2D diffusion priors for optimizing such light field represented as a MLP. To enable zero-shot generalization to in-the-wild scenes, we fine-tune a pre-trained image diffusion model to predict lighting at multiple locations by jointly inpainting multiple chrome balls as light probes. We evaluate our method on indoor lighting estimation from a single image or video and show superior performance over compared baselines. Most importantly, we highlight results on spatiotemporally consistent lighting estimation from in-the-wild videos, which is rarely demonstrated in previous works.
PaperID: 102,   https://arxiv.org/pdf/2505.04051    
Authors: Junming Huang, Chi Wang, Letian Li, Changxin Huang, Qiang Dai, Weiwei Xu
Affiliations: State Key Lab of CAD & CG, Zhejiang University, China ; LIGHTSPEED, China ; State Key Lab of CAD&CG
Title: BuildingBlock: A Hybrid Approach for Structured Building Generation
Abstract:
Three-dimensional building generation is vital for applications in gaming, virtual reality, and digital twins, yet current methods face challenges in producing diverse, structured, and hierarchically coherent buildings. We propose BuildingBlock, a hybrid approach that integrates generative models, procedural content generation (PCG), and large language models (LLMs) to address these limitations. Specifically, our method introduces a two-phase pipeline: the Layout Generation Phase (LGP) and the Building Construction Phase (BCP). LGP reframes box-based layout generation as a point-cloud generation task, utilizing a newly constructed architectural dataset and a Transformer-based diffusion model to create globally consistent layouts. With LLMs, these layouts are extended into rule-based hierarchical designs, seamlessly incorporating component styles and spatial structures. The BCP leverages these layouts to guide PCG, enabling local-customizable, high-quality structured building generation. Experimental results demonstrate BuildingBlock ’s effectiveness in generating diverse and hierarchically structured buildings, achieving state-of-the-art results on multiple benchmarks, and paving the way for scalable and intuitive architectural workflows.
PaperID: 103,   https://arxiv.org/pdf/2505.10558    
Authors: Peiying Zhang, Nanxuan Zhao, Jing Liao
Affiliations: Hong Kong, China ; Adobe Research, San Jose
Title: Style Customization of Text-to-Vector Generation with Image Diffusion Priors
Abstract:
Scalable Vector Graphics (SVGs) are highly favored by designers due to their resolution independence and well-organized layer structure. Although existing text-to-vector (T2V) generation methods can create SVGs from text prompts, they often overlook an important need in practical applications: style customization, which is vital for producing a collection of vector graphics with consistent visual appearance and coherent aesthetics.
PaperID: 104,   https://arxiv.org/pdf/2502.09608    
Authors: Mia Tang, Yael Vinker, Chuan Yan, Lvmin Zhang, Maneesh Agrawala
Affiliations: Stanford University, USA ; Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), USA and Roblox Research
Title: Instance Segmentation of Scene Sketches Using Natural Image Priors
Abstract:
Sketch segmentation involves grouping pixels within a sketch that belong to the same object or instance. It serves as a valuable tool for sketch editing tasks, such as moving, scaling, or removing specific components. While image segmentation models have demonstrated remarkable capabilities in recent years, sketches present unique challenges for these models due to their sparse nature and wide variation in styles. We introduce InkLayer, a method for instance segmentation of raster scene sketches. Our approach adapts state-of-the-art image segmentation and object detection models to the sketch domain by employing class-agnostic fine-tuning and refining segmentation masks using depth cues. Furthermore, our method organizes sketches into sorted layers, where occluded instances are inpainted, enabling advanced sketch editing applications. As existing datasets in this domain lack variation in sketch styles, we construct a synthetic scene sketch segmentation dataset, InkScenes, featuring sketches with diverse brush strokes and varying levels of detail. We use this dataset to demonstrate the robustness of our approach. Code and data for this paper are released at project page: https://inklayer.github.io.
PaperID: 105,   https://arxiv.org/pdf/2502.02607    
Authors: Tianyang Xue, Longdu Liu, Lin Lu, Paul Henderson, Pengbin Tang, Haochen Li, Jikai Liu, Haisen Zhao, Hao Peng, Bernd Bickel
Affiliations: Shandong University, China ; University of Glasgow, United Kingdom ; ETH Zürich, China ; CrownCAD, China ; ETH Zürich
Title: MIND: Microstructure INverse Design with Generative Hybrid Neural Representation
Abstract:
The inverse design of microstructures plays a pivotal role in optimizing metamaterials with specific, targeted physical properties. While traditional forward design methods are constrained by their inability to explore the vast combinatorial design space, inverse design offers a compelling alternative by directly generating structures that fulfill predefined performance criteria. However, achieving precise control over both geometry and material properties remains a significant challenge due to their intricate interdependence. Existing approaches, which typically rely on voxel or parametric representations, often limit design flexibility and structural diversity.
PaperID: 106,   https://arxiv.org/pdf/2505.04002    
Authors: Michael Xu, Yi Shi, KangKang Yin, Xue Bin Peng
Affiliations: Simon Fraser University, Canada and NVIDIA
Title: PARC: Physics-based Augmentation with Reinforcement Learning for Character Controllers
Abstract:
Humans excel in navigating diverse, complex environments with agile motor skills, exemplified by parkour practitioners performing dynamic maneuvers, such as climbing up walls and jumping across gaps. Reproducing these agile movements with simulated characters remains challenging, in part due to the scarcity of motion capture data for agile terrain traversal behaviors and the high cost of acquiring such data. In this work, we introduce PARC (Physics-based Augmentation with Reinforcement Learning for Character Controllers), a framework that leverages machine learning and physics-based simulation to iteratively augment motion datasets and expand the capabilities of terrain traversal controllers. PARC begins by training a motion generator on a small dataset consisting of core terrain traversal skills. The motion generator is then used to produce synthetic data for traversing new terrains. However, these generated motions often exhibit artifacts, such as incorrect contacts or discontinuities. To correct these artifacts, we train a physics-based tracking controller to imitate the motions in simulation. The corrected motions are then added to the dataset, which is used to continue training the motion generator in the next iteration. PARC’s iterative process jointly expands the capabilities of the motion generator and tracker, creating agile and versatile models for interacting with complex environments. PARC provides an effective approach to develop controllers for agile terrain traversal, which bridges the gap between the scarcity of motion data and the need for versatile character controllers.
PaperID: 107,   https://arxiv.org/pdf/2507.19926    
Authors: Louis Sugy
Affiliations:
Title: A Fast Parallel Median Filtering Algorithm Using Hierarchical Tiling
Abstract:
Median filtering is a non-linear smoothing technique widely used in digital image processing to remove noise while retaining sharp edges. It is particularly well suited to removing outliers (impulse noise) or granular artifacts (speckle noise). However, the high computational cost of median filtering can be prohibitive. Sorting-based algorithms excel with small kernels but scale poorly with increasing kernel diameter, in contrast to constant-time methods characterized by higher constant factors but better scalability, such as histogram-based approaches or the 2D wavelet matrix.
PaperID: 108,   https://arxiv.org/pdf/2506.02219    
Authors: Abhishek Madan, Nicholas Sharp, Francis Williams, Ken Museth, David I.W. Levin
Affiliations: University of Toronto, Canada ; NVIDIA, USA ; NVIDIA, New York, Los Angeles, Canada and NVIDIA
Title: Stochastic Barnes-Hut Approximation for Fast Summation on the GPU
Abstract:
We present a novel stochastic version of the Barnes-Hut approximation. Regarding the level-of-detail (LOD) family of approximations as control variates, we construct an unbiased estimator of the kernel sum being approximated. Through several examples in graphics applications such as winding number computation and smooth distance evaluation, we demonstrate that our method is well-suited for GPU computation, capable of outperforming a GPU-optimized implementation of the deterministic Barnes-Hut approximation by achieving equal median error in up to 9.4x less time.
PaperID: 109,   https://arxiv.org/pdf/2501.01427    
Authors: Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, Hengshuang Zhao
Affiliations: Hong Kong, Alibaba Group
Title: VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control
Abstract:
Despite significant advancements in video generation, inserting a given object into videos remains a challenging task. The difficulty lies in preserving the appearance details of the reference object and accurately modeling coherent motion at the same time. In this paper, we propose VideoAnydoor, a zero-shot video object insertion framework with high-fidelity detail preservation and precise motion control. Starting from a text-to-video model, we utilize an ID extractor to inject the global identity and leverage a box sequence to control the overall motion. To preserve the detailed appearance and meanwhile support fine-grained motion control, we design a pixel warper. It takes the reference image with arbitrary key-points and the corresponding key-point trajectories as inputs. It warps the pixel details according to the trajectories and fuses the warped features with the diffusion U-Net, thus improving detail preservation and supporting users in manipulating the motion trajectories. In addition, we propose a training strategy involving both videos and static images with a reweight reconstruction loss to enhance insertion quality. VideoAnydoor demonstrates significant superiority over existing methods and naturally supports various downstream applications (e.g., talking head generation, video virtual try-on, multi-region editing) without task-specific fine-tuning.
PaperID: 110,   https://arxiv.org/pdf/2505.04622    
Authors: Jingwen Ye, Yuze He, Yanning Zhou, Yiqin Zhu, Kaiwen Xiao, Yong-Jin Liu, Wei Yang, Xiao Han
Affiliations: Tencent AIPD, China and Tsinghua University, China ; Tsinghua University
Title: PrimitiveAnything: Human-Crafted 3D Primitive Assembly Generation with Auto-Regressive transformer
Abstract:
Shape primitive abstraction, which decomposes complex 3D shapes into simple geometric elements, plays a crucial role in human visual cognition and has broad applications in computer vision and graphics. While recent advances in 3D content generation have shown remarkable progress, existing primitive abstraction methods either rely on geometric optimization with limited semantic understanding or learn from small-scale, category-specific datasets, struggling to generalize across diverse shape categories. We present PrimitiveAnything, a novel framework that reformulates shape primitive abstraction as a primitive assembly generation task. PrimitiveAnything includes a shape-conditioned primitive transformer for auto-regressive generation and an ambiguity-free parameterization scheme to represent multiple types of primitives in a unified manner. The proposed framework directly learns the process of primitive assembly from large-scale human-crafted abstractions, enabling it to capture how humans decompose complex shapes into primitive elements. Through extensive experiments, we demonstrate that PrimitiveAnything can generate high-quality primitive assemblies that better align with human perception while maintaining geometric fidelity across diverse shape categories. It benefits various 3D applications and shows potential for enabling primitive-based user-generated content (UGC) in games. Project page: https://primitiveanything.github.io
PaperID: 111,   https://arxiv.org/pdf/2505.02592    
Authors: Yuki Tatsukawa, Anran Qi, I-Chao Shen, Takeo Igarashi
Affiliations: The University of Tokyo, Japan ; Centre Inria d'Université Côte d'Azur, Sophia Antipolis
Title: GarmentImage: Raster Encoding of Garment Sewing Patterns with Diverse Topologies
Abstract:
Garment sewing patterns are the design language behind clothing, yet their current vector-based digital representations weren’t built with machine learning in mind. Vector-based representation encodes a sewing pattern as a discrete set of panels, each defined as a sequence of lines and curves, stitching information between panels and the placement of each panel around a body. However, this representation causes two major challenges for neural networks: discontinuity in latent space between patterns with different topologies and limited generalization to garments with unseen topologies in the training data. In this work, we introduce GarmentImage, a unified raster-based sewing pattern representation. GarmentImage encodes a garment sewing pattern’s geometry, topology and placement into multi-channel regular grids. Machine learning models trained on GarmentImage achieve seamless transitions between patterns with different topologies and show better generalization capabilities compared to models trained on vector-based representation. We demonstrate the effectiveness of GarmentImage across three applications: pattern exploration in latent space, text-based pattern editing, and image-to-pattern prediction. The results show that GarmentImage achieves superior performance on these applications using only simple convolutional networks.
PaperID: 112,   https://arxiv.org/pdf/2504.08353    
Authors: Ren Li, Cong Cao, Corentin Dumery, Yingxuan You, Hao Li, Pascal Fua
Affiliations: Switzerland ; MBZUAI, Abu Dhabi, United Arab Emirates ; EPFL, Switzerland ; EPFL, Switzerland ; Pinscreen, Los Angeles, USA and MBZUAI
Title: Single View Garment Reconstruction Using Diffusion Mapping Via Pattern Coordinates
Abstract:
Reconstructing 3D clothed humans from images is fundamental to applications like virtual try-on, avatar creation, and mixed reality. While recent advances have enhanced human body recovery, accurate reconstruction of garment geometry—especially for loose-fitting clothing—remains an open challenge. We present a novel method for high-fidelity 3D garment reconstruction from single images that bridges 2D and 3D representations. Our approach combines Implicit Sewing Patterns (ISP) with a generative diffusion model to learn rich garment shape priors in a 2D UV space. A key innovation is our mapping model that establishes correspondences between 2D image pixels, UV pattern coordinates, and 3D geometry, enabling joint optimization of both 3D garment meshes and the corresponding 2D patterns by aligning learned priors with image observations. Despite training exclusively on synthetically simulated cloth data, our method generalizes effectively to real-world images, outperforming existing approaches on both tight- and loose-fitting garments. The reconstructed garments maintain physical plausibility while capturing fine geometric details, enabling downstream applications including garment retargeting and texture manipulation.
PaperID: 113,   https://arxiv.org/pdf/2501.14068    
Authors: Dong Xiao, Renjie Chen
Affiliations: University of Science and Technology of China
Title: Flexible 3D Cage-based Deformation via Green Coordinates on Bézier Patches
Abstract:
Cage-based deformation is a fundamental problem in geometry processing, where a cage, a user-specified boundary of a region, is used to deform the ambient space of a given mesh. Traditional 3D cages are typically composed of triangles and quads. While quads can represent non-planar regions when their four corners are not coplanar, they form ruled surfaces with straight isoparametric curves, which limits their ability to handle curved and high-curvature deformations. In this work, we extend the cage for curved boundaries using Bézier patches, enabling flexible and high-curvature deformations with only a few control points. The higher-order structure of the Bézier patch also allows for the creation of a more compact and precise curved cage for the input model. Based on Green’s third identity, we derive the Green coordinates for the Bézier cage, achieving shape-preserving deformation with smooth surface boundaries. These coordinates are defined based on the vertex positions and normals of the Bézier control net. Given that the coordinates are approximately calculated through the Riemann summation, we propose a global projection technique to ensure that the coordinates accurately conform to the linear reproduction property. Experimental results show that our method achieves high performance in handling curved and high-curvature deformations.
PaperID: 114,   https://arxiv.org/pdf/2504.19174    
Authors: Xueqi Ma, Yilin Liu, Tianlong Gao, Qirui Huang, Hui Huang
Affiliations: Shenzhen University
Title: CLR-Wire: Towards Continuous Latent Representations for 3D Curve Wireframe Generation
Abstract:
We introduce CLR-Wire, a novel framework for 3D curve-based wireframe generation that integrates geometry and topology into a unified Continuous Latent Representation. Unlike conventional methods that decouple vertices, edges, and faces, CLR-Wire encodes curves as Neural Parametric Curves along with their topological connectivity into a continuous and fixed-length latent space using an attention-driven variational autoencoder (VAE). This unified approach facilitates joint learning and generation of both geometry and topology. To generate wireframes, we employ a flow matching model to progressively map Gaussian noise to these latents, which are subsequently decoded into complete 3D wireframes. Our method provides fine-grained modeling of complex shapes and irregular topologies, and supports both unconditional generation and generation conditioned on point cloud or image inputs. Experimental results demonstrate that, compared with state-of-the-art generative approaches, our method achieves substantial improvements in accuracy, novelty, and diversity, offering an efficient and comprehensive solution for CAD design, geometric reconstruction, and 3D content creation.
PaperID: 115,   https://arxiv.org/pdf/2505.01319    
Authors: Yifang Pan, Karan Singh, Luiz Gustavo Hafemann
Affiliations: Department of Computer Science, Dynamic Graphics Project, University of Toronto, North York, Canada ; Luiz GustavoHafemann La Forge
Title: Model See Model Do: Speech-Driven Facial Animation with Style Control
Abstract:
Speech-driven 3D facial animation plays a key role in applications such as virtual avatars, gaming, and digital content creation. While existing methods have made significant progress in achieving accurate lip synchronization and generating basic emotional expressions, they often struggle to capture and effectively transfer nuanced performance styles. We propose a novel example-based generation framework that conditions a latent diffusion model on a reference style clip to produce highly expressive and temporally coherent facial animations. To address the challenge of accurately adhering to the style reference, we introduce a novel conditioning mechanism called style basis, which extracts key poses from the reference and additively guides the diffusion generation process to fit the style without compromising lip synchronization quality. This approach enables the model to capture subtle stylistic cues while ensuring that the generated animations align closely with the input speech. Extensive qualitative, quantitative, and perceptual evaluations demonstrate the effectiveness of our method in faithfully reproducing the desired style while achieving superior lip synchronization across various speech scenarios.
PaperID: 116,   https://arxiv.org/pdf/2405.19507    
Authors: Maxine Perroni-Scharf, Zachary Ferguson, Thomas Butruille, Carlos Portela, Mina Konaković Luković
Affiliations: Massachusetts Institute of Technology (MIT), USA and CLO Virtual Fashion, New York
Title: Data-Efficient Discovery of Hyperelastic TPMS Metamaterials with Extreme Energy Dissipation
Abstract:
The authors have requested minor, non-substantive changes to the VoR and, in accordance with ACM policies, a Corrected VoR was published on January 7, 2026. One of the references was corrected. For reference purposes the VoR may still be accessed via the Supplemental Material section on this page.
PaperID: 117,   https://arxiv.org/pdf/2505.08985    
Authors: Liwen Wu, Fujun Luan, Miloš Hašan, Ravi Ramamoorthi
Affiliations: University of California San Diego, La Jolla, USA ; Adobe Research, San Jose
Title: Position-Normal Manifold for Efficient Glint Rendering on High-Resolution Normal Maps
Abstract:
Detailed microstructures on specular objects often exhibit intriguing glinty patterns under high-frequency lighting, which is challenging to render using a conventional normal-mapped BRDF. In this paper, we present a manifold-based formulation of the glint normal distribution functions (NDF) that precisely captures the surface normal distributions over queried footprints. The manifold-based formulation transfers the integration for the glint NDF construction to a problem of mesh intersections. Compared to previous works that rely on complex numerical approximations, our integral solution is exact and much simpler to compute, which also allows an easy adaptation of a mesh clustering hierarchy to accelerate the NDF evaluation of large footprints. Our performance and quality analysis shows that our NDF formulation achieves similar glinty appearance compared to the baselines but is an order of magnitude faster. Within this framework, we further present a novel derivation of analytical shadow-masking for normal-mapped diffuse surfaces—a component that is often ignored in previous works.
PaperID: 118,   https://arxiv.org/pdf/2504.02830    
Authors: Weizheng Zhang, Hao Pan, Lin Lu, Xiaowei Duan, Xin Yan, Ruonan Wang, Qiang Du
Affiliations: Shandong University, China ; School of Software, Tsinghua University, China ; Institute of Engineering Thermophysics, Chinese Academy of Sciences
Title: DualMS: Implicit Dual-Channel Minimal Surface Optimization for Heat Exchanger Design
Abstract:
Heat exchangers are critical components in a wide range of engineering applications, from energy systems to chemical processing, where efficient thermal management is essential. The design objectives for heat exchangers include maximizing the heat exchange rate while minimizing the pressure drop, requiring both a large interface area and a smooth internal structure. State-of-the-art designs, such as triply periodic minimal surfaces (TPMS), have proven effective in optimizing heat exchange efficiency. However, TPMS designs are constrained by predefined mathematical equations, limiting their adaptability to freeform boundary shapes. Additionally, TPMS structures do not inherently control flow directions, which can lead to flow stagnation and undesirable pressure drops.
PaperID: 119,   https://arxiv.org/pdf/2501.18627    
Authors: Ziyi Zhang, Nicolas Roussel, Thomas Muller, Tizian Zeltner, Merlin Nimier-David, Fabrice Rousselle, Wenzel Jakob
Affiliations: Switzerland ; EPFL, Switzerland ; NVIDIA Research
Title: Radiance Surfaces: Optimizing Surface Representations with a 5D Radiance Field Loss
Abstract:
We present a fast and simple technique to convert images into a radiance surface-based scene representation. Building on existing radiance volume reconstruction algorithms, we introduce a subtle yet impactful modification of the loss function requiring changes to only a few lines of code: instead of integrating the radiance field along rays and supervising the resulting images, we project the training images into the scene to directly supervise the spatio-directional radiance field.
PaperID: 120,   https://arxiv.org/pdf/2504.09149    
Authors: Changhao Li, Yu Xin, Xiaowei Zhou, Ariel Shamir, Hao Zhang, Ligang Liu, Ruizhen Hu
Affiliations: University of Science and Technology of China, China ; State Key Laboratory of CAD & CG, Zhejiang University, China ; Reichman University, Israel ; Simon Fraser University, China ; Shenzhen University
Title: MASH: Masked Anchored SpHerical Distances for 3D Shape Representation and Generation
Abstract:
We introduce Masked Anchored SpHerical Distances (MASH), a novel multi-view and parametrized representation of 3D shapes. Inspired by multi-view geometry and motivated by the importance of perceptual shape understanding for learning 3D shapes, MASH represents a 3D shape as a collection of observable local surface patches, each defined by a spherical distance function emanating from an anchor point. We further leverage the compactness of spherical harmonics to encode the MASH functions, combined with a generalized view cone with a parameterized base that masks the spatial extent of the spherical function to attain locality. We develop a differentiable optimization algorithm capable of converting any point cloud into a MASH representation accurately approximating ground-truth surfaces with arbitrary geometry and topology. Extensive experiments demonstrate that MASH is versatile for multiple applications including surface reconstruction, shape generation, completion, and blending, achieving superior performance thanks to its unique representation encompassing both implicit and explicit features. More information and resources can be found at: https://chli.top/MASH.
PaperID: 121,   https://arxiv.org/pdf/2502.03502    
Authors: Janghyeok Han, Gyujin Sim, Geonung Kim, Hyun-Seung Lee, Kyuha Choi, Youngseok Han, Sunghyun Cho
Affiliations: Samsung Electronics, Republic of Korea
Title: DC-VSR: Spatially and Temporally Consistent Video Super-Resolution with Video Diffusion Prior
Abstract:
Video super-resolution (VSR) aims to reconstruct a high-resolution (HR) video from a low-resolution (LR) counterpart. Achieving successful VSR requires producing realistic HR details and ensuring both spatial and temporal consistency. To restore realistic details, diffusion-based VSR approaches have recently been proposed. However, the inherent randomness of diffusion, combined with their tile-based approach, often leads to spatio-temporal inconsistencies. In this paper, we propose DC-VSR, a novel VSR approach to produce spatially and temporally consistent VSR results with realistic textures. To achieve spatial and temporal consistency, DC-VSR adopts a novel Spatial Attention Propagation (SAP) scheme and a Temporal Attention Propagation (TAP) scheme that propagate information across spatio-temporal tiles based on the self-attention mechanism. To enhance high-frequency details, we also introduce Detail-Suppression Self-Attention Guidance (DSSAG), a novel diffusion guidance scheme. Comprehensive experiments demonstrate that DC-VSR achieves spatially and temporally consistent, high-quality VSR results, outperforming previous approaches.
PaperID: 122,   https://arxiv.org/pdf/2504.19828    
Authors: Zhiming Hu, Daniel Haeufle, Syn Schmitt, Andreas Bulling
Affiliations: University of Stuttgart, Germany and The Hong Kong University of Science and Technology (Guangzhou), China ; University of Tuebingen, Germany and The Center for Bionic Intelligence Tuebingen Stuttgart
Title: HOIGaze: Gaze Estimation During Hand-Object Interactions in Extended Reality Exploiting Eye-Hand-Head Coordination
Abstract:
We present HOIGaze – a novel learning-based approach for gaze estimation during hand-object interactions (HOI) in extended reality (XR). HOIGaze addresses the challenging HOI setting by building on one key insight: Eye, hand, and head movements are closely coordinated during HOIs and this coordination can be exploited to identify samples that are most useful for gaze estimator training – as such, effectively denoising the training data. This denoising approach is in stark contrast to previous gaze estimation methods that treated all training samples as equal. Specifically, we propose: 1) a novel hierarchical framework that first recognises the hand currently visually attended to and then estimates gaze direction based on the attended hand; 2) a new gaze estimator that uses cross-modal Transformers to fuse head and hand-object features extracted using a convolutional neural network and a spatio-temporal graph convolutional network; and 3) a novel eye-head coordination loss that upgrades training samples belonging to the coordinated eye-head movements. We evaluate HOIGaze on the HOT3D and Aria digital twin (ADT) datasets and show that it significantly outperforms state-of-the-art methods, achieving an average improvement of 15.6% on HOT3D and 6.0% on ADT in mean angular error. To demonstrate the potential of our method, we further report significant performance improvements for the sample downstream task of eye-based activity recognition on ADT. Taken together, our results underline the significant information content available in eye-hand-head coordination and, as such, open up an exciting new direction for learning-based gaze estimation.
PaperID: 123,   https://arxiv.org/pdf/2502.00626    
Authors: Yue Chang, Mengfei Liu, Zhecheng Wang, Peter Yichen Chen, Eitan Grinspun
Affiliations: University of Toronto, Canada ; Peter YichenChen MIT CSAIL
Title: Lifting the Winding Number: Precise Discontinuities in Neural Fields for Physics Simulation
Abstract:
Cutting thin-walled deformable structures is common in daily life, but poses significant challenges for simulation due to the introduced spatial discontinuities. Traditional methods rely on mesh-based domain representations, which require frequent remeshing and refinement to accurately capture evolving discontinuities. These challenges are further compounded in reduced-space simulations, where the basis functions are inherently geometry- and mesh-dependent, making it difficult or even impossible for the basis to represent the diverse family of discontinuities introduced by cuts.
PaperID: 124,   https://arxiv.org/pdf/2505.02094    
Authors: Runyi Yu, Yinhuai Wang, Qihan Zhao, Hok Wai Tsui, Jingbo Wang, Ping Tan, Qifeng Chen
Affiliations: Hong Kong
Title: SkillMimic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations
Abstract:
We address a fundamental challenge in Reinforcement Learning from Interaction Demonstration (RLID): demonstration noise and coverage limitations. While existing data collection approaches provide valuable interaction demonstrations, they often yield sparse, disconnected, and noisy trajectories that fail to capture the full spectrum of possible skill variations and transitions. Our key insight is that despite noisy and sparse demonstrations, there exist infinite physically feasible trajectories that naturally bridge between demonstrated skills or emerge from their neighboring states, forming a continuous space of possible skill variations and transitions. Building upon this insight, we present two data augmentation techniques: a Stitched Trajectory Graph (STG) that discovers potential transitions between demonstration skills, and a State Transition Field (STF) that establishes unique connections for arbitrary states within the demonstration neighborhood. To enable effective RLID with augmented data, we develop an Adaptive Trajectory Sampling (ATS) strategy for dynamic curriculum generation and a historical encoding mechanism for memory-dependent skill learning. Our approach enables robust skill acquisition that significantly generalizes beyond the reference demonstrations. Extensive experiments across diverse interaction tasks demonstrate substantial improvements over state-of-the-art methods in terms of convergence stability, generalization capability, and recovery robustness.
PaperID: 125,   https://arxiv.org/pdf/2507.11465    
Authors: Nuri Ryu, Jiyun Won, Jooeun Son, Minsu Gong, Joo-Haeng Lee, Sunghyun Cho
Affiliations: Republic of Korea
Title: Elevating 3D Models: High-Quality Texture and Geometry Refinement from a Low-Quality Model
Abstract:
High-quality 3D assets are essential for various applications in computer graphics and 3D vision but remain scarce due to significant acquisition costs. To address this shortage, we introduce Elevate3D, a novel framework that transforms readily accessible low-quality 3D assets into higher quality. At the core of Elevate3D is HFS-SDEdit, a specialized texture enhancement method that significantly improves texture quality while preserving the appearance and geometry while fixing its degradations. Furthermore, Elevate3D operates in a view-by-view manner, alternating between texture and geometry refinement. Unlike previous methods that have largely overlooked geometry refinement, our framework leverages geometric cues from images refined with HFS-SDEdit by employing state-of-the-art monocular geometry predictors. This approach ensures detailed and accurate geometry that aligns seamlessly with the enhanced texture. Elevate3D outperforms recent competitors by achieving state-of-the-art quality in 3D model refinement, effectively addressing the scarcity of high-quality open-source 3D assets.
PaperID: 126  
Authors: Yuan-Yuan Cheng, Qing Fang, Ligang Liu, Xiao-Ming Fu
Affiliations: University of Science and Technology of China
Title: Divide-and-Conquer Embedding
Abstract:
We propose an exact method for embedding a disk-topology triangular mesh onto any convex polygon. The method employs a divide-and-conquer approach, iteratively decomposing the embedding problem into smaller sub-problems that map sub-meshes to convex sub-polygons. The process continues until each triangle in the mesh is naturally embedded into a corresponding 3-sided polygon. The approach is supported by a constructive proof, ensuring its theoretical validity. We translate this proof into a practical algorithm, incorporating various dividing strategies and interpolation weights. Unlike previous methods, our approach preserves the connectivity of the input mesh throughout the embedding process. Extensive experiments demonstrate the efficiency and effectiveness of the proposed method.
PaperID: 127  
Authors: Eric Chen, Žiga Kovačič, Madhav Aggarwal, Abe Davis
Affiliations: Cornell University, USA and CSAIL
Title: Pocket Time-Lapse
Abstract:
This paper explores how to record, explore, and visualize long-term changes in an environment—at the scale of days, months, and even years—based on data that a single user can conveniently capture using the mobile phone they already carry. Our strategy involves making the data capture process as quick and convenient as possible so that it is easy to integrate into daily routines. This strategy yields large unstructured panoramic image datasets, which we process using novel registration and scene reconstruction approaches. Our central contribution lies in demonstrating pocket time-lapse as a novel application, made possible through several key technical contributions. These include a novel method for quickly and robustly registering thousands of unstructured panoramic images, a novel reconstruction technique for rendering time-lapse and performing state-of-the-art intrinsic image decomposition, and several large hand-captured datasets that span multiple years of data collection, totaling over 6k separate capture sessions and 50k images.
PaperID: 128  
Authors: Alvin Shi, Haomiao Wu, Theodore Kim
Affiliations: Yale University, New Haven
Title: Hyper-Dimensional Deformation Simulation
Abstract:
We present a method for simulating deformable bodies in four spatial dimensions. To accomplish this, we generalize several pieces of the traditional simulation pipeline. Starting from the meshing stage, we propose a simple method for generating a pentachoral mesh, the 4D analog of a tetrahedral mesh. Next, we show how to generalize the deformation invariants, allowing us to construct 4D hyperelastic energies that lead directly to hyper-dimensional deformation forces. Finally, we formulate collision detection and response in 4D. Our eigenanalyses of the resulting deformation and collision energies generalize to arbitrarily higher dimensions. The resulting simulations display a variety of previously unseen visual phenomena.
PaperID: 129  
Authors: Zizhou Huang, Chrystiano Araújo, Andrew Kunz, Denis Zorin, Daniele Panozzo, Victor Zordan
Affiliations: New York, USA and Roblox, San Mateo, USA ; Roblox, USA and Clemson University
Title: Intersection-Free Garment Retargeting
Abstract:
Manual design of garments for avatars requires a large effort. Garment retargeting methods can save manual efforts by automatically deforming an existing garment design from one avatar to another. Previous methods are limited to human avatars with small variations in body shapes, while non-human avatars with unrealistic characteristics widely appear in games and animations. In this paper, the goal is to retarget artist-designed garments on a standard mannequin to a more general class of avatars. While there is a lack of training data of various avatars wearing garments, we propose a training-free method that performs optimizations on the mesh representation of the garments, with a combination of loss functions that preserve the geometrical features in the original design, guarantee intersection-free, and fit the garment adaptively to the avatars. Our method produces simulation-ready garment models that can be used later in avatar animations.
PaperID: 130  
Authors: Jerry Hsu, Tongtong Wang, Kui Wu, Cem Yuksel
Affiliations: University of Utah, Salt Lake City, USA ; LIGHTSPEED, Shen Zhen, China ; LIGHTSPEED, Los Angeles
Title: Stable Cosserat Rods
Abstract:
Cosserat rods have become an increasingly popular framework for simulating complex bending and twisting in thin elastic rods, used for hair, tree, and yarn-level cloth models. However, traditional approaches often encounter significant challenges in robustly and efficiently solving for valid quaternion orientations, even when employing small time steps or computationally expensive global solvers. We introduce stable Cosserat rods, a new solver that can achieve high accuracy with high stiffness levels and maintain stability under large time steps. It is also inherently suitable for parallelization. Our key contribution is a split position and rotation optimization scheme with a closed-form Gauss-Seidel quasi-static orientation update. This solver significantly improves the numerical stability with Cosserat rods, allowing faster computation and larger time steps. We validate our method across a wide range of applications, including simulations of hair, trees, yarn-level cloth, slingshots, and bridges, demonstrating its ability to handle diverse material behaviors and complex geometries. Furthermore, we show that our method is orders of magnitude faster and more stable than alternative rod solvers, such as extended position-based dynamics and discrete elastic rods.
PaperID: 131  
Authors: Anran Qi, Nico Pietroni, Maria Korosteleva, Olga Sorkine-Hornung, Adrien Bousseau
Affiliations: Centre Inria d'Université Côte d'Azur, Sophia Antipolis, France ; University of Technology Sydney, Australia ; ETH Zurich, Switzerland and Meshcapade, Switzerland ; ETH Zurich
Title: Rags2Riches: Computational Garment Reuse
Abstract:
We present the first algorithm to automatically compute sewing patterns for upcycling existing garments into new designs. Our algorithm takes as input two garment designs along with their corresponding sewing patterns and determines how to cut one of them to match the other by following garment reuse principles. Specifically, our algorithm favors the reuse of seams and hems present in the existing garment, thereby preserving the embedded value of these structural components and simplifying the fabrication of the new garment. Finding optimal reused pattern is computationally challenging because it involves both discrete and continuous quantities. Discrete decisions include the choice of existing panels to cut from and the choice of seams and hems to reuse. Continuous variables include the precise placement of the new panels along seams and hems, and potential deformations of these panels to maximize reuse. Our key idea for making this optimization tractable is quantizing the shape of garment panels. This allows us to frame the search for an optimal reused pattern as a discrete assignment problem, which we solve efficiently with an ILP solver. We showcase our proposed pipeline on several reuse examples, including comparisons with reused patterns crafted by a professional garment designer. Additionally, we manufacture a physical reused garment to demonstrate the practical effectiveness of our approach.
PaperID: 132  
Authors: Naoto Shirashima, Hideki Todo, Yuki Yamaoka, Shizuo Kaji, Kunihiko Kobayashi, Haruna Shimotahira, Yonghao Yue
Affiliations: Japan ; Takushoku University, Japan ; AGU, Japan ; Kyushu University, Japan and Kyoto University
Title: Stroke Transfer for Participating Media
Abstract:
We present a method for generating stroke-based painterly drawings of participating media, such as smoke, fire, and clouds, by transferring stroke attributes—color, width, length, and orientation—from exemplar to animation frames. Building on the stroke transfer framework, we introduce features and basis fields capturing lighting-, view-, and geometry-dependent information, extending surface-based ones (e.g., intensity, apparent normals and curvatures, and distance from silhouettes) to volumetric scenes while supporting traditional surface objects. Novel features, including apparent relative velocity and mean free-path, address non-rigid flow and dynamic scenes. Our system combines automated exemplar selection, user-guided style learning, and temporally coherent stroke generation, enabling artistic and expressive visualizations of dynamic media.
PaperID: 133  
Authors: Mingi Lee, Dongsu Zhang, Clément Jambon, Young Min Kim
Affiliations: Seoul National University, Republic of Korea
Title: BrepDiff: Single-Stage B-rep Diffusion Model
Abstract:
The Boundary Representation (B-rep) is a widely used 3D model representation of most consumer products designed with CAD software. However, its highly irregular and sparse set of relationships poses significant challenges for designing a generative model tailored to B-reps. Existing approaches use multi-stage approaches to satisfy the complex constraints sequentially. As a result, the final geometry cannot incorporate user edits due to the non-deterministic dependencies between cascaded stages. In contrast, we propose BrepDiff, a single-stage diffusion model for B-rep generation. We present a masked UV grid representation consisting of structured point samples from faces, serving as input for a diffusion transformer. By introducing an asynchronous and shifted noise schedule, we improve the training signal, enabling the diffusion model to better capture the distribution of UV grids. The explicitness of our masked UV grid representation enables users to intuitively understand and freely design surface geometry without being constrained by topological validity. The interconnectivity can be derived from the face layout, which is later processed into a valid solid volume during post-processing. Our approach achieves performance on par with state-of-the-art cascaded models while offering complex and diverse manipulations of geometry and topology, such as shape completion, merging, and interpolation.
PaperID: 134  
Authors: Aviv Segall, Jing Ren, Martin Schwarz, Olga Sorkine-Hornung
Affiliations: ETH Zurich, Switzerland ; University of Basel
Title: Computational Modeling of Gothic Microarchitecture
Abstract:
Gothic microarchitecture is a design phenomenon widely observed in late medieval European art, comprising sculptural works that emulate the forms and structural composition of monumental Gothic architecture. Despite its prevalence in preserved artifacts, the design and construction methods of Gothic microarchitecture used by artisans remain a mystery, as these processes were orally transmitted and rarely documented. The Basel goldsmith drawings (“Basler Goldschmiedrisse”), a collection of over 200 late 15th-century design drawings from the Upper Rhine region, provide a rare glimpse into the workshop practices of Gothic artisans. This collection consists of unpaired 2D drawings, including top-view and side-view projections of Gothic microarchitecture, featuring nested curve networks without annotations or explicitly articulated design principles. Understanding these 2D drawings and reconstructing the 3D objects they represent has long posed a significant challenge due to the lack of documentation and the complexity of the designs. In this work, we propose a framework of simple yet expressive geometric principles to model Gothic microarchitecture as 3D curve networks, using limited input such as historical 2D drawings. Our approach formalizes a historically informed design space, constrained to tools traditionally available to artisans–namely compass and straightedge–and enables faithful reproduction of Gothic microarchitecture that conforms to physical artifacts. Our framework is intuitive and efficient, allowing users to interactively create 3D Gothic microarchitecture with minimal effort. It bridges the gap between historical artistry and modern computational design, while also shedding light on a lost chapter of Gothic craftsmanship.
PaperID: 135  
Authors: Karlis Martins Briedis, Abdelaziz Djelouah, Raphaël Ortiz, Markus Gross, Christopher Schroers
Affiliations: Karlis MartinsBriedis DisneyResearch|Studios, Switzerland and ETH Zürich, Switzerland ; DisneyResearch|Studios
Title: Controllable Tracking-Based Video Frame Interpolation
Abstract:
Temporal video frame interpolation has been an active area of research in recent years, with a primary focus on motion estimation, compensation, and synthesis of the final frame. While recent methods have shown good quality results in many cases, they can still fail in challenging scenarios. Moreover, they typically produce fixed outputs with no means of control, further limiting their application in film production pipelines. In this work, we address the less explored problem of user-assisted frame interpolation to improve quality and enable control over the appearance and motion of interpolated frames. To this end, we introduce a tracking-based video frame interpolation method that utilizes sparse point tracks, first estimated and interpolated with existing point tracking methods and then optionally refined by the user. Additionally, we propose a mechanism for controlling the levels of hallucination in interpolated frames through inference-time model weight adaptation, allowing a continuous trade-off between hallucination and blurriness.
PaperID: 136  
Authors: Kei Iwasaki, Yoshinori Dobashi
Affiliations: Saitama University, Japan and Prometech CG Research, Japan ; Hokkaido University
Title: Spherical Lighting with Spherical Harmonics Hessian
Abstract:
In this paper, we introduce a second-order derivative of spherical harmonics, spherical harmonics Hessian, and solid spherical harmonics, a variant of spherical harmonics, to the computer graphics community. These mathematical tools are used to develop an analytical representation of the Hessian matrix of spherical harmonics coefficients for spherical lights. We apply our analytic representation of the Hessian matrix to grid-based SH lighting rendering applications with many spherical lights that store the incident light field as spherical harmonics coefficients and their spatial gradient at sparse grid. We develop a Hessian-based error metric, with which our method automatically and adaptively subdivides the grid whether the interpolation using the spatial gradient is appropriate. Our method can be easily incorporated into the grid-based precomputed radiance transfer (PRT) framework with small additional storage. We demonstrate that our adaptive grid subdivided by using the Hessian-based error metric can substantially improve the rendering quality in equal-time grid construction.
PaperID: 137  
Authors: Karran Pandey, Anita Hu, Clement Fuji Tsang, Or Perel, Karan Singh, Maria Shugrina
Affiliations: University of Toronto, Canada ; NVIDIA, Canada ; Clement FujiTsang NVIDIA
Title: Painting with 3D Gaussian Splat Brushes
Abstract:
We explore interactive painting on 3D Gaussian splat scenes and other surfaces using 3D Gaussian splat brushes, each containing a chunk of realistic texture-geometry that make capture representations so appealing. The suite of brush capabilities we propose enables 3D artists to capture and then remix real world imagery and geometry with direct interactive control. In particular, we propose a set of algorithms for 1) selecting subsets of Gaussians as a brush pattern interactively, 2) applying the brush interactively to the same or other 3DGS scenes or other 3D surfaces using stamp-based painting, 3) using an inpainting Diffusion Model to adjust stamp seams for seamless and realistic appearance. We also present an ensemble of artistic brush parameters, resulting in a wide range of appearance options for the same brush. Our contribution is a judicious combination of algorithms, design features and creative affordances, that together enable the first prototype implementation of interactive brush-based painting with 3D Gaussian splats. We evaluate our system by showing compelling results on a diverse set of 3D scenes; and a user study with VFX/animation professionals, to validate system features, workflow, and potential for creative impact. Code and data for this paper can be accessed from splatpainting.github.io.
PaperID: 138  
Authors: Zhi Zhou, Chao Li, Zhenyuan Zhang, Mingcong Tang, Zibin Li, Shuhang Luan, Zhangjin Huang
Affiliations: University of Science and Technology of China, China and Tencent, China ; Tencent
Title: Gaussian Compression for Precomputed Indirect Illumination
Abstract:
Precomputed global illumination (GI) techniques, such as light probes, particularly focus on capturing indirect illumination and have gained widespread adoption. However, as the scale of the scenes continues to expand, the demand for storage space and runtime memory for light probes also increases substantially. To address this issue, we propose a novel Gaussian fitting compression technique specifically designed for light field probes, which enables the use of denser samples to describe illumination in complex scenes. The core idea of our method is utilizing low-bit adaptive Gaussian functions to store the latent representation of light probes, enabling parallel and high-speed decompression on the GPU. Additionally, we implement a custom gradient propagation process to replace conventional inference frameworks, like PyTorch, ensuring an exceptional compression speed.
PaperID: 139  
Authors: Sarah Taylor, Salvador Medina, Jonathan Windle, Erica Alcusa Sáez, Iain Matthews
Affiliations: Epic Games
Title: xADA: Controllable and Expressive Audio-Driven Animation
Abstract:
We introduce xADA, a generative model for creating expressive and realistic animation of the face, tongue, and head directly from speech audio. Our approach leverages the pretrained Whisper audio encoder to extract rich speech features which are decoded into face and head animation using a series of gated recurrent unit (GRU) networks. The generated animation maps directly onto MetaHuman compatible rig controls enabling seamless integration into industry-standard content creation pipelines. xADA operates fully automatically, with an option for users to override the detected emotion and/or blink timings. xADA generalizes across languages, and voice styles, and can animate non-verbal sounds. Quantitative evaluation and a user study demonstrate that xADA produces state-of-the-art animation with high realism, frequently indistinguishable from ground truth performance. Additionally, we outline a comprehensive data capture protocol designed to collect an extensive range of speech and non-verbal sounds for training animation models.
PaperID: 140  
Authors: Corentin Salaün, Martin Balint, Laurent Belcour, Eric Heitz, Gurprit Singh, Karol Myszkowski
Affiliations: Max Planck Institute for Informatics, Germany ; Intel, France ; Intel
Title: Histogram Stratification for Spatio-Temporal Reservoir Sampling
Abstract:
Monte Carlo (MC) rendering is a widely used approach for photorealistic image synthesis, yet real-time applications often limit sampling to one path per pixel, resulting in high noise levels. To mitigate this, resampled importance sampling (RIS) has shown promise by approximating ideal sample distributions through a discrete set of candidates, avoiding the complexity of neural models or data-intensive structures. However, current RIS techniques often rely on random sampling, which fails to maximize the potential of the candidate pool. We propose a two step approach that first organizes samples candidates into local histograms and then sample the histogram using Quasi Monte Carlo and antithetic patterns. This can be done with minimal overhead and allows to reduce error in rendering to increase visual quality. Additionally, we show how it can be combined with blue noise error distribution to perceptually reduce noise artifacts. Our approach yields a higher-quality resampling estimator with enhanced noise reduction, demonstrating significant improvements in real-time rendering tasks.
PaperID: 141  
Authors: Xiaochun Tong, Toshiya Hachisuka
Affiliations: University of Waterloo
Title: Practical Stylized Nonlinear Monte Carlo Rendering
Abstract:
The recent formulation of stylized rendering equation (SRE) models stylization by applying nonlinear functions to reflected radiance recursively at each bounce, allowing seamless blend between stylized and physically based light transport. A naive estimator has to branch at each stylized surface, resulting in exponential computation and storage cost. We propose a practical approach for rendering scenes with SRE at a tractable cost. We first propose nonlinear path filtering (NL-PF) that caches the radiance evaluations at intermediate bounces, reducing the exponential sampling cost of the branching estimator of SRE to polynomial. Despite the effectiveness of NL-PF, its high memory cost makes it less scalable. To further improve efficiency, we propose nonlinear radiance caching (NL-NRC) where we apply a compact neural network to store radiance fields. Our NL-NRC has the same linear time sampling cost as a non-branching path tracer and can solve SRE with a high number of bounces and recursive stylization. Our key insight is that, by allowing the network to learn outgoing radiance prior to applying any nonlinear function, the network converges to the correct solution, even when we only have access to biased gradients due to nonlinearity. Our NL-NRC enables rendering scenes with arbitrary, highly nonlinear stylization while achieving significant speedup over branching estimators.
PaperID: 142  
Authors: Navid Ansari, Hans-Peter Seidel, Vahid Babaei
Affiliations: Max Planck Institute for Informatics
Title: Accelerated Gamut Discovery via Massive Parallelization
Abstract:
This paper presents a scalable framework for efficiently discovering the performance gamut of different processes. Gamut boundaries comprise the set of highest-performing solutions within a design space. While sampling methods are often inefficient or prone to premature convergence, Bayesian optimization struggles with taking advantage of existing large-scale parallel computation or experimentation. To address these challenges, we utilize Bayesian neural networks as scalable surrogates for performance prediction and uncertainty estimation. We further introduce a novel acquisition function that combines the diversity-driven exploration of stochastic optimization with the information-efficient exploitation of Bayesian optimization. This enables generating large, high-quality batches of samples. Our approach leverages large batch sizes to reduce the number of iterations needed for optimization. We demonstrate its effectiveness on real-world engineering and robotic problems, achieving faster and more extensive discovery of the performance gamut. Code and data are available at https://gitlab.mpi-klsb.mpg.de/nansari/lbn_mobo.
PaperID: 143  
Authors: Davide Sforza, Marzia Riso, Filippo Muzzini, Nicola Capodieci, Fabio Pellacini
Affiliations: Sapienza University of Rome, Italy and INRIA, Université Côte d'Azur, Sophia Antipolis, France ; University of Modena and Reggio Emilia, Italy ; University of Modena and Reggio Emilia
Title: Interactive Optimization of Scaffolded Procedural Patterns
Abstract:
A procedural program is the representation of a family of assets that share the same structural or semantic properties, whose final appearance is determined by different parameter assignments. Identifying the parameter values that define a desired asset is usually a time-consuming operation, since it requires manually tuning parameters separately and in a non-intuitive manner. In the domain of procedural patterns, recent works focused on estimating parameter values to match a target render or sketch, using parameter optimization or inference via neural networks. However, these approaches are neither fast enough for interactive design nor precise enough to give direct control. In this work, we propose an interactive method for procedural parameter estimation based on the idea of scaffolded procedural patterns. A scaffolded procedural pattern is a sequence of procedural programs that model a pattern in a coarse-to-fine manner, in which the desired pattern appearance is reached step-by-step by inheriting previously optimized parameters. Through scaffolding, patterns are more straightforward to sketch for users and easier to optimize for most algorithms. In our implementation, patterns are represented as procedural signed distance functions whose parameters are estimated with a gradient-free optimization method that runs in real-time on the GPU. We show that scaffolded patterns can be created with a node-based interface familiar to artists. We validate our approach by creating and interactively editing several scaffolded patterns. We show the effectiveness of scaffolding through a user study, where scaffolding enhances both the output quality and the editing experience with respect to approaches that optimize the procedural parameters all at once. We also perform a comparison with previous strategies and provide several recordings of real-time editing sessions in the accompanying materials.
PaperID: 144  
Authors: Fumiya Narita, Nimiko Ochiai, Takashi Kanai, Ryoichi Ando
Affiliations: The University of Tokyo, Japan and GAME FREAK Inc., Japan ; GAME FREAK Inc., Japan ; Unaffiliated
Title: Quadtree Tall Cells for Eulerian Liquid Simulation
Abstract:
This paper introduces a novel grid structure that extends tall cell methods for efficient deep water simulation. Unlike previous tall cell methods, which are designed to capture all the fine details around liquid surfaces, our approach subdivides tall cells horizontally, allowing for more aggressive adaptivity and a significant reduction in the number of cells. The foundation of our method lies in a new variational formulation of Poisson’s equations for pressure solve tailored for tall-cell grids, which naturally handles the transition of variable-sized cells. This variational view not only permits the use of the efficacy-proven conjugate gradient method but also facilitates monolithic two-way coupled rigid bodies. The key distinction between our method and previous general adaptive approaches, such as tetrahedral or octree grids, is the simplification of adaptive grid construction. Our method performs grid subdivision in a quadtree fashion, rather than an octree. These 2D cells are then simply extended vertically to complete the tall cell population. We demonstrate that this novel form of adaptivity, which we refer to as quadtree tall cells, delivers superior performance compared to traditional uniform tall cells.
PaperID: 145  
Authors: Diyang Zhang, Zhendong Wang, Zegao Liu, Xinming Pei, Weiwei Xu, Huamin Wang
Affiliations: StyleD Research, China ; State Key Laboratory of CAD & CG, Zhejiang University
Title: Physics-inspired Estimation of Optimal Cloth Mesh Resolution
Abstract:
In this paper, we tackle an important yet often overlooked question: What is the optimal mesh resolution for cloth simulation, without relying on preliminary simulations? The optimal resolution should be sufficient to capture fine details of all potential wrinkles, while avoiding an unnecessarily high resolution that wastes computational time and memory on excessive vertices. This challenge stems from the complex nature of wrinkle distribution, which varies spatially, temporally, and anisotropically across different orientations. To address this, we propose a method to estimate the optimal cloth mesh resolution, based on two key factors: material stiffness and boundary conditions.
PaperID: 146  
Authors: Avinab Saha, Yu-Chih Chen, Jean-Charles Bazin, Christian Häne, Ioannis Katsavounidis, Alexandre Chapiro, Alan Bovik
Affiliations: The University of Texas at Austin, USA ; Reality Labs
Title: FaceExpressions-70k: A Dataset of Perceived Expression Differences
Abstract:
Facial expressions are key to human communication, conveying emotions and intentions. Given the rising popularity of digital humans and avatars, the ability to accurately represent facial expressions in real time has become an important topic. However, quantifying perceived differences between pairs of expressions is difficult, and no comprehensive subjective datasets are available for testing. This work introduces a new dataset targeting this problem: FaceExpressions-70k. Obtained via crowdsourcing, our dataset contains 70,500 subjective expression comparisons rated by over 1,000 study participants1 We demonstrate the applicability of the dataset for training perceptual expression difference models and guiding decisions on acceptable latency and sampling rates for facial expressions when driving a face avatar.
PaperID: 147  
Authors: Kai Yan, Cheng Zhang, Sébastien Speierer, Guangyan Cai, Yufeng Zhu, Zhao Dong, Shuang Zhao
Affiliations: University of California Irvine, USA ; Reality Labs
Title: Image-space Adaptive Sampling for Fast Inverse Rendering
Abstract:
Inverse rendering is crucial for many scientific and engineering disciplines. Recent progress in differentiable rendering has led to efficient differentiation of the full image formation process with respect to scene parameters, enabling gradient-based optimization.
PaperID: 148  
Authors: Huibiao Wen, Guilong He, Rui Xu, Shuangmin Chen, Shiqing Xin, Zhenyu Shu, Taku Komura, Jieqing Feng, Wenping Wang, Changhe Tu
Affiliations: Shandong University, China and University of Health and Rehabilitation Sciences, Hong Kong, China ; Qingdao University of Science and Technology, China ; NingboTech University, China ; State Key Laboratory of CAD & CG, Zhejiang University, China ; Texas A&M University, College Station
Title: Feature-Preserving Mesh Repair via Restricted Power Diagram
Abstract:
Mesh repair is a critical process in 3D geometry processing aimed at correcting errors and imperfections in polygonal meshes to produce watertight, manifold, and feature-preserving meshes suitable for downstream tasks. While errors such as degeneracies, duplication, holes, and overlaps can be addressed through standard repair processes, cracks along trimmed curves require special attention and should ideally be repaired to align with sharp feature lines.
PaperID: 149  
Authors: Bosheng Li, Nikolas Schwarz, Wojtek Palubicki, Sören Pirk, Dominik L. Michels, Bedrich Benes
Affiliations: Purdue University, West Lafayette, USA ; Kiel University, Germany ; Adam Mickiewicz University, Poland ; Kiel University, Germany ; Dominik L.Michels King Abdullah University of Science and Technology (KAUST)
Title: Stressful Tree Modeling: Breaking Branches with Strands
Abstract:
We propose a novel approach for the computational modeling of lignified tissues, such as those found in tree branches and timber. We leverage a state-of-the-art strand-based representation for tree form, which we extend to describe biophysical processes at short and long time scales. Simulations at short time scales enable us to model different breaking patterns due to branch bending, twisting, and breaking. On long timescales, our method enables the simulation of realistic branch shapes under the influence of plausible biophysical processes, such as the development of compression and tension wood. We specifically focus on computationally fast simulations of woody material, enabling the interactive exploration of branches and wood breaking. By leveraging Cosserat rod physics, our method enables the generation of a wide variety of breaking patterns. We showcase the capabilities of our method by performing and visualizing numerous experiments.
PaperID: 150  
Authors: Kechun Wang, Renjie Chen
Affiliations: School of Mathematical Sciences, University of Science and Technology of China
Title: PaRas: A Rasterizer for Large-Scale Parametric Surfaces
Abstract:
The advantages of higher-order surfaces, such as their ability to represent complex geometry compactly and smoothly, have led to their increasing use in computer graphics. This trend underscores the importance of developing efficient rendering algorithms tailored for these representations. We introduce PaRas, a highly performant rasterizer for real-time rendering of large-scale parametric surfaces with high precision. Unlike conventional graphics pipelines that rely on hardware tessellation to convert smooth surfaces into numerous flat triangles, our method provides a highly efficient and parallel approach to directly rasterize parametric surfaces. PaRas seamlessly integrates into existing workflows, enabling smooth surfaces to be handled with the same ease as triangle meshes. To accomplish this, we formulate the rasterization of parametric surfaces as a point inversion problem, employing a Newton-type iteration on the GPU to compute precise solutions. The framework’s effectiveness is demonstrated on quartic triangular Bézier patches and rational Bézier patches, both commonly used in high-precision modeling and industrial applications. Experimental results indicate that our rendering pipeline achieves higher efficiency and greater accuracy compared to traditional hardware tessellation techniques.
PaperID: 151  
Authors: Alon Feldman, Mirela Ben-Chen
Affiliations: Technion Israel Institute of Technology
Title: On Planar Shape Interpolation With Logarithmic Metric Blending
Abstract:
We present an interpolation method for planar shapes using logarithmic metric blending. Our approach generalizes prior work on pullback metrics to a framework, allowing us to employ different techniques, such as logarithmic blending of symmetric positive definite matrices, to have precise control over both conformal and area distortions. Key contributions include generalizing the continuous blending scheme and its adaptation to discrete mesh interpolation through different conformal and isometric parameterizations. Experimental results demonstrate that our method outperforms existing techniques in achieving bounded distortions, making it a compelling choice for applications in animation and morphing.
PaperID: 152  
Authors: Jeongmin Gu, Bochang Moon
Affiliations: Gwangju Institute of Science and Technology, Republic of Korea
Title: James-Stein Gradient Combiner for Inverse Monte Carlo Rendering
Abstract:
Inferring scene parameters such as BSDFs and volume densities from user-provided target images has been achieved using a gradient-based optimization framework, which iteratively updates the parameters using the gradient of a loss function defined by the differences between rendered and target images. The gradient can be unbiasedly estimated via a physics-based rendering, i.e., differentiable Monte Carlo rendering. However, the estimated gradient can become noisy unless a large number of samples are used for gradient estimation, and relying on this noisy gradient often slows optimization convergence. An alternative is to exploit a biased version of the gradient, e.g., a filtered gradient, to achieve faster optimization convergence. Unfortunately, this can result in less noisy but overly blurred scene parameters compared to those obtained using unbiased gradients. This paper proposes a gradient combiner that blends unbiased and biased gradients in parameter space instead of relying solely on one gradient type (i.e., unbiased or biased). We demonstrate that optimization with our combined gradient enables more accurate inference of scene parameters than using unbiased or biased gradients alone.
PaperID: 153  
Authors: Henglei Lv, Bailin Deng, Jianzhu Guo, Xiaoqiang Liu, Pengfei Wan, Di Zhang, Lin Gao
Affiliations: Institute of Computing Technology, Chinese Academy of Sciences, China ; Cardiff University, United Kingdom ; Kuaishou Technology, China ; Kuaishou Technology
Title: GSHeadRelight: Fast Relightability for 3D Gaussian Head Synthesis
Abstract:
Relighting and novel view synthesis of human portraits are essential in applications such as portrait photography, virtual reality (VR), and augmented reality (AR). Despite recent progress, 3D-aware portrait relighting remains challenging due to the demands for photorealistic rendering, real-time performance, and generalization to unseen subjects. Existing works either rely on supervision from limited and expensive light stage captured data or produce suboptimal results. Moreover, many works are based on generative NeRFs, which suffer from poor 3D consistency and low real-time performance. We resort to recent progress on generative 3D Gaussians and design a lighting model based on a unified neural radiance transfer representation, which responds linearly to incident light. Using only in-the-wild images, our method achieves state-of-the-art relighting results and a significantly faster rendering speed (× 12) compared to previous 3D-aware portrait relighting research.
PaperID: 154  
Authors: Zhengming Yu, Tianye Li, Jingxiang Sun, Omer Shapira, Seonwook Park, Michael Stengel, Matthew Chan, Xin Li, Wenping Wang, Koki Nagano, Shalini De Mello
Affiliations: Santa Clara, USA and Texas A&M University, College Station, USA ; NVIDIA, China and Tsinghua University, China ; NVIDIA, New York, USA ; Texas A&M University, Los Angeles, USA ; ShaliniDe Mello NVIDIA
Title: GAIA: Generative Animatable Interactive Avatars with Expression-conditioned Gaussians
Abstract:
3D generative models of faces trained on in-the-wild image collections have improved greatly in recent times, offering better visual fidelity and view consistency. Making such generative models animatable is a hard yet rewarding task, with applications in virtual AI agents, character animation, and telepresence. However, it is not trivial to learn a well-behaved animation model with the generative setting, as the learned latent space aims to best capture the data distribution, often omitting details such as dynamic appearance and entangling animation with other factors that affect controllability. We present GAIA: Generative Animatable Interactive Avatars, which is able to generate high-fidelity 3D head avatars for both realistic animation and rendering. To achieve consistency during animation, we learn to generate Gaussians embedded in an underlying morphable model for human heads via a shared UV parameterization. For modeling realistic animation, we further design the generator to learn expression-conditioned details for both geometric deformation and dynamic appearance. Finally, facing an inevitable entanglement problem between facial identity and expression, we propose a novel two-branch architecture that encourages the generator to disentangle identity and expression. On existing benchmarks, GAIA achieves state-of-the-art performance in visual quality as well as realistic animation. The generated Gaussian-based avatar supports highly efficient animation and rendering, making it readily available for interactive animation and appearance editing.
PaperID: 155  
Authors: Rahul Mitra, Mattéo Couplet, Tongtong Wang, Megan Hoffman, Kui Wu, Edward Chien
Affiliations: Boston University, USA and LightSpeed Studios, Los Angeles, USA ; LightSpeed Studios, Shen Zhen, China ; Northeastern University
Title: Curl Quantization for Automatic Placement of Knit Singularities
Abstract:
We develop a method for automatic placement of knit singularities based on curl quantization, extending the knit-planning frameworks of Mitra et al. [2024; 2023]. Stripe patterns are generated that closely follow the isolines of an underlying knitting time function, and has course and wale singularities in regions of high curl for the normalized time function gradient and its 90° rotated field, respectively. Singularities are placed in an iterative fashion, and we show that this strategy allows us to easily maintain the structural constraints necessary for machine-knitting, e.g., the helix-free constraint, and to satisfy user constraints such as stripe alignment and singularity placement. Our more performant approach obviates the need for a mixed-integer solve [Mitra et al. 2023], manual fixing of singularity positions, or the running of a singularity matching procedure in post-processing [Mitra et al. 2024]. Our global optimization also produces smooth knit graphs that provide quick simulation-free previews of rendered knits without the surface artifacts of competing methods. Furthermore, we extend our method to the popular cut-and-sew garment design paradigm. We validate our method by machine-knitting and rendering yarn-based visualizations of prototypical models in the 3D and cut-and-sew settings.
PaperID: 156  
Authors: Pengfei Zhu, Jie Guo, Yifan Liu, Qi Sun, Yanxiang Wang, Keheng Xu, Ligang Liu, Yanwen Guo
Affiliations: Nanjing University, China ; University of Science and Technology of China
Title: Appearance-aware Multi-view SVBRDF Reconstruction via Deep Reinforcement Learning
Abstract:
Recent advancements in deep learning have revolutionized the reconstruction of spatially-varying surface reflectance of real-world objects. Many existing methods have successfully recovered high-quality reflectance maps using a remarkably limited number of images captured by a lightweight handheld camera and a flash-like light source. As the samples become sparse, the choice of the sampling set has a significant impact on the results. To determine the best sampling set for each material while ensuring minimal capture costs, we introduce an appearance-aware adaptive sampling method in this paper. We model the sampling process as a sequential decision-making problem, and employ a deep reinforcement learning (DRL) framework to solve it. At each step, an agent (NBVL Planner), after trained on a specially designed dataset, plans the next best view-lighting (NBVL) pair based on the appearance of the material recognized so far. Once stopped, the sequence of the NBVLs constitutes the best sampling set for the material. We show, through extensive experiments on both synthetic materials and real-world cases, that the best sampling set extracted by our method outperforms other sampling sets, especially for challenging materials featuring globally-varying specular reflectance.
PaperID: 157  
Authors: Youyang Du, Lu Wang, Beibei Wang
Affiliations: Shandong University, China ; Nanjing University
Title: Facial Microscopic Structures Synthesis from a Single Unconstrained Image
Abstract:
Obtaining 3D faces with microscopic structures from a single unconstrained image is challenging. The complexities of wrinkles and pores at a microscopic level, coupled with the blurriness of the input image, raise the difficulty. However, the distribution of wrinkles and pores tends to follow a specialized pattern, which can provide a strong prior for synthesizing them. Therefore, a key to microstructure synthesis is a parametric wrinkles and pore model with controllable semantic parameters. Additionally, ensuring differentiability is essential for enabling optimization through gradient descent methods. To this end, we propose a novel framework designed to reconstruct facial micro-wrinkles and pores from naturally captured images efficiently. At the core of our framework is a differentiable representation of wrinkles and pores via a graph neural network (GNN), which can simulate the complex interactions between adjacent wrinkles by multiple graph convolutions. Furthermore, to overcome the problem of inconsistency between the blurry input and clear wrinkles during optimization, we proposed a Direction Distribution Similarity that ensures that the wrinkle-directional features remain consistent. Consequently, our framework can synthesize facial micro-structures from a blurry skin image patch, which is cropped from a natural-captured facial image, in around an average of 2 seconds. Our framework can seamlessly integrate with existing macroscopic facial detail reconstruction methods to enhance their detailed appearance. We showcase this capability on several works, including DECA, HRN, and FaceScape.
PaperID: 158  
Authors: Jiawei Huang, Shaokun Zheng, Kun Xu, Yoshifumi Kitamura, Jiaping Wang
Affiliations: International Digital Economy Academy, China ; Tsinghua University, China ; Tohoku University
Title: Guided Lens Sampling for Efficient Monte Carlo Circle-of-Confusion Rendering
Abstract:
We introduce a guided lens sampling method for efficient rendering of circles of confusion (CoCs). While traditional Monte Carlo techniques simulate depth-of-field (DoF) effects by perturbing camera rays at the lens, uniform lens sampling often results in significant noise by failing to prioritize rays toward highlight regions in the scene. Although path guiding has proven effective for global illumination by learning importance distributions for incoming radiance, no comparable guiding technique for CoCs exists, primarily due to the strong parallax between adjacent pixels. We model highlight spots in world space using a globally shared radiance field, which is then transformed into lens space through a bipolar-cone projection to guide camera ray generation. We implement this theory using 3D Gaussians, achieving fast, robust guiding with minimal computational and storage overhead, making it suitable for production rendering. We also propose two extensions to further enhance local adaptation. Our experiments show that this approach significantly improves the sampling efficiency for CoC rendering.
PaperID: 159  
Authors: Jeffrey Liu, Daqi Lin, Markus Kettunen, Chris Wyman, Ravi Ramamoorthi
Affiliations: University of Illinois Urbana-Champaign, USA ; NVIDIA, Finland ; NVIDIA, La Jolla, USA and University of California San Diego
Title: Reservoir Splatting for Temporal Path Resampling and Motion Blur
Abstract:
Recent extensions to spatiotemporal path reuse, or ReSTIR, improve rendering efficiency in the presence of high-frequency content by augmenting path reservoirs to represent contributions over full pixel footprints. Still, if historical paths fail to contribute to future frames, these benefits disappear. Prior ReSTIR work backprojects to the prior frame to identify paths for reuse. Backprojection can fail to find relevant paths for many reasons, including moving cameras or subpixel geometry with differing motion.
PaperID: 160  
Authors: Dewen Guo, Zhendong Wang, Zegao Liu, Sheng Li, Guoping Wang, Yin Yang, Huamin Wang
Affiliations: Peking University, China and StyleD Research, China ; StyleD Research, China ; University of Utah, Salt Lake City, USA ; StyleD Research
Title: Fast Physics-Based Modeling of Knots and Ties using Templates
Abstract:
Knots and ties are captivating elements of digital garments and accessories, but they have been notoriously challenging and computationally expensive to model manually. In this paper, we propose a physics-based modeling system for knots and ties using templates. The primary challenge lies in transforming cloth pieces into desired knot and tie configurations in a controllable, penetration-free manner, particularly when interacting with surrounding meshes. To address this, we introduce a pipe-like parametric knot template representation, defined by a Bézier curve as its medial axis and an adaptively adjustable radius for enhanced flexibility and variation. This representation enables customizable knot sizes, shapes, and styles while ensuring intersection-free results through robust collision detection techniques. Using the defined knot template, we present a mapping and penetration-free initialization method to transform selected cloth regions from UV space into the initial 3D knot shape. We further enable quasistatic simulation of knots and their surrounding meshes through a fast and reliable collision handling and simulation scheme. Our experiments demonstrate the system’s effectiveness and efficiency in modeling a wide range of digital knots and ties with diverse styles and shapes, including configurations that were previously impractical to create manually.
PaperID: 161  
Authors: András Simon, Danwu Chen, Philipp Urban, Vincent Duveiller, Henning Lübbe
Affiliations: Fraunhofer IGD, Germany and Technical University of Darmstadt, Germany and Norwegian University of Science and Technology NTNU, Norway ; VITA Zahnfabrik H. Rauter GmbH & Co. KG, Bad Säckingen, Germany ; VITA Zahnfabrik H. Rauter GmbH & Co. KG
Title: Color Matching and Biomimicry for Multi-Material Dental 3D Printing
Abstract:
The growing global demand for removable partial and full dentures, driven by an aging population and the high prevalence of edentulism, emphasizes the importance of advancing manufacturing solutions. Multi-material jetting, with newly regulatory-approved dental resins, facilitates the production of monolithic, full-color dentures, reducing manual labor and enabling advanced aesthetic customization.
PaperID: 162  
Authors: Jungnam Park, Euikyun Jung, Jehee Lee, Jungdam Won
Affiliations: Seoul National University, Republic of Korea
Title: MAGNET: Muscle Activation Generation Networks for Diverse Human Movement
Abstract:
We introduce MAGNET (Muscle Activation Generation Networks), a scalable framework for reconstructing full-body muscle activations across diverse human movements. Our approach employs musculoskeletal simulation with a novel two-level controller architecture trained using three-stage learning methods. Additionally, we develop distilled models tailored for solving downstream tasks or generating real-time muscle activations, even on edge devices. The efficacy of our framework is demonstrated through examples of daily life and challenging behaviors, as well as comprehensive evaluations.
PaperID: 163  
Authors: Laurent Belcour, Alban Fichet, Pascal Barla
Affiliations: Intel Labs, France ; Inria - LaBRI
Title: A Fluorescent Material Model for Non-Spectral Editing & Rendering
Abstract:
Fluorescent materials are characterized by a spectral reradiation toward longer wavelengths. Recent work [Fichet et al. 2024] has shown that the rendering of fluorescence in a non-spectral engine is possible through the use of appropriate reduced reradiation matrices. But the approach has limited expressivity, as it requires the storage of one reduced matrix per fluorescent material, and only works with measured fluorescent assets.
PaperID: 164  
Authors: Bowen Zheng, Ke Chen, Yuxin Yao, Zijiao Zeng, Xinwei Jiang, He Wang, Joan Lasenby, Xiaogang Jin
Affiliations: State Key Lab of CAD&CG, Zhejiang University, China ; Department of Engineering, University of Cambridge, United Kingdom ; Department of Efficiency Product, Tencent Games, China ; Department of Efficiency Product, China ; UCL Centre for Artificial Intelligence, Department of Computer Science, University College London, United Kingdom ; Department of Engineering, China and ZJU-Tencent Game and Intelligent Graphics Innovation Technology Joint Lab
Title: AutoKeyframe: Autoregressive Keyframe Generation for Human Motion Synthesis and Editing
Abstract:
Keyframing has long been the cornerstone of standard character animation pipelines, offering precise control over detailed postures and dynamics. However, this approach is labor-intensive, necessitating significant manual effort. Automating this process while balancing the trade-off between minimizing manual input and maintaining full motion control has therefore been a central research challenge. In this work, we introduce AutoKeyframe, a novel framework that simultaneously accepts dense and sparse control signals for motion generation by generating keyframes directly. Dense signals govern the overall motion trajectory, while sparse signals define critical key postures at specific timings. This approach substantially reduces manual input requirements while preserving precise control over motion. The generated keyframes can be easily edited to serve as detailed control signals. AutoKeyframe operates by automatically generating keyframes from dense root positions, which can be determined through arc-length parameterization of the trajectory curve. This process is powered by an autoregressive diffusion model, which facilitates keyframe generation and incorporates a skeleton-based gradient guidance technique for sparse spatial constraints and frame editing. Extensive experiments demonstrate the efficacy of AutoKeyframe, achieving high-quality motion synthesis with precise and intuitive control.
PaperID: 165  
Authors: Maximilian Kohlbrenner, Marc Alexa
Affiliations: Technical University of Berlin
Title: A Polyhedral Construction of Empty Spheres in Discrete Distance Fields
Abstract:
Lie sphere geometry provides a unified representation of points, oriented spheres and hyperplanes in Euclidean d-space as the subset of lines in R d + 3 that are contained in a certain quadric. The natural scalar product in this construction is zero if two elements are in oriented contact. We show how the sign of this product can be used to decide if spheres are disjoint. This allows us to model the space of spheres that are not intersecting a given union of spheres as the intersection of half-spaces (and the quadric). The maximal spheres are on the boundary of this set and can be computed by first constructing the intersection of half-spaces, which is a convex hull problem, and then intersecting edges of the hull against the quadric, which are the roots of a univariate quadratic. We demonstrate the method at the example of contouring a discrete signed distance field: every sample of the signed distance field represents an empty spheres and the zero-level contour has to be disjoint from the union of these spheres. Maximal spheres outside the empty spheres provide samples on the zero-level contour. The quality of this sample set is comparable to existing methods relying on optimization, while being deterministic and faster in practice.
PaperID: 166  
Authors: Shaohua Mo, Chuankun Zheng, Zihao Lin, Dianbing Xi, Qi Ye, Rui Wang, Hujun Bao, Yuchi Huo
Affiliations: State Key Laboratory of CAD&CG, Zhejiang University, China and Zhejiang Lab
Title: Dual-Band Feature Fusion for Neural Global Illumination with Multi-Frequency Reflections
Abstract:
In this paper, we present a novel neural global illumination approach that enables multi-frequency reflections in dynamic scenes. Our method utilizes object-centric, spatial feature grids as the core framework to model rendering effects implicitly. A lightweight scene query, based on single-bounce ray tracing, is then performed on these feature grids to extract principal and secondary features separately. The principal features can capture a wide range of relatively low-frequency global illumination effects, such as diffuse indirect lighting and reflections on rough surfaces. In contrast, the secondary features can provide sparse scene-specific reflection details, typically with much higher frequencies than the final observed radiance. Inspired by the physical processes of light propagation, we introduce a novel dual-band feature fusion module that seamlessly blends these two types of features, generating fused features capable of modeling multi-frequency rendering effects. Additionally, we propose a two-stage training strategy tailored to accommodate the distinct characteristics of each feature type, significantly enhancing the overall quality and reducing artifacts in the rendered results. Experimental results demonstrate that our method delivers high-quality, multi-frequency dynamic reflections, outperforming state-of-the-art baselines, including path tracing with screen-space neural denoising and other neural global illumination methods.
PaperID: 167  
Authors: Pengbin Tang, Bernhard Thomaszewski, Stelian Coros, Bernd Bickel
Affiliations: ETH Zürich
Title: Inverse Design of Discrete Interlocking Materials with Desired Mechanical Behavior
Abstract:
We present a computational approach for designing Discrete Interlocking Materials (DIMs) with desired mechanical properties. Unlike conventional elastic materials, DIMs are kinematic materials governed by internal contacts among elements. These contacts induce anisotropic deformation limits that depend on the shape and topology of the elements. To enable gradient-based design optimization of DIMs with desired deformation limits, we introduce an implicit representation of interlocking elements based on unions of tori. Using this low-dimensional representation, we simulate DIMs with smoothly evolving contacts, allowing us to predict changes in deformation limits as a function of shape parameters. With this toolset in hand, we optimize for element shape parameters to design heterogeneous DIMs that best approximate prescribed limits. We demonstrate the effectiveness of our method by designing discrete interlocking materials with diverse limit profiles for in- and out-of-plane deformation and validate our method on fabricated physical prototypes.
PaperID: 168  
Authors: Qingqin Liu, Ziqi Fang, Jiayi Wu, Shaoyu Cai, Jianhui Yan, Tiande Mo, Shuk Ching Chan, Kening Zhu
Affiliations: School of Creative Media, Hong Kong
Title: VirCHEW Reality: On-Face Kinesthetic Feedback for Enhancing Food-Intake Experience in Virtual Reality
Abstract:
While haptic interfaces for virtual reality (VR) has received extensive research attention, on-face haptics in VR remained less explored, especially for virtual food intake. In this paper, we introduce VirCHEW Reality, a face-worn haptic device designed to provide on-face kinesthetic force feedback, to enhance the virtual food-chewing experience in VR. Leveraging a pneumatic actuation system, VirCHEW Reality controlled the process of air inflation and deflation, to simulate the mechanical properties of food textures, such as hardness, cohesiveness, and stickiness. We evaluated the system through three user studies. First, a just-noticeable difference (JND) study examined users’ sensitivity to and the system’s capability of rendering different levels of on-face pneumatic-based kinesthetic feedback while users performing chewing action. Building upon the user-distinguishable signal ranges found in the first study, we further conducted a matching study to explore the correspondence between the kinesthetic stimuli provided by our device and user-perceived food textures, revealing the capability of simulating food texture properties during chewing (e.g., hardness, cohesiveness, stickiness). Finally, a user study in a VR eating scenario showed that VirCHEW Reality could significantly improve the users’ ratings on the sense of presence, compared to the condition without haptic feedback. Our findings further highlighted possible applications in virtual/remote dining, healthcare, and immersive entertainment.
PaperID: 169  
Authors: Chunyi Sun, Junlin Han, Runjia Li, Weijian Deng, Dylan Campbell, Stephen Gould
Affiliations: Australian National University, Australia ; University of Oxford, United Kingdom ; University of Oxford
Title: Unsupervised Decomposition of 3D Shapes into Expressive and Editable Extruded Profile Primitives
Abstract:
Transforming 3D shapes into representations that support part-level editing, flexible redesign, and efficient compression is vital for asset customization, content creation, and optimization in digital design. Despite its importance, achieving a representation that balances expressivity, editability, compactness, and interpretability remains a challenge. We introduce 3D2EP, a novel method for 3D shape decomposition that represents objects as a collection of differentiable, parametric primitives. Given a 3D shape represented by a voxel grid, 3D2EP decomposes this into a set of primitive parts, each generated by extruding a scaled 2D profile along a 3D curve, with the requisite components being predicted in a feedforward manner. That is, each primitive is constrained to have a single cross-section profile, up to scale. This enables the primitives to adapt to the data, capturing the geometry with precision but without excess degrees-of-freedom that would stymie editability.
PaperID: 170  
Authors: Kenneth Chen, Nathan Matsuda, Jon McElvain, Yang Zhao, Thomas Wan, Qi Sun, Alexandre Chapiro
Affiliations: New York University, USA and Reality Labs Research, USA ; Reality Labs Research
Title: What is HDR? Perceptual Impact of Luminance and Contrast in Immersive Displays
Abstract:
The contrast and luminance capabilities of a display are central to the quality of the image. High dynamic range (HDR) displays have high luminance and contrast, but it can be difficult to ascertain whether a given set of characteristics qualifies for this label. This is especially unclear for new display modes, such as virtual reality (VR). This paper studies the perceptual impact of peak luminance and contrast of a display, including characteristics and use cases representative of VR. To achieve this goal, we first developed a haploscope testbed prototype display capable of achieving 1,000 nits peak luminance and 1,000,000:1 contrast with high precision. We then collected a novel HDR video dataset targetting VR-relevant content types. We also implemented custom tone mapping operators to map between display parameter sets. Finally, we collected subjective preference data spanning 3 orders of magnitude in each dimension. Our data was used to fit a model, which was validated using a subjective study on an HDR VR prototype headmounted display (HMD). Our model helps provide guidance for future display design, and helps standardize the understanding of HDR1.
PaperID: 171  
Authors: Jinseok Bae, Younghwan Lee, Donggeun Lim, Young Min Kim
Affiliations: Seoul National University, Republic of Korea
Title: PLT: Part-Wise Latent Tokens as Adaptable Motion Priors for Physically Simulated Characters
Abstract:
Physically simulated characters can learn highly natural full-body motion guided by motion capture datasets. However, the range of motion is limited to the existing high-quality datasets, and cannot effectively adapt to challenging scenarios. We propose a novel policy architecture that learns part-wise motion skills, where individual parts can be separately extended and combined for unobserved settings. Our method employs a set of part-specific codebooks, which robustly capture motion dynamics without catastrophic collapse or forgetting. This structured decomposition allows intuitive control over the character’s behavior and dynamic exploration for a novel combination of part-wise motion. We further incorporate a refinement network compensating for subtle discrepancies in the disjoint discrete tokens, thus improving motion quality and stability. Our extensive evaluations show that our part-wise latent token achieves superior performance in imitating motions, even those from unseen distribution. We also validate our method in challenging tasks, including body tracking, navigation on complex terrains, and point-goal navigation with damaged body parts. Finally, we introduce a part-wise expansion of motion priors, where the physically simulated character incrementally adapts partial motion and produces unique combinations of whole-body motion, significantly diversifying motions.
PaperID: 172  
Authors: Xiao-Lei Li, Hao-Xiang Chen, Yanni Zhang, Kai Ma, Alan Zhao, Tai-Jiang Mu, Hao-Xiang Guo, Ran Zhang
Affiliations: Tsinghua University, China and Tencent Video AI Center, China ; Tencent Video AI Center, China ; Tencent PCG, China ; Skywork AI, Kunlun Inc., New York
Title: RELATE3D: REfocusing Latent Adapter for Targeted local Enhancement and Editing in 3D Generation
Abstract:
Recent advancements in 3D generation techniques have simplified the tedious manual process of 3D asset production. Among these methods, 3D native latent diffusion models are particularly effective in generating high-quality geometric details. However, achieving local enhancement and editing of the generated 3D models remains a challenge due to the limited understanding of the relationship between text,images,and 3D in terms of local semantics and feature space.We explore and reveal the characteristics of the native 3D latent space, make it decomposable and low-rank, thereby enabling efficient and effective learning for multimodal local alignment. Based on this, we introduce RELATE3D, a novel approach that combines a Refocusing Adapter with part-to-latent correspondence guided training for precise local enhancement and part-level editing of 3D geometry. The Refocusing Adapter incorporates partial image and caption signals, and, combined with part-to-latent mapping, directs modifications to the relevant latent dimensions during latent diffusion process. We validate the effectiveness of our approach through extensive experiments and ablation studies, showcasing the capabilities of our generative local enhancement and editing process, as well as global refinement.
PaperID: 173  
Authors: Zilin Xu, Xiang Chen, Chen Liu, Beibei Wang, Lu Wang, Zahra Montazeri, Ling-Qi Yan
Affiliations: Santa Barbara, USA ; Shandong University, China ; Zhejiang Lingdi Digital Technology Co., China ; Nanjing University, China ; Shandong University, China ; University of Manchester
Title: Towards Comprehensive Neural Materials: Dynamic Structure-Preserving Synthesis with Accurate Silhouette at Instant Inference Speed
Abstract:
Photorealistic rendering aims to accurately replicate real-world appearances. Traditional methods, like microfacet-based models, often struggle with complex visuals. Consequently, neural material techniques have emerged, typically offering improved performance over traditional approaches. However, these neural material approaches only attempt to address one or a few essential aspects of the complete appearance while neglecting others (quality, parallax & silhouette, synthesis, performance). Although these aspects may seem separate, they are inherently intertwined as part of the complete appearance which cannot be isolated. In this paper, we challenge the comprehensive neural material representation by thoroughly considering the essential aspects of the complete appearance. We introduce an int8-quantized neural network that keeps high fidelity (quality) while achieving an order of magnitude speedup (performance) compared to previous methods. We also present a controllable structure-preserving synthesis strategy (synthesis), along with accurate displacement effects (parallax & silhouette) through a dynamic two-step displacement tracing technique.
PaperID: 174  
Authors: Rachel McDonnell, Bharat Vyas, Uros Sikimic, Pisut Wisessing
Affiliations: Trinity College Dublin, Ireland ; Epic Games, Serbia ; CMKL University
Title: Feeling Blue or Seeing Red? Investigating the effect of light color, shadow and realism on the perception of emotion of real and virtual humans
Abstract:
Cinematic lighting is a powerful tool used in film-making to create a mood or atmosphere and to influence the audience’s perception and emotional response to a scene. For example, red can be used to increase feelings of anxiety or excitement, while blue might have a more calming effect. These responses can be harnessed to enhance the storytelling. Previous studies in Psychology have shown that light color has a direct impact on the perception of emotions and feelings. However, there is a lack of controlled empirical studies for understanding if lighting alone can alter the interpretation of emotion. Realistic virtual humans are an underused tool to study these effects in a controlled manner as they retain the same emotional expression across lighting conditions, and can display the same emotion across different genders and races. In this paper, we focus on studying the effect of light temperature, color, and shadow on the interpretation of emotions of realistic virtual humans, and compare to a human photo baseline. We are particularly interested in recognition of emotion, emotion intensity, and genuineness of the emotion. Our findings can be used by developers to increase the emotional intensity and genuineness of their virtual humans.