SIGGRAPH-ASIA2025

Abstract:
Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object’s geometry, texture, and articulation parameters—without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility. Code for this paper is at https://github.com/CzzzzH/FreeArt3D.

Abstract:
We present CHARM, a novel parametric representation and generative framework for anime hairstyle modeling. While traditional hair modeling methods focus on realistic hair using strand-based or volumetric representations, anime hairstyle exhibits highly stylized, piecewise-structured geometry that challenges existing techniques. Existing works often rely on dense mesh modeling or hand-crafted spline curves, making them inefficient for editing and unsuitable for scalable learning. CHARM introduces a compact, invertible control-point-based parameterization, where a sequence of control points represents each hair card, and each point is encoded with only five geometric parameters. This efficient and accurate representation supports both artist-friendly design and learning-based generation. Built upon this representation, CHARM introduces an autoregressive generative framework that effectively generates anime hairstyles from input images or point clouds. By interpreting anime hairstyles as a sequential “hair language”, our autoregressive transformer captures both local geometry and global hairstyle topology, resulting in high-fidelity anime hairstyle creation. To facilitate both training and evaluation of anime hairstyle generation, we construct AnimeHair, a large-scale dataset of 37K high-quality anime hairstyles with separated hair cards and processed mesh data. Extensive experiments demonstrate state-of-the-art performance of CHARM in both reconstruction accuracy and generation quality, offering an expressive and scalable solution for anime hairstyle modeling. Project page: https://hyzcluster.github.io/charm

Abstract:
Generating 3D worlds from text is a highly anticipated goal in computer vision. Existing works are limited by the degree of exploration they allow inside of a scene, i.e., produce streched-out and noisy artifacts when moving beyond central or panoramic perspectives. To this end, we propose WorldExplorer, a novel method based on autoregressive video trajectory generation, which builds fully navigable 3D scenes with consistent visual quality across a wide range of viewpoints. We initialize our scenes by creating multi-view consistent images corresponding to a 360 degree panorama. Then, we expand it by leveraging video diffusion models in an iterative scene generation pipeline. Concretely, we generate multiple videos along short, pre-defined trajectories, that explore the scene in depth, including motion around objects. Our novel scene memory conditions each video on the most relevant prior views, while a collision-detection mechanism prevents degenerate results, like moving into objects. Finally, we fuse all generated views into a unified 3D representation via 3D Gaussian Splatting optimization. Compared to prior approaches, WorldExplorer produces high-quality scenes that remain stable under large camera motion, enabling for the first time realistic and unrestricted exploration. We believe this marks a significant step toward generating immersive and truly explorable virtual 3D environments.

Abstract:
Modeling 3D articulated objects with realistic geometry, textures, and kinematics is essential for a wide range of applications. However, existing optimization-based reconstruction methods often require dense multi-view inputs and expensive per-instance optimization, limiting their scalability. Recent feedforward approaches offer faster alternatives but frequently produce coarse geometry, lack texture reconstruction, and rely on brittle, complex multi-stage pipelines. We introduce LARM, a unified feedforward framework that reconstructs 3D articulated objects from sparse-view images by jointly recovering detailed geometry, realistic textures, and accurate joint structures. LARM extends LVSM—a recent novel view synthesis (NVS) approach for static 3D objects—into the articulated setting by jointly reasoning over camera pose and articulation variation using a transformer-based architecture, enabling scalable and accurate novel view synthesis. In addition, LARM generates auxiliary outputs such as depth maps and part masks to facilitate explicit 3D mesh extraction and joint estimation. Our pipeline eliminates the need for dense supervision and supports high-fidelity reconstruction across diverse object categories. Extensive experiments demonstrate that LARM outperforms state-of-the-art methods in both novel view and state synthesis as well as 3D articulated object reconstruction, generating high-quality meshes that closely adhere to the input images. Code for this paper is at https://github.com/sylviayuan-sy/LARM.

Abstract:
3D Gaussian Splatting (3DGS) has shown impressive results in real-time novel view synthesis. However, it often struggles under sparse-view settings, producing undesirable artifacts such as floaters, inaccurate geometry, and overfitting due to limited observations. We find that a key contributing factor is uncontrolled densification, where adding Gaussian primitives rapidly without guidance can harm geometry and cause artifacts. We propose AD-GS, a novel alternating densification framework that interleaves high and low densification phases. During high densification, the model densifies aggressively, followed by photometric loss based training to capture fine-grained scene details. Low densification then primarily involves aggressive opacity pruning of Gaussians followed by regularizing their geometry through pseudo-view consistency and edge-aware depth smoothness. This alternating approach helps reduce overfitting by carefully controlling model capacity growth while progressively refining the scene representation. Extensive experiments on challenging datasets demonstrate that AD-GS significantly improves rendering quality and geometric consistency compared to existing methods. The source code for our model can be found on our project page: https://gurutvapatle.github.io/publications/2025/ADGS.html.

Abstract:
Large transformer models are proving to be a powerful tool for 3D vision and novel view synthesis. However, the standard Transformer’s well-known quadratic complexity makes it difficult to scale these methods to large scenes. To address this challenge, we propose the Local View Transformer (LVT), a large-scale scene reconstruction and novel view synthesis architecture that circumvents the need for the quadratic attention operation. Motivated by the insight that spatially nearby views provide more useful signal about the local scene composition than distant views, our model processes all information in a local neighborhood around each view. To attend to tokens in nearby views, we leverage a novel positional encoding that conditions on the relative geometric transformation between the query and nearby views. We decode the output of our model into a 3D Gaussian Splat scene representation that includes both color and opacity view-dependence. Taken together, the Local View Transformer enables reconstruction of arbitrarily large, high-resolution scenes in a single forward pass. See our project page for results and interactive demos: https://toobaimt.github.io/lvt/.

Abstract:
Volumetric video is emerging as a key medium for digitizing the dynamic physical world, creating the virtual environments with six degrees of freedom to deliver immersive user experiences. However, robustly modeling general dynamic scenes, especially those involving topological changes while maintaining long-term tracking remains a fundamental challenge. In this paper, we present TaoGS, a novel topology-aware dynamic Gaussian representation that disentangles motion and appearance to support, both, long-range tracking and topological adaptation. We represent scene motion with a sparse set of motion Gaussians, which are continuously updated by a spatio-temporal tracker and photometric cues that detect structural variations across frames. To capture fine-grained texture, each motion Gaussian anchors and dynamically activates a set of local appearance Gaussians, which are non-rigidly warped to the current frame to provide strong initialization and significantly reduce training time. This activation mechanism enables efficient modeling of detailed textures and maintains temporal coherence, allowing high-fidelity rendering even under challenging scenarios such as changing clothes. To enable seamless integration into codec-based volumetric formats, we introduce a global Gaussian Lookup Table that records the lifespan of each Gaussian and organizes attributes into a lifespan-aware 2D layout. This structure aligns naturally with standard video codecs and supports up to 40× compression. TaoGS provides a unified, adaptive solution for scalable volumetric video under topological variation, capturing moments where “elegance in motion” and “Power in Stillness”— delivering immersive experiences that harmonize with the physical world. Project page: https://guochch.github.io/TaoGS/.

Abstract:
Generating high-fidelity images of humans with fine-grained control over attributes such as hairstyle and clothing remains a core challenge in personalized text-to-image synthesis. While prior methods emphasize identity preservation from a reference image, they lack modularity and fail to provide disentangled control over specific visual attributes. We introduce a new paradigm for attribute-specific image prompting, in which distinct sets of reference images are used to guide the generation of individual aspects of human appearance, such as hair, clothing, and identity. Our method encodes these inputs into attribute-specific tokens, which are injected into a pre-trained text-to-image diffusion model. This enables compositional and disentangled control over multiple visual factors, even across multiple people within a single image. To promote natural composition and robust disentanglement, we curate a cross-reference training dataset featuring subjects in diverse poses and expressions, and propose a multi-attribute cross-reference training strategy that encourages the model to generate faithful outputs from misaligned attribute inputs while adhering to both identity and textual conditioning. Extensive experiments show that our method achieves state-of-the-art performance in accurately following both visual and textual prompts. Our framework paves the way for more configurable human image synthesis by combining visual prompting with text-driven generation. Our project page is available at https://snap-research.github.io/composeme/.

Abstract:
We propose FairyGen, an automatic system for generating story-driven videos from a single child’s drawing, while faithfully preserving its unique artistic style, maintaining consistent identity across shots, and generating natural anthropomorphic motion. Unlike previous works focusing solely on subject or motion, we treat the entire storytelling process as layered across character modeling, environment generation, and shot design. Given a single hand-drawn image, we first employ a Multimodal Large Language Model (MLLM) to generate a structured storyboard. Subsequently, for style-consistent background generation, we introduce a style propagation adapter that captures the character’s visual style and propagates it to the background, via a pre-trained background inpainting diffusion model. Furthermore, to animate the generated scenes, we reconstruct a 3D character proxy to derive plausible motion sequences. These sequences are then used to fine-tune an MMDiT-based image-to-video diffusion model, which learns complex motion through a motion customization adapter with a timestep-shift strategy. Once trained, FairyGen directly renders diverse and coherent video scenes aligned with the storyboard. Extensive experiments demonstrate that our system produces animations that are stylistically faithful, narratively structured, and rich in smooth, natural motion, highlighting its potential for personalized and engaging story animation.

Abstract:
We consider the problem of active 3D imaging using single-shot structured light systems, which are widely employed in commercial 3D sensing devices such as Apple Face ID and Intel RealSense. Traditional structured light methods typically decode depth correspondences through pixel-domain matching algorithms, resulting in limited robustness under challenging scenarios like occlusions, fine-structured details, and non-Lambertian surfaces. Inspired by recent advances in neural feature matching, we propose a learning-based structured light decoding framework that performs robust correspondence matching within feature space rather than the fragile pixel domain. Our method extracts neural features from the projected patterns and captured infrared (IR) images, explicitly incorporating their geometric priors by building cost volumes in feature space, achieving substantial performance improvements over pixel-domain decoding approaches. To further enhance depth quality, we introduce a depth refinement module that leverages strong priors from large-scale monocular depth estimation models, improving fine detail recovery and global structural coherence. To facilitate effective learning, we develop a physically-based structured light rendering pipeline, generating nearly one million synthetic pattern-image pairs with diverse objects and materials for indoor settings. Experiments demonstrate that our method, trained exclusively on synthetic data with multiple structured light patterns, generalizes well to real-world indoor environments, effectively processes various pattern types without retraining, and consistently outperforms both commercial structured light systems and passive stereo RGB-based depth estimation methods. Code and data are available at https://github.com/Namisntimpot/NSL

Abstract:
Real-time 3D reconstruction is a fundamental task in computer graphics. Recently, differentiable-rendering-based SLAM system has demonstrated significant potential, enabling photorealistic scene rendering through learnable scene representations such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Current differentiable rendering methods face dual challenges in real-time computation and sensor noise sensitivity, leading to degraded geometric fidelity in scene reconstruction and limited practicality. To address these challenges, we propose a novel real-time system EGG-Fusion, featuring robust sparse-to-dense camera tracking and a geometry-aware Gaussian surfel mapping module, introducing an information filter-based fusion method that explicitly accounts for sensor noise to achieve high-precision surface reconstruction. The proposed differentiable Gaussian surfel mapping effectively models multi-view consistent surfaces while enabling efficient parameter optimization. Extensive experimental results demonstrate that the proposed system achieves a surface reconstruction error of 0.6cm on standardized benchmark datasets including Replica and ScanNet++, representing over 20% improvement in accuracy compared to state-of-the-art (SOTA) GS-based methods. Notably, the system maintains real-time processing capabilities at 24 FPS, establishing it as one of the most accurate differentiable-rendering-based real-time reconstruction systems. Project Page: https://zju3dv.github.io/eggfusion/.

Abstract:
This paper presents GS-RoadPatching, an inpainting method for driving scene completion by referring to completely reconstructed regions, which are represented by 3D Gaussian Splatting (3DGS). Unlike existing 3DGS inpainting methods that perform generative completion relying on 2D perspective-view-based diffusion or GAN models to predict limited appearance or depth cues for missing regions, our approach enables substitutional scene inpainting and editing directly through the 3DGS modality, extricating it from requiring spatial-temporal consistency of 2D cross-modals and eliminating the need for time-intensive retraining of Gaussians. Our key insight is that the highly repetitive patterns in driving scenes often share multi-modal similarities within the implicit 3DGS feature space and are particularly suitable for structural matching to enable effective 3DGS-based substitutional inpainting. Practically, we construct feature-embedded 3DGS scenes to incorporate a patch measurement method for abstracting local context at different scales and, subsequently, propose a structural search method to find candidate patches in 3D space effectively. Finally, we propose a simple yet effective substitution-and-fusion optimization for better visual harmony. We conduct extensive experiments on multiple publicly available datasets to demonstrate the effectiveness and efficiency of our proposed method in driving scenes, and the results validate that our method achieves state-of-the-art performance compared to the baseline methods in terms of both quality and interoperability. Additional experiments in general scenes also demonstrate the applicability of the proposed 3D inpainting strategy. The project page and code are available at: https://shanzhaguoo.github.io/GS-RoadPatching/.

Abstract:
We present AnimaX, a feed-forward 3D animation framework that bridges the motion priors of video diffusion models with the controllable structure of skeleton-based animation. Traditional motion synthesis methods are either restricted to fixed skeletal topologies or require costly optimization in high-dimensional deformation spaces. In contrast, AnimaX effectively transfers video-based motion knowledge to the 3D domain, supporting diverse articulated meshes with arbitrary skeletons. Our method represents 3D motion as multi-view, multi-frame 2D pose maps, and enables joint video-pose diffusion conditioned on template renderings and a textual motion prompt. We introduce shared positional encodings and modality-aware embeddings to ensure spatial-temporal alignment between video and pose sequences, effectively transferring video priors to motion generation task. The resulting multi-view pose sequences are triangulated into 3D joint positions and converted into mesh animation via inverse kinematics. Trained on a newly curated dataset of 160,000 rigged sequences, AnimaX achieves state-of-the-art results on VBench in generalization, motion fidelity, and efficiency, offering a scalable solution for category-agnostic 3D animation.

Abstract:
Camera and human motion controls have been extensively studied for video generation, but existing approaches typically address them separately, suffering from limited data with high-quality annotations for both aspects. To overcome this, we present Uni3C, a unified 3D-enhanced framework for precise control of both camera and human motion in video generation. Uni3C includes two key contributions. First, we propose a plug-and-play control module trained with a frozen video generative backbone, PCDController, which utilizes unprojected point clouds from monocular depth to achieve accurate camera control. By leveraging the strong 3D priors of point clouds and the powerful capacities of video foundational models, PCDController shows impressive generalization, performing well regardless of whether the inference backbone is frozen or fine-tuned. This flexibility enables different modules of Uni3C to be trained in specific domains, i.e., either camera control or human motion control, reducing the dependency on jointly annotated data. Second, we propose a jointly aligned 3D world guidance for the inference phase that seamlessly integrates both scenic point clouds and SMPL-X characters to unify the control signals for camera and human motion, respectively. Extensive experiments confirm that PCDController enjoys strong robustness in driving camera motion for fine-tuned backbones of video generation. Uni3C substantially outperforms competitors in both camera controllability and human motion quality. Additionally, we collect tailored validation sets featuring challenging camera movements and human actions to validate the effectiveness of our method. Codes are released at https://github.com/alibaba-damo-academy/Uni3C.

Abstract:
In this paper, we introduce DreamID, a diffusion-based face swapping model that achieves high levels of ID similarity, attribute preservation, image fidelity, and fast inference speed. Unlike the typical face swapping training process, which often relies on implicit supervision and struggles to achieve satisfactory results. DreamID establishes explicit supervision for face swapping by constructing Triplet ID Group data, significantly enhancing identity similarity and attribute preservation. The iterative nature of diffusion models poses challenges for utilizing efficient image-space loss functions, as performing time-consuming multi-step sampling to obtain the generated image during training is impractical. To address this issue, we leverage the accelerated diffusion model SD Turbo, reducing the inference steps to a single iteration, enabling efficient pixel-level end-to-end training with explicit Triplet ID Group supervision. Additionally, we propose an improved diffusion-based model architecture comprising SwapNet, FaceNet, and ID Adapter. This robust architecture fully unlocks the power of the Triplet ID Group explicit supervision. Finally, to further extend our method, we explicitly modify the Triplet ID Group data during training to fine-tune and preserve specific attributes, such as glasses and face shape. Extensive experiments demonstrate that DreamID outperforms state-of-the-art methods in terms of identity similarity, pose and expression preservation, and image fidelity. Overall, DreamID achieves high-quality face swapping results at 512×512 resolution in just 0.6 seconds and performs exceptionally well in challenging scenarios such as complex lighting, large angles, and occlusions. Our project: https://superhero-7.github.io/DreamID/.

Abstract:
3D Gaussian Splatting (3DGS) has emerged as a leading approach for high-quality novel view synthesis, with numerous variants extending its applicability to a broad spectrum of 3D and 4D scene reconstruction tasks. Despite its success, the representational capacity of 3DGS remains limited by the use of 3D Gaussian kernels to model local variations. Recent works have proposed to augment 3DGS with additional per-primitive capacity, such as per-splat textures, to enhance its expressiveness. However, these per-splat texture approaches primarily target dense novel view synthesis with a reduced number of Gaussian primitives, and their effectiveness tends to diminish when applied to more general reconstruction scenarios. In this paper, we aim to achieve concrete performance improvement over state-of-the-art 3DGS variants across a wide range of reconstruction tasks, including novel view synthesis, geometry and dynamic reconstruction, under both sparse and dense input settings. To this end, we introduce Neural Texture Splatting (NTS). At the core of our approach is a global neural field (represented as a hybrid of a tri-plane and a neural decoder) that predicts local appearance and geometric fields for each primitive. By leveraging this shared global representation that models local texture fields across primitives, we significantly reduce model size and facilitate efficient global information exchange, demonstrating strong generalization across tasks. Furthermore, our neural modeling of local texture fields introduces expressive view- and time-dependent effects, a critical aspect that existing methods fail to account for. Extensive experiments show that Neural Texture Splatting consistently improves models and achieves state-of-the-art results across multiple benchmarks. Codes are at https://github.com/19reborn/neural-texture-splatting.

Abstract:
Point clouds are widely used representations of 3D data, but determining the visibility of points from a given viewpoint remains a challenging problem due to their sparse nature and lack of explicit connectivity. Traditional methods, such as Hidden Point Removal (HPR), face limitations in computational efficiency, robustness to noise, and handling concave regions or low-density point clouds. In this paper, we propose a novel approach to visibility determination in point clouds by formulating it as a binary classification task. The core of our network consists of a 3D U-Net that extracts view-independent point-wise features and a shared multi-layer perceptron (MLP) that predicts point visibility using the extracted features and view direction as inputs. The network is trained end-to-end with ground-truth visibility labels generated from rendered 3D models. Our method significantly outperforms HPR in both accuracy and computational efficiency, achieving up to 126 times speedup on large point clouds. Additionally, our network demonstrates robustness to noise and varying point cloud densities and generalizes well to unseen shapes. We validate the effectiveness of our approach through extensive experiments on the ShapeNet, ABC Dataset and real-world datasets, showing substantial improvements in visibility accuracy. We also demonstrate the versatility of our method in various applications, including point cloud visualization, surface reconstruction, normal estimation, shadow rendering, and viewpoint optimization. Our code and models are available at https://github.com/octree-nn/neural-visibility.

Abstract:
We introduce the first method for audio-driven universal photorealistic avatar synthesis, combining a person-agnostic speech model with our novel Universal Head Avatar Prior (UHAP). UHAP is trained on cross-identity multi-view videos. In particular, our UHAP is supervised with neutral scan data, enabling it to capture the identity-specific details at high fidelity. In contrast to previous approaches, which predominantly map audio features to geometric deformations only while ignoring audio-dependent appearance variations, our universal speech model directly maps raw audio inputs into the UHAP latent expression space. This expression space inherently encodes, both, geometric and appearance variations. For efficient personalization to new subjects, we employ a monocular encoder, which enables lightweight regression of dynamic expression variations across video frames. By accounting for these expression-dependent changes, it enables the subsequent model fine-tuning stage to focus exclusively on capturing the subject’s global appearance and geometry. Decoding these audio-driven expression codes via UHAP generates highly realistic avatars with precise lip synchronization and nuanced expressive details, such as eyebrow movement, gaze shifts, and realistic mouth interior appearance as well as motion. Extensive evaluations demonstrate that our method is not only the first generalizable audio-driven avatar model that can account for detailed appearance modeling and rendering, but it also outperforms competing (geometry-only) methods across metrics measuring lip-sync accuracy, quantitative image quality, and perceptual realism.

Abstract:
Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt. This task presents two core challenges: (1) maintaining identity consistency while aligning with the described appearance and actions, and (2) generating natural, fluid motion without unrealistic stiffness. To address these challenges, we introduce Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization. First, we propose a Multimodal Identity Fusion (MIF) module that unifies visual and textual cues into a joint identity representation using a Q-Former, providing coherent guidance to the diffusion model and eliminating modality imbalance. Second, we present a Time-Aware Identity Injection (TAII) mechanism that dynamically modulates identity conditioning across denoising steps, improving fine-detail reconstruction. Third, we propose Adaptive Motion Learning (AML), a motion-aware optimization strategy that reweights training loss based on optical-flow-derived motion heatmaps, enhancing motion realism without requiring additional inputs. To support this task, we construct Proteus-Bench, a high-quality dataset comprising 200K curated clips for training and 150 individuals from diverse professions and ethnicities for evaluation. Extensive experiments demonstrate that Proteus-ID outperforms prior methods in identity preservation, text alignment, and motion quality, establishing a new benchmark for video identity customization.

Abstract:
We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects that naturally interact with the existing scene over time. The position, appearance, and motion of the new content are seamlessly integrated into the original footage while accounting for camera motion, occlusions, and interactions with other dynamic objects in the scene, resulting in a cohesive and realistic output video. We achieve this via a zero-shot, training-free framework that harnesses a pre-trained text-to-video diffusion transformer to synthesize the new content and a pre-trained vision-language model to envision the augmented scene in detail. Specifically, we introduce a novel inference-based method that manipulates features within the attention mechanism, enabling accurate localization and seamless integration of the new content while preserving the integrity of the original scene. Our method is fully automated, requiring only a simple user instruction. We demonstrate its effectiveness on a wide range of edits applied to real-world videos, encompassing diverse objects and scenarios involving both camera and object motion. Code for this paper will be available soon at https://github.com/DynVFX/dynvfx.

Abstract:
UV unwrapping flattens 3D surfaces to 2D with minimal distortion, often requiring the complex surface to be decomposed into multiple charts. Although extensively studied, existing UV unwrapping methods frequently struggle with AI-generated meshes, which are typically noisy, bumpy, and poorly conditioned. These methods often produce highly fragmented charts and suboptimal boundaries, introducing artifacts and hindering downstream tasks. We introduce PartUV, a part-based UV unwrapping pipeline that generates significantly fewer, part-aligned charts while maintaining low distortion. Built on top of a recent learning-based part decomposition method PartField, PartUV combines high-level semantic part decomposition with novel geometric heuristics in a top-down recursive framework. It ensures each chart’s distortion remains below a user-specified threshold while minimizing the total number of charts. The pipeline integrates and extends parameterization and packing algorithms, incorporates dedicated handling of non-manifold and degenerate meshes, and is extensively parallelized for efficiency. Evaluated across four diverse datasets—including man-made, CAD, AI-generated, and Common Shapes—PartUV outperforms existing tools and recent neural methods in chart count and seam length, achieves comparable distortion, exhibits high success rates on challenging meshes, and enables new applications like part-specific multi-tiles packing. Code for this paper is at https://github.com/EricWang12/PartUV.

Abstract:
Training native 3D texture generative models remains a fundamental yet challenging problem, largely due to the limited availability of large-scale, high-quality 3D texture datasets. This scarcity hinders generalization to real-world scenarios. To address this, most existing methods finetune foundation image generative models to exploit their learned visual priors. However, these approaches typically generate only multi-view images and rely on post-processing to produce UV texture maps—an essential representation in modern graphics pipelines. Such two-stage pipelines often suffer from error accumulation and spatial inconsistencies across the 3D surface. In this paper, we introduce SeqTex, a novel end-to-end framework that leverages the visual knowledge encoded in pretrained video foundation models to directly generate complete UV texture maps. Unlike previous methods that model the distribution of UV textures in isolation, SeqTex reformulates the task as a sequence generation problem, enabling the model to learn the joint distribution of multi-view renderings and UV textures. This design effectively transfers the consistent image-space priors from video foundation models into the UV domain. To further enhance performance, we propose several architectural innovations: a decoupled multi-view and UV branch design, geometry-informed attention to guide cross-domain feature alignment, and adaptive token resolution to preserve fine texture details while maintaining computational efficiency. Together, these components allow SeqTex to fully utilize pretrained video priors and synthesize high-fidelity UV texture maps without the need for post-processing. Extensive experiments show that SeqTex achieves state-of-the-art performance on both image-conditioned and text-conditioned 3D texture generation tasks, with superior 3D consistency, texture-geometry alignment, and real-world generalization. Our project page is https://yuanze1024.github.io/SeqTex/.

Abstract:
Image vectorization is a powerful technique that converts raster images into vector graphics, enabling enhanced flexibility and interactivity. However, popular image vectorization tools struggle with occluded regions, producing incomplete or fragmented shapes that hinder editability. While recent advancements have explored optimization-based and learning-based layer-wise image vectorization, these methods face limitations in vectorization quality and flexibility. In this paper, we introduce LayerPeeler, a novel layer-wise image vectorization approach that addresses these challenges through a progressive simplification paradigm. The key to LayerPeeler’s success lies in its autoregressive peeling strategy: by identifying and removing the topmost non-occluded layers while recovering underlying content, we generate vector graphics with complete paths and coherent layer structures. Our method leverages vision-language models to construct a layer graph that captures occlusion relationships among elements, enabling precise detection and description for non-occluded layers. These descriptive captions are used as editing instructions for a finetuned image diffusion model to remove the identified layers. To ensure accurate removal, we employ localized attention control that precisely guides the model to target regions while faithfully preserving the surrounding content. To support this, we contribute a large-scale dataset specifically designed for layer peeling tasks. Extensive quantitative and qualitative experiments demonstrate that LayerPeeler significantly outperforms existing techniques, producing vectorization results with superior path semantics, geometric regularity, and visual fidelity. Our code and dataset will be available at https://layerpeeler.github.io/.

Abstract:
We introduce a method for composing object-level visual prompts within a text-to-image diffusion model. Our approach addresses the task of generating semantically coherent compositions across diverse scenes and styles, similar to the versatility and expressiveness offered by text prompts. A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts, while also generating diverse compositions across different images. To address this challenge, we introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations. The keys are derived from an encoder with a small bottleneck for layout control, whereas the values come from a larger bottleneck encoder that captures fine-grained appearance details. By mixing keys and values from these complementary sources, our model preserves the identity of the visual prompts while supporting flexible variations in object arrangement, pose, and composition. During inference, we further propose object-level compositional guidance to improve the method’s identity preservation and layout correctness. Results show that our technique produces diverse scene compositions that preserve the unique characteristics of each visual prompt, expanding the creative potential of text-to-image generation.

Abstract:
We focus on the problem of using generative diffusion models for the task of motion detailing: converting a rough version of a character animation, represented by a sparse set of coarsely posed, and imprecisely timed blocking poses, into a detailed, natural looking character animation. Current diffusion models can address the problem of correcting the timing of imprecisely timed poses, but we find that no good solution exists for leveraging the diffusion prior to enhance a sparse set of blocking poses with additional pose detail. We overcome this challenge using a simple inference-time trick. At certain diffusion steps, we blend the outputs of an unconditioned diffusion model with input blocking pose constraints using per-blocking-pose tolerance weights, and pass this result in as the input condition to an pre-existing motion retiming model. We find this approach works significantly better than existing approaches that attempt to add detail by blending model outputs or via expressing blocking pose constraints as guidance. The result is the first diffusion model that can robustly convert blocking-level poses into plausible detailed character animations. The project page for this work can be found at https://purvigoel.github.io/generative-motion-detailing/.

Abstract:
Recent advances in text-to-video generation have enabled high-quality synthesis from text and image prompts. While the personalization of dynamic concepts, which capture subject-specific appearance and motion from a single video, is now feasible, most existing methods require per-instance fine-tuning, limiting scalability. We introduce a fully zero-shot framework for dynamic concept personalization in text-to-video models. Our method leverages structured 2×2 video grids that spatially organize input and output pairs, enabling the training of lightweight Grid-LoRA adapters for editing and composition within these grids. At inference, a dedicated Grid Fill module completes partially observed layouts, producing temporally coherent and identity preserving outputs. Once trained, the entire system operates in a single forward pass, generalizing to previously unseen dynamic concepts without any test-time optimization. Extensive experiments demonstrate high-quality and consistent results across a wide range of subjects beyond trained concepts and editing scenarios.

Abstract:
Multi-objective optimization problems, which require the simultaneous optimization of multiple objectives, are prevalent across numerous applications. Existing multi-objective optimization methods often rely on manually-tuned aggregation functions to formulate a joint optimization objective. The performance of such hand-tuned methods is heavily dependent on careful weight selection, a time-consuming and laborious process. These limitations also arise in the setting of reinforcement-learning-based motion tracking methods for physically simulated characters, where intricately crafted reward functions are typically used to achieve high-fidelity results. Such solutions not only require domain expertise and significant manual tuning, but also limit the applicability of the resulting reward function across diverse skills. To bridge this gap, we present a novel adversarial multi-objective optimization technique that is broadly applicable to a range of multi-objective reinforcement-learning tasks, including motion tracking. Our proposed Adversarial Differential Discriminator (ADD) receives a single positive sample, yet is still effective at guiding the optimization process. We demonstrate that our technique can enable characters to closely replicate a variety of acrobatic and agile behaviors, achieving comparable quality to state-of-the-art motion-tracking methods, without relying on manually-designed reward functions.

Abstract:
Learning human motion based on a time-dependent input signal presents a challenging yet impactful task with various applications. The goal of this task is to generate or estimate human movement that consistently reflects the temporal patterns of conditioning inputs. Existing methods typically rely on cross-attention mechanisms to fuse the condition with motion. However, this approach primarily captures global interactions and struggles to maintain step-by-step temporal alignment. To address this limitation, we introduce Temporally Conditional Mamba, a new mamba-based model for human motion generation. Our approach integrates conditional information into the recurrent dynamics of the Mamba block, enabling better temporally aligned motion. To validate the effectiveness of our method, we evaluate it on a variety of human motion tasks. Extensive experiments demonstrate that our model significantly improves temporal alignment, motion realism, and condition consistency over state-of-the-art approaches. Our project page is available at https://zquang2202.github.io/TCM.

Abstract:
Recently, extensive research on image customization (e.g., identity, subject, style, background, etc.) demonstrates strong customization capabilities in large-scale generative models. However, most approaches are designed for specific tasks, restricting their generalizability to combine different types of condition. Developing a unified framework for image customization remains an open challenge. In this paper, we present DreamO, an image customization framework designed to support a wide range of tasks while facilitating seamless integration of multiple conditions. Specifically, DreamO utilizes a diffusion transformer (DiT) framework to uniformly process input of different types. During training, we introduce a feature routing constraint to facilitate the precise querying of relevant information from reference images. Additionally, we design a placeholder strategy that associates specific placeholders with conditions at particular positions, enabling control over the placement of conditions in the generated results. Moreover, we employ a progressive training strategy to ensure smooth model convergence and correct the generation quality of the final output. Extensive experiments demonstrate that the proposed DreamO can effectively perform various image customization tasks with high quality and flexibly integrate different types of control conditions. Project page: https://mc-e.github.io/project/DreamO

Abstract:
Embedding a language field in a 3D representation enables richer semantic understanding of spatial environments by linking geometry with descriptive meaning. This allows for a more intuitive human-computer interaction, enabling querying or editing scenes using natural language, and could potentially improve tasks like scene retrieval, navigation, and multimodal reasoning. While such capabilities could be transformative, in particular for large-scale scenes, we find that recent feature distillation approaches cannot effectively learn over massive Internet data due to challenges in semantic feature misalignment and inefficiency in memory and runtime. To this end, we propose a novel approach to address these challenges. First, we introduce extremely low-dimensional semantic bottleneck features as part of the underlying 3D Gaussian representation. These are processed by rendering and passing them through a multi-resolution, feature-based, hash encoder. This significantly improves efficiency both in runtime and GPU memory. Second, we introduce an Attenuated Downsampler module and propose several regularizations addressing the semantic misalignment of ground truth 2D features. We evaluate our method on the in-the-wild HolyScenes dataset and demonstrate that it surpasses existing approaches in both performance and efficiency.

Abstract:
Denoising diffusion models excel at generating high-quality images conditioned on text prompts, yet their effectiveness heavily relies on careful guidance during the sampling process. Classifier-Free Guidance (CFG) provides a widely used mechanism for steering generation by setting the guidance scale, which balances image quality and prompt alignment. However, the choice of the guidance scale has a critical impact on the convergence toward a visually appealing and prompt-adherent image. In this work, we propose an annealing guidance scheduler which dynamically adjusts the guidance scale over time based on the conditional noisy signal. By learning a scheduling policy, our method addresses the temperamental behavior of CFG. Empirical results demonstrate that our guidance scheduler significantly enhances image quality and alignment with the text prompt, advancing the performance of text-to-image generation. Notably, our novel scheduler requires no additional activations or memory consumption, and can seamlessly replace the common classifier-free guidance, offering an improved trade-off between prompt alignment and quality.

Abstract:
Diffusion models can synthesize realistic co-speech video from audio for various applications, such as video creation and virtual agents. However, existing diffusion-based methods are slow due to numerous denoising steps and costly attention mechanisms, preventing real-time deployment. In this work, we distill a many-step diffusion video model into a few-step student model. Unfortunately, directly applying recent diffusion distillation methods degrades video quality and falls short of real-time performance. To address these issues, our new video distillation method leverages input human pose conditioning for both attention and loss functions. We first propose using accurate correspondence between input human pose keypoints to guide attention to relevant regions, such as the speaker’s face, hands, and upper body. This input-aware sparse attention reduces redundant computations and strengthens temporal correspondences of body parts, improving inference efficiency and motion coherence. To further enhance visual quality, we introduce an input-aware distillation loss that improves lip synchronization and hand motion realism. By integrating our input-aware sparse attention and distillation loss, our method achieves real-time performance with improved visual quality compared to recent audio-driven and input-driven methods. We also conduct extensive experiments showing the effectiveness of our algorithmic design choices.

Abstract:
Camera control is crucial for generating expressive and cinematic videos. Existing methods rely on explicit sequences of camera parameters as control conditions, which can be cumbersome for users to construct, particularly for intricate camera movements. To provide a more intuitive camera control method, we propose CamCloneMaster, a framework that enables users to replicate camera movements from reference videos without requiring camera parameters or test-time fine-tuning. CamCloneMaster seamlessly supports reference-based camera control for both Image-to-Video and Video-to-Video tasks within a unified framework. Furthermore, we present the Camera Clone Dataset, a large-scale synthetic dataset designed for camera clone learning, encompassing diverse scenes, subjects, and camera movements. Extensive experiments and user studies demonstrate that CamCloneMaster outperforms existing methods in terms of both camera controllability and visual quality. Dataset and Code can be found athttps://camclonemaster.github.io/.

Abstract:
We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency, designed for social interactions in virtual reality for anyone. Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time, which are then decoded as photorealistic 3D facial avatars. Leveraging the generative capabilities of diffusion models, we capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance (<15ms GPU time). Our novel architecture minimizes latency through two key innovations: an online transformer that eliminates dependency on future inputs and a distillation pipeline that accelerates iterative denoising into a single step. We further address critical design challenges in live scenarios for processing continuous audio signals frame-by-frame while maintaining consistent animation quality. The versatility of our framework extends to multimodal applications, including semantic modalities such as emotion conditions and multimodal sensors with head-mounted eye cameras on VR headsets. Experimental results demonstrate significant improvements in facial animation accuracy over existing offline state-of-the-art baselines, achieving 100 to 1000 × faster inference speed. We validate our approach through live VR demonstrations and across various scenarios such as multilingual speeches.

Abstract:
We introduce a 3D detailizer, a neural model which can instantaneously (in <1s) transform a coarse 3D shape proxy into a high-quality asset with detailed geometry and texture as guided by an input text prompt. Our model is trained using the text prompt, which defines the shape class and characterizes the appearance and fine-grained style of the generated details. The coarse 3D proxy, which can be easily varied and adjusted (e.g., via user editing), provides structure control over the final shape. Importantly, our detailizer is not optimized for a single shape; it is the result of distilling a generative model, so that it can be reused, without retraining, to generate any number of shapes, with varied structures, whose local details all share a consistent style and appearance. Our detailizer training utilizes a pretrained multi-view image diffusion model, with text conditioning, to distill the foundational knowledge therein into our detailizer via Score Distillation Sampling (SDS). To improve SDS and enable our detailizer architecture to learn generalizable features over complex structures, we train our model in two training stages to generate shapes with increasing structural complexity. Through extensive experiments, we show that our method generates shapes of superior quality and details compared to existing text-to-3D models under varied structure control. Our detailizer can refine a coarse shape in less than a second, making it possible to interactively author and adjust 3D shapes. Furthermore, the user-imposed structure control can lead to creative, and hence out-of-distribution, 3D asset generations that are beyond the current capabilities of leading text-to-3D generative models. We demonstrate an interactive 3D modeling workflow our method enables, and its strong generalizability over styles, structures, and object categories.

Abstract:
Existing 4D Gaussian Splatting (4DGS) methods struggle to accurately reconstruct dynamic scenes, often failing to resolve ambiguous pixel correspondences and inadequate densification in dynamic regions. We address these issues by introducing a novel method composed of two key components: (1) Elliptical Error Clustering and Error Correcting Splat Addition that pinpoints dynamic areas to improve and initialize fitting splats, and (2) Grouped 4D Gaussian Splatting that improves consistency of mapping between splats and represented dynamic objects. Specifically, we classify rendering errors into missing-color and occlusion types, then apply targeted corrections via backprojection or foreground splitting guided by cross-view color consistency. Evaluations on Neural 3D Video and Technicolor datasets demonstrate that our approach significantly improves temporal consistency and achieves state-of-the-art perceptual rendering quality, improving 0.39dB of PSNR on the Technicolor Light Field dataset. Our visualization shows improved alignment between splats and dynamic objects, and the error correction method’s capability to identify errors and properly initialize new splats. Our implementation details and source code are available at https://github.com/tho-kn/cem-4dgs.

Abstract:
Tracking and mapping in large-scale, unbounded outdoor environments using only monocular RGB input presents substantial challenges for existing SLAM systems. Traditional Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) SLAM methods are typically limited to small, bounded indoor settings. To overcome these challenges, we introduce GigaSLAM, the first NeRF/3DGS-based SLAM framework for kilometer-scale outdoor environments, as mainly demonstrated on the 4 kilometer-scale datasets. Our approach employs a hierarchical sparse voxel map representation, where Gaussians are decoded by neural networks at multiple levels of detail. This design enables efficient, scalable mapping and high-fidelity viewpoint rendering across expansive, unbounded scenes. For front-end tracking, GigaSLAM utilizes a metric depth model combined with epipolar geometry and PnP algorithms to accurately estimate poses, while incorporating a Bag-of-Words-based loop closure mechanism to maintain robust alignment over long trajectories. Consequently, GigaSLAM delivers high-precision tracking and visually faithful rendering on urban outdoor benchmarks, establishing a robust SLAM solution for large-scale, long-term scenarios, and significantly extending the applicability of Gaussian Splatting SLAM systems to unbounded outdoor environments.

Abstract:
Representing and rendering dynamic scenes with complex motions remains challenging in computer vision and graphics. Recent dynamic view synthesis methods achieve high-quality rendering but often produce physically implausible motions. We introduce NeHaD, a neural deformation field for dynamic Gaussian Splatting governed by Hamiltonian mechanics. Our key observation is that existing methods using MLPs to predict deformation fields introduce inevitable biases, resulting in unnatural dynamics. By incorporating physics priors, we achieve robust and realistic dynamic scene rendering. Hamiltonian mechanics provides an ideal framework for modeling Gaussian deformation fields due to their shared phase-space structure, where primitives evolve along energy-conserving trajectories. We employ Hamiltonian neural networks to implicitly learn underlying physical laws governing deformation. Meanwhile, we introduce Boltzmann equilibrium decomposition, an energy-aware mechanism that adaptively separates static and dynamic Gaussians based on their spatial-temporal energy states for flexible rendering. To handle real-world dissipation, we employ second-order symplectic integration and local rigidity regularization as physics-informed constraints for robust dynamics modeling. Additionally, we extend NeHaD to adaptive streaming through scale-aware mipmapping and progressive optimization. Extensive experiments demonstrate that NeHaD achieves physically plausible results with a rendering quality-efficiency trade-off. To our knowledge, this is the first exploration leveraging Hamiltonian mechanics for neural Gaussian deformation, enabling physically realistic dynamic scene rendering with streaming capabilities.

Abstract:
We presentAnimus3D, a text-driven 3D animation framework that generates motion field given a static 3D asset and text prompt. Previous methods mostly leverage the vanilla Score Distillation Sampling (SDS) objective to distill motion from pretrained text-to-video diffusion, leading to animations with minimal movement or noticeable jitter. To address this, our approach introduces a novel SDS alternative, Motion Score Distillation (MSD). Specifically, we introduce a LoRA-enhanced video diffusion model that defines a static source distribution rather than pure noise as in SDS, while another inversion-based noise estimation technique ensures appearance preservation when guiding motion. To further improve motion fidelity, we incorporate explicit temporal and spatial regularization terms that mitigate geometric distortions across time and space. Additionally, we propose a motion refinement module to upscale the temporal resolution and enhance fine-grained details, overcoming the fixed-resolution constraints of the underlying video model. Extensive experiments demonstrate thatAnimus3D successfully animates static 3D assets from diverse text prompts, generating significantly more substantial and detailed motion than state-of-the-art baselines while maintaining high visual integrity. Code will be released upon acceptance.

Abstract:
Recently, generating 3D assets with the control of condition images has achieved impressive quality. However, existing 3D generation methods are limited to handling a single control objective and lack the ability to utilize multiple images to independently control different regions of a 3D asset, which hinders their flexibility in applications. We propose Fuse3D, a novel method that enables generating 3D assets under the control of multiple images, allowing for the seamless fusion of multi-level regional controls from global views to intricate local details. First, we introduce a Multi-Condition Fusion Module to integrate the visual features from multiple image regions. Then, we propose a method to automatically align user-selected 2D image regions with their associated 3D regions based on semantic cues. Finally, to resolve control conflicts and enhance local control features from multi-condition images, we introduce a Local Attention Enhancement Strategy that flexibly balances region-specific feature fusion. Overall, we introduce the first method capable of controllable 3D asset generation from multiple condition images. The experimental results indicate that Fuse3D can flexibly fuse multiple 2D image regions into coherent 3D structures, resulting in high-quality 3D assets. Code and data for this paper are at https://jinnmnm.github.io/Fuse3d.github.io/.

Abstract:
We introduce ASIA (Adaptive 3D Segmentation using few Image Annotations), a novel framework that enables segmentation of possibly non-semantic and non-text describable "parts" in 3D. Our segmentation is controllable through a few user-annotated in-the-wild images, which are easier to collect than multi-view images, less demanding to annotate than 3D models, and more precise than potentially ambiguous text descriptions. Our method leverages the rich priors of text-to-image diffusion models, such as Stable Diffusion, to transfer segmentations from image space to 3D, even when the annotated and target objects differ significantly in geometry or structure. During training, we optimize a text token for each segment and fine-tune our model with a novel cross-view part correspondence loss. At inference, we segment multi-view renderings of the 3D mesh, fuse the labels in UV-space via voting, refine them with our novelNoise Optimization technique, and finally map the UV-labels back onto the mesh. ASIA provides a practical and generalizable solution for both semantic and non-semantic 3D segmentation tasks, outperforming existing methods by a noticeable margin in both quantitative and qualitative evaluations.

Abstract:
Reconstructing object deformation from a single image remains a significant challenge in computer vision and graphics. Existing methods typically rely on multi-view video to recover deformation, limiting their applicability under constrained scenarios. To address this, we propose DeformSplat, a novel framework that effectively guides 3D Gaussian deformation from only a single image. Our method introduces two main technical contributions. First, we present Gaussian-to-Pixel Matching which bridges the domain gap between 3D Gaussian representations and 2D pixel observations. This enables robust deformation guidance from sparse visual cues. Second, we propose Rigid Part Segmentation consisting of initialization and refinement. This segmentation explicitly identifies rigid regions, crucial for maintaining geometric coherence during deformation. By combining these two techniques, our approach can reconstruct consistent deformations from a single image. Extensive experiments demonstrate that our approach significantly outperforms existing methods and naturally extends to various applications, such as frame interpolation and interactive object manipulation. Project page : https://vision3d-lab.github.io/deformsplat

Abstract:
As user expectations for image editing continue to rise, the demand for flexible, fine-grained manipulation of specific visual elements presents a challenge for current diffusion-based methods. In this work, we present BlobCtrl, a framework for element-level image editing based on a probabilistic blob-based representation. Treating blobs as visual primitives, BlobCtrl disentangles layout from appearance, affording fine-grained, controllable object-level elements manipulation. Our key contributions are twofold: 1) an in-context dual-branch diffusion model that separates foreground and background processing, incorporating blob representations to explicitly decouple layout and appearance; and 2) a self-supervised disentangle-then-reconstruct training paradigm with an identity-preserving loss function, along with tailored strategies to efficiently leverage blob-image pairs. To foster further research, we introduce BlobData for large-scale training, and BlobBench, a benchmark for systematic evaluation. Experimental results demonstrate that BlobCtrl achieves state-of-the-art performance in a variety of element-level editing tasks—such as object addition, removal, scaling, and replacement—while maintaining computational efficiency.

Abstract:
In Omnimatte, one aims to decompose a given video into semantically meaningful layers, including the background and individual objects along with their associated effects, such as shadows and reflections. Existing methods often require extensive training or costly self-supervised optimization. In this paper, we present OmnimatteZero, a training-free approach that leverages off-the-shelf pre-trained video diffusion models for omnimatte. It can remove objects from videos, extract individual object layers along with their effects, and composite those objects onto new videos. These are accomplished by adapting zero-shot image inpainting techniques for video object removal, a task they fail to handle effectively out-of-the-box. To overcome this, we introduce temporal and spatial attention guidance modules that steer the diffusion process for accurate object removal and temporally consistent background reconstruction. We further show that self-attention maps capture information about the object and its footprints and use them to inpaint the object’s effects, leaving a clean background. Additionally, through simple latent arithmetic, object layers can be isolated and recombined seamlessly with new video layers to produce new videos. Evaluations show that OmnimatteZero not only achieves superior performance in terms of background reconstruction but also sets a new record for the fastest Omnimatte approach, achieving real-time performance with minimal frame runtime. Project Page.

Abstract:
We present HRM2Avatar, a novel framework for creating high-fidelity avatars from monocular phone scans, which can be rendered and animated in real-time on mobile devices. Monocular capture with commodity smartphones provides a low-cost, pervasive alternative to studio-grade multi-camera rigs, making avatar digitization accessible to non-expert users. Reconstructing high-fidelity avatars from single-view video sequences poses significant challenges due to deficient visual and geometric data relative to multi-camera setups. To address these limitations, at the data level, our method leverages two types of data captured with smartphones: static pose sequences for detailed texture reconstruction and dynamic motion sequences for learning pose-dependent deformations and lighting changes. At the representation level, we employ a lightweight yet expressive representation to reconstruct high-fidelity digital humans from sparse monocular data. First, we extract explicit garment meshes from monocular data to model clothing deformations more effectively. Second, we attach illumination-aware Gaussians to the mesh surface, enabling high-fidelity rendering and capturing pose-dependent lighting changes. This representation efficiently learns high-resolution and dynamic information from our tailored monocular data, enabling the creation of detailed avatars. At the rendering level, real-time performance is critical for rendering and animating high-fidelity avatars in AR/VR, social gaming, and on-device creation, demanding sub-frame responsiveness. Our fully GPU-driven rendering pipeline delivers 120 FPS on mobile devices and 90 FPS on standalone VR devices at 2K resolution, over 2.7 × faster than representative mobile-engine baselines. Experiments show that HRM2Avatar delivers superior visual realism and real-time interactivity at high resolutions, outperforming state-of-the-art monocular methods.

Abstract:
Speech-driven 3D facial animation has attracted increasing interest since its potential to generate expressive and temporally synchronized digital humans. While recent works have begun to explore emotion-aware animation, they still depend on explicit one-hot encodings to represent identity and emotion with given emotion and identity labels, which limits their ability to generalize to unseen speakers. Moreover, the emotional cues inherently present in speech are often neglected, limiting the naturalness and adaptability of generated animations. In this work, we propose LSF-Animation, a novel framework that eliminates the reliance on explicit emotion and identity feature representations. Specifically, LSF-Animation implicitly extracts emotion information from speech and captures the identity features from a neutral facial mesh, enabling improved generalization to unseen speakers and emotional states without requiring manual labels. Furthermore, we introduce a Hierarchical Interaction Fusion Block (HIFB), which employs a fusion token to integrate dual transformer features and more effectively integrate emotional, motion-related and identity-related cues. Extensive experiments conducted on the 3DMEAD dataset demonstrate that our method surpasses recent state-of-the-art approaches in terms of emotional expressiveness, identity generalization, and animation realism. The source code will be released at: https://github.com/Dogter521/LSF-Animation.

Abstract:
We present X-Actor, a novel audio-driven portrait animation framework that generates lifelike, emotionally expressive talking head videos from a single reference image and an input audio clip. Unlike prior methods that emphasize lip synchronization and short-range visual fidelity in constrained speaking scenarios, X-Actor enables actor-quality, long-form portrait performance—capturing nuanced, dynamically evolving emotions that flow coherently with the rhythm and content of speech. Central to our approach is a two-stage decoupled generation pipeline: an audio-conditioned autoregressive diffusion model that predicts expressive yet identity-agnostic facial motion latent tokens within a long temporal context window, followed by a diffusion-based video synthesis module that translates these motions into high-fidelity video animations. By operating in a compact facial motion latent space decoupled from visual and identity cues, our autoregressive diffusion model effectively captures long-range correlations between audio and facial dynamics through a diffusion-forcing training paradigm, enabling infinite-length emotionally-rich motion prediction without error accumulation. Extensive experiments demonstrate that X-Actor produces compelling, cinematic-style performances that go beyond standard talking head animations and achieves state-of-the-art results in long-range, audio-driven emotional portrait acting. Please refer to https://byteaigc.github.io/X-Actor/ for more results.

Abstract:
Recent advances in 3D Gaussian representations have significantly improved the quality and efficiency of image-based scene reconstruction. Their explicit nature facilitates real-time rendering and fast optimization, yet extracting accurate surfaces—particularly in large-scale, unbounded environments—remains a difficult task. Many existing methods rely on approximate depth estimates and global sorting heuristics, which can introduce artifacts and limit the fidelity of the reconstructed mesh. In this paper, we present Sorted Opacity Fields (SOF), a method designed to recover detailed surfaces from 3D Gaussians with both speed and precision. Our approach improves upon prior work by introducing hierarchical resorting and a robust formulation of Gaussian depth, which better aligns with the level-set. To enhance mesh quality, we incorporate a level-set regularizer operating on the opacity field and introduce losses that encourage geometrically-consistent primitive shapes. In addition, we develop a parallelized Marching Tetrahedra algorithm tailored to our opacity formulation, reducing meshing time by up to an order of magnitude. As demonstrated by our quantitative evaluation, SOF achieves higher reconstruction accuracy while cutting total processing time by more than a factor of three. These results mark a step forward in turning efficient Gaussian-based rendering into equally efficient geometry extraction.

Abstract:
We present PartComposer: a framework for part-level concept learning from single-image examples that enables text-to-image diffusion models to compose novel objects from meaningful components. Existing methods either struggle with effectively learning fine-grained concepts or require a large dataset as input. We propose a dynamic data synthesis pipeline generating diverse part compositions to address one-shot data scarcity. Most importantly, we propose to maximize the mutual information between denoised latents and structured concept codes via a concept predictor, enabling direct regulation on concept disentanglement and re-composition supervision. Our method achieves strong disentanglement and controllable composition, outperforming subject and part-level baselines when mixing concepts from the same, or different, object categories. Our code is released in https://github.com/Junyu-Liu-Nate/partcomposer.

Abstract:
We propose ArtiLatent, a generative framework that synthesizes human-made 3D objects with fine-grained geometry, accurate articulation, and realistic appearance. Our approach jointly models part geometry and articulation dynamics by embedding sparse voxel representations and associated articulation properties—including joint type, axis, origin, range, and part category—into a unified latent space via a variational autoencoder. A latent diffusion model is then trained over this space to enable diverse yet physically plausible sampling. To reconstruct photorealistic 3D shapes, we introduce an articulation-aware Gaussian decoder that accounts for articulation-dependent visibility changes (e.g., revealing the interior of a drawer when opened). By conditioning appearance decoding on articulation state, our method assigns plausible texture features to regions that are typically occluded in static poses, significantly improving visual realism across articulation configurations. Extensive experiments on furniture-like objects from PartNet-Mobility and ACD datasets demonstrate that ArtiLatent outperforms existing approaches in geometric consistency and appearance fidelity. Our framework provides a scalable solution for articulated 3D object synthesis and manipulation.

Abstract:
Recent advances in large-scale text-to-image models have revolutionized creative fields by generating visually captivating outputs from textual prompts; however, while traditional photography offers precise control over camera settings to shape visual aesthetics—such as depth-of-field via aperture—current diffusion models typically rely on prompt engineering to mimic such effects. This approach often results in crude approximations and inadvertently alters the scene content. In this work, we propose Bokeh Diffusion, a scene-consistent bokeh control framework that explicitly conditions a diffusion model on a physical defocus blur parameter. To overcome the scarcity of paired real-world images captured under different camera settings, we introduce a hybrid training pipeline that aligns in-the-wild images with synthetic blur augmentations, providing diverse scenes and subjects as well as supervision to learn the separation of image content from lens blur. Central to our framework is our grounded self-attention mechanism, trained on image pairs with different bokeh levels of the same scene, which enables blur strength to be adjusted in both directions while preserving the underlying scene. Extensive experiments demonstrate that our approach enables flexible, lens-like blur control, supports downstream applications such as real image editing via inversion, and generalizes effectively across both Stable Diffusion and FLUX architectures.

Abstract:
Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100–200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency, and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture. Project page: https://openimaginglab.github.io/4DSloMo/

Abstract:
The boundary representation (B-Rep) is the standard data structure used in Computer-Aided Design (CAD) for defining solid models. Despite recent progress, directly generating B-Reps end-to-end with precise geometry and watertight topology remains a challenge. This paper presents AutoBrep, a novel Transformer model that autoregressively generates B-Reps with high quality and validity. AutoBrep employs a unified tokenization scheme that encodes both geometric and topological characteristics of a B-Rep model as a sequence of discrete tokens. Geometric primitives (i.e., surfaces and curves) are encoded as latent geometry tokens, and their structural relationships are defined as special topological reference tokens. Sequence order in AutoBrep naturally follows a breadth first traversal of the B-Rep face adjacency graph. At inference time, neighboring faces and edges along with their topological structure are progressively generated. Extensive experiments demonstrate the advantages of our unified representation when coupled with next-token prediction for B-Rep generation. AutoBrep outperforms baselines with better quality and watertightness. It is also highly scalable to complex solids with good fidelity and inference speed. We further show that autocompleting B-Reps is natively supported through our unified tokenization, enabling user-controllable CAD generation with minimal changes. Code is available at https://github.com/AutodeskAILab/AutoBrep.

Abstract:
In this paper, we present PanoDreamer, a novel method for producing a coherent 360° 3D scene from a single input image. Unlike existing methods that generate the scene sequentially, we frame the problem as single-image panorama and depth estimation. Once the coherent panoramic image and its corresponding depth are obtained, the scene can be reconstructed by inpainting the small occluded regions and projecting them into 3D space. Our key contribution is formulating single-image panorama and depth estimation as two optimization tasks and introducing alternating minimization strategies to effectively solve their objectives. We demonstrate that our approach outperforms existing techniques in single-image 360° 3D scene reconstruction in terms of consistency and overall quality.

Abstract:
Recent advances in interactive video generation have shown promising results, yet existing approaches struggle with scene-consistent memory capabilities in long video generation due to limited use of historical context. In this work, we propose Context-as-Memory, which utilizes historical context as memory for video generation. It includes two simple yet effective designs: (1) storing context in frame format without additional post-processing; (2) conditioning by concatenating context and frames to be predicted along the frame dimension at the input, requiring no external control modules. Furthermore, considering the enormous computational overhead of incorporating all historical context, we propose the Memory Retrieval module to select truly relevant context frames by determining FOV (Field of View) overlap between camera poses, which significantly reduces the number of candidate frames without substantial information loss. Experiments demonstrate that Context-as-Memory achieves superior memory capabilities in interactive long video generation compared to SOTAs, even generalizing effectively to open-domain scenarios not seen during training. Our project page are publicly available at https://context-as-memory.github.io/.

Abstract:
Generating 3D scenes is still a challenging task due to the lack of readily available scene data. Most existing methods only produce partial scenes and provide limited navigational freedom. We introduce a practical and scalable solution that uses 360° video as an intermediate scene representation, capturing the full-scene context and ensuring consistent visual content throughout the generation. We propose WorldPrompter, a generative pipeline that synthesizes traversable 3D scenes from text prompts. WorldPrompter incorporates a conditional 360° panoramic video generator, capable of producing a 128-frame video that simulates a person walking through and capturing a virtual environment. The resulting video is then reconstructed as Gaussian splats by a fast feedforward 3D reconstructor, enabling a true walkable experience within the 3D scene. Experiments demonstrate that our panoramic video generation model, trained with a mix of image and video data, achieves convincing spatial and temporal consistency for static scenes. This is validated by an average COLMAP matching rate of 94.6%, allowing for high-quality panoramic Gaussian splat reconstruction and improved navigation throughout the scene. Qualitative and quantitative results also show it outperforms the state-of-the-art 360° video generators and 3D scene generation models.

Abstract:
We present Social Agent, a novel framework for synthesizing realistic and contextually appropriate co-speech nonverbal behaviors in dyadic conversations. In this framework, we develop an agentic system driven by a Large Language Model (LLM) to direct the conversation flow and determine appropriate interactive behaviors for both participants. Additionally, we propose a novel dual-person gesture generation model based on an auto-regressive diffusion model, which synthesizes coordinated motions from speech signals. The output of the agentic system is translated into high-level guidance for the gesture generator, resulting in realistic movement at both the behavioral and motion levels. Furthermore, the agentic system periodically examines the movements of interlocutors and infers their intentions, forming a continuous feedback loop that enables dynamic and responsive interactions between the two participants. User studies and quantitative evaluations show that our model significantly improves the quality of dyadic interactions, producing natural, synchronized nonverbal behaviors. We will release the code and prompts for academic research.

Abstract:
We present X-UniMotion, a unified and expressive implicit latent representation for whole-body human motion, encompassing facial expressions, body poses, and hand gestures. Unlike prior motion transfer methods that rely on explicit skeletal poses and heuristic cross-identity adjustments, our approach encodes multi-granular human motion directly from a single image into a compact set of four disentangled latent tokens—one each for facial expression and body pose, and one per hand. These motion latents are both highly expressive and identity-agnostic, enabling high-fidelity, detailed cross-identity motion transfer across subjects with distinct identities, poses and spatial configurations. To achieve this, we introduce a self-supervised, end-to-end training framework that jointly learns the motion encoder and latent representation alongside a DiT-based video generative model, trained on large-scale video datasets spanning diverse human motions. Motion-identity disentanglement is enforced via 2D spatial and color augmentations, as well as synthetic 3D renderings of cross-identity subject pairs under shared poses. We further guide the learning of motion tokens using auxiliary decoders to promote fine-grained, semantically aligned, and normal-aware motion embeddings. Extensive experiments demonstrate that X-UniMotion outperforms state-of-the-art methods, producing highly expressive animations with superior motion expressiveness and identity preservation. Please refer to https://byteaigc.github.io/X-Unimotion/ for more information.

Abstract:
3D Gaussian Splatting (3DGS) combines classic image-based rendering, point-based graphics, and modern differentiable techniques, and offers an interesting alternative to traditional physically-based rendering. 3DGS-family models are far from efficient for power-constrained Extended Reality (XR) devices, which need to operate at a Watt-level. This paper introduces PowerGS, the first framework to jointly minimize the rendering and display power in 3DGS under a quality constraint. We present a general problem formulation and show that solving the problem amounts to 1) identifying the iso-quality curve(s) in the landscape subtended by the display and rendering power and 2) identifying the power-minimal point on a given curve, which has a closed-form solution given a proper parameterization of the curves. PowerGS also readily supports foveated rendering for further power savings. Extensive experiments and user studies show that PowerGS achieves up to 86% total power reduction compared to state-of-the-art 3DGS models, with minimal loss in both subjective and objective quality. Code is available at https://github.com/horizon-research/PowerGS.

Abstract:
We introduce a framework that enables both multi-view character consistency and 3D camera control in video diffusion models through a novel customization data pipeline. We train the character consistency component with recorded volumetric capture performances re-rendered with diverse camera trajectories via 4D Gaussian Splatting (4DGS), lighting variability obtained with a video relighting model. We fine-tune state-of-the-art open-source video diffusion models on this data to provide strong multi-view identity preservation, precise camera control, and lighting adaptability. Our framework also supports core capabilities for virtual production, including multi-subject generation using two approaches: joint training and noise blending, the latter enabling efficient composition of independently customized models at inference time; it also achieves scene and real-life video customization as well as control over motion and spatial layout during customization. Extensive experiments show improved video quality, higher personalization accuracy, and enhanced camera control and lighting adaptability, advancing the integration of video generation into virtual production.

Abstract:
We propose DeMapGS, a structured Gaussian Splatting framework that jointly optimizes deformable surfaces and surface-attached 2D Gaussian splats. By anchoring splats to a deformable template mesh, our method overcomes topological inconsistencies and enhances editing flexibility, addressing limitations of prior Gaussian Splatting methods that treat points independently. The unified representation in our method supports extraction of high-fidelity diffuse, normal, and displacement maps, enabling the reconstructed mesh to inherit the photorealistic rendering quality of Gaussian Splatting. To support robust optimization, we introduce a gradient diffusion strategy that propagates supervision across the surface, along with an alternating 2D/3D rendering scheme to handle concave regions. Experiments demonstrate that DeMapGS achieves state-of-the-art mesh reconstruction quality and enables downstream applications for Gaussian splats such as editing and cross-object manipulation through a shared parametric surface.

Abstract:
Recent advances in diffusion models have enhanced multimodal-guided visual generation, enabling customized subject insertion that seamlessly “brushes” user-specified objects into a given image guided by textual prompts. However, existing methods often struggle to insert customized subjects with high fidelity and align results with the user’s intent through textual prompts. In this work, we propose In-Context Brush, a zero-shot framework for customized subject insertion by reformulating the task within the paradigm of in-context learning. Without loss of generality, we formulate the object image and the textual prompts as cross-modal demonstrations, and the target image with the masked region as the query. The goal is to inpaint the target image with the subject aligning textual prompts without model tuning. Building upon a pretrained MMDiT-based inpainting network, we perform test-time enhancement via dual-level latent space manipulation: intra-head latent feature shifting within each attention head that dynamically shifts attention outputs to reflect the desired subject semantics and inter-head attention reweighting across different heads that amplifies prompt controllability through differential attention prioritization. Extensive experiments and applications demonstrate that our approach achieves superior identity preservation, text alignment, and image quality compared to existing state-of-the-art methods, without requiring dedicated training or additional data collection. Project page: https://yuci-gpt.github.io/In-Context-Brush/.

Abstract:
Generating highly dynamic and photorealistic portrait animations driven by audio and skeletal motion remains challenging due to the need for precise lip synchronization, natural facial expressions, and high-fidelity body motion dynamics. We propose a human-preference-aligned diffusion framework that addresses these challenges through two key innovations. First, we introduce direct preference optimization tailored for human-centric animation, leveraging a curated dataset of human preferences to align generated outputs with perceptual metrics for portrait motion-video alignment and naturalness of expression. Second, the proposed temporal motion modulation resolves spatiotemporal resolution mismatches by reshaping motion conditions into dimensionally aligned latent features through temporal channel redistribution and proportional feature expansion, preserving the fidelity of high-frequency motion details in diffusion-based synthesis. The proposed mechanism is complementary to existing UNet and DiT-based portrait diffusion approaches, and experiments demonstrate obvious improvements in lip-audio synchronization, expression vividness, body motion coherence over baseline methods, alongside notable gains in human preference metrics. Code and data for this paper are at https://github.com/fudan-generative-vision/hallo4.

Abstract:
Document dewarping aims to rectify deformations in photographic document images, thus improving text readability, which has attracted much attention and made great progress, but it is still challenging to preserve document structures. Given recent advances in diffusion models, it is natural for us to consider their potential applicability to document dewarping. However, it is far from straightforward to adopt diffusion models in document dewarping due to their unfaithful control on highly complex document images (e.g., 2000 × 3000 resolution). In this paper, we propose DvD, the first generative model to tackle document Dewarping via a Diffusion framework. To be specific, DvD introduces a coordinate-level denoising instead of typical pixel-level denoising, generating a mapping for deformation rectification. In addition, we further propose a time-variant condition refinement mechanism to enhance the preservation of document structures. In experiments, we find that current document dewarping benchmarks can not evaluate dewarping models comprehensively. To this end, we present AnyPhotoDoc6300, a rigorously designed large-scale document dewarping benchmark comprising 6,300 real image pairs across three distinct domains, enabling fine-grained evaluation of dewarping models. Comprehensive experiments demonstrate that our proposed DvD can achieve state-of-the-art performance with acceptable computational efficiency on multiple metrics across various benchmarks, including DocUNet, DIR300, and AnyPhotoDoc6300. The new benchmark and code will be publicly available at https://github.com/hanquansanren/DvD.

Abstract:
Recent advancements in 3D Gaussian Splatting (3DGS) have demonstrated its potential for efficient and photorealistic 3D reconstructions, which is crucial for diverse applications such as robotics and immersive media. However, current Gaussian-based methods for dynamic scene reconstruction struggle with large inter-frame displacements, leading to artifacts and temporal inconsistencies under fast object motions. To address this, we introduce TrackerSplat, a novel method that integrates advanced point tracking methods to enhance the robustness and scalability of 3DGS for dynamic scene reconstruction. TrackerSplat utilizes off-the-shelf point tracking models to extract pixel trajectories and triangulate per-view pixel trajectories onto 3D Gaussians to guide the relocation, rotation, and scaling of Gaussians before training. This strategy effectively handles large displacements between frames, dramatically reducing the fading and recoloring artifacts prevalent in prior methods. By accurately positioning Gaussians prior to gradient-based optimization, TrackerSplat overcomes the quality degradation associated with large frame gaps when processing multiple adjacent frames in parallel across multiple devices, thereby boosting reconstruction throughput while preserving rendering quality. Experiments on real-world datasets confirm the robustness of TrackerSplat in challenging scenarios with significant displacements, achieving superior throughput under parallel settings and maintaining visual quality compared to baselines. The code is available at https://github.com/yindaheng98/TrackerSplat.

Abstract:
Reverse engineering 3D computer-aided design (CAD) models from images is an important task for many downstream applications including interactive editing, manufacturing, architecture, robotics, etc. The difficulty of the task lies in vast representational disparities between the CAD output and the image input. CAD models are precise, programmatic constructs that involves sequential operations combining discrete command structure with continuous attributes – making it challenging to learn and optimize in an end-to-end fashion. Concurrently, input images introduce inherent challenges such as photometric variability and sensor noise, complicating the reverse engineering process. In this work, we introduce a novel approach that conditionally factorizes the task into two sub-problems. First, we leverage vision-language foundation models (VLMs), a finetuned Llama3.2, to predict the global discrete base structure with semantic information. Second, we propose TrAssembler that conditioned on the discrete structure with semantics predicts the continuous attribute values. To support the training of our TrAssembler, we further constructed an annotated CAD dataset of common objects from ShapeNet. Putting all together, our approach and data demonstrate significant first steps towards CAD-ifying images in the wild. Code and data can be found in https://github.com/qq456cvb/Img2CAD.

Abstract:
In this paper, we propose VideoFrom3D, a novel framework for synthesizing high-quality 3D scene videos from coarse geometry, a camera trajectory, and a reference image. Our approach streamlines the 3D graphic design workflow, enabling flexible design exploration and rapid production of deliverables. A straightforward approach to synthesizing a video from coarse geometry might condition a video diffusion model on geometric structure. However, existing video diffusion models struggle to generate high-fidelity results for complex scenes due to the difficulty of jointly modeling visual quality, motion, and temporal consistency. To address this, we propose a generative framework that leverages the complementary strengths of image and video diffusion models. Specifically, our framework consists of a Sparse Anchor-view Generation (SAG) and a Geometry-guided Generative Inbetweening (GGI) module. The SAG module generates high-quality, cross-view consistent anchor views using an image diffusion model, aided by Sparse Appearance-guided Sampling. Building on these anchor views, GGI module faithfully interpolates intermediate frames using a video diffusion model, enhanced by flow-based camera control and structural guidance. Notably, both modules operate without any paired dataset of 3D scene models and natural images, which is extremely difficult to obtain. Comprehensive experiments show that our method produces high-quality, style-consistent scene videos under diverse and challenging scenarios, outperforming simple and extended baselines. Code is available at github.com/KIMGEONUNG/VideoFrom3D.

Abstract:
Recent breakthroughs in video generation, powered by large-scale datasets and diffusion techniques, have shown that video diffusion models can function as implicit 4D novel view synthesizers. Nevertheless, current methods primarily concentrate on redirecting camera trajectory within the front view while struggling to generate 360-degree viewpoint changes. In this paper, we focus on human-centric subdomain and present MV-Performer, an innovative framework for creating synchronized novel view videos from monocular full-body captures. To achieve a 360-degree synthesis, we extensively leverage the MVHumanNet dataset and incorporate an informative condition signal. Specifically, we use the camera-dependent normal maps rendered from oriented partial point clouds, which effectively alleviate the ambiguity between seen and unseen observations. To maintain synchronization in the generated videos, we propose a multi-view human-centric video diffusion model that fuses information from the reference video, partial rendering, and different viewpoints. Additionally, we provide a robust inference procedure for in-the-wild video cases, which greatly mitigates the artifacts induced by imperfect monocular depth estimation. Extensive experiments on three datasets demonstrate our MV-Performer’s state-of-the-art effectiveness and robustness, setting a strong model for human-centric 4D novel view synthesis. Code is available at https://github.com/zyhbili/MV-Performer.

Abstract:
We introduce a novel, training-free system for reconstructing, understanding, and rendering 3D indoor scenes from a sparse set of unposed RGB images. Unlike traditional radiance field approaches that require dense views and per-scene optimization, our pipeline achieves high-fidelity results without any training or pose preprocessing. The system integrates three key innovations: (1) A robust point cloud reconstruction module that filters unreliable geometry using a warping-based anomaly removal strategy; (2) A warping-guided 2D-to-3D instance lifting mechanism that propagates 2D segmentation masks into a consistent, instance-aware 3D representation; and (3) A novel rendering approach that projects the point cloud into new views and refines the renderings with a 3D-aware diffusion model. Our method leverages the generative power of diffusion to compensate for missing geometry and enhances realism, especially under sparse input conditions. We further demonstrate that object-level scene editing—such as instance removal—can be naturally supported in our pipeline by modifying only the point cloud, enabling the synthesis of consistent, edited views without retraining. Our results establish a new direction for efficient, editable 3D content generation without relying on scene-specific optimization.

Abstract:
Accurate measurement of images produced by electronic displays is critical for the evaluation of both traditional and computational displays. Traditional display measurement methods based on sparse radiometric sampling and fitting a model are inadequate for capturing spatially varying display artifacts, as they fail to capture high-frequency and pixel-level distortions. While cameras offer sufficient spatial resolution, they introduce optical, sampling, and photometric distortions. Furthermore, the physical measurement must be combined with a model of a visual system to assess whether the distortions are going to be visible. To enable perceptual assessment of displays, we propose a combination of a camera-based reconstruction pipeline with a visual difference predictor, which account for both the inaccuracy of camera measurements and visual difference prediction. The reconstruction pipeline combines HDR image stacking, MTF inversion, vignetting correction, geometric undistortion, homography transformation, and color correction, enabling cameras to function as precise display measurement instruments. By incorporating a Visual Difference Predictor (VDP), our system models the visibility of various stimuli under different viewing conditions for the human visual system. We validate the proposed CameraVDP framework through three applications: defective pixel detection, color fringing awareness, and display non-uniformity evaluation. Our uncertainty analysis framework enables the estimation of the theoretical upper bound for defect pixel detection performance and provides confidence intervals for VDP quality scores. Our code is available on https://github.com/gfxdisp/CameraVDP.

Abstract:
Material creation and reconstruction are crucial for appearance modeling but traditionally require significant time and expertise from artists. While recent methods leverage visual foundation models to synthesize PBR materials from user-provided inputs, they often fall short in quality, flexibility, and user control. We propose a novel two-stage generate-and-estimate framework for PBR material generation. In the generation stage, a fine-tuned diffusion model synthesizes shaded, tileable texture images aligned with user input. In the estimation stage, we introduce a chained decomposition scheme that sequentially predicts SVBRDF channels by passing previously extracted representation as input into a single-step image-conditional diffusion model. Our method is efficient, high quality, and enables flexible user control. We evaluate our approach against existing material generation and estimation methods, demonstrating superior performance. Our material estimation method shows strong robustness on both generated textures and in-the-wild photographs. Furthermore, we highlight the flexibility of our framework across diverse applications, including text-to-material, image-to-material, structure-guided generation, and material editing.

Abstract:
We propose a training-free method for feature field rendering in 3D Gaussian Splatting, enabling fast and scalable embedding of high-dimensional features into 3D scenes. Unlike training-based feature distillation methods, which are computationally expensive and often yield feature embeddings that poorly reflect the rendered semantics, our approach back-projects 2D features onto pre-trained 3D Gaussians using influence weights derived from the rendering equation. This projection produces a queryable 3D feature field, validated on tasks including 2D and 3D segmentation, affordance transfer, and identity encoding, spanning queries using language, pixel, and synthetic embeddings. These capabilities, in turn, enable downstream applications in augmented and virtual reality, interactive scene editing, and robotics. Across different tasks, our method achieves performance comparable to or better than training-based approaches, while significantly reducing computational cost. The project page is at https://jojijoseph.github.io/3dgs-backprojection.

Abstract:
Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment–body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost—a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category—enhancing garment–body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.

Abstract:
AI-assisted graphic design has emerged as a powerful tool for automating the creation and editing of design elements such as posters, banners, and advertisements. While diffusion-based text-to-image models have demonstrated strong capabilities in visual content generation, their text rendering performance, particularly for small-scale typography and non-Latin scripts, remains limited. In this paper, we propose UTDesign, a unified framework for high-precision stylized text editing and conditional text generation in design images, supporting both English and Chinese scripts. Our framework introduces a novel DiT-based text style transfer model trained from scratch on a synthetic dataset, capable of generating transparent RGBA text foregrounds that preserve the style of reference glyphs. We further extend this model into a conditional text generation framework by training a multi-modal condition encoder on a curated dataset with detailed text annotations, enabling accurate, style-consistent text synthesis conditioned on background images, prompts, and layout specifications. Finally, we integrate our approach into a fully automated text-to-design (T2D) pipeline by incorporating pre-trained text-to-image (T2I) models and an MLLM-based layout planner. Extensive experiments demonstrate that UTDesign achieves state-of-the-art performance among open-source methods in terms of stylistic consistency and text accuracy, and also exhibits unique advantages compared to proprietary commercial approaches. Code and data for this paper are available at https://github.com/ZYM-PKU/UTDesign.

Abstract:
The procedural occupancy function is a flexible and compact representation for creating 3D scenes. For rasterization and other tasks, it is often necessary to extract a mesh that represents the shape. Unbounded scenes with long-range camera trajectories, such as flying through a forest, pose a unique challenge for mesh extraction. A single static mesh representing all the geometric detail necessary for the full camera path can be prohibitively large. Therefore, independent meshes can be extracted for different camera views, but this approach may lead to popping artifacts during transitions. We propose a temporally coherent method for extracting meshes suitable for long-range camera trajectories in unbounded scenes represented by an occupancy function. The key idea is to perform 4D mesh extraction using a new spacetime tree structure called a binary-octree. Experiments show that, compared to existing baseline methods, our method offers superior visual consistency at a comparable cost. The code and the supplementary video for this paper are available at https://github.com/princeton-vl/BinocMesher.

Abstract:
Recent advances in diffusion models have made significant progress in digital human generation. However, most existing models still struggle to maintain 3D consistency, temporal coherence, and motion accuracy. These limitations primarily stem from two key factors: the limited representation ability of commonly used control signals (e.g., landmarks, depth maps), and the lack of diversity in identity and pose variations within publicly available datasets. In this paper, we construct a powerful head model from both aspects by constructing learnable control signals and enabling the model to adaptively leverage synthetic data. Firstly, we introduce a novel control signal representation that is learnable, dense, expressive, and 3D consistent. Our method embeds learnable Gaussians onto a parametric head surface, which significantly enhances the consistency and expressiveness of diffusion-based head models. Secondly, in terms of data, we synthesize a large-scale dataset covering diverse poses and identities. To reduce the negative impact of artifacts in synthetic data, we introduce real/synthetic embeddings that allow the model to distinguish between real and synthetic samples and learn to utilize them adaptively. Extensive experiments show that our model outperforms existing methods in terms of realism, expressiveness, and 3D consistency. Our code, synthetic datasets, and pre-trained models will be released at https://ustc3dv.github.io/Learn2Control.

Abstract:
We present LLM-Primitives: Large Language Model for 3D Reconstruction with Primitives, a novel approach to shape abstraction. By incorporating multi-modal conditional inputs, our method enables LLMs to reconstruct high-quality 3D primitives using only a modest amount of training data (tens of thousands of samples). This work marks a significant milestone in applying large language models to 3D primitive-based reconstruction, demonstrating both their feasibility and effectiveness in this domain. Specifically, we leverage the point clouds of existing 3D models as conditional inputs to the LLM via a multi-modal connector. Instead of directly estimating primitive parameters, we introduce a center-to-surface vector representation, ensuring deterministic outputs and avoiding the ambiguity often associated with primitive parameterization. Experimental results show that LLM-Primitives surpass state-of-the-art 3D primitive methods across various quantitative metrics. Notably, the substantial improvements in visual quality further confirm that LLM-Primitives can reconstruct high-quality, practical 3D primitives. (Project page: https://llm-primitives.github.io/LLM-Primitives/)

Abstract:
This paper presents MODepth, a multi-frame monocular depth estimation system based on the controlled motion of an optical image stabilization (OIS) module. By actively injecting acoustic signals, we induce regular translational movements of the OIS lens, resulting in controllable camera pose changes and simplifying inter-frame pose estimation. Leveraging multi-frame images captured under OIS-controlled lens movements, we design a high-precision depth estimation network, MODNet, and introduce the principal point offset estimation module and pose estimation modules to fully exploit geometric information across frames. To validate the effectiveness of our approach, we collect a new dataset MODdata with 1100 samples in nearly 220 indoor scenarios and benchmark our model as an OIS-based multi-frame depth estimation method, comparing it to ground truth obtained from a depth sensor and other state-of-the-art monocular depth estimation algorithms. Our method achieves competitive or superior performance compared to fully supervised baselines, reaching an RMSE of 0.439, which outperforms all evaluated methods, demonstrating that self-supervised fine-tuning with OIS-induced parallax is a viable alternative to ground-truth supervision. Code and dataset are available at: https://github.com/liangjindeamo-yuer/MODEPTH

Abstract:
Existing underwater image processing methods often struggle due to the limited availability of real paired training data. Models trained on public datasets frequently fail to generalize across diverse underwater conditions and produce suboptimal color restoration. To address these challenges, we propose a self-supervised underwater color restoration framework based on a Wavelet-Diffusion Model with Filtered Multi-Scale Feature Distillation. Specifically, we introduce a wavelet-diffusion training paradigm on terrestrial images, guided by a stochastic underwater imaging model prior. This randomized control enables the model to learn diverse underwater imaging processes, facilitating effective generalization to real-world underwater images and achieving precise color restoration. Furthermore, to tackle feature entanglement in zero-shot domain generalization and mitigate the slow sampling and partial corruption issues of diffusion models, We integrate a Mamba-based U-shaped student network for multi-scale feature distillation. Additionally, we introduce a filtering mechanism to refine the diffusion sampled features, allowing the student model to outperform the teacher in both performance and image quality. Extensive experiments across multiple underwater datasets demonstrate that our approach effectively restores natural colors, eliminating water-induced distortions while achieving state-of-the-art performance in both qualitative and quantitative evaluations. Code and data for this paper are at https://github.com/zx826/FMFD

Abstract:
While recent advances in human-object interaction (HOI) video generation showcase promising capabilities for synthesizing coordinated human-object dynamics, existing methods remain constrained by their reliance on meticulously curated motion sequences and actor-specific data, thereby limiting practical scalability and user accessibility. Furthermore, generalization to novel object appearances and interaction scenarios remains understudied. To address these limitations, we propose HOMA, a weakly conditioned multimodal-driven HOI video generation framework that introduces sparse, decoupled motion guidance to enhance controllability and reduce dependency on stringent input conditions. Our approach encodes appearance and motion signals into the dual input space of a multimodal diffusion transformer (MMDiT), fusing them within a shared context space to enable temporally consistent and physically plausible interactions. To optimize learning efficiency and feature injection accuracy, we introduce a parameter-space HOI adapter initialized with pretrained MMDiT weights to preserve prior knowledge while enabling efficient adaptation. Additionally, we design a facial cross-attention adapter for audio-driven lip synchronization, ensuring anatomically accurate speech animation. Extensive experiments demonstrate that HOMA achieves state-of-the-art performance in interaction naturalness and generalization under weak supervision, outperforming existing methods by significant margins. We further illustrate HOMA’s versatility through diverse applications, including text-conditioned generation and interactive object manipulation, facilitated by a user-friendly demo interface. The project page is https://bone-11.github.io/homa-page/.

Abstract:
High speed, high-resolution, and accurate 3D scanning would open doors to many new applications in graphics, robotics, science, and medicine by enabling the accurate scanning of deformable objects during interactions. Past attempts to use structured light, time-of-flight, and stereo in high-speed settings have usually required tradeoffs in resolution or inaccuracy. In this paper, we introduce a method that enables, for the first time, 3D scanning at 450 frames per second at 1 Megapixel, or 1,450 frames per second at 0.4 Megapixel in an environment with controlled lighting. The key idea is to use a per-pixel lookup table that maps colors to depths, which is built using a linear stage. Imperfections, such as lens-distortion and sensor defects are baked into the calibration. We describe our method and test it on a novel hardware prototype. We compare the system with both ground-truth geometry as well as commercially available dynamic sensors like the Microsoft Kinect and Intel Realsense. Our results show the system acquiring geometry of objects undergoing high-speed deformations and oscillations and demonstrate the ability to recover physical properties from the reconstructions.

Abstract:
Many 3D tasks such as pose alignment, animation, motion transfer, and 3D reconstruction rely on establishing correspondences between 3D shapes. This challenge has recently been approached by pairwise matching of semantic features from pre-trained vision models. However, despite their power, these features struggle to differentiate instances of the same semantic class such as “left hand” versus “right hand” which leads to substantial mapping errors. To solve this, we learn a surface-aware embedding space that is robust to these ambiguities while facilitating shared mapping for an entire family of 3D shapes. Importantly, our approach is self-supervised and requires only a small number of unpaired training meshes to infer features for new possibly imperfect 3D shapes at test time. We achieve this by introducing a contrastive loss that preserves the semantic content of the features distilled from foundational models while disambiguating features located far apart on the shape’s surface. We observe superior performance in correspondence matching benchmarks and enable downstream applications including 2D-to-3D and 3D-to-3D texture transfer, in-part segmentation, pose alignment, and motion transfer in low-data regimes. Unlike previous pairwise approaches, our solution constructs a joint embedding space, where both seen and unseen 3D shapes are implicitly aligned without further optimization. The code is available at https://graphics.tudelft.nl/SurfaceAware3DFeatures.

Abstract:
We propose mesh-free fluid simulations that exploit a kinematic neural basis for velocity fields represented by an MLP. We design a set of losses that ensures that these neural bases approximate fundamental physical properties such as orthogonality, divergence-free, boundary alignment, and smoothness. Our neural bases can then be used to fit an input sketch of a flow, which will inherit the same fundamental properties from the bases. We then can animate such flow in real-time using standard time integrators. Our neural bases can accommodate different domains, moving boundaries, and naturally extend to three dimensions.

Abstract:
The depth-of-field (DoF) effect, which introduces aesthetically pleasing blur, enhances photographic quality but is fixed and difficult to modify once the image has been created. This becomes problematic when the applied blur is undesirable (e.g., the subject is out of focus). To address this, we propose DiffCamera, a model that enables flexible refocusing of a created image conditioned on an arbitrary new focus point and a blur level. Specifically, we design a diffusion transformer framework for refocusing learning. However, the training requires pairs of data with different focus planes and bokeh levels in the same scene, which are hard to acquire. To overcome this limitation, we develop a simulation-based pipeline to generate large-scale image pairs with varying focus planes and bokeh levels. With the simulated data, we find that training with only a vanilla diffusion objective often leads to incorrect DoF behaviors due to the complexity of the task. This requires a stronger constraint during training. Inspired by the photographic principle that photos of different focus planes can be linearly blended into a multi-focus image, we propose a stacking constraint during training to enforce precise DoF manipulation. This constraint enhances model training by imposing physically grounded refocusing behavior that the focusing results should be faithfully aligned with the scene structure and the camera conditions so that they can be combined into the correct multi-focus image. We also construct a benchmark to evaluate the effectiveness of our refocusing model. Extensive experiments demonstrate that DiffCamera supports stable refocusing across a wide range of scenes, providing unprecedented control over DoF adjustments for photography and generative AI applications.

Abstract:
Processing visual data often involves small adjustments or sequences of changes, e.g., image filtering, surface smoothing, and animation. While established graphics techniques like normal mapping and video compression exploit redundancy to encode such small changes efficiently, the problem of encoding small changes to neural fields—neural network parameterizations of visual or physical functions—has received less attention. We propose a parameter-efficient strategy for updating neural fields using low-rank adaptations (LoRA). LoRA, a method from the parameter-efficient fine-tuning LLM community, encodes small updates to pre-trained models with minimal computational overhead. We adapt LoRA for instance-specific neural fields, avoiding the need for large pre-trained models and yielding lightweight updates. We validate our approach with experiments in image filtering, geometry editing, video compression, and energy-based editing, demonstrating its effectiveness and versatility for representing neural field updates.

Abstract:
Neural fields excel at representing continuous visual signals but typically operate at a single, fixed resolution. We present a simple yet powerful method to optimize neural fields that can be prefiltered in a single forward pass. Key innovations and features include: (1) We perform convolutional filtering in the input domain by analytically scaling Fourier feature embeddings with the filter’s frequency response. (2) This closed-form modulation generalizes beyond Gaussian filtering and supports other parametric filters (Box and Lanczos) that are unseen at training time. (3) We train the neural field using single-sample Monte Carlo estimates of the filtered signal. Our method is fast during both training and inference, and imposes no additional constraints on the network architecture. We show quantitative and qualitative improvements over existing methods for neural-field filtering.

Abstract:
Recent research on learnable neural representations has been widely adopted in the field of 3D scene reconstruction and neural rendering applications. However, traditional feature grid representations often suffer from a substantial memory footprint, posing a significant bottleneck for modern parallel computing hardware. In this paper, we present neural vertex features, a generalized formulation of learnable representation for neural rendering tasks involving explicit mesh surfaces. Instead of uniformly distributing neural features throughout 3D space, our method stores learnable features directly at mesh vertices, leveraging the underlying geometry as a compact and structured representation for neural processing. This not only optimizes memory efficiency, but also improves feature representation by aligning compactly with the surface using task-specific geometric priors. Additionally, neural vertex features offer improved feature representation by compactly aligning with the surface using task-specific geometric priors. We validate our neural representation across diverse neural rendering tasks, with a specific emphasis on neural radiosity. Experimental results demonstrate that our method reduces memory consumption to only one-fifth (or even less) of grid-based representations, while maintaining comparable rendering quality and lowering inference overhead.

Abstract:
Inspired by generative paradigms in image and video, 3D shape generation has made notable progress, enabling the rapid synthesis of high-fidelity 3D assets from a single image. However, current methods still face challenges, including the lack of intricate details, overly smoothed surfaces, and fragmented thin-shell structures. These limitations leave the generated 3D assets still one step short of meeting the standards favored by artists. In this paper, we present ShapeGen, which achieves high-quality image-to-3D shape generation through 3D representation and supervision improvements, resolution scaling up, and the advantages of linear transformers. These advancements allow the generated assets to be seamlessly integrated into 3D pipelines, facilitating their widespread adoption across various applications. Specifically, in contrast to existing methods: 1) We investigate how different representations and VAE supervision strategies affect the generation process, and address issues like aliasing artifacts and fragmented thin-shell structures by using an TSDF-based representation supervised with BCE loss. 2) We scale up the resolution of 3D data, image conditioning inputs, and the number of latent tokens to enhance generation fidelity. 3) We adopt mixed conditioning using raw RGB images and normal maps during training, effectively resolving ambiguities caused by inconsistencies between ControlNet-generated RGB images and the underlying geometry from untextured assets. 4) We replace the original softmax attention with linear attention to improve training and inference efficiency when handling a large number of latent tokens. 5) We introduce an inference-time scaling strategy that enhances generation quality at test time. Through extensive experiments, we validate the impact of these improvements on overall performance. Ultimately, thanks to the synergistic effects of these enhancements, ShapeGen achieves a significant leap in image-to-3D generation, establishing a new state-of-the-art performance.

Abstract:
Sketches are an important medium of expression and recently many works concentrate on automatic sketch creations. One such ability very useful for amateurs is text-based completion of a partial sketch to create a complex scene, while preserving the style of the partial sketch. Existing methods focus solely on generating sketch that match the content in the input prompt in a predefined style, ignoring the styles of the input partial sketches, e.g., the global abstraction level and local stroke styles. To address this challenge, we introduce AutoSketch, a style-aware vector sketch completion method that accommodates diverse sketch styles and supports iterative sketch completion. AutoSketch completes the input sketch in a style-consistent manner using a two-stage method. In the first stage, we initially optimize the strokes to match an input prompt augmented by style descriptions extracted from a vision-language model (VLM). Such style descriptions lead to non-photorealistic guidance images which enable more content to be depicted through new strokes. In the second stage, we utilize the VLM to adjust the strokes from the previous stage to adhere to the style present in the input partial sketch through an iterative style adjustment process. In each iteration, the VLM identifies a list of style differences between the input sketch and the strokes generated in the previous stage, translating these differences into adjustment codes to modify the strokes. We compare our method with existing methods using various sketch styles and prompts, perform extensive ablation studies and qualitative and quantitative evaluations, and demonstrate that AutoSketch can support diverse sketching scenarios.

Abstract:
Motion in-betweening is the problem to synthesize movement between keyposes. Traditional research focused primarily on single characters. Extending them to densely interacting characters is highly challenging, as it demands precise spatial-temporal correspondence between the characters to maintain the interaction, while creating natural transitions towards predefined keyposes. In this research, we present a method for long-horizon interaction in-betweening that enables two characters to engage and respond to one another naturally. To effectively represent and synthesize interactions, we propose a novel solution called Cross-Space In-Betweening, which models the interactions of each character across different conditioning representation spaces. We further observe that the significantly increased constraints in interacting characters heavily limit the solution space, leading to degraded motion quality and diminished interaction over time. To enable long-horizon synthesis, we present two solutions to maintain long-term interaction and motion quality, thereby keeping synthesis in the stable region of the solution space. We first sustain interaction quality by identifying periodic interaction patterns through adversarial learning. We further maintain the motion quality by learning to refine the drifted latent space and prevent pose error accumulation. We demonstrate that our approach produces realistic, controllable, and long-horizon in-between motions of two characters with dynamic boxing and dancing actions across multiple keyposes, supported by extensive quantitative evaluations and user studies.

Abstract:
Stochastic PDE solvers have emerged as a powerful alternative to traditional discretization-based methods for solving partial differential equations (PDEs), especially in geometry processing and graphics. While off-centered estimators enhance sample reuse in WoS-type Monte Carlo solvers, they introduce correlation artifacts and bias when Green’s functions are approximated. In this paper, we propose a statistically weighted off-centered WoS-type estimator that leverages local similarity filtering to selectively combine samples across neighboring evaluation points. Our method balances bias and variance through a principled weighting strategy that suppresses unreliable estimators. We demonstrate our approach’s effectiveness on various PDEs—including screened Poisson equations—and boundary conditions, achieving consistent improvements over existing solvers such as vanilla Walk on Spheres, mean value caching, and boundary value caching. Our method also naturally extends to gradient field estimation and mixed boundary problems.

Abstract:
Acquiring bidirectional reflectance distribution functions (BRDFs) is essential for simulating light transport and analytically modeling material properties. Over the past two decades, numerous intensity-only BRDF datasets in the visible spectrum have been introduced, primarily for RGB image rendering applications. However, in scientific and engineering domains, there remains an unmet need to model light transport with polarization–a fundamental wave property of light–across hyperspectral bands. To address this gap, we present the first hyperspectral-polarimetric BRDF (hpBRDF) dataset of real-world materials, spanning wavelengths from 414 to 950 nm and densely sampled at 68 spectral bands. This dataset covers both the visible and near-infrared (NIR) spectra, enabling detailed material analysis and light reflection simulations that incorporate polarization at each narrow spectral band. We develop an efficient hpBRDF acquisition system that captures high-dimensional hpBRDFs within a feasible acquisition time. Using this system, we demonstrate hyperspectral-polarimetric rendering using the acquired hpBRDFs. To provide insights on hpBRDF, we analyze the hpBRDFs with respect to their dependencies on wavelength, polarization state, material type, and illumination/viewing geometry. Also, we propose compact representations through principal component analysis and implicit neural hpBRDF modeling. Dataset is available on our project page1.

Abstract:
Manipulating the illumination of a 3D scene within a single image represents a fundamental challenge in computer vision and graphics. This problem has traditionally been addressed using inverse rendering techniques, which involve explicit 3D asset reconstruction and costly ray-tracing simulations. Meanwhile, recent advancements in visual foundation models suggest that a new paradigm could soon be possible – one that replaces explicit physical models with networks that are trained on large amounts of image and video data. In this paper, we exploit the implicit scene understanding of a video diffusion model, particularly Stable Video Diffusion, to relight a single image. We introduce GenLit, a framework that distills the ability of a graphics engine to perform light manipulation into a video-generation model, enabling users to directly insert and manipulate a point light in the 3D world within a given image and generate results directly as a video sequence. We find that a model fine-tuned on only a small synthetic dataset generalizes to real-world scenes, enabling single-image relighting with plausible and convincing shadows and inter-reflections. Our results highlight the ability of video foundation models to capture rich information about lighting, material, and shape, and our findings indicate that such models, with minimal training, can be used to perform relighting without explicit asset reconstruction or ray-tracing.

Abstract:
Effective multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity. Current methods, however, often prioritize basic visual consistency, neglecting crucial editing patterns (e.g., shot/reverse shot, cutaways) that drive narrative flow for compelling storytelling. This yields outputs that may be visually coherent but lack narrative sophistication and true cinematic integrity. To bridge this, we introduce Next Shot Generation (NSG): synthesizing a subsequent, high-quality shot that critically conforms to professional editing patterns while upholding rigorous cinematic continuity. Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employs in-context tuning guided by a novel Hierarchical Multi-Prompting strategy. This strategy uses Relational Prompts to define overall context and inter-shot editing styles. Individual Prompts then specify per-shot content and cinematographic attributes. Together, these guide Cut2Next to generate cinematically appropriate next shots. Architectural innovations, Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), further integrate these diverse signals without introducing new parameters. We construct RawCuts (large-scale) and CuratedCuts (refined) datasets, both with hierarchical prompts, and introduce CutBench for evaluation. Experiments show Cut2Next excels in visual consistency and text fidelity. Crucially, user studies reveal a strong preference for Cut2Next, particularly for its adherence to intended editing patterns and overall cinematic continuity, validating its ability to generate high-quality, narratively expressive, and cinematically coherent subsequent shots.

Abstract:
Real-time visibility determination in expansive or dynamically changing environments has long posed a significant challenge in computer graphics. Existing techniques are computationally expensive and often applied as a precomputation step on a static scene. We present NeuralPVS, the first deep-learning approach for visibility computation that efficiently determines from-region visibility in a large scene, running at approximately 100 Hz processing with less than 1 % missing geometry. This approach is possible by using a neural network operating on a froxelized representation of the scene. The network’s performance is achieved by combining sparse convolution with a 3D volume-preserving interleaving for data compression. Moreover, we introduce a novel repulsive visibility loss that can effectively guide the network to converge to the correct data distribution. This loss provides enhanced robustness and generalization to unseen scenes. Our results demonstrate that NeuralPVS outperforms existing visibility methods in terms of both accuracy and efficiency.

Abstract:
Focus is a cornerstone of photography, yet autofocus systems often fail to capture the intended subject, and users frequently wish to adjust focus after capture. We introduce a novel method for realistic post-capture refocusing using video diffusion models. From a single defocused image, our approach generates a perceptually accurate focal stack, represented as a video sequence, enabling interactive refocusing and unlocking a range of downstream applications. We release a large-scale focal stack dataset acquired under diverse real-world smartphone conditions to support this work and future research. Our method consistently outperforms existing approaches in both perceptual quality and robustness across challenging scenarios, paving the way for more advanced focus-editing capabilities in everyday photography. Code and data are available at www.learn2refocus.github.io

Abstract:
We present UltraZoom, a system for generating gigapixel-resolution images of objects from casually captured inputs, such as handheld phone photos. Given a full-shot image (global, low-detail) and one or more close-ups (local, high-detail), UltraZoom upscales the full image to match the fine detail and scale of the close-up examples. To achieve this, we construct a per-instance paired dataset from the close-ups and adapt a pretrained generative model to learn object-specific low-to-high resolution mappings. At inference, we apply the model in a sliding window fashion over the full image. Constructing these pairs is non-trivial: it requires registering the close-ups within the full image for scale estimation and degradation alignment. We introduce a simple, robust method for achieving registration on arbitrary materials in casual, in-the-wild captures. Together, these components form a system that enables seamless pan and zoom across the entire object, producing consistent, photorealistic gigapixel imagery from minimal input. For full-resolution results and code, visit our project page at ultra-zoom.github.io .

Abstract:
We propose a transformer architecture and training strategy for tree generation. The architecture processes data at multiple resolutions and has an hourglass shape, with middle layers processing fewer tokens than outer layers. Similar to convolutional networks, we introduce longer-range skip connections to complement this multi-resolution approach. The key advantages of this architecture are the faster processing speed and lower memory consumption. We are, therefore, able to process more complex trees than would be possible with a vanilla transformer architecture. Furthermore, we extend this approach to perform image-to-tree and point-cloud-to-tree conditional generation and to simulate the tree growth processes, generating 4D trees. Empirical results validate our approach in terms of speed, memory consumption, and generation quality.

Abstract:
Artist-drawn sketches only loosely conform to analytical models of perspective projection; the deviation of human-drawn perspective from analytical perspective models is persistent and well documented, but has yet to be algorithmically replicated. We encode this deviation between human and analytic perspectives as a continuous function in 3D space and develop a method to learn it. We seek deviation functions that (i) mimic artist deviation on our training data; (ii) generalize to other shapes; (iii) are consistent across different views of the same shape; and (iv) produce outputs that appear human-drawn. The natural data for learning this deviation is pairs of artist sketches of 3D shapes and best-matching analytical camera views of the same shapes. However, a core challenge in learning perspective deviation is the heterogeneity of human drawing choices, combined with relative data paucity (the datasets we rely on have only a few dozen training pairs). We sidestep this challenge by learning perspective deviation from an individual pair of an artist sketch of a 3D shape and the contours of the same shape rendered from a best-matching analytical camera view. We first match contours of the depicted shape to artist strokes, then learn a spatially continuous local perspective deviation function that modifies the camera perspective projecting the contours to their corresponding strokes. This function retains key geometric properties that artists strive to preserve when depicting 3D content, thus satisfying (i) and (iv) above. We generalize our method to alternative shapes and views (ii,iii) via a self-augmentation approach that algorithmically generates training data for nearby views, and enforces spatial smoothness and consistency across all views. We compare our results to potential alternatives, demonstrating the superiority of the proposed approach. Code and models will be released upon acceptance.

Abstract:
This work studies the challenge of transfer animations between characters whose skeletal topologies differ substantially. While many techniques have advanced retargeting techniques in decades, transfer motions across diverse topologies remains less-explored. The primary obstacle lies in the inherent topological inconsistency between source and target skeletons, which restricts the establishment of straightforward one-to-one bone correspondences. Besides, the current lack of large-scale paired motion datasets spanning different topological structures severely constrains the development of data-driven approaches. To address these limitations, we introduce Motion2Motion, a novel, training-free framework. Simply yet effectively, Motion2Motion works with only one or a few example motions on the target skeleton, by accessing a sparse set of bone correspondences between the source and target skeletons. Through comprehensive qualitative and quantitative evaluations, we demonstrate that Motion2Motion achieves efficient and reliable performance in both similar-skeleton and cross-species skeleton transfer scenarios. The practical utility of our approach is further evidenced by its successful integration in downstream applications and user interfaces, highlighting its potential for industrial applications. Code and data are available at https://lhchen.top/Motion2Motion.

Abstract:
Recently, camera-controlled video generation has seen rapid development, offering more precise control over video generation. However, existing methods predominantly focus on camera control in perspective projection video generation, while geometrically consistent panoramic video generation remains challenging. This limitation is primarily due to the inherent complexities in panoramic pose representation and spherical projection. To address this issue, we propose CamPVG, the first diffusion-based framework for panoramic video generation guided by precise camera poses. We achieve camera position encoding for panoramic images and cross-view feature aggregation based on spherical projection. Specifically, we propose a panoramic Plücker embedding that encodes camera extrinsic parameters through spherical coordinate transformation. This pose encoder effectively captures panoramic geometry, overcoming the limitations of traditional methods when applied to equirectangular projections. Additionally, we introduce a spherical epipolar module that enforces geometric constraints through adaptive attention masking along epipolar lines. This module enables fine-grained cross-view feature aggregation, substantially enhancing the quality and consistency of generated panoramic videos. Extensive experiments demonstrate that our method generates high-quality panoramic videos consistent with camera trajectories, far surpassing existing methods in panoramic video generation.

Abstract:
Generating realistic and controllable 3D human avatars is a long-standing challenge. The difficulty increases when covering a broad range of attributes such as ethnicity, age, clothing styles, and detailed body shapes. Capturing and annotating large-scale human datasets for training generative models is prohibitively expensive and limited in both scale and diversity. The central question we address in this paper is: Can we distill existing foundation models to generate theoretically unbounded richly annotated 3D human data? We introduce InfiniHuman, a novel framework to distill these models synergistically, to generate richly annotated human data with minimal cost and theoretically unlimited scalability. Specifically, we propose InfiniHumanData, a fully automatic pipeline that leverages vision-language and image generation models to create a large-scale multi-modal dataset. Remarkably, users cannot distinguish our automatically generated identities from scan renderings. InfiniHumanData contains 111K identities and covers unprecedented diversity in ethnicity, age, clothing styles, and more. Each identity is annotated with multi-granularity text descriptions, multi-view RGB images, detailed clothing images, and SMPL body shape parameters. Based on this, we learn InfiniHumanGen, a diffusion-based generative pipeline conditioned on text, body shape, and clothing assets. InfiniHumanGen enables fast, realistic, and precisely controllable avatar generation. Extensive experiments demonstrate that InfiniHuman significantly surpasses existing state-of-the-art methods in terms of visual quality, generation speed, and controllability. Importantly, our approach democratizes high-quality avatar generation with fine-grained control at infinite scale through a practical and affordable solution. To facilitate future research, we will publicly release our automatic data generation pipeline and the comprehensive dataset InfiniHumanData, and the generative models InfiniHumanGen. The code and data of InfiniHuman is publicly available at https://yuxuan-xue.com/infini-human.

Abstract:
The computation of a potentially visible set (PVS) can accelerate many computer graphics algorithms, such as framerate upsampling, streaming rendering, global illumination, and multi-fragment effects. Algorithms for from-region PVS have an inherently high complexity. Previous from-region PVS algorithms propagate occlusion through the scene in a front-to-back manner and are order-dependent, which places bounds on parallelism and restricts execution speed. We introduce the disocclusion buffer, which operates on a sparse, layered representation of the scene with quantized depth. In this representation, we invert the traditional PVS problem formulation and explicitly compute disocclusion rather than occlusion. Disocclusion can be computed in parallel in an order-independent manner, overcoming the main bottleneck in traditional PVS computation. Our PVS algorithm is over six times faster than the previous state of the art at the same level of accuracy in a direct comparison. It runs in shaders on the GPU without requiring any hardware extensions. We demonstrate how our work outperforms previous PVS algorithms in the range of supported camera motion without compromising quality.

Abstract:
We introduce a method that automatically and jointly updates both continuous and discrete parameters of a compound lens design, to improve its performance in terms of sharpness, speed, or both. Previous methods for compound lens design use gradient-based optimization to update continuous parameters (e.g., curvature of individual lens elements) of a given lens topology, requiring extensive expert intervention to realize topology changes. By contrast, our method can additionally optimize discrete parameters such as number and type (e.g., singlet or doublet) of lens elements. Our method achieves this capability by combining gradient-based optimization with a tailored Markov chain Monte Carlo sampling algorithm, using transdimensional mutation and paraxial projection operations for efficient global exploration. We show experimentally on a variety of lens design tasks that our method effectively explores an expanded design space of compound lenses, producing better designs than previous methods and pushing the envelope of speed-sharpness tradeoffs achievable by automated lens design.

Abstract:
The Noise2Noise method allows for training machine learning-based denoisers with pairs of input and target images where both the input and target can be noisy. This removes the need for training with clean target images, which can be difficult to obtain. However, Noise2Noise training has a major limitation: nonlinear functions applied to the noisy targets will skew the results. This bias occurs because the nonlinearity makes the expected value of the noisy targets different from the clean target image. Since nonlinear functions are common in image processing, avoiding them limits the types of preprocessing that can be performed on the noisy targets. Our main insight is that certain nonlinear functions can be applied to the noisy targets without adding significant bias to the results. We develop a theoretical framework for analyzing the effects of these nonlinearities, and describe a class of nonlinear functions with minimal bias.

Abstract:
Speech-driven 3D facial animation aims to generate realistic lip movements and facial expressions for 3D head models from arbitrary audio clips. Although existing diffusion-based methods are capable of producing natural motions, their slow generation speed limits their application potential. In this paper, we introduce a novel autoregressive model that achieves real-time generation of highly synchronized lip movements and realistic head poses and eye blinks by learning a mapping from speech to a multi-scale motion codebook. Furthermore, our model can adapt to unseen speaking styles, enabling the creation of 3D talking avatars with unique personal styles beyond the identities seen during training. Extensive evaluations and user studies demonstrate that our method outperforms existing approaches in lip synchronization accuracy and perceived quality. Demos and codes are available at https://xg-chu.site/project_artalk/.

Abstract:
Despite rapid advancements in video generation models, generating coherent, long-form storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation’s logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover, collectively realizing multi-character, multi-scene animation. Central to AniMaker’s approach are two key technical components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion, and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards. Code and data for this paper are at https://animaker-dev.github.io/

Abstract:
We present PAD3R, a method for reconstructing deformable 3D objects from casually captured, unposed monocular videos. Unlike existing approaches, PAD3R handles long video sequences that feature substantial object deformation, large-scale camera movement, and limited view coverage, which typically challenge conventional systems. At its core, our approach trains a personalized, object-centric pose estimator, supervised by a pre-trained image-to-3D model. This guides the optimization of deformable 3D Gaussian representation. The optimization is further regularized by long-term 2D point tracking over the entire input video. By combining generative priors and differentiable rendering, PAD3R reconstructs high-fidelity, articulated 3D representations of objects in a category-agnostic way. Extensive qualitative and quantitative results show that PAD3R is robust and generalizes well across challenging scenarios, highlighting its potential for dynamic scene understanding and 3D content creation. Please refer to our project page for more details: PAD3R.github.io .

Abstract:
This paper presents a new approach to estimate accurate and robust 3D semantic correspondence with the hierarchical neural semantic representation. Our work has three key contributions. First, we design the hierarchical neural semantic representation (HNSR), which consists of a global semantic feature to capture high-level structure and multi-resolution local geometric features to preserve fine details, by carefully harnessing 3D priors from pre-trained 3D generative models. Second, we design a progressive global-to-local matching strategy, which establishes coarse semantic correspondence using the global semantic feature, then iteratively refines it with local geometric features, yielding accurate and semantically-consistent mappings. Third, our framework is training-free and broadly compatible with various pre-trained 3D generative backbones, demonstrating strong generalization across diverse shape categories. Our method also supports various applications, such as shape co-segmentation, keypoint matching, and texture transfer, and generalizes well to structurally diverse shapes, with promising results even in cross-category scenarios. Both qualitative and quantitative evaluations show that our method outperforms previous state-of-the-art techniques.

Abstract:
Large pretrained diffusion models can provide strong priors beneficial for many graphics applications. However, generative applications such as neural rendering and inverse methods such as SVBRDF estimation and intrinsic image decomposition require additional input or output channels. Current solutions for channel expansion are often application specific and these solutions can be difficult to adapt to different diffusion models or new tasks. This paper introduces Teamwork: a flexible and efficient unified solution for jointly increasing the number of input and output channels as well as adapting a pretrained diffusion model to new tasks. Teamwork achieves channel expansion without altering the pretrained diffusion model architecture by coordinating and adapting multiple instances of the base diffusion model (i.e., teammates). We employ a novel variation of Low Rank-Adaptation (LoRA) to jointly address both adaptation and coordination between the different teammates. Furthermore Teamwork supports dynamic (de)activation of teammates. We demonstrate the flexibility and efficiency of Teamwork on a variety of generative and inverse graphics tasks such as inpainting, single image SVBRDF estimation, intrinsic decomposition, neural shading, and intrinsic image synthesis.

Abstract:
Recent advances in training-free attention control methods have enabled flexible and efficient text-guided editing capabilities for existing image and video generation models. However, current approaches struggle to simultaneously deliver strong editing strength while preserving consistency with the source. For instance, in color-editing tasks, they struggle to maintain structural consistency in edited regions while preserving the rest intact. This limitation becomes particularly critical in multi-round and video editing, where visual errors can accumulate over time. Moreover, most existing methods enforce global consistency, which limits their ability to modify individual attributes such as texture while preserving others, thereby hindering fine-grained editing. Recently, the architectural shift from U-Net to Multi-Modal Diffusion Transformers (MM-DiT) has brought significant improvements in generative performance and introduced a novel mechanism for integrating text and vision modalities. These advancements pave the way for overcoming challenges that previous methods failed to resolve. Through an in-depth analysis of MM-DiT, we identify three key insights into its attention mechanisms. Building on these, we propose ConsistEdit, a novel attention control method specifically tailored for MM-DiT. ConsistEdit incorporates vision-only attention control, mask-guided pre-attention fusion, and differentiated manipulation of the query, key, and value tokens to produce consistent, prompt-aligned edits. Extensive experiments demonstrate that ConsistEdit achieves state-of-the-art performance across a wide range of image and video editing tasks, including both structure-consistent and structure-inconsistent scenarios. Unlike prior methods, it is the first approach to perform editing across all inference steps and attention layers without handcraft, significantly enhancing reliability and consistency, which enables robust multi-round and multi-region editing. Furthermore, it supports progressive adjustment of structural consistency, enabling finer control. ConsistEdit represents a significant advancement in generative model editing and unlocks the full editing potential of MM-DiT architectures.

Abstract:
Recently, 3D Gaussian Splatting (3DGS) has achieved impressive results in novel view synthesis, demonstrating high fidelity and efficiency. However, it easily exhibits needle-like artifacts, especially when increasing the sampling rate. Mip-Splatting tries to remove these artifacts with a 3D smoothing filter for frequency constraints and a 2D Mip filter for approximated supersampling. Unfortunately, it tends to produce over-blurred results, and sometimes needle-like Gaussians still persist. Our spectral analysis of the covariance matrix during optimization and densification reveals that current 3DGS lacks shape awareness, relying instead on spectral radius and view positional gradients to determine splitting. As a result, needle-like Gaussians with small positional gradients and low spectral entropy fail to split and overfit high-frequency details. Furthermore, both the filters used in 3DGS and Mip-Splatting reduce the spectral entropy and increase the condition number during zooming in to synthesize novel view, causing view inconsistencies and more pronounced artifacts. Our Spectral-GS, based on spectral analysis, introduces 3D shape-aware splitting and 2D view-consistent filtering strategies, effectively addressing these issues, enhancing 3DGS’s capability to represent high-frequency details without noticeable artifacts, and achieving high-quality realistic rendering.

Abstract:
Recent advances in deep generative modeling have unlocked unprecedented opportunities for video synthesis. In real-world applications, however, users often seek tools to faithfully realize their creative editing intentions with precise and consistent control. Despite the progress achieved by existing methods, ensuring fine-grained alignment with user intentions remains an open and challenging problem. In this work, we present Shape-for-Motion, a novel framework that incorporates a 3D proxy for precise and consistent video editing. Shape-for-Motion achieves this by converting the target object in the input video to a time-consistent mesh, i.e., a 3D proxy, allowing edits to be performed directly on the proxy and then inferred back to the video frames. To simplify the editing process, we design a novel Dual-Propagation Strategy that allows users to perform edits on the 3D mesh of a single frame, and the edits are then automatically propagated to the 3D meshes of the other frames. The 3D meshes for different frames are further projected onto the 2D space to produce the edited geometry and texture renderings, which serve as inputs to a decoupled video diffusion model for generating edited results. Our framework supports various precise and physically-consistent manipulations across the video frames, including pose editing, rotation, scaling, translation, texture modification, and object composition. Our approach marks a key step toward high-quality, controllable video editing workflows. Extensive experiments demonstrate the superiority and effectiveness of our approach. Project Page: https://shapeformotion.github.io.

Abstract:
Motion capture (mocap) data often exhibits visually jarring artifacts due to inaccurate sensors and post-processing. Cleaning this corrupted data can require substantial manual effort from human experts, which can be a costly and time-consuming process. Previous data-driven motion cleanup methods offer the promise of automating this cleanup process, but often require in-domain paired corrupted-to-clean training data. Constructing such paired datasets requires access to high-quality, relatively artifact-free motion clips, which often necessitates laborious manual cleanup. In this work, we present StableMotion, a simple yet effective method for training motion cleanup models directly from unpaired corrupted datasets that need cleanup. The core component of our method is the introduction of motion quality indicators, which can be easily annotated— through manual labeling or heuristic algorithms—and enable training of quality-aware motion generation models on raw motion data with mixed quality. At test time, the model can be prompted to generate high-quality motions using the quality indicators. Our method can be implemented through a simple diffusion-based framework, leading to a unified motion generate-discriminate model, which can be used to both identify and fix corrupted frames. We demonstrate that our proposed method is effective for training motion cleanup models on raw mocap data in production scenarios by applying StableMotion to SoccerMocap, a 245-hour soccer mocap dataset containing real-world motion artifacts. The trained model effectively corrects a wide range of motion artifacts, reducing motion pops and frozen frames by 68% and 81%, respectively. On our benchmark dataset, we further show that cleanup models trained with our method on unpaired corrupted data outperform state-of-the-art methods trained on clean or paired data, while also achieving comparable performance in preserving the content of the original motion clips.

Affiliations: Max Planck Institute for Informatics, Germany and Saarbrücken Research Center for Visual Computing, Interaction and Artificial Intelligence, Germany ; Google Inc., San Fransisco, Germany ; Harvard University, USA ; Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany ; ETH Zürich, Switzerland and IST Austria, Switzerland ; Harvard University, USA ; Computer Science and Artificial Intelligence Laboratory (CSAIL), USA and Massachusetts Institute of Technology (MIT), USA ; Google Inc.

Abstract:
Rendering novel, relit views of a human head, given a monocular portrait image as input, is an inherently underconstrained problem. The traditional graphics solution is to explicitly decompose the input image into geometry, material and lighting via differentiable rendering; but this is constrained by the multiple assumptions and approximations of the underlying models and parameterizations of these scene components. We propose 3DPR, an image-based relighting model that leverages generative priors learnt from multi-view One-Light-at-A-Time (OLAT) images captured in a light stage. We introduce a new diverse and large-scale multi-view 4K OLAT dataset of 139 subjects to learn a high-quality prior over the distribution of high-frequency face reflectance. We leverage the latent space of a pre-trained generative head model that provides a rich prior over face geometry learnt from in-the-wild image datasets. The input portrait is first embedded in the latent manifold of such a model through an encoder-based inversion process. Then a novel triplane-based reflectance network trained on our lightstage data is used to synthesize high-fidelity OLAT images to enable image-based relighting. Our reflectance network operates in the latent space of the generative head model, crucially enabling a relatively small number of lightstage images to train the reflectance model. Combining the generated OLATs according to a given HDRI environment maps yields physically accurate environmental relighting results. Through quantitative and qualitative evaluations, we demonstrate that 3DPR outperforms previous methods, particularly in preserving identity and in capturing lighting effects such as specularities, self-shadows, and subsurface scattering.

Abstract:
Digital human avatars aim to simulate the dynamic appearance of humans in virtual environments, enabling immersive experiences across gaming, film, virtual reality, and more. However, the conventional process for creating and animating photorealistic human avatars is expensive and time-consuming, requiring large camera capture rigs and significant manual effort from professional 3D artists. With the advent of capable image and video generation models, recent methods enable automatic rendering of realistic animated avatars from a single casually captured reference image of a target subject. While these techniques significantly lower barriers to avatar creation and offer compelling realism, they lack constraints provided by multi-view information or an explicit 3D representation. So, image quality and realism degrade when rendered from viewpoints that deviate strongly from the reference image. Here, we build a video model that generates animatable multi-view videos of digital humans based on a single reference image and target expressions. Our model, MVP4D, is based on a state-of-the-art pre-trained video diffusion model and generates hundreds of frames simultaneously from viewpoints varying by up to 360 degrees around a target subject. We show how to distill the outputs of this model into a 4D avatar that can be rendered in real-time. Our approach significantly improves the realism, temporal consistency, and 3D consistency of generated avatars compared to previous methods.

Abstract:
We pose a new problem, In-2-4D, for generative 4D (i.e., 3D + motion) inbetweening to interpolate two single-view images. In contrast to video/4D generation from only text or a single image, our interpolative task can leverage more precise motion control to better constrain the generation. Given two monocular RGB images representing the start and end states of an object in motion, our goal is to generate and reconstruct the motion in 4D, without making assumptions on the object category, motion type, length, or complexity. To handle such arbitrary and diverse motions, we utilize a foundational video interpolation model for motion prediction. However, large frame-to-frame motion gaps can lead to ambiguous interpretations. To this end, we employ a hierarchical approach to identify keyframes that are visually close to the input states while exhibiting significant motions, then generate smooth fragments between them. For each fragment, we construct a 3D representation of the keyframe using Gaussian Splatting (3DGS). The temporal frames within the fragment guide the motion, enabling their transformation into dynamic 3DGS through a deformation field. To improve temporal consistency and refine the 3D motion, we expand the self-attention of multi-view diffusion across timesteps and apply rigid transformation regularization. Finally, we merge the independently generated 3D motion segments by interpolating boundary deformation fields and optimizing them to align with the guiding video, ensuring smooth and flicker-free transitions. Through extensive qualitative and quantitive experiments as well as a user study, we demonstrate the effectiveness of our method and design choices.

Abstract:
We present Assembler, a scalable and generalizable framework for 3D part assembly that reconstructs complete objects from input part meshes and a reference image. Unlike prior approaches that mostly rely on deterministic part pose prediction and category-specific training, Assembler is designed to handle diverse, in-the-wild objects with varying part counts, geometries, and structures. It addresses the core challenges of scaling to general 3D part assembly through innovations in task formulation, representation, and data. First, Assembler casts part assembly as a generative problem and employs diffusion models to sample plausible configurations, effectively capturing ambiguities arising from symmetry, repeated parts, and multiple valid assemblies. Second, we introduce a novel shape-centric representation based on sparse anchor point clouds, enabling scalable generation in Euclidean space and avoiding the limitations of abstract SE(3) pose prediction. Third, we construct a large-scale dataset of over 320K diverse part-object assemblies using a synthesis and filtering pipeline built on existing 3D shape repositories. Assembler achieves state-of-the-art performance on PartNet and is the first to demonstrate high-quality assembly for complex, real-world objects. Based on Assembler, we further introduce an interesting part-aware 3D modeling system that generates high-resolution, editable objects from images, demonstrating potential for interactive and compositional design. Project page: https://assembler3d.github.io/

Abstract:
In user-generated-content (UGC) applications, non-expert users often rely on image-to-3D generative models to create 3D assets. In this context, primitive-based shape abstraction offers a promising solution for UGC scenarios by compressing high-resolution meshes into compact, editable representations. Towards this end, effective shape abstraction must therefore be structure-aware, characterized by low overlap between primitives, part-aware alignment, and primitive compactness. We present Light-SQ, a novel superquadric-based optimization framework that explicitly emphasizes structure-awareness from three aspects. (a) We introduce SDF carving to iteratively udpate the target signed distance field, discouraging overlap between primitives. (b) We propose a block-regrow-fill strategy guided by structure-aware volumetric decomposition, enabling structural partitioning to drive primitive placement. (c) We implement adaptive residual pruning based on SDF update history to surpress over-segmentation and ensure compact results. In addition, Light-SQ supports multiscale fitting, enabling localized refinement to preserve fine geometric details. To evaluate our method, we introduce 3DGen-Prim, a benchmark extending 3DGen-Bench with new metrics for both reconstruction quality and primitive-level editability. Extensive experiments demonstrate that Light-SQ enables efficient, high-fidelity, and editable shape abstraction with superquadrics for complex generated geometry, advancing the feasibility of 3D UGC creation. Project Page: https://johann.wang/Light-SQ/ .

Abstract:
We propose a novel optimization framework for computing the medial axis transform that simultaneously preserves the medial structure and ensures high medial mesh quality. The medial structure, consisting of interconnected sheets, seams, and junctions, provides a natural volumetric decomposition of a 3D shape. Our method introduces a structure-aware, particle-based optimization pipeline guided by the restricted power diagram (RPD), which partitions the input volume into convex cells whose dual encodes the connectivity of the medial mesh. Structure-awareness is enforced through a spherical quadratic error metric (SQEM) projection that constrains the movement of medial spheres, while a Gaussian kernel energy encourages an even spatial distribution. Compared to feature-preserving methods such as MATFP [Wang et al. 2022] and MATTopo [Wang et al. 2024b], our approach produces cleaner medial structures with significantly improved mesh quality. In contrast to voxel-based, point-cloud-based, and variational methods, our framework is the first to integrate structural awareness into the optimization process, yielding medial meshes with explicit structural decomposition, topological correctness, and geometric fidelity. Our code is available at our project website.

Abstract:
Diffusion models have emerged as the leading approach for image synthesis, demonstrating exceptional photorealism and diversity. However, training diffusion models at high resolutions remains computationally prohibitive, and existing zero-shot generation techniques for synthesizing images beyond training resolutions often produce artifacts, including object duplication and spatial incoherence. In this paper, we introduce HiWave, a training-free, zero-shot approach that substantially enhances visual fidelity and structural coherence in ultra-high-resolution image synthesis using pretrained diffusion models. Our method employs a two-stage pipeline: generating a base image from the pretrained model followed by a patch-wise DDIM inversion step and a novel wavelet-based detail enhancer module. Specifically, we first utilize inversion methods to derive initial noise vectors that preserve global coherence from the base image. Subsequently, during sampling, our wavelet-domain detail enhancer retains low-frequency components from the base image to ensure structural consistency, while selectively guiding high-frequency components to enrich fine details and textures. Extensive evaluations using Stable Diffusion XL demonstrate that HiWave effectively mitigates common visual artifacts seen in prior methods, achieving superior perceptual quality. A user study confirmed HiWave’s performance, where it was preferred over the state-of-the-art alternative in more than 80% of comparisons, highlighting its effectiveness for high-quality, ultra-high-resolution image synthesis without requiring retraining or architectural modifications.

Abstract:
3D Gaussian Splatting (3DGS) renders pixels by rasterizing Gaussian primitives, where conditional alpha-blending dominates the computational cost in the rendering pipeline. This paper proposes TC-GS, an algorithm-independ-ent universal module that expands the applicability of Tensor Core (TCU) for 3DGS, leading to substantial speedups and seamless integration into existing 3DGS optimization frameworks. The key innovation lies in mapping alpha computation to matrix multiplication, fully utilizing otherwise idle TCUs in existing 3DGS implementations. TC-GS provides plug-and-play acceleration for existing top-tier acceleration algorithms and integrates seamlessly with rendering pipeline designs, such as Gaussian compression and redundancy elimination algorithms. Additionally, we introduce a global-to-local coordinate transformation to mitigate rounding errors from quadratic terms of pixel coordinates caused by Tensor Core half-precision computation. Extensive experiments demonstrate that our method maintains rendering quality while providing an additional 2.18× speedup over existing Gaussian acceleration algorithms, thereby achieving a total acceleration of up to 5.6×.

Abstract:
The creation of 3D assets with explicit, editable part structures is crucial for advancing interactive applications, yet most generative methods produce only monolithic shapes, limiting their utility. We introduce OmniPart, a novel framework for part-aware 3D object generation designed to achieve high semantic decoupling among components while maintaining robust structural cohesion. OmniPart uniquely decouples this complex task into two synergistic stages: (1) an autoregressive structure planning module generates a controllable, variable-length sequence of 3D part bounding boxes, critically guided by flexible 2D part masks that allow for intuitive control over part decomposition without requiring direct correspondences or semantic labels; and (2) a spatially-conditioned rectified flow model, efficiently adapted from a pre-trained holistic 3D generator, synthesizes all 3D parts simultaneously and consistently within the planned layout. Our approach supports user-defined part granularity, precise localization, and enables diverse downstream applications. Extensive experiments demonstrate that OmniPart achieves state-of-the-art performance, paving the way for more interpretable, editable, and versatile 3D content.

Abstract:
We introduce a fully automatic pipeline for dynamic scene reconstruction from casually captured monocular RGB videos. Rather than designing a new scene representation, we enhance the priors that drive Dynamic Gaussian Splatting. Video segmentation combined with epipolar-error maps yields object-level masks that closely follow thin structures; these masks (i) guide an object-depth loss that sharpens the consistent video depth, and (ii) support skeleton-based sampling plus mask-guided re-identification to produce reliable, comprehensive 2-D tracks. Two additional objectives embed the refined priors in the reconstruction stage: a virtual-view depth loss removes floaters, and a scaffold-projection loss ties motion nodes to the tracks, preserving fine geometry and coherent motion. The resulting system surpasses previous monocular dynamic scene reconstruction methods and delivers visibly superior renderings. Project page: https://priorenhancedgaussian.github.io/

Abstract:
The development of intelligent robots seeks to seamlessly integrate them into the human world, providing assistance and companionship in daily life and work, with the ultimate goal of achieving human-robot symbiosis. This requires robots with intelligent interaction abilities to work naturally and effectively with humans. However, current robotic simulators fail to support real human participation, limiting their ability to provide authentic interaction experiences and gather valuable human feedback essential for enhancing robotic capabilities. In this paper, we introduce SymBridge, the first human-in-the-loop cyber-physical interactive system designed to enable the safe and efficient development, evaluation, and optimization of human-robot interaction methods. Specifically, we employ augmented reality technology to enable real humans to interact with virtual robots in physical environments, creating an authentic interactive experience. Building on this, we propose a novel robotic interaction model that generates responsive, precise robot actions in real time through continuous human behavior observation. The model incorporates multi-resolution human motion features and environmental affordances, ensuring contextually adaptive robotic responses. Additionally, SymBridge enables continuous robot learning by collecting human feedback and dynamically adapting the robotic interaction model. By leveraging a designed system architecture and modules, SymBridge builds a bridge between humans and robots, as well as between cyber and physical spaces, providing a natural and realistic interaction experience while facilitating the continuous evolution of robotic intelligence. Extensive experiments, user studies, and robot testing demonstrate the system’s promising performance and highlight its potential to significantly advance research on human-robot symbiosis.

Abstract:
This paper presents a novel adjoint solver for differentiable fluid simulation based on bidirectional flow maps. Our key observation is that the forward fluid solver and its corresponding backward, adjoint solver share the same flow map as the forward simulation. In the forward pass, this map transports fluid impulse variables from the initial frame to the current frame to simulate vortical dynamics. In the backward pass, the same map propagates adjoint variables from the current frame back to the initial frame to compute gradients. This shared long-range map allows the accuracy of gradient computation to benefit directly from improvements in flow map construction. Building on this insight, we introduce a novel adjoint solver that solves the adjoint equations directly on the flow map, enabling long-range and accurate differentiation of incompressible flows without differentiating intermediate numerical steps or storing intermediate variables, as required in conventional adjoint methods. To further improve efficiency, we propose a long-short time-sparse flow map representation for evolving adjoint variables. Our approach has low memory usage, requiring only 6.53GB of data at a resolution of 1923 while preserving high accuracy in tracking vorticity, enabling new differentiable simulation tasks that require precise identification, prediction, and control of vortex dynamics.

Abstract:
3D human pose estimation from sketches has broad applications in computer animation and film production. Unlike traditional human pose estimation, this task presents unique challenges due to the abstract and disproportionate nature of sketches. Previous sketch-to-pose methods, constrained by the lack of large-scale sketch-3D pose annotations, primarily relied on optimization with heuristic rules—an approach that is both time-consuming and limited in generalizability. To address these challenges, we propose a novel approach leveraging a "learn from synthesis" strategy. Firstly, a diffusion model is learned to synthesize sketch images from 2D poses projected from 3D human poses, mimicking disproportionate human structures in sketches. This process enables the creation of a synthetic dataset, SKEP-120K, consisting of 120k accurate sketch-3D pose annotation pairs across various sketch styles. Building on this synthetic dataset, we introduce an end-to-end data-driven framework for estimating human poses and shapes from diverse sketch styles. Our framework combines existing 2D pose detectors and generative diffusion priors for sketch feature extraction with a feed-forward neural network for efficient 2D pose estimation. Multiple heuristic loss functions have been incorporated to guarantee geometric coherence between the derived 3D poses and the detected 2D poses while preserving accurate self-contacts. Qualitative, quantitative, and subjective evaluations collectively affirm that our proposed model substantially surpasses previous ones in both estimation accuracy and speed for sketch-to-pose tasks.

Abstract:
Creating 3D assets that follow the texture and geometry style of existing ones is often desirable or even inevitable in practical applications like video gaming and virtual reality.While impressive progress has been made in generating 3D objects from text or images, creating style-controllable 3D assets remains a complex and challenging problem. In this work, we propose StyleSculptor, a novel training-free approach for generating style-guided 3D assets from a content image and one or more style images.Unlike previous works, StyleSculptor achieves style-guided 3D generation in a zero-shot manner, enabling fine-grained 3D style control that captures the texture, geometry, or both styles of user-provided style images. At the core of StyleSculptor is a novel Style Disentangled Attention (SD-Attn) module, which establishes a dynamic interaction between the input content image and style image for style-guided 3D asset generation via a cross-3D attention mechanism, enabling stable feature fusion and effective style-guided generation.To alleviate semantic content leakage, we also introduce a style-disentangled feature selection strategy within the SD-Attn module, which leverages the variance of 3D feature patches to disentangle style- and content-significant channels, allowing selective feature injection within the attention framework. With SD-Attn, the network can dynamically compute texture-, geometry-, or both-guided features to steer the 3D generation process. Built upon this, we further propose the Style Guided Control (SGC) mechanism, which enables exclusive geometry- or texture-only stylization, as well as adjustable style intensity control. StyleSculptor does not require prior training and enables instant adaptation to any reference models while maintaining strict user-specified style consistency. Extensive experiments demonstrate that StyleSculptor outperforms existing baseline methods in producing high-fidelity 3D assets. Code will be available at the project page.

Abstract:
3D scene reconstruction from a single measurement is challenging, especially in the presence of occluded regions and specular materials, such as mirrors. We address these challenges by leveraging single-photon lidars. These lidars estimate depth from light that is emitted into the scene and reflected directly back to the sensor. However, they can also measure light that bounces multiple times in the scene before reaching the sensor. This multi-bounce light contains additional information that can be used to recover dense depth, occluded geometry, and material properties. Prior work with single-photon lidar, however, has only demonstrated these use cases when a laser sequentially illuminates one scene point at a time. We instead focus on the more practical – and challenging – scenario of illuminating multiple scene points simultaneously. The complexity of light transport due to the combined effects of multiplexed illumination, two-bounce light, shadows, and specular reflections is challenging to invert analytically. Instead, we propose a data-driven method to invert light transport in single-photon lidar. To enable this approach, we create the first large-scale simulated dataset of ~100k lidar transients for indoor scenes. We use this dataset to learn a prior on complex light transport, enabling measured two-bounce light to be decomposed into the constituent contributions from each laser spot. Finally, we experimentally demonstrate how this decomposed light can be used to infer 3D geometry in scenes with occlusions and mirrors from a single measurement. Our code and dataset are released on our project webpage.

Abstract:
We present Uni-Inter, a unified framework for human motion generation that supports a wide range of interaction scenarios: including human-human, human-object, and human-scene—within a single, task-agnostic architecture. In contrast to existing methods that rely on task-specific designs and exhibit limited generalization, Uni-Inter introduces the Unified Interactive Volume (UIV), a volumetric representation that encodes heterogeneous interactive entities into a shared spatial field. This enables consistent relational reasoning and compound interaction modeling. Motion generation is formulated as joint-wise probabilistic prediction over the UIV, allowing the model to capture fine-grained spatial dependencies and produce coherent, context-aware behaviors. Experiments across three representative interaction tasks demonstrate that Uni-Inter achieves competitive performance and generalizes well to novel combinations of entities. These results suggest that unified modeling of compound interactions offers a promising direction for scalable motion synthesis in complex environments.

Abstract:
We present Gaussian See, Gaussian Do, a novel approach for semantic 3D motion transfer from multiview video. Our method enables rig-free, cross-category motion transfer between objects with semantically meaningful correspondence. Building on implicit motion transfer techniques, we extract motion embeddings from source videos via condition inversion, apply them to rendered frames of static target shapes, and use the resulting videos to supervise dynamic 3D Gaussian Splatting reconstruction. Our approach introduces an anchor-based view-aware motion embedding mechanism, ensuring cross-view consistency and accelerating convergence, along with a robust 4D reconstruction pipeline that consolidates noisy supervision videos. We establish the first benchmark for semantic 3D motion transfer and demonstrate superior motion fidelity and structural consistency compared to adapted baselines. Code and data for this paper available at gsgd-motiontransfer.github.io.

Abstract:
Fusing cross-category objects to a single coherent object has gained increasing attention in text-to-image (T2I) generation due to its broad applications in virtual reality, digital media, film, and gaming. However, existing methods often produce biased, visually chaotic, or semantically inconsistent results due to overlapping artifacts and poor integration. Moreover, progress in this field has been limited by the absence of a comprehensive benchmark dataset. To address these problems, we propose Adaptive Group Swapping (AGSwap), a simple yet highly effective approach comprising two key components: (1) Group-wise Embedding Swapping, which fuses semantic attributes from different concepts through feature manipulation, and (2) Adaptive Group Updating, a dynamic optimization mechanism guided by a balance evaluation score to ensure coherent synthesis. Additionally, we introduce Cross-category Object Fusion (COF), a large-scale, hierarchically structured dataset built upon ImageNet-1K and WordNet. COF includes 95 superclasses, each with 10 subclasses, enabling 451,250 unique fusion pairs. Extensive experiments demonstrate that AGSwap outperforms state-of-the-art compositional T2I methods, including GPT-Image-1 using simple and complex prompts. Project Page

Abstract:
We introduce the Aging Multiverse, a framework for generating multiple plausible facial aging trajectories from a single image, each conditioned on external factors such as environment, health, and lifestyle. Unlike prior methods that model aging as a single deterministic path, our approach creates an aging tree that visualizes diverse futures. To enable this, we propose a training-free diffusion-based method that balances identity preservation, age accuracy, and condition control. Our key contributions include attention mixing to modulate editing strength and a Simulated Aging Regularization strategy to stabilize edits. Extensive experiments and user studies demonstrate state-of-the-art performance across identity preservation, aging realism, and conditional alignment, outperforming existing editing and age-progression models, which often fail to account for one or more of the editing criteria. By transforming aging into a multi-dimensional, controllable, and interpretable process, our approach opens up new creative and practical avenues in digital storytelling, health education, and personalized visualization.

Abstract:
Reconstructing metrically accurate humans and their surrounding scenes from a single image is crucial for virtual reality, robotics, and comprehensive 3D scene understanding. However, existing methods struggle with depth ambiguity, occlusions, and physically inconsistent contacts. To address these challenges, we introduce PhySIC, a unified framework for physically plausible Human–Scene Interaction and Contact reconstruction. PhySIC recovers metrically consistent SMPL-X human meshes, dense scene surfaces, and vertex-level contact maps within a shared coordinate frame, all from a single RGB image. Starting from coarse monocular depth and parametric body estimates, PhySIC performs occlusion-aware inpainting, fuses visible depth with unscaled geometry for a robust initial metric scene scaffold, and synthesizes missing support surfaces like floors. A confidence-weighted optimization subsequently refines body pose, camera parameters, and global scale by jointly enforcing depth alignment, contact priors, interpenetration avoidance, and 2D reprojection consistency. Explicit occlusion masking safeguards invisible body regions against implausible configurations. PhySIC is highly efficient, requiring only 9 seconds for a joint human-scene optimization and less than 27 seconds for end-to-end reconstruction process. Moreover, the framework naturally handles multiple humans, enabling reconstruction of diverse human scene interactions. Empirically, PhySIC substantially outperforms single-image baselines, reducing mean per-vertex scene error from 641 mm to 227 mm, halving the pose-aligned mean per-joint position error (PA-MPJPE) to 42 mm, and improving contact F1-score from 0.09 to 0.51. Qualitative results demonstrate that PhySIC yields realistic foot-floor interactions, natural seating postures, and plausible reconstructions of heavily occluded furniture. By converting a single image into a physically plausible 3D human-scene pair, PhySIC advances accessible and scalable 3D scene understanding. Our implementation is publicly available at https://yuxuan-xue.com/physic.

Abstract:
We propose FreeMusco, a motion-free framework that jointly learns latent representations and control policies for musculoskeletal characters. By leveraging the musculoskeletal model as a strong prior, our method enables energy-aware and morphology-adaptive locomotion to emerge without motion data. The framework generalizes across human, non-human, and synthetic morphologies, where distinct energy-efficient strategies naturally appear—for example, quadrupedal gaits in Chimanoid versus bipedal gaits in Humanoid. The latent space and corresponding control policy are constructed from scratch, without demonstration, and enable downstream tasks such as goal navigation and path following—representing, to our knowledge, the first motion-free method to provide such capabilities. FreeMusco learns diverse and physically plausible locomotion behaviors through model-based reinforcement learning, guided by the locomotion objective that combines control, balancing, and biomechanical terms. To better capture the periodic structure of natural gait, we introduce the temporally averaged loss formulation, which compares simulated and target states over a time window rather than on a per-frame basis. We further encourage behavioral diversity by randomizing target poses and energy levels during training, enabling locomotion to be flexibly modulated in both form and intensity at runtime. Together, these results demonstrate that versatile and adaptive locomotion control can emerge without motion capture, offering a new direction for simulating movement in characters where data collection is impractical or impossible.

Abstract:
Accurate and efficient modeling of large-scale urban scenes is critical for applications such as AR navigation, UAV-based inspection, and smart city digital twins. While aerial imagery offers broad coverage and complements limitations of ground-based data, reconstructing city-scale environments from such views remains challenging due to occlusions, incomplete geometry, and high memory demands. Recent advances like 3D Gaussian Splatting (3DGS) improve scalability and visual quality but remain limited by dense primitive usage, long training times, and poor suitability for edge devices. We propose CityGo, a hybrid framework that combines textured proxy geometry with residual and surrounding 3D Gaussians for lightweight, photorealistic rendering of urban scenes from aerial perspectives. Our approach first extracts compact building proxy meshes from MVS point clouds, then uses zero-order SH Gaussians to generate occlusion-free textures via image-based rendering and back-projection. To capture high-frequency details, we introduce residual Gaussians placed based on proxy-photo discrepancies and guided by depth priors. Broader urban context is represented by surrounding Gaussians, with importance-aware downsampling applied to non-critical regions to reduce redundancy. A tailored optimization strategy jointly refines proxy textures and Gaussian parameters, enabling real-time rendering of complex urban scenes on mobile GPUs with significantly reduced training and memory requirements. Extensive experiments on real-world aerial datasets demonstrate that our hybrid representation achieves fastest training speed, while delivering comparable visual fidelity to pure 3D Gaussian Splatting approaches. Furthermore, CityGo enables real-time rendering of large-scale urban scenes on mobile consumer GPUs, with substantially reduced memory usage and energy consumption.

Abstract:
With the advancement of Gaussian Splatting techniques, a growing number of datasets based on this representation have been developed. However, performing accurate and efficient clipping for Gaussian Splatting remains a challenging and unresolved problem, primarily due to the volumetric nature of Gaussian primitives, which makes hard clipping incapable of precisely localizing their pixel-level contributions. In this paper, we propose a hybrid rendering framework that combines rasterization and ray tracing to achieve efficient and high-fidelity clipping of Gaussian Splatting data. At the core of our method is the RaRa strategy, which first leverages rasterization to quickly identify Gaussians intersected by the clipping plane, followed by ray tracing to compute attenuation weights based on their partial occlusion. These weights are then used to accurately estimate each Gaussian’s contribution to the final image, enabling smooth and continuous clipping effects. We validate our approach on diverse datasets, including general Gaussians, hair strand Gaussians, and multi-layer Gaussians, and conduct user studies to evaluate both perceptual quality and quantitative performance. Experimental results demonstrate that our method delivers visually superior results while maintaining real-time rendering performance and preserving high fidelity in the unclipped regions.

Abstract:
Synthesizing 3D scenes from open-vocabulary text descriptions is a challenging, important, and recently-popular application. One of its critical subproblems is layout generation: given a set of objects, lay them out to produce a scene matching the input description. Nearly all recent work adopts a declarative paradigm for this problem: using an LLM to generate a specification of constraints between objects, then solving those constraints to produce the final layout. In contrast, we explore an alternative imperative paradigm, in which an LLM iteratively places objects, with each object’s position and orientation computed as a function of previously-placed objects. The imperative approach allows for a simpler scene specification language while also handling a wider variety and larger complexity of scenes. We further improve the robustness of our imperative scheme by developing an error correction mechanism that iteratively improves the scene’s validity while staying as close as possible to the original layout generated by the LLM. In forced-choice perceptual studies, participants preferred layouts generated by our imperative approach 82% and 94% of the time, respectively, when compared against two declarative layout generation methods. We also present a simple, automated evaluation metric for 3D scene layout generation that aligns well with human preferences.

Abstract:
Reconstructing physically plausible human motion from monocular videos remains a challenging problem in computer vision and graphics. Existing methods primarily focus on kinematics-based pose estimation, often leading to unrealistic results due to the lack of physical constraints. To address such artifacts, prior methods have typically relied on physics-based post-processing following the initial kinematics-based motion estimation. However, this two-stage design introduces error accumulation, ultimately limiting the overall reconstruction quality. In this paper, we present PhysHMR, a unified framework that directly learns a visual-to-action policy for humanoid control in a physics-based simulator, enabling motion reconstruction that is both physically grounded and visually aligned with the input video. A key component of our approach is the pixel-as-ray strategy, which lifts 2D keypoints into 3D spatial rays and transforms them into global space. These rays are incorporated as policy inputs, providing robust global pose guidance without depending on noisy 3D root predictions. This soft global grounding, combined with local visual features from a pretrained encoder, allows the policy to reason over both detailed pose and global positioning. To overcome the sample inefficiency of reinforcement learning, we further introduce a distillation scheme that transfers motion knowledge from a mocap-trained expert to the vision-conditioned policy, which is then refined using physically motivated reinforcement learning rewards. Extensive experiments demonstrate that PhysHMR produces high-fidelity, physically plausible motion across diverse scenarios, outperforming prior approaches in both visual accuracy and physical realism.

Abstract:
Existing single-view 3D generative models typically adopt multiview diffusion priors to reconstruct object surfaces, yet they remain prone to inter-view inconsistencies and are unable to faithfully represent complex internal structure or nontrivial topologies. In particular, we encode geometry information by projecting it onto a bounding sphere and unwrapping it into a compact and structural multi-layer 2D Spherical Projection (SP) representation. Operating solely in the image domain, SPGen offers three key advantages simultaneously: (1) Consistency. The injective SP mapping encodes surface geometry with a single viewpoint which naturally eliminates view inconsistency and ambiguity; (2) Flexibility. Multi-layer SP maps represent nested internal structures and support direct lifting to watertight or open 3D surfaces; (3) Efficiency. The image-domain formulation allows the direct inheritance of powerful 2D diffusion priors and enables efficient finetuning with limited computational resources. Extensive experiments demonstrate that SPGen significantly outperforms existing baselines in geometric quality and computational efficiency.

Abstract:
Natural head rotation is critical for believable embodied virtual agents, yet this micro-level behavior remains largely underexplored. While head-rotation prediction algorithms could, in principle, reproduce this behavior, they typically focus on visually salient stimuli and overlook the cognitive motives that guide head rotation. This yields agents that look at conspicuous objects while overlooking obstacles or task-relevant cues, diminishing realism in a virtual environment. We introduce SCORE, a Symbolic COgnitive Reasoning framework for Embodied Head Rotation, a data-agnostic framework that produces context-aware head movements without task-specific training or hand-tuned heuristics. A controlled VR study (N=20) identifies five motivational drivers of human head movements: Interest, Information Seeking, Safety, Social Schema, and Habit. SCORE encodes these drivers as symbolic predicates, perceives the scene with a Vision–Language Model (VLM), and plans head poses with a Large Language Model (LLM). The framework employs a hybrid workflow: the VLM-LLM reasoning is executed offline, after which a lightweight FastVLM performs online validation to suppress hallucinations while maintaining responsiveness to scene dynamics. The result is an agent that predicts not only where to look but also why, generalizing to unseen scenes and multi-agent crowds while retaining behavioral plausibility.

Abstract:
The efficient simulation of incompressible fluids remains a difficult and open problem. Prior works often make various tradeoffs between incompressibility, stability, and cost. Yet, it is rare to obtain all three. In this paper, we introduce a novel incompressible Smoothed Particle Hydrodynamics (SPH) scheme which uses a second-order implicit descent scheme to optimize a variational energy specially formulated to approach incompressibility. We demonstrate that our method is superior in both incompressibility and stability with a minimal cost to computational budget. Furthermore, we demonstrate that our method is unconditionally stable even under extreme time steps, making it suitable for interactive applications.

Abstract:
We introduce a divergence-free nD vector noise defined as the n-dimensional cross product of the gradients of n − 1 noise functions. We show that this vector noise function is divergence-free and hence volume preserving for any dimension n. Our method enables precise integration and extends to new settings by substituting noise functions with implicit surfaces, (hyper)surfaces, or custom functions. We demonstrate applications including image warping, surface texturing, noise bounded by implicit surfaces, anisotropic curl-noise, and high-dimensional point jittering up to 7D.

Abstract:
We present a computational framework for designing geometric metamaterials capable of approximating freeform 3D surfaces via rotationally deployable kirigami patterns. While prior inverse design methods typically rely on standard, well-studied patterns, such as equilateral triangles or quadrilaterals, we step back to examine the broader design space of the patterns themselves. Specifically, we derive principled rules to determine whether a given planar tiling can be cut into a rotationally deployable hinged kirigami structure with possible curvature adaptation. These insights allow us to generate and validate a broad family of novel tiling patterns beyond traditional examples. We further analyze two key deployment states of a general pattern: the commonly used maximal area expansion, and the maximal rotation angle reached just before face collisions occur, which we adopt as the default for inverse design as it allows for simple deployment in practice, i.e., rotating the faces to their natural limit. Finally, we solve the inverse problem: given a target 3D surface, we compute a planar tiling that, when cut and deployed to its maximal rotation angle, approximates the input geometry. We show that for a subset of patterns, the deployed configurations are hole-free, demonstrating that curvature can be achieved from planar sheets through local combinatorial changes. Our experiments, including physical fabrications, demonstrate the effectiveness of our approach and validate a wide range of previously unexplored patterns that are both physically realizable and geometrically expressive.

Abstract:
Artist-created meshes in-the-wild often do not have a well defined interior. We observe that they typically consist of a mix of solid elements, faces that bound a volume, and shell elements that represent the medial surface of a thin shell. The lack of a well-defined interior prevents downstream applications, such as solid-modeling, simulation, and manufacturing. We present a method that takes as input a surface mesh and assigns to each face a label determining whether it belongs to a solid or shell. These labels reduce ambiguity by defining the interior for solid faces through thresholding the generalized winding number field, and for shell faces as the volume within an offset. We cast the labeling problem as an optimization that outputs a solid/shell label for each face, guided by a sparse set of user inputs. Once labeling is complete, we show how the shape can be volume meshed by passing the shell faces through an offset mesher and the solid faces to an off-the-shelf tetrahedral mesher, producing a final volumetric mesh by taking their union. Experiments on diverse meshes with defects and multiple solid and shell components demonstrate that our approach delivers the desired labels, enabling modeling and simulation on wild meshes in a way that respects the user intent.

Abstract:
We introduce mathematical tools to describe the geometric problem of sieves, two-dimensional holes that admit certain three-dimensional objects to pass through them, but block others. This is achieved by formulating the sieve design problem as a two-player game where both players (the one that wants to pass, and the one that wants to block) try to find a set of rigid transformations to achieve their objective. We also introduce an algorithm for solving this game by solving a global optimization problem employing both differentiable rendering with gradient-based optimization as well as particle swarm optimization. Our procedure accounts for real-world manufacturing concerns, and we fabricate a variety of examples demonstrating the practical viability of our sieves. Our implementation takes advantage of GPUs and does not rely on any clean or manifold input geometry as long as it is a triangle mesh. We can produce intricate sieves that block an arbitrary set of shapes B but admit another arbitrary set of shapes A (if finding a solution is possible for our method).

Abstract:
State-of-the-art cloth simulations rely on linear triangular elements in mass-spring or continuum based finite element formulations. These methods typically decompose the surface energy density into in-plane (shearing and stretching) and out-of-plane (bending) components, with bending energies modeled using discrete mean curvature measures. While effective, they are prone to mesh-dependent behavior and locking. Higher-order formulations can mitigate these issues, but their adoption poses significant challenges due to the requirement for continuity of basis functions’ derivatives across element boundaries to accurately represent surface curvature. We introduce a novel continuum-based approach that addresses the limitations of existing methods without requiring globally smooth (H2-continuous) basis functions. Our method uses non-conforming function spaces and weakly enforces the continuity of tangent basis through carefully derived interface terms. In fact, the proposed method builds on Interior Penalty methods, which we adapt to effectively handle simulations of curved surfaces. Our approach uses standard Lagrangian basis functions, and supports straightforward extension to high-order bases, while adhering to the in-plane/out-of-plane decoupling paradigm widely adopted in cloth simulation. We demonstrate the robustness and versatility of our method through garment simulations, illustrating its ability to handle complex deformations and a variety of bending behaviors with high fidelity.

Abstract:
3D Gaussian Splatting (3DGS) has shown strong capability in reconstructing and rendering photorealistic 3D scenes with high efficiency. However, extending 3DGS to synthesize large-scale or infinite terrains from a single captured exemplar—remains an open challenge. In this paper, we propose a tile-based framework that addresses this problem. Our method builds on Wang Tiles, where each tile encodes a local field of Gaussians with boundary constraints to ensure seamless transitions. This enables stochastic yet continuous tiling of Gaussian fields over arbitrary surfaces, allowing for procedural generation of expansive terrains with high spatial diversity. Furthermore, we introduce several rendering optimizations tailored to the unique characteristics of 3DGS Wang tiles, achieving real-time rendering of large-scale 3DGS terrains.

Abstract:
Scratch-represented 3D visual arts can create compelling visual effects by manipulating light reflections across surfaces. Established works, such as those involving scratch holograms, have realized impressive multi-view imagery effects of reflection arts. However, creating a continuous view of 3D virtual objects with shading effects, especially view-dependent shading remains a challenge. Yet, most reported works are demonstrated on planar surfaces, leaving exploring the potential benefits of leveraging curved surfaces for diverse imagery scenarios an interesting research avenue. This work explores the continuous view-dependent imagery with rich shading effects via scratch-based reflection, whose design space has the potential to be extended to arbitrary curved surfaces. This is achieved by solving the ordinary differential equations under constraints calculated from established bidirectional reflectance distribution function models to optimize scratch distribution on substrate surfaces. Importantly, we create real-world examples by manufacturing optimized reflectors using off-the-shelf carving machines, delivering state-of-the-art specular view-dependent imagery that features continuous and realistic shading effects on both planar and developable curved surfaces.

Abstract:
3D sketches are an effective representation of a 3D shape, convenient to create via modern Virtual or Augmented Reality (VR/AR) interfaces or from 2D sketches. For 3D sketches drawn by designers, human observers can consistently imagine the surface they imply, yet reconstructing such a surface with modern methods remains an open problem. Existing methods either assume a clean, well-structured 3D curve network (while in reality most 3D sketches are rough and unstructured), or make no effort to produce a surface consistent with perceptual observations. We propose a novel method that addresses this challenge by designing a system that reconstructs a surface that better aligns with human perception from a clean or rough set of 3D sketches. As the topology of the desired surface is unknown, we use an implicit neural surface representation, parameterized via its gradient field.

Abstract:
We propose a novel method to automatically approximate a free-form surface using a set of near developable patches that form a tensile-like structure when anchored at a sparse set of points. These structures are appealing for their ability to span large areas with low material cost and structural weight, while also offering strong aesthetic potential. Our algorithm strikes a balance between approximation accuracy, patch simplicity, and visual quality, while ensuring manufacturability and structural feasibility. The layout is guided by a curvature field and refined through a combinatorial process that incrementally adds patches until performance and fabrication constraints are met. Redundant elements are then removed to improve clarity and elegance.

Abstract:
A K-hedral tiling of a 2D finite domain is a covering of the domain with tiles without gaps or overlaps, where each tile is congruent to one of the K distinct shapes called prototiles. K, the number of prototiles, is preferred to be as small as possible for congruent tiling appearance and reducing fabrication cost, e.g., by molding. Typically, a forward approach is adopted to produce K-hedral tilings by prescribing a set of prototiles and placing prototile instances (i.e., tiles) to cover the input domain. However, the prescribed prototile set may not be sufficient to tile the domain (for small K) or may lead to tiling results with excessive prototiles more than needed (for large K).

Abstract:
Monte Carlo methods are a cornerstone of physics-based light transport simulations, valued for their ability to produce high-quality photorealistic images. These stochastic methods often suffer from variance, resulting in undesirable noise in the rendered images. Gradient-domain rendering (GDR) techniques mitigate this problem by estimating unbiased image-space gradients via so-called shift-mapping operators. While these mappings are computationally efficient, they can yield high-variance gradients—and thus poor reconstruction quality—when applied to pixels with wildly different integrals. We tackle this challenge by dynamically selecting the optimal set of neighboring pixels for applying shift-mapping under random sequence replay. Key to our approach is a differentiable sorting network that softly ranks the output of a convolutional neural network conditioned on input sample features for weighted reconstruction. This module is carefully rigidified over time to converge to a hard top-k selection, allowing end-to-end optimization with respect to the reconstruction error. Our method is versatile and can be jointly optimized with other adaptive sampling strategies. We demonstrate variance reduction over other traditional adaptive gradient-domain methods across scenes of varying radiometric complexity.

Abstract:
When an image is seen on an optical see-through augmented reality (AR) display, the light from the display is mixed with the background light from the environment. This can severely limit the available contrast in AR, which is often orders of magnitude below that of traditional displays. Yet, the presented images appear sharper and show more details than the reduction in physical contrast would indicate. In this work, we hypothesize two effects that are likely responsible for the enhanced perceived contrast in AR: background discounting, which allows observers focused on the display plane to partially discount the light from the environment; and supra-threshold contrast perception, which explains the differences in contrast perception across luminance levels. In a series of controlled experiments on an AR high-dynamic-range multi-focal haploscope testbed, we found no statistical evidence supporting the effect of background discounting on contrast perception. Instead, the increase of visibility in AR is better explained with models of supra-threshold contrast perception. Our findings can be generalized to incorporate an image input, and this model serves to design better algorithms and hardware for display systems affected by additive light, such as AR.

Abstract:
The miniaturization of shell structures presents a versatile and complex challenge, bridging geometry with diverse practical applications. In this paper, we introduce a novel approach for computing origami crease patterns to compress arbitrary 3D shell objects. First, we employ the adapted Material Point Method (MPM) to simulate the compression of a target surface and obtain an initial folded configuration. Since MPM produces overly smooth curved surfaces, their crease patterns are unsuitable for practical origami fabrication. We then propose a novel Folding Line Extraction (FLE) method that optimizes these smoothed surfaces to extract folding lines that achieve the target compression with minimal deformation and stretching outside the crease lines. This method produces smooth curved folding lines. Fabrication and experimental validation of the extracted patterns demonstrate their effectiveness and applicability in real-world scenarios.

Abstract:
Reconstructing the geometry and appearance of a given scene is a fundamental task in 3D computer graphics and computer vision. Recently, radiance fields have emerged as a representation of light transport in the scene, allowing, as a byproduct, also to extract 3D geometry solely from multi-view imagery. Initially designed for RGB captures, existing approaches have been extended to other sensor modalities. Among these, transient imaging — measuring the time-of-flight of light at picosecond resolution — has emerged as a promising alternative, offering rich spatio-temporal information to improve reconstruction quality from limited viewpoints and obstructed views. However, its applicability to outdoor scenarios has been highly problematic due to interference from ambient light and the different sensor behavior under high-photon-flux conditions typical of outdoor settings. Addressing this gap, we introduce Transient LASSO, a neural scene reconstruction method operating on raw transient measures of outdoor in-the-wild captures to accurately reconstruct the underlying scene geometry and properties. We demonstrate the effectiveness of our method across a variety of outdoor environments, including complex urban scenes with dense traffic and infrastructure. Finally, we also show the potential use cases of our method for downstream applications such as sensor parameter optimization.

Abstract:
We tackle the challenges of synthesizing versatile, physically simulated human motions for full-body object manipulation. Unlike prior methods that are focused on detailed motion tracking, trajectory following, or teleoperation, our framework enables users to specify versatile high-level objectives such as target object poses or body poses. To achieve this, we introduce MaskedManipulator, a generative control policy distilled from a tracking controller trained on large-scale human motion capture data. This two-stage learning process allows the system to perform complex interaction behaviors, while providing intuitive user control over both character and object motions. MaskedManipulator produces goal-directed manipulation behaviors that expand the scope of interactive animation systems beyond task-specific solutions.

Abstract:
Recovering high-fidelity spatially varying bidirectional reflectance distribution function (SVBRDF) maps from a single image remains an ill-posed and challenging problem, especially in the presence of saturated highlights. Existing methods often fail to reconstruct the underlying texture in regions overwhelmed by intense specular reflections. This kind of bake-in artifacts caused by highlight corruption can be greatly alleviated by providing a series of material images under different lighting conditions. To this end, our key insight is to leverage the strong priors of diffusion models to generate images of the same material under varying lighting conditions. These generated images are then used to aid a multi-image SVBRDF estimator in recovering highlight-free reflectance maps. However, strong highlights in the input image lead to inconsistencies across the relighting results. Moreover, texture reconstruction becomes unstable in saturated regions, with variations in background structure, specular shape, and overall material color. These artifacts degrade the quality of SVBRDF recovery. To address this issue, we propose a shuffle-based background consistency module that extracts stable background features and implicitly identifies saturated regions. This guides the diffusion model to generate coherent content while preserving material structures and details. Furthermore, to stabilize the appearance of generated highlights, we introduce a lightweight specular prior encoder that estimates highlight features and then performs grid-based latent feature translation, injecting consistent specular contour priors while preserving material color fidelity. Both quantitative analysis and qualitative visualization demonstrate that our method enables stable neural relighting from a single image and can be seamlessly integrated into multi-input SVBRDF networks to estimate highlight-free reflectance maps.

Abstract:
Simulating the interactions between fluids and porous media has attracted significant attention in computer graphics. A key challenge in this domain is modeling the Poro-Elasto-Capillary (PEC) coupling effect which describes the intricate interplay of three physical phenomena in soft porous materials: pore-structure evolution, elastic deformation, and wetting driven by capillary pressure. These phenomena collectively govern dynamic behavior such as the softening and fracturing of biscuits upon water absorption or the swelling of cellulose sponges due to liquid infiltration. Most existing simulation methods model porous media either as static grids or as solid particles with augmented water content attributes, failing to capture the full spectrum of PEC-driven effects due to the lack of physical modeling for elasticity, dynamic porosity changes, and capillary interactions. We propose a multiphase particle-based framework to holistically simulate PEC coupling effects with porous media. We develop a physics-driven model that captures elasticity and dynamic pore-structure evolution under capillary action, enabling realistic simulation of softening and swelling. We derive a saturation-aware pressure Poisson equation to enforce fluid incompressibility within and around the porous medium, ensuring accurate capillary-driven flow while preserving mass and momentum. Finally, we propose a representative elementary volume-based formulation to unify the modeling of homogeneous macro-porous media and cavity-embedded structures, enhancing the representation of pore-scale PEC effects. Comparisons with prior work and real footage show the advantages of our approach in achieving visually realistic fluid-porous media interactions.

Abstract:
Denoising is an important post-processing step in physically based Monte Carlo (MC) rendering. While neural networks are widely used in practice, statistical analysis has recently become a viable alternative for denoising. In this paper, we present a general framework for statistics-based error reduction of both estimated radiance and variance. Specifically, we introduce a novel denoising approach for variance estimates, which can either improve variance-aware adaptive sampling or provide additional input for image denoising in a cascaded manner. Furthermore, we present multi-transform denoising: a general and efficient correction scheme for non-normal distributions, which typically occur in MC rendering. All these contributions combine to a robust denoising pipeline that does not require any pretraining and can run efficiently on current GPU hardware. Our results show distinct advantages over previous denoising methods, especially in the range of a few hundred samples per pixel, which is of high practical relevance. Finally, we demonstrate good convergence behavior as the number of samples increases, providing predictable results with low bias that are free of hallucinated neural artifacts. In summary, our statistics-based algorithms for adaptive sampling and denoising deliver fast, consistent, low-bias variance and radiance estimates.

Abstract:
With the rise of digital fashion, reusing high-quality garment assets to assemble new outfits has become increasingly important for improving design efficiency and reducing production costs. However, combining multiple garments often introduces complex inter-garment intersections that are difficult to resolve. In this paper, we propose a novel framework that introduces a midsurface representation to simplify multilayered garments for intersection-free outfit assembly. Each garment is approximated by a watertight tetrahedral enclosure, enabling efficient resolution of inter-garment collisions on the midsurface level. To assemble an outfit, our method progressively untangles pairs of single-layer midsurfaces and incrementally constructs a merged midsurface. To recover the intersection-free full geometry from these deformed midsurfaces and enable instantaneous transfer across different poses, we uses embedded anchors to drive inversion-free deformation of enclosing tetrahedral cages. Through various examples, we demonstrate that our method provides a scalable and automated solution for virtual outfit coordination, enabling the direct reuse of garment assets in high-fidelity, collision-free digital fashion workflows.

Abstract:
We propose SCom Tree, a compact representation for geometry based on Signed Distance Fields (SDF), which outperforms previous approaches in both size and quality while maintaining comparable rendering speed. Our representation employs the concept that many surface regions are similar to each other up to a linear transformation, and only a small fraction of them can be stored to represent the whole 3D model. At the top level, we use BVH trees to accelerate ray tracing and efficiently cull empty space. The BVH leaves store octrees, with nodes referencing surface regions, each represented by a small 3D grid (brick) and transformation. The transformations themselves are a limited set of rotations and translations that are encoded with a short index. SCom Tree supports a continuous level of detail that enables efficient streaming and performance control at far distances, making our method particularly useful for rendering large 3D models with ray tracing.

Abstract:
We present ReSTIR Path Guiding (ReSTIR-PG), a real-time method that extracts guiding distributions from resampled paths produced by ReSTIR and uses them to generate improved initial candidates for the next frame. While ReSTIR significantly reduces variance through spatiotemporal resampling, its effectiveness is ultimately limited by the quality of the initial candidates, which are often poorly distributed and introduce correlation artifacts. Our key observation is that ReSTIR’s accepted paths already approximate the target path contribution density, and that their bounce directions follow the ideal distribution for local path guiding – the product of incident radiance and the cosine-weighted BSDF. We exploit this structure to fit lightweight guiding distributions using each frame’s resampled paths by density estimation. Compared to conventional guiding based on raw path-traced samples, ReSTIR-PG closes the loop between guiding and resampling. Our method achieves lower variance, faster response time to scene change, reduced correlation artifacts, all while preserving real-time performance.

Abstract:
Automated LEGO® design is challenging due to the extensive variety of LEGO® brick types and the necessity of constructing semantically meaningful models from individually meaningless components. Current automatic LEGO® generation methods face two key challenges: i) They typically rely on explicit modeling of brick connectivity to ensure structural validity. However, this requires extensive manual annotation, which is labor-intensive as the variety of LEGO® primitives increases. This limits training data diversity, restricting the variety of LEGO® bricks that can be effectively utilized. ii) To facilitate learning within neural networks, current methods often employ either volume or text-based descriptions to represent LEGO® models. However, volumetric representations are computationally expensive and hamper large-scale generative training, while text-based approaches rely on large language models and dedicated text-to-brick mapping rules, introducing a semantic gap between language tokens and 3D brick structures.

Abstract:
In 3D object reconstruction from photographs, estimating material properties is challenging. We propose an inverse rendering method that uses active area lighting: as this provides a wider range of lighting angles per photo than point lighting, material reconstruction can be more accurate for the same number of photos. We compare area light shading with point lighting. With either mesh or 3D Gaussian splatting pipelines, area lighting can improve BRDF reconstruction and leads to +3 dB relighting PSNR over point lights, or need only \nicefrac 1 5 of the input photos for the same quality. We also compare area light shading with Monte Carlo ray tracing and with differential linearly transformed cosines (LTC) plus shadow visibility weighting. LTC can be faster, improving optimization times by 25%. In SOTA method-level comparisons, our approach improves material reconstruction, particularly for material roughness, leading to superior relighting quality.

Abstract:
Vertical binocular misalignment (VBM) can degrade image quality and contribute to visual discomfort in stereoscopic head-mounted displays, particularly for see-through AR. In this project, we investigate whether VBM impairs visual performance — namely, users’ ability to process briefly-presented AR content, like text notifications. We also quantify how the impacts of VBM vary with an AR system’s virtual image distance (VID). Across three experiments, participants were asked to (a) detect and (b) resolve, fuse and process AR content presented with constant and time-varying VBM. Short text stimuli (words or sentences) were briefly presented on a multi-display haploscope, using additive and transmissive displays to emulate see-through AR. Experiments were repeated at three VIDs: 57, 100, 139 cm (1.75, 1, 0.72 D). The magnitude and frequency of VBM was adaptively sampled on each trial. Visual performance (as measured by participants’ time to fuse and read text) was steadily impaired with increasing VBM. For high VBM magnitudes, time to fuse did not meaningfully differ between VIDs; for low VBM, time to fuse was fastest in the furthest VID. Participants’ ability to detect VBM also improved at further VIDs. Correlations were observed between all three user outcome measures: detection, visual performance, and comfort. Overall, we find that visual performance metrics provide a useful framework to complement detection and visual comfort approaches, consistent with recent work on VBM and related artifacts in AR. The results of this study can be used to inform VBM tolerance guidelines and VID placement tradeoffs in future AR devices.

Abstract:
In this paper, we introduce a state-of-the-art blendshape compression algorithm that significantly reduces storage requirements and computational complexity in facial animation. Our approach leverages large sparse matrix factorization and quantization to compress high-dimensional blendshape coefficients into a compact representation, preserving essential features and high-frequency geometric details. The proposed algorithm outperforms existing methods in terms of compression ratio, reconstruction quality, and computational efficiency. We demonstrate its effectiveness through extensive experiments on various animated face models, achieving compression factors of up to 100 × over sparse blendshapes with minimal impact on quality. Our technique offers compression rates up to 4.6 × better than the prior state-of-the-art while also improving approximation error and preserving features like wrinkles. Additionally, our runtime computation is up to 3 × faster than state-of-the-art on CPU and 70% faster than state-of-the-art on GPU, facilitating high-quality facial animation on low-powered computing platforms with limited resources.

Abstract:
3D cellular metamaterials are valued for many unique and useful mechanical properties. They enable lightweight, high-strength structures, with a wide range of directional stiffness profiles and possible auxetic behaviour. Infill patterns based on triply-periodic minimal surfaces (TPMS) are commonly used in additive manufacturing due to their high strength-to-weight ratio and near-isotropic mechanical behaviour. While existing work provides a wide range of cellular metamaterials to choose from, optimization of these patterns remains a significant challenge due to the diverse space of possible surface topologies and the lack of a unified parameterization. As a promising alternative, Voronoi diagrams with star-shaped distance metrics have been shown to provide a continuous parameterization of 2D cellular metamaterials, opening a rich space of possible designs. Extending the work of [Zhou et al. 2025], we provide a novel, differentiable construction of 3D volumetric Voronoi diagrams with star-shaped metrics. We integrate our formulation into a complete pipeline for mechanical metamaterial optimization, demonstrating the flexibility of star-shaped metric Voronoi diagrams to create periodic structures with a diverse range of directional stiffness profiles and stress-strain curves. Furthermore, we demonstrate the applicability of this framework to heterogeneous, smoothly graded cellular structures.

Abstract:
We present a method for automatically converting strand-based hair models into an efficient mesh-based representation, known as hair cards, for real-time rendering. Our method takes strands as inputs and outputs polygon strips with semi-transparent texture, preserving the appearance of the original strand-based hairstyle. To achieve this, we first cluster strands into groups, referred to as wisps, and generate hairstyle-preserving texture maps for each wisp by skinning-based alignment of the strands into a normalized pose in UV space. These textures can further be shared among similar wisps to better utilize the limited texture resolution. Next, polygon strips are fitted to the clustered strands via tailored differentiable rendering that can optimize transparent cluster-colored coverage masks. The proposed method successfully handles a wide range of hair models and outperforms existing approaches in representing volumetric hairstyles such as curly and wavy ones. Furthermore, our strip optimization can efficiently convert a full-hair model with more than 100 thousand strands within 20 seconds. Our method was extensively tested on both a hair database and many complex real-world hairstyles acquired using state-of-the-art hair capture methods.

Abstract:
Generative models have recently demonstrated impressive capabilities in producing high-quality 3D shapes from a variety of user inputs (e.g., text or images). However, generated objects often lack physical integrity. We introduce PhysiOpt, a differentiable physics optimizer designed to improve the physical behavior of 3D generative outputs, enabling them to transition from virtual designs to physically plausible, real-world objects. While most generative models represent geometry as continuous implicit fields, physics-based approaches often rely on the finite element method (FEM), requiring ad hoc mesh extraction to perform shape optimization. In addition, these methods are typically slow, limiting their integration in fast, iterative generative design workflows. Instead, we bridge the representation gap and propose a fast and effective differentiable simulation pipeline that optimizes shapes directly in the latent space of generative models using an intuitive and easy-to-implement differentiable mapping. This approach enables fast optimization while preserving semantic structure, unlike traditional methods relying on local mesh-based adjustments. We demonstrate the versatility of our optimizer across a range of shape priors, from global and part-based latent models to a state-of-the-art large-scale 3D generator, and compare it to a traditional mesh-based shape optimizer. Our method preserves the native representation and capabilities of the underlying generative model while supporting user-specified materials, loads, and boundary conditions. The resulting designs exhibit improved physical behavior, remain faithful to the learned priors, and are suitable for fabrication. We demonstrate the effectiveness of our approach on both virtual and fabricated objects.

Abstract:
In this paper, we introduce a novel approach to spatial regularization of optimal transport problems. Based on the notion of forward and backward “mean maps” of a transport plan, we introduce a convex formulation of optimal transport problems that incorporates regularization of these mean maps to promote spatial continuity of the resulting optimal plan. Unlike previous regularization approaches that required the optimization of all the transport plan coefficients, our formulation translates into an ADMM-based solver combined with Sinkhorn type algorithms, which drastically reduces the number of variables and scales up to large problems. We demonstrate the usefulness and efficiency of this new computational tool for various applications and for different regularizations.

Abstract:
Cloud computing has seen rapid growth in recent years, accompanied by the increasing popularity of game streaming services that allow users to play high-end games on low-end devices, across platforms, and from virtually anywhere. The rise of multiplayer games, shared immersive experiences, and metaverse-style applications—such as exhibitions or social virtual spaces—presents unique opportunities for improving rendering efficiency. In particular, the presence of multiple viewers within the same virtual environment opens the door for computation reuse across rendering instances. We propose a scalable, multi-GPU cloud rendering system tailored for multi-viewer scenarios. Built on top of on-surface caches (OSC), our system extends the core idea of decoupling shading from viewpoints to enable efficient reuse of shading information across multiple users. Our system is designed to scale with an increasing number of viewers by dynamically distributing rendering workloads across multiple GPUs. We further enhance scalability and significantly reduce inter-GPU bandwidth requirements from 6 × up to 65 × —through a novel sparse cache update strategy. Instead of copying full frames between GPUs, our method selectively propagates only relevant cache updates, enabling efficient data sharing while minimizing redundant transfers.

Abstract:
Generating articulated objects, such as laptops and microwaves, is a crucial yet challenging task with extensive applications in Embodied AI and AR/VR. Current image-to-3D methods primarily focus on surface geometry and texture, neglecting part decomposition and articulation modeling. Meanwhile, neural reconstruction approaches (e.g., NeRF or Gaussian Splatting) rely on dense multi-view or interaction data, limiting their scalability. In this paper, we introduce DreamArt, a novel framework for generating high-fidelity, interactable articulated assets from single-view images. DreamArt employs a three-stage pipeline: firstly, it reconstructs part‑segmented and complete 3D object meshes through a combination of image-to-3D generation, mask-prompted 3D segmentation, and part amodal completion. Second, we fine-tune a video diffusion model to capture part-level articulation priors, leveraging movable part masks as prompt and amodal images to mitigate ambiguities caused by occlusion. Finally, DreamArt optimizes the articulation motion, represented by a dual quaternion, and conducts global texture refinement and repainting to ensure coherent, high-quality textures across all parts. Experimental results demonstrate that DreamArt effectively generates high-quality articulated objects, possessing accurate part shape, high appearance fidelity, and plausible articulation, thereby providing a scalable solution for articulated asset generation.

Abstract:
Radiance fields such as 3D Gaussian Splatting allow real-time rendering of scenes captured from photos. They also reconstruct most specular reflections with high visual quality, but typically model them with “fake” reflected geometry, using primitives behind the reflector. Our goal is to correctly reconstruct the reflector and the reflected objects such as to make specular reflections editable; we present a proof of concept which exploits promising learning-based methods to extract diffuse and specular buffers from photos, as well as geometry and BRDF buffers. Our method builds on three key components. First, by using diffuse/specular buffers of input training views, we optimize a diffuse version of the scene and use path tracing to efficiently generate physically-based specular reflections. Second, we present a specialized training method that allows this process to converge. Finally, we present a fast ray tracing algorithm for 3D Gaussian primitives that enables efficient multi-bounce reflections. Our method reconstructs reflectors and reflected objects—including those not seen in the input images—in a unique scene representation. Our solution allows real-time, consistent editing of captured scenes with specular reflections, including multi-bounce effects, changing roughness etc. We mainly show results using ground truth buffers from synthetic scenes, and also preliminary results in real scenes with currently imperfect learning-based buffers. Code and data are available at: https://repo-sam.inria.fr/nerphys/editable-gaussian-reflections/.

Abstract:
Recent advances in neural rendering have mainly focused on modeling radiance fields with neural representations, often overlooking the underlying mechanisms for producing various lighting effects, and consequently leading to the limited adaptability to dynamic scenes. Lighting effects, such as highlights, shadows, and indirect illuminations, are typically computed using physically-based rendering methods like path tracing, which can be computationally intensive for complex indoor luminaires. Although several recent studies have aimed to model global illumination effects with neural representations, they commonly suffer from long training times or poor generalizability to new scenes. Addressing these challenges, this work presents a novel neural lighting function generation model capable of synthesizing diverse lighting effects in real time for unseen dynamic scenes and complex indoor luminaires, achieving results comparable to state-of-the-art rendering pipelines. Our model operates in two stages. First, multi-view observation images of the luminaire are captured to encode a compact, scene-independent 3D neural lighting field. Subsequently, light information is sampled from this neural lighting field and integrated with G-buffers and shadow clues to produce the shading results. In parallel, we employ a state-of-the-art generative model together with our training-free Inverse HDR Splatting module to generate HDR 3D Gaussians representing the luminaire. This strategy capitalizes on the powerful generalization capabilities of advanced generative models, enabling efficient and accurate appearance reconstruction for a diverse range of complex luminaires. In our experiments, the model trained on a dataset of 10,000 modern indoor scenes and thousands of illuminations demonstrates strong generalizability, high efficiency, and visually convincing results across a wide range of test scenes, highlighting its potential as a practical and flexible solution for high-fidelity, real-time neural indoor rendering.

Abstract:
We propose a Reinforcement Learning (RL) algorithm that combines several novel techniques to achieve more stable and robust control results for coupled solid-fluid systems. Our method utilizes the twin-delayed actor-critic algorithm to efficiently utilize off-policy data and achieve faster convergence. For more accurate estimations of the value function to guide the search of optimal policies, we use the Boltzmann softmax operator to reduce the bias of estimation. We further introduce a novel two-step Q-value estimator to reduce the well-known under-estimation issue. Finally, to mitigate the requirement of excessive exploration under sparse rewards, we propose the Fluid Effective Domain Guidance (FEDG) algorithm to guide policy exploration, where the policy for an easier task is trained jointly with that for a harder task. Put together, our framework achieves state-of-the-art performance in complex fluid-solid coupling control benchmarks, delivering stable and reliable performance in both 2D and 3D tasks over long horizons.

Abstract:
In graphics applications featuring dynamically moving visual targets – such as film and gaming – we have to rotate our eyes to follow objects as they move across the screen. Because target motion is often unpredictable and ever-changing, we must rapidly respond to motion cues and adjust eye movements to maintain the target within the fovea, a process known as catch-up. This catch-up behavior reflects how efficiently the eyes react to and compensate for sudden changes in motion, making it a critical indicator for both task performance and the overall visual experience. In this work, we study and measure the eye catch-up performance during visual tracking. In particular, we present a behavioral analysis that predicts users’ reaction latency to abrupt target motion based on target visibility. Our numerical analysis and human subject studies evidence the effectiveness and generalizability. We further show how the catch-up metric can be applied to evaluate video quality, adjust game difficulty, and optimize display configurations for enhanced user performance. We envision this research to create a computational link between human perception and behavioral performance in dynamic graphics contexts.

Abstract:
Lightweight, mesh-level models of knit fabric behavior are useful for both interactive pattern editing and initialization of yarn-level simulations. However, existing mesh-level simulation methods abstract knitting as a homogeneous material, which prevents them from capturing more complicated mixed structures. Furthermore, these methods require different simulation parameters depending on the knit pattern, or arrangement of stitches within the knit. Thus, fitting these parameters to physical examples must be done for each new pattern, even when the same types of stitches are used. To address this, we observe that physical behavior of a stitch is determined not only by its individual structure but also by the stitch types that surround it. In our work, we extend the stitch mesh model to allow for neighbor-aware material properties at the stitch level. Using structural analysis of stitch connections, we derive a finite set of four-way kernels that combine to create general knit-purl patterns for relaxation. From this, we generate a set of reference patterns that can be measured to infer the rest-lengths of the kernels using a linear model. After knitting and measuring these reference patterns, we used the derived kernel rest lengths to run relaxation on our stitch mesh models with mixtures of knits and purls that we then validated against physical examples. Our results show that the 4 neighbors of each stitch is sufficient to account for much of the neighborhood-dependent deformation, while remaining simple enough to directly fit to measured data with a set of 11 basis swatches. This allows our relaxation method to efficiently estimate the rest shape of mixed knit-purl patterns, which enables fast fabric preview and more accurate yarn-level simulation.

Abstract:
Thermal imaging, as a promising approach for scalable and robust scene perception, is invaluable for many applications in various fields, such as architecture and building physics. Despite many recent works having demonstrated their capability to incorporate thermal images into radiance field methods, they typically do not explicitly model how radiation interacts and reflects within the scene before reaching the camera, which is essential for inferring thermal physics and properties of objects in a scene. Using Gaussian primitives as the scene representation, our method estimates surface temperature and material properties to generate infrared renderings that closely match the input images. Taking inspirations from radiosity and hemicube rasterization, our method decomposes the outgoing radiation from each Gaussian primitive into two parts: self-emission and reflection originating from other primitives and the environment. This formulation allows us to simulate radiation under novel heating conditions and to find the best-fit temperature and material parameters given thermal images. The method is verified using both synthetic and real capture datasets.

Abstract:
We propose a fast, robust, and user-controllable algorithm for knot untangling and volume-filling curves. We extend prior work on surface-filling curves to the more challenging case of 3D volumes, equipped with a specialized gradient preconditioner that allows larger step sizes. Our method exhibits orders of magnitude faster runtime than existing methods. Our framework provides a whole new set of parameters to guide the shape of the curve, making it ideal for interactive design applications.

Abstract:
Steered Mixtures-of-Experts (SMoE) is an existing regression framework that has previously been applied for modeling and compression of 2D images and higher-dimensional imagery, including compression of light fields and light-field video. SMoE models are sparse, edge-aware representations that allow rendering of imagery with few Gaussians with excellent quality. In this paper a novel, edge-aware "3D SMoE Splatting" (3DSMoES) framework for 3D rendering is introduced, adopted to fit into the existing "3D Gaussian Splatting" (3DGS) CUDA optimization pipeline. Here, SMoE regression serves as a "plug-and-play" solution that replaces the established 3DGS regression as a novel workhorse. 3DSMoES achieves significant visual quality gains with drastically fewer Gaussian kernels compared to 3DGS. We observe up to approximately 4dB improvement in PSNR on individual scenes with kernel reductions between 20 to 50 percent. The sparse models are significantly faster to train and allow up to 30-50 percent improved rendering speeds.

Abstract:
High-quality Physically-Based Rendering (PBR) materials are crucial for visual realism in 3D asset creation, yet existing methods primarily target static objects, leading to challenges in maintaining multi-frame consistency for animatable entities. To tackle this issue, we introduce AniTex, the first generative pipeline that utilizes diffusion models to synthesize high-quality PBR materials for animatable objects based on text prompts. The pipeline consists of three key stages: First, sequences of RGB images are generated using a video diffusion model conditioned on depth, normals, irradiance, and motion vectors to ensure temporal coherence and geometric alignment across multiple frames and viewpoints. Second, these RGB image sequences are decomposed into per-view, per-frame PBR material maps (albedo, roughness, metallic) by a specialized Intrinsic Diffusion Model (IDM), which is conditioned on the RGB images along with consistent geometry and lighting cues to disentangle material from illumination. Finally, these per-view, per-frame PBR maps are hierarchically blended. This process first ensures temporal coherence within each view’s frame sequence, then amalgamates these into globally consistent PBR materials for the animatable object, maintaining overall temporal coherence and visual consistency throughout its animation. Extensive experiments show that AniTex produces more realistic PBR materials for both static and animated objects, outperforming baseline methods in visual appeal.

Abstract:
Caustics rendering remains a long-standing challenge in Monte Carlo rendering because high-energy specular paths occupy only a small region of path space, making them difficult to sample effectively. Recent work such as Specular Manifold Sampling (SMS) [Zeltner et al. 2020] can stochastically sample these specular paths and estimate their unbiased weights using Bernoulli trials. However, applying SMS in interactive rendering is non-trivial because it is slow and delivers noisy images given a very limited time budget.

Abstract:
We introduce RibbonSculpt, the first method for interactive freeform shape design in VR through progressive sketching of sparse, oriented ribbons. Instead of reconstructing a surface from a fully drawn VR sketch, our method allows the real-time creation and progressive refinement of a closed surface of any topological genus, thanks to the continuous update of a volumetric proxy. The latter corresponds to a filtered subset of the Voronoi balls defined by the user-sketched ribbons. At each visualization step, a mesh extracted from the proxy is beautified through Laplacian-based energy minimization, yielding a smooth surface that interpolates the ribbons. Guided by this surface, users can easily refine their design by adding or removing ribbons, which sculpts, in return, the set of Voronoi balls forming the proxy. Our results, supported by user studies, show that RibbonSculpt allows VR users to easily and quickly draft the 3D shapes they have in mind.

Abstract:
Recovering spatial-varying bi-directional reflectance distribution function (SVBRDF) from as few as possible captured images has been a challenging task in computer graphics. Benefiting from the co-located flashlight-camera capture strategy and data-driven priors, SVBRDF can be estimated from few input images. However, this capture strategy usually requires a controllable darkroom environment, ensuring the flashlight is a single light source. It is often impractical during on-site capture in real-world scenarios. To support SVBRDF estimation in an uncontrolled environment, the key challenge lies in the high-precise estimation of unknown environment lighting and its effective utilization on SVBRDF recovery. To address this issue, we proposed a novel exemplar-based environment lighting representation, which is easier to use for neural networks. These exemplars are a set of rendered images of selected materials under the environment lighting. By embedding the rendering process, our approach transforms environment lighting represented in the spherical domain into the sample-surface domain, thereby achieving the domain alignment with input images. This significantly reduces the network’s learning burden, resulting in a more precise environment lighting estimation. Furthermore, after lighting prediction, we also present a dominant lighting extraction algorithm and an adaptive exemplar selection algorithm to enhance the guidance of environment lighting in SVBRDF estimation. Finally, considering the distant contribution of environment lighting and point lighting to SVBRDF recovery, we proposed a well-designed cascaded network. Quantitative assessments and qualitative analysis have demonstrated that our method achieves superior SVBRDF estimations compared to previous approaches. The source code will be released.

Abstract:
We propose closed-form Cauchy coordinates and their derivatives for 2D closed high-order input cages composed of arbitrary-order polynomial curves. Our coordinates facilitate the transformation of input polynomial curves into output curves of any desired polynomial order. Central to our derivation is the creative use of the residue theorem with the logarithmic function to obtain the integral of a rational polynomial required for extending the classical 2D Cauchy coordinates to high-order input cages. Our coordinates enable smooth cage-aware angle-preserving deformations, and the derivatives allow for point-to-point deformation. Moreover, our derivation can be extended to the input cages with rational polynomial curves. Through various 2D deformations, we demonstrate how users can intuitively manipulate Bézier control points to achieve desired deformations easily.

Abstract:
We present a compact, learning-based representation that captures the full Monte Carlo sampling distribution of a rendered image. Our approach enables rendering at arbitrary samples per pixel (SPP) during inference without requiring expensive path tracing operations. This is achieved by fitting parametric distributions to per-pixel radiance values, which can be efficiently estimated, stored, and sampled. Our method proceeds in three stages. First, we map radiance samples into radial log space, which encourages Gaussian-like distributions while preserving angular relationships. Second, we fit each pixel’s distribution using 3D Gaussian Mixture Models (GMMs), trained online with minimal memory overhead, making the approach compatible with standard path tracers. For inference, we introduce an optimized sampling scheme whose complexity is independent of the target SPP, enabling fast synthesis of high-SPP images. Additionally, we demonstrate that the learned representations can be heavily compressed using quantization and codebook techniques with negligible quality loss. Experiments show that GMMs strike an effective balance between expressiveness and sparsity. Compared to alternative models, our method better captures pixel-wise Monte Carlo distributions. Lastly, we illustrate the versatility of our representation with applications such as firefly rejection and ray-distribution-driven denoising.

Abstract:
Transparent object reconstruction in an uncontrolled natural scene is a challenging task due to its complex appearance. Existing methods optimize the object shape with RGB color as supervision, which suffer from locality and ambiguity, and fail to recover accurate structures. In this paper, we present RCTrans, which uses ray-background intersection as a more efficient constraint to achieve high-quality reconstruction, while maintaining a convenient setup. The key technology to achieve this is a novel pre-trained correspondence estimation network, which allows us to acquire ray-background correspondence under uncontrolled scenes and camera views. In addition, a confidence evaluation is introduced to protect the reconstruction from inaccurate estimated correspondence. Extensive experiments on both synthetic and real data demonstrate that our method can produce highly accurate results, without any extra acquisition burden. The code and dataset will be publicly available.

Abstract:
Creating high-quality, photorealistic 3D digital humans from a single image remains challenging. While existing methods can generate visually appealing multi-view outputs, they often suffer from inconsistencies in viewpoints and camera poses, resulting in suboptimal 3D reconstructions with reduced realism. Furthermore, most approaches focus on body generation while overlooking facial consistency – a perceptually critical issue caused by the fact that the face occupies only a small area in a full-body image (e.g., ∼ 80 × 80 pixels out of a 512 × 512 image). This limited resolution and low weight for the facial regions during optimization leads to insufficient facial details and inconsistent facial identity features across multiple views.To address these challenges, we leverage the powerful capabilities of 2D video diffusion models for consistent multi-viewRGB andNormal human image generation, combined with the 3D SMPL-X representation to enable spatial consistency and geometrical details. By fine-tuning the DiT models (HumanWan-DiTs) on realistic 3D human datasets using the LoRA technique, our method ensuresboth generalizability and 3D visual consistency on realistic multi-view human image generation. The proposed facial enhancement is integrated into 3D Gaussian optimization to enhance facial details. To further refine results, we apply super-resolution and generative priors to reduce facial blurring alongside SMPL-X parameter tuning and the assistance of generated multi-view normal images, achieving photorealistic and consistent rendering from a single image. Extensive experiments demonstrate that our approach outperforms existing methods, producing photorealistic, consistent, and fine-detailed human renderings.

Abstract:
Blind Face Restoration (BFR) aims to recover face images suffering from unknown degradations. A recent approach to solve BFR is via plug-and-play methods for image restoration, which combine a likelihood function with pre-trained diffusion models as priors. However, as the likelihood is inherently unknown in BFR, existing methods rely instead on heuristic constraints. This leads to suboptimal distortion and identity preservation metrics. We introduce Expectation-based Likelihood Approximation with Diffusion prior (ELAD), a novel plug-and-play approach that explicitly models the likelihood function for BFR. ELAD estimates the first and second moments of the likelihood distribution by employing a Degradation Estimator to predict the degradation sequence from the input. This enables principled Bayesian inference without requiring end-to-end training. Our method achieves state-of-the-art distortion and identity preservation results compared to existing plug-and-play BFR techniques, while maintaining competitive perceptual quality. As we show, while being plug-and-play, our method still rivals end-to-end trained BFR models.

Abstract:
Shape grammars offer a powerful framework for computational design, but synthesizing shape programs to achieve specific goals remains challenging. Inspired by the success of gradient-based optimization in high-dimensional, nonconvex spaces such as those in machine learning, we ask: what makes a shape grammar amenable to gradient-based optimization? To explore this, we introduce Stochastic Rewrite Descent (SRD), an algorithm that interleaves structural rewrites with continuous parameter updates, taking steps in both to optimize a given objective. We analyze the core challenges which have previously prevented optimizing shape programs via descent, and identify a set of desirable properties for grammars that support effective optimization, along with concrete grammar design recommendations to achieve them. We validate this approach across three shape grammars, demonstrating its effectiveness in diverse domains including image fitting, text-driven generation, and topology optimization. Through ablations and comparisons, we show that grammars satisfying our proposed properties lead to significantly better optimization performance. The goal of this work is to open the door to more general and flexible computational paradigms for inverse design with shape grammars.

Abstract:
Generating realistic and robust motion for virtual characters under complex physical conditions, such as irregular terrain, real-time control scenarios, and external disturbances, remains a key challenge in computer graphics. While deep reinforcement learning has enabled high-fidelity physics-based character animation, such methods often suffer from limited generalizability, as learned controllers tend to overfit to the environments they were trained in. In contrast, simplified models, such as single rigid bodies, offer better adaptability, but traditionally require hand-crafted heuristics and can only handle short motion segments. In this paper, we present a general learning framework that trains a single-rigid-body (SRB) character controller from long and unstructured datasets, without the reliance on human-crafted rules. Our method enables zero-shot adaptation to diverse environments and unseen motion styles. The resulting controller generates expressive and physically plausible motions in real time and seamlessly integrates with high-level kinematic motion planners without retraining, enabling a wide range of downstream tasks.

Abstract:
High-fidelity avatar reconstruction from monocular videos faces significant challenges due to imperfect foreground segmentation and inaccurate body poses. Existing methods typically depend on additive components, such as explicit background modeling, which introduce additional overhead and reduce the flexibility of avatar reconstruction. We argue that these challenges need to be addressed fundamentally. To this end, we propose leveraging a learned 3D human prior to guide the reconstruction of 3D avatars, dubbed PriorAvatar, without increasing model complexity. At the core of our method is a learned 3D prior, which consists of a multi-person feature codebook that stores the 3D shapes and appearances derived from human scans. These latent features are complemented by a shared U-Net decoder that converts them into a set of renderable 3D Gaussians. During reconstruction, the learned 3D prior allows for fitting to unseen subjects in the monocular videos by fine-tuning with 2D photometric losses using 3D Gaussians. This approach ensures that the reconstruction process effectively utilizes the learned latent spaces while minimizing discrepancies with the 2D observations. In our experiments, we demonstrate the efficiency and robustness of our novel reconstruction scheme, as evidenced by its state-of-the-art quantitative and qualitative performance without relying on complex regularizers or additional model enhancements. The results of ablation studies further verify the effectiveness of incorporating a learned human prior for monocular avatar reconstruction.

Abstract:
Video data provides an accessible and rich source beyond expensive action-labeled robot data for advancing robotic learning paradigms. Motivated by this potential, researchers investigate methods to exploit video data in robotic learning. Recent approaches can be primarily divided into two categories: Action-based approaches tokenize latent actions from videos for policy pre-training. State-based approaches pre-train models to predict subsequent states. The former establishes rich motion priors, while the latter empowers the robot to anticipate future events. These complementary capabilities suggest significant potential for integration into a unified framework. In this paper, we propose UniMimic, a novel approach unifying latent action and latent state pre-training from videos. We first train a unified tokenizer to learn latent states from video frames while deriving latent actions between state tokens. Subsequently, the policy is pre-trained on videos to predict these latent actions and subsequent latent states. Finally, the policy is fine-tuned on an action-labeled robot dataset to transfer the learned priors to precise robot execution. Experiments exhibit that our pre-training stage enhances the performance by 19% in the Libero benchmark and improves the average number of tasks completed in a row of 5 from 2.50 and 2.35 to 3.89 and 3.73 in the CALVIN benchmark. In the real-world experiments, our method still delivers improvements exceeding 36%.

Abstract:
Conversational behavior generation, being a crucial capability of embodied agents, is a significant factor influencing human-computer interaction. Generating high-quality conversational motions requires not only appropriate audio-motion mapping but also interactive responses to interlocutor behaviors and comprehensive understanding of conversational semantics. Existing methods primarily rely on audio signals and interlocutor motions for main agent motion generation, lacking high-level semantic understanding of the conversational content, leading to moderate quality motions that are not appropriate for the dialogue. To address these limitations, we leverage the powerful semantic understanding capabilities of large language models, to comprehend complex conversational contexts. Inspired by human conversation processes that conversational motions are highly related to both global and local semantic factors, including the conversational context, and the intentions, emotions, and passive or active states of the participants, we propose an agentic system named Echo that analyzes such information. To achieve comprehensive conversational understanding, Echo leverages multiple prompts and test-time recipes to guide large language models in decomposing conversational structures and extracting fine-grained semantic information. Furthermore, we design a hierarchical feature fusion network that systematically integrates from frame-level audio-motion features to sentence-level semantic understanding and finally to conversation-level contextual comprehension, organically combining fine-grained semantic features from large language models with audio and motion characteristics. Experimental results demonstrate that our framework can be effectively integrated with several state-of-the-art motion generation models to enhance their performance in generating high-quality conversational behaviors.

Abstract:
In music-to-motion generation, the interplay between movements and music tempo variations significantly influences the emotional expressiveness and realism of performances. However, tempo-changing mechanisms remain underexplored in neural network-based music-to-motion tasks due to the scarcity of relevant datasets. Therefore, in this paper, we propose to use novel music features explicitly representing tempo variations, and introduce a dataset, JoruriPuppet, incorporating the Japanese traditional Jo-Ha-Kyu principle characterized by expressive tempo changes. Furthermore, we design three metrics to quantitatively evaluate the synchronization and expressiveness of generated motions. Experiments on our dataset highlight the limitations of SOTA methods in capturing fine-grained tempo changes. We demonstrate that integrating tempo-changing features into them improves neural network-based music-to-motion performance across existing datasets, validating the general effectiveness and applicability of our research. The dataset and source code are available at the project link: https://www.dr-lab.org/projects/joruripuppet/.