SIGGRAPH-ASIA2024

Abstract:
We present a novel method for recovering world-grounded human motion from monocular video. The main challenge lies in the ambiguity of defining the world coordinate system, which varies between sequences. Previous approaches attempt to alleviate this issue by predicting relative motion in an autoregressive manner, but are prone to accumulating errors. Instead, we propose estimating human poses in a novel Gravity-View (GV) coordinate system, which is defined by the world gravity and the camera view direction. The proposed GV system is naturally gravity-aligned and uniquely defined for each video frame, largely reducing the ambiguity of learning image-pose mapping. The estimated poses can be transformed back to the world coordinate system using camera rotations, forming a global motion sequence. Additionally, the per-frame estimation avoids error accumulation in the autoregressive methods. Experiments on in-the-wild benchmarks demonstrate that our method recovers more realistic motion in both the camera space and world-grounded settings, outperforming state-of-the-art methods in both accuracy and speed. The code is available at https://zju3dv.github.io/gvhmr.

Abstract:
Diffusion models gain increasing popularity for their generative capabilities. Recently, there have been surging needs to generate customized images by inverting diffusion models from exemplar images, and existing inversion methods mainly focus on capturing object appearances (i.e., the “look”). However, how to invert object relations, another important pillar in the visual world, remains unexplored. In this work, we propose the Relation Inversion task, which aims to learn a specific relation (represented as “relation prompt”) from exemplar images. Specifically, we learn a relation prompt with a frozen pre-trained text-to-image diffusion model. The learned relation prompt can then be applied to generate relation-specific images with new objects, backgrounds, and styles. To tackle the Relation Inversion task, we propose the ReVersion Framework. Specifically, we propose a novel “relation-steering contrastive learning” scheme to steer the relation prompt towards relation-dense regions, and disentangle it away from object appearances. We further devise “relation-focal importance sampling” to emphasize high-level interactions over low-level appearances (e.g., texture, color). To comprehensively evaluate this new task, we contribute the ReVersion Benchmark, which provides various exemplar images with diverse relations. Extensive experiments validate the superiority of our approach over existing methods across a wide range of visual relations. Our proposed task and method could be good inspirations for future research in various domains like generative inversion, few-shot learning, and visual relation detection.

Abstract:
The seamless integration of music with dance movements is essential for communicating the artistic intent of a dance piece. This alignment also significantly improves the immersive quality of gaming experiences and animation productions. Although there has been remarkable advancement in creating high-fidelity music from textual descriptions, current methodologies mainly focus on modulating overall characteristics such as genre and emotional tone. They often overlook the nuanced management of temporal rhythm, which is indispensable in crafting music for dance, since it intricately aligns the musical beats with the dancers’ movements. Recognizing this gap, we propose an encoder-based textual inversion technique to augment text-to-music models with visual control, facilitating personalized music generation. Specifically, we develop dual-path rhythm-genre inversion to effectively integrate the rhythm and genre of a dance motion sequence into the textual space of a text-to-music model. Contrary to traditional textual inversion methods, which directly update text embeddings to reconstruct a single target object, our approach utilizes separate rhythm and genre encoders to obtain text embeddings for two pseudo-words, adapting to the varying rhythms and genres. We collect a new dataset called In-the-wild Dance Videos (InDV) and demonstrate that our approach outperforms state-of-the-art methods across multiple evaluation metrics. Furthermore, our method is able to adapt to changes in tempo and effectively integrates with the inherent text-guided generation capability of the pre-trained model. Our source code and demo videos are available at https://github.com/lsfhuihuiff/Dance-to-music_Siggraph_Asia_2024.

Abstract:
Learning 3D head priors from large 2D image collections is an important step towards high-quality 3D-aware human modeling. A core requirement is an efficient architecture that scales well to large-scale datasets and large image resolutions. Unfortunately, existing 3D GANs struggle to scale to generating samples at high resolutions due to their relatively slow train and render speeds, and typically have to rely on 2D superresolution networks at the expense of global 3D consistency. To address these challenges, we propose Generative Gaussian Heads (GGHead), which adopts the recent 3D Gaussian Splatting representation within a 3D GAN framework. To generate a 3D representation, we employ a powerful 2D CNN generator to predict Gaussian attributes in the UV space of a template head mesh. This way, GGHead exploits the regularity of the template’s UV layout, substantially facilitating the challenging task of predicting an unstructured set of 3D Gaussians. We further improve the geometric fidelity of the generated 3D representations with a novel total variation loss on rendered UV coordinates. Intuitively, this regularization encourages that neighboring rendered pixels should stem from neighboring Gaussians in the template’s UV space. Taken together, our pipeline can efficiently generate 3D heads trained only from single-view 2D image observations. Our proposed framework matches the quality of existing 3D head GANs on FFHQ while being both substantially faster and fully 3D consistent. As a result, we demonstrate real-time generation and rendering of high-quality 3D-consistent heads at 10242 resolution for the first time.

Abstract:
We present iSeg, a new interactive technique for segmenting 3D shapes. Previous works have focused mainly on leveraging pre-trained 2D foundation models for 3D segmentation based on text. However, text may be insufficient for accurately describing fine-grained spatial segmentations. Moreover, achieving a consistent 3D segmentation using a 2D model is highly challenging, since occluded areas of the same semantic region may not be visible together from any 2D view. Thus, we design a segmentation method conditioned on fine user clicks, which operates entirely in 3D. Our system accepts user clicks directly on the shape’s surface, indicating the inclusion or exclusion of regions from the desired shape partition. To accommodate various click settings, we propose a novel interactive attention module capable of processing different numbers and types of clicks, enabling the training of a single unified interactive segmentation model. We apply iSeg to a myriad of shapes from different domains, demonstrating its versatility and faithfulness to the user’s specifications. Our project page is at https://threedle.github.io/iSeg/.

Abstract:
We present a novel approach to synthesize dexterous motions for physically simulated hands in tasks that require coordination between the control of two hands with high temporal precision. Instead of directly learning a joint policy to control two hands, our approach performs bimanual control through cooperative learning where each hand is treated as an individual agent. The individual policies for each hand are first trained separately, and then synchronized through latent space manipulation in a centralized environment to serve as a joint policy for two-hand control. By doing so, we avoid directly performing policy learning in the joint state-action space of two hands with higher dimensions, greatly improving the overall training efficiency. We demonstrate the effectiveness of our proposed approach in the challenging guitar-playing task. The virtual guitarist trained by our approach can synthesize motions from unstructured reference data of general guitar-playing practice motions, and accurately play diverse rhythms with complex chord pressing and string picking patterns based on the input guitar tabs that do not exist in the references. Along with this paper, we provide the motion capture data that we collected as the reference for policy training.

Abstract:
We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts mechanism utilized in large language models (LLMs), MoA distributes the generation workload between two attention pathways: a personalized branch and a non-personalized prior branch. MoA is designed to retain the original model’s prior by fixing its attention layers in the prior branch, while minimally intervening in the generation process with the personalized branch that learns to embed subjects in the layout and context generated by the prior branch. A novel routing mechanism manages the distribution of pixels in each layer across these branches to optimize the blend of personalized and generic content creation. Once trained, MoA facilitates the creation of high-quality, personalized images featuring multiple subjects with compositions and interactions as diverse as those generated by the original model. Crucially, MoA enhances the distinction between the model’s pre-existing capability and the newly augmented personalized intervention, thereby offering a more disentangled subject-context control that was previously unattainable. Please visit the submitted (supplementary) website.

Abstract:
Recent single-view 3D generative methods have made significant advancements by leveraging knowledge distilled from extensive 3D object datasets. However, challenges persist in the synthesis of 3D scenes from a single view, primarily due to the complexity of real-world environments and the limited availability of high-quality prior resources. In this paper, we introduce a novel approach called Pano2Room, designed to automatically reconstruct high-quality 3D indoor scenes from a single panoramic image. These panoramic images can be easily generated using a panoramic RGBD inpainter from captures at a single location with any camera. The key idea is to initially construct a preliminary mesh from the input panorama, and iteratively refine this mesh using a panoramic RGBD inpainter while collecting photo-realistic 3D-consistent pseudo novel views. Finally, the refined mesh is converted into a 3D Gaussian Splatting field and trained with the collected pseudo novel views. This pipeline enables the reconstruction of real-world 3D scenes, even in the presence of large occlusions, and facilitates the synthesis of photo-realistic novel views with detailed geometry. Extensive qualitative and quantitative experiments have been conducted to validate the superiority of our method in single-panorama indoor novel synthesis compared to the state-of-the-art. Our code and data are available at https://github.com/TrickyGo/Pano2Room.

Abstract:
We propose TextToon, a method to generate a drivable toonified avatar. Given a short monocular video sequence and a written instruction about the avatar style, our model can generate a high-fidelity toonified avatar that can be driven in real-time by another video with arbitrary identities. Existing related works heavily rely on multi-view modeling to recover geometry via texture embeddings, presented in a static manner, leading to control limitations. The multi-view video input also makes it difficult to deploy these models in real-world applications. To address these issues, we adopt a conditional embedding Tri-plane to learn realistic and stylized facial representations in a Gaussian deformation field. Additionally, we expand the stylization capabilities of 3D Gaussian Splatting by introducing an adaptive pixel-translation neural network and leveraging patch-aware contrastive learning to achieve high-quality images. To push our work into consumer applications, we develop a real-time system that can operate at 48 FPS on a GPU machine and 15-18 FPS on mobile machine. Extensive experiments demonstrate the efficacy of our approach in generating textual avatars over existing methods in terms of quality and real-time animation. Please refer to our project page for more details: https://songluchuan.github.io/TextToon/.

Abstract:
Video Frame Interpolation (VFI) is important for video enhancement, frame rate up-conversion, and slow-motion generation. The introduction of event cameras, which capture per-pixel brightness changes asynchronously, has significantly enhanced VFI capabilities, particularly for high-speed, nonlinear motions. However, these event-based methods encounter challenges in low-light conditions, notably trailing artifacts and signal latency, which hinder their direct applicability and generalization. Addressing these issues, we propose a novel per-scene optimization strategy tailored for low-light conditions. This approach utilizes the internal statistics of a sequence to handle degraded event data under low-light conditions, improving the generalizability to different lighting and camera settings. To evaluate its robustness in low-light condition, we further introduce EVFI-LL, a unique RGB+Event dataset captured under low-light conditions. Our results demonstrate state-of-the-art performance in low-light environments. Project page: https://openimaginglab.github.io/Sim2Real/.

Abstract:
Recently, 2D speaking avatars have increasingly participated in everyday scenarios due to the fast development of facial animation techniques. However, most existing works neglect the explicit control of human bodies. In this paper, we propose to drive not only the faces but also the torso and gesture movements of a speaking figure. Inspired by recent advances in diffusion models, we propose the Motion-Enhanced Textural-Aware ModeLing for SpeaKing Avatar Reenactment (TALK-Act) framework, which enables high-fidelity avatar reenactment from only short footage of monocular video. Our key idea is to enhance the textural awareness with explicit motion guidance in diffusion modeling. Specifically, we carefully construct 2D and 3D structural information as intermediate guidance. While recent diffusion models adopt a side network for control information injection, they fail to synthesize temporally stable results even with person-specific fine-tuning. We propose a Motion-Enhanced Textural Alignment module to enhance the bond between driving and target signals. Moreover, we build a Memory-based Hand-Recovering module to help with the difficulties in hand-shape preserving. After pre-training, our model can achieve high-fidelity 2D avatar reenactment with only 30 seconds of person-specific data. Extensive experiments demonstrate the effectiveness and superiority of our proposed framework.

Abstract:
Texture plays a vital role in enhancing visual richness in both real photographs and computer-generated imagery. However, the process of editing textures often involves laborious and repetitive manual adjustments of textons, which are the recurring local patterns that characterize textures. This work introduces a fully unsupervised approach for representing textures using a compositional neural model that captures individual textons. We represent each texton as a 2D Gaussian function whose spatial support approximates its shape, and an associated feature that encodes its detailed appearance. By modeling a texture as a discrete composition of Gaussian textons, the representation offers both expressiveness and ease of editing. Textures can be edited by modifying the compositional Gaussians within the latent space, and new textures can be efficiently synthesized by feeding the modified Gaussians through a generator network in a feed-forward manner. This approach enables a wide range of applications, including transferring appearance from an image texture to another image, diversifying textures, texture interpolation, revealing/modifying texture variations, edit propagation, texture animation, and direct texton manipulation. The proposed approach contributes to advancing texture analysis, modeling, and editing techniques, and opens up new possibilities for creating visually appealing images with controllable textures.

Abstract:
We introduce a dual contouring method that provides state-of-the-art performance for occupancy functions while achieving computation times of a few seconds. Our method is learning-free and carefully designed to maximize the use of GPU parallelization. The recent surge of implicit neural representations has led to significant attention to occupancy fields, resulting in a wide range of 3D reconstruction and generation methods based on them. However, the outputs of such methods have been underestimated due to the bottleneck in converting the resulting occupancy function to a mesh. Marching Cubes tends to produce staircase-like artifacts, and most subsequent works focusing on exploiting signed distance functions as input also yield suboptimal results for occupancy functions. Based on Manifold Dual Contouring (MDC), we propose Occupancy-Based Dual Contouring (ODC), which mainly modifies the computation of grid edge points (1D points) and grid cell points (3D points) to not use any distance information. We introduce auxiliary 2D points that are used to compute local surface normals along with the 1D points, helping identify 3D points via the quadric error function. To search the 1D, 2D, and 3D points, we develop fast algorithms that are parallelizable across all grid edges, faces, and cells. Our experiments with several 3D neural generative models and a 3D mesh dataset demonstrate that our method achieves the best fidelity compared to prior works.

Abstract:
Achieving efficient, high-fidelity, high-resolution garment simulation is challenging due to its computational demands. Conversely, low-resolution garment simulation is more accessible and ideal for low-budget devices like smartphones. In this paper, we introduce a lightweight, learning-based method for garment dynamic super-resolution, designed to efficiently enhance high-resolution, high-frequency details in low-resolution garment simulations. Starting with low-resolution garment simulation and underlying body motion, we utilize a mesh-graph-net to compute super-resolution features based on coarse garment dynamics and garment-body interactions. These features are then used by a hyper-net to construct an implicit function of detailed wrinkle residuals for each coarse mesh triangle. Considering the influence of coarse garment shapes on detailed wrinkle performance, we correct the coarse garment shape and predict detailed wrinkle residuals using these implicit functions. Finally, we generate detailed high-resolution garment geometry by applying the detailed wrinkle residuals to the corrected coarse garment. Our method enables roll-out prediction by iteratively using its predictions as input for subsequent frames, producing fine-grained wrinkle details to enhance the low-resolution simulation. Despite training on a small dataset, our network robustly generalizes to different body shapes, motions, and garment types not present in the training data. We demonstrate significant improvements over state-of-the-art alternatives, particularly in enhancing the quality of high-frequency, fine-grained wrinkle details. Code and data is released in https://github.com/MengZephyr/Neural-Garment-Dynamic-Super-resolution/

Abstract:
Differentiable rendering is a key ingredient for inverse rendering and machine learning, as it allows to optimize scene parameters (shape, materials, lighting) to best fit target images. Differentiable rendering requires that each scene parameter relates to pixel values through differentiable operations. While 3D mesh rendering algorithms have been implemented in a differentiable way, these algorithms do not directly extend to Constructive-Solid-Geometry (CSG), a popular parametric representation of shapes, because the underlying boolean operations are typically performed with complex black-box mesh-processing libraries. We present an algorithm, DiffCSG, to render CSG models in a differentiable manner. Our algorithm builds upon CSG rasterization, which displays the result of boolean operations between primitives without explicitly computing the resulting mesh and, as such, bypasses black-box mesh processing. We describe how to implement CSG rasterization within a differentiable rendering pipeline, taking special care to apply antialiasing along primitive intersections to obtain gradients in such critical areas. Our algorithm is simple and fast, can be easily incorporated into modern machine learning setups, and enables a range of applications for computer-aided design, including direct and image-based editing of CSG primitives. Code and data: https://yyyyyhc.github.io/DiffCSG/.

Abstract:
We propose L3DG, the first approach for generative 3D modeling of 3D Gaussians through a latent 3D Gaussian diffusion formulation. This enables effective generative 3D modeling, scaling to generation of entire room-scale scenes which can be very efficiently rendered. To enable effective synthesis of 3D Gaussians, we propose a latent diffusion formulation, operating in a compressed latent space of 3D Gaussians. This compressed latent space is learned by a vector-quantized variational autoencoder (VQ-VAE), for which we employ a sparse convolutional architecture to efficiently operate on room-scale scenes. This way, the complexity of the costly generation process via diffusion is substantially reduced, allowing higher detail on object-level generation, as well as scalability to large scenes. By leveraging the 3D Gaussian representation, the generated scenes can be rendered from arbitrary viewpoints in real-time. We demonstrate that our approach significantly improves visual quality over prior work on unconditional object-level radiance field synthesis and showcase its applicability to room-scale scene generation.

Abstract:
The creation of high-fidelity, digital versions of human heads is an important stepping stone in the process of further integrating virtual components into our everyday lives. Constructing such avatars is a challenging research problem, due to a high demand for photo-realism and real-time rendering performance. In this work, we propose Neural Parametric Gaussian Avatars (NPGA), a data-driven approach to create high-fidelity, controllable avatars from multi-view video recordings. We build our method around 3D Gaussian splatting for its highly efficient rendering and to inherit the topological flexibility of point clouds. In contrast to previous work, we condition our avatars’ dynamics on the rich expression space of neural parametric head models (NPHM), instead of mesh-based 3DMMs. To this end, we distill the backward deformation field of our underlying NPHM into forward deformations which are compatible with rasterization-based rendering. All remaining fine-scale, expression-dependent details are learned from the multi-view videos. For increased representational capacity of our avatars, we propose per-Gaussian latent features that condition each primitives dynamic behavior. To regularize this increased dynamic expressivity, we propose Laplacian terms on the latent features and predicted dynamics. We evaluate our method on the public NeRSemble dataset, demonstrating that NPGA significantly outperforms the previous state-of-the-art avatars on the self-reenactment task by ≈ 2.6 PSNR. Furthermore, we demonstrate accurate animation capabilities from real-world monocular videos.

Abstract:
We present a new approach to creating photorealistic and relightable head avatars from a phone scan with unknown illumination. The reconstructed avatars can be animated and relit in real time with the global illumination of diverse environments. Unlike existing approaches that estimate parametric reflectance parameters via inverse rendering, our approach directly models learnable radiance transfer that incorporates global light transport in an efficient manner for real-time rendering. However, learning such a complex light transport that can generalize across identities is non-trivial. A phone scan in a single environment lacks sufficient information to infer how the head would appear in general environments. To address this, we build a universal relightable avatar model represented by 3D Gaussians. We train on hundreds of high-quality multi-view human scans with controllable point lights. High-resolution geometric guidance further enhances the reconstruction accuracy and generalization. Once trained, we finetune the pretrained model on a phone scan using inverse rendering to obtain a personalized relightable avatar. Our experiments establish the efficacy of our design, outperforming existing approaches while retaining real-time rendering capability.

Abstract:
We present a spatial and angular Gaussian based representation and a triple splatting process, for real-time, high-quality novel lighting-and-view synthesis from multi-view point-lit input images. To describe complex appearance, we employ a Lambertian plus a mixture of angular Gaussians as an effective reflectance function for each spatial Gaussian. To generate self-shadow, we splat all spatial Gaussians towards the light source to obtain shadow values, which are further refined by a small multi-layer perceptron. To compensate for other effects like global illumination, another network is trained to compute and add a per-spatial-Gaussian RGB tuple. The effectiveness of our representation is demonstrated on 30 samples with a wide variation in geometry (from solid to fluffy) and appearance (from translucent to anisotropic), as well as using different forms of input data, including rendered images of synthetic/reconstructed objects, photographs captured with a handheld camera and a flash, or from a professional lightstage. We achieve a training time of 40-70 minutes and a rendering speed of 90 fps on a single commodity GPU. Our results compare favorably with state-of-the-art techniques in terms of quality/performance.

Abstract:
Large-scale text-to-image models enable a wide range of image editing techniques, using text prompts or even spatial controls. However, applying these editing methods to multi-view images depicting a single scene leads to 3D-inconsistent results. In this work, we focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views. We build on two insights: (1) maintaining consistent features throughout the generative process helps attain consistency in multi-view editing, and (2) the queries in self-attention layers significantly influence the image structure. Hence, we propose to improve the geometric consistency of the edited images by enforcing the consistency of the queries. To do so, we introduce QNeRF, a neural radiance field trained on the internal query features of the edited images. Once trained, QNeRF can render 3D-consistent queries, which are then softly injected back into the self-attention layers during generation, greatly improving multi-view consistency. We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps. We compare our method to a range of existing techniques and demonstrate that it can achieve better multi-view consistency and higher fidelity to the input scene. These advantages allow us to train NeRFs with fewer visual artifacts, that are better aligned with the target geometry.

Abstract:
High-quality eyelid reconstruction and animation are challenging for the subtle details and complicated deformations. Previous works usually suffer from the trade-off between the capture costs and the quality of details. In this paper, we propose a novel method that can achieve detailed eyelid reconstruction and animation by only using an RGB video captured by a mobile phone. Our method utilizes both static and dynamic information of eyeballs (e.g., positions and rotations) to assist the eyelid reconstruction, cooperating with an automatic eyeball calibration method to get the required eyeball parameters. Furthermore, we develop a neural eyelid control module to achieve the semantic animation control of eyelids. To the best of our knowledge, we present the first method for high-quality eyelid reconstruction and animation from lightweight captures. Extensive experiments on both synthetic and real data show that our method can provide more detailed and realistic results compared with previous methods based on the same-level capture setups. The code is available at https://github.com/StoryMY/AniEyelid.

Abstract:
Achieving high efficiency in modern photorealistic rendering hinges on using Monte Carlo sampling distributions that closely approximate the illumination integral estimated for every pixel. Samples are typically generated from a set of simple distributions, each targeting a different factor in the integrand, which are combined via multiple importance sampling. The resulting mixture distribution can be far from the actual product of all factors, leading to sub-optimal variance even for direct-illumination estimation. We present a learning-based method that uses normalizing flows to efficiently importance sample illumination product integrals, e.g., the product of environment lighting and material terms. Our sampler composes a flow head warp with an emitter tail warp. The small conditional head warp is represented by a neural spline flow, while the large unconditional tail is discretized per environment map and its evaluation is instant. If the conditioning is low-dimensional, the head warp can be also discretized to achieve even better performance. We demonstrate variance reduction over prior methods on a range of applications comprising complex geometry, materials and illumination.

Abstract:
We present Fashion-VDM, a video diffusion model (VDM) for generating virtual try-on videos. Given an input garment image and person video, our method aims to generate a high-quality try-on video of the person wearing the given garment, while preserving the person’s identity and motion. Image-based virtual try-on has shown impressive results; however, existing video virtual try-on (VVT) methods are still lacking garment details and temporal consistency. To address these issues, we propose a diffusion-based architecture for video virtual try-on, split classifier-free guidance for increased control over the conditioning inputs, and a progressive temporal training strategy for single-pass 64-frame, 512px video generation. We also demonstrate the effectiveness of joint image-video training for video try-on, especially when video data is limited. Our qualitative and quantitative experiments show that our approach sets the new state-of-the-art for video virtual try-on.

Abstract:
Large text-to-video (T2V) models such as Sora have the potential to revolutionize visual effects and the creation of some types of movies. Current T2V models require tedious trial-and-error experimentation to achieve desired results, however. This motivates the search for methods to directly control desired attributes. In this work, we take a step toward this goal, introducing a method for high-level, temporally-coherent control over the basic trajectories and appearance of objects. Our algorithm, TrailBlazer, allows the general positions and (optionally) appearance of objects to be controlled simply by keyframing approximate bounding boxes and (optionally) their corresponding prompts. Importantly, our method does not require a pre-existing control video signal that already contains an accurate outline of the desired motion, yet the synthesized motion is surprisingly natural with emergent effects including perspective and movement toward the virtual camera as the box size increases. The method is efficient, making use of a pre-trained T2V model and requiring no training or fine-tuning, with negligible additional computation. Specifically, the bounding box controls are used as soft masks to guide manipulation of the self-attention and cross-attention modules in the video diffusion model. While our visual results are limited by those of the underlying model, the algorithm may generalize to future models that use standard self- and cross-attention components.

Abstract:
Animating various character drawings is an engaging visual content creation task. Given a single character drawing, existing animation methods are limited to flat 2D motions and thus lack 3D effects. An alternative solution is to reconstruct a 3D model from a character drawing as a proxy and then retarget 3D motion data onto it. However, the existing image-to-3D methods could not work well for amateur character drawings in terms of appearance and geometry. We observe the contour lines, commonly existing in character drawings, would introduce significant ambiguity in texture synthesis due to their view-dependence. Additionally, thin regions represented by single-line contours are difficult to reconstruct (e.g., slim limbs of a stick figure) due to their delicate structures. To address these issues, we propose a novel system, DrawingSpinUp, to produce plausible 3D animations and breathe life into character drawings, allowing them to freely spin up, leap, and even perform a hip-hop dance. For appearance improvement, we adopt a removal-then-restoration strategy to first remove the view-dependent contour lines and then render them back after retargeting the reconstructed character. For geometry refinement, we develop a skeleton-based thinning deformation algorithm to refine the slim structures represented by the single-line contours. The experimental evaluations and a perceptual user study show that our proposed method outperforms the existing 2D and 3D animation methods and generates high-quality 3D animations from a single character drawing. Please refer to our project page (https://lordliang.github.io/DrawingSpinUp) for the code and generated animations.

Abstract:
The recent developments in neural fields have brought phenomenal capabilities to the field of shape generation, but they lack crucial properties, such as incremental control — a fundamental requirement for artistic work. Triangular meshes, on the other hand, are the representation of choice for most geometry-related tasks, offering efficiency and intuitive control, but do not lend themselves to neural optimization. To support downstream tasks, previous art typically proposes a two-step approach, where first, a shape is generated using neural fields, and then a mesh is extracted for further processing. Instead, in this paper, we introduce a hybrid approach that maintains both a mesh and a Signed Distance Field (SDF) representations consistently. Using this representation, we introduce MagicClay — a tool for sculpting regions of a mesh according to textual prompts while keeping other regions untouched. Our method is designed to be compatible with existing mesh sculpting workflows. The user sculpts the desired shape using the existing brushes and our pipeline then evolves the geometry and triangulation of the selected mesh part according to the given textual prompt. This process operates on the original mesh while preserving its meta-data Our framework carefully and efficiently balances consistency between the representations and regularizations in every step of the shape optimization. Relying on the mesh representation, we show how to render the SDF at higher resolutions and faster. In addition, we employ recent work in differentiable mesh reconstruction to adaptively allocate triangles in the mesh where required, as indicated by the SDF. Using an implemented prototype, we demonstrate superior generated geometry compared to the state-of-the-art and novel consistent control, allowing sequential prompt-based edits to the same mesh for the first time. We will release the code upon acceptance.

Abstract:
We present Follow-Your-Emoji, a diffusion-based framework for portrait animation, which animates a reference portrait with target landmark sequences. The main challenge of portrait animation is to preserve the identity of the reference portrait and transfer the target expression to this portrait while maintaining temporal consistency and fidelity. To address these challenges, Follow-Your-Emoji equipped the powerful Stable Diffusion model with two well-designed technologies. Specifically, we first adopt a new explicit motion signal, namely expression-aware landmark, to guide the animation process. We discover this landmark can not only ensure the accurate motion alignment between the reference portrait and target motion during inference but also increase the ability to portray exaggerated expressions (i.e., large pupil movements) and avoid identity leakage. Then, we propose a facial fine-grained loss to improve the model’s ability of subtle expression perception and reference portrait appearance reconstruction by using both expression and facial masks. Accordingly, our method demonstrates significant performance in controlling the expression of freestyle portraits, including real humans, cartoons, sculptures, and even animals. By leveraging a simple and effective progressive generation strategy, we extend our model to stable long-term animation, thus increasing its potential application value. To address the lack of a benchmark for this field, we introduce EmojiBench, a comprehensive benchmark comprising diverse portrait images, driving videos, and landmarks. We show extensive evaluations on EmojiBench to verify the superiority of Follow-Your-Emoji. The code, training dataset and benchmark will be found in https://github.com/mayuelala/FollowYourEmoji.

Abstract:
Feedforward monocular face capture methods seek to reconstruct posed faces from a single image of a person. Current state of the art approaches have the ability to regress parametric 3D face models in real-time across a wide range of identities, lighting conditions and poses by leveraging large image datasets of human faces. These methods however suffer from clear limitations in that the underlying parametric face model only provides a coarse estimation of the face shape, thereby limiting their practical applicability in tasks that require precise 3D reconstruction (aging, face swapping, digital make-up,...). In this paper, we propose a method for high-precision 3D face capture taking advantage of a collection of unconstrained videos of a subject as prior information. Our proposal builds on a two stage approach. We start with the reconstruction of a detailed 3D face avatar of the person, capturing both precise geometry and appearance from a collection of videos. We then use the encoder from a pre-trained monocular face reconstruction method, substituting its decoder with our personalized model, and proceed with transfer learning on the video collection. Using our pre-estimated image formation model, we obtain a more precise self-supervision objective, enabling improved expression and pose alignment. This results in a trained encoder capable of efficiently regressing pose and expression parameters in real-time from previously unseen images, which combined with our personalized geometry model yields more accurate and high fidelity mesh inference.

Abstract:
Panoramic image stitching provides a unified, wide-angle view of a scene that extends beyond the camera’s field of view. Stitching frames of a panning video into a panoramic photograph is a well-understood problem for stationary scenes, but when objects are moving, a still panorama cannot capture the scene. We present a method for synthesizing a panoramic video from a casually-captured panning video, as if the original video were captured with a wide-angle camera. We pose panorama synthesis as a space-time outpainting problem, where we aim to create a full panoramic video of the same length as the input video. Consistent completion of the space-time volume requires a powerful, realistic prior over video content and motion, for which we adapt generative video models. Existing generative models do not, however, immediately extend to panorama completion, as we show. We instead apply video generation as a component of our panorama synthesis system, and demonstrate how to exploit the strengths of the models while minimizing their limitations. Our system can create video panoramas for a range of in-the-wild scenes including people, vehicles, and flowing water, as well as stationary background features.

Abstract:
We present Frankenstein, a diffusion-based framework that can generate semantic-compositional 3D scenes in a single pass. Unlike existing methods that output a single, unified 3D shape, Frankenstein simultaneously generates multiple separated shapes, each corresponding to a semantically meaningful part. The 3D scene information is encoded in one single tri-plane tensor, from which multiple Signed Distance Function (SDF) fields can be decoded to represent the compositional shapes. During training, an auto-encoder compresses tri-planes into a latent space, and then the denoising diffusion process is employed to approximate the distribution of the compositional scenes. Frankenstein demonstrates promising results in generating room interiors as well as human avatars with automatically separated parts. The generated scenes facilitate many downstream applications, such as part-wise re-texturing, object rearrangement in the room or avatar cloth re-targeting.

Abstract:
We introduce PortraitGen, a powerful portrait video editing method that achieves consistent and expressive stylization with multimodal prompts. Traditional portrait video editing methods often struggle with 3D and temporal consistency, and typically lack in rendering quality and efficiency. To address these issues, we lift the portrait video frames to a unified dynamic 3D Gaussian field, which ensures structural and temporal coherence across frames. Furthermore, we design a novel Neural Gaussian Texture mechanism that not only enables sophisticated style editing but also achieves rendering speed over 100FPS. Our approach incorporates multimodal inputs through knowledge distilled from large-scale 2D generative models. Our system also incorporates expression similarity guidance and a face-aware portrait editing module, effectively mitigating degradation issues associated with iterative dataset updates. Extensive experiments demonstrate the temporal consistency, editing efficiency, and superior rendering quality of our method. The broad applicability of the proposed approach is demonstrated through various applications, including text-driven editing, image-driven editing, and relighting, highlighting its great potential to advance the field of video editing. Demo videos and released code are provided in our project page: https://ustc3dv.github.io/PortraitGen/

Abstract:
We propose a simple yet effective pipeline for stylizing a 3D scene, harnessing the power of 2D image diffusion models. Given a NeRF model reconstructed from a set of multi-view images, we perform 3D style transfer by refining the source NeRF model using stylized images generated by a style-aligned image-to-image diffusion model. Given a target style prompt, we first generate perceptually similar multi-view images by leveraging a depth-conditioned diffusion model with an attention-sharing mechanism. Next, based on the stylized multi-view images, we propose to guide the style transfer process with the sliced Wasserstein loss based on the feature maps extracted from a pre-trained CNN model. Our pipeline consists of decoupled steps, allowing users to test various prompt ideas and preview the stylized 3D result before proceeding to the NeRF fine-tuning stage. We demonstrate that our method can transfer diverse artistic styles to real-world 3D scenes with competitive quality.

Abstract:
For audio-driven visual dubbing, it remains a considerable challenge to uphold and highlight speaker’s persona while synthesizing accurate lip synchronization. Existing methods fall short of capturing speaker’s unique speaking style or preserving facial details. In this paper, we present PersonaTalk, an attention-based two-stage framework, including geometry construction and face rendering, for high-fidelity and personalized visual dubbing. In the first stage, we propose a style-aware audio encoding module that injects speaking style into audio features through a cross-attention layer. The stylized audio features are then used to drive speaker’s template geometry to obtain lip-synced geometries. In the second stage, a dual-attention face renderer is introduced to render textures for the target geometries. It consists of two parallel cross-attention layers, namely Lip-Attention and Face-Attention, which respectively sample textures from different reference frames to render the entire face. With our innovative design, intricate facial details can be well preserved. Comprehensive experiments and user studies demonstrate our advantages over other state-of-the-art methods in terms of visual quality, lip-sync accuracy and persona preservation. Furthermore, as a person-generic framework, PersonaTalk can achieve competitive performance as state-of-the-art person-specific methods. Project Page: https://grisoon.github.io/PersonaTalk/.

Abstract:
We present a technique for dynamically projecting 3D content onto human hands with short perceived motion-to-photon latency. Computing the pose and shape of human hands accurately and quickly is a challenging task due to their articulated and deformable nature. We combine a slower 3D coarse estimation of the hand pose with high speed 2D correction steps which improve the alignment of the projection to the hands, increase the projected surface area, and reduce perceived latency. Since our approach leverages a full 3D reconstruction of the hands, any arbitrary texture or reasonably performant effect can be applied, which was not possible before. We conducted two user studies to assess the benefits of using our method. The results show subjects are less sensitive to latency artifacts and perform faster and with more ease a given associated task over the naïve approach of directly projecting rendered frames from the 3D pose estimation. We demonstrate several novel use cases and applications.

Abstract:
Emerging holographic display technology offers unique capabilities for next-generation virtual reality systems. Current holographic near-eye displays, however, only support a small étendue, which results in a direct tradeoff between achievable field of view and eyebox size. Étendue expansion has recently been explored, but existing approaches are either fundamentally limited in the image quality that can be achieved or they require extremely high-speed spatial light modulators. We describe a new étendue expansion approach that combines multiple coherent sources with content-adaptive amplitude modulation of the hologram spectrum in the Fourier plane. To generate time-multiplexed phase and amplitude patterns for our spatial light modulators, we devise a pupil-aware gradient-descent-based computer-generated holography algorithm that is supervised by a large-baseline target light field. Compared with relevant baseline approaches, ours demonstrates significant improvements in image quality and étendue in simulation and with an experimental holographic display prototype.

Abstract:
Optical motion capture (MoCap) is the "gold standard" for accurately capturing full-body motions. To make use of raw MoCap point data, the system labels the points with corresponding body part locations and solves the full-body motions. However, MoCap data often contains mislabeling, occlusion and positional errors, requiring extensive manual correction. To alleviate this burden, we introduce RoMo, a learning-based framework for robustly labeling and solving raw optical motion capture data. In the labeling stage, RoMo employs a divide-and-conquer strategy to break down the complex full-body labeling challenge into manageable subtasks: alignment, full-body segmentation and part-specific labeling. To utilize the temporal continuity of markers, RoMo generates marker tracklets using a K-partite graph-based clustering algorithm, where markers serve as nodes, and edges are formed based on positional and feature similarities. For motion solving, to prevent error accumulation along the kinematic chain, we introduce a hybrid inverse kinematic solver that utilizes joint positions as intermediate representations and adjusts the template skeleton to match estimated joint positions. We demonstrate that RoMo achieves high labeling and solving accuracy across multiple metrics and various datasets. Extensive comparisons show that our method outperforms state-of-the-art research methods. On a real dataset, RoMo improves the F1 score of hand labeling from 0.94 to 0.98, and reduces joint position error of body motion solving by 25%. Furthermore, RoMo can be applied in scenarios where commercial systems are inadequate. The code and data for RoMo are available at https://github.com/non-void/RoMo.

Abstract:
We introduce FabricDiffusion, a method for transferring fabric textures from a single clothing image to 3D garments of arbitrary shapes. Existing approaches typically synthesize textures on the garment surface through 2D-to-3D texture mapping or depth-aware inpainting via generative models. Unfortunately, these methods often struggle to capture and preserve texture details, particularly due to challenging occlusions, distortions, or poses in the input image. Inspired by the observation that in the fashion industry, most garments are constructed by stitching sewing patterns with flat, repeatable textures, we cast the task of clothing texture transfer as extracting distortion-free, tileable texture materials that are subsequently mapped onto the UV space of the garment. Building upon this insight, we train a denoising diffusion model with a large-scale synthetic dataset to rectify distortions in the input texture image. This process yields a flat texture map that enables a tight coupling with existing Physically-Based Rendering (PBR) material generation pipelines, allowing for realistic relighting of the garment under various lighting conditions. We show that FabricDiffusion can transfer various features from a single clothing image including texture patterns, material properties, and detailed prints and logos. Extensive experiments demonstrate that our model significantly outperforms state-to-the-art methods on both synthetic data and real-world, in-the-wild clothing images while generalizing to unseen textures and garment shapes.

Abstract:
Generative Adversarial Networks (GANs), particularly StyleGAN and its variants, have demonstrated remarkable capabilities in generating highly realistic images. Despite their success, adapting these models to diverse tasks such as domain adaptation, reference-guided synthesis, and text-guided manipulation with limited training data remains challenging. Towards this end, in this study, we present a novel framework that significantly extends the capabilities of a pre-trained StyleGAN by integrating CLIP space via hypernetworks. This integration allows dynamic adaptation of StyleGAN to new domains defined by reference images or textual descriptions. Additionally, we introduce a CLIP-guided discriminator that enhances the alignment between generated images and target domains, ensuring superior image quality. Our approach demonstrates unprecedented flexibility, enabling text-guided image manipulation without the need for text-specific training data and facilitating seamless style transfer. Comprehensive qualitative and quantitative evaluations confirm the robustness and superior performance of our framework compared to existing methods.

Abstract:
Traditional animation generation methods depend on training generative models with human-labelled data, entailing a sophisticated multi-stage pipeline that demands substantial human effort and incurs high training costs. Due to limited prompting plans, these methods typically produce brief, information-poor, and context-incoherent animations. To overcome these limitations and automate the animation process, we pioneer the introduction of large multimodal models (LMMs) as the core processor to build an autonomous animation-making agent, named Anim-Director. This agent mainly harnesses the advanced understanding and reasoning capabilities of LMMs and generative AI tools to create animated videos from concise narratives or simple instructions. Specifically, it operates in three main stages: Firstly, the Anim-Director generates a coherent storyline from user inputs, followed by a detailed director’s script that encompasses settings of character profiles and interior/exterior descriptions, and context-coherent scene descriptions that include appearing characters, interiors or exteriors, and scene events. Secondly, we employ LMMs with the image generation tool to produce visual images of settings and scenes. These images are designed to maintain visual consistency across different scenes using a visual-language prompting method that combines scene descriptions and images of the appearing character and setting. Thirdly, scene images serve as the foundation for producing animated videos, with LMMs generating prompts to guide this process. The whole process is notably autonomous without manual intervention, as the LMMs interact seamlessly with generative tools to generate prompts, evaluate visual quality, and select the best one to optimize the final output. To assess the effectiveness of our framework, we collect varied short narratives and incorporate various Image/video evaluation metrics including visual consistency and video quality. The experimental results and case studies demonstrate the Anim-Director’s versatility and significant potential to streamline animation creation.

Abstract:
In this paper, we propose SRIF, a novel Semantic shape Registration framework based on diffusion-based Image morphing and Flow estimation. More concretely, given a pair of extrinsically aligned shapes, we first render them from multi-views, and then utilize an image interpolation framework based on diffusion models to generate sequences of intermediate images between them. The images are later fed into a dynamic 3D Gaussian splatting framework, with which we reconstruct and post-process for intermediate point clouds respecting the image morphing processing. In the end, tailored for the above, we propose a novel registration module to estimate continuous normalizing flow, which deforms source shape consistently towards the target, with intermediate point clouds as weak guidance. Our key insight is to leverage large vision models (LVMs) to associate shapes and therefore obtain much richer semantic information on the relationship between shapes than the ad-hoc feature extraction and alignment. As consequence, SRIF achieves high-quality dense correspondences on challenging shape pairs, but also delivers smooth, semantically meaningful interpolation in between. Empirical evidences justify the effectiveness and superiority of our method as well as specific design choices. The code is released at https://github.com/rqhuang88/SRIF.

Abstract:
Motion Planning (MP) is a critical challenge in robotics, especially pertinent with the burgeoning interest in embodied artificial intelligence. Traditional MP methods often struggle with high-dimensional complexities. Recently neural motion planners, particularly physics-informed neural planners based on the Eikonal equation, have been proposed to overcome the curse of dimensionality. However, these methods perform poorly in complex scenarios with shaped robots due to multiple solutions inherent in the Eikonal equation. To address these issues, this paper presents PC-Planner, a novel physics-constrained self-supervised learning framework for robot motion planning with various shapes in complex environments. To this end, we propose several physical constraints, including monotonic and optimal constraints, to stabilize the training process of the neural network with the Eikonal equation. Additionally, we introduce a novel shape-aware distance field that considers the robot’s shape for efficient collision checking and Ground Truth (GT) speed computation. This field reduces the computational intensity, and facilitates adaptive motion planning at test time. Experiments in diverse scenarios with different robots demonstrate the superiority of the proposed method in efficiency and robustness for robot motion planning, particularly in complex environments.

Abstract:
Given the remarkable results of motion synthesis with diffusion models, a natural question arises: how can we effectively leverage these models for motion editing? Existing diffusion-based motion editing methods overlook the profound potential of the prior embedded within the weights of pre-trained models, which enables manipulating the latent feature space; hence, they primarily center on handling the motion space. In this work, we explore the attention mechanism of pre-trained motion diffusion models. We uncover the roles and interactions of attention elements in capturing and representing intricate human motion patterns, and carefully integrate these elements to transfer a leader motion to a follower one while maintaining the nuanced characteristics of the follower, resulting in zero-shot motion transfer. Manipulating features associated with selected motions allows us to confront a challenge observed in prior motion diffusion approaches, which use general directives (e.g., text, music) for editing, ultimately failing to convey subtle nuances effectively. Our work is inspired by the phrase Monkey See, Monkey Do, relating to human mimicry. Our technique enables accomplishing tasks such as synthesizing out-of-distribution motions, style transfer, and spatial editing. Furthermore, diffusion inversion is seldom employed for motions; as a result, editing efforts focus on generated motions, limiting the editability of real ones. MoMo harnesses motion inversion, extending its application to both real and generated motions. Experimental results show the advantage of our approach over the current art. In particular, unlike methods tailored for specific applications through training, our approach is applied at inference time, requiring no training. Webpage: https://monkeyseedocg.github.io.

Abstract:
The painting process of artists is inherently stepwise and varies significantly among different painters and styles. Generating detailed, step-by-step painting processes is essential for art education and research, yet remains largely underexplored. Traditional stroke-based rendering methods break down images into sequences of brushstrokes, yet they fall short of replicating the authentic processes of artists, with limitations confined to basic brushstroke modifications. Text-to-image models utilizing diffusion processes generate images through iterative denoising, also diverge substantially from artists’ painting process. To address these challenges, we introduce ProcessPainter, a text-to-video model that is initially pre-trained on synthetic data and subsequently fine-tuned with a select set of artists’ painting sequences using the LoRA model. This approach successfully generates painting processes from text prompts for the first time. Furthermore, we introduce an Artwork Replication Network capable of accepting arbitrary-frame input, which facilitates the controlled generation of painting processes, decomposing images into painting sequences, and completing semi-finished artworks. This paper offers new perspectives and tools for advancing art education and image generation technology. Our code is available at: https://github.com/nicolaus-huang/ProcessPainter.

Abstract:
To improve novel view synthesis of curved-surface reflections and refractions, we revisit local geometry-guided ray interpolation techniques with modern differentiable rendering and optimization. In contrast to depth or mesh geometries, our approach uses a local or per-view density represented as Gaussian mixtures along each ray. To synthesize novel views, we warp and fuse local volumes, then alpha-composite using input photograph ray colors from a small set of neighboring images. For fusion, we use a neural blending weight from a shallow MLP. We optimize the local Gaussian density mixtures using both a reconstruction loss and a consistency loss. The consistency loss, based on per-ray KL-divergence, encourages more accurate geometry reconstruction. In scenes with complex reflections captured in our LGDM dataset, the experimental results show that our method outperforms state-of-the-art novel view synthesis methods by 12.2%–37.1% in PSNR, due to its ability to maintain sharper view-dependent appearances. Project webpage: https://xchaowu.github.io/papers/lgdm/index.html

Abstract:
We present a method for prediction of a person’s hairstyle from a single image. Despite growing use cases in user digitization and enrollment for virtual experiences, available methods are limited, particularly in the range of hairstyles they can capture. Human hair is extremely diverse and lacks any universally accepted description or categorization, making this a challenging task. Most current methods rely on parametric models of hair at a strand level. These approaches, while very promising, are not yet able to represent short, frizzy, coily hair and gathered hairstyles. We instead choose a classification approach which can represent the diversity of hairstyles required for a truly robust and inclusive system. Previous classification approaches have been restricted by poorly labeled data that lacks diversity, imposing constraints on the usefulness of any resulting enrollment system. We use only synthetic data to train our models. This allows for explicit control of diversity of hairstyle attributes, hair colors, facial appearance, poses, environments and other parameters. It also produces noise-free ground-truth labels. We introduce a novel hairstyle taxonomy developed in collaboration with a diverse group of domain experts which we use to balance our training data, supervise our model, and directly measure fairness. We annotate our synthetic training data and a real evaluation dataset using this taxonomy and release both to enable comparison of future hairstyle prediction approaches. We employ an architecture based on a pre-trained feature extraction network in order to improve generalization of our method to real data and predict taxonomy attributes as an auxiliary task to improve accuracy. Results show our method to be significantly more robust for challenging hairstyles than recent parametric approaches. Evaluation with taxonomy-based metrics also demonstrates the fairness of our method across diverse hairstyles.

Abstract:
We present a novel framework for free-viewpoint facial performance relighting using diffusion-based image-to-image translation. Leveraging a subject-specific dataset containing diverse facial expressions captured under various lighting conditions, including flat-lit and one-light-at-a-time (OLAT) scenarios, we train a diffusion model for precise lighting control, enabling high-fidelity relit facial images from flat-lit inputs. Our framework includes spatially-aligned conditioning of flat-lit captures and random noise, along with integrated lighting information for global control, utilizing prior knowledge from the pre-trained Stable Diffusion model. This model is then applied to dynamic facial performances captured in a consistent flat-lit environment and reconstructed for novel-view synthesis using a scalable dynamic 3D Gaussian Splatting method to maintain quality and consistency in the relit results. In addition, we introduce unified lighting control by integrating a novel area lighting representation with directional lighting, allowing for joint adjustments in light size and direction. We also enable high dynamic range imaging (HDRI) composition using multiple directional lights to produce dynamic sequences under complex lighting conditions. Our evaluations demonstrate the model’s efficiency in achieving precise lighting control and generalizing across various facial expressions while preserving detailed features such as skin texture and hair. The model accurately reproduces complex lighting effects like eye reflections, subsurface scattering, self-shadowing, and translucency, advancing photorealism within our framework.

Abstract:
Neural Radiance Fields (NeRF) have demonstrated exceptional capabilities in reconstructing complex scenes with high fidelity. However, NeRF’s view dependency can only handle low-frequency reflections. It falls short when dealing with complex planar reflections, often interpreting them as erroneous scene geometries and leading to duplicated and inaccurate scene representations. To address this challenge, we introduce a planar reflection-aware NeRF that jointly models planar reflectors, such as windows, and explicitly casts reflected rays to capture the source of the high-frequency reflections. We query a single radiance field to render the primary color and the source of the reflection. We propose a sparse edge regularization to help utilize the true source of reflections for rendering planar reflections rather than creating a duplicate along the primary ray at the same depth. As a result, we obtain accurate scene geometry. Rendering along the primary ray results in a clean, reflection-free view, while explicitly rendering along the reflected ray allows us to reconstruct highly detailed reflections. Our extensive quantitative and qualitative evaluations of real-world datasets demonstrate our method’s performance in accurately handling reflections.

Abstract:
This paper introduces a new learning-based method, NASM, for anisotropic surface meshing. Our key idea is to propose a graph neural network to embed an input mesh into a high-dimensional (high-d) Euclidean embedding space to preserve curvature-based anisotropic metric by using a dot product loss between high-d edge vectors. This can dramatically reduce the computational time and increase the scalability. Then, we propose a novel feature-sensitive remeshing on the generated high-d embedding to automatically capture sharp geometric features. We define a high-d normal metric, and then derive an automatic differentiation on a high-d centroidal Voronoi tessellation (CVT) optimization with the normal metric to simultaneously preserve geometric features and curvature anisotropy that exhibit in the original 3D shapes. To our knowledge, this is the first time that a deep learning framework and a large dataset are proposed to construct a high-d Euclidean embedding space for 3D anisotropic surface meshing. Experimental results are evaluated and compared with the state-of-the-art in anisotropic surface meshing on a large number of surface models from Thingi10K dataset as well as tested on extensive unseen 3D shapes from Multi-Garment Network dataset and FAUST human dataset.

Abstract:
High-fidelity simulation of fluid dynamics is challenging because of the high dimensional state data needed to capture fine details and the large computational cost associated with advancing the system in time. We present neural implicit reduced fluid simulation (NIRFS), a reduced fluid simulation technique that combines an implicit neural representation of fluid shapes and a neural ordinary differential equation to model the dynamics of fluid in the reduced latent space. The latent trajectories are computed at very little cost in comparison to simulations for training, while preserving fine physical details. We show that this approach can work well, capturing the shapes and dynamics involved in a variety of scenarios with constrained initial conditions, e.g., droplet-droplet collisions, crown splashes, and fluid slosh in a container. In each scenario, we learn the latent implicit representation of fluid shapes with a deep-network signed distance function, as well as the energy function and parameters of a damped Hamiltonian system, which helps guarantee desirable properties of the latent dynamics. To ensure that latent shape representations form smooth and physically meaningful trajectories, we simultaneously learn the latent representation and dynamics. We evaluate novel simulations for conservation of volume and momentum conservation, discuss design decisions, and demonstrate an application of our method to fluid control.

Abstract:
Given an input painting, we reconstruct a time-lapse video of how it may have been painted. We formulate this as an autoregressive image generation problem, in which an initially blank “canvas” is iteratively updated. The model learns from real artists by training on many painting videos. Our approach incorporates text and region understanding to define a set of painting “instructions” and updates the canvas with a novel diffusion-based renderer. The method extrapolates beyond the limited, acrylic style paintings on which it has been trained, showing plausible results for a wide range of artistic styles and genres.

Abstract:
The focus of this paper is 3D motion editing. Given a 3D human motion and a textual description of the desired modification, our goal is to generate an edited motion as described by the text. The challenges include the lack of training data and the design of a model that faithfully edits the source motion. In this paper, we address both these challenges. We build a methodology to semi-automatically collect a dataset of triplets in the form of (i) a source motion, (ii) a target motion, and (iii) an edit text, and create the new MotionFix dataset. Having access to such data allows us to train a conditional diffusion model, TMED, that takes both the source motion and the edit text as input. We further build various baselines trained only on text-motion pairs datasets, and show superior performance of our model trained on triplets. We introduce new retrieval-based metrics for motion editing, and establish a new benchmark on the evaluation set of MotionFix. Our results are encouraging, paving the way for further research on finegrained motion generation. Code, models and data are available at our project website.

Abstract:
The advent of the digital age has driven the development of coherent optical modems—devices that modulate the amplitude and phase of light in multiple polarization states. These modems transmit data through fiber optic cables that are thousands of kilometers in length at data rates exceeding one terabit per second. This remarkable technology is made possible through near-THz-rate programmable control and sensing of the full optical wavefield. While coherent optical modems form the backbone of telecommunications networks around the world, their extraordinary capabilities also provide unique opportunities for imaging. Here, we repurpose off-the-shelf coherent optical modems to introduce full-wavefield lidar: a type of random modulation continuous wave lidar that simultaneously measures depth, axial velocity, and polarization. We demonstrate this modality by combining a 74 GHz-bandwidth coherent optical modem with free-space coupling optics and scanning mirrors. We develop a time-resolved image formation model for this system and formulate a maximum-likelihood reconstruction algorithm to recover depth, velocity, and polarization information at each scene point from the modem’s raw transmitted and received symbols. Compared to existing lidars, full-wavefield lidar promises improved mm-scale ranging accuracy from brief, microsecond exposure times, reliable velocimetry, and robustness to interference from ambient light or other lidar signals.

Abstract:
This paper introduces a novel approach to synthesize texture to dress up a 3D object, given a text prompt. Based on the pre-trained text-to-image (T2I) diffusion model, existing methods usually employ a project-and-inpaint approach, in which a view of the given object is first generated and warped to another view for inpainting. But it tends to generate inconsistent texture due to the asynchronous diffusion of multiple views. We believe that such asynchronous diffusion and insufficient information sharing among views are the root causes of the inconsistent artifacts. In this paper, we propose a synchronized multi-view diffusion approach that allows the diffusion processes from different views to reach a consensus on the generated content early in the process, and hence ensures the texture consistency. To synchronize the diffusion, we share the denoised content among different views in each denoising step, specifically by blending the latent content in the texture domain from overlapping views. Our method demonstrates superior performance in generating consistent, seamless and highly detailed textures, comparing to state-of-the-art methods.

Abstract:
This paper introduces a new formulation for material homogenization of thin-shell microstructures. It addresses important challenges that limit the quality of previous approaches: methods that fit the energy response neglect visual impact, methods that fit the stress response are not conservative, and all of them are limited to a low-dimensional interplay between deformation modes. The new formulation is rooted on the following design principles: the material energy functions are conservative by definition, they are formulated on the high-dimensional membrane and bending domain to capture the complex interplay of the different deformation modes, the material function domain is maximally aligned with the training data, and the material parameters and the optimization are formulated on stress instead of energy for better correlation with visual impact. The key novelty of our formulation is a new type of high-order RBF interpolant for polar coordinates, which allows us to fulfill all the design principles. We design a material function using this novel interpolant, as well as an overall homogenization workflow. Our results demonstrate very accurate fitting of diverse microstructure behaviors, both quantitatively and qualitatively superior to previous work.

Abstract:
We introduce a novel adaptive eigenvalue filtering strategy to stabilize and accelerate the optimization of Neo-Hookean energy and its variants under the Projected Newton framework. For the first time, we show that Newton’s method, Projected Newton with eigenvalue clamping and Projected Newton with absolute eigenvalue filtering can be unified using ideas from the generalized trust region method. Based on the trust-region fit, our model adaptively chooses the correct eigenvalue filtering strategy to apply during the optimization. Our method is simple but effective, requiring only two lines of code change in the existing Projected Newton framework. We validate our model outperforms stand-alone variants across a number of experiments on quasistatic simulation of deformable solids over a large dataset.

Abstract:
Measured Bidirectional Texture Function (BTF) can faithfully reproduce a realistic appearance but is costly to acquire and store due to its 6D nature (2D spatial and 4D angular). Therefore, it is practical and necessary for rendering to synthesize BTFs from a small example patch. While previous methods managed to produce plausible results, we find that they seldomly take into consideration the property of being dynamic, so a BTF must be synthesized before the rendering process, resulting in limited size, costly pre-generation and storage issues. In this paper, we propose a dynamic BTF synthesis scheme, where a BTF at any position only needs to be synthesized when being queried. Our insight is that, with the recent advances in neural dimension reduction methods, a BTF can be decomposed into disjoint low-dimensional components. We can perform dynamic synthesis only on the positional dimensions, and during rendering, recover the BTF by querying and combining these low-dimensional functions with the help of a lightweight Multilayer Perceptron (MLP). Consequently, we obtain a fully dynamic 6D BTF synthesis scheme that does not require any pre-generation, which enables efficient rendering of our infinitely large and non-repetitive BTFs on the fly. We demonstrate the effectiveness of our method through various types of BTFs taken from UBO2014 [Weinmann et al. 2014].

Abstract:
Model customization introduces new concepts to existing text-to-image models, enabling the generation of these new concepts/objects in novel contexts. However, such methods lack accurate camera view control with respect to the new object, and users must resort to prompt engineering (e.g., adding “top-view”) to achieve coarse view control. In this work, we introduce a new task – enabling explicit control of the object viewpoint in the customization of text-to-image diffusion models. This allows us to modify the custom object’s properties and generate it in various background scenes via text prompts, all while incorporating the object viewpoint as an additional control. This new task presents significant challenges, as one must harmoniously merge a 3D representation from the multi-view images with the 2D pre-trained model. To bridge this gap, we propose to condition the diffusion process on the 3D object features rendered from the target viewpoint. During training, we fine-tune the 3D feature prediction modules to reconstruct the object’s appearance and geometry, while reducing overfitting to the input multi-view images. Our method outperforms existing image editing and model customization baselines in preserving the custom object’s identity while following the target object viewpoint and the text prompt.

Abstract:
Content creators often aim to create personalized images using personal subjects that go beyond the capabilities of conventional text-to-image models. Additionally, they may want the resulting image to encompass a specific location, style, ambiance, and more. Existing personalization methods may compromise personalization ability or the alignment to complex textual prompts. This trade-off can impede the fulfillment of user prompts and subject fidelity. We propose a new approach focusing on personalization methods for a single prompt to address this issue. We term our approach prompt-aligned personalization. While this may seem restrictive, our method excels in improving text alignment, enabling the creation of images with complex and intricate prompts, which may pose a challenge for current techniques. In particular, our method keeps the personalized model aligned with a target prompt using an additional score distillation sampling term. We demonstrate the versatility of our method in multi- and single-shot settings and further show that it can compose multiple subjects or use inspiration from reference images, such as artworks. We compare our approach quantitatively and qualitatively with existing baselines and state-of-the-art techniques.

Abstract:
Neural Radiance Fields (NeRFs) typically struggle to reconstruct and render highly specular objects, whose appearance varies quickly with changes in viewpoint. Recent works have improved NeRF’s ability to render detailed specular appearance of distant environment illumination, but are unable to synthesize consistent reflections of closer content. Moreover, these techniques rely on large computationally-expensive neural networks to model outgoing radiance, which severely limits optimization and rendering speed. We address these issues with an approach based on ray tracing: instead of querying an expensive neural network for the outgoing view-dependent radiance at points along each camera ray, our model casts reflection rays from these points and traces them through the NeRF representation to render feature vectors which are decoded into color using a small inexpensive network. We demonstrate that our model outperforms prior methods for view synthesis of scenes containing shiny objects, and that it is the only existing NeRF method that can synthesize photorealistic specular appearance and reflections in real-world scenes, while requiring comparable optimization time to current state-of-the-art view synthesis models.

Abstract:
The household rearrangement task involves spotting misplaced objects in a scene and accommodate them with proper places. It depends both on common-sense knowledge on the objective side and human user preference on the subjective side. In achieving such a task, we propose to mine object functionality with user preference alignment directly from the scene itself, without relying on human intervention. To do so, we work with scene graph representation and propose LLM-enhanced scene graph learning which transforms the input scene graph into an affordance-enhanced graph (AEG) with information-enhanced nodes and newly discovered edges (relations). In AEG, the nodes corresponding to the receptacle objects are augmented with context-induced affordance which encodes what kind of carriable objects can be placed on it. New edges are discovered with newly discovered non-local relations. With AEG, we perform task planning for scene rearrangement by detecting misplaced carriables and determining a proper placement for each of them. We test our method by implementing a tiding robot in simulator and perform evaluation on a new benchmark we build. Extensive evaluations demonstrate that our method achieves state-of-the-art performance in misplacement detection and the following rearrangement planning.

Abstract:
Synthesizing human motions in 3D environments, particularly those with complex activities such as locomotion, hand-reaching, and Human-Object Interaction (HOI), presents substantial demands for user-defined waypoints and stage transitions. These requirements pose challenges for current models, leading to a notable gap in automating the animation of characters from simple human inputs. This paper addresses this challenge by introducing a comprehensive framework for synthesizing multi-stage scene-aware interaction motions directly from a single text instruction and goal location. Our approach employs an auto-regressive diffusion model to synthesize the next motion segment, along with an autonomous scheduler predicting the transition for each action stage. To ensure that the synthesized motions are seamlessly integrated within the environment, we propose a scene representation that considers the local perception both at the start and the goal location. We further enhance the coherence of the generated motion by integrating frame embeddings with language input. Additionally, to support model training, we present a comprehensive motion-captured (MoCap) dataset comprising 16 hours of motion sequences in 120 indoor scenes covering 40 types of motions, each annotated with precise language descriptions. Experimental results demonstrate the efficacy of our method in generating high-quality, multi-stage motions closely aligned with environmental and textual conditions.

Abstract:
Drag-based image editing has recently gained popularity for its interactivity and precision. However, despite the ability of text-to-image models to generate samples within a second, drag editing still lags behind due to the challenge of accurately reflecting user interaction while maintaining image content. Some existing approaches rely on computationally intensive per-image optimization or intricate guidance-based methods, requiring additional inputs such as masks for movable regions and text prompts, thereby compromising the interactivity of the editing process. We introduce InstantDrag, an optimization-free pipeline that enhances interactivity and speed, requiring only an image and a drag instruction as input. InstantDrag consists of two carefully designed networks: a drag-conditioned optical flow generator (FlowGen) and an optical flow-conditioned diffusion model (FlowDiffusion). InstantDrag learns motion dynamics for drag-based image editing in real-world video datasets by decomposing the task into motion generation and motion-conditioned image generation. We demonstrate InstantDrag’s capability to perform fast, photo-realistic edits without masks or text prompts through experiments on facial video datasets and general scenes. These results highlight the efficiency of our approach in handling drag-based image editing, making it a promising solution for interactive, real-time applications.

Abstract:
Motion style transfer changes the style of a motion while retaining its content and is useful in computer animations and games. Contact is an essential component of motion style transfer that should be controlled explicitly in order to express the style vividly while enhancing motion naturalness and quality. However, it is unknown how to decouple and control contact to achieve fine-grained control in motion style transfer. In this paper, we present a novel style transfer method for fine-grained control over contacts while achieving both motion naturalness and spatial-temporal variations of style. Based on our empirical evidence, we propose controlling contact indirectly through the hip velocity, which can be further decomposed into the trajectory and contact timing, respectively. To this end, we propose a new model that explicitly models the correlations between motions and trajectory/contact timing/style, allowing us to decouple and control each separately. Our approach is built around a motion manifold, where hip controls can be easily integrated into a Transformer-based decoder. It is versatile in that it can generate motions directly as well as be used as post-processing for existing methods to improve quality and contact controllability. In addition, we propose a new metric that measures a correlation pattern of motions based on our empirical evidence, aligning well with human perception in terms of motion naturalness. Based on extensive evaluation, our method outperforms existing methods in terms of style expressivity and motion quality.

Abstract:
3D Gaussian Splatting (3DGS) explicit 3D representation has achieved high-quality reconstruction and real-time rendering of complex scenes. However, the rasterization pipeline still unnecessary overhead resulting from avoidable serial Gaussian culling, and uneven load due to the distinct number of Gaussian pixels, which wider promotion and application. In order to accelerate Gaussian splatting, we propose AdR-Gaussian, which employ adaptive radius to narrow the rendering pixel range for each Gaussian, and introduces a load balancing method to minimize thread waiting time during the pixel-parallel rendering. Our contributions are threefold, achieving a rendering speed of 310% while maintaining or even better quality than the state-of-the-art. Firstly, we propose an adaptive radius, which reduces the number of affected tile through the Gaussian bounding circle, thus reducing unnecessary overhead and achieving faster rendering speed. Secondly, we axis-aligned bounding box for Gaussian splatting, which achieves a more significant reduction in ineffective expenses by accurately calculating the Gaussian size in the 2D directions. Thirdly, we a balancing algorithm for pixel thread load, which compresses the information of heavy-load pixels to reduce thread waiting time, and enhance information of light-load pixels to quality loss. Experiments on datasets demonstrate that our algorithm can significantly improve the Gaussian Splatting rendering speed.

Abstract:
Neural shape representation, such as neural signed distance field (NSDF), becomes more and more popular in shape modeling as its ability to deal with complex topology and arbitrary resolution. Due to the implicit manner to use features for shape representation, manipulating the shapes faces inherent challenge of inconvenience, since the feature cannot be intuitively edited. In this work, we propose neural generalized cylinder (NGC) for explicit manipulation of NSDF, which is an extension of traditional generalized cylinder (GC). Specifically, we define a central curve first and assign neural features along the curve to represent the profiles. Then NSDF is defined on the relative coordinates of a specialized GC with oval-shaped profiles. By using the relative coordinates, NSDF can be explicitly controlled via manipulation of the GC. To this end, we apply NGC to many non-rigid deformation tasks like complex curved deformation, local scaling and twisting for shapes. The comparison on shape deformation with other methods proves the effectiveness and efficiency of NGC. Furthermore, NGC could utilize the neural feature for shape blending by a simple neural feature interpolation.

Abstract:
Generating diverse and realistic human motion that can physically interact with an environment remains a challenging research area in character animation. Meanwhile, diffusion-based methods, as proposed by the robotics community, have demonstrated the ability to capture highly diverse and multi-modal skills. However, naively training a diffusion policy often results in unstable motions for high-frequency, under-actuated control tasks like bipedal locomotion due to rapidly accumulating compounding errors, pushing the agent away from optimal training trajectories. The key idea lies in using RL policies not just for providing optimal trajectories but for providing corrective actions in sub-optimal states which gives the policy a chance to correct for errors caused by environmental stimulus, model errors, or numerical errors in simulation. Our method, Physics-Based Character Animation via Diffusion Policy (PDP), combines reinforcement learning (RL) and behavior cloning (BC) to create a robust diffusion policy for physics-based character animation. We demonstrate PDP on perturbation recovery, universal motion tracking, and physics-based text-to-motion synthesis.

Abstract:
Symmetries are ubiquitous across all kinds of objects, whether in nature or in man-made creations. While these symmetries may seem intuitive to the human eye, detecting them with a machine is nontrivial due to the vast search space. Classical geometry-based methods work by aggregating "votes" for each symmetry but struggle with noise. In contrast, learning-based methods may be more robust to noise, but often overlook partial symmetries due to the scarcity of annotated data. In this work, we address this challenge by proposing a novel symmetry detection method that marries classical symmetry detection techniques with recent advances in generative modeling. Specifically, we apply Langevin dynamics to a redefined symmetry space to enhance robustness against noise. We provide empirical results on a variety of shapes that suggest our method is not only robust to noise, but can also identify both partial and global symmetries. Moreover, we demonstrate the utility of our detected symmetries in various downstream tasks, such as compression and symmetrization of noisy shapes.

Abstract:
Immersive VR telepresence ideally means being able to interact and communicate with digital avatars that are indistinguishable from and precisely reflect the behaviour of their real counterparts. The core technical challenge is two fold: Creating a digital double that faithfully reflects the real human and tracking the real human solely from egocentric sensing devices that are lightweight and have a low energy consumption, e.g. a single RGB camera. Up to date, no unified solution to this problem exists as recent works solely focus on egocentric motion capture, only model the head, or build avatars from multi-view captures. In this work, we, for the first time in literature, propose a person-specific egocentric telepresence approach, which jointly models the photoreal digital avatar while also driving it from a single egocentric video. We first present a character model that is animatible, i.e. can be solely driven by skeletal motion, while being capable of modeling geometry and appearance. Then, we introduce a personalized egocentric motion capture component, which recovers full-body motion from an egocentric video. Finally, we apply the recovered pose to our character model and perform a test-time mesh refinement such that the geometry faithfully projects onto the egocentric view. To validate our design choices, we propose a new and challenging benchmark, which provides paired egocentric and dense multi-view videos of real humans performing various motions. Our experiments demonstrate a clear step towards egocentric and photoreal telepresence as our method outperforms baselines as well as competing methods. For more details, code, and data, we refer to our project page.

Abstract:
Automatic 3D content creation has gained increasing attention recently, due to its potential in various applications such as video games, film industry, and AR/VR. Recent advancements in diffusion models and multimodal models have notably improved the quality and efficiency of 3D object generation given a single RGB image. However, 3D objects generated even by state-of-the-art methods are still unsatisfactory compared to human-created assets. Considering only textures instead of materials makes these methods encounter challenges in photo-realistic rendering, relighting, and flexible appearance editing. And they also suffer from severe misalignment between geometry and high-frequency texture details. In this work, we propose a novel approach to boost the quality of generated 3D objects from the perspective of Physics-Based Rendering (PBR) materials. By analyzing the components of PBR materials, we choose to consider albedo, roughness, metalness, and bump maps. For albedo and bump maps, we leverage Stable Diffusion fine-tuned on synthetic data to extract these values, with novel usages of these fine-tuned models to obtain 3D consistent albedo UV and bump UV for generated objects. In terms of roughness and metalness maps, we adopt a semi-automatic process to provide room for interactive adjustment, which we believe is more practical. Extensive experiments demonstrate that our model is generally beneficial for various state-of-the-art generation methods, significantly boosting the quality and realism of their generated 3D objects, with natural relighting effects and substantially improved geometry.

Abstract:
3D Gaussian Splatting (3DGS) has transformed novel-view synthesis with its fast, interpretable, and high-fidelity rendering. However, its resource requirements limit its usability. Especially on constrained devices, training performance degrades quickly and often cannot complete due to excessive memory consumption of the model. The method converges with an indefinite number of Gaussians—many of them redundant—making rendering unnecessarily slow and preventing its usage in downstream tasks that expect fixed-size inputs. To address these issues, we tackle the challenges of training and rendering 3DGS models on a budget. We use a guided, purely constructive densification process that steers densification toward Gaussians that raise the reconstruction quality. Model size continuously increases in a controlled manner towards an exact budget, using score-based densification of Gaussians with training-time priors that measure their contribution. We further address training speed obstacles: following a careful analysis of 3DGS’ original pipeline, we derive faster, numerically equivalent solutions for gradient computation and attribute updates, including an alternative parallelization for efficient backpropagation. We also propose quality-preserving approximations where suitable to reduce training time even further.

Abstract:
Art reinterpretation is the practice of creating a variation of a reference work, making a paired artwork that exhibits a distinct artistic style. We ask if such an image pair can be used to customize a generative model to capture the demonstrated stylistic difference. We propose PairCustomization, a new customization method that learns stylistic difference from a single image pair and then applies the acquired style to the generation process. Unlike existing methods that learn to mimic a single concept from a collection of images, our method captures the stylistic difference between paired images. This allows us to apply a stylistic change without overfitting to the specific image content in the examples. To address this new task, we employ a joint optimization method that explicitly separates the style and content into distinct LoRA weight spaces. We optimize these style and content weights to reproduce the style and content images while encouraging their orthogonality. During inference, we modify the diffusion process via a new style guidance based on our learned weights. Both qualitative and quantitative experiments show that our method can effectively learn style while avoiding overfitting to image content, highlighting the potential of modeling such stylistic differences from a single image pair.

Abstract:
We propose a novel technique for adding geometric details to an input coarse 3D mesh guided by a text prompt. Our method is composed of three stages. First, we generate a single-view RGB image conditioned on the input coarse geometry and the input text prompt. This single-view image generation step allows the user to pre-visualize the result and offers stronger conditioning for subsequent multi-view generation. Second, we use our novel multi-view normal generation architecture to jointly generate six different views of the normal images. The joint view generation reduces inconsistencies and leads to sharper details. Third, we optimize our mesh with respect to all views and generate a fine, detailed geometry as output. The resulting method produces an output within seconds and offers explicit user control over the coarse structure, pose, and desired details of the resulting 3D mesh.

Abstract:
Low-resolution quantized imagery, such as pixel art, is seeing a revival in modern applications ranging from video game graphics to digital design and fabrication, where creativity is often bound by a limited palette of elemental units. Despite their growing popularity, the automated generation of quantized images from raw inputs remains a significant challenge, often necessitating intensive manual input. We introduce SD-π XL, an approach for producing quantized images that employs score distillation sampling in conjunction with a differentiable image generator. Our method enables users to input a prompt and optionally an image for spatial conditioning, set any desired output size H × W, and choose a palette of n colors or elements. Each color corresponds to a distinct class for our generator, which operates on an H × W × n tensor. We adopt a softmax approach, computing a convex sum of elements, thus rendering the process differentiable and amenable to backpropagation. We show that employing Gumbel-softmax reparameterization allows for crisp pixel art effects. Unique to our method is the ability to transform input images into low-resolution, quantized versions while retaining their key semantic features. Our experiments validate SD-π XL’s performance in creating visually pleasing and faithful representations, consistently outperforming the current state-of-the-art. Furthermore, we showcase SD-π XL’s practical utility in fabrication through its applications in interlocking brick mosaic, beading and embroidery design.

Abstract:
Multiple importance sampling (MIS) is an indispensable tool in rendering that constructs robust sampling strategies by combining the respective strengths of individual distributions. Its efficiency can be greatly improved by carefully selecting the number of samples drawn from each distribution, but automating this process remains a challenging problem. Existing works are mostly limited to mixture sampling, in which only a single sample is drawn in total, and the works that do investigate multi-sample MIS only optimize the sample counts at a per-pixel level, which cannot account for variations beyond the first bounce. Recent work on Russian roulette and splitting has demonstrated how fixed-point schemes can be used to spatially vary sample counts to optimize image efficiency but is limited to choosing the same number of samples across all sampling strategies. Our work proposes a highly flexible sample allocation strategy that bridges the gap between these areas of work. We show how to iteratively optimize the sample counts to maximize the efficiency of the rendered image using a lightweight data structure, which allows us to make local and individual decisions per technique. We demonstrate the benefits of our approach in two applications, path guiding and bidirectional path tracing, in both of which we achieve consistent and substantial speedups over the respective previous state-of-the-art.

Abstract:
Diffusion models have opened the path to a wide range of text-based image editing frameworks. However, these typically build on the multi-step nature of the diffusion backwards process, and adapting them to distilled, fast-sampling methods has proven surprisingly challenging. Here, we focus on a popular line of text-based editing frameworks - the “edit-friendly” DDPM-noise inversion approach. We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength. We trace the artifacts to mismatched noise statistics between inverted noises and the expected noise schedule, and suggest a shifted noise schedule which corrects for this offset. To increase editing strength, we propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts. All in all, our method enables text-based image editing with as few as three diffusion steps, while providing novel insights into the mechanisms behind popular text-based editing approaches.

Abstract:
Radiance fields are powerful and, hence, popular models for representing the appearance of complex scenes. Yet, constructing them based on image observations gives rise to ambiguities and uncertainties. We propose a versatile approach for learning Gaussian radiance fields with explicit and fine-grained uncertainty estimates that impose only little additional cost compared to uncertainty-agnostic training. Our key observation is that uncertainties can be modeled as a low-dimensional manifold in the space of radiance field parameters that is highly amenable to Monte Carlo sampling. Importantly, our uncertainties are differentiable and, thus, allow for gradient-based optimization of subsequent captures that optimally reduce ambiguities. We demonstrate state-of-the-art performance on next-best-view planning tasks, including high-dimensional illumination planning for optimal radiance field relighting quality.

Abstract:
We introduce Lumiere – a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion – a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution – an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.

Abstract:
The remarkable generative capabilities of diffusion models have motivated extensive research in both image and video editing. Compared to video editing which faces additional challenges in the time dimension, image editing has witnessed the development of more diverse, high-quality approaches and more capable software like Photoshop. In light of this gap, we introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model. Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits, effectively handling global edits, local edits, and moderate shape changes, which existing methods cannot fully achieve. At the core of our method are two main processes: Coarse Motion Extraction to align basic motion patterns with the original video, and Appearance Refinement for precise adjustments using fine-grained attention matching. We also incorporate a skip-interval strategy to mitigate quality degradation from auto-regressive generation across multiple video clips. Experimental results demonstrate our framework’s superior performance in fine-grained video editing, proving its capability to produce high-quality, temporally consistent outputs.

Abstract:
Single-view 3D hair reconstruction is challenging, due to the wide range of shape variations among diverse hairstyles. Current state-of-the-art methods are specialized in recovering un-braided 3D hairs and often take braided styles as their failure cases, because of the inherent difficulty to define priors for complex hairstyles, whether rule-based or data-based. We propose a novel strategy to enable single-view 3D reconstruction for a variety of hair types via a unified pipeline. To achieve this, we first collect a large-scale synthetic multi-view hair dataset SynMvHair with diverse 3D hair in both braided and un-braided styles, and learn two diffusion priors specialized on hair. Then we optimize 3D Gaussian-based hair from the priors with two specially designed modules, i.e. view-wise and pixel-wise Gaussian refinement. Our experiments demonstrate that reconstructing braided and un-braided 3D hair from single-view images via a unified approach is possible and our method achieves the state-of-the-art performance in recovering complex hairstyles. It is worth to mention that our method shows good generalization ability to real images, although it learns hair priors from synthetic data. Code and data are available at https://unihair24.github.io

Abstract:
We present a simple algorithm for differentiable rendering of surfaces represented by Signed Distance Fields (SDF), which makes it easy to integrate rendering into gradient-based optimization pipelines. To tackle visibility-related derivatives that make rendering non-differentiable, existing physically based differentiable rendering methods often rely on elaborate guiding data structures or reparameterization with a global impact on variance. In this article, we investigate an alternative that embraces nonzero bias in exchange for low variance and architectural simplicity. Our method expands the lower-dimensional boundary integral into a thin band that is easy to sample when the underlying surface is represented by an SDF. We demonstrate the performance and robustness of our formulation in end-to-end inverse rendering tasks, where it obtains results that are competitive with or superior to existing work.

Abstract:
We introduce a neuro-symbolic transformer-based model that converts flat, segmented facade structures into procedural definitions using a custom-designed split grammar. To facilitate this, we first develop a semi-complex split grammar tailored for architectural facades and then generate a dataset comprising of facades alongside their corresponding procedural representations. This dataset is used to train our transformer model to convert segmented, flat facades into the procedural language of our grammar. During inference, the model applies this learned transformation to new facade segmentations, providing a procedural representation that users can adjust to generate varied facade designs. This method not only automates the conversion of static facade images into dynamic, editable procedural formats but also enhances the design flexibility, allowing for easy modifications.

Abstract:
Text-to-image diffusion models have proven effective for solving many image editing tasks. However, the seemingly straightforward task of seamlessly relocating objects within a scene remains surprisingly challenging. Existing methods addressing this problem often struggle to function reliably in real-world scenarios due to lacking spatial reasoning. In this work, we propose a training-free method, dubbed DiffUHaul, that harnesses the spatial understanding of a localized text-to-image model, for the object dragging task. Blindly manipulating layout inputs of the localized model tends to cause low editing performance due to the intrinsic entanglement of object representation in the model. To this end, we first apply attention masking in each denoising step to make the generation more disentangled across different objects and adopt the self-attention sharing mechanism to preserve the high-level object appearance. Furthermore, we propose a new diffusion anchoring technique: in the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance; in the later denoising steps, we pass the localized features from the source images to the interpolated images to retain fine-grained object details. To adapt DiffUHaul to real-image editing, we apply a DDPM self-attention bucketing that can better reconstruct real images with the localized model. Finally, we introduce an automated evaluation pipeline for this task and showcase the efficacy of our method. Our results are reinforced through a user preference study.

Abstract:
Position-based Dynamics (PBD) and its extension, eXtended Position-based Dynamics (XPBD), have been predominantly applied to compliant constrained elastodynamics, with their potential in finite strain (visco-) elastoplasticity remaining underexplored. XPBD is often perceived to stand in contrast to other meshless methods, such as the Material Point Method (MPM). MPM is based on discretizing the weak form of governing partial differential equations within a continuum domain, coupled with a hybrid Lagrangian-Eulerian method for tracking deformation gradients. In contrast, XPBD formulates specific constraints, whether hard or compliant, to positional degrees of freedom. We revisit this perception by investigating the potential of XPBD in handling inelastic materials that are described with classical continuum mechanics-based yield surfaces and elastoplastic flow rules. Our inspiration is that a robust estimation of the velocity gradient is a sufficiently useful key to effectively tracking deformation gradients in XPBD simulations. By further incorporating implicit inelastic constitutive relationships, we introduce a plasticity in-the-loop updated Lagrangian augmentation to XPBD. This enhancement enables the simulation of elastoplastic, viscoplastic, and granular substances following their standard constitutive laws. We demonstrate the effectiveness of our method through high-resolution and real-time simulations of diverse materials such as snow, sand, and plasticine, and its integration with standard XPBD simulations of cloth and water.

Abstract:
Computing maximum/minimum distances between 3D meshes is crucial for various applications, i.e., robotics, CAD, VR/AR, etc. In this work, we introduce a highly parallel algorithm (gDist) optimized for Graphics Processing Units (GPUs), which is capable of computing the distance between two meshes with over 15 million triangles in less than 0.4 milliseconds (Fig. 1). By testing on benchmarks with varying characteristics, the algorithm achieves remarkable speedups over prior CPU-based and GPU-based algorithms on a commodity GPU (NVIDIA GeForce RTX 4090). Notably, the algorithm consistently maintains high-speed performance, even in challenging scenarios that pose difficulties for prior algorithms.

Abstract:
Meshes are ubiquitous in visual computing and simulation, yet most existing machine learning techniques represent meshes only indirectly, e.g. as the level set of a scalar field or deformation of a template, or as a disordered triangle soup lacking local structure. This work presents a scheme to directly generate manifold, polygonal meshes of complex connectivity as the output of a neural network. Our key innovation is to define a continuous latent connectivity space at each mesh vertex, which implies the discrete mesh. In particular, our vertex embeddings generate cyclic neighbor relationships in a halfedge mesh representation, which gives a guarantee of edge-manifoldness and the ability to represent general polygonal meshes. This representation is well-suited to machine learning and stochastic optimization, without restriction on connectivity or topology. We first explore the basic properties of this representation, then use it to fit distributions of meshes from large datasets. The resulting models generate diverse meshes with tessellation structure learned from the dataset population, with concise details and high-quality mesh elements. In applications, this approach not only yields high-quality outputs from generative models, but also enables directly learning challenging geometry processing tasks such as mesh repair.

Abstract:
Diffusion models are the main driver of progress in image and video synthesis, but suffer from slow inference speed. Distillation methods, like the recently introduced adversarial diffusion distillation (ADD) aim to shift the model from many-shot to single-step inference, albeit at the cost of expensive and difficult optimization due to its reliance on a fixed pretrained DINOv2 discriminator. We introduce Latent Adversarial Diffusion Distillation (LADD), a novel distillation approach overcoming the limitations of ADD. In contrast to pixel-based ADD, LADD utilizes generative features from pretrained latent diffusion models. This approach simplifies training and enhances performance, enabling high-resolution multi-aspect ratio image synthesis. We apply LADD to Stable Diffusion 3 (8B) to obtain SD3-Turbo, a fast model that matches the performance of state-of-the-art text-to-image generators using only four unguided sampling steps. Moreover, we systematically investigate its scaling behavior and demonstrate LADD’s effectiveness in various applications such as image editing and inpainting.

Abstract:
We introduce DiffH2O, a new diffusion-based framework for synthesizing realistic, dexterous hand-object interactions from natural language. Our model employs a temporal two-stage diffusion process, dividing hand-object motion generation into grasping and interaction stages to enhance generalization to various object shapes and textual prompts. To improve generalization to unseen objects and increase output controllability, we propose grasp guidance, which directs the diffusion model towards a target grasp, seamlessly connecting the grasping and interaction stages through a motion imputation mechanism. We demonstrate the practical value of grasp guidance using hand poses extracted from images or grasp synthesis methods. Additionally, we provide detailed textual descriptions for the GRAB dataset, enabling fine-grained text-based control of the model output. Our quantitative and qualitative evaluations show that DiffH2O generates realistic hand-object motions from natural language, generalizes to unseen objects, and significantly outperforms existing methods on a standard benchmark and in perceptual studies.

Abstract:
Challenging to capture, and challenging to display on a cellphone screen, the panorama paradoxically remains both a staple and underused feature of modern mobile camera applications. In this work we address both of these challenges with a spherical neural light field model for implicit panoramic image stitching and re-rendering; able to accommodate for depth parallax, view-dependent lighting, and local scene motion and color changes during capture. Fit during test-time to an arbitrary path panoramic video capture – vertical, horizontal, random-walk – these neural light spheres jointly estimate the camera path and a high-resolution scene reconstruction to produce novel wide field-of-view projections of the environment. Our single-layer model avoids expensive volumetric sampling, and decomposes the scene into compact view-dependent ray offset and color components, with a total model size of 80 MB per scene, and real-time (50 FPS) rendering at 1080p resolution. We demonstrate improved reconstruction quality over traditional image stitching and radiance field methods, with significantly higher tolerance to scene motion and non-ideal capture settings.

Abstract:
In this paper, we present an advanced approach to solving the inverse rig problem in blendshape animation, using high-quality corrective blendshapes. Our algorithm focuses on three key areas: ensuring high data fidelity in reconstructed meshes, achieving greater sparsity in weight distributions, and facilitating smoother frame-to-frame transitions. While the incorporation of corrective terms is a known practice, our method differentiates itself by employing a unique combination of l1 norm regularization for sparsity and a temporal smoothness constraint through roughness penalty, focusing on the sum of second differences in consecutive frame weights. A significant innovation in our approach is the temporal decoupling of blendshapes, which permits simultaneous optimization across entire animation sequences. This feature sets our work apart from existing methods and contributes to a more efficient and effective solution. Our algorithm exhibits a marked improvement in maintaining data fidelity and ensuring smooth frame transitions when compared to prior approaches that either lack smoothness regularization or rely solely on linear blendshape models. In addition to superior mesh resemblance and smoothness, our method offers practical benefits, including reduced computational complexity and execution time, achieved through a novel parallelization strategy using clustering methods. Our results not only advance the state-of-the-art in terms of fidelity, sparsity, and smoothness in inverse rigging but also introduce significant efficiency improvements1.

Abstract:
Traditional character animation specializes in characters with a rigidly articulated skeleton and a bipedal/quadripedal morphology. This assumption simplifies the design of physically based animations, like locomotion, but comes with the price of excluding characters of arbitrary deformable geometries. To remedy this, we propose a spatio-temporal actuation subspace built off the natural vibrations of the character geometry. We show this actuation subspace is well suited for designing natural locomotion, without requiring user-provided guidance keyframes as is common in prior work. The resulting actuation is coupled to a reduced fast soft body simulation, allowing us to optimize for locomotions for a wide variety of high resolution deformable characters.

Abstract:
Joint injuries, and their long-term consequences, present a substantial global health burden. Wearable prophylactic braces are an attractive potential solution to reduce the incidence of joint injuries by limiting joint movements that are related to injury risk. Given human motion and ground reaction forces, we present a computational framework that enables the design of personalized braces by optimizing the distribution of microstructures and elasticity. As varied brace designs yield different reaction forces that influence kinematics and kinetics analysis outcomes, the optimization process is formulated as a differentiable end-to-end pipeline in which the design domain of microstructure distribution is parameterized onto a neural network. The optimized distribution of microstructures is obtained via a self-learning process to determine the network coefficients according to a carefully designed set of losses and the integrated biomechanical and physical analyses. Since knees and ankles are the most commonly injured joints, we demonstrate the effectiveness of our pipeline by designing, fabricating, and testing prophylactic braces for the knee and ankle to prevent potentially harmful joint movements.

Abstract:
The conventional mesh-based Level of Detail (LoD) technique, exemplified by applications such as Google Earth and many game engines, exhibits the capability to holistically represent a large scene even the Earth, and achieves rendering with a space complexity of O ( log ⁡ n ). This constrained data requirement not only enhances rendering efficiency but also facilitates dynamic data fetching, thereby enabling a seamless 3D navigation experience for users.

Abstract:
Gaussian splatting has become a popular representation for novel-view synthesis, exhibiting clear strengths in efficiency, photometric quality, and compositional edibility. Following its success, many works have extended Gaussians to 4D, showing that dynamic Gaussians maintain these benefits while also tracking scene geometry far better than alternative representations. Yet, these methods assume dense multi-view videos as supervision, constraining their use to controlled capture settings. In this work, we are interested in extending the capability of Gaussian scene representations to casually captured monocular videos. We show that existing 4D Gaussian methods dramatically fail in this setup because the monocular setting is underconstrained. Building off this finding, we propose a method we call Dynamic Gaussian Marbles, which consist of three core modifications that target the difficulties of the monocular setting. First, we use isotropic Gaussian “marbles”, reducing the degrees of freedom of each Gaussian, and constraining the optimization to focus on motion and appearance over local shape. Second, we employ a hierarchical divide-and-conquer learning strategy to efficiently guide the optimization towards solutions with globally coherent motion. Finally, we add image-level and geometry-level priors into the optimization, including a tracking loss that takes advantage of recent progress in point tracking. By constraining the optimization in these ways, Dynamic Gaussian Marbles learns Gaussian trajectories that enable novel-view rendering and accurately capture the 3D motion of the scene elements. We evaluate on the (monocular) Nvidia Dynamic Scenes dataset and the Dycheck iPhone dataset, and show that Gaussian Marbles significantly outperforms other Gaussian baselines in quality, and is on-par with non-Gaussian representations, all while maintaining the efficiency, compositionality, editability, and tracking benefits of Gaussians.

Abstract:
We present a method for automatically producing human-like vocal imitations of sounds: the equivalent of “sketching,” but for auditory rather than visual representation. Starting with a simulated model of the human vocal tract, we first try generating vocal imitations by tuning the model’s control parameters to make the synthesized vocalization match the target sound in terms of perceptually-salient auditory features. Then, to better match human intuitions, we apply a cognitive theory of communication to take into account how human speakers reason strategically about their listeners. Finally, we show through several experiments and user studies that when we add this type of communicative reasoning to our method, it aligns with human intuitions better than matching auditory features alone does. This observation has broad implications for the study of depiction in computer graphics.

Abstract:
Volumetric modeling and neural radiance field representations have revolutionized 3D face capture and photorealistic novel view synthesis. However, these methods often require hundreds of multi-view input images and are thus inapplicable to cases with less than a handful of inputs. We present a novel volumetric prior on human faces that allows for high-fidelity expressive face modeling from as few as three input views captured in the wild. Our key insight is that an implicit prior trained on synthetic data alone can generalize to extremely challenging real-world identities and expressions and render novel views with fine idiosyncratic details like wrinkles and eyelashes. We leverage a 3D Morphable Face Model to synthesize a large training set, rendering each identity with different expressions, hair, clothing, and other assets. We then train a conditional Neural Radiance Field prior on this synthetic dataset and, at inference time, fine-tune the model on a very sparse set of real images of a single subject. On average, the fine-tuning requires only three inputs to cross the synthetic-to-real domain gap. The resulting personalized 3D model reconstructs strong idiosyncratic facial expressions and outperforms the state-of-the-art in high-quality novel view synthesis of faces from sparse inputs in terms of perceptual and photo-metric quality.

Abstract:
Video-driven 3D facial animation transfer aims to drive avatars to reproduce the expressions of actors. Existing methods have achieved remarkable results by constraining both geometric and perceptual consistency. However, geometric constraints (like those designed on facial landmarks) are insufficient to capture subtle emotions, while expression features trained on classification tasks lack fine granularity for complex emotions. To address this, we propose FreeAvatar, a robust facial animation transfer method that relies solely on our learned expression representation. Specifically, FreeAvatar consists of two main components: the expression foundation model and the facial animation transfer model. In the first component, we initially construct a facial feature space through a face reconstruction task and then optimize the expression feature space by exploring the similarities among different expressions. Benefiting from training on the amounts of unlabeled facial images and re-collected expression comparison dataset, our model adapts freely and effectively to any in-the-wild input facial images. In the facial animation transfer component, we propose a novel Expression-driven Multi-avatar Animator, which first maps expressive semantics to the facial control parameters of 3D avatars and then imposes perceptual constraints between the input and output images to maintain expression consistency. To make the entire process differentiable, we employ a trained neural renderer to translate rig parameters into corresponding images. Furthermore, unlike previous methods that require separate decoders for each avatar, we propose a dynamic identity injection module that allows for the joint training of multiple avatars within a single network. The comparisons show that our method achieves prominent performance even without introducing any geometric constraints, highlighting the robustness of our FreeAvatar. Our code will be publicly available at here .

Abstract:
We present projected walk on spheres (PWoS), a novel pointwise and discretization-free Monte Carlo solver for surface PDEs with Dirichlet boundaries, as a generalization of the walk on spheres method (WoS) [Muller 1956; Sawhney and Crane 2020]. We adapt the recursive relationship of WoS designed for PDEs in volumetric domains to a volumetric neighborhood around the surface, and at the end of each recursion step, we project the sample point on the sphere back to the surface. We motivate this simple modification to WoS with the theory of the closest point extension used in the closest point method. To define the valid volumetric neighborhood domain for PWoS, we develop strategies to estimate the local feature size of the surface and to compute the distance to the Dirichlet boundaries on the surface extended in their normal directions. We also design a mean value filtering method for PWoS to improve the method’s efficiency when the surface is represented as a polygonal mesh or a point cloud. Finally, we study the convergence of PWoS and demonstrate its application to graphics tasks, including diffusion curves, geodesic distance computation, and wave propagation animation. We show that our method works with various types of surfaces, including a surface of mixed codimension.

Abstract:
The automated synthesis of high-quality 3D gestures from speech holds significant value for virtual humans and gaming. Previous methods primarily focus on synchronizing gestures with speech rhythm, often neglecting semantic gestures. These semantic gestures are sparse and follow a long-tailed distribution across the gesture sequence, making them challenging to learn in an end-to-end manner. Additionally, generating rhythmically aligned gestures that generalize well to in-the-wild speech remains a significant challenge. To address these issues, we introduce SIGGesture, a novel diffusion-based approach for synthesizing realistic gestures that are both high-quality and semantically pertinent. Specifically, we firstly build a robust diffusion-based foundation model for rhythmical gesture synthesis by pre-training it on a collected large-scale dataset with pseudo labels. Secondly, we leverage the powerful generalization capabilities of Large Language Models (LLMs) to generate appropriate semantic gestures for various speech transcripts. Finally, we propose a semantic injection module to infuse semantic information into the synthesized results during the diffusion reverse process. Extensive experiments demonstrate that SIGGesture significantly outperforms existing baselines, exhibiting excellent generalization and controllability.

Abstract:
The perception of flicker has been a prominent concern in illumination and electronic display fields for over a century. Traditional approaches often rely on Critical Flicker Frequency (CFF), primarily suited for high-contrast (full-on, full-off) flicker. To tackle varying contrast flicker, the International Committee for Display Metrology (ICDM) introduced a Temporal Contrast Sensitivity Function TCSFIDMS within the Information Display Measurements Standard (IDMS). Nevertheless, this standard overlooks crucial parameters: luminance, eccentricity, and area. Existing models incorporating these parameters are inadequate for flicker detection, especially at low spatial frequencies. To address these limitations, we extend the TCSFIDMS and combine it with a new spatial probability summation model to incorporate the effects of luminance, eccentricity, and area (elaTCSF). We train the elaTCSF on various flicker detection datasets and establish the first variable refresh rate flicker detection dataset for further verification. Additionally, we contribute to resolving a longstanding debate on whether the flicker is more visible in peripheral vision. We demonstrate how elaTCSF can be used to predict flicker due to low-persistence in VR headsets, identify flicker-free VRR operational ranges, and determine flicker sensitivity in lighting design.

Abstract:
We introduce the Stability-Incorporated Neighborhood Graph (SING), a novel density-aware structure designed to capture the intrinsic geometric properties of a point set. We improve upon the spheres-of-influence graph by incorporating additional features to offer more flexibility and control in encoding proximity information and capturing local density variations. Through persistence analysis on our proximity graph, we propose a new clustering technique and explore additional variants incorporating extra features for the proximity criterion. Alongside the detailed analysis and comparison to evaluate its performance on various datasets, our experiments demonstrate that the proposed method can effectively extract meaningful clusters from diverse datasets with variations in density and correlation. Our application scenarios underscore the advantages of the proposed graph over classical neighborhood graphs, particularly in terms of parameter tuning.

Abstract:
We present a novel multi-camera, multi-modal vision system designed for industrial robotics applications. The system generates high-quality 3D point clouds, with a focus on improving the completeness and reducing hallucinations for collision avoidance across various geometries, materials, and lighting conditions. Our system incorporates several key advancements: (1) a modular and scalable Plenoptic Stereo Vision Unit that captures high-resolution RGB, polarization, and infrared (IR) data for enhanced scene understanding; (2) an Auto-Calibration Routine that enables the seamless addition and automatic registration of multiple stereo units, expanding the system’s capabilities; (3) a Deep Fusion Stereo Architecture - a state-of-the-art deep learning architecture trained fully on synthetic data that effectively fuses multi-baseline and multi-modal data for superior reconstruction accuracy. We demonstrate the impact of each design decision through rigorous testing, showing improved performance across varying lighting, geometry, and material challenges. To benchmark our system, we create an extensive industrial-robotics inspired dataset featuring sub-millimeter accurate ground truth 3D reconstructions of scenes with challenging elements such as sunlight, deep bins, transparency, reflective surfaces, and thin objects. Our system surpasses the performance of state-of-the-art high-resolution structured light on this dataset. We also demonstrate generalization to non-robotics polarization datasets. Interactive visualizations and videos are available at https://www.intrinsic.ai/publications/siggraphasia2024.

Abstract:
Prototyping and small volume production of custom imaging-grade lenses is difficult and expensive, especially for more complex aspherical shapes. Fluidic shaping has recently been proposed as a potential solution: It makes use of the atomic level smoothness of interfaces between liquids, where the shape of the interface can be carefully controlled by boundary conditions, buoyancy control and other physical parameters. If one of the liquids is a resin, its shape can be “frozen” by curing, thus creating a solid optical element. While fluidic shaping is a promising avenue, the shape space generated by this method is currently only described in the form of partial differential equations, which are incompatible with existing lens design processes. Moreover, we show that the existing PDEs are inaccurate for larger curvatures. In this work, we develop a new formulation of the shape space lenses generated by the fluidic shaping technique. It overcomes the inaccuracies of previous models, and, through a differentiable implementation, can be integrated into recent end-to-end optical design pipelines based on differentiable ray tracing. We extensively evaluate the model and the design pipeline with simulations, as well as initial physical prototypes.

Abstract:
Spectral rendering has received increasing attention in recent years. Yet, solutions to define spectral reflectances are mostly limited to uplifting techniques which deterministically augment existing RGB inputs. Only recently has uplifting been able to ensure a certain surface appearance under direct illuminants. Yet, prior work in this area limits artist expressiveness and is not well suited for designing the appearance of a scene, as indirect illumination is ignored entirely.

Abstract:
Prior approaches to the neural rendering of global illumination typically rely on complex network architectures and training strategies to model the global effects. This often leads to impractically high overheads for both training and inference. The neural radiosity technique marks a significant advancement by injecting the radiometric prior into the training process, allowing for efficient modeling of the global radiance fields using a lightweight network and grid-based representations. However, this method encounters difficulties in modeling dynamic scenes, as the high-dimensional feature space quickly becomes unmanageable as the number of varying scene parameters grows. In this work, we extend neural radiosity for variable scenes through a novel neural decomposition method. To achieve this, we first parameterize the animated scene with an explicit vector v, which conditions a high-dimensional radiance field Lθ. We then develop a practical representation for Lθ by decomposing the high-dimensional feature grid into 3D grids, 2D feature planes, and lightweight MLPs. This strategy effectively models the correlation between 3D spatial features and dynamic scene variables, while maintaining a practical memory and computational cost. Experimental results show that our method facilitates efficient dynamic global illumination rendering with practical runtime performance, outperforming previous state-of-the-art techniques with both reduced training and inference costs.

Abstract:
We propose a novel method to reconstruct non-line-of-sight (NLOS) scenes that combines polarization and time-of-flight light transport measurements. Unpolarized NLOS imaging methods reconstruct objects hidden around corners by inverting time-gated indirect light paths measured at a visible relay surface, but fail to reconstruct scene features depending on their position and orientation with respect to such surface. We address this limitation (known as the missing cone problem) by capturing the polarization state of light in time-gated imaging systems at picosecond time resolution, and introducing a novel inversion method that leverages directionality information of polarized measurements to reduce directional ambiguities in the reconstruction. Our method is capable of imaging features of hidden surfaces inside the missing cone space of state-of-the-art NLOS methods, yielding fine reconstruction details even when using a fraction of measured points on the relay surface. We demonstrate the benefits of our method in both simulated and experimental scenarios.

Abstract:
Barycentric coordinates are widely used in computer graphics, especially in shape deformation. Traditionally, barycentric coordinates are defined for polygonal domains. In this work, we relax this requirement by representing the boundary of the domain using a Bézier spline and extend the complex-valued Cauchy barycentric coordinates [Weber et al. 2009] to the Bézier case. Compared to the latest polynomial 2D Green coordinates [Michel and Thiery 2023], we obtain equivalent results. We further derive a numerical integration formula for the inverse mapping based on Cauchy’s integral formula, enabling deformation between curved cages through an intermediate step. Notably, our approach allows curved cages as input. We also provide expressions for the nth-order derivatives of the coordinates, which facilitate constrained deformations with position constraints. Through extensive experiments, we demonstrate the versatility of our coordinates for interactive deformation.

Abstract:
In this paper we study the problem of efficiently rendering images for embodied AI training workloads, where agent training involves rendering millions to billions of independent, low-resolution frames, often with simple lighting and shading, that serve as the agent’s observations of the world. To enable high-throughput training from images, we design a flexible, batch-mode rendering interface that allows state-of-the-art GPU-accelerated batch world simulators to efficiently communicate with high-performance rendering backends. Using this interface we architect and compare two high-performance renderers: one based on the GPU hardware-accelerated graphics pipeline and a second based on a GPU software implementation of ray tracing. To evaluate these renderers and encourage further research by the graphics community in this area, we build a rendering benchmark for this under-explored regime. We find that the ray tracing renderer outperforms the rasterization-based solution across the benchmark on a datacenter-class GPU, while also performing competitively in geometrically complex environments on a high-end consumer GPU. When tasked to render large batches of independent 128 × 128 images, the ray tracer can exceed 100,000 frames per second per GPU for simple scenes, and exceed 10,000 frames per second per GPU on geometrically complex scenes from the HSSD dataset.

Abstract:
Detailed human surface capture from multiple images is an essential component for many 3D production, analysis and transmission tasks. Yet producing millimetric precision 3D models in practical time, and actually verifying their 3D accuracy in a real-world capture context, remain key challenges due to the lack of specific methods and data for these goals. We propose two complementary contributions to this end. The first one is a highly scalable neural surface radiance field approach able to achieve millimetric precision by construction, while demonstrating high compute and memory efficiency. The second one is a novel dataset, MVMannequin, of clothed mannequin geometry captured with a high resolution hand-held 3D scanner paired with calibrated multi-view images, that allows to verify the millimetric accuracy claim. Although our approach can produce such highly dense and precise geometry, we show how aggressive sparsification and optimizations of the neural surface pipeline allow estimations in minutes of computation time using only a few GB of GPU memory, while allowing for real-time millisecond neural rendering. On the basis of our framework and dataset, we show that our method achieves submillimetric accuracy and completeness for 77% of the points in less than 3 minutes of training time, with 68 viewpoints.

Abstract:
The L1 Hessian energy measures the norm of the Hessian of a function on a surface (and not the squared norm, as is common with many geometry applications that employ L2). Its minimizers tend to be locally linear with a sparse set of curved ridges. We introduce a fully-intrinsic discretization of this energy for triangle meshes and show that it can be optimized using off-the-shelf conic program solvers. We apply it to stylization, denoising, interpolation, hole-filling, and segmentation tasks. Our L1 approach exhibits multiple important differences from its more-familiar L2 counterpart: it preserves ridge-like features in the input, it naturally incorporates a flatness prior for reconstruction, and, at its extreme, it distills its input to an abstract, angular form.

Abstract:
Monte Carlo methods have been widely adopted in physics-based rendering. A key property of a Monte Carlo estimator is its variance, which dictates the convergence rate of the estimator. In this paper, we devise a mathematical formulation for derivatives of rendering variance with respect to not only scene parameters (e.g., surface roughness) but also sampling probabilities. Based on this formulation, we introduce unbiased Monte Carlo estimators for those derivatives. Our theory and algorithm enable variance-aware inverse rendering which alters a virtual scene and/or an estimator in an optimal way to offer a good balance between bias and variance. We evaluate our technique using several synthetic examples.

Abstract:
The joint reconstruction of shape and appearance for translucent objects from real-world data poses a challenge in computer graphics, especially when dealing with complex layered materials like leaves or paper. The traditional assumption of diffuse transmittance falls short, and more accurate Monte-Carlo-based models are often needed to reproduce their appearance. To accurately capture the translucent appearance, an acquisition system needs to be carefully designed. Additionally, there are three challenges for inverse rendering: First, a large number of unknown parameters make the optimization problem difficult. Second, the Monte Carlo (MC) renderer introduces noise, which the optimization is sensitive to, especially when dealing with complex material models such as rough dielectric surfaces and highly scattering participating media. Last, MC estimators using long light paths (up to 32 bounces in our case) create a large computation graph in memory, making the gradient back-propagation costly. To address those challenges, we present an affordable and fast acquisition pipeline that can capture spatially-varying reflectance and transmission at the same time, using a two-phase optimization. We first initialize the geometry with the traditional vision method and then fit a simple and fast appearance model. Thereafter, we use the estimated parameters to initialize a second optimization using a more expensive volumetric model, which converges faster and more reliably from this favorable starting position. We also introduce a way to analyze each parameter’s sensitivity to the noise in the measurements, which can be used in optimally selecting useful measurements for optimization. Furthermore, instead of iterating on the camera system, we also introduce a weighted ℓ2 loss as an alternative for selecting useful pixels from existing measurements.

Abstract:
Humans are uniquely sensitive to faces. Recognizing fine detail in faces plays an important role in social cognition, identity; and it is key to human interaction. In this work, we present the first quantitative study of the relative importance of face regions to human observers. We created a dataset of 960 unique models featuring localized geometry and texture distortions relevant to visual computing applications. We then conducted an extensive subjective study examining the perceptual saliency of facial regions through the lens of distortion visibility. Our study comprises over 18,000 comparisons and indicates non-trivial preferences across distortion types and facial areas. Our results provide relevant insights for algorithm design, and we demonstrate our data’s value in model compression applications.

Abstract:
One-light-at-a-time (OLAT) images sample a broader range of object appearance changes than images captured under constant lighting and are superior as input to object relighting. Although existing methods have produced reasonable relighting quality using OLAT images, they utilize surface-like representations, limiting their capacity to model volumetric objects, such as furs. Besides, their rendering process is time-consuming and still far from being used in real-time applications. To address these issues, we propose OLAT Gaussians to build relightable representations of objects from multi-view OLAT images. We build our pipeline on 3D Gaussian Splatting (3DGS), which achieves real-time high-quality rendering. To augment 3DGS with relighting capability, we assign each Gaussian a learnable feature vector, serving as an index to query the objects’ appearance field. Specifically, we decompose the appearance field into an incident illumination function and a scattering function. The former accounts for light transmittance and foreshortening effects, while the latter represents the object’s material properties to scatter light. Rather than using an off-the-shelf physically-based parametric rendering formulation, we model both functions using multi-layer perceptrons (MLPs). This makes our method suitable for various objects, e.g., opaque surfaces, semi-transparent volumes, furs, fabrics, etc. Given a camera view and a point light position, we compute each Gaussian’s color as the product of the light intensity, the incident illumination value, and the scattering value, and then render the target image through the 3DGS rasterizer. To enhance rendering quality, we further utilize a proxy mesh to provide OLAT Gaussians with normals to improve highlights and visibility cues to improve shadows. Extensive experiments demonstrate that our method produces state-of-the-art rendering quality with significantly more details in texture-rich areas than previous methods. Our method also achieves real-time rendering, allowing users to interactively modify camera views and point light positions to get immediate rendering results, which are not available from the offline rendering of previous methods.

Abstract:
Previous local guiding methods used 3D data structures to model spatial radiance variations but struggled with additional dimensions in the path integral, such as temporal changes in dynamic scenes. Extending these structures to higher dimensions also proves inefficient due to the curse of dimensionality. In this study, we investigate the potential of compact neural representations to model additional scene dimensions efficiently, thereby enhancing the performance of path guiding in specialized rendering applications, such as distributed effects including motion blur. We present an approach that models a higher dimensional spatio-temporal distribution through neural feature decomposition. Additionally, we present a cost-effective approximate with lower-dimensional representation to model only subspace by progressive training strategy. We also investigate the benefits of modeling correlations with the additional dimensions on typical distributed ray tracing scenarios, including the motion blur effect in dynamic scenes, as well as spectral rendering. Experimental results demonstrate the effectiveness of our method in these applications.

Abstract:
Wire sculptures are important in both industrial applications and daily life. We introduce a novel fabrication strategy for wire sculptures with complex geometries by tuning the target shape to a collision-free shape for the wire-bending machine and then bending it back to the target by a human. The key challenge lies in tuning the least number of bending points, which is formulated as an "Optimizing Wire Reconfiguration" problem. We first fit the input target wire with consecutive line segments and circular segments to ensure the bending manufacturing constraints for each segment, then generate tuned wire through a bilevel optimization. This involves selecting the bending points at the upper level with a beam search strategy and determining the specifically tuned angles at the lower level. We perform a thorough physical evaluation using a DIY wire-bending machine. The results show the effectiveness of our proposed approach in realizing a wide range of intricate and complex wire sculptures.

Abstract:
We present a novel method for computing a discrete skeleton from a shape represented by a point cloud or triangle mesh. Inspired by variational shape approximation, our approach optimizes the partitioning of the input shape by minimizing an error metric defined between medial axis samples (medial spheres) and their corresponding clusters. The metric combines plane-sphere and point-sphere distance terms and the balance between these two terms enables coarse skeletons to capture the main geometric features while denser skeletons achieve a uniform distribution of medial axis samples. The sampling of the medial axis is progressively refined through an automatic process that splits medial spheres with the highest errors. Our method’s efficiency also allows users to dynamically add or remove medial axis samples locally while the optimization process continuously updates the underlying partition. Skeleton connectivity is efficiently constructed by computing the dual of the optimized shape partition. Unlike previous approaches, our method does not rely on a predefined set of candidate spheres or an initial medial axis representation.

Abstract:
The stochastic nature of modern Monte Carlo (MC) rendering methods inevitably produces noise in rendered images for a practical number of samples per pixel. The problem of denoising these images has been widely studied, with most recent methods relying on data-driven, pretrained neural networks. In contrast, in this paper we propose a statistical approach to the denoising problem, treating each pixel as a random variable and reasoning about its distribution. Considering a pixel of the noisy rendered image, we formulate fast pair-wise statistical tests—based on online estimators—to decide which of the nearby pixels to exclude from the denoising filter. We show that for symmetric pixel weights and normally distributed samples, the classical Welch t-test is optimal in terms of mean squared error. We then show how to extend this result to handle non-normal distributions, using more recent confidence-interval formulations in combination with the Box-Cox transformation. Our results show that our statistical denoising approach matches the performance of state-of-the-art neural image denoising without having to resort to any computation-intensive pretraining. Furthermore, our approach easily generalizes to other quantities besides pixel intensity, which we demonstrate by showing additional applications to Russian roulette path termination and multiple importance sampling.

Abstract:
In inverse rendering, gradient-based methods, which have seen great progress in the recent years, are typically used in conjunction with the Adam optimizer. While Adam usually improves convergence by temporally filtering gradients over previous iterations to reduce noise, it is not tailored to inverse rendering where the target signals (textures, volumes, or geometry) are usually piecewise smooth. Previous work has applied the inverse Laplacian operator to smooth gradients spatially, but this isotropic filtering can often lead to oversmoothing. We propose a spatiotemporal optimizer that can significantly speedup the convergence over Adam, by enforcing the optimization parameter updates to be piecewise smooth through a lightweight spatial domain cross-bilateral filter. We discuss different options of combining spatial filtering and Adam’s temporal filtering, and provide intuitions for different scenarios. We show that our filtering leads to significantly higher-quality reconstructions in different inverse problems including texture, volume and geometry recovery.

Abstract:
Previous neural sampling methods, primarily using analytical lobe mixtures and normalizing flows, often struggle with specular materials, particularly at grazing angles. Furthermore, they are limited to reflection, and do not handle transmission. Our key observation is that previous normalizing flows impose significant restriction in their network architecture for easy computation of the Jacobian. However, for low-dimensional BSDF sampling, the Jacobian computation is not the bottleneck. Therefore, we propose to use diffusion models to importance sample full BSDFs. Our method has two variants, one for most reflective materials that learns a distribution on a disk, and the other for extremely specular reflective materials and full BSDFs, which learns a distribution on a sphere. Our equal-time evaluations show that our method outperforms normalizing flows and significantly surpasses them in certain specular materials.

Abstract:
Creating intelligent virtual agents with realistic conversational abilities necessitates a multimodal communication approach extending beyond text. Body gestures, in particular, play a pivotal role in delivering a lifelike user experience by providing additional context, such as agreement, confusion, and emotional states. This paper introduces an integration of motion matching framework with a learning-based approach for generating gestures, suitable for multimodal, real-time, interactive conversational agents mimicking natural human discourse. Our gesture generation framework enables accurate synchronization with both the rhythm and semantics of spoken language accompanied by multimodal perceptual cues. It also incorporates gesture phasing theory from social studies to maintain critical gesture features while ensuring agile responses to unexpected interruptions and barging-in situations. Our system demonstrates responsiveness, fluidity, and quality beyond traditional turn-based gesture-generation methods.

Abstract:
For a fixed polygon, one can easily determine whether a point is inside or outside it using the winding number. However, deforming a given polygon based on a set of points with expected inside/outside labeling is much more difficult. It asks the winding number to be differentiable with respect to locations of the inside/outside test point and the polygon vertices. We propose a method to address this even for a possibly intersected 2D polygon through Gaussian kernel convolution. Our method can be applied to various problems such as resolving embedding issues (e.g., intersections), editing curves using an in-out brush, and offsetting curves with feature preservation.

Abstract:
Existing 3D face parameterization methods are limited to human faces and/or require a large amount of manual work to prepare face-specific blendshapes. Unfortunately, many of the automated parameterization methods do not provide local controls for the different facial regions and methods that allow the integration of physics-based simulations also suffer from limited character compatibility and editability. We propose a human face anatomy-inspired 3D face parameterization method called Fabrig, which is quick to set up, transferable among various characters, easily editable, and compatible with physics-based simulations. Instead of using conventional volume-centric simulation for the face anatomy, our method innovatively uses cloth simulation for lighter computation. The parameterized faces support physics-based effects like collision and show skin details such as dynamic wrinkles. From our objective evaluation, we found that our method can parameterize the faces of various characters, ranging from realistic humans to non-humans, without any labor-demanding preparation work. The parameterized faces can be edited at the anatomical level while remaining intuitive to artists. Our evaluation results confirm that this new parameterization can accurately and naturally recreate the facial poses of a character or facial actions performed by a motion capture subject.

Abstract:
We present geometric methods for generating shapes that are characteristic of highly coiled hair. Different features become visually relevant when hairs are well-approximated by high-frequency helices instead of low-frequency curves, so we present algorithms for three such phenomena. First, a Fourier-based method for phase locking, the process by which disparate helices near the scalp coalesce into a single curl. Second, a method for period skipping which models individual helices deviating from the coalesced curl. Third, a non-linear optimization that directly generates the shapes of switchbacks, a.k.a. helical perversions, which heretofore could only be produced through direct physical simulation. By applying all three methods in tandem, we show that we can achieve richly detailed depictions of highly coiled hair.

Abstract:
Capturing materials from the real world avoids laborious manual material authoring. However, recovering high-fidelity Spatially Varying Bidirectional Reflectance Distribution Function (SVBRDF) maps from a few captured images is challenging due to its ill-posed nature. Existing approaches have made extensive efforts to alleviate this ambiguity issue by leveraging generative models with latent space optimization or extracting features with variant encoder-decoders. Albeit the rendered images at input views can match input images, the problematic decomposition among maps leads to significant differences when rendered under novel views/lighting. We observe that for human eyes, besides individual images, the correlation (or the highlights variation) among input images also serves as an important hint to recognize the materials of objects. Hence, our key insight is to explicitly model this correlation in the SVBRDF acquisition network. To this end, we propose a correlation-aware encoder-decoder network to model the correlation features among the input images via a graph convolutional network by treating channel features from each image as a graph node. This way, the ambiguity among the maps has been reduced significantly. However, several SVBRDF maps still tend to be over-smooth, leading to a mismatch in the novel-view rendering. The main reason is the uneven update of different maps caused by a single decoder for map interpretation. To address this issue, we further design an adapter-equipped decoder consisting of a main decoder and four tiny per-map adapters, where adapters are employed for individual maps interpretation, together with fine-tuning, to enhance flexibility. As a result, our framework allows the optimization of the latent space with the input image feature embeddings as the initial latent vector and the fine-tuning of per-map adapters. Consequently, our method can outperform existing approaches both visually and quantitatively on synthetic and real data.

Abstract:
Synthesizing sound sources for modern physics-based animation is challenging due to rapidly moving, deforming, and vibrating interfaces that produce acoustic waves within the air domain. Not only must the methods synthesize sounds that are faithful and free of digital artifacts, but, in order to be practical, the methods should be easy to implement and support fast parallel hardware. Unfortunately, no current solutions satisfy these many conflicting constraints.

Abstract:
Interactive rendering of dynamic scenes with complex global illumination has been a long-standing problem in computer graphics. Recent advances in neural rendering demonstrate new promising possibilities. However, while existing methods have achieved impressive results, complex rendering effects (e.g., caustics) remain challenging. This paper presents a novel neural rendering method that is able to generate high-quality global illumination effects, including but not limited to caustics, soft shadows, and indirect highlights, for dynamic scenes with varying camera, lighting conditions, materials, and object transformations. Inspired by object-oriented transfer field representations, we employ deformable neural feature fields to implicitly model the impacts of individual objects or light sources on global illumination. By employing neural feature fields, our method gains the ability to represent high-frequency details, thus supporting complex rendering effects. We superpose these feature fields in latent space and utilize a lightweight decoder to obtain global illumination estimates, which allows our neural representations to spontaneously adapt to the contribution of individual objects or light sources to global illumination in a data-driven manner, thus further improving the quality. Our experiments demonstrate the effectiveness of our method on a wide range of scenes with complex light paths, materials, and geometries.

Abstract:
Fractured object reassembly is a challenging problem in computer vision and graphics with applications in industrial manufacturing and archaeology. Traditional methods based on shape descriptors and geometric registration often struggle with ambiguous features, resulting in lower accuracy. Recent data-driven methods are inherently affected by the representation and learning ability of the trained models. To address this, we propose a novel approach inspired by diffusion models and transformers. Our method applies diffusion denoising via a transformer to predict the pose parameter of each fragment, taking advantage of their global feature correlation and pose prior learning abilities. We evaluate our approach on a fractured object dataset and demonstrate superior performance compared to state-of-the-art methods. Our method offers a promising solution for accurate and robust fractured object reassembly, advancing the field in complex shape analysis and assembly tasks.

Abstract:
Importance sampling using a light tree (i.e., a hierarchy of light clusters) has been widely used for many-light rendering. This technique samples a light source by stochastically traversing the tree according to the importance of each node. While this importance should be close to the illumination integral for each node’s light cluster, it is infeasible to compute the exact solution. Therefore, existing methods used a rough approximation (e.g., upper bound), which results in significant Monte Carlo (MC) variance, especially for high-frequency microfacet BRDFs at grazing angles. In this paper, we present a more accurate approximation of the importance based on spherical Gaussians (SGs). Our method represents a light cluster with an SG light for each node, and analytically approximates the product integral of the SG light and a BRDF. Although high-quality SG lighting approximations have been studied, they could not be used for the node importance due to violations of an unbiased sampling constraint. To improve the sampling quality and satisfy the constraint for anisotropic microfacet BRDFs, we introduce a new high-quality SG lighting approximation by extending an NDF filtering method that has been used for specular antialiasing. For diffuse surfaces, we also present a simpler and more accurate SG lighting than the state-of-the-art SG approximation, satisfying the constraint. Using our method, we can efficiently reduce the MC variance for many-light scenes with modern physically plausible materials.

Abstract:
While explicit representations of shapes such as triangular and tetrahedral meshes are often used for boundary surfaces and 3D volumes bounded by closed surfaces, implicit representations of planar regions and volumetric regions defined by level-set functions have also found widespread applications in geometric modeling and simulations. However, an important computational tool, the L2-orthogonal Hodge decomposition for scalar and vector fields defined on implicit representations under commonly used Dirichlet/Neumann boundary conditions with proper correspondence to the topology presents additional challenges. For instance, the projection to the interior or boundary of the domain is not as straightforward as in the mesh-based frameworks. Thus, we introduce a comprehensive 5-component Hodge decomposition that unifies normal and tangential components in the Cartesian representation. Numerical experiments on various objects, including single-cell RNA velocity, validate the effectiveness of our approach, confirming the expected rigorous L2-orthogonality and the accurate cohomology.

Abstract:
Recent advances in text-to-image diffusion models have significantly enhanced image generation quality, when trained on internet-scale data. However, existing methods are constrained by their reliance on image or scene-level conditions, limiting their ability to synthesize composable 3D objects in a complex scene. To address these limitations, we propose BlobGEN-3D, a novel approach that decouples compositional 3D scene representation from 2D image generation, enabling direct controllability in the 3D space while fully leveraging the capabilities of 2D diffusion models. Specifically, BlobGEN-3D utilizes object-level 3D blobs with rich textual descriptions as the 3D scene representation, which is amenable to 2D projection, and is seamlessly integrable with 2D diffusion models. Based on this representation, we introduce an auto-regressive pipeline for freeview image generation, by conditioning the pretrained blob-grounded 2D text-to-image diffusion model on the previously generated image. Our method has three key features: (i) it enables modular representation of 3D scene elements; (ii) coherent cross-view 2D generation; and (iii) manipulation of object appearance in the generated image sequences. Our method not only competes with the existing multi-view and optimization-based approaches, but also offers object-level appearance control, which was not possible before with alternatives that solely rely on scene-level descriptions, or image captions.

Abstract:
Recent advancements in generative motion models have achieved remarkable results, enabling the synthesis of lifelike human motions from textual descriptions. These kinematic approaches, while visually appealing, often produce motions that fail to adhere to physical constraints, resulting in artifacts that impede real-world deployment. To address this issue, we introduce a novel method that integrates kinematic generative models with physics-based character control. Our approach begins by training a reward surrogate to predict the performance of the downstream non-differentiable control task, offering an efficient and differentiable loss function. This reward model is then employed to fine-tune a baseline generative model, ensuring that the generated motions are not only diverse but also physically plausible for real-world scenarios. The outcome of our processing is the Robot Motion Diffusion Model (RobotMDM), a text-conditioned kinematic diffusion model that interfaces with a reinforcement learning-based tracking controller. We demonstrate the effectiveness of this method on a challenging humanoid robot, confirming its practical utility and robustness in dynamic environments.

Abstract:
Text-to-image models have revolutionized content creation, enabling users to generate images from natural language prompts. While recent advancements in conditioning these models offer more control over the generated results, photography—a significant artistic domain—remains inadequately integrated into these systems. Our research identifies critical gaps in modeling camera settings and photographic terms within text-to-image synthesis. Vision-language models (VLMs) like CLIP and OpenCLIP, which typically drive the text conditions through cross-attention mechanisms of conditional diffusion models, struggle to represent numerical data like camera settings effectively in their textual space. To address these challenges, we present CameraSettings20k, a new dataset aggregated from RAISE [Dang-Nguyen et al. 2015], DDPD [Abuolaim and Brown 2020], and PPR10K [Liang et al. 2021]. Our curated dataset offers normalized camera settings for over 20,000 raw-format images, providing equivalent values standardized to a full-frame sensor. Furthermore, we introduce Camera Settings as Tokens, an embedding approach leveraging the LoRA adapter of Latent Diffusion Models (LDMs) to numerically control image generation based on photographic principles like focal length, aperture, film speed, and exposure time. Our experimental results demonstrate the effectiveness of the proposed approach to generate promising synthesized images obeying the photographic principles given the specified numerical camera settings. Furthermore, our work not only bridges the gap between camera settings and user-friendly photographic control in image synthesis but also sets the stage for future explorations into more physics-aware generative models.

Abstract:
We introduce an approach for simulating elastoplastic surfaces using quadratic through-the-thickness (Q3T) solid shell elements. Modeling the mechanics of deformable surfaces has been a cornerstone of graphics research for decades. Although thin shell models are suitable for many materials and applications, simulation-based planning of plastic forming processes requires attention to deformation in the thickness direction. Building on recent advances in the graphics community, we explore solid shell elements for modeling elastoplastic surfaces. Linear prism elements perform well for compressible materials such as thick cloth and foam mats. However, due to their inability to capture non-constant strain in the thickness direction, they suffer from severe locking artifacts when applied to incompressible and plastic materials. Q3T elements address this limitation with a minimal yet effective modification to linear prisms, resulting in significantly improved performance with only a moderate increase in computational cost. Through various examples, we demonstrate that Q3T elements closely match the qualitative behavior of reference simulations and provide accurate quantitative results compared to real-world deep drawing experiments.

Abstract:
Piano playing requires agile, precise, and coordinated hand control that stretches the limits of dexterity. Hand motion models with the sophistication to accurately recreate piano playing have a wide range of applications in character animation, embodied AI, biomechanics, and VR/AR. In this paper, we construct a first-of-its-kind large-scale dataset that contains approximately 10 hours of 3D hand motion and audio from 15 elite-level pianists playing 153 pieces of classical music. To capture natural performances, we designed a markerless setup in which motions are reconstructed from multi-view videos using state-of-the-art pose estimation models. The motion data is further refined via inverse kinematics using the high-resolution MIDI key-pressing data obtained from sensors in a specialized Yamaha Disklavier piano. Leveraging the collected dataset, we developed a pipeline that can synthesize physically-plausible hand motions for musical scores outside of the dataset. Our approach employs a combination of imitation learning and reinforcement learning to obtain policies for physics-based bimanual control involving the interaction between hands and piano keys. To solve the sampling efficiency problem with the large motion dataset, we use a diffusion model to generate natural reference motions, which provide high-level trajectory and fingering (finger order and placement) information. However, the generated reference motion alone does not provide sufficient accuracy for piano performance modeling. We then further augmented the data by using musical similarity to retrieve similar motions from the captured dataset to boost the precision of the RL policy. With the proposed method, our model generates natural, dexterous motions that generalize to music from outside the training dataset.

Abstract:
This paper reconsiders how to distill knowledge from pretrained 2D diffusion models to guide 3D asset generation, in particular to generate complex 3D scenes: it should accept varied inputs, i.e., texts or images, to allow for flexible expression of requirement; objects in the scene should be style-consistent and decoupled with clearly modeled interactions, benefiting downstream tasks. We propose DIScene, a novel method for this task. It represents the entire 3D scene with a learnable structured scene graph: each node explicitly models an object with its appearance, textual description, transformation, geometry as a mesh attached with surface-aligned Gaussians; the graph’s edges model object interactions. With this new representation, objects are optimized in the canonical space and interactions between objects are optimized by object-aware rendering to avoid wrong back-propagation. Extensive experiments demonstrate the significant utility and superiority of our approach and that DIScene can greatly facilitate 3D content creation tasks.

Abstract:
Physics-based differentiable rendering requires estimating boundary path integrals emerging from the shift of discontinuities (e.g., visibility boundaries). Previously, although the mathematical formulation of boundary path integrals has been established, efficient and robust estimation of these integrals has remained challenging. Specifically, state-of-the-art boundary sampling methods all rely on primary-sample-space guiding precomputed using sophisticated data structures—whose performance tends to degrade for finely tessellated geometries.

Abstract:
Novel view synthesis of smoke scenes presents a challenging problem. Previous neural approaches have suffered from inadequate quality and inefficient training. We introduce NeuSmoke, an efficient framework for dynamic smoke reconstruction using neural transportation fields, enabling high-quality density reconstruction and novel-view synthesis from multi-view videos. Our framework consists of two stages. In the first stage, we design a novel neural fluid field representation, integrating the transport equation with neural transportation fields. This includes adaptive embedding of multiple time stamps to enhance the spatial-temporal consistency of the reconstructed smoke. In the second stage, we combine novel-view color and depth information, employing convolutional neural networks (CNNs) to refine the smoke reconstruction. Our model achieves over 10 times faster than previous physics informed approaches. Extensive experiments demonstrate that our method surpasses existing techniques in novel view synthesis and volume density estimation in real-world and synthetic datasets.

Abstract:
In imaging, various tasks require a comparison of images between displays with different emission spectra. Even if the displays are calibrated and predicted to match using traditional colorimetry, individual differences in color perception of human observers can cause colors on two displays to appear mismatched. This phenomenon, known as observer metameric failure, can cause problems at different stages of content production when multiple decision makers have to approve the image. In this paper, a new Calibrated Observer Metameric Failure Index (COMFI) is proposed for use in such cases to predict the possibility of observer metameric failure.

Abstract:
Automatically generating high-quality textures for complex scenes remains a significant challenge in computer graphics. Recent advances in text-to-texture synthesis using 2D diffusion models have yielded impressive results for individual objects but struggle to maintain style consistency and semantic alignment when applied to larger scenes. These methods often require extensive optimization time and substantial memory resources. To address these challenges, we present InstanceTex, a novel approach to creating realistic and style-consistent textures for large scenes containing multiple objects. The core idea of InstanceTex lies in the instance-level controllable texture synthesis, which utilizes an instance layout representation to allow precise semantic control over individual instances while maintaining overall style consistency. We also introduce a local synchronized multi-view diffusion strategy to improve local texture consistency by sharing the latent denoised content among neighboring views in a mini-batch. Additionally, we introduce Neural MipTexture, inspired by Mipmaps, specifically designed for scene texture mapping to minimize aliasing effects. Extensive texturing experiments on both indoor and outdoor scenes demonstrate that InstanceTex can produce high-quality and consistent textures that outperform existing texture generation methods in terms of quality and consistency.

Abstract:
Inkjet 3D printers produce solid shapes using very small voxels made of polymeric materials. While state-of-the-art methods for predicting the appearance of inkjet-printed objects assume a perfect grid, the printed patterns have an irregular material distribution due to complex spreading behavior. This irregularity leads to imprecise appearance predictive tools. Here, we propose a fully differentiable method that models the material spreading behavior of inkjet printing. We use a differentiable simulator along with a differentiable volume renderer. Then, using an image of only one printed calibration pattern, we obtain a generalizable material spreading transformation that can be applied to an input, nominal grid of materials and produce the effective material grid. By taking into account the dynamics of the printing process, our method significantly outperforms state of the art 3D printing appearance prediction models.

Abstract:
A mesh-based generative model of the inner-mouth system is presented, which includes teeth and gums for the upper and lower jaw, the tongue, and their placement inside the human head. The model is capable of capturing person-specific detail, enabling the creation of highly accurate avatars that exceed the quality of prior mesh-based representations. The method combines data from oral scans and facial performances captured in a multi-camera capture rig. The system employs a precise segmentation model that can differentiate complex tongue motion. A novel inverse-rendering formulation is used in a staged modeling procedure, producing accurate registration of tongue, teeth, and jaw. The system is demonstrated on novel held-out subjects, where we demonstrate highly accurate reconstructions that exceed prior mesh-based avatar representations.

Abstract:
This paper presents EVSplitting, an efficient and visually consistent splitting algorithm for 3D Gaussian Splatting (3DGS). It is designed to make operating 3DGS as easy and effective as other 3D explicit representations, readily for industrial productions. The challenges of above target are: 1) The huge number and complex attributes of 3DGS make it tough to explicitly operate on 3DGS in a real-time and learning-free manner; 2) The visual effect of 3DGS is very difficult to maintain during explicit operations and 3) The anisotropism of Gaussian always leads to blurs and artifacts. As far as we know, no prior work can address these challenges well. In this work, we introduce a direct and efficient 3DGS splitting algorithm to solve them. Specifically, we formulate the 3DGS splitting as two minimization problems that aim to ensure visual consistency and reduce Gaussian overflow across boundary (splitting plane), respectively. Firstly, we impose conservations on the zero-, first- and second-order moments of the weighted Gaussian distribution to guarantee visual consistency. Secondly, we reduce the boundary overflow with a special constraint on the aforementioned conservations. With these conservations and constraints, we derive a closed-form solution for the 3DGS splitting problem. This yields an easy-to-implement, plug-and-play, efficient and fundamental tool, benefiting various downstream applications of 3DGS.

Abstract:
Existing research showed that unpleasant haptic feedback, such as pain, could enhance the user experience and performance in various scenarios (e.g. entertainment and training). This paper introduces ThermOuch, a wearable thermo-haptic device that leverages the thermal grill illusion (TGI) to simulate pain sensations in virtual reality (VR) without causing actual invasive/non-invasive harm. Our results of the user-perception experiments revealed that higher temperature-changing rates, particularly with increased warming, were associated with more intense pain perceived by the participants through our system. Furthermore, a higher ratio of warm-to-cool temperature transitions reduced the sensation of coldness prior to pain. Our experiments also showed that introducing an additional stimulus unit potentially heightened pain perception, and altering the spacing between stimulus units modified the perceived pain area. Lastly, the user study in VR demonstrated that ThermOuch significantly enhanced the sense of presence and body ownership for the participants, as well as elevated their biosignal-indicated arousal levels.